On illumos Day, Joyent’s own Ben Rockwood gave us an in-depth look at the tools and techniques he uses to manage Joyent’s public cloud on SmartOS, covering everything from monitoring to configuration management and troubleshooting. Here’s video of the complete talk: 55 minutes of deep devops goodness, packed full of knowledge – and humor.
Agenda
- Principles of Operation
- Provisioning
- Monitoring
- Configuration Management
- Orchestration
- Authentication
- Access Control
- Auditing
- Logging
- Metrics
- Tips & Tools
Obligatory DevOps Pitch
DevOps is about 3 things:
- The Collaboration of People
- The Convergence of Process
- The Creation & Exploitation of Tools
Its primary goal is providing quality & value to customers
It is concerned with flow and encourages system thinking
Born from TPS/LEAN, TOC, Agile, & classical Operations Management
Principles of Operation
Goals:
- Omnipotence: All Powerful
- Omnipresence: All Seeing
- Omniscience: All Knowing
… Since we can’t really do that…
- Make change control simple and standardized
- Monitor as deeply as possible and alert a human as needed
- Leverage a suite of tools to help us analyze problems quickly
Man is mortal
- Follow the “no snowflake” rule; minimize variation to maximize maintainability and predictability
- Maintain a set of standard operating procedures (SOP’s) to ensure quality across the organization
- Make tools as simple and productive as possible to avoid ad hoc (rouge) administration
- Keep it simple stupid (KISS); cleverness is temporary but grok’ability is forever
- Leverage industry standard tools and stock supplied facilities, avoid excessive customization
Provisioning
USB Keys, CD/DVD/ISO, or PXE possible
PXE is preferred for all serious production deployments
- PXE greatly simplifies upgrade/downgrade; just change the TFTP image to boot and reboot the machine.
- Much faster and controllable than USB re-images in place
Shameless Smart Data Center Plug
Monitoring
JPC uses Zabbix
- Free & Open Source
- Proxy architecture allows for multiple data centers easily
- Agent or Agent-less Operation
- Agent-less: Supports IPMI & SNMP
- Agent is tiny, written in C and can compile static for easy binary installation without dependancies
- Extremely easy to customize and add custom metrics
- Dashboard provides a “single pane of glass” view of your entire infrastructure
- Includes historical graphing of all metrics Caution: Use Percona as the backend-database
Monitoring: Completing the Solution
Zabbix Agents installed by Chef - Statically compiled binaries distributed in Cookbook
Zabbix Alerts
- All alerts sent to Ops Staff Jabber & Email directly
- “Disaster” alerts sent to PagerDuty (SMS Ops)
Pingdom used as backup/redundant solution, alerts sent to PagerDuty
Configuration Management
In SmartOS, CM is mandatory (imho)
JPC uses Chef-Solo for Configuration Management
Bootstrap script is curled and piped to bash, which:
- Downloads Chef “Fat Client”
- Creates Chef-Solo Configuration
- Creates SMF Service and Runs Chef
Each data center has its own “attributes file” which specifies Zabbix Server, LDAP servers, SSH Keys, etc.
One set of Cookbooks are used for all DC’s
Config Management w/Chef
JPC Chef Cookbooks include:
- “joyent”: Base cookbook run on all nodes, installs basic tools, fixes anything undesirable in SmartOS, adds BMC driver, adds MegaSAS tools, etc.
- “computenode”: Modifications specific to general purpose compute nodes (currently empty)
- “ldap”: Configures LDAP client, modifies PAM for netgroups support, creates user directories, configures ZFS for delegated administration, etc.
- “zabbix”: Installs and configures Zabbix … others, including “northstar”, “bart”, “logging”, etc.
- SmartOS Cookbooks & Tools: github.com/joyent/smartos_cookbooks
Orchestration
Orchestration layer is required for ad-hoc mass control of nodes, for:
- Re-running Chef
- Mass service control (“svcadm disable zones” on all nodes)
- Auditing
- … things you can’t foresee
Several options exist: Mcollective, pssh, mussh, etc. SDC includes an Mcollective like solution (sdc-oneachnode)
Orchestration w/ Mussh
Other Orchestration Tools to Consider
ClusterSSH (cssh): http://sourceforge.net/projects/clusterssh/
RunDeck (formerly ControlTier): http://rundeck.org
User Management & Authentication
Use LDAP!
JPC uses OpenLDAP
- Easy to manage; lots of resources
- Flexible replication schemes
- Flat text file configuration makes change control easier
Client Access via Simple-SSL (636)
Preform daily management via Apache DirectoryStudio Generate User Passwords using apg (20 char len)
- Don’t enable Anon access, you do NOT need it
- Firewalled legacy 389 access provided for some appliances
LDAP Considerations
The “Hard Part” is creating the Schema & seeding the DIT; JPC’s “ldap_kit” will be open sourced soon
Always deploy LDAP Servers in pairs Use MirrorMode replication
Enforce auth for all users (no anon) and only use SSL if you can
Don’t mess around with anything other than OpenLDAP & the standard Illumos LDAP Client (ie: don’t go chasing Linux PAM projects, you don’t need them)
When configuring clients via CM, modify files directly. Trying to exec ldapclient init may have mixed results.
A Word About Kerberos
Its not worth the administrative overhead (imho)
I don’t believe in SSO for administration in production environments (password entry encourage boundary awareness)
Keep an eye on ApacheDS (directory.apache.org) & FreeIPA (freeipa.org) Projects
Access Control
Use Role Based Access Control (RBAC) Its not hard… really! Manage RBAC in LDAP, if possible Create abstraction profiles, ex:
- Joyent Level D: Normal user + DTrace
- Joyent Level 1: Normal user + Zone/VM Management
- Joyent Level 2: Admin, All but security
- Joyent Level 3: “Primary Administrator” (uid=0)
RBAC: Learning More
Authorizations are in /etc/security/auth_attr Execs are in /etc/security/exec_attr
Profiles associate auths and execs for easy reference in /etc/security/prof_attr
They are associated with users in /etc/user_attr
RBAC Shells
pfbash, pfcsh, pfsh, etc. Avoid them; intended for roles, not users.
Auditing
Basic Security Module (BSM) Lives!
BSM Auditing is enabled by Default
Audit trails in /var/audit
Make sure to add a crontab to rotate audit trails (“audit -n”) daily or weekly; by default it does not.
Print audit trails using “praudit -ls <trail>”
Auditing with BART
BART == Basic Auditing & Reporting Tool Similar to TripWire Consider using “BARTlog”
Logging
SmartOS ships with Rsyslog; will fallback to stock syslogd if you wish
Rsyslog is a syslog server for this century, includes TCP support, TLS, filtering, compression, database support, etc.
SMF Services log to /var/svc/log System logs found in /var/adm & /var/log
25
Logging Tips
Enable BSM Syslog Plugin
- Sadly, command executions do not include ARGV today
Use logger(1) in your scripts to write syslog messages
Centralize Syslog
Leverage Rsyslog’s TCP capabilities for clients
Leverage Rsyslog’s filtering capabilities for building centralized syslog servers
… if you can afford it, buy Splunk or SumoLogic … if you can’t, consider Graylog2 and/or Logstash If you have too much time on your hands, go Hadoop
Metrics
“If it moves, graph it. If its important, alert on it.”
Kstats are your friend (See all available: “kstat -p”)
For everything else, there is dtrace
Metrics: Kstats
A “registry” of kernal statistics
Most stats are counters; to calculate activity find the delta
Most common tools use Kstats as their source data, ex:
Metrics Graphing Solutions
RRDtool: All-in-One database and graphing solution; local Ganglia: Flexible cluster graphing solution, based on
RRDtool (agent-based)
Graphite: Modern alternative to RRDtool; network based graphing and “rrd” data storage. (agent-less)
In the end, data is feed into nearly all tools as key/value pairs with a timestamp.
Northstar RRDtool Example
Graphite In Use
Feed Graphite Data via netcat:
echo “test.cpu 20 $(date +%s)” | nc graphite-server 2003
View the graph via the “URL API”:
http://graphite-server:8888/render/?
width=400&height=250&target=dtrace.newton.syscall.read.entry& from=-1hours
API also supports CSV, JSON, and XML output!
Examples of DTrace & Graphite at: https://github.com/benr/graphite-dtrace
Other Tools & Tips
Use the Ptools to observe processes
- pfiles: List open file descriptors of a process
- pargs: List arguments & env vars on a process
- pmap: Show memory allocation of a process
- and… pldd, pflags, pcred, pstack, pstop, prun, pwait, etc.
Monitor per mount file system activity with fsstat SmartOS includes ziostat and zmemstat
Use IPMI if you’ve got it! IPMI goodies include:
- Sensor Data Repository (sdr)
- System Event Log (sel)
- Serial Console Redirection Over LAN (sol)
- FRU Inventory (fru)
Know your place! Use LLDP if your network provides it.
‘getldp.pl’ uses snoop to listen for LLDP packets
… now go forth and operate that thing!