SmartOS training from Joyent!

SmartOS Operations – Ben Rockwood at illumos Day

On illumos Day, Joyent’s own Ben Rockwood gave us an in-depth look at the tools and techniques he uses to manage Joyent’s public cloud on SmartOS, covering everything from monitoring to configuration management and troubleshooting. Here’s video of the complete talk: 55 minutes of deep devops goodness, packed full of knowledge – and humor.

slides

Agenda

  • Principles of Operation
  • Provisioning
  • Monitoring
  • Configuration Management
  • Orchestration
  • Authentication
  • Access Control
  • Auditing
  • Logging
  • Metrics
  • Tips & Tools

Obligatory DevOps Pitch

DevOps is about 3 things:

  • The Collaboration of People
  • The Convergence of Process
  • The Creation & Exploitation of Tools

Its primary goal is providing quality & value to customers

It is concerned with flow and encourages system thinking

Born from TPS/LEAN, TOC, Agile, & classical Operations Management

Principles of Operation

Goals:

  • Omnipotence: All Powerful
  • Omnipresence: All Seeing
  • Omniscience: All Knowing

… Since we can’t really do that…

  • Make change control simple and standardized
  • Monitor as deeply as possible and alert a human as needed
  • Leverage a suite of tools to help us analyze problems quickly

Man is mortal

  • Follow the “no snowflake” rule; minimize variation to maximize maintainability and predictability
  • Maintain a set of standard operating procedures (SOP’s) to ensure quality across the organization
  • Make tools as simple and productive as possible to avoid ad hoc (rouge) administration
  • Keep it simple stupid (KISS); cleverness is temporary but grok’ability is forever
  • Leverage industry standard tools and stock supplied facilities, avoid excessive customization

Provisioning

USB Keys, CD/DVD/ISO, or PXE possible

PXE is preferred for all serious production deployments

  • PXE greatly simplifies upgrade/downgrade; just change the TFTP image to boot and reboot the machine.
  • Much faster and controllable than USB re-images in place

Shameless Smart Data Center Plug

Monitoring

JPC uses Zabbix

  • Free & Open Source
  • Proxy architecture allows for multiple data centers easily
  • Agent or Agent-less Operation
  • Agent-less: Supports IPMI & SNMP
  • Agent is tiny, written in C and can compile static for easy binary installation without dependancies
  • Extremely easy to customize and add custom metrics
  • Dashboard provides a “single pane of glass” view of your entire infrastructure
  • Includes historical graphing of all metrics Caution: Use Percona as the backend-database

Monitoring: Completing the Solution

Zabbix Agents installed by Chef - Statically compiled binaries distributed in Cookbook

Zabbix Alerts

  • All alerts sent to Ops Staff Jabber & Email directly
  • “Disaster” alerts sent to PagerDuty (SMS Ops)

Pingdom used as backup/redundant solution, alerts sent to PagerDuty

Configuration Management

In SmartOS, CM is mandatory (imho)

JPC uses Chef-Solo for Configuration Management

Bootstrap script is curled and piped to bash, which:

  • Downloads Chef “Fat Client”
  • Creates Chef-Solo Configuration
  • Creates SMF Service and Runs Chef

Each data center has its own “attributes file” which specifies Zabbix Server, LDAP servers, SSH Keys, etc.

One set of Cookbooks are used for all DC’s

Config Management w/Chef

JPC Chef Cookbooks include:

  • “joyent”: Base cookbook run on all nodes, installs basic tools, fixes anything undesirable in SmartOS, adds BMC driver, adds MegaSAS tools, etc.
  • “computenode”: Modifications specific to general purpose compute nodes (currently empty)
  • “ldap”: Configures LDAP client, modifies PAM for netgroups support, creates user directories, configures ZFS for delegated administration, etc.
  • “zabbix”: Installs and configures Zabbix … others, including “northstar”, “bart”, “logging”, etc.
  • SmartOS Cookbooks & Tools: github.com/joyent/smartos_cookbooks

Orchestration

Orchestration layer is required for ad-hoc mass control of nodes, for:

  • Re-running Chef
  • Mass service control (“svcadm disable zones” on all nodes)
  • Auditing
  • … things you can’t foresee

Several options exist: Mcollective, pssh, mussh, etc. SDC includes an Mcollective like solution (sdc-oneachnode)

 

Orchestration w/ Mussh

Other Orchestration Tools to Consider

ClusterSSH (cssh): http://sourceforge.net/projects/clusterssh/

RunDeck (formerly ControlTier): http://rundeck.org

User Management & Authentication

Use LDAP!

JPC uses OpenLDAP

  • Easy to manage; lots of resources
  • Flexible replication schemes
  • Flat text file configuration makes change control easier

Client Access via Simple-SSL (636)

Preform daily management via Apache DirectoryStudio Generate User Passwords using apg (20 char len)

 

  • Don’t enable Anon access, you do NOT need it
  • Firewalled legacy 389 access provided for some appliances

LDAP Considerations

The “Hard Part” is creating the Schema & seeding the DIT; JPC’s “ldap_kit” will be open sourced soon

Always deploy LDAP Servers in pairs Use MirrorMode replication

Enforce auth for all users (no anon) and only use SSL if you can

Don’t mess around with anything other than OpenLDAP & the standard Illumos LDAP Client (ie: don’t go chasing Linux PAM projects, you don’t need them)

When configuring clients via CM, modify files directly. Trying to exec ldapclient init may have mixed results.

A Word About Kerberos

Its not worth the administrative overhead (imho)

I don’t believe in SSO for administration in production environments (password entry encourage boundary awareness)

Keep an eye on ApacheDS (directory.apache.org) & FreeIPA (freeipa.org) Projects

Access Control

Use Role Based Access Control (RBAC) Its not hard… really! Manage RBAC in LDAP, if possible Create abstraction profiles, ex:

  • Joyent Level D: Normal user + DTrace
  • Joyent Level 1: Normal user + Zone/VM Management
  • Joyent Level 2: Admin, All but security
  • Joyent Level 3: “Primary Administrator” (uid=0)

RBAC: Learning More

Authorizations are in /etc/security/auth_attr Execs are in /etc/security/exec_attr

Profiles associate auths and execs for easy reference in /etc/security/prof_attr

They are associated with users in /etc/user_attr

RBAC Shells

pfbash, pfcsh, pfsh, etc. Avoid them; intended for roles, not users.

Auditing

Basic Security Module (BSM) Lives!

BSM Auditing is enabled by Default

Audit trails in /var/audit

Make sure to add a crontab to rotate audit trails (“audit -n”) daily or weekly; by default it does not.

Print audit trails using “praudit -ls <trail>

Auditing with BART

BART == Basic Auditing & Reporting Tool Similar to TripWire Consider using “BARTlog”

Logging

SmartOS ships with Rsyslog; will fallback to stock syslogd if you wish

Rsyslog is a syslog server for this century, includes TCP support, TLS, filtering, compression, database support, etc.

SMF Services log to /var/svc/log System logs found in /var/adm & /var/log

25

Logging Tips

Enable BSM Syslog Plugin

  • Sadly, command executions do not include ARGV today :(

Use logger(1) in your scripts to write syslog messages

Centralize Syslog

Leverage Rsyslog’s TCP capabilities for clients

Leverage Rsyslog’s filtering capabilities for building centralized syslog servers

… if you can afford it, buy Splunk or SumoLogic … if you can’t, consider Graylog2 and/or Logstash If you have too much time on your hands, go Hadoop

Metrics

“If it moves, graph it. If its important, alert on it.”

Kstats are your friend (See all available: “kstat -p”)

For everything else, there is dtrace

Metrics: Kstats

A “registry” of kernal statistics

Most stats are counters; to calculate activity find the delta

Most common tools use Kstats as their source data, ex:

Metrics Graphing Solutions

RRDtool: All-in-One database and graphing solution; local Ganglia: Flexible cluster graphing solution, based on

RRDtool (agent-based)

Graphite: Modern alternative to RRDtool; network based graphing and “rrd” data storage. (agent-less)

In the end, data is feed into nearly all tools as key/value pairs with a timestamp.

Northstar RRDtool Example

Graphite In Use

Feed Graphite Data via netcat:

echo “test.cpu 20 $(date +%s)” | nc graphite-server 2003

View the graph via the “URL API”:

http://graphite-server:8888/render/?

width=400&height=250&target=dtrace.newton.syscall.read.entry& from=-1hours

API also supports CSV, JSON, and XML output!

Examples of DTrace & Graphite at: https://github.com/benr/graphite-dtrace

Other Tools & Tips

Use the Ptools to observe processes

  • pfiles: List open file descriptors of a process 
  • pargs: List arguments & env vars on a process 
  • pmap: Show memory allocation of a process
  • and… plddpflagspcredpstackpstopprunpwait, etc.

Monitor per mount file system activity with fsstat SmartOS includes ziostat and zmemstat

Use IPMI if you’ve got it! IPMI goodies include:

  • Sensor Data Repository (sdr)
  • System Event Log (sel)
  • Serial Console Redirection Over LAN (sol)
  • FRU Inventory (fru)

Know your place! Use LLDP if your network provides it.

‘getldp.pl’ uses snoop to listen for LLDP packets

… now go forth and operate that thing!

 

Share this post: