CFEngine and Ansible are complementary

CFEngine is designed for ongoing maintenance and verification of desired state; whereas Ansible is designed as a simple tool for making changes quickly.

CFEngine and Ansible complement each other. For example, I have a CFEngine promise to inventory CliQr version by reading in /usr/local/osmosix/MANIFEST.MF. By pulling up a CFEngine Enterprise report, I can tell in 3 seconds how many of my thousands of hosts are on which version of CliQr. To run this report with Ansible would take minutes (the more hosts, the longer it would take).

However to develop, test and deploy this CFEngine promise took its own time (an initial up-front investment).

Another example: I wanted to quickly check, just one time, in preparation for a CliQr upgrade (actual example) if /usr/local/osmosix/MANIFEST.MF is a symlink to /usr/local/osmosix/agent/MANIFEST.MF on ALL hosts, I used Ansible, as that was quicker than (a) writing a CFEngine promise and (b) scheduling a window to deploy it. Ansible is faster for an ad-hoc report like this.

However if I then wanted to ensure that symlink is present on all hosts (current and future), I’d use CFEngine, as doing it with Ansible would only reach the hosts up at that instant.

So these tools are complementary, and I intend to use them together. For example, CFEngine could keep servers up to date on patches but I’d use Ansible to reboot specific groups of servers during scheduled maintenance windows (to load the new patchset).

LOPSA Mentorship Program protege Ionut Cadariu earns RHCE

I got a touching letter yesterday from a computer system administrator I’ve been mentoring over the years through the LOPSA Mentorship Program:

On Fri, Oct 16, 2015, Ionut Cadariu wrote:
Hello Aleksey,

I finally achieved my long term objective -> to be Red Hat Certified Engineer…it was a long journey and I wanted to thank you for all the hard work you did with me in order to achieve this.

A big part of my career is because of you and I can’t thank you enough!

Ionut

This letter made me ridiculously happy.

Ionut’s a young man based in Romania and has a family to support. He’s a hard worker and a joy to work with – completes his programs, answers emails in a timely manner, is appreciative, etc.

Thanks, Ionut, and keep up the good work. 🙂

CCNA is next. Get your hands on physical equipment to practice with, and use GNS3 in the meantime.

Senior Sysadmin becomes Director of Operations: Retrospective on Seven Years in Digital Cinema

In my position as Director of Production Systems at Deluxe Digital Cinema in Burbank (June 2013 – September 2014), I was responsible for Operation and Quality of Alchemy, Deluxe’s theatrical booking and delivery management system. This system would book which movies are going to play in which cinemas when, so Deluxe Digital Cinema could deliver the movie to each theatre and then deliver the decryption key so the cinema could play it.

Digital cinema took off faster than anybody predicted. We shut down our Hollywood film lab (a city block in size) and sold the facility, including the data center! We moved services to a backup facility in London; moved equipment from LA to our new primary site in Las Vegas, and moved services from London to Vegas. At the same time, we signed a prolific studio client (hence increased workload); and digital cinema technology marched on (witness Dolby Atmos multidimensional sound), so we had to upgrade in flight to keep up.

To handle the upgrades and migrations in a short time with a small staff, I organized dedicated departments for Infrastructure and Application Operations and secured staff allocations. At first, our Application Operations department existed on paper only, so Infrastructure Operations wore both hats while I procured and trained the App Ops staff. We ended up growing Operations from 2 to 5 staff (2 Infrastructure, 2 App Ops, 1 Manager).

To organize for high production, I put in place:
– an organizing chart listing every function and the products of each area
– an organizing chart listing every post and its functions
– a complete knowledge base enabling 100% competent staff
– process tooling and automation
– a ticketing system

In the meantime, our two-person Quality section tirelessly tested bug fixes and upgrades and automated their testing to get Testing ahead of the curve.

I’m proud of how the teams pulled together to handle a Herculean task. I am grateful to my senior for putting me into this harness (he promoted me from Senior Sys Admin to Director when our prior Director left). Kudos to the Dev team, we worked closely with them every step of the way and couldn’t have done this without them.

After the move, we had a fully staffed and operational Support section, 100% industry adoption of digital cinema and most major studios as our clients. Seeing the high growth phase was over, I decided to take a break and then look for a new challenge. What happened next is a separate story.

P.S. Recently, Deluxe and Technicolor created a joint digital cinema venture, Deluxe Technicolor Digital Cinema. “The unit will be managed by Deluxe and based in Burbank, Calif.” –Variety

Resources
1. “Ops Report Card” by Limoncelli, et al
2. A Sysadmin’s Guide to Navigating the Business World, by Mark Burgess and Carolyn Rowland
2. “Basics of Organizing” course based on materials by L. Ron Hubbard

CFEngine Bootstrapping Primer

Definitions

Managed server
A server managed by CFEngine. Presumably one of many.

Policy server
A file-sharing service used to distribute policy from
some centralized point to a fleet of servers.

Hub
A commercial add-on to CFEngine that collects reports
from managed servers. These reports are available through
the Web UI (another commercial add-on) and give you instant
insight into the state of your infrastructure.

The policy server, the hub and the Web UI usually run on the
same server called “hub” for short.

Bootstrapping
In CFEngine, bootstrapping is the establishing of the
trust relationship between a policy server/hub and a managed server
such that:

a) the managed server can download policy updates from the policy server, and

b) the hub can poll the server for reports.

How it works

Bootstrapping is done by running cf-agent on the managed server
with the –bootstrap (-B) switch with the IP address of the policy
server as the argument:

cf-agent -B 1.2.3.4

Running cf-agent with the –bootstrap (-B) switch is saying “yes”
to the SSH-like question, “I haven’t seen the other server before,
are you sure you want to accept its key and connect?”

After exchanging keys (which are stored in /var/cfengine/ppkeys),
the managed node records the hub’s address, and the hub records
the managed node’s address so that managed node can download
policy updates from the hub and the hub can download reports from
the managed node.

Data is never pushed. It is always pulled. This is part
of CFEngine’s security model. CFEngine 3 has had zero
major security vulnerabilities (remote compromise).

The managed nodes pull policy from the policy server:

[managed node]   <---policy----   [policy server]

The hub pulls reports from the managed nodes:

[hub]  <---reports----   [managed node]

The CFEngine component that handles inter-node communication is
cf-serverd. That’s the part of CFEngine that listens on a TCP
socket (cfengine/5308) for incoming connections.

For bootstrapping to be possible, the managed node must be able
to connect to the hub on port 5308; and policy server be configured
to trust first-time connections from the address range containing
the managed node.

Example of a successful bootstrap:

# cf-agent -B 10.10.10.20
R: This autonomous node assumes the role of voluntary client
R: Updated local policy from policy server
R: Started the scheduler
2015-07-08T14:38:12+0000   notice: Bootstrap to '10.10.10.20' completed successfully!
#

Example of an unsuccessful bootstrap — see especially the last
line which states the bootstrap failed and we don’t have a usable
promises.cf file (the default input file for CFEngine, containing
configuration policy in the form of CFEngine promises).

# cf-agent -B 10.10.10.10
2015-07-09T02:55:45+0000    error: /default/cfe_internal_update/files/'/var/cfengine/inputs'[0]: No suitable server responded to hail
R: This autonomous node assumes the role of voluntary client
R: Failed to copy policy from policy server at 10.10.10.10:/var/cfengine/masterfiles
       Please check
       * cf-serverd is running on 10.10.10.10
       * network connectivity to 10.10.10.10 on port 5308
       * masterfiles 'body server control' - in particular allowconnects, trustkeysfrom and skipverify
       * masterfiles 'bundle server' -> access: -> masterfiles -> admit/deny
       It is often useful to restart cf-serverd in verbose mode (cf-serverd -v) on 10.10.10.10 to diagnose connection issues.
       When updating masterfiles, wait (usually 5 minutes) for files to propagate to inputs on 10.10.10.10 before retrying.
R: Did not start the scheduler
2015-07-09T02:56:06+0000   notice: /default/cfe_internal_call_update/commands/'"/var/cfengine/bin/cf-agent" -f update.cf'[0]: Q: ".../cf-agent" -f u": 2015-07-09T02:55:45+0000    error: There is no readable input file at '/var/cfengine/inputs/update.cf'. (stat: No such file or directory)
Q: ".../cf-agent" -f u": 2015-07-09T02:55:45+0000    error: CFEngine was not able to get confirmation of promises from cf-promises, so going to failsafe
Q: ".../cf-agent" -f u": 2015-07-09T02:56:06+0000    error: /default/cfe_internal_update/files/'/var/cfengine/inputs'[0]: No suitable server responded to hail
Q: ".../cf-agent" -f u": R: Failed to copy policy from policy server at 10.10.10.10:/var/cfengine/masterfiles
Q: ".../cf-agent" -f u":        Please check
Q: ".../cf-agent" -f u":        * cf-serverd is running on 10.10.10.10
Q: ".../cf-agent" -f u":        * network connectivity to 10.10.10.10 on port 5308
Q: ".../cf-agent" -f u":        * masterfiles 'body server control' - in particular allowconnects, trustkeysfrom and skipverify
Q: ".../cf-agent" -f u":        * masterfiles 'bundle server' -> access: -> masterfiles -> admit/deny
Q: ".../cf-agent" -f u":        It is often useful to restart cf-serverd in verbose mode (cf-serverd -v) on 10.10.10.10 to diagnose connection issues.
Q: ".../cf-agent" -f u":        When updating masterfiles, wait (usually 5 minutes) for files to propagate to inputs on 10.10.10.10 before retrying.
Q: ".../cf-agent" -f u": R: Did not start the scheduler
Q: ".../cf-agent" -f u": 2015-07-09T02:56:06+0000   notice: /default/cfe_internal_call_update/commands/'"/var/cfengine/bin/cf-agent" -f update.cf'[0]: Q: ".../cf-agent" -f u": 2015-07-09T02:56:06+0000    error: There is no readable input file at '/var/cfengine/inputs/update.cf'. (stat: No such file or directory)
Q: ".../cf-agent" -f u": Q: ".../cf-agent" -f u": 2015-07-09T02:56:06+0000    error: CFEngine was not able to get confirmation of promises from cf-promises, so going to failsafe

2015-07-09T02:56:06+0000    error: Bootstrapping failed, no input file at '/var/cfengine/inputs/promises.cf' after bootstrap
#

Can CFEngine compare server configurations?

One question I often hear sysadmins ask who are just getting into configuration management is, can CFEngine can report on configuration drift between servers? As in, take a server “A” and baseline it, and compare it to server “B”, and make a report telling where things are different.

Comparing existing servers is the hard way to go about getting consistency.

I have a story about how I got into configuration management.

Back in 2000, I was working at EarthLink, about 5 million users. We had 8 mid-range Sun servers (a third of a rack each) handling relay of outgoing email. We noticed one of the servers had 35% better performance than the average, and a couple of servers had 10-20% worse. The rest were +/- 5% or so.

Naturally we wanted to configure all the servers like the leader.

I tried to analyze their configuration to find the magic sauce on the leader. The longer I looked (with ssh “for” loops), the more differences I found: sendmail version, sendmail config settings, OS version, patch sets, kernel tunables. Every server was unique, even though we had a build procedure. NOTHING was consistent.

We ended up moving to exim and consolidating to 6 servers with better performance but that’s another story.

But that’s when I realized there had to be a better way and started looking into configuration management; and that’s when I found CFEngine.

There are a lot of configuration details on a server. Every file is potentially a configuration detail. Every line in every file. Every package. It takes live judgement to decide which configuration details are important and those are the ones you want to manage.

The CFEngine Way is to model the desired configuration in policy; CFEngine will converge your infrastructure to that configuration.

You do have to analyze your configuration to identify what’s important to manage. You have to pick out the senior configuration details in a sea of detail. In our case (back to EarthLink), the exact version of sendmail wasn’t important — what turned out to be important was using exim instead. There is no substitute for live intelligence. It is an essential component of the man/machine hybrid modern civilization is shaping up to be.

Speaking of configuration differences, recently I’ve been enjoying the reporting features of CFEngine Enterprise (free for up to 25 nodes) at scale. You tag the configuration aspect you care about, and the hub polls the servers to inventory them and then summarizes and graphs the results. Upgrading to a new configuration across a fleet of servers, I see at a glance the upgrade was successful in 99%, but 1% require special attention and I can drill down and see the hosts that are in the 1%. This helps me get to 100%. No more unique snowflakes – not where it counts.