Todd's Tenth Rule of certificate automation

Abstract

I’m an old engineer at heart. Many of my ideals were formed by Joel’s Things You Should Never Do, Fred’s No Silver Bullet, and Brian’s Big Ball of Mud. One of my favorites was Greenspun’s Tenth Rule:

Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.

The joke isn’t really about programming languages. It’s about a pattern: certain problems have a shape, and no matter how you approach them, you end up building the same solution, in the same order, until you arrive at the same messy place. That place looks a lot like something that already existed before you started.

Certificate management has exactly that shape. I’m ready to name it.

Todd’s Tenth Rule of certificate automation

Any sufficiently complicated SSL certificate script contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of a certificate lifecycle management system.

Just like Greenspun, I don’t have the other nine rules yet. But this one stands. Here’s how it happens.

Phase 1: the first win

You install Certbot on a Linux web server. It rotates the certificate with a cron-job and validates itself over HTTP. It was easier than you thought, and you call it a win.

Phase 2: the second server

You do it again for another server. This one’s different. An old Tomcat app that needs a Java keystore format. You have to move files around, transform things with OpenSSL, and set passwords. Now you own a 50-line deploy hook script.

Phase 3: wildcards and DNS credentials

The next server hosts dozens of sites on subdomains. HTTP validation won’t work anymore. You spend an afternoon in your DNS provider’s documentation figuring out DNS-01 validation.

Your first set of DNS credentials goes into the script. You don’t think too hard about what else those credentials can do, the cert got issued, so who cares!

Phase 4: the distribution problem

Next you have a Microsoft Exchange cluster. The cert renewed on one node, but all three nodes need it. How do you get it there?

You figure out a shared folder, or a scheduled task that copies the files after renewal. Something. The cert has to get from where Certbot wrote it to everywhere else that needs it, and Certbot has nothing to say about that part. The official guidance is essentially: figure it out yourself.

So you do. You’ve now built certificate distribution. It works until someone changes a folder permission, or the service account password rotates, or a new node gets added to the cluster and nobody updates the script.

Phase 5: the outage

A certificate expires anyway, at 2am. Some service didn’t pick up the new cert because the configuration was pointing at the old path. Nobody knew until users started complaining.

Embarrassed, you add logging to every job. You add a check that emails you when a cert has less than 30 days left.

Phase 6: the spreadsheet

You need to pass a security audit. Who owns these certificates? Who authorizes the rotation? You have a spreadsheet now. Every certificate, when it renews, which team owns it, what to check when something goes wrong.

It looks a lot like the certificate management system you said you didn’t need.

You arrived at the same destination, just slowly and one outage at a time.

Certbot handled your renewal. You kludged together distribution and monitoring. Maybe you built auditing too. Tom built that part, actually, and he left last March.

What you didn’t build is the operational layer. A single place to see every certificate, where it’s deployed, when it was last verified, and who changed it. Something you can show your CISO to prove everything is fine, or check at 2am to see what’s broken.

You also didn’t build a way to stay current. Certificate management isn’t static. ACME gets updated. Certificate lifetimes keep shrinking. CAs have security incidents that force emergency revocations. Your DNS provider changes their API and breaks your validation. Every one of those is a maintenance event for your scripts. Someone has to read the specs, test the edge cases, and update the jobs before they quietly start failing.

That’s the build vs. buy problem. You didn’t just build a certificate manager once. You signed up to maintain one forever.

You should have just used Lisp

Greenspun’s point wasn’t that Lisp is magic. It’s that the solution already existed, and all that effort reimplementing pieces of it, poorly, could have been skipped.

My rule is similarly humble. If your organization needs certificate lifecycle management, just buy certificate lifecycle management and spend your energy on the things that actually differentiate your business.

Your worst option is the one you’re hacking together with scripts, shared drives, and cron jobs. There are lots of CLM tools, but CertKit is one worth looking at.