Certificate distribution is the last mile nobody solved

Abstract

Certbot is good software in the classic Linux tradition: it does one thing simply and expects you to chain it together with everything else. One server, one certificate, done.

The trouble is that most environments are not simple. And the moment yours isn’t, you discover that renewing a certificate and getting it deployed are two different problems, and deployment is your problem.

Certificate renewal isn’t enough

In April 2021, Epic Games had a wildcard certificate expire across hundreds of backend services. Their monitoring caught it within 12 minutes. Recovery took 5.5 hours and involved 25 people.

Renewing the certificate only took a few minutes. The hard part was getting the new certificate to every service that needed it, confirming each one had picked it up, and dealing with the cascading failures that had already started.

They had certificate automation. It just didn’t handle deployment well.

This is not an unusual story. A 2024 PKI survey found organizations averaged three certificate-related outages over 24 months. In almost every case, the certificate renewed fine. Distribution is where things fell apart.

Endpoints that can’t run ACME

You have endpoints that need a certificate, but can’t run Certbot or speak ACME.

Network appliances are the obvious case: VPNs, load balancers, firewalls, legacy mail servers. These devices terminate TLS and need valid certificates, but they have no mechanism for ACME validation. They can’t serve HTTP-01 challenge files, they don’t have DNS APIs, or you can’t install software on them.

Other servers could theoretically run Certbot, but shouldn’t. HTTP-01 validation requires port 80 to be publicly reachable. DNS-01 validation requires API credentials to your DNS provider, and handing those out to every server in your infrastructure is exactly the kind of credential sprawl that we’re trying to avoid.

The practical answer is a centralized server that handles ACME. One machine runs Certbot, holds the credentials, handles the validation.

Now you have to figure out how to deploy certificates to other devices.

One other wrinkle, as certificate lifetimes shorten, your centralized Certbot server needs to run more frequently, especially to comply with ARI. You’re going to hit Let’s Encrypt rate limits, and things are going to fail, often at the worst possible moment.

Centralizing certificate orchestration

Trying to run certbot across multiple servers is where most automation plans hit a wall. The answers on the Let’s Encrypt forums are all variations of the same thing:

just write a script.

But what does that actually mean? You have to transport the certificate files to where they need to go, what format the certificates need to be in, and how to reload the software to read them. And do that inside a change window so it doesn’t break the users. And it should be secure. And backed up. And monitored for when it breaks.

There’s a lot in “just write a script”.

Deploying SSL certificates to multiple servers

A network share is a common approach. Mount the certificate directory over NFS or map a Windows shared folder. Endpoints pull certificates when they want to update.

With pull, you just put your keys on a network share. Every endpoint needs to be responsible for polling the share for changes. When there’s a new certificate, they better grab a local copy of it, or you created a single point of failure for your whole infrastructure if the Certbot server goes down.

Alternatively, you could push certificates out via authenticated transport like SSH or SCP. With push, you’ve centralized the update logic, but your Certbot server needs skeleton-key credentials to reach every machine in your infrastructure.

Backup

If your Certbot server dies, could you rebuild it? Do you have the ACME account keys backed up? The private keys?

If the backup is encrypted (and it should be), who holds the decryption key? If it goes offsite on physical media, are your private keys riding around in someone’s bag?

Have you ever actually restored from it? You don’t have a backup until you’ve verified it restores. Most teams find this out at the worst possible time.

Automation failure

You need to make sure your centralized Certbot server continues to do its job. If certbot renew returns errors, is anyone listening? A rsync job exits 0 whether it transferred 50 files or none.

Expand your monitoring to watch the logs and exit codes, and make sure that certificate files are being updated.

Every endpoint is different

The next step is actually updating your software to read the new certificates.

There are dozens of certificate formats. Nginx wants PEM. Windows and IIS want PFX. Java applications want JKS or PKCS#12. Some appliances have their own proprietary formats. If your infrastructure is heterogeneous, and most real infrastructure is, your distribution scripts need to handle format conversion for each destination.

File paths vary too. Nginx might expect the cert at /etc/nginx/ssl/example.com.crt. Your mail server has its own convention. Your legacy app has the path hardcoded in a config file that nobody has touched in four years. Each endpoint has its own expectation about where the file goes, what it’s named, and what permissions it needs.

Then, you need to reload the service. Dropping a new certificate file on disk is not enough. The service needs to know the file changed. Nginx needs nginx -s reload. IIS needs its bindings updated. Some services need a full restart. Some appliances require an explicit import step through their API.

And when does this happen? Some services, like Microsoft RRAS, break sessions when you update the certificate. If a renewal happens in the middle of the day and kicks everyone off of VPN, that’s probably not okay.

So you need to separate certificate renewal from reloading, and queue changes for the next update window. Now you have a certificate sitting on disk that is not being served, and you need a separate mechanism to track that state, schedule the reload, confirm it ran, and alert if it didn’t. That mechanism is its own small system to build and maintain.

Monitoring: because software breaks

Let’s say you built all of this. You have a centralized Certbot server, a distribution mechanism, format converters, per-host reload commands, change window logic, and encrypted backups.

You still need to verify it stays working.

Not just “did the certificate land on disk”. The distribution script could succeed and deposit a malformed certificate or the wrong certificate. The service reload could fail silently. The endpoint could be serving the old certificate from memory while the new one sits on disk waiting for a restart that never happened.

Verification means actually connecting to the endpoint and checking what it serves. A real TLS handshake against the public hostname, checking the expiry, the subject, the SANs, and the chain. If that check passes, distribution worked. If it doesn’t, you need to know about it before your users do.

The 47-day certificate lifetime amplifier

Certificate lifetimes are shrinking. The CA/Browser Forum has mandated 47-day maximum lifetimes by 2029, with intermediate steps at 200 days in March 2026 and 100 days in March 2027.

Every distribution problem you have today gets worse in proportion to how often it runs. At annual renewals, a flaky rsync script fails once a year and you fix it. At 47-day renewals, it fails eight times a year, and one of those failures lands on a holiday weekend.

This is not a hypothetical future problem. The 200-day era started this month. The distribution infrastructure you have right now needs to run reliably twice as often as it did last year.

What good certificate distribution actually looks like

The architecture that works is not complicated to describe. Issuance is centralized. Endpoints do not run ACME clients or hold DNS credentials. They subscribe to the certificates they need, pulling updates automatically when something changes. Format conversion happens at the endpoint, tailored to what each piece of software expects. Service reloads are scheduled within change windows you define. External monitoring checks that each endpoint is actually serving the new certificate, not just that a file was written somewhere.

The honest conversation that comes after describing that architecture is whether you want to build it.

You can. Plenty of teams have, out of Certbot, hourly cron jobs, rsync scripts, format converters, monitoring hooks, and institutional knowledge held by whoever set it up. It works, mostly, until someone leaves, or a dependency changes, or the cron job starts failing silently and nobody notices until the certificate expires.

That assembled system is your own bespoke certificate management platform. You built it incrementally and it doesn’t feel like a platform, but that’s what it is. And unlike software your team ships to customers, it generates no revenue and attracts no investment. It just needs to keep working.

Whether building and maintaining that is the right use of your team’s time is a question worth asking before your next 2am page.

CertKit automates certificate lifecycle management so certificate distribution is someone else’s problem.