Searching Certificate Transparency Logs (Part 1)

Abstract

Every TLS certificate issued by a root Certificate Authority (CA) ends up in one more more publicly accessible logs. These logs, collectively, make up the Certificate Transparency (CT) ecosystem. Unfortunately the logs are not very searchable. You can’t easily type in a domain and find all associated certificates.

At CertKit we’re building CT monitoring capabilities to notify our customers when a new certificate is issued. For that reason, and others, we need fast and reliable CT search capabilities.

There are tools available that store the logs in a search-friendly format. Probably the most well known is crt.sh. It’s eminently comprehensive (and free!), but does suffer from some issues. The biggest is that searching is slow. If your query matches many log entries, you will only get a truncated result set back. And fairly frequently the site just fails to respond or is hard down.

That’s why we built our own free certificate transparency search tool!

This series of posts focuses on how we went about building this faster, more reliable Certificate Transparency search for CertKit.

What is Certificate Transparency used for?

The primary use case is to make SSL/TLS certificate issuance visible and auditable. Basically to hold CAs accountable and ensure they’re not mis-issuing certificates.

In addition, end users can find forgotten or outdated infrastructure that’s still getting certificates, see who’s issuing certificates on their behalf, and even figure out if they’re about to miss a certificate renewal by looking at historical records.

You might be surprised at how much the certificate transparency logs can tell you about your own applications!

There are some potential “off-label” uses of CT logs as well.

Because every publicly trusted certificate is recorded, you could also monitor when competitors launch new products, test staging environments, or stand up new services, all by watching the domains they register certificates for. I’ve even heard people hypothesize you could use the logs as a signal when considering stock trades.

How did Certificate Transparency start?

In 2011 a Dutch CA named DigiNotar was compromised and the attacker issued rogue certificates for over 500 domains. The fake certificates were used to perform man-in-the-middle attacks.

Following this and other high profile CA issues, the IETF created RFC 6962, which established the certificate transparency protocol (and resulting logs).

How does Certficate Transparency work?

There’s a comprehensive explanation of the certificate transparency protocol on the main CT website, but the important points are these:

Before a real certificate can be issued, a CA needs to submit a precertificate to at least 2 CT logs. The precertificate contains the same data as the real cert, along with a “poison” extension that prevents it from being used.
Upon submitting the precertificate to a log, the log returns a “Signed Certificate Timestamp” (SCT), which is a promise to include the certificate in the publicly accessible log.
The resulting SCTs from each log are included in the final certificate.
When a browser reads the certificate, it checks to make sure the included SCTs are from trusted logs before it considers it valid.

Where are the Certificate Transparency logs?

By design, the certificate transparency ecosystem is distributed. The idea is to ensure there are enough folks hosting and monitoring the logs to catch any bad actors. Many companies in the web PKI space run logs, including Let’s Encrypt, Google, Sectigo and Digicert. As mentioned earlier, Chrome and other browsers require a minimum of 2 SCTs to consider a certificate valid.

Each browser vendor has a list of CT logs that it considers usable or qualified.

You can find them here:

Chrome
Safari
Firefox - Uses the same list as Chrome, but the link goes to the header file containing them in source.

Log Shards

Because there are so many certificates issued every day, the logs are broken in to shards. The shards are broken up by date. For example, you’ll see that Google’s Argon log has a 2026h1 and 2026h2 shard. This roughly corresponds to the first half and second half of the year. The NotAfter date of the certificate is used to decide which log it goes to.

Types of Logs

There are two types of CT logs. The OG logs are called RFC 6962 logs and the new logs are variously called tiled logs, static-ct logs or sunlight logs, depending on context and how old the information you’re looking at is. Both are based on Merkle trees.

It’s beyond the scope of this post to get in to the nitty gritty, but if you are interested you can do worse than read Russ Cox’s post on the subject. Let’s Encrypt also wrote a blog post explaining the rationale for switching to tiled logs.

The old RFC 6962 logs suffered from scalability issues and large hosting costs. Tiled logs are a response to those problems. The downside is the read-path for each kind of log is different - which means different approaches are required to scan them. (More on this in the next post)

How big are the logs?

The short answer is very big. There are billions of certificates stored in the various logs, and there will be even more when certificate lifetimes fall to 47 days. You can explore the data a bit using Cloudflare’s helpful Radar website. At the time of this writing there were 96 million unique certificates and 103 million unique precertificates issued in the last 7 days. It’s a lot of data.

Conclusion

There’s a lot of interesting data in the Certificate Transparency logs, but it’s not stored in a very searchable format. Our goal in the next post is to start scanning the logs and pulling the data. Given the sheer volume of data in the logs this is not a trivial task. The fact that there are two separate types of logs also makes things more complicated.