Abstract
Clickhouse is an incredible database. Here at Certkit, we’ve long worked in the world of “No SQL” databases like Elasticsearch precisely for their ability to query large amounts of data. But for every database, there’s an amount of data that’s “Too big”. Too big to query quickly or too big to store affordably. Clickhouse manages to thread the needle by efficiently storing truly ridiculous amounts of data while still providing impressive query performance.
Certkit is building Certificate Transparency (CT) log monitoring for our customers. In previous posts, we learned what CT logs are and how to pull data from them. Now, we need to store all that data in Clickhouse. This is no small task. About 100 million new certificates enter the CT logs every week!
This is the final post in a series on how we built a fast, reliable Certificate Transparency Search for CertKit.
What Is Clickhouse?
In case you missed it the first time, Clickhouse is a remarkable database. While it is SQL based, don’t let that fool you, Clickhouse is not your father’s database. Its performance is what we’ve come to associate with No SQL databases in recent years.
Column Based
Clickhouse tables are column based as opposed to the usual row based SQL database. A “column-orientated” database simply means all data for a column is stored together, separate from other columns in the same table. Clickhouse uses this column orientation to do two big things: store data efficiently and query it quickly.
Efficient Storage
A column’s data is all the same type and entries are often similar to each other. Clickhouse takes advantage of this fact to run compression over each column individually. These compression rates can range from no compression with random data up to 1000x or more with highly repetitive data.
Fast Queries Over Large Data Sets
Querying a column also benefits from Clickhouse’s columnar nature. Because all the data is co-located, data can be read very efficiently. The impressive data compression often helps as well. When size on disk is so small, it’s easy to read a large number of rows with a relatively small amount of disk activity.
This only touches the surface of what makes Clickhouse unique but suffice to say, it can hold lots of data in a small amount of space, and query it very quickly.
Goals
We have three main goals for our CT monitoring. These goals greatly impact our database schema. Our database must:
Fast Search
Both CT log alerting and searching depend on one main feature: fast domain name searches. Without fast result results, interactive searches will be too slow and alerts will be delayed or broken. Our schema needs to sell out to make search fast (Foreshadowing!).
Completeness Needed For Alerts
We want to alert the customer when certificates are created for their domain. Completeness is important, otherwise alerts won’t sent when they should be. Therefore, we need to store any and all recently issued certificates.
Affordable
We have a unused server that would be a great start for this service. Unfortunately, it only has 2.5TB of SSD disk space. A space efficient schema is needed to fit all 3+ Billion certificates issued in the last year into this small amount of space.
Using Clickhouse To Store Every SSL Certificate
We’d like to know about every certificate that is issued. The potential size of this dataset greatly influenced our decision to use Clickhouse.
ReplacingMergeTree
A certificate has many entries spread across different CT Log servers. These entries are purposely stored in multiple logs for redundancy. For our purposes, it is good enough to have just one reference to a CT log for every certificate. Deduplicating these entries saves on table size and query time. Both of which we care about very much.
Clickhouse offers an easy way to deduplicate our overly numerous log entries. The ReplacingMergeTree table engine deduplicates rows using the table ordering as a uniqueness key. One catch: deduplication is not guaranteed and is performed in the background or not at all, depending on the situation. This is usually a fine tradeoff, but needs to be kept in mind when writing queries.
Table Ordering
Table ordering determines how the ReplacingMergeTree will determine duplicates. It also has storage size implications: sorting similar rows near each other gives more opportunity for efficient compression.
We settled on ordering by SerialNumber and IsPrecert ORDER BY (SerialNumber, SHA256).
Certificate Serial numbers
Primarily sorting by SerialNumber enables fast lookups of specific certificates and increases compression ratios by placing all rows about a single certificate next to each other. It has the added bonus of letting us correlate pre-certificate and issued certificate log entries.
However, there is one catch: A certificate’s serial number is not guaranteed to be globally unique. It is only required to be unique for a specific issuer. From our data, serial numbers are relatively unique and collisions are uncommon. We can verify we’ve gotten the correct certificate by checking that the returned row has the expected issuer or SHA256 hash. This inconvenience is outweighed by the ability to associate a certificate’s pre-certificate and final certificate entries.
SHA256
Log entries come in two types: Pre-certificate and Issued certificate. Only a pre-certificate entry is guaranteed to appear in a log, but many final certificate entries also exist. We’d like to keep one example of each for every certificate. We are doing this in practice by secondarily ordering the table by SHA256. This ensures we keep at least one entry for the pre-certificate and issued certificate.
Partitioning and Deletes
Because completeness is not one of our goals, we’re perfectly happy to save space by deleting data for old, expired certificates. Here lies a small problem: Clickhouse deletes are very expensive because rows are immutable once written. The only way to update or delete a row is to rewrite the entire file containing the row. This is very expensive and slow.
What is fast for Clickhouse is deleting an entire file containing many rows. Partitioning tells Clickhouse how rows should be divided between files. Our table is partitioned by the certificate’s expiry date (PARTITION BY toYYYYMM(CertificateExpiryDate)), letting us easily drop a month’s worth of certificates without thrashing the disk.
(Not) Storing Raw Certificates
TLDR: We don’t store the raw certificate bytes. This might seem odd, because the whole point is to know about every issued certificate. The problem is, certificates are HUGE.
A single certificate is just a few KB in size, but multiplying a few KB by 3 billion certificates results in a very large database. We avoid the storage costs by not storing the certificate at all.
Retrieving A Certificate
We do need the original, raw certificate from time to time. Luckily, the full certificate is stored as part of each CT Log entry. The Leaf Index (which we DO store and is just a single integer) allows us to ask the log for the entry along with the original certificate.
Searching Every SSL Certificate In Clickhouse, Very Quickly
So we’ve stored a CT log entry for every certificate, but we have not addressed how to quickly search certificates by their associated domains. Running a domain query on our large log entries table results in a full table scan. We need something much faster than that.
Skip Indexes Have Limits
Skip Indexes are Clickhouse’s solution to secondary indexes where the table sort order is not conducive to the query. We could use a Skip Index for the domain names column. Clickhouse uses these indexes to skip blocks of data where it knows a value does not exist. Skip indexes still result in some table scanning, but it is greatly reduced. This almost works for our use case but results in unpredictable query times for large result sets. For example, UPS.com has generated over 3 million certificates in the last year. A query for UPS takes many times longer to complete than a domain with just a few certificates.
A Table Optimized For Search
When in doubt, make more tables. To get consistent query times, we opted for a denormalized table created specifically for domain name searching. This additional table is denormalized to a row per SAN(aka domain names) and is also ordered by SAN. That means there is often several rows in the search table for a single certificate in our main table. Ordering the entire table by domain name lets us take the most advantage of the primary table indexes.
Reversing Strings for Performance
Storing a string in reversed order increases the performance of some LIKE queries. Our search patterns generally look like: “Find all certificates where the domain is *.example.com,” or “Find all certificates ending with *.vpn.example.com. In SQL, this query looks something like WHERE Domain LIKE '%example.com' or WHERE Domain LIKE '%vpn.example.com'. Clickhouse reverts to table scanning instead of using the primary index because Clickhouse doesn’t know how the domain begins!
Clickhouse can use the primary index in LIKE queries when the wildcard is at the end. Storing the domains in reversed order lets us run queries where the wildcard is at the end of the string. example.com is stored as moc.elpmaxe and the query changes to: WHERE Domain LIKE 'moc.elpmaxe%'. To make a more readable query, Clickhouse will happily reverse the string for you: WHERE Domain LIKE reverse('%example.com').
Significant Domains
When we search domain names, we don’t think it’s useful to match a search like “*.com”. We need to match at least the whole first “significant domain”. A simple test of whether something is a significant domain is, “Could I buy this domain at a registrar?”. nyt.com or bbc.co.uk are both significant domains even though one contains an additional level.
Clickhouse has a built in function to extract the significant domain. However, doing it on the fly will again effect how well we can use the table’s indexes. Saving off the significant domain into its own column lets us improve our query performance:
SELECT cutToFirstSignificantSubdomain('www.nyt.com');
SELECT cutToFirstSignificantSubdomain('www.bbc.co.uk');
-- RESULT --
'nyt.com'
'bbc.co.uk'
Materialized Views
We’d rather not change our processing code to insert every entry into two tables. Clickhouse can do it for us with materialized views. Materialized views take new data from one table and inserts it in some modified way into second table.
Conclusion
So far, we’ve stored Billions of certificates which Clickhouse has easily handled. If anything, this amount of data is small by Clickhouse standards and we often see domain query times under a tenth of a second. You can use it yourself in our free Certificate Transparency search tool.
We built this tool as part of Certkit: a simple certification issuance and management tool. It’s just a small part of what makes Certkit great. Check us out if you’re tired of manually issuing certificates or fighting with Certbot.
Comments