Time-Series storage efficiency
After exploring how to query our Wikipedia statistics dataset, let's focus on optimizing its storage efficiency in ClickHouse. This section demonstrates practical techniques to reduce storage requirements while maintaining query performance.
Type optimization
The general approach to optimizing storage efficiency is using optimal data types.
Let's take the project and subproject columns. These columns are of type String, but have a relatively small amount of unique values:
This means we can use the LowCardinality() data type, which uses dictionary-based encoding. This causes ClickHouse to store the internal value ID instead of the original string value, which in turn saves a lot of space:
We've also used UInt64 type for the hits column, which takes 8 bytes, but has a relatively small max value:
Given this value, we can use UInt32 instead, which takes only 4 bytes, and allows us to store up to ~4b as a max value:
This will reduce the size of this column in memory by at least 2 times. Note that the size on disk will remain unchanged due to compression. But be careful, pick data types that are not too small!
Specialized codecs
When we deal with sequential data, like time-series, we can further improve storage efficiency by using special codecs. The general idea is to store changes between values instead of absolute values themselves, which results in much less space needed when dealing with slowly changing data:
We've used the Delta codec for time column, which is a good fit for time series data.
The right ordering key can also save disk space.
Since we usually want to filter by a path, we will add path to the sorting key.
This requires recreation of the table.
Below we can see the CREATE command for our initial table and the optimized table:
And let's have a look at the amount of space taken up by the data in each table:
The optimized table takes up just over 4 times less space in its compressed form.
