Storage Engine
The Cassandra storage engine is optimized for high performance, write-oriented workloads. The architecture is based on Log Structured Merge (LSM) trees, which utilize an append-only approach instead of the traditional relational database design with B-trees. This creates a write path free of read lookups and bottlenecks.
While the write path is highly optimized, it comes with tradeoffs in terms of read performance and write amplification. To enhance read operations, Cassandra uses Bloom filters when accessing data from stables. Bloom filters are remarkably efficient, leading to generally well-balanced performance for both reads and writes.
Compaction is a necessary background activity required by the ‘merge’ phase of Log Structured Merge trees. Compaction creates write amplification when several small SSTables on disk are read, merged, updates and deletes processed, and a new ssstable is re-written. Every write of data in Cassandra is re-written multiple times, known as write amplification, and this adds background I/O to the database workload.
The core storage engine consists of memtables for in-memory data and immutable SSTables (Sorted String Tables) on disk. Data in SSTables is stored sorted to enable efficent merge sort during compaction. Additionally, a write-ahead log (WAL), referred to as the commit log, ensures resiliency for crash and transaction recovery.
The sequence of the steps in the write path:
-
Logging data in the commit log
-
Writing data to the memtable
-
Flushing data from the memtable
-
Storing data on disk in SSTables
Logging writes to commit logs
When a write occurs, Cassandra writes the data to a local append-only (cassandra.apache.org//glossary.html#commit-log)[commit log] on disk. This action provides configurable durability by logging every write made to a Cassandra node. If an unexpected shutdown occurs, the commit log provides permanent durable writes of the data. On startup, any mutations in the commit log will be applied to (cassandra.apache.org//glossary.html#memtable)[memtables]. The commit log is shared among tables.
All mutations are write-optimized on storage in commit log segments, reducing the number of seeks needed to write to disk.
Commit log segments are limited by the commitlog_segment_size
option.
Once the defined size is reached, a new commit log segment is created.
Commit log segments can be archived, deleted, or recycled once all the data is flushed to
(SSTables.
Commit log segments are truncated when Cassandra has written data older than a certain point to the SSTables.
Running nodetool drain
before stopping Cassandra will write everything in the memtables
to SSTables and remove the need to sync with the commit logs on startup.
-
commitlog_segment_size
: The default size is 32MiB, which is almost always fine, but if you are archiving commitlog segments (see commitlog_archiving.properties), then you probably want a finer granularity of archiving; 8 or 16 MiB is reasonable.commitlog_segment_size
also determines the default value ofmax_mutation_size
incassandra.yaml
. By default,max_mutation_size
is a half the size ofcommitlog_segment_size
.
If |
-
commitlog_sync
: may be either periodic or batch.-
batch
: In batch mode, Cassandra won’t acknowledge writes until the commit log has been fsynced to disk. -
periodic
: In periodic mode, writes are immediately acknowledged, and the commit log is simply synced every "commitlog_sync_period" milliseconds.-
commitlog_sync_period
: Time to wait between "periodic" fsyncs Default Value: 10000ms
-
-
Default Value: batch
In the event of an unexpected shutdown, Cassandra can lose up to the sync period or more if the sync is delayed.
If using |
-
commitlog_directory
: This option is commented out by default. When running on magnetic HDD, this should be a separate spindle than the data directories. If not set, the default directory is$CASSANDRA_HOME/data/commitlog
.
Default Value: /var/lib/cassandra/commitlog
-
commitlog_compression
: Compression to apply to the commitlog. If omitted, the commit log will be written uncompressed. LZ4, Snappy,Deflate and Zstd compressors are supported.
Default Value: (complex option):
# - class_name: LZ4Compressor
# parameters:
-
commitlog_total_space
: Total space to use for commit logs on disk. This option is commented out by default. If space gets above this value, Cassandra will flush every dirty table in the oldest segment and remove it. So a small total commit log space will tend to cause more flush activity on less-active tables. The default value is the smallest between 8192 and 1/4 of the total space of the commitlog volume.
Default Value: 8192MiB
Memtables
When a write occurs, Cassandra also writes the data to a memtable.
Memtables are in-memory structures where Cassandra buffers writes.
In general, there is one active memtable per table.
The memtable is a write-back cache of data partitions that Cassandra looks up by key.
Memtables may be stored entirely on-heap or partially off-heap, depending on memtable_allocation_type
.
The memtable stores writes in sorted order until reaching a configurable limit. When the limit is reached, memtables are flushed onto disk and become immutable SSTables. Flushing can be triggered in several ways:
-
The memory usage of the memtables exceeds the configured threshold (see
memtable_cleanup_threshold
) -
The
commit log
approaches its maximum size, and forces memtable flushes in order to allow commit log segments to be freed.
When a triggering event occurs, the memtable is put in a queue that is flushed to disk. Flushing writes the data to disk, in the memtable-sorted order. A partition index is also created on the disk that maps the tokens to a location on disk.
The queue can be configured with either the memtable_heap_space
or memtable_offheap_space
setting in the cassandra.yaml
file.
If the data to be flushed exceeds the memtable_cleanup_threshold
, Cassandra blocks writes until the next flush succeeds.
You can manually flush a table using nodetool flush
or nodetool drain
(flushes memtables without listening for connections to other nodes).
To reduce the commit log replay time, the recommended best practice is to flush the memtable before you restart the nodes.
If a node stops working, replaying the commit log restores writes to the memtable that were there before it stopped.
Data in the commit log is purged after its corresponding data in the memtable is flushed to an SSTable on disk.
SSTables
SSTables are the immutable data files that Cassandra uses for persisting data on disk. SSTables are maintained per table. SSTables are immutable, and never written to again after the memtable is flushed. Thus, a partition is typically stored across multiple SSTable files, as data is added or modified.
Each SSTable is comprised of multiple components stored in separate files:
Data.db
-
The actual data, i.e. the contents of rows.
Partitions.db
-
The partition index file maps unique prefixes of decorated partition keys to data file locations, or, in the case of wide partitions indexed in the row index file, to locations in the row index file.
Rows.db
-
The row index file only contains entries for partitions that contain more than one row and are bigger than one index block. For all such partitions, it stores a copy of the partition key, a partition header, and an index of row block separators, which map each row key into the first block where any content with equal or higher row key can be found.
Index.db
-
An index from partition keys to positions in the
Data.db
file. For wide partitions, this may also include an index to rows within a partition. Summary.db
-
A sampling of (by default) every 128th entry in the
Index.db
file. Filter.db
-
A Bloom Filter of the partition keys in the SSTable.
CompressionInfo.db
-
Metadata about the offsets and lengths of compression chunks in the
Data.db
file. Statistics.db
-
Stores metadata about the SSTable, including information about timestamps, tombstones, clustering keys, compaction, repair, compression, TTLs, and more.
Digest.crc32
-
A CRC-32 digest of the
Data.db
file. TOC.txt
-
A plain text list of the component files for the SSTable.
SAI*.db
-
Index information for Storage-Attached indexes. Only present if SAI is enabled for the table.
Note that the |
Within the Data.db
file, rows are organized by partition.
These partitions are sorted in token order (i.e. by a hash of the partition key when the default partitioner, Murmur3Partition
, is used).
Within a partition, rows are stored in the order of their clustering keys.
SSTables can be optionally compressed using block-based compression.
As SSTables are flushed to disk from memtables
or are streamed from other nodes, Cassandra triggers compactions which combine multiple SSTables into one.
Once the new SSTable has been written, the old SSTables can be removed.
SSTable Versions
From (BigFormat#BigVersion.
The version numbers, to date are:
Version 0
-
b (0.7.0): added version to sstable filenames
-
c (0.7.0): bloom filter component computes hashes over raw key bytes instead of strings
-
d (0.7.0): row size in data component becomes a long instead of int
-
e (0.7.0): stores undecorated keys in data and index components
-
f (0.7.0): switched bloom filter implementations in data component
-
g (0.8): tracks flushed-at context in metadata component
Version 1
-
h (1.0): tracks max client timestamp in metadata component
-
hb (1.0.3): records compression ration in metadata component
-
hc (1.0.4): records partitioner in metadata component
-
hd (1.0.10): includes row tombstones in maxtimestamp
-
he (1.1.3): includes ancestors generation in metadata component
-
hf (1.1.6): marker that replay position corresponds to 1.1.5+ millis-based id (see CASSANDRA-4782)
-
ia (1.2.0):
-
column indexes are promoted to the index file
-
records estimated histogram of deletion times in tombstones
-
bloom filter (keys and columns) upgraded to Murmur3
-
-
ib (1.2.1): tracks min client timestamp in metadata component
-
ic (1.2.5): omits per-row bloom filter of column names
Version 2
-
ja (2.0.0):
-
super columns are serialized as composites (note that there is no real format change, this is mostly a marker to know if we should expect super columns or not. We do need a major version bump however, because we should not allow streaming of super columns into this new format)
-
tracks max local deletiontime in sstable metadata
-
records bloom_filter_fp_chance in metadata component
-
remove data size and column count from data file (CASSANDRA-4180)
-
tracks max/min column values (according to comparator)
-
-
jb (2.0.1):
-
switch from crc32 to adler32 for compression checksums
-
checksum the compressed data
-
-
ka (2.1.0):
-
new Statistics.db file format
-
index summaries can be downsampled and the sampling level is persisted
-
switch uncompressed checksums to adler32
-
tracks presence of legacy (local and remote) counter shards
-
-
la (2.2.0): new file name format
-
lb (2.2.7): commit log lower bound included
Version 3
-
ma (3.0.0):
-
swap bf hash order
-
store rows natively
-
-
mb (3.0.7, 3.7): commit log lower bound included
-
mc (3.0.8, 3.9): commit log intervals included
-
md (3.0.18, 3.11.4): corrected sstable min/max clustering
-
me (3.0.25, 3.11.11): added hostId of the node from which the sstable originated
Trie-indexed Based SSTable Versions (BTI)
Cassandra 5.0 introduced new SSTable formats BTI for Trie-indexed SSTables.
To use the BTI formats configure it cassandra.yaml
like
sstable:
selected_format: bti
Versions come from (BtiFormat#BtiVersion.
For implementation docs see (BtiFormat.md.