Cassandra Glossary
A
B
back pressure
Pausing or blocking the buffering of incoming requests after reaching the threshold until the internal processing of buffered requests catches up.
C
cardinality
The number of unique values in a column. For example, a column of ID numbers unique for each employee would have high cardinality while a column of employee ZIP codes would have low cardinality because multiple employees can have the same ZIP code.
An index on a column with low cardinality can boost read performance because the index is significantly smaller than the column. An index for a high-cardinality column may reduce performance. If your application requires a search on a high-cardinality column, a materialized view is ideal.
clustering
The storage engine process that creates an index and keeps data in order based on the index.
clustering column
In the table definition, a clustering column is a column that is part of the compound primary key definition. Note that the clustering column cannot be the first column because that position is reserved for the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.
coalescing strategy
Strategy to combine multiple network messages into a single packet for outbound TCP connections to nodes in the same data center (intra-DC) or to nodes in a different data center (inter-DC). A coalescing strategy is provided with a blocking queue of pending messages and an output collection for messages to send.
column family
A container for rows, similar to the table in a relational system. Called a table in CQL 3.
commit log
A file to which the database appends changed data for recovery in the event of a hardware failure.
compaction
The process of consolidating SSTables, discarding tombstones, and regenerating the SSTable index. The available compaction strategies are:
compound primary key
A primary key consisting of the partition key, which determines the node on which data is stored, and one or more additional columns that determine clustering.
consistency level
A setting that defines a successful write or read by the number of cluster replicas that acknowledge the write or respond to the read request, respectively.
coordinator node
The node that determines which nodes in the ring should get the request based on the cluster configured snitch.
cosine similarity
A metric measuring the similarity between two non-zero vectors in a multi-dimensional space. It quantifies the cosine of the angle between the vectors; the angle representing each vector’s orientation and direction relative to each other. Zero (0) indicates complete dissimilarity. Negative one (-1) indicates exact opposite orientation of the vectors. One (1) indicates complete similarity.
D
datacenter
A group of related nodes that are configured together within a cluster for replication and workload segregation purposes. Not necessarily a separate location or physical data center. Datacenter names are case sensitive and cannot be changed.
E
EBNF
EBNF (Extended Backus-Naur Form) syntax expresses a context-free grammar that formally describes a language. EBNF extends its precursor BNF (Backus-Naur Form) with additional operators allowed in expansions. Syntax (railroad) diagrams graphically depict EBNF grammars.
embeddings
A mathematical technique in machine learning where complex, high-dimensional data is represented as points in a lower-dimensional space. The process of creating an embedding preserves the relevant properties of the original data, such as distance and similarity, enabling easier computational processing. For instance, words with similar meanings in Natural Language Processing (NLP) can be set close to each other in the reduced space, facilitating their use in machine learning models.
Euclidean distance
A coordinate geometry non-negative distance metric between two points, quantifying the similarity or dissimilarity between those data points represented as vectors. Use it to compare generated samples to real data points.
eventual consistency
The database maximizes availability and partition tolerance. The database ensures eventual data consistency by updating all replicas during read operations and periodically checking and updating any replicas not directly accessed. The updating and checking ensures that any query always returns the most recent version of the result set and that all replicas of any given row eventually become completely consistent with each other.
G
H
HDD
A hard disk drive (HDD) or spinning disk is a data storage device used for storing and retrieving digital information using one or more rigid rapidly rotating disks. Compare to SSD.
HDFS
Hadoop Distributed File System (HDFS) stores data on nodes to improve performance. HDFS is a necessary component in addition to MapReduce in a Hadoop distribution.
I
L
LeveledCompactionStrategy (LCS)
This compaction strategy creates SSTables of a fixed, relatively small size that are grouped into levels. Within each level, SSTables are guaranteed to be non-overlapping. Each level (L0, L1, L2, and so on) is ten times as large as the previous level. Disk I/O is more uniform and predictable on higher levels than on lower levels as SSTables are continuously being compacted into progressively larger levels. At each level, row keys are merged into non-overlapping SSTables in the next level. This process improves performance for reads because the database can determine which SSTables in each level to check for the existence of row key data.
linearizable consistency
Also called serializable consistency, linearizable consistency is the restriction that one operation cannot be executed unless and until another operation has completed.
The database supports Lightweight transactions to ensure linearizable consistency in writes. The first phase of a Lightweight transaction works at SERIAL consistency and follows the Paxos protocol to ensure that the required operation succeeds. If this phase succeeds, the write is performed at the consistency level specified for the operation. Reads performed at the SERIAL consistency level execute without database built-in read repair operations.
M
Machine Learning (ML)
A branch of artificial intelligence (AI) and computer science that uses and develops computer systems capable of learning and adapting without explicit instruction. ML uses algorithms and statistical models to analyze data and identify patterns, make decisions, and improve its system.
MapReduce
Hadoop’s parallel processing engine that quickly processes large data sets. A necessary component in addition to MapReduce in a Hadoop distribution.
N
Natural Language Processing (NLP)
Helps computers interpret and share the human language to offer the best use for the user.
P
partition
A partition is a collection of data addressable by a key. This data resides on one node in a Cassandra cluster. A partition is replicated on as many nodes as the replication factor specifies.
partition key
A partition keys represents a logical entity which helps a Cassandra cluster know on which node some requested data resides.
The partition key is the first column declared in the primary key definition. In a compound key, multiple columns can declare the columns that form the primary key.
partition range
The limits of the partition that differ depending on the configured partitioner. Murmur3Partitioner (default) range is -263 to +263 and RandomPartitioner range is 0 to 2127-1.
partition summary
A subset of the partition index. By default, 1 partition key out of every 128 is sampled.
Partitioner
Distributes data across a cluster. The types of partitioners are Murmur3Partitioner (default), RandomPartitioner, and OrderPreservingPartitioner.
Unresolved include directive in modules/ROOT/pages/glossary.adoc - include::ROOT:partial$persistent-volume.adoc[]
Unresolved include directive in modules/ROOT/pages/glossary.adoc - include::ROOT:partial$persistent-volume-claim.adoc[]
primary key
The partition key. One or more columns that uniquely identify a row in a table.
R
read repair
A process that updates database replicas with the most recent version of frequently-read data.
replication factor (RF)
The total number of replicas across the cluster, abbreviated as RF. A replication factor of 1 means that there is only one copy of each row in the cluster. If the node containing the row goes down, the row cannot be retrieved. A replication factor of 2 indicates two copies of each row and that each copy is on a different node. All replicas are equally important; there is no primary or master replica.
replication group
See datacenter.
role
A set of permissions assigned to users that limits their access to database resources. When using internal authentication, roles can also have passwords and represent a single user, DSE client tool, or application.
rolling restart
A procedure that is performed during upgrading nodes in a cluster for zero downtime. Nodes are upgraded and restarted one at a time while other nodes continue to operate online.
row
1) Columns that have the same primary key.
2) A collection of cells per combination of columns in the storage engine.
row cache
A database component for improving the performance of read-intensive operations. In off-heap memory, the row cache holds the most recently read rows from the local SSTables. Each local read operation stores its result set in the row cache and sends it to the coordinator node. The next read first checks the row cache. If the required data is there, the database returns it immediately. This initial read can save further seeks in the Bloom filter, partition key cache, partition summary, partition index, and SSTables.
The database uses LRU (least-recently-used) eviction to ensure that the row cache is refreshed with the most frequently accessed rows. The size of the row cache can be configured in the cassandra.yaml file.
S
seed
A seed, or seed node, is used to bootstrap the gossip process for new nodes joining a cluster. A seed node provides no other function and is not a single point of failure for a cluster.
Unresolved include directive in modules/ROOT/pages/glossary.adoc - include::ROOT:partial$segment.adoc[]
SizeTieredCompactionStrategy (STCS)
The default compaction strategy. This strategy triggers a minor compaction when there are a number of similar sized SSTables on disk as configured by the table subproperty, min_threshold. A minor compaction does not involve all the tables in a keyspace. Also see STCS compaction subproperties in the relevant CQL documentation.
slice
A set of clustered columns in a partition that you query as a set using, for example, a conditional WHERE clause.
Snitch
The mapping from the IP addresses of nodes to physical and virtual locations, such as racks and datacenters. The request routing mechanism is affected by which of the several types of snitches is used.
SSD
A solid-state drive (SSD) is a solid-state storage device that uses integrated circuits to persistently store data. Compare to HDD.
SSTable
A sorted string table (SSTable) is an immutable data file to which the database writes memtables periodically. SSTables are stored on disk sequentially and maintained for each database table.
streaming
A component that handles data exchange among nodes in a cluster.
It is part of the SSTable file.
Examples include:
-
When bootstrapping a new node, the new node gets data from existing nodes using streaming.
-
When running nodetool repair, nodes exchange out-of-sync data using streaming.
-
When bulkloading data from backup, sstableloader uses streaming to complete a task.
strong consistency
As a database reads data it performs a read repair before returning results.
superuser
Superuser is a role attribute that provides root database access.
Superusers have all permissions on all objects.
Apache Cassandra databases include the superuser role cassandra
with password cassandra
by default.
This account runs queries, including logins, with a consistency level of QUORUM
.
It is recommended that users create a superuser for deployments and remove the cassandra
role.
T
table
A collection of columns ordered by name and fetched by row. A row consists of columns and has a primary key; the first part of the key is a column name. Subsequent parts of a compound key are other column names that define the order of columns in the table.
TimeWindowCompactionStrategy (TWCS)
This compaction strategy compacts SSTables based on a series of time windows. During the current time window, the SSTables are compacted into one or more SSTables. At the end of the current time window, all SSTables are compacted into a single larger SSTable. The compaction process repeats at the start of the next time window. Each TWCS time window contains data within a specified range and contains varying amounts of data.
token
An element on the ring that depends on the partitioner. Determines the node’s position on the ring and the portion of data for which it is responsible. The range for the Murmur3Partitioner (default) is -263 to +263. The range for the RandomPartitioner is 0 to 2127-1.
tombstone
A marker in a row that indicates a column was deleted. During compaction, marked columns are deleted.
tunable consistency
The database ensures that all replicas of any given row eventually become completely consistent. For situations requiring immediate and complete consistency, the database can be tuned to provide 100% consistency for specified operations, datacenters, or clusters. The database cannot be tuned to complete consistency for all data and operations.
U
UnifiedCompactionStrategy (UCS)
This compaction strategy compacts SSTables based on a series of time windows. During the current time window, the SSTables are compacted into one or more SSTables. At the end of the current time window, all SSTables are compacted into a single larger SSTable. The compaction process repeats at the start of the next time window. Each TWCS time window contains data within a specified range and contains varying amounts of data.
Covers the applications of levelled, tiered and time-windowed compaction strategies, including combinations of levelled and tiered in different levels of the compaction hierarchy. This compaction can work in modes similar to [STCS] (with w = T4 matching STCS’s default threshold of 4), LCS (with w = L10 to match LCS’s default fan factor of 10), and can also work well enough for time-series workloads when used with a large tiered fan factor (e.g. w = T20). Read-heavy workloads, especially ones that cannot benefit from bloom filters or time order (i.e. wide partition non-time-series) are best served by levelled configurations. Write-heavy, time series or key-value workloads are best served by tiered ones.
V
Vector Search
Reviews data on a database to determine the distance between the vectors. The closer they are, the more similar the data. The more the distance, the less similar the data.
Vnode
Vnode is a virtual node. Normally, nodes are responsible for a single partitioning range in the full token range of a cluster. With vnodes enabled, each node is responsible for several virtual nodes, effectively spreading a partitioning range across more nodes in the cluster. Enabling vnodes can reduce the risk of hotspotting or straining one node in the cluster.
X, Y, Z
zombie
A row or cell that reappears in a database table after deletion. This can happen if a node goes down for a long period of time and is then restored without being repaired.
Deleted data is not erased from database tables; it is marked with tombstones until compaction. The tombstones created on one node must be propagated to the nodes containing the deleted data. If one of these nodes goes down before this happens, the node may not receive the most up-to-date tombstones. If the node is not repaired before it comes back online, the database finds the non-tombstoned items and propagates them to other nodes as new data.
To avoid this problem, run nodetool repair on any restored node before rejoining it to its cluster.