A Comprehensive Guide to Cassandra Database
Table of contents
- Introduction to Cassandra
- Cassandra Architecture Overview
- Installing Cassandra
- Basic Cassandra Commands
- Advanced Cassandra Concepts
- Monitoring and Performance Tuning
- Advanced Cassandra Queries
- 1. Counter Columns
- 2. Using Collections (Set, List, Map)
- 3. Using Static Columns
- 4. Using the ALLOW FILTERING Clause
- 5. Time-Series Data with TTL and Bucketing
- 6. Secondary Indexes on Non-Primary Key Columns
- 7. Paging with LIMIT and OFFSET
- 8. Time-Window Compaction Strategy (TWCS) for Time-Series Data
- 9. Lightweight Transactions (LWT) for Conditional Updates
- 10. Cassandra’s Query Language (CQL) for Aggregations (Limitation)
Introduction to Cassandra
Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers, with no single point of failure. It's known for its linear scalability and fault tolerance, making it ideal for applications that require high availability, like social media platforms, e-commerce sites, and IoT applications.
Key Features of Cassandra:
Scalability: Horizontal scalability by adding more nodes.
Fault Tolerance: Data is replicated across multiple nodes.
High Performance: Handles large amounts of data with low latency.
Flexible Data Model: Supports both structured and unstructured data.
No Single Point of Failure: Fully decentralized peer-to-peer architecture.
Cassandra Architecture Overview
Cassandra follows a distributed architecture based on a peer-to-peer model. Each node in a Cassandra cluster holds a part of the data and acts as an equal. The data is divided and distributed using consistent hashing.
Key Architectural Components:
Node: The basic unit of Cassandra; it stores part of the data.
Cluster: A collection of nodes forming the database.
Keyspace: A namespace that defines data replication across nodes.
Column Family (Table): Similar to tables in relational databases.
Partitioner: Determines the distribution of data across nodes.
Replication Factor: Number of copies of the data stored across nodes.
Consistency Level: Controls the trade-off between consistency and availability.
Installing Cassandra
For Linux/Mac:
sudo apt-get update
sudo apt-get install cassandra
For Windows:
Download the latest Cassandra binary from Apache Cassandra Download.
Extract the folder and run
cassandra.bat
from thebin
directory.
After installation, you can start the Cassandra service:
sudo service cassandra start
Verify the installation:
nodetool status
Basic Cassandra Commands
1. Creating a Keyspace
A keyspace is a container for tables in Cassandra.
CREATE KEYSPACE demo_keyspace WITH REPLICATION = {
'class': 'SimpleStrategy',
'replication_factor': 3
};
2. Using a Keyspace
USE demo_keyspace;
3. Creating a Table
CREATE TABLE users (
user_id UUID PRIMARY KEY,
name TEXT,
age INT,
email TEXT
);
4. Inserting Data
INSERT INTO users (user_id, name, age, email) VALUES (uuid(), 'Alice', 25, 'alice@example.com');
5. Querying Data
SELECT * FROM users;
6. Filtering Data
SELECT * FROM users WHERE age > 20;
7. Updating Data
UPDATE users SET age = 26 WHERE user_id = <some-uuid>;
8. Deleting Data
DELETE FROM users WHERE user_id = <some-uuid>;
9. Altering a Table
ALTER TABLE users ADD phone_number TEXT;
10. Dropping a Table
DROP TABLE users;
11. Dropping a Keyspace
DROP KEYSPACE demo_keyspace;
Advanced Cassandra Concepts
1. Partitioning and Clustering Keys
Partition Key: Determines which node stores the data.
Clustering Key: Determines the order of data within a partition.
Example:
CREATE TABLE orders (
order_id UUID,
customer_id UUID,
product_name TEXT,
order_date TIMESTAMP,
PRIMARY KEY (customer_id, order_date)
) WITH CLUSTERING ORDER BY (order_date DESC);
2. Indexes
CREATE INDEX ON users(email);
3. Batch Operations
Batch operations are used to execute multiple queries as a single atomic operation.
BEGIN BATCH
INSERT INTO users (user_id, name, age, email) VALUES (uuid(), 'Bob', 30, 'bob@example.com');
UPDATE users SET age = 31 WHERE email = 'bob@example.com';
APPLY BATCH;
4. TTL (Time-to-Live)
Set an expiration time for data:
INSERT INTO users (user_id, name, age, email) VALUES (uuid(), 'Carol', 28, 'carol@example.com') USING TTL 86400;
5. Lightweight Transactions (LWT)
Ensures conditional updates:
INSERT INTO users (user_id, name, age, email) VALUES (uuid(), 'Dave', 32, 'dave@example.com') IF NOT EXISTS;
6. Materialized Views
Materialized views provide automatic denormalization:
CREATE MATERIALIZED VIEW user_by_email AS
SELECT * FROM users
WHERE email IS NOT NULL
PRIMARY KEY (email, user_id);
Monitoring and Performance Tuning
nodetool status: Monitor cluster health.
nodetool repair: Run repairs to ensure data consistency.
Compaction and Compression: Tune for optimal performance.
Advanced Cassandra Queries
1. Counter Columns
Counters are special columns used to store numeric values that can only be incremented or decremented. They are ideal for tracking metrics like page views, likes, or upvotes.
CREATE TABLE page_views (
url TEXT PRIMARY KEY,
view_count COUNTER
);
-- Incrementing the counter
UPDATE page_views SET view_count = view_count + 1 WHERE url = 'https://example.com/home';
-- Decrementing the counter
UPDATE page_views SET view_count = view_count - 1 WHERE url = 'https://example.com/home';
-- Querying the counter
SELECT url, view_count FROM page_views;
2. Using Collections (Set, List, Map)
Cassandra supports complex data types such as lists, sets, and maps.
List: Ordered collection of values.
Set: Unordered collection of unique values.
Map: Key-value pairs.
Creating a table with collections:
CREATE TABLE user_profiles (
user_id UUID PRIMARY KEY,
emails LIST<TEXT>,
phone_numbers SET<TEXT>,
addresses MAP<TEXT, TEXT> -- Key: Address type, Value: Address
);
Inserting data into collections:
INSERT INTO user_profiles (user_id, emails, phone_numbers, addresses)
VALUES (uuid(), ['user1@example.com', 'user2@example.com'], {'9876543210'}, {'home': '123 Main St', 'work': '456 Office Blvd'});
Updating collections:
-- Adding an email to the list
UPDATE user_profiles SET emails = emails + ['newuser@example.com'] WHERE user_id = <some-uuid>;
-- Adding a phone number to the set
UPDATE user_profiles SET phone_numbers = phone_numbers + {'1234567890'} WHERE user_id = <some-uuid>;
-- Updating a specific map entry
UPDATE user_profiles SET addresses['home'] = '789 New Home St' WHERE user_id = <some-uuid>;
Querying collections:
SELECT emails, phone_numbers, addresses FROM user_profiles WHERE user_id = <some-uuid>;
3. Using Static Columns
Static columns are columns that have the same value for all rows within a partition. They are useful when you want to store metadata that applies to all rows in a partition.
Example:
CREATE TABLE order_status (
customer_id UUID,
order_id UUID,
order_status TEXT STATIC, -- Static column
product_name TEXT,
PRIMARY KEY (customer_id, order_id)
);
-- Inserting data
INSERT INTO order_status (customer_id, order_id, order_status, product_name) VALUES (uuid(), uuid(), 'Pending', 'Laptop');
INSERT INTO order_status (customer_id, order_id, product_name) VALUES (uuid(), uuid(), 'Mobile Phone');
-- Updating the static column (applies to all orders for the customer)
UPDATE order_status SET order_status = 'Shipped' WHERE customer_id = <some-uuid>;
4. Using the ALLOW FILTERING
Clause
By default, Cassandra does not allow filtering on columns that are not part of the primary key for performance reasons. The ALLOW FILTERING
clause allows you to filter on non-key columns, but it should be used cautiously as it can lead to performance degradation.
Example:
SELECT * FROM users WHERE age = 30 ALLOW FILTERING;
5. Time-Series Data with TTL and Bucketing
When working with time-series data, it’s common to use bucketing (storing data in fixed intervals like days or hours) combined with TTL to automatically expire old data.
Example:
CREATE TABLE sensor_data (
sensor_id UUID,
bucket_date DATE,
timestamp TIMESTAMP,
reading DOUBLE,
PRIMARY KEY ((sensor_id, bucket_date), timestamp)
);
-- Inserting time-series data with TTL
INSERT INTO sensor_data (sensor_id, bucket_date, timestamp, reading)
VALUES (uuid(), '2024-08-18', toTimestamp(now()), 25.7) USING TTL 86400; -- TTL = 1 day
6. Secondary Indexes on Non-Primary Key Columns
Secondary indexes can be created on non-primary key columns to optimize queries.
Example:
CREATE INDEX ON users (email);
-- Querying using the index
SELECT * FROM users WHERE email = 'alice@example.com';
7. Paging with LIMIT and OFFSET
For large datasets, pagination is often required. Cassandra supports paging using LIMIT
, but OFFSET
is not directly supported. You achieve pagination using the token()
function.
Example:
SELECT * FROM users WHERE token(user_id) > token(last_user_id) LIMIT 10;
8. Time-Window Compaction Strategy (TWCS) for Time-Series Data
When managing time-series data, Cassandra’s TimeWindowCompactionStrategy (TWCS) is useful for optimizing read and write performance.
Example:
CREATE TABLE temperature_readings (
sensor_id UUID,
timestamp TIMESTAMP,
temperature DOUBLE,
PRIMARY KEY (sensor_id, timestamp)
) WITH compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_size': 1,
'compaction_window_unit': 'DAYS'
};
9. Lightweight Transactions (LWT) for Conditional Updates
Lightweight transactions ensure that conditional updates or inserts are atomic and isolated.
Example:
-- Insert if the record doesn't already exist
INSERT INTO users (user_id, name, email) VALUES (uuid(), 'John Doe', 'john@example.com') IF NOT EXISTS;
-- Update only if a condition is met
UPDATE users SET email = 'newemail@example.com' WHERE user_id = <some-uuid> IF email = 'oldemail@example.com';
10. Cassandra’s Query Language (CQL) for Aggregations (Limitation)
Cassandra’s CQL does not support complex aggregations like SQL, but you can use functions like COUNT
, MAX
, and MIN
in basic cases.
Example:
SELECT COUNT(*) FROM users;
SELECT MAX(age) FROM users;
Subscribe to my newsletter
Read articles from Abhishek Jaiswal directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Abhishek Jaiswal
Abhishek Jaiswal
As a dynamic and motivated B.Tech student specializing in Computer Science and Engineering, I am deeply driven by my unwavering passion for harnessing the transformative potential of data engineering, devops, and cloud technologies to tackle multifaceted problems. Armed with a solid foundation in the Python programming language, I possess an extensive skill set and proficiency in utilizing a comprehensive stack of technical tools, including Apache Airflow, Apache Spark, SQL, MongoDB, and data warehousing solutions like Snowflake. Throughout my academic journey, I have diligently honed my abilities in problem-solving, software development methodologies, and fundamental computer science principles. My adeptness in data structures and algorithms empowers me to approach challenges with efficiency and creativity, enabling me to break down complex problems into manageable tasks and craft elegant solutions. In addition to my technical prowess, I bring exceptional communication and collaboration skills to the table, allowing me to thrive in team settings and make meaningful contributions to collaborative projects. I am highly adaptable and excel in dynamic environments that foster continuous learning and growth, as they provide me with the opportunity to expand my knowledge and refine my skills further.