A Comprehensive Guide to Cassandra Database


Introduction to Cassandra

Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers, with no single point of failure. It's known for its linear scalability and fault tolerance, making it ideal for applications that require high availability, like social media platforms, e-commerce sites, and IoT applications.

Key Features of Cassandra:

  • Scalability: Horizontal scalability by adding more nodes.

  • Fault Tolerance: Data is replicated across multiple nodes.

  • High Performance: Handles large amounts of data with low latency.

  • Flexible Data Model: Supports both structured and unstructured data.

  • No Single Point of Failure: Fully decentralized peer-to-peer architecture.

Cassandra Architecture Overview

Cassandra follows a distributed architecture based on a peer-to-peer model. Each node in a Cassandra cluster holds a part of the data and acts as an equal. The data is divided and distributed using consistent hashing.

Key Architectural Components:

  1. Node: The basic unit of Cassandra; it stores part of the data.

  2. Cluster: A collection of nodes forming the database.

  3. Keyspace: A namespace that defines data replication across nodes.

  4. Column Family (Table): Similar to tables in relational databases.

  5. Partitioner: Determines the distribution of data across nodes.

  6. Replication Factor: Number of copies of the data stored across nodes.

  7. Consistency Level: Controls the trade-off between consistency and availability.

Installing Cassandra

For Linux/Mac:

sudo apt-get update
sudo apt-get install cassandra

For Windows:

  1. Download the latest Cassandra binary from Apache Cassandra Download.

  2. Extract the folder and run cassandra.bat from the bin directory.

After installation, you can start the Cassandra service:

sudo service cassandra start

Verify the installation:

nodetool status

Basic Cassandra Commands

1. Creating a Keyspace

A keyspace is a container for tables in Cassandra.

CREATE KEYSPACE demo_keyspace WITH REPLICATION = {
  'class': 'SimpleStrategy',
  'replication_factor': 3
};

2. Using a Keyspace

USE demo_keyspace;

3. Creating a Table

CREATE TABLE users (
  user_id UUID PRIMARY KEY,
  name TEXT,
  age INT,
  email TEXT
);

4. Inserting Data

INSERT INTO users (user_id, name, age, email) VALUES (uuid(), 'Alice', 25, 'alice@example.com');

5. Querying Data

SELECT * FROM users;

6. Filtering Data

SELECT * FROM users WHERE age > 20;

7. Updating Data

UPDATE users SET age = 26 WHERE user_id = <some-uuid>;

8. Deleting Data

DELETE FROM users WHERE user_id = <some-uuid>;

9. Altering a Table

ALTER TABLE users ADD phone_number TEXT;

10. Dropping a Table

DROP TABLE users;

11. Dropping a Keyspace

DROP KEYSPACE demo_keyspace;

Advanced Cassandra Concepts

1. Partitioning and Clustering Keys

  • Partition Key: Determines which node stores the data.

  • Clustering Key: Determines the order of data within a partition.

Example:

CREATE TABLE orders (
  order_id UUID,
  customer_id UUID,
  product_name TEXT,
  order_date TIMESTAMP,
  PRIMARY KEY (customer_id, order_date)
) WITH CLUSTERING ORDER BY (order_date DESC);

2. Indexes

CREATE INDEX ON users(email);

3. Batch Operations

Batch operations are used to execute multiple queries as a single atomic operation.

BEGIN BATCH
  INSERT INTO users (user_id, name, age, email) VALUES (uuid(), 'Bob', 30, 'bob@example.com');
  UPDATE users SET age = 31 WHERE email = 'bob@example.com';
APPLY BATCH;

4. TTL (Time-to-Live)

Set an expiration time for data:

INSERT INTO users (user_id, name, age, email) VALUES (uuid(), 'Carol', 28, 'carol@example.com') USING TTL 86400;

5. Lightweight Transactions (LWT)

Ensures conditional updates:

INSERT INTO users (user_id, name, age, email) VALUES (uuid(), 'Dave', 32, 'dave@example.com') IF NOT EXISTS;

6. Materialized Views

Materialized views provide automatic denormalization:

CREATE MATERIALIZED VIEW user_by_email AS
  SELECT * FROM users
  WHERE email IS NOT NULL
  PRIMARY KEY (email, user_id);

Monitoring and Performance Tuning

  1. nodetool status: Monitor cluster health.

  2. nodetool repair: Run repairs to ensure data consistency.

  3. Compaction and Compression: Tune for optimal performance.

Advanced Cassandra Queries

1. Counter Columns

Counters are special columns used to store numeric values that can only be incremented or decremented. They are ideal for tracking metrics like page views, likes, or upvotes.

CREATE TABLE page_views (
  url TEXT PRIMARY KEY,
  view_count COUNTER
);

-- Incrementing the counter
UPDATE page_views SET view_count = view_count + 1 WHERE url = 'https://example.com/home';

-- Decrementing the counter
UPDATE page_views SET view_count = view_count - 1 WHERE url = 'https://example.com/home';

-- Querying the counter
SELECT url, view_count FROM page_views;

2. Using Collections (Set, List, Map)

Cassandra supports complex data types such as lists, sets, and maps.

  • List: Ordered collection of values.

  • Set: Unordered collection of unique values.

  • Map: Key-value pairs.

Creating a table with collections:

CREATE TABLE user_profiles (
  user_id UUID PRIMARY KEY,
  emails LIST<TEXT>,
  phone_numbers SET<TEXT>,
  addresses MAP<TEXT, TEXT>  -- Key: Address type, Value: Address
);

Inserting data into collections:

INSERT INTO user_profiles (user_id, emails, phone_numbers, addresses) 
VALUES (uuid(), ['user1@example.com', 'user2@example.com'], {'9876543210'}, {'home': '123 Main St', 'work': '456 Office Blvd'});

Updating collections:

-- Adding an email to the list
UPDATE user_profiles SET emails = emails + ['newuser@example.com'] WHERE user_id = <some-uuid>;

-- Adding a phone number to the set
UPDATE user_profiles SET phone_numbers = phone_numbers + {'1234567890'} WHERE user_id = <some-uuid>;

-- Updating a specific map entry
UPDATE user_profiles SET addresses['home'] = '789 New Home St' WHERE user_id = <some-uuid>;

Querying collections:

SELECT emails, phone_numbers, addresses FROM user_profiles WHERE user_id = <some-uuid>;

3. Using Static Columns

Static columns are columns that have the same value for all rows within a partition. They are useful when you want to store metadata that applies to all rows in a partition.

Example:

CREATE TABLE order_status (
  customer_id UUID,
  order_id UUID,
  order_status TEXT STATIC, -- Static column
  product_name TEXT,
  PRIMARY KEY (customer_id, order_id)
);

-- Inserting data
INSERT INTO order_status (customer_id, order_id, order_status, product_name) VALUES (uuid(), uuid(), 'Pending', 'Laptop');
INSERT INTO order_status (customer_id, order_id, product_name) VALUES (uuid(), uuid(), 'Mobile Phone');

-- Updating the static column (applies to all orders for the customer)
UPDATE order_status SET order_status = 'Shipped' WHERE customer_id = <some-uuid>;

4. Using the ALLOW FILTERING Clause

By default, Cassandra does not allow filtering on columns that are not part of the primary key for performance reasons. The ALLOW FILTERING clause allows you to filter on non-key columns, but it should be used cautiously as it can lead to performance degradation.

Example:

SELECT * FROM users WHERE age = 30 ALLOW FILTERING;

5. Time-Series Data with TTL and Bucketing

When working with time-series data, it’s common to use bucketing (storing data in fixed intervals like days or hours) combined with TTL to automatically expire old data.

Example:

CREATE TABLE sensor_data (
  sensor_id UUID,
  bucket_date DATE,
  timestamp TIMESTAMP,
  reading DOUBLE,
  PRIMARY KEY ((sensor_id, bucket_date), timestamp)
);

-- Inserting time-series data with TTL
INSERT INTO sensor_data (sensor_id, bucket_date, timestamp, reading)
VALUES (uuid(), '2024-08-18', toTimestamp(now()), 25.7) USING TTL 86400;  -- TTL = 1 day

6. Secondary Indexes on Non-Primary Key Columns

Secondary indexes can be created on non-primary key columns to optimize queries.

Example:

CREATE INDEX ON users (email);

-- Querying using the index
SELECT * FROM users WHERE email = 'alice@example.com';

7. Paging with LIMIT and OFFSET

For large datasets, pagination is often required. Cassandra supports paging using LIMIT, but OFFSET is not directly supported. You achieve pagination using the token() function.

Example:

SELECT * FROM users WHERE token(user_id) > token(last_user_id) LIMIT 10;

8. Time-Window Compaction Strategy (TWCS) for Time-Series Data

When managing time-series data, Cassandra’s TimeWindowCompactionStrategy (TWCS) is useful for optimizing read and write performance.

Example:

CREATE TABLE temperature_readings (
  sensor_id UUID,
  timestamp TIMESTAMP,
  temperature DOUBLE,
  PRIMARY KEY (sensor_id, timestamp)
) WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_size': 1,
  'compaction_window_unit': 'DAYS'
};

9. Lightweight Transactions (LWT) for Conditional Updates

Lightweight transactions ensure that conditional updates or inserts are atomic and isolated.

Example:

-- Insert if the record doesn't already exist
INSERT INTO users (user_id, name, email) VALUES (uuid(), 'John Doe', 'john@example.com') IF NOT EXISTS;

-- Update only if a condition is met
UPDATE users SET email = 'newemail@example.com' WHERE user_id = <some-uuid> IF email = 'oldemail@example.com';

10. Cassandra’s Query Language (CQL) for Aggregations (Limitation)

Cassandra’s CQL does not support complex aggregations like SQL, but you can use functions like COUNT, MAX, and MIN in basic cases.

Example:

SELECT COUNT(*) FROM users;
SELECT MAX(age) FROM users;

0
Subscribe to my newsletter

Read articles from Abhishek Jaiswal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abhishek Jaiswal
Abhishek Jaiswal

As a dynamic and motivated B.Tech student specializing in Computer Science and Engineering, I am deeply driven by my unwavering passion for harnessing the transformative potential of data engineering, devops, and cloud technologies to tackle multifaceted problems. Armed with a solid foundation in the Python programming language, I possess an extensive skill set and proficiency in utilizing a comprehensive stack of technical tools, including Apache Airflow, Apache Spark, SQL, MongoDB, and data warehousing solutions like Snowflake. Throughout my academic journey, I have diligently honed my abilities in problem-solving, software development methodologies, and fundamental computer science principles. My adeptness in data structures and algorithms empowers me to approach challenges with efficiency and creativity, enabling me to break down complex problems into manageable tasks and craft elegant solutions. In addition to my technical prowess, I bring exceptional communication and collaboration skills to the table, allowing me to thrive in team settings and make meaningful contributions to collaborative projects. I am highly adaptable and excel in dynamic environments that foster continuous learning and growth, as they provide me with the opportunity to expand my knowledge and refine my skills further.