To explore large-scale data systems, from storage to processing, from the interface to implementation, I'll be writing a variety of articles.

This is the first one -- Dataset.

Logically, we treat all the data to be processed as a dataset.

Representation

A dataset can be represented in a variety of ways:

Tuples
Nodes and Edges
KV pairs
Documents
Files
Objects
Messages
Logs

They all relate to a certain kind of storage:

Relational Database
Graph Database
KV Store
Document Storage
File System
Object Store
Message Queue
Log System

Each type of storage has a certain user interface and may be applied in particular circumstances. But at a high level, they may all be thought of as datasets made up of some sort of fundamental data component.

We refer to this data component as objects uniformly across this series.

Partition

To scatter the logical dataset to different machines, we should first split the dataset into serval parts, which are called shards, partitions or splits. As a result, it forms a three-level abstraction:

DataSet - Partition - Object

Large-Scale Data Systems (1): DataSet

Representation

Partition

Subscribe to my newsletter

muniao

muniao