If you're a data engineer working with Databricks, you know that choosing the right type of table is crucial for your workflows. Databricks offers a variety of table types, each designed for specific use cases. Here’s a breakdown of the 9 most important table types you’ll use in your daily work, along with simple examples to help you understand when and how to use them.

1. Managed Tables

What it is: Databricks manages both the data and metadata. The data is stored in DBFS (Databricks File System).
When to use: Ideal for storing cleaned and processed data that you’ll use for analysis or machine learning.

Example:

  CREATE TABLE managed_sales (order_id INT, product STRING, amount FLOAT);

2. External Tables

What it is: These tables point to data stored in an external system like S3, ADLS, or GCS. Databricks only manages the metadata.
When to use: Use this when you want to access raw data stored in an external bucket without moving it.

Example:

  CREATE TABLE external_logs (log_id INT, message STRING)
  LOCATION 's3://my-bucket/logs/';

3. Delta Tables

What it is: Tables that use the Delta Lake format, providing ACID transactions, versioning, and performance optimizations.
When to use: Perfect for transactional data that requires frequent updates and consistency.

Example:

  CREATE TABLE delta_transactions (txn_id INT, user_id INT, amount FLOAT) USING delta;

4. Views (not physical table but super useful)

What it is: Saved queries that don’t store data physically but provide a virtual representation of the data.
When to use: Use views to simplify complex queries or control access to specific data.

Example:

  CREATE VIEW high_value_customers AS
  SELECT * FROM customers WHERE total_spent > 1000;

5. Temporary Tables

What it is: Tables that exist only for the duration of the Spark session in which they’re created.
When to use: Great for storing intermediate data during a processing session.

Example:

  CREATE TEMPORARY TABLE temp_orders AS
  SELECT * FROM orders WHERE order_date = '2023-10-01';

6. Streaming Tables

What it is: Tables designed for real-time data processing using Structured Streaming.
When to use: Use these for processing continuous data streams, like application logs or real-time transactions.

Example:

  df.writeStream.format("delta").outputMode("append").start("/path/to/streaming_table")

7. Catalog Tables

What it is: Tables managed by Unity Catalog, which provides centralized governance and security.
When to use: Ideal for sharing data across teams with access control and auditing.

Example:

  CREATE TABLE catalog.sales.orders (order_id INT, product STRING, amount FLOAT);

8. Delta Live Tables (DLT)

What it is: Tables created and managed automatically by the Delta Live Tables framework for building reliable data pipelines.
When to use: Use DLT for building and maintaining ETL/ELT pipelines.

Example:

  @dlt.table
  def cleaned_sales():
      return spark.read.format("delta").load("/path/to/raw_sales")

9. Iceberg Tables

What it is: Tables that use the Apache Iceberg format for improved performance and data management on large datasets.
When to use: Use Iceberg for handling large-scale data with versioning and high performance.

Example:

  CREATE TABLE iceberg_sales (order_id INT, product STRING, amount FLOAT) USING iceberg;

Choosing the right table type in Databricks can make your life as a data engineer much easier. Whether you’re handling raw data, building pipelines, or processing real-time streams, there’s a table type designed for your use case. Keep this guide handy, and you’ll be able to pick the perfect table type for any scenario.

Table Type	Best For
Managed Tables	Storing cleaned and processed data for analysis or ML.
External Tables	Accessing raw data in external storage (e.g., S3, ADLS) without moving it.
Delta Tables	Handling transactional data with ACID guarantees and versioning.
Views	Simplifying complex queries or controlling access to specific data.
Temporary Tables	Storing intermediate data during a single Spark session.
Streaming Tables	Processing real-time data streams (e.g., logs, transactions).
Catalog Tables	Sharing data across teams with governance and access control.
DLT Tables	Building reliable and maintainable ETL/ELT pipelines.
Iceberg Tables	Managing large-scale datasets with high performance and versioning.

Databricks: 9 Types of Physical Tables You Need to Know

1. Managed Tables

2. External Tables

3. Delta Tables

4. Views (not physical table but super useful)

5. Temporary Tables

6. Streaming Tables

7. Catalog Tables

8. Delta Live Tables (DLT)

9. Iceberg Tables

Subscribe to my newsletter

Mario Herrera

Mario Herrera