Databricks: 9 Types of Physical Tables You Need to Know

Mario HerreraMario Herrera
4 min read

If you're a data engineer working with Databricks, you know that choosing the right type of table is crucial for your workflows. Databricks offers a variety of table types, each designed for specific use cases. Here’s a breakdown of the 9 most important table types you’ll use in your daily work, along with simple examples to help you understand when and how to use them.


1. Managed Tables

  • What it is: Databricks manages both the data and metadata. The data is stored in DBFS (Databricks File System).

  • When to use: Ideal for storing cleaned and processed data that you’ll use for analysis or machine learning.

  • Example:

      CREATE TABLE managed_sales (order_id INT, product STRING, amount FLOAT);
    

2. External Tables

  • What it is: These tables point to data stored in an external system like S3, ADLS, or GCS. Databricks only manages the metadata.

  • When to use: Use this when you want to access raw data stored in an external bucket without moving it.

  • Example:

      CREATE TABLE external_logs (log_id INT, message STRING)
      LOCATION 's3://my-bucket/logs/';
    

3. Delta Tables

  • What it is: Tables that use the Delta Lake format, providing ACID transactions, versioning, and performance optimizations.

  • When to use: Perfect for transactional data that requires frequent updates and consistency.

  • Example:

      CREATE TABLE delta_transactions (txn_id INT, user_id INT, amount FLOAT) USING delta;
    

4. Views (not physical table but super useful)

  • What it is: Saved queries that don’t store data physically but provide a virtual representation of the data.

  • When to use: Use views to simplify complex queries or control access to specific data.

  • Example:

      CREATE VIEW high_value_customers AS
      SELECT * FROM customers WHERE total_spent > 1000;
    

5. Temporary Tables

  • What it is: Tables that exist only for the duration of the Spark session in which they’re created.

  • When to use: Great for storing intermediate data during a processing session.

  • Example:

      CREATE TEMPORARY TABLE temp_orders AS
      SELECT * FROM orders WHERE order_date = '2023-10-01';
    

6. Streaming Tables

  • What it is: Tables designed for real-time data processing using Structured Streaming.

  • When to use: Use these for processing continuous data streams, like application logs or real-time transactions.

  • Example:

      df.writeStream.format("delta").outputMode("append").start("/path/to/streaming_table")
    

7. Catalog Tables

  • What it is: Tables managed by Unity Catalog, which provides centralized governance and security.

  • When to use: Ideal for sharing data across teams with access control and auditing.

  • Example:

      CREATE TABLE catalog.sales.orders (order_id INT, product STRING, amount FLOAT);
    

8. Delta Live Tables (DLT)

  • What it is: Tables created and managed automatically by the Delta Live Tables framework for building reliable data pipelines.

  • When to use: Use DLT for building and maintaining ETL/ELT pipelines.

  • Example:

      @dlt.table
      def cleaned_sales():
          return spark.read.format("delta").load("/path/to/raw_sales")
    

9. Iceberg Tables

  • What it is: Tables that use the Apache Iceberg format for improved performance and data management on large datasets.

  • When to use: Use Iceberg for handling large-scale data with versioning and high performance.

  • Example:

      CREATE TABLE iceberg_sales (order_id INT, product STRING, amount FLOAT) USING iceberg;
    

Choosing the right table type in Databricks can make your life as a data engineer much easier. Whether you’re handling raw data, building pipelines, or processing real-time streams, there’s a table type designed for your use case. Keep this guide handy, and you’ll be able to pick the perfect table type for any scenario.

Table TypeBest For
Managed TablesStoring cleaned and processed data for analysis or ML.
External TablesAccessing raw data in external storage (e.g., S3, ADLS) without moving it.
Delta TablesHandling transactional data with ACID guarantees and versioning.
ViewsSimplifying complex queries or controlling access to specific data.
Temporary TablesStoring intermediate data during a single Spark session.
Streaming TablesProcessing real-time data streams (e.g., logs, transactions).
Catalog TablesSharing data across teams with governance and access control.
DLT TablesBuilding reliable and maintainable ETL/ELT pipelines.
Iceberg TablesManaging large-scale datasets with high performance and versioning.
0
Subscribe to my newsletter

Read articles from Mario Herrera directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mario Herrera
Mario Herrera

Data expert with over 13 years of experience in data architectures such as AWS/Snowflake/Azure, optimizing processes, improving accuracy, and generating measurable business results.