Loading Data into Snowflake – Mastering the Essentials

Sriram KrishnanSriram Krishnan
5 min read

Every powerful analytics platform is built on a foundation of well-loaded data. The process of ingestion is the critical first step that defines the speed, freshness, and reliability of your entire data ecosystem. Snowflake provides a flexible and powerful suite of tools for this, tailored to every need—from massive nightly batch loads to real-time event streams.

In this chapter, we will master the essentials of data loading, exploring the trade-offs between bulk, continuous, and streaming methods. We'll cover the "what," "why," and "how" for each, so you can choose the right tool for the right job.


The Core Loading Methods

A. The Workhorse: Bulk Loading with COPY INTO

The COPY INTO <table> command is the bedrock of high-volume data ingestion in Snowflake. It is designed for efficient, large-scale batch loading of files from a staging area into a Snowflake table.

  • When to Use It: For predictable, scheduled data loads. Imagine you receive a 100 GB data drop from a vendor every night, or you need to run a large data backfill.

  • How it Works: It uses a running virtual warehouse to load data in parallel from files in an internal or external stage (like S3, Azure Blob, or GCS).

  • Warehouse Sizing: The size of the warehouse determines parallelism. A larger warehouse does not make loading a single file faster, but it loads more files in parallel. Snowflake recommends that the number of data files you load should be a multiple of the number of threads in your warehouse (an X-Small has 8 threads, a Small has 16, and so on).

  • Transformations on the Fly: COPY is more than just a load command. You can reorder columns, omit columns, and perform light transformations like casting data types directly within the statement.

Example: A Typical Nightly Batch Load

-- First, define your external stage and file format
CREATE OR REPLACE STAGE my_s3_stage
  URL = 's3://my-company-data/daily_drops/'
  CREDENTIALS = (AWS_KEY_ID='...' AWS_SECRET_KEY='...');

CREATE OR REPLACE FILE_FORMAT my_parquet_format
  TYPE = 'PARQUET';

-- Execute the load using a running warehouse
COPY INTO production_db.public.orders
  FROM (
    -- Select specific columns and cast a data type during load
    SELECT
      $1:order_id::STRING,
      $1:customer_id::STRING,
      $1:order_total::NUMBER(10, 2),
      $1:order_datetime::TIMESTAMP_NTZ,
      METADATA$FILENAME, -- Ingest file metadata as well
      CURRENT_TIMESTAMP()
    FROM @my_s3_stage
  )
  PATTERN = '.*.parquet' -- Use a pattern to select files
  FILE_FORMAT = (FORMAT_NAME = 'my_parquet_format');

B. The Automator: Continuous Loading with Snowpipe

Snowpipe automates the COPY INTO process for micro-batches. Instead of you scheduling a load, Snowpipe loads new files automatically as soon as they arrive in your stage.

  • When to Use It: For continuous, event-driven data feeds where low latency is important. For example, an e-commerce platform that generates order confirmation files every few seconds.

  • How it Works: It's a serverless feature. You define a PIPE object that links a stage to a target table. When a new file appears, a cloud provider notification (e.g., S3 Event Notification) triggers the pipe, and Snowflake loads the data using its own internal compute resources.

  • The Cost Model: Snowpipe billing is different from warehouse billing. It has two components:

    1. A small, per-file overhead charge.

    2. A serverless compute charge based on the actual processing time.
      This model makes it crucial to avoid sending thousands of tiny files.

Example: Creating an Automated Ingestion Pipeline

-- Create a pipe that wraps a COPY INTO statement.
-- The AUTO_INGEST=TRUE flag tells Snowflake to listen for stage events.
CREATE OR REPLACE PIPE retail.public.orders_pipe
  AUTO_INGEST = TRUE
AS
COPY INTO production_db.public.orders
  FROM @my_s3_stage
  FILE_FORMAT = (FORMAT_NAME = 'my_parquet_format');

-- Note: This requires setting up an event notification on your S3 bucket
-- that points to the pipe's ARN (Amazon Resource Name).

C. The Real-Time Engine: Snowpipe Streaming & Kafka

For true real-time use cases where latency must be measured in seconds, not minutes, Snowpipe Streaming is the answer.

  • When to Use It: For high-volume, row-by-row event streams. Think of IoT sensors sending thousands of readings per second or real-time mobile application logging.

  • The Key Distinction: Unlike COPY or Snowpipe, which load files, Snowpipe Streaming ingests rows directly into Snowflake tables without creating intermediate files.

  • How it Works: It uses a native Java SDK that you integrate directly into your data-producing application. Your application opens a channel to a Snowflake table and writes rows directly over it. This bypasses the need for staging files entirely.

  • Kafka Integration: For those already using Apache Kafka, the Snowflake Kafka Connector can run in SNOWPIPE_STREAMING mode to achieve the same low-latency, row-based ingestion.


Part 2: Supporting Methods and Best Practices

Ad-Hoc Uploads: The Web Interface

For quick, one-off tasks, the Snowsight UI provides a simple wizard to upload local files. It's perfect for a business analyst who just received a small CSV from a colleague or for initial data exploration. It supports schema inference and can create a new table for you on the fly.

Best Practices for Efficient Loading

  1. Choose the Right Method: This is the most important decision.

    • Scheduled Batch: Use COPY INTO.

    • Near Real-Time (Files): Use Snowpipe.

    • Ultra-Low Latency (Rows): Use Snowpipe Streaming.

  2. Optimize File Sizes: The sweet spot for files loaded into Snowflake is 100-250 MB (compressed).

    • For Snowpipe: Consolidating smaller events into larger files before uploading is critical to avoid the per-file overhead cost.

    • For COPY INTO: This size range allows for optimal parallelism without creating too much management overhead.

  3. Use a Clear Staging Strategy: Organize your files logically in your cloud storage using prefixes (e.g., /app_name/table_name/YYYY/MM/DD/). This makes loading, reprocessing, and auditing much simpler.


Summary Table: Choosing Your Loading Method

Loading Method

Typical Latency

Compute Model

Complexity

Ideal For

COPY INTO (Batch)

Minutes–Hours

User-managed Warehouse

Low (SQL)

Large nightly batches, backfills, scheduled ETL.

Snowpipe (Auto-Ingest)

Minutes

Serverless (Snowflake-managed)

Medium (SQL + Cloud Setup)

Continuous, event-driven micro-batches.

Snowpipe Streaming

Seconds

Serverless (Snowflake-managed)

High (Requires Java SDK)

True real-time, row-based event streams.

Web UI Upload

Minutes

User-managed Warehouse

Minimal (UI Wizard)

Ad-hoc uploads, small files, testing.

Final Thoughts

Loading data is the gateway to analytics in Snowflake. By understanding the fundamental trade-offs between batch, micro-batch, and streaming methods, you can design a data ingestion architecture that is both performant and cost-efficient. Whether you’re loading gigabytes of data nightly with COPY INTO or millions of events per minute with Snowpipe Streaming, Snowflake provides the tools to match your needs. Choosing the right tool is the first and most critical step toward building a successful data platform.

0
Subscribe to my newsletter

Read articles from Sriram Krishnan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sriram Krishnan
Sriram Krishnan

Sharing lessons, tools & patterns to build scalable, modern data platforms —one post at a time.