The Ultimate Guide to Snowpipe vs Snowpipe Streaming

Jacqueline Cheong

Updated on

May 28, 2025

Data know-how

Artie replicates operational data into warehouses and lakes - reliably, in real time, and without the engineering overhead most pipelines come with. In this post, we’re breaking down the real differences between Snowpipe and Snowpipe Streaming - what they do well, where they fall short, and how to pick the right one.

Note: This post is mostly technical and product-agnostic, but we’ll briefly explain how Artie fits into the picture at the end.

‍

Getting Data Into Snowflake

If you're pushing data into Snowflake, you've probably heard of Snowpipe. Maybe Snowpipe Streaming too. They sound similar and even serve a similar purpose: continuously loading data into Snowflake. But they work very differently.

Knowing when to use one over the other can save you a lot of time, debugging effort, and compute credits. So let’s break down how each one works, where they shine, and where they fall short. And yeah, we’ll talk about what happens when you need UPSERTs, schema changes, or real-time guarantees.

‍

Snowpipe

Snowpipe’s your go-to when your data shows up in files and you just want Snowflake to grab them. You drop a file (CSV, JSON, Parquet, etc.) into a cloud bucket like S3 or GCS or Azure Blob, and Snowflake picks it up automatically using COPY INTO behind the scenes.

It’s serverless and scales on its own, which is great. But you still have to wire up cloud events (like S3 + SNS/SQS) to tell Snowflake when files arrive.

And pricing? You’re billed per file + file size. If you’re dropping thousands of tiny files, that gets expensive fast.

‍

Pros

No infra to manage
Auto-scales and runs serverless
Great for batch-style ingestion (e.g. daily files)
Pretty simple architecture - just drop files in a stage and Snowpipe handles the load

‍

‍Cons

1-5 minute latency from file drop to table load
No UPSERT support - Snowpipe is an append-only operation and you’ll need to implement downstream processing for deduplication or updates
Requires setting up cloud event notifications (e.g. S3 + SNS/SQS, GCS + Pub/Sub)
Snowpipe is designed to handle files, not individual events
No explicit ordering guarantees across files
Snowpipe deduplicates at the file level by filename, so if you re-upload a file with the same name, it will be skipped. Remember to use unique filenames to ensure proper ingestion and avoid silent data loss
Schema evolution is limited (no renames, no drops) but you can configure it to auto-add some columns

‍

DML and DDL Handling

Situation	Behavior
UPSERTs	❌ Not supported Use Snowflake Streams & Tasks, Dynamic Tables, or scheduled `MERGE` statements after load - and pray you don’t forget a config somewhere.
New Columns	⚠️ New columns are ignored unless: 1. You explicitly set the `COPY INTO` to use `MATCH_BY_COLUMN_NAME` 2. `ENABLE_SCHEMA_EVOLUTION = TRUE` is set on the table If those settings are enabled, Snowflake will add the new column and infer its data type from the file content. Note that inference rules are limited and there are other limitations. For example: - Numeric-looking strings (e.g. "123") might be inferred as `NUMBER` - Alphabetic strings (e.g. "abc") would be inferred as `TEXT` - Ambiguous or mixed types might default to `VARIANT` - You can only add a maximum of 10 new columns per `COPY` operation
Deleted Columns	🚫 This is where things get messy. If a previously present column is missing in new data, Snowflake assumes it’s nullable and sets it to `NULL` - there is no error thrown, so can be risky if this goes unnoticed.
Type Changes	🚨 `COPY INTO` won’t convert column types. If the data can’t be cast to the table’s column type, it results in an error. For example: trying to load a string like `abc` into an `INTEGER` column will fail.

‍

Best practice for using Snowpipe

Land raw data in a staging table first. Then handle dedupes and UPSERTs downstream using MERGE, Streams & Tasks, or Dynamic Tables.

If you want Snowflake to auto-add new columns, you’ll need to turn on ENABLE_SCHEMA_EVOLUTION = TRUE and use MATCH_BY_COLUMN_NAME. Otherwise, extra columns in your files will just get ignored.

Stick to JSON or Parquet if you can - Snowflake does a better job inferring types with those. And if you're loading files, batch them. Too many tiny files = extra overhead.

‍

Format Quirks (CSV vs. JSON vs. Parquet)

CSV is the easiest to generate - but the hardest to work with. Every field is loaded as a string unless you manually cast it. Schema evolution is also more fragile with CSV. JSON and Parquet are better choices: they support richer type inference and cleaner schema evolution.

Pro tip: avoid mixed-type columns in JSON (e.g. 123 in one row, "abc" in another). Snowflake will treat the column as VARIANT, which makes downstream modeling harder.

‍

Snowpipe Streaming

Snowpipe Streaming skips the whole file-drop dance. Instead of waiting for files to land in S3, you push rows straight into Snowflake using their SDK or Kafka connector. It’s built for low-latency use cases.

Here’s the catch: you’re the one doing the pushing. You’ll need to read data from the source, then call the Snowflake Ingest API to insert it. Snowflake takes care of the backend infra, but you're on the hook for running the producer - and you're billed per row.

‍

Pros

Lower latency since there is no staging delay
No need to manage cloud storage or files
Exactly-once semantics (per stream channel)
Ideal for Kafka, Flink, or custom producers

‍

Cons

You have to build and run a producer (or use a tool that does this)
INSERT-only; no native support for UPSERTs or DELETs - UPSERTs must be handled post-ingestion using Snowflake Streams & Tasks, Dynamic Tables, or scheduled MERGE statements
Schema changes must be handled manually unless you're using Kafka with a schema registry
Operational complexity: you own the producer, retries, logging, and monitoring

‍

Handling Failures in Snowpipe Streaming

If your producer crashes mid-ingestion, Snowflake won’t recover the data for you. Exactly-once delivery relies on your app tracking offsets and retrying intelligently. For Kafka, this usually means persisting the last committed offset and making ingestion idempotent - otherwise you risk data inconsistencies.

There’s no built-in retry mechanism in the Snowflake SDK. You’ll need to implement backoff logic and error handling yourself. It works - but only if you’re disciplined.

‍

DML and DDL handling

Situation	Behavior
UPSERTs	❌ Not supported Use Snowflake Streams & Tasks, Dynamic Tables, or scheduled `MERGE` statements after load - and pray you don’t forget a config somewhere.
New Columns	⚠️ Limited support. In most cases you’ll need to manually add new columns yourself using `ALTER TABLE`. Automatically adding new columns is only supported if you’re using the Kafka Snowpipe Streaming connector with a schema registry – and even then, it’s only limited to certain formats like Avro
Deleted Columns	⚠️ If you insert fewer columns than the table expects: - Missing values = `NULL` (if nullable) - If a missing column is `NOT NULL`, it will error out
Type Changes	❌ There’s no automatic type casting or evolution. If you’re sending data that doesn’t match the column type, ingestion will fail for that row. For example: sending `abc` into an `INTEGER` column

‍

Best Practice for Using Snowpipe Streaming

Start by streaming into a raw table with a flexible schema - think: mostly strings, all columns nullable. That way, you won’t break the pipeline when something unexpected shows up.

Then clean it up later. Do your transformations in a second step (classic ELT). It gives you breathing room and keeps ingestion stable.

‍

TL;DR

Snowpipe is best suited for file-based batch loads where minimal setup is important and real-time latency isn’t a strict requirement. It works well when your data lands in cloud storage periodically, like hourly or daily.

Snowpipe Streaming offers lower latency and is designed for streaming data in near real-time. But with that speed comes complexity - you’ll need to manage a producer application, handle retries, ensure row ordering, and deal with schema changes manually.

Neither tool does UPSERTs. They're mostly INSERT-only. If you want to deduplicate, UPDATE, or DELETE rows, you'll need to roll your own logic downstream.

Same for schema changes. Snowpipe supports optional column adds (with config). Snowpipe Streaming won't evolve schemas unless you're using the Kafka connector + Schema Registry (and even then, only certain formats are supported).

‍

Artie: Real-Time Pipelines, Minus the Headaches

Sometimes you don’t want to glue together 5 Snowflake features to make updates work. You just want your database changes - INSERT, UPDATE, DELETE - to land in Snowflake in the right order, in real time, without breaking when a column gets renamed.

We built Artie for the folks who are tired of babysitting pipelines. We help get data into Snowflake fast and with zero engineering overhead.

If you're tired of:

Append-only limitations
Hand-rolled MERGE logic
Breakage from schema changes

...Artie might be what you're really looking for.

‍

If you’re interested, here's how all three options stack up, side by side.

Criteria	Snowpipe	Snowpipe Streaming	Artie
Ease of Setup	Moderate. Requires setting up an external stage (e.g. S3), a pipe, and cloud event notifications (e.g. S3 → SNS/SQS). Some manual steps.	Higher. Requires building or deploying a client (e.g. Java SDK app or Kafka connector) to push data to Snowflake’s streaming API.	Very easy. Fully managed via UI. Connect source DB and Snowflake with minimal configuration. No infrastructure to run or maintain.
Data Consistency	At-least-once at file level. Duplicate file uploads are skipped based on file name. No row-level deduplication or ordering guarantees.	Exactly-once per stream channel, assuming proper use of offset tracking. Maintains row order within each channel.	Exactly-once across the entire change stream. Artie preserves commit order and avoids duplicates using log positions and transactional updates.
UPSERT Support	❌ Not natively supported. Loads are append-only. To deduplicate or apply updates, you must build a `MERGE` pipeline using Streams & Tasks.	❌ Not supported. `INSERT`-only API. To apply changes, stream into a staging table and use downstream `MERGE` logic.	✅ Native support for `INSERT`, `UPDATE`, and `DELETE`. Artie continuously syncs changes using log-based CDC and applies them to Snowflake.
Schema Evolution	⚠️ Can auto-add new columns if `ENABLE_SCHEMA_EVOLUTION` and `MATCH_BY_COLUMN_NAME` are enabled. No support for drops or renames.	⚠️ Manual schema management unless using the Kafka connector with schema registry, which can auto-add new fields.	✅ Fully automated. Artie detects schema changes in the source (adds/drops/renames) and applies them in Snowflake to keep schemas aligned.
Latency	⏱️ Near real-time (1–5 min). Loads files in micro-batches after they land in storage.	⚡ Low latency (typically sub-second). Flushes small batches directly into Snowflake in near real-time.	⚡ Low latency (seconds). Continuously streams change events from the database with minimal lag.
Cost Model 💵	Charged per row + per file load. Serverless compute usage scales automatically. May be costlier with high file churn.	Charged per row. You manage the streaming infrastructure (e.g. Kafka Connect). Cost depends on volume and client setup.	Subscription-based and charge is correlated to rows processed (excluding backfills). Fully managed, includes compute and infra.
Ordering Guarantees	❌ No guarantees. Files may be loaded out of order depending on when they arrive and are processed.	✅ In-order within a single channel (e.g. Kafka partition). No ordering across multiple channels.	✅ Global ordering within each table, based on database commit logs. Preserves real-world update sequences.
Monitoring / Observability	Basic monitoring via Snowflake UI and table functions (e.g., `COPY_HISTORY`). No pipeline-level observability or alerting.	Depends on the client. Kafka connector provides metrics; SDK requires custom logging. Some observability but not centralized.	Robust web dashboard with ingestion health, row-level stats, schema changes, and alerting.
Best For	Batch-style data (e.g. periodic file drops from external vendors or systems). Simple to maintain once set up.	Real-time event ingestion from stream processors or pub/sub systems. Ideal when latency is critical.	Full-fidelity, low-latency database replication to Snowflake. Great for analytics, CDC, and ELT with minimal maintenance overhead.