Improve Data Quality with In-Flight Data Pipeline Validation

Robin Tang

Updated on

May 21, 2025

Product spotlight

Artie replicates operational data into warehouses and lakes - reliably, in real time, and without the engineering overhead most pipelines come with. In this post, we’re diving into a small but powerful validation step that helps catch silent mismatches before they turn into downstream surprises.

‍

TL;DR: we explain how validating the number of rows affected during data ingestion (e.g. COPY INTO, MERGE) can catch silent mismatches in pipelines. This in-flight data quality check adds another layer of assurance without adding overhead - and it’s now built into every Artie pipeline.

‍

Why In-Flight Data Validation Matters for Data Pipelines

Our pipelines are built to be fast, reliable, and hands-off. But even with strong transactional guarantees across the stack, we believe in having guardrails - lightweight checks that catch the edge cases you didn’t see coming.

One of those guardrails? A dead-simple check: making sure the number of rows affected matches what we expect. This in-flight data pipeline validation helps protect against data quality issues that wouldn’t be caught by traditional observability.

‍

How Validating Rows Affected Improves Data Accuracy

Sometimes things don’t break - they just quietly go wrong. A MERGE that succeeds but touches zero rows. A COPY INTO that loads fewer records than expected. These aren’t errors that necessarily get flagged. But left unchecked, they can lead to stale or incomplete data downstream.

That’s why we’re introducing a validation step: we track the number of rows we expect to load, and compare that against what actually lands in the destination. This real-time ETL check gives us confidence that what we buffered is what we committed.

This is one part of a broader effort to make correctness a first-class citizen in our pipeline architecture - from schema change detection to offset-aware retries - all focused on improving CDC pipeline quality.

‍

Where In-Flight Data Quality Checks Happen in the Pipeline

Stepping back a bit, Artie pipelines look something like this:

Let’s use Snowflake as an example. Here’s what Artie Transfer does to land data:

Create a delta GZIP CSV file
Upload the file into a staging table's internal stage (PUT)
Copy the internal stage into the staging table (COPY INTO)
Merge the staging table into the target table (MERGE)

‍

We now check the number of rows affected - a form of in-flight data pipeline validation - during two key steps:

The ROWS_LOADED result from COPY INTO
The sum of INSERTED, UPDATED, and DELETED rows from MERGE

If either of those doesn’t match what we expect, we won’t commit the Kafka offset and will immediately flag the issue. This real-time validation protects data integrity - without adding overhead or complexity.

‍

How Row-Level Validation Helped Optimize Redshift Ingestion

Here’s a real example from our Redshift ingestion path.

We wrap INSERT, UPDATE, and DELETE commands into a single transaction. At first, we ran them in this order: insert → update → delete. The pipeline worked fine - data landed correctly, and there was no visible inconsistency.

But then our rows affected check flagged something unexpected: the UPDATE step was consistently affecting zero rows.

It turned out to be a sequencing issue. By running INSERT before UPDATE, we were sometimes trying to update rows we had just inserted. Since those values were already fresh, the update became a no-op - wasted work that added latency and compute overhead.

‍

By switching to update → insert → delete, we made the process more efficient: updates now target existing records, inserts only add new ones, and deletes clean up what’s no longer needed.

The data was always correct. But without this check, we would’ve kept paying a silent cost in unnecessary operations - the kind that add up at scale.

‍

Building Data Quality Into Your Real-Time Data Stack

Most replication issues don’t explode - they whisper. A few rows go missing. A change doesn’t land. And a month later, someone (or worse, a customer) notices a dashboard that looks weird.

By checking row counts at each stage, we’re making sure your data doesn’t just move - it arrives as expected.

We’ve rolled this out across all pipelines, and it’s already helping us tighten the screws on efficiency and correctness. Want to see it in action or compare notes? Let’s talk.

AUTHOR

Robin Tang