Backfills

How backfilling works with Artie

When a happens, Artie spins up two seperate processes:

Backfill process (historical data)

Scans the full table and writes directly to your destination
Skips Kafka entirely
Runs in batches with parallelism (default: 5)
Can use a Read Replica to reduce load on your primary DB

CDC process (live changes)

Starts reading from the database’s transaction logs immediately
Queues inserts, updates, and deletes in Kafka while the backfill runs
Waits until backfill is done before applying changes

Once the backfill finishes, we hand off to the CDC stream and apply the queued changes in order, effectively “catching up” and bringing the destination into a real-time state.

This architecture is inspired by Netflix’s DBLog pattern, which ensures backfills and CDC can coexist without conflict or impacting database performance.

Benefits

Because Artie keeps these two streams separate, we avoid the usual tradeoffs and failure modes:

Data correctness issues

If CDC writes land while a backfill is running, older rows can overwrite newer ones — or duplicate records can sneak in.

Database strain

Some databases (like Postgres) retain WAL files until the replication slot is consumed. If the CDC process is blocked during a long backfill, it can lead to slot bloat or overflow.

When do we do backfills?

When you first launch a pipeline
When you add new tables
When you trigger an ad hoc backfill from the dashboard

How many tables backfill at once?

By default, Artie backfills 10 tables in parallel per pipeline.

This keeps the load on your source DB manageable, especially during the initial sync of a large schema. Tables are marked as:

Queued to backfill — Waiting their turn in the queue
Backfilling — Actively running

If you have a high table count (some customers backfill 500+ tables), we can increase or decrease this parallelism. Just contact us and we’ll tune it for your setup.

How are backfills ordered?

We use a simple FIFO (first-in, first-out) model. Tables are backfilled in the order they were added, up to the concurrency limit.

What happens to CDC changes during a backfill?

The CDC stream is still running in the background while backfill is in progress. Every change is captured in Kafka. Once the backfill is complete, Artie:

Switches to the CDC stream
Applies all queued changes in order
Transitions the table to a fully streaming state

This prevents stale data from overwriting live updates and guarantees consistency.

Edge cases and tuning

Some things to keep in mind:

Large tables (100M+ rows) may take time. We batch intelligently, but performance depends on schema, indexes, and source DB size.
High write volume during backfill is fine — CDC logs will queue and apply in order.
Parallelism is tunable — we can adjust table concurrency or batch sizes if needed.

Home

Concepts

Connectors

Tables

Monitoring

Artie Dashboard

Legal

How backfilling works with Artie

Benefits

Data correctness issues

Database strain

When do we do backfills?

How many tables backfill at once?

How are backfills ordered?

What happens to CDC changes during a backfill?

Edge cases and tuning

Home

Concepts

Connectors

Tables

Monitoring

Artie Dashboard

Legal

​How backfilling works with Artie

​Benefits

Data correctness issues

Database strain

​When do we do backfills?

​How many tables backfill at once?

​How are backfills ordered?

​What happens to CDC changes during a backfill?

​Edge cases and tuning

How backfilling works with Artie

Benefits

When do we do backfills?

How many tables backfill at once?

How are backfills ordered?

What happens to CDC changes during a backfill?

Edge cases and tuning