How backfilling works with Artie

When a happens, Artie spins up two seperate processes:

1

Backfill process (historical data)

  • Scans the full table and writes directly to your destination
  • Skips Kafka entirely
  • Runs in batches with parallelism (default: 5)
  • Can use a Read Replica to reduce load on your primary DB
2

CDC process (live changes)

  • Starts reading from the database’s transaction logs immediately
  • Queues inserts, updates, and deletes in Kafka while the backfill runs
  • Waits until backfill is done before applying changes

Once the backfill finishes, we hand off to the CDC stream and apply the queued changes in order, effectively “catching up” and bringing the destination into a real-time state.

This architecture is inspired by Netflix’s DBLog pattern, which ensures backfills and CDC can coexist without conflict or impacting database performance.

Benefits

Because Artie keeps these two streams separate, we avoid the usual tradeoffs and failure modes:

Data correctness issues

If CDC writes land while a backfill is running, older rows can overwrite newer ones — or duplicate records can sneak in.

Database strain

Some databases (like Postgres) retain WAL files until the replication slot is consumed. If the CDC process is blocked during a long backfill, it can lead to slot bloat or overflow.

When do we do backfills?

  1. When you first launch a pipeline
  2. When you add new tables
  3. When you trigger an ad hoc backfill from the dashboard

How many tables backfill at once?

By default, Artie backfills 10 tables in parallel per pipeline.

This keeps the load on your source DB manageable, especially during the initial sync of a large schema. Tables are marked as:

  • Queued to backfill — Waiting their turn in the queue
  • Backfilling — Actively running

If you have a high table count (some customers backfill 500+ tables), we can increase or decrease this parallelism. Just contact us and we’ll tune it for your setup.

How are backfills ordered?

We use a simple FIFO (first-in, first-out) model. Tables are backfilled in the order they were added, up to the concurrency limit.

What happens to CDC changes during a backfill?

The CDC stream is still running in the background while backfill is in progress. Every change is captured in Kafka. Once the backfill is complete, Artie:

  1. Switches to the CDC stream
  2. Applies all queued changes in order
  3. Transitions the table to a fully streaming state

This prevents stale data from overwriting live updates and guarantees consistency.

Edge cases and tuning

Some things to keep in mind:

  • Large tables (100M+ rows) may take time. We batch intelligently, but performance depends on schema, indexes, and source DB size.
  • High write volume during backfill is fine — CDC logs will queue and apply in order.
  • Parallelism is tunable — we can adjust table concurrency or batch sizes if needed.