Backfills
Artie supports both real-time change data capture (CDC) and full historical backfills — and we’ve designed them to work together seamlessly.
When you connect a new table, we don’t just start streaming live changes. We also backfill historical rows so your destination has a complete and accurate copy from day one.
And unlike most systems, our backfills won’t overload your production database.
How backfilling works with Artie
When a happens, Artie spins up two seperate processes:
Backfill process (historical data)
- Scans the full table and writes directly to your destination
- Skips Kafka entirely
- Runs in batches with parallelism (default: 5)
- Can use a
Read Replica
to reduce load on your primary DB
CDC process (live changes)
- Starts reading from the database’s transaction logs immediately
- Queues inserts, updates, and deletes in Kafka while the backfill runs
- Waits until backfill is done before applying changes
Once the backfill finishes, we hand off to the CDC stream and apply the queued changes in order, effectively “catching up” and bringing the destination into a real-time state.
This architecture is inspired by Netflix’s DBLog pattern, which ensures backfills and CDC can coexist without conflict or impacting database performance.
Benefits
Because Artie keeps these two streams separate, we avoid the usual tradeoffs and failure modes:
Data correctness issues
If CDC writes land while a backfill is running, older rows can overwrite newer ones — or duplicate records can sneak in.
Database strain
Some databases (like Postgres) retain WAL files until the replication slot is consumed. If the CDC process is blocked during a long backfill, it can lead to slot bloat or overflow.
When do we do backfills?
- When you first launch a pipeline
- When you add new tables
- When you trigger an ad hoc backfill from the dashboard
How many tables backfill at once?
By default, Artie backfills 10 tables in parallel per pipeline.
This keeps the load on your source DB manageable, especially during the initial sync of a large schema. Tables are marked as:
Queued to backfill
— Waiting their turn in the queueBackfilling
— Actively running
If you have a high table count (some customers backfill 500+ tables), we can increase or decrease this parallelism. Just contact us and we’ll tune it for your setup.
How are backfills ordered?
We use a simple FIFO (first-in, first-out) model. Tables are backfilled in the order they were added, up to the concurrency limit.
What happens to CDC changes during a backfill?
The CDC stream is still running in the background while backfill is in progress. Every change is captured in Kafka. Once the backfill is complete, Artie:
- Switches to the CDC stream
- Applies all queued changes in order
- Transitions the table to a fully streaming state
This prevents stale data from overwriting live updates and guarantees consistency.
Edge cases and tuning
Some things to keep in mind:
- Large tables (100M+ rows) may take time. We batch intelligently, but performance depends on schema, indexes, and source DB size.
- High write volume during backfill is fine — CDC logs will queue and apply in order.
- Parallelism is tunable — we can adjust table concurrency or batch sizes if needed.