How it works
is a critical part of Artie’s data pipeline that determines when and how data gets written to your destination.1
Data buffering
- Artie’s reading process will read changes from your source database and publish them to Kafka
- Artie’s writing process will read messages from Kafka and write them to your destination
- Messages are temporarily stored in memory and deduplicated based on primary key(s) or unique index
- Multiple changes to the same record are merged to reduce write volume
2
Flush trigger evaluation
- Artie continuously monitors three flush conditions
- When any condition is met, a flush is triggered
- Reading from Kafka pauses during the flush operation
3
Data loading
- Buffered data is written to your destination in an optimized batch
- After completion, Artie will commit the offset and resume reading from Kafka
- The cycle repeats for continuous data flow
Conditions
Artie evaluates three conditions to determine when to flush data. Any one of these conditions will trigger a flush:Time elapsed
Maximum time in seconds — Ensures data freshness even during low-volume periods
Message count
Number of deduplicated messages — Based on unique primary keys or unique index.
Byte size
Total bytes of deduplicated data — Actual payload size after deduplication
Setting optimal rules
The right flush configuration depends on your destination type, data volume, and latency requirements.OLTP destinations
OLTP destinations
For transactional databases like PostgreSQL, MySQL, or SQL Server:Example configuration:
Recommended approach
Smaller, frequent flushes work well because:
- Row-based storage handles individual record operations efficiently
- Native UPSERT/MERGE operations minimize overhead
- Messages: 1,000-5,000 records
- Bytes: 10-50 MB
- Time: 30-60 seconds
OLAP destinations
OLAP destinations
For analytical databases like Snowflake, Databricks, BigQuery, or Redshift:Example configuration:
Setting the flush rules too low can hinder throughput and cause latency spikes:
- Fixed overhead costs: Each flush has connection/metadata overhead that dominates processing time with small batches
- Inefficient resource usage: OLAP systems are designed for large parallel operations, not frequent micro-operations
- Storage and query degradation: Many small files hurt compression, increase metadata lookups, and trigger excessive compaction
- Recommendation: For OLAP destinations, set higher row/byte limits and rely on time-based triggers
Recommended approach
Larger, less frequent flushes are optimal because:
- Columnar storage benefits from batch processing
- Reduced metadata overhead and better compression
- More efficient query performance with fewer small files
- Messages: 25,000-500,000 records
- Bytes: 50-500 MB
- Time: 3-15 minutes
Best practices
Start conservative
Begin with smaller flush values and increase based on observed performance and destination capabilities.
Validate through flush metrics
As you experiment and fine-tune the flush rules, you can see which rule triggered the flush as the reason in “Flush Count” graph from the analytics portal.
Monitor and adjust
Track flush frequency, batch sizes, and end-to-end latency to optimize over time.
Consider your SLA
Time threshold should align with your data freshness requirements and business SLAs.
Advanced
See flush reason in the analytics portal
See flush reason in the analytics portal
