Artie is a fully managed real-time streaming platform that continuously replicates database changes into warehouses and lakes. We automate the entire data ingestion lifecycle - from capturing changes to merges, schema evolution, backfills, and observability - and scale to billions of change events per day.
This post is for data engineers and backend developers who run MySQL in production and want near-real-time analytics in Redshift.
TL;DR: MySQL's binary log captures every row-level change in your database. With CDC, you can stream those changes to Redshift in under a minute instead of waiting for nightly batch jobs. This post walks through how binlog CDC works, how to set it up, and the Redshift-specific patterns that make it performant.
The Problem: Your Analytics Are Always a Day Behind
Say you're running a meal-kit delivery company - let's call it FreshBowl. Orders land in MySQL. Your ops team needs to see order fulfillment status in Redshift so they can catch warehouse bottlenecks before they cascade into late deliveries.
With batch ETL, here's what happens: an order comes in at 3pm, the ETL job runs at midnight, and the ops team sees it at 6am the next day. By then, the bottleneck has been backing up for 15 hours.
This is the core problem with batch pipelines. Your analytics are always looking at yesterday. Revenue dashboards show last night's close, not this afternoon's spike. Fulfillment metrics lag behind actual warehouse throughput. Your dbt models are technically correct but practically stale.
What you actually want is for FreshBowl's orders table in MySQL to show up in Redshift within a minute of a row changing. That's what change data capture (CDC) gives you.
How Binlog CDC Actually Works
MySQL's binary log (binlog) is a sequential record of every data change on the server. It exists primarily for replication between MySQL instances, but it's also the foundation for CDC.
What is CDC? Change data capture is a pattern where you read a database's internal change log instead of querying tables directly. You get a stream of "row X was inserted," "row Y was updated," "row Z was deleted" - without polling the source or adding triggers to your application code. Wikipedia has a solid overview.
There are three binlog formats, and only one works for CDC:
ROW format records the actual before-and-after state of each row. If someone updates total from 42.00 to 47.50 on FreshBowl order #1234, the binlog captures both values. This is what CDC needs.
STATEMENT format logs the SQL statement itself (UPDATE orders SET total = 47.50 WHERE id = 1234). Sounds equivalent, but it breaks down with non-deterministic functions, triggers, and stored procedures. You can't reliably reconstruct row state from statements.
MIXED format switches between the two depending on the query. Too unpredictable for CDC.
So the first step is making sure your MySQL instance uses ROW format. If you're on RDS or Aurora, this means editing a parameter group - it's not a my.cnf change you can make directly.
GTIDs (Global Transaction IDs) are the other key piece. Without GTIDs, the CDC reader tracks position by binlog file name and byte offset - which breaks if MySQL rotates logs or you fail over to a replica. GTIDs give every transaction a unique identifier that survives failover. Enable them.
Here's the full architecture, from binlog to Redshift:
And here's the MySQL setup:
-- In my.cnf or RDS parameter group
binlog_format = ROW
log_bin = ON
gtid_mode = ON
enforce_gtid_consistency = ON
-- Create a dedicated replication user
CREATE USER 'cdc_reader'@'%' IDENTIFIED BY 'strong_password_here';
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'cdc_reader'@'%';Once this is in place, a CDC reader connects as a replication client, reads the binlog stream, and forwards each change event downstream - in our case, to Redshift.
One thing that bites people early: binlog retention. MySQL doesn't keep binlogs forever. If your CDC reader goes offline for longer than the retention period, it loses its place and needs a full re-sync. On RDS, the default retention is often just a few hours. Set it to at least 24 hours (72 is safer):
CALL mysql.rds_set_configuration('binlog retention hours', 72);
Landing Data in Redshift
Getting changes out of MySQL is half the problem. The other half is applying them to Redshift efficiently.
The staging + MERGE pattern. Redshift doesn't handle row-level upserts the way a transactional database does. The standard approach is to land a batch of changes into a temporary staging table, then run a MERGE to apply inserts, updates, and deletes to the target table in one atomic operation.
Here's what that looks like for FreshBowl's orders table:
-- 1. Changes land in a staging table
CREATE TEMP TABLE orders_staging (LIKE orders);
-- (CDC process inserts changed rows here)
-- 2. MERGE applies inserts, updates, and deletes
MERGE INTO orders
USING orders_staging AS s
ON orders.id = s.id
WHEN MATCHED THEN
UPDATE SET user_id = s.user_id, total = s.total,
status = s.status, updated_at = s.updated_at,
metadata = s.metadata
WHEN NOT MATCHED THEN
INSERT (id, user_id, total, status, created_at, updated_at, metadata)
VALUES (s.id, s.user_id, s.total, s.status, s.created_at,
s.updated_at, s.metadata);
-- 3. Truncate staging after merge - no storage growth
TRUNCATE orders_staging;This gives you exactly-once semantics. If a batch retries, the MERGE is idempotent - no duplicate rows.
But MERGE does take a lock on the target table. If you're running analytical queries against the same table that's receiving merges, you can hit contention. Batching changes and merging on a short interval (say, every 30 seconds) keeps the lock windows small. For very high-volume tables, you'll want to think about merge frequency vs. query concurrency.
Sort keys and distribution keys matter a lot in Redshift, and they're worth getting right up front. For FreshBowl's orders table, sorting by created_at makes time-range queries fast - "show me orders from the last hour" only scans the relevant blocks instead of the whole table. Distributing by user_id means queries that filter or join on a user are local to a single node instead of shuffling data across the network.
A good rule of thumb: sort by your most common time filter, distribute by your most common join key. For small lookup tables (like a warehouses dimension with 50 rows), use ALL distribution so every node has a local copy and joins never require redistribution. Redshift's sort key docs go deeper on the trade-offs.
JSON columns in MySQL map to Redshift's SUPER type. FreshBowl stores order metadata as JSON - item counts, special instructions, delivery preferences. In Redshift, you query it with path expressions:
-- FreshBowl: pending orders with special instructions, last hour
SELECT
warehouse_id,
COUNT(*) AS order_count,
json_extract_path_text(metadata, 'special_instructions') AS instructions
FROM orders
WHERE status = 'pending'
AND created_at > GETDATE() - INTERVAL '1 hour'
GROUP BY warehouse_id, instructions
ORDER BY order_count DESC;One caveat: SUPER has nesting depth limits. If your JSON goes five or more levels deep, flatten it in a dbt model rather than querying the raw SUPER column directly.
Type mappings are mostly intuitive - INT maps to INTEGER, VARCHAR stays VARCHAR, TIMESTAMP stays TIMESTAMP. The non-obvious ones worth knowing:
How Artie Handles This
Everything above - reading the binlog, managing positioning with GTIDs, staging changes, running MERGE, handling type mappings - is what Artie automates end to end.
You don't operate Kafka or Debezium yourself. You don't write MERGE SQL. You don't build retry logic or monitor binlog position. Artie runs the full pipeline as a managed service: MySQL binlog reader, through Kafka, into Redshift via the staging + MERGE pattern described above.
A few things worth calling out that are genuinely hard to build yourself:
Schema evolution. When someone runs ALTER TABLE orders ADD COLUMN priority VARCHAR(20) on MySQL, Artie detects the change and adds the column to Redshift automatically. No pipeline restart, no manual DDL. Renamed or dropped columns are handled per your configuration.
Backfills. The first sync does a parallel backfill of existing data while simultaneously streaming new changes from the binlog. You don't have to choose between "catch up on history" and "start streaming now" - both happen at the same time.
Monitoring. Replication lag, throughput, and error rates are visible in the Artie dashboard. Set up alerts so you know the moment something falls behind.
Honest latency expectations. Typical end-to-end latency is under 60 seconds from MySQL commit to Redshift availability. But be realistic - very large transactions, Redshift Serverless cold starts after idle periods, or catch-up from an outage can add delay. And if your team only checks dashboards once a day, real-time CDC might be more infrastructure than you need. A batch tool syncing every hour works fine for daily reporting. Real-time matters when you're making decisions on data less than an hour old - like FreshBowl's ops team catching a warehouse bottleneck at 3pm instead of discovering it at 6am tomorrow.
Getting started takes three steps:
- Enable row-based binlog on MySQL and create a replication user
- Connect your MySQL source and Redshift destination in Artie
- Select tables and deploy - Artie validates connectivity before anything starts
Start replicating MySQL to Redshift with Artie
Further Reading
- MySQL Binary Log documentation - official reference for binlog configuration
- Redshift MERGE statement - syntax and best practices for upserts
- Redshift SUPER type - querying semi-structured JSON data
- Redshift sort key best practices - choosing the right sort and distribution keys
- Change data capture - the concept behind binlog CDC

