Artie moves your operational database data into warehouses and lakes - in real time, with minimal overhead. In this post, we’re diving into one of Postgres’s quirks - TOAST columns - and how you can prevent them from silently corrupting your data downstream.
The Most Annoying Thing About Postgres TOAST
You’ve got a Postgres database. You’re syncing it into a warehouse. And then - suddenly - your jsonb column turns into NULL. Sound familiar?
This post is about why that happens and how to fix it without bloating your Write Ahead Log (WAL) or rewriting your table configs.
What is TOAST in Postgres?
TOAST stands for "The Oversized-Attribute Storage Technique."
Postgres has an 8KB page size limit for rows. If a column's value is too large to fit, Postgres compresses and stores it out-of-line in a separate TOAST table. This keeps your main table lean and keeps writes fast. This is super common with:
- text
- jsonb
- bytea
- array
But here's the catch: if that TOASTed column doesn’t change, the actual value is omitted from the WAL entry. Which means if you’re doing Change Data Capture (CDC), your downstream system sees the column as NULL. Not “unchanged.” Not “preserved.” Just gone.
Why TOAST Columns Break CDC
In Postgres, CDC streams row-level changes - like inserts, updates, and deletes - from the WAL so downstream systems can stay in sync in real time.
The problem is, if the TOASTed column doesn’t change, Postgres doesn’t include it in the WAL. That can silently break CDC replication tools.
How Most CDC Tools Handle TOAST (Incorrectly)
Most CDC tools assume that if a column is missing in the WAL, it must be NULL. This is true for Snowflake’s native Postgres connector and AWS DMS. With Debezium + Kafka setups, Debezium replaces these values as `__debezium_unavailable_value`
and lands that as a string. This behavior breaks things fast in target downstream tables:
- JSON blobs mysteriously disappear
- Large text fields revert to NULL or the placeholder value
To avoid this, some tools force you to use REPLICA IDENTITY FULL. But that:
- Adds performance overhead
- Increases WAL volume dramatically
- Slows replication throughput
The Right Way to Handle TOAST Columns
To handle TOAST columns correctly in a CDC pipeline, you need logic that can tell the difference between a truly NULL value and one that was simply omitted from the WAL because it didn't change.
Here’s what that should look like:
- When a column is missing in a CDC event, check:
- Is the column TOASTable?
- Was it present in the previous state?
- If it hasn’t changed, re-use the previous value.
- If it has changed, sync the new value.
Here’s what that MERGE
logic looks like in Snowflake:
MERGE INTO target_table AS tgt
USING incoming_changes AS src
ON tgt.id = src.id
WHEN MATCHED THEN
UPDATE SET
tgt.column_a = CASE WHEN src.column_a NOT LIKE '%__unchanged_value_placeholder%'
THEN src.column_a ELSE tgt.column_a END,
tgt.column_b = CASE WHEN src.column_b NOT LIKE '%__unchanged_value_placeholder%'
THEN src.column_b ELSE tgt.column_b END;
This avoids false NULLs, preserves correctness, and doesn’t require switching to REPLICA IDENTITY FULL.
If you're using Artie, the good news is: we do all of this automatically and preserve your data’s integrity, behind the scenes - without requiring you to change your replica identity settings. No manual setup needed.