Why TOAST Columns Break Postgres CDC and How to Fix It

Jacqueline Cheong

Updated on

May 7, 2025

Data know-how

Artie moves your operational database data into warehouses and lakes - in real time, with minimal overhead. In this post, we’re diving into one of Postgres’s quirks - TOAST columns - and how you can prevent them from silently corrupting your data downstream.

‍

The Most Annoying Thing About Postgres TOAST

You’ve got a Postgres database. You’re syncing it into a warehouse. And then - suddenly - your jsonb column turns into NULL. Sound familiar?

This post is about why that happens and how to fix it without bloating your Write Ahead Log (WAL) or rewriting your table configs.

‍

What is TOAST in Postgres?

TOAST stands for "The Oversized-Attribute Storage Technique."

Postgres has an 8KB page size limit for rows. If a column's value is too large to fit, Postgres compresses and stores it out-of-line in a separate TOAST table. This keeps your main table lean and keeps writes fast. This is super common with:

text
jsonb
bytea
array

But here's the catch: if that TOASTed column doesn’t change, the actual value is omitted from the WAL entry. Which means if you’re doing Change Data Capture (CDC), your downstream system sees the column as NULL. Not “unchanged.” Not “preserved.” Just gone.

‍

Why TOAST Columns Break CDC

In Postgres, CDC streams row-level changes - like inserts, updates, and deletes - from the WAL so downstream systems can stay in sync in real time.

The problem is, if the TOASTed column doesn’t change, Postgres doesn’t include it in the WAL. That can silently break CDC replication tools.

‍

How Most CDC Tools Handle TOAST (Incorrectly)

Most CDC tools assume that if a column is missing in the WAL, it must be NULL. This is true for Snowflake’s native Postgres connector and AWS DMS. With Debezium + Kafka setups, Debezium replaces these values as `__debezium_unavailable_value` and lands that as a string. This behavior breaks things fast in target downstream tables:

JSON blobs mysteriously disappear
Large text fields revert to NULL or the placeholder value

‍

To avoid this, some tools force you to use REPLICA IDENTITY FULL. But that:

Adds performance overhead
Increases WAL volume dramatically
Slows replication throughput

‍

The Right Way to Handle TOAST Columns

To handle TOAST columns correctly in a CDC pipeline, you need logic that can tell the difference between a truly NULL value and one that was simply omitted from the WAL because it didn't change.

Here’s what that should look like:

When a column is missing in a CDC event, check:
- Is the column TOASTable?
- Was it present in the previous state?
If it hasn’t changed, re-use the previous value.
If it has changed, sync the new value.

‍

Here’s what that MERGE logic looks like in Snowflake:

MERGE INTO target_table AS tgt
USING incoming_changes AS src
ON tgt.id = src.id
WHEN MATCHED THEN
  UPDATE SET
    tgt.column_a = CASE WHEN src.column_a NOT LIKE '%__unchanged_value_placeholder%'
      THEN src.column_a ELSE tgt.column_a END,
    tgt.column_b = CASE WHEN src.column_b NOT LIKE '%__unchanged_value_placeholder%'
      THEN src.column_b ELSE tgt.column_b END;

This avoids false NULLs, preserves correctness, and doesn’t require switching to REPLICA IDENTITY FULL.

If you're using Artie, the good news is: we do all of this automatically and preserve your data’s integrity, behind the scenes - without requiring you to change your replica identity settings. No manual setup needed.

AUTHOR

Jacqueline Cheong