How Alloy Powers Real-Time Fraud Detection with Reliable Postgres Replication

Alloy is an AI-powered identity and fraud prevention platform over 800 financial institutions and fintechs trust to stop fraud and automate risk management across the customer lifecycle. Their data pipelines replicate over 20 billion rows per month from production Postgres databases into Redshift and Snowflake, powering recurring compliance evaluations, real-time machine learning models, internal analytics, and billing.

With Artie, I don't have to look at our replication. I just trust that it will be working.

Josh KostalSenior Software Engineer II at Alloy


Company Website	alloy.com
Switched from	AWS DMS, Fivetran
Switched	February 2025
Use case	Postgres to Redshift + Snowflake ingestion for fraud detection ML and compliance products

Initial ingestion setup

Alloy originally relied on AWS DMS for replicating data from Postgres into Redshift, and Fivetran for replicating into Snowflake.

The Redshift path powered two critical product lines. The first was portfolio evaluations, a decisioning product that runs recurring compliance checks, validating customer entities against rules on a daily or monthly cadence using aggregations computed in Redshift. The second was a set of machine learning models for fraud detection, where features generated from Redshift data get combined with real-time signals from production Postgres to power both online and offline models. An expected replication latency of roughly 30 minutes was baked into the ML architecture.

On the Snowflake side, the data supported BI, internal QA, and billing. Less latency-sensitive, but accuracy still mattered. Replication was managed by a small number of engineers as part of a distributed function, with much of the day-to-day pipeline work falling to one or two people.

What was breaking

DMS failures were unpredictable and often unexplained. Replication would stop for a table, sometimes silently, and the root cause was frequently unclear. The team had Datadog monitors around row counts and replication slot sizes, but these didn't catch everything. Digging through CloudWatch logs was difficult and time-consuming.

When failures hit large tables with millions of rows, recovery meant a full backfill from scratch, a process that could take one to two days. During that window, downstream products were operating on stale or incomplete data. On multiple occasions, the team ended up on hours-long calls with AWS support, trying to get the right DMS specialist and Postgres SME on the same line to troubleshoot configuration.

DMS also didn't recover gracefully from routine Postgres operations. Upgrades, hiccups, or configuration changes upstream could cascade into replication failures that required manual intervention to resolve.

Data type handling added another layer of friction. DMS was stringifying timestamps and dropping timezone information, requiring downstream transformations to compensate for what should have been accurate replication.

On the Fivetran side, the problems were different but cumulative. Cost was a concern, and the tool didn't integrate well with infrastructure-as-code workflows since it was managed through a UI rather than Terraform. The team also had reliability concerns based on prior experiences, which contributed to the decision to move off the platform.

Portfolio evaluations changed the stakes

The catalyst was the launch of portfolio evaluations as a live, customer-facing product. This decisioning feature runs recurring compliance checks against customer entities using aggregations computed in Redshift. While the product had been in development alongside existing DMS pipelines, taking it to production with customers paying for it changed the calculus entirely.

Multi-day replication failures weren't just an operational annoyance anymore. They meant a product customers were paying for couldn't function. And the ML models that powered real-time fraud detection were degraded whenever replication fell behind, feeding an inaccurate view of the world to models that needed to make split-second decisions about suspicious transactions.

If you have a product that's live, you can't have multi-day outages. You just can't accept that.

Josh KostalSenior Software Engineer II at Alloy

Alternatives considered

The team never seriously considered building an internal replication solution. The engineering effort (months of hiring, likely a year or more of development, and a permanent maintenance burden) was impractical for a team without a dedicated data platform function.

"Unless you work at a large company that's got all this expertise and time, it's kind of like a fool's errand. It's probably going to take longer than expected, harder than you think. It just seems like the perfect thing to offload to a vendor."

– Josh Kostal,** **Senior Software Engineer II at Alloy

On the vendor side, the team evaluated several options. Two made it to the proof-of-concept stage alongside Artie: Confluent and Red Panda. Both positioned themselves as streaming solutions with plug-and-play connectors.

In practice, setup was significantly more involved than expected. Both required help from Alloy's infrastructure team, multiple calls spread over weeks, and were difficult to test against production-representative data. The gap between what the connectors promised and the actual setup complexity was substantial.

How the decision was made

Artie's proof of concept stood apart. Setup was straightforward, configured through a UI rather than requiring weeks of infrastructure engineering. Testing was smooth, and results were verifiable.

The bigger concern was Artie's stage as a company. At the time, Artie was an early-stage startup. The team worried about longevity, acquisition risk, and whether their scale (billions of rows per month) would exceed Artie's production capabilities over time. A bad outcome would mean going back to DMS and starting the evaluation over, likely under different leadership.

What tipped the decision was a combination of technical depth and responsiveness. The Artie team demonstrated clear expertise in replication failure modes and recovery paths. Customer reference calls confirmed reliability and a pattern of minimal production incidents at comparable scale.

"You were very responsive right off the bat. Technically very strong. We felt comfortable knowing where there are pitfalls and how quickly you can move to resolve issues."

– Yan Karklin,** **Staff Data Scientist at Alloy

The team also recognized that the risk of staying on DMS was at least as high as the risk of switching. With DMS already causing multi-day outages and portfolio evaluations in production, inaction wasn't a safe default.

	Before: DMS / Fivetran	After: Artie
Failure behavior	Replication stalls randomly, root cause unclear	Replication runs continuously, failures are diagnosable
Recovery	Full backfill from scratch, 1-2 day outages	Targeted recovery, no full reloads
Support	Hours-long calls with AWS to get the right specialist	Issues resolved quickly with Artie team
Data accuracy	Timestamps stringified, timezones dropped	Data types replicated accurately
Postgres operations	Failovers required manual DMS restart	Failovers handled automatically
Infrastructure fit	Fivetran managed via UI, not Terraform	Straightforward setup and management

Our entire organization now knows that when Postgres clusters roll over or there's a failover, it's not going to impact anything. Before, I don't think we were confident DMS would handle that. We would have had to pause the tasks and restart them.

Josh KostalSenior Software Engineer II at Alloy

What changed after Artie

Artie was deployed in phases. The Redshift migration began shortly after signing, with the first tables live within the first few weeks. The team took a careful, zero-downtime approach: renaming DMS destination tables, spinning up duplicate DMS pipelines as a safety net, creating views that could be pointed at either source, and slowly cutting over from smallest to largest tables.

One particularly large table presented a challenge. Its volume caused replication latency that was affecting machine learning products downstream. The Artie team built a partitioned replication feature to handle it, extending that final migration over the following months. The rest of the tables were fully cut over soon after.

The Snowflake migration from Fivetran took a couple months and was simpler, thanks to existing view layers that made rerouting from Fivetran sources to Artie sources straightforward.

The most significant change has been operational trust. The team spends significantly less time monitoring replication pipelines day-to-day. Postgres failovers and cluster maintenance, operations that would have required pausing and restarting DMS tasks, now pass through without impact.

Table maintenance that would have been risky under DMS, like migrating integer primary keys that were running out of space, now happens without disrupting replication.

The team is exploring expanding Artie to additional data sources and considering how reliable streaming replication can support new architectural patterns that weren't feasible with batch-oriented or fragile pipelines.

About Artie: Artie is a real-time data replication solution for databases and data warehouses. Artie leverages change data capture (CDC) and stream processing to perform data syncs in a more efficient way, which enables sub-minute latency and helps optimize compute costs. With Artie, any company can set up streaming pipelines in minutes without coding.

About Alloy: Alloy is an AI-powered identity and fraud prevention platform over 800 financial institutions and fintechs trust to stop fraud and automate risk management across the customer lifecycle.