How ClickUp Built an AI-Ready Data Platform Across Micro-sharded Postgres

ClickUp is the everything app for work, where software, AI, and humans converge. With 20M+ users worldwide and unlimited AI Agents, ClickUp brings every team, project, and workflow into one AI-powered workspace. ClickUp's data platform manages data replication which powers analytics across the company: board reporting, top-line metrics, and intraday operational decisions such as marketing spend and forecasting. As backend engineering evolved the production database from a single Postgres instance to a sharded Postgres and later to a micro-sharded environment, the complexity and risk of data ingestion increased. Michael Revelo, Director of Data Platform at ClickUp, and his data team are responsible for ensuring that increasingly fragmented operational data could still be reliably replicated into Snowflake, unified into analytics tables, and trusted by the business.
TL;DR
| Company Website | clickup.com |
| Switched from | AWS DMS |
| Switched | October 2025 |
| Use case | Postgres → Snowflake ingestion across micro-sharded databases |
Initial ingestion setup
ClickUp initially relied on Amazon DMS for logical replication from Postgres into Snowflake.
In a relatively simple sharded environment, this setup was workable. When DMS was healthy, it required little day-to-day attention. The issue was not frequent failure. It was that failures were irregular, hard to diagnose, and expensive to recover from.
"When DMS was working, it worked. It was just when it wasn't working… and most of the time it wasn't clear why."
— Michael Revelo, Director of Data Platform at ClickUp
What was breaking
Over time, the team encountered a recurring set of failure modes. Amazon DMS would occasionally fall behind or stop replicating entirely. When this happened, it was often unclear whether the issue was related to replication slots, task state, replication lag, or source database pressure. There wasn't always a clear and reliable signal that pointed to a specific cause.
When replication stalled, recovery followed a blunt sequence: pause the DMS task, drop and recreate replication slots, and trigger a full reload. For large tables, this meant hours of backfill before replication could resume.
These reloads placed sustained throughput pressure on source Postgres and required careful coordination to avoid customer impact. Database ops often needed to be involved to monitor lag and production health.
In some cases, restoring progress meant excluding large tables from reloads to unblock the rest of the pipeline. This kept ingestion moving, but at the cost of incomplete downstream analytics until those tables could be reloaded later.
Microsharding changed the risk profile
As ClickUp prepared to move from sharding to microsharding, ingestion complexity would compound.
Microsharding improves application performance by splitting customer data into smaller partitions. From the analytics side, tables that were previously ingested as a single unit would now arrive as multiple fragments that needed to be reassembled into a single logical table downstream.
This increased the number of replication tasks and the likelihood of overlapping failures. Issues were less likely to affect a single table and more likely to impact related datasets together.
Recovery paths that already relied on full reloads became harder to manage. Reloads took longer, placed additional load on source Postgres, and required more coordination to run safely. Restoring a consistent analytics state took more time and effort.
At that point, the concern was predictability. Ingestion needed to behave consistently as microshard counts increased, without expanding recovery time or operational involvement.
Alternatives considered
Before evaluating vendors, the team tried to determine whether a change was actually necessary.
They stress-tested Amazon DMS in its existing configuration as well as DMS Serverless. The goal was to see whether the operational issues they were experiencing would improve enough to support microsharding. While these options reduced some day-to-day operational management, they did not change the underlying recovery behavior. Full reloads were still the primary recovery mechanism, and failures were still difficult to reason about.
In parallel, the team considered building internally. The potential options were some flavor of a Kafka or Kinesis based streaming architecture. This approach offered control, but it also meant committing engineering resources to designing, operating, and maintaining ingestion infrastructure long term.
"Once you really think about what it entails, it's at minimum a years-long project. And even once you build it, you still have to maintain it."
— Michael Revelo, Director of Data Platform at ClickUp
Streaming itself was not the immediate requirement. The team needed ingestion that could handle a microsharded database architecture reliably, without introducing another large system to own. Given the timeline of the migration, an internal build would have delayed risk reduction rather than accelerating it.
Initial skepticism on outsourcing to vendors
When the team began evaluating third-party vendors, there was some skepticism and uncertainty. Ingestion is mission-critical infrastructure. Adopting a vendor meant relying on an external system during failures and, in practice, sharing on-call responsibility. The team was concerned about long-term cost, support quality after procurement, and whether a smaller vendor could handle complex failure scenarios.
As a result, the evaluation emphasized feature availability, but heavily indexed on operational confidence: whether support would be available during incidents, whether failure scenarios were well understood, and whether costs would remain predictable over time.
The team evaluated several vendors through technical questionnaires and proofs of concept, eventually narrowing the field to two candidates, including Artie.
How a decision was made
The decision came down to which option both reduced operational risk and scaled through a microsharding migration.
Continuing with the existing DMS-based approach meant accepting increasingly expensive recovery work as microshard counts increased. Building internally meant committing to operating a new ingestion system long term. Third-party tools introduced vendor risk, but also offered a more sustainable path forward and a way to reduce recovery complexity.
During the evaluation, Artie stood out because its behavior under failure was easier to reason about. It supported both sharded and microsharded architectures, held up during microsharding tests, and reduced the reliance on full reloads as the primary recovery mechanism.
Equally important, detailed questions about edge cases and recovery scenarios were answered directly. The team felt they understood how the system would behave when things went wrong and what operating it would involve.
The deciding factor was not a single feature or benchmark, but how sharp the team is. The ClickUp team felt confident Artie could operate through failures with fewer unknowns. The Artie team showed strong expertise in sharded and microsharded architectures and was consistently quick and thorough in responding to technical questions.
What changed after Artie
Artie was deployed in the existing sharded environment and carried forward into microsharding. Fragmented source data could be reliably unified into analytics tables in Snowflake across both setups without rearchitecting pipelines at each stage.
Ingestion required less active management. Pipelines ran without the same level of monitoring, coordination, and manual intervention that had previously been necessary to keep data in sync. As a result, the team had more capacity to support analytics, experimentation, and new downstream use cases.
Streaming was not the original goal of the migration. Moving away from reload-heavy ingestion patterns made real-time workflows possible, while keeping Snowflake as the operational center of gravity.
With ingestion stabilized, downstream teams were able to act on fresher, more complete data. This enabled new operational initiatives, including personalized outbound motions that contributed to revenue generation, while improving business-as-usual reporting and establishing an AI-ready Snowflake-native platform.
About Artie: Artie is a real-time data replication solution for databases and data warehouses. Artie leverages change data capture (CDC) and stream processing to perform data syncs in a more efficient way, which enables sub-minute latency and helps optimize compute costs. With Artie, any company can set up streaming pipelines in minutes without coding.
About ClickUp: ClickUp is the everything app for work, where software, AI, and humans converge. With 20M+ users worldwide and unlimited AI Agents, ClickUp brings every team, project, and workflow into one AI-powered workspace.
