
How Vector Solutions Built Reliable Data Pipelines to Iceberg at Scale


Vector Solutions builds learning management systems and operational applications for online training, scheduling, and incident tracking. The company serves over 24,000 organizations and 31 million users worldwide, and has grown through acquisition to operate 15 distinct applications, each with its own tech stack, database type, and data model. John Schwegler, VP of Data and AI Architecture, was brought in to centralize data from this heterogeneous environment into a single data lake. As data demands grew, the ingestion infrastructure began to break down, impacting both internal analytics and customer-facing reporting.
TL;DR
Initial ingestion setup
Vector Solutions initially relied on AWS DMS to replicate data from application databases into a centralized data lake. As an AWS shop, DMS was the natural starting point.
The challenge was the heterogeneity of Vector's environment. Every acquired application brought its own database — MySQL, SQL Server, Oracle, Azure SQL — each with its own schema and data model. DMS had to handle all of them, transporting hundreds of millions of rows per month into the data lake.
In a stable environment, this setup was workable.
"It initially worked, but we found a lot of issues with DMS - in terms of performance, reliability, silent failures. There's a number of things that made it hairy to manage."
— John Schwegler, VP Data & AI Architecture at Vector Solutions
What was breaking
DMS had multiple failure modes. The most dangerous was tied to seasonal volume spikes. Vector's higher education systems see 75% of their activity concentrated in a three-to-four month window. DMS would gradually fall behind as volume increased, then break entirely - with no warning.
The second failure mode was silent. On MySQL sources, DMS would stop reading binary logs and simply stop bringing in new data. There was no alert, no error - the data just stopped arriving. When failures hit, recovery was expensive. DMS required full restarts and complete data reprocessing from scratch. For Vector's largest data sources, this meant days of lost reporting - during the exact period when customers were most intensely looking at their data.
The operational cost extended beyond the data team. AWS support calls sometimes lasted weeks, with engineers tuning obscure parameters to get things running again. Five or six different engineers could get pulled into a single incident. And the failures weren't isolated to internal reporting - Vector's pipeline also powered customer-facing survey results and learner analytics. When the pipeline broke, customers saw stale data.
The cost of unreliability
The failures were predictable enough that Vector's cloud manager had to build them into the budget. The team maintained a standing cost allocation for DMS reprocessing - a line item dedicated to expected failures.
Beyond direct costs, the maintenance overhead was delaying the team's roadmap and putting new projects at risk.
At the same time, Vector was being asked to take on increasingly complex projects - including a new reporting initiative against an Azure SQL environment with 32 separate customer databases, each with multiple tenants. The scope of merging those databases reliably with DMS would have been complex.
Why Iceberg
Vector's data architecture had already been evolving independently of the CDC problem. The team originally followed a standard flow: export data from source databases, run ETL in the data lake, load into Redshift, and report from Redshift. The problem with that model: any time computation is tied to storage, it becomes a bottleneck. Redshift was not cheap to run, and as more processes needed to query the same data, the team faced a choice: spend significantly more money or accept performance bottlenecks.
When Apache Iceberg started gaining traction, it offered something no prior format could: atomic state for tables in the data lake. Before Iceberg, Vector's data lake consisted of Glue Catalog tables backed by loose collections of Parquet files - with no reliable way to know what the current state was, whether changes had completed properly, or even whether a producer had added or removed columns.
"It's basically a big container of a lot of Parquet files, and it's impossible to know what the state is, who changed it, whether the changes were completed properly or not, or even keep track of changes to the schema."
— John Schwegler, VP Data & AI Architecture at Vector Solutions
Schwegler evaluated all three open table formats - Iceberg, Delta Lake, and Apache Hudi - in detail. Delta Lake was too tied to its original creator. Hudi required pulling in more of its software ecosystem to use effectively. Iceberg stood out as a standalone data standard that could be consumed by many different query engines.
"Iceberg was not particularly tied like that. It was pretty much a standalone data standard that could be used by a lot of different consumers."
— John Schwegler, VP Data & AI Architecture at Vector Solutions
Vector became one of the early adopters of AWS S3 Tables for Iceberg, which added automatic maintenance on top of the format. In production, one challenge surfaced: on high-volume source databases, the number of Iceberg snapshots grew so rapidly that tables became unusable until the team tuned S3 Tables to retain only a limited number of snapshots. But this was a configuration issue with Iceberg's metadata management - not a problem with the CDC pipeline itself.
The move to Iceberg meant Vector needed a CDC solution that could deliver directly to Iceberg tables in S3.
Alternatives considered
Schwegler periodically reviewed the CDC vendor landscape. Building in-house was never seriously considered.
"Code is not an asset. It's a liability — it's something you have to maintain and update as other systems change. CDC is a very complicated ecosystem. That's not something I want us to build."
— John Schwegler, VP Data & AI Architecture at Vector Solutions
Vendors like Fivetran were evaluated and ruled out. The pricing models were either too high or too opaque — making it impossible to predict costs as data volumes grew. Schwegler wanted a cost model that scaled linearly or sublinearly with volume. Using open-source tools like Debezium directly still required significant build effort on top.
AWS itself was not evolving DMS to meet Vector's needs. In conversations with AWS enterprise support engineers, it was clear that DMS was considered a stable, mature product — not one that would see major capability improvements.
Initial skepticism and gradual adoption
When Schwegler first evaluated Artie, the architecture and transparent pricing stood out. But Vector's CTO had concerns: Artie was a small, young company. The concern was straightforward: what if the company doesn't exist in a year? What if they get acquired? What would the fallback be?
Rather than committing to a big-bang migration, Schwegler proposed a gradual adoption strategy. He ran an initial trial against Vector's highest-volume data sources, which went smoothly with no glitches. Then, when budget was approved the following year, Artie was deployed on a new Oracle-based project — one with flexible timelines and a clear fallback to DMS if needed.
"I didn't go to my CTO and say we should replace DMS for our biggest services. I said, let's go with relatively small projects where we can drop it out if we have to, and we'll have time to evaluate whether it's going to work."
— John Schwegler, VP Data & AI Architecture at Vector Solutions
How confidence grew
Over time, two things shifted the CTO's confidence. First, reliability: Artie's pipelines worked consistently, with clear observability. The team received automatic notifications when sources went unavailable and when they resumed. The UI made pipeline status immediately clear.
Second, support responsiveness. When issues did arise, fixes were often shipped the same day. That responsiveness built trust over time.
Then Vector was asked to take on the complex multi-tenant Azure SQL reporting project - 32 customer databases that needed to be unified into a single reporting source. In a conversation with Artie's CTO Robin, Schwegler learned that database unification was already on Artie's roadmap - and was delivered shortly after. Features like S3 Tables Iceberg as a destination and multi-source database merging saved months of development work that Vector would otherwise have had to build and maintain.
"You're giving us the finished product. We've got a source, we've got a destination, and you deal with all the details in between."
— John Schwegler, VP Data & AI Architecture at Vector Solutions
What changed after Artie
With Artie handling CDC, the data team spends less time on ingestion maintenance. The team is also building a new unified reporting platform, consolidating reporting from over 15 different applications - each with its own aging UI running queries against transactional databases - into a single, new reporting framework powered by the data lake.
For one particular project, 33 Azure SQL databases are being merged via Artie into a single Iceberg database. Reports that previously took one to two minutes for customers to generate now return in five seconds or less, with data recency within an hour. The new reporting framework went into production in March 2026, with early customer access planned for April. Other teams across Vector are already requesting to adopt the same model for their applications.
"When you get something reliable as a service that hides all the complications, that's less context you have to worry about. It means it's easier to do new projects because you can just drop this in."
— John Schwegler, VP Data & AI Architecture at Vector Solutions
About Artie: Artie is a real-time data replication solution for databases and data warehouses. Artie leverages change data capture (CDC) and stream processing to perform data syncs in a more efficient way, which enables sub-minute latency and helps optimize compute costs. With Artie, any company can set up streaming pipelines in minutes without coding.
About Vector Solutions: Vector Solutions is an AI-enabled workforce management platform trusted by over 24,000 organizations. The company provides learning management, safety management, workforce scheduling, and operational tools for industries including higher education, K-12, public safety, manufacturing, and government.
