
As part of Artie’s Data & Engineering Innovators Spotlight, we profile leaders shaping the future of modern data infrastructure, real-time data systems, and AI-driven engineering. This series highlights the practitioners designing scalable architectures, modernizing legacy stacks, and pushing the boundaries of what data engineering teams can achieve.
Today, we’re excited to feature Ved Prakash, Principal Data Engineer at GitLab, the architect behind Project Siphon, GitLab’s enterprise-scale CDC backbone delivering real-time streaming and sub-hour latency and a leader in real time data platforms.
About Ved Prakash: A Leader in Modern Data Engineering
Ved architects Project Siphon, GitLab’s enterprise-scale Change Data Capture (CDC) backbone that powers real-time streaming and sub-hour latency, while maintaining disciplined cost optimization. As one of only 17 members globally on Snowflake’s Data Superhero Council, he drives high-impact platform decisions across Snowflake, Apache Iceberg, Open Catalog, and cloud-native CDC pipelines, including AI cost optimization strategies for Snowflake deployments. He also shares lessons through talks (Berlin Buzzwords, DSC Europe) and DataAI Chronicles, where he explores next-gen real-time and AI-integrated data platform architecture.
Their work reflects how top data organizations are evolving: adopting real-time pipelines, improving data reliability, enabling AI workloads, and building foundations that scale with the business.
Interview With Ved Prakash - Insights on Data Architecture, Real-Time Systems, and Engineering Leadership
Q1: How has the role of a data engineering leader evolved since you started your career?
When I started in data engineering, leaders were primarily focused on keeping pipelines running and ensuring data quality—success meant reports were accurate and stakeholders got their data on time. It was very execution-focused: build the ETL, optimize the queries, manage the infrastructure.
Today, data engineering leadership has transformed into a strategic function. Leaders are now at the table making build-versus-buy decisions worth millions, influencing company-wide technology choices, and directly enabling or constraining what organizations can do with AI and analytics. You're expected to speak the language of business—ROI, cost optimization, competitive advantage—not just technical architecture. The shift from "keeping the lights on" to "defining what's possible" has been dramatic.
The other major evolution is the expansion of responsibilities. Data engineering leaders used to own pipelines and maybe some warehouse optimization. Now you're responsible for platform economics—managing Snowflake costs at scale, building governance frameworks for AI workloads, architecting real-time systems that fundamentally change how the business operates. You're thinking about data as a product, about enabling self-service, and about how your infrastructure decisions either accelerate or slow down innovation across the entire company. It's gone from a purely technical role to one that requires equal parts engineering depth, business acumen, and strategic vision about where data technology is heading.
Q2: What’s a recent architectural decision you’re proud of — and why?
I'm really proud of the AI Cost Control Framework we built for Snowflake Cortex at GitLab. As AI adoption exploded across the organization, we realized nobody had visibility into Cortex credit consumption, and costs could easily spiral out of control without anyone noticing until the bill arrived.
We architected an automated monitoring system that tracks Cortex credit usage in real-time, implements configurable threshold alerts, and triggers breach procedures before costs get out of hand. The key decision was making it proactive rather than reactive—the system doesn't just tell you after you've blown the budget, it warns stakeholders as consumption approaches thresholds and provides actionable insights on which services or teams are driving costs. What makes me proud is that we're solving a problem most companies haven't even recognized yet, and as AI becomes more embedded in data platforms, this kind of intelligent cost governance will be table stakes. It's not just about building infrastructure—it's about enabling innovation while maintaining financial discipline.
Q3:What’s the biggest misconception you see about CDC or streaming systems?
The biggest misconception I see is that CDC is just about getting data faster—people think if you implement CDC, you automatically get real-time insights and everything just works. What I've learned building Project Siphon at GitLab is that CDC is really about architectural transformation: you're changing how your entire data platform thinks about data movement, consistency, and downstream dependencies.
The reality is much more complex than "turn on CDC and data flows in real-time." You're dealing with schema evolution, handling deletes and updates differently than batch pipelines, managing backfills alongside streaming changes, and rethinking how your downstream consumers—whether it's dbt models, BI tools, or ML systems—need to adapt to incremental data. At GitLab, we've had to carefully architect around issues like ensuring UI updates actually persist to the database before CDC captures them, handling Postgres logical replication slots, and designing Iceberg table structures that can efficiently handle both batch historical loads and streaming updates.
The other big one? People underestimate the operational maturity required. CDC isn't a "set it and forget it" solution—it's a living system that requires monitoring, alerting, understanding of database internals, and close collaboration between data engineering, platform teams, and application developers. The payoff is absolutely worth it in terms of latency and cost optimization, but it's a journey that requires serious engineering investment and organizational alignment.
Q4:Where do you think AI will have the biggest impact on data engineering over the next five years?
- Intelligent Data Pipeline Orchestration & Self-Healing Systems I think we'll see pipelines that actually understand what they're doing and can fix themselves when things go wrong. Instead of getting paged at 2 AM because a job failed, the system will detect the issue, understand the downstream impact, and either retry intelligently or route around the problem.
- Code Generation & Platform Engineering Acceleration AI is already changing how quickly we can build things—imagine describing what you need in plain English and receiving a working dbt model or CDC pipeline configuration. We'll spend way less time on boilerplate code and repetitive setup work, and more time on the interesting architectural challenges that actually move the business forward.
- Cost Optimisation Through Intelligent Resource Management This one's huge for me—AI that watches how your Snowflake or Databricks environment actually gets used and adjusts resources automatically. Right now, we're setting policies and hoping for the best, but AI can learn patterns and optimise in real-time, which is critical when you're running CDC pipelines and streaming data, where costs can explode if you're not careful.
Why Leaders Like Ved Prakash Inspire the Future of Data Engineering
Innovators like Ved Prakash are redefining what modern data engineering looks like - from real-time data architectures to AI-powered operational systems. Their insights help teams rethink scalability, data quality, and the future of intelligent infrastructure.
At Artie, we’re proud to feature leaders building the next generation of data platforms, CDC pipelines, and real-time analytics systems.
If you're advancing your company’s data infrastructure, we’d love to spotlight your work in a future edition.

