Building Robust Data Products: 5 Pillars Every Data Engineer Should Apply

Clean Architecture Principles for Robust and Evolvable Data Products
 •
6:30 Mins
 •
May 29, 2026

https://www.moderndata101.com/blogs/building-robust-data-products-5-pillars-every-data-engineer-should-apply/

Building Robust Data Products: 5 Pillars Every Data Engineer Should Apply

Analyze this article with: 

🔮 Google AI

 or 

💬 ChatGPT

 or 

🔍 Perplexity

 or 

🤖 Claude

 or 

⚔️ Grok

.

TL;DR

For years, data was merely a technical component of applications, a simple support mechanism for maintenance. Today, it has taken on a new dimension: it has become a product with its own uses, requirements, and lifecycle.

However, most data systems still act like fragile pipelines:

  • They break as soon as business rules change
  • Every update is costly
  • They fail quietly without proper monitoring
  • They constantly need someone to fix them.

These situations have shaped a clear conviction: The real challenge today is to design systems that adapt, preserve trust, and remain coherent as the business evolves.

A robust data product is therefore much more than a workflow. It is a long-lived digital asset that encodes business meaning, enforces data quality, traces change, and supports continuous evolution without friction.

In this article, I’m laying out five engineering pillars that I’ve seen consistently turn data products from “some scripts that work for now” into components you can trust long-term.


1. Separate Business Logic from Infrastructure Using Clean Architecture

Clean Architecture, based on the classic onion model, gives you a practical way to create a demarcation between what should stay stable and what will inevitably change. In real world scenarios, infrastructure shifts all the time, but the business logic usually doesn’t. The onion just makes that separation obvious.

At the center is the code that should barely move: the rules, the definitions, the meaning of the business itself. As you move outward, the layers get more volatile and more technical. And that’s the whole point.

The core stays calm while the edges take the churn. The further outward you go, the more technical and replaceable the components become.

What this means in practice

Business logic sits at the core. A dedicated domain layer encapsulates the true semantics of the business:

  • Versionned
  • Testable
  • Technology-agnostic
  • Isolated from external concerns

The infrastructure forms the outer layers that can change without affecting the domain logic. We include in infrastructure:

  • Orchestrator
  • Cloud Resources
  • Ingestion Systems
  • Storage Formats
  • Compute engines
  • ...

Each domain carries its own:

  • Responsibilities
  • Business Code
  • Tests
  • Data Contract reinforcing autonomy and clarity

How does the onion model apply to data products?

Onion Model for Clean Data Architecture Applied to Data | Source: Author

From the center outward:

  1. Domain (core business rules) - stable, semantic, versioned
  2. Application layer - orchestrates the domain logic, defines workflows
  3. Adapters - readers, writers, format converters, I/O logic
  4. Infrastructure - compute engines, cloud services, orchestrators, storage systems

Business rules depend on nothing. Outer layers depend on the components inward, never the other way.

[playbook]

Benefits

  • Maximum portability: switching orchestrators or compute engines requires zero business logic changes.
  • Massively reduced technical debt: Clear boundaries make it easier to upgrade tools, refactor pipelines, and evolve models.
  • Stronger alignment with business domains: Each domain becomes a cohesive, understandable unit for data engineers and business stakeholders alike.

2. Add Automated Tests: From Unit Tests to Functional Validation

Most data products are still tested primarily at the unit level, and even this coverage can vary significantly depending on team maturity. Many data engineers are early in their careers, and pipelines are often treated as “scripts that work” rather than full software products.

A helpful framework to think about test coverage is the testing pyramid:

Automated Test and Architecture Pyramid | Source: Author
  • Foundation: Unit tests
    These validate individual functions or transformations. They are fast, low-level checks.
  • Middle Layer: Integration tests
    Verifying if different components, modules, or pipelines are working together as expected.
  • Top: Functional tests
    Validating if the actual behavior is aligned with business requirements or goals.
  • Bonus: End-to-end tests
    Validating the overall system behavior in a prod-like environment.

Let’s focus on the functional tests.

It should represent the business rules independently of the infrastructure.
Beyond this foundational testing, Behavior-Driven Development (BDD) is gaining traction in data engineering. BDD allows teams to capture complex business rules in executable, human-readable scenarios using the Gherkin language.

Why integrate BDD in data products?

  • Living documentation, understandable by business teams, analysts, and engineers
  • Clear, unambiguous business expectations
  • Automatically testable scenarios
  • Strong alignment during schema changes or updates in logic

Example of Functional Scenarios

Scenario: Customer Score Calculation | Source: Author

Business rules are expressed in natural language, versioned in Git, and validated before deployment.

Integrating BDD Into a Modern CI/CD Workflow

A robust workflow looks like this:

  • Data Product Manager writes Gherkin scenarios and validates them by the business and the Data Engineers team
  • Data Engineers integrate the feature file and implement the BDD steps
    • Set the data context
    • Implement & Call the business logic
    • Check the assertion is valid between the expected and the actual behavior
  • CI executes the Gherkin scenarios to validate any change in:
    • schema
    • contracts
    • business logic
    • transformations

By combining a solid base of unit and integration tests with functional BDD scenarios, data products can evolve safely, maintain trust, and make business rules explicit.

[data-expert]


3. Design for Reprocessing from Day One (Backfill-as-a-Service)

Modern pipelines must be replayable on demand. A change in business logic should propagate to historical data safely and automatically.

Core practices

  • Idempotent pipelines by design
  • Automated backfilling via a dedicated service (ie, GitOps + API-driven orchestrations + on-demand compute)
  • Use of time travel/snapshots (Iceberg, Delta Lake…)
  • Versioned transformations (Spark, DBT…)

How it works

A dedicated service handles everything that traditionally makes backfills painful:

  • partition/date/table level scheduling
  • parallelization and autoscaling
  • monitoring, retries, and error isolation
  • automatic compute allocation
  • DAG dependency management

A modern workflow

  1. You update your dbt/Spark code and push to Git.
  2. A GitOps event detects the change.
  3. The service identifies impacted partitions.
  4. It allocates compute and triggers the backfill through APIs.
  5. Historical data is updated seamlessly.

No more clicking through 500 dates in an orchestrator UI. The system does the heavy lifting.


4. Manage Data Model Evolution and Breaking Changes

A data model is a living system - it grows, mutates, and sometimes breaks. A mature data product must support:

  • adding new fields
  • modifying existing structures
  • deprecating fields
  • backward compatibility or fully versioned schemas

What I recommend doing

  • Version schemas (v1, v2, v3…).
  • Document all changes clearly.
  • Allow coexistence during breaking migrations.
  • Use contract-first schemas (Avro, JSON Schema …).
  • Include a schema registry for streaming pipelines.
Data Model Versioning Management | Source: Author


Your data model should evolve without stopping production and impact our consumers.


5. Monitor Data Quality and Detect Anomalies Proactively

Data quality is often treated as a low priority and mainly for reporting purposes.
However, this approach can lead to incidents that have a significant business impact.

Pipeline without Data Quality | Source: Author

Integrating data quality checks directly into data pipelines enables data producers to proactively detect anomalies. By exposing data quality metadata, data consumers can determine whether the data meets the required standards for their use cases and better control potential business risks.

Pipeline with Data Quality | Source: Author

Coverage should include:

  • Completeness: Are some fields missing?
  • Freshness: How long does it take between the business gesture and the data ingestion?
  • Business-rule consistency: Do the data correctly comply with the defined business rules?
  • Statistical drift: Is the data distribution changing abnormally over time?
  • Duplicates: Do we have duplicate data?
  • Expected volumes: Does the volume deviate from our usual expectations?

Silent anomalies are extremely expensive and too often detected by end users, not by the Data Product team.

To include in the solution

  • Automated checks (Great Expectations, Soda…)
  • Alerting on drifts or anomalies
  • Monitoring dashboards for data and business teams

A Coherent Mindset, Not Just Best Practices

When you look at each pillar on its own, it solves a specific challenge. But when we bring them together, they become consistent and create a unified approach. Downstream, users are able to trust and consume the data confidently. Even as business logic, schemas, or infrastructure evolves, downstream requirements are not compromised.

Clean architecture, automated tests, reprocessing capabilities, controlled schema evolution, and proactive monitoring create a foundation that lets teams move from reactive firefighting to predictable, stable operations.

This is what turns a simple pipeline into a robust data product, and a Data Engineer into someone who builds software-quality systems rather than fragile workflows.

Data Product Maturity

Evaluate your organization's data product maturity across 9 critical dimensions.

Your Copy of the Modern Data Survey Report

See what sets high-performing data teams apart.

Better decisions start with shared insight.
Pass it along to your team →

Oops! Something went wrong while submitting the form.

The Modern Data Survey Report 2025

This survey is a yearly roundup, uncovering challenges, solutions, and opinions of Data Leaders, Practitioners, and Thought Leaders.

Your Copy of the Modern Data Survey Report

See what sets high-performing data teams apart.

Better decisions start with shared insight.
Pass it along to your team →

Oops! Something went wrong while submitting the form.

The State of Data Products

Discover how the data product space is shaping up, what are the best minds leaning towards? This is your quarterly guide to make the best bets on data.

Yay, click below to download 👇
Download your PDF
Oops! Something went wrong while submitting the form.

The Data Product Playbook

Activate Data Products in 6 Months Weeks!

Welcome aboard!
Thanks for subscribing — great things are coming your way.
Oops! Something went wrong while submitting the form.

Go from Theory to Action.
Connect to a Community Data Expert for Free.

Connect to a Community Data Expert for Free.

Welcome aboard!
Thanks for subscribing — great things are coming your way.
Oops! Something went wrong while submitting the form.
No items found.

Author Connect 🖋️

Connect: 

Connect: 

Connect: 

Originally published on 

Modern Data 101 Newsletter

, the above is a revised edition.

About Modern Data 101

Modern Data 101 is a movement redefining how the world thinks about data. A community built by the same team behind the world’s first data operating system, Modern Data 101 sits at the intersection of data, product thinking, and AI. Spread across 150+ countries, the community brings together a global network of practitioners, architects, and leaders who are actively building the next generation of data systems.

At its core, Modern Data 101 exists to simplify the journey from raw data to tangible and observable impact. It advocates high-potential data systems and next-gen architectures to unify and activate insights and automation across analytics, applications, and operational workflows at the edge.

In a world shifting from data stacks to AI ecosystems, Modern Data 101 helps teams not just navigate the change but lead it.

Latest reads...
How to Operationalise AI Ontologies for Enterprises
How to Operationalise AI Ontologies for Enterprises
Rethinking Data Movement: A First Principles Approach
Rethinking Data Movement: A First Principles Approach
The $12.9M Problem: What Poor Entity Resolution Is Really Costing Your Organisation
The $12.9M Problem: What Poor Entity Resolution Is Really Costing Your Organisation
AI and Data are Business Strategy Experiments Now. How Far Are You Willing to Push the Curve?
AI and Data are Business Strategy Experiments Now. How Far Are You Willing to Push the Curve?
How to Build a True Customer 360 Using Entity Resolution
How to Build a True Customer 360 Using Entity Resolution
Reflecting the Language Instinct in Machines
Reflecting the Language Instinct in Machines
TABLE OF CONTENT

Join the community

Data Product Expertise

Find all things data products, be it strategy, implementation, or a directory of top data product experts & their insights to learn from.

Opportunity to Network

Connect with the minds shaping the future of data. Modern Data 101 is your gateway to share ideas and build relationships that drive innovation.

Visibility & Peer Exposure

Showcase your expertise and stand out in a community of like-minded professionals. Share your journey, insights, and solutions with peers and industry leaders.

Continue reading...
How to Operationalise AI Ontologies for Enterprises
Ontology
6:00 mins
How to Operationalise AI Ontologies for Enterprises
Rethinking Data Movement: A First Principles Approach
Data Products
11:33 mins
Rethinking Data Movement: A First Principles Approach
The $12.9M Problem: What Poor Entity Resolution Is Really Costing Your Organisation
Ontology
4:58 mins
The $12.9M Problem: What Poor Entity Resolution Is Really Costing Your Organisation