Building Robust Data Products: 5 Pillars Every Data Engineer Should Apply

Clean Architecture Principles for Robust and Evolvable Data Products

•

6:30 Mins

•

May 29, 2026

•

Building Robust Data Products: 5 Pillars Every Data Engineer Should Apply

Analyze this article with:

or

or

or

or

.

TL;DR

For years, data was merely a technical component of applications, a simple support mechanism for maintenance. Today, it has taken on a new dimension: it has become a product with its own uses, requirements, and lifecycle.

However, most data systems still act like fragile pipelines:

They break as soon as business rules change
Every update is costly
They fail quietly without proper monitoring
They constantly need someone to fix them.

These situations have shaped a clear conviction: The real challenge today is to design systems that adapt, preserve trust, and remain coherent as the business evolves.

A robust data product is therefore much more than a workflow. It is a long-lived digital asset that encodes business meaning, enforces data quality, traces change, and supports continuous evolution without friction.

In this article, I’m laying out five engineering pillars that I’ve seen consistently turn data products from “some scripts that work for now” into components you can trust long-term.

1. Separate Business Logic from Infrastructure Using Clean Architecture

Clean Architecture, based on the classic onion model, gives you a practical way to create a demarcation between what should stay stable and what will inevitably change. In real world scenarios, infrastructure shifts all the time, but the business logic usually doesn’t. The onion just makes that separation obvious.

At the center is the code that should barely move: the rules, the definitions, the meaning of the business itself. As you move outward, the layers get more volatile and more technical. And that’s the whole point.

The core stays calm while the edges take the churn. The further outward you go, the more technical and replaceable the components become.

What this means in practice

Business logic sits at the core. A dedicated domain layer encapsulates the true semantics of the business:

Versionned
Testable
Technology-agnostic
Isolated from external concerns

The infrastructure forms the outer layers that can change without affecting the domain logic. We include in infrastructure:

Orchestrator
Cloud Resources
Ingestion Systems
Storage Formats
Compute engines
...

Each domain carries its own:

Responsibilities
Business Code
Tests
Data Contract reinforcing autonomy and clarity

How does the onion model apply to data products?

Onion Model for Clean Data Architecture Applied to Data | Source: Author

From the center outward:

Domain (core business rules) - stable, semantic, versioned
Application layer - orchestrates the domain logic, defines workflows
Adapters - readers, writers, format converters, I/O logic
Infrastructure - compute engines, cloud services, orchestrators, storage systems

Business rules depend on nothing. Outer layers depend on the components inward, never the other way.

[playbook]

Benefits

Maximum portability: switching orchestrators or compute engines requires zero business logic changes.
Massively reduced technical debt: Clear boundaries make it easier to upgrade tools, refactor pipelines, and evolve models.
Stronger alignment with business domains: Each domain becomes a cohesive, understandable unit for data engineers and business stakeholders alike.

2. Add Automated Tests: From Unit Tests to Functional Validation

Most data products are still tested primarily at the unit level, and even this coverage can vary significantly depending on team maturity. Many data engineers are early in their careers, and pipelines are often treated as “scripts that work” rather than full software products.

A helpful framework to think about test coverage is the testing pyramid:

Automated Test and Architecture Pyramid | Source: Author

Foundation: Unit tests
These validate individual functions or transformations. They are fast, low-level checks.
Middle Layer: Integration tests
Verifying if different components, modules, or pipelines are working together as expected.
Top: Functional tests
Validating if the actual behavior is aligned with business requirements or goals.
Bonus: End-to-end tests
Validating the overall system behavior in a prod-like environment.

Let’s focus on the functional tests.

It should represent the business rules independently of the infrastructure.
Beyond this foundational testing, Behavior-Driven Development (BDD) is gaining traction in data engineering. BDD allows teams to capture complex business rules in executable, human-readable scenarios using the Gherkin language.

Why integrate BDD in data products?

Living documentation, understandable by business teams, analysts, and engineers
Clear, unambiguous business expectations
Automatically testable scenarios
Strong alignment during schema changes or updates in logic

Example of Functional Scenarios

Scenario: Customer Score Calculation | Source: Author

Business rules are expressed in natural language, versioned in Git, and validated before deployment.

Integrating BDD Into a Modern CI/CD Workflow

A robust workflow looks like this:

Data Product Manager writes Gherkin scenarios and validates them by the business and the Data Engineers team
Data Engineers integrate the feature file and implement the BDD steps
- Set the data context
- Implement & Call the business logic
- Check the assertion is valid between the expected and the actual behavior
CI executes the Gherkin scenarios to validate any change in:
- schema
- contracts
- business logic
- transformations

By combining a solid base of unit and integration tests with functional BDD scenarios, data products can evolve safely, maintain trust, and make business rules explicit.

[data-expert]
‍

3. Design for Reprocessing from Day One (Backfill-as-a-Service)

Modern pipelines must be replayable on demand. A change in business logic should propagate to historical data safely and automatically.

Core practices

Idempotent pipelines by design
Automated backfilling via a dedicated service (ie, GitOps + API-driven orchestrations + on-demand compute)
Use of time travel/snapshots (Iceberg, Delta Lake…)
Versioned transformations (Spark, DBT…)

How it works

A dedicated service handles everything that traditionally makes backfills painful:

partition/date/table level scheduling
parallelization and autoscaling
monitoring, retries, and error isolation
automatic compute allocation
DAG dependency management

A modern workflow

You update your dbt/Spark code and push to Git.
A GitOps event detects the change.
The service identifies impacted partitions.
It allocates compute and triggers the backfill through APIs.
Historical data is updated seamlessly.

No more clicking through 500 dates in an orchestrator UI. The system does the heavy lifting.

4. Manage Data Model Evolution and Breaking Changes

A data model is a living system - it grows, mutates, and sometimes breaks. A mature data product must support:

adding new fields
modifying existing structures
deprecating fields
backward compatibility or fully versioned schemas

What I recommend doing

Version schemas (v1, v2, v3…).
Document all changes clearly.
Allow coexistence during breaking migrations.
Use contract-first schemas (Avro, JSON Schema …).
Include a schema registry for streaming pipelines.

Data Model Versioning Management | Source: Author

Your data model should evolve without stopping production and impact our consumers.

5. Monitor Data Quality and Detect Anomalies Proactively

Data quality is often treated as a low priority and mainly for reporting purposes.
However, this approach can lead to incidents that have a significant business impact.

Pipeline without Data Quality | Source: Author

Integrating data quality checks directly into data pipelines enables data producers to proactively detect anomalies. By exposing data quality metadata, data consumers can determine whether the data meets the required standards for their use cases and better control potential business risks.

Pipeline with Data Quality | Source: Author

Coverage should include:

Completeness: Are some fields missing?
Freshness: How long does it take between the business gesture and the data ingestion?
Business-rule consistency: Do the data correctly comply with the defined business rules?
Statistical drift: Is the data distribution changing abnormally over time?
Duplicates: Do we have duplicate data?
Expected volumes: Does the volume deviate from our usual expectations?

Silent anomalies are extremely expensive and too often detected by end users, not by the Data Product team.

To include in the solution

Automated checks (Great Expectations, Soda…)
Alerting on drifts or anomalies
Monitoring dashboards for data and business teams

A Coherent Mindset, Not Just Best Practices

When you look at each pillar on its own, it solves a specific challenge. But when we bring them together, they become consistent and create a unified approach. Downstream, users are able to trust and consume the data confidently. Even as business logic, schemas, or infrastructure evolves, downstream requirements are not compromised.

Clean architecture, automated tests, reprocessing capabilities, controlled schema evolution, and proactive monitoring create a foundation that lets teams move from reactive firefighting to predictable, stable operations.

This is what turns a simple pipeline into a robust data product, and a Data Engineer into someone who builds software-quality systems rather than fragile workflows.

‍

Author Connect 🖋️

Connect:

Najate BOUAD

Engineering Manager of Sport Product Data Platform & Product Referential at Decathlon Digital

Najate is an Engineering Manager at Decathlon Digital, leading Sport Product Data Platform & Product Referential initiatives. She focuses on scalable data ecosystems, product data strategy, platform engineering, and enabling connected digital experiences across retail and sport products.

Connect:

Originally published on

Modern Data 101 Newsletter

, the above is a revised edition.

Find more community resources

Courses

The Modern Data Masterclass

Master Data, One Masterclass at a Time!

Articles

Expert's Desk Articles

Community insights from top data experts

Report

Modern Data Modules

End-to-end guides on data mastery

Playbook

The Data Product Playbook

Find where are you in the Data Product journey

About Modern Data 101

Modern Data 101 is a movement redefining how the world thinks about data. A community built by the same team behind the world’s first data operating system, Modern Data 101 sits at the intersection of data, product thinking, and AI. Spread across 150+ countries, the community brings together a global network of practitioners, architects, and leaders who are actively building the next generation of data systems.

At its core, Modern Data 101 exists to simplify the journey from raw data to tangible and observable impact. It advocates high-potential data systems and next-gen architectures to unify and activate insights and automation across analytics, applications, and operational workflows at the edge.

In a world shifting from data stacks to AI ecosystems, Modern Data 101 helps teams not just navigate the change but lead it.

Access full report

Download the Report

Oops! Something went wrong while submitting the form.

Join the community

Data Product Expertise

Find all things data products, be it strategy, implementation, or a directory of top data product experts & their insights to learn from.

Opportunity to Network

Connect with the minds shaping the future of data. Modern Data 101 is your gateway to share ideas and build relationships that drive innovation.

Visibility & Peer Exposure

Showcase your expertise and stand out in a community of like-minded professionals. Share your journey, insights, and solutions with peers and industry leaders.

Join us today

What Is the AI Data Governance Gap? Why It Keeps Getting Worse

RCA & Observability

6 mins

What Is the AI Data Governance Gap? Why It Keeps Getting Worse

5 Ways AI Agents Will Transform Data Management & Analytics

Data Platforms

6 min

5 Ways AI Agents Will Transform Data Management & Analytics

AI vs. Traditional Data Management: Which One Actually Saves Time?

Data Platforms

5:12 mins

AI vs. Traditional Data Management: Which One Actually Saves Time?

Read all blogs