
Here's your Playbook
Download now
Oops! Something went wrong while submitting the form.
Learn More
TABLE OF CONTENT
This piece is a community contribution from Alejandro Aboy, a data engineer with nearly a decade of experience across analytics, engineering, and digital tracking. Currently at Workpath, he is the main data platform stack owner, building scalable pipelines, automating workflows, and prototyping AI-powered SaaS analytics using tools like Airflow, dbt, Aurora, and Metabase. With prior roles at AILY Labs and Ironhack, Alejandro has built a strong foundation in web and marketing analytics, shifting into modern data stack tools to deliver robust ELT workflows, visualisations, and governance systems. He also writes The Pipe & The Line, where he reflects on modern data engineering and the evolving stack. We’re thrilled to feature their unique insights on Modern Data 101!
We actively collaborate with data experts to bring the best resources to a 10,000+ strong community of data leaders and practitioners. If you have something to share, reach out!
🫴🏻 Share your ideas and work: community@moderndata101.com
*Note: Opinions expressed in contributions are not our own and are only curated by us for broader access and discussion. All submissions are vetted for quality & relevance. We keep it information-first and do not support any promotions, paid or otherwise!
TOC
Over the last few years, we’ve been reading authors such as Inmon and Kimball to understand how to prepare data in ways that could help our future selves maintain data warehouses in the most optimal way.
Inspired by these, we applied techniques such as Star Schema, OBT, Data Vault, and different levels of normalization to keep our data assets clearly defined.
Then, we started having cooler toys; the Modern Data Stack brought tools such as:
Combined together, we end up with powerful setups to maintain data infrastructures to keep our company’s data going.
But there’s a problem:
We started focusing on tools and tech more than on the fundamentals, and many companies now have the greatest stack ever but discovering a new planet seems easier than adding a column to one of their data models.
And AI won’t fix that situation.
With the current pace of AI’s evolution, we are landing in a context where productivity can skyrocket if you incorporate LLMs the right way.
But, just as many other times in the past, companies might not be ready because they are already carrying a lot of technical debt, outdated documentation, and persistent data quality issues.
So, if you don’t solve all those problems first, you shouldn’t jump into adding AI to your architecture. *Surprise: we jump anyway.
In so many ways, Data Modeling is the backbone of a company's analytics-related systems:
Data Modeling is optimal not only because we can anticipate data evolution issues and tackle them before they become a problem, but also because every new team member gets an amazing onboarding experience if it is in the right place.
Combined with a decent data governance and semantic layer setup, you are in a good spot.
However, like any good foundation, this takes time, requires consistency, and considering how fast-paced environments behave nowadays, it's likely impossible.
I like to simplify my mental model into two angles: Constructive modeling and Destructive data modeling.
A new feature is added to the product, a new tool is added to the stack, and a new marketing channel needs some tables.
All these scenarios imply, as a baseline:
They are usually easier to tackle since they are new artifacts to implement, but sometimes they are extensions of what’s currently there, and you need extra thinking on how to connect the dots.
For example, if backend team adds 3 new columns to a raw tables that’s used by 20 models, you will need to do an extensive scanning to get all the downstream dependencies and don’t miss any of them.
Seems easier, but most of the time it’s not. Deprecating features, removing models, and everything related to deletions.
You would follow the same approach as constructive, but you can skip some parts since these models will be gone.
The big challenge is when we need to change how a model behaves after removing upstream dependencies, since that’s:
…a great opportunity to mess up a whole chain of relationships if we don’t reverse engineer it correctly.
If it’s so complex for us to maintain a basic Data Modeling mindset, what can we expect from Agents when we try to put them down to do some data modeling for us?
The previous example on Constructive Data Modeling might get problematic when trying to achieve that with an AI IDE like Cursor, since it will skip some models out of its context eventually, unless your scope is really well defined and targeted.
Let’s ask ourselves some questions on how Agents will behave:
We can’t attach our company data catalog to a context window, nor can we just input all companies’ documentation or dbt project docs to an LLM and hope for the best.
In a world of “quick requests,” we are throwing our Data Modeling design into the trash every time we proceed blindly with those quick requests, so feeding this operational inefficiency into Agents will just speed up technical debt instead of enhancing it.
We need solid foundations, a system that already works for us, and a system to put everything together that can help speed up our data modeling processes.
As mentioned before, what works for us will scale amazingly for Agents if we find ways of putting it together. Just to recap on the most common Data Modeling Approaches, we have:
We spent a lot of time trying to decide between performance and cost optimisation (Normalization) or simplicity in spite of higher costs (Star Schema & OBT). I think the latter is ideal because:
JOINs
, which normally ends up with LLMs confusing foreign keys and over-engineering queries.If you can handle OBT or Star Schema computational costs, go ahead and design good ERDs using Database Markup Language to feed data into context.
On the other hand, I’ve never seen solid implementations of highly normalized tables on companies with OLAP systems.
However if you have good documentation for LLMs to know how to use JOINs without ambiguity, go ahead. Your warehouse bills will appreciate the savings.
Let’s daydream a little.
Since it’s the most common Data Transformation tool out there, the example will be approached with dbt, but you can adapt it to your use cases.
Extend Cursor or Claude Code rules to teach the LLMs how to follow your data modeling approach, scan your project to find dependencies, relationships, create-update documentation.
The workflow would be:
On Cursor (Example):
With the Level 1 approach, I might still be missing relevant business requirements details that would change the output quality. The absence of the data catalog can also lead the IDE to some mistakes due to missing some of the nuances. Lastly, to provide output examples on SQL queries, I need to run them manually and paste the output, so the UX is not perfect.
Imagine:
catalog.json
and manifest.json
on an AWS S3 bucket, so it can be easily retrieved.These are all tools for an MCP server:
So now, while you are on Cursor in your dbt repo, you can ask for meetings summaries to plan for a Data Modeling implementation and check the dbt catalog before going ahead with any change.
If context gets too much, you can temporarily save the outputs on markdown files so they can be fetched on demand and not handled through conversation context.
The good thing is, you don’t have to deploy this on production; you can just keep it for local usage.
On Levels 1 and 2, your AI IDE can pick up your meeting documents from a Drive folder, expose your data catalog to the server to check any nuances, and start modeling by scanning all dependencies perfectly.
But it doesn’t end here; it can go beyond:
This is not about AI replacing us, but using it in our favour to make us better. These examples are realistic approaches that I have been experimenting with, but they can be more or less useful depending on your company’s context.
Small startups with small data infrastructures can quickly benefit from this since context management is not a handicap, but big data companies need to work on more isolated tasks or consider more extended versions of these approaches.
At the end of the day, there will be things that models won’t catch or understand, even if we plug in context. There are lots of things we know without realizing that we aren’t yet ready to feed into LLMs, so these kinds of projects need tons of iterations before reaching autonomous scenarios.
Thanks for reading Modern Data 101! Subscribe for free to receive new posts and support our work.
Find me on LinkedIn 🙌🏻
Hands-on guides, tools, and experiments to sharpen your Data & AI Engineering skills from someone who learned it all in the wild.
By Alejandro Aboy
Top Artefacts From MD101 🧡
230+ Industry voices with 15+ years of experience on average and from across 48 Countries came together to participate in the first edition of the Modern Data Survey
Here’s your own copy of the Actionable Data Product Playbook. With 3000+downloads so far and quality feedback, we are thrilled with the response to this 6-week guide we’ve built with industry experts and practitioners. Stay tuned to moderndata101.com for more actionable resources from us!
Last quarter had so much piled up in the world of Data Products! The challenges of GenAI implementations, what VCs and Founders are saying, deriving the role of data solutions for GenAI, the vision of Data Platforms and Design Principles, “Operating Systems ” for GenAI, and so much more!
Related: