Photo by Tobias Fischer on Unsplash

What Is Well-Modeled Data for Analysis?

Modeling data is not enough— you need to do it well

Jesse Paquette
Towards Data Science
6 min readMay 4, 2022

--

This article will be constrained to modeling read-access data for the purposes of analysis — i.e. the O part of I/O.

I’m also not going to discuss data quality or pre-processing—but I trust you won’t forget about those things.

Why should you care about data modeling?

It’s about getting the most value from your data — quickly, repeatedly and robustly.

You want to use your data, right? Otherwise, why bother?

Consider when you have poorly-modeled data. Sure, you can dump everything into a data lake without thinking — effectively creating a “data swamp”. That’s fast for data engineers to do. But then, every time a data scientist or data product developer wants to use the data, they have to spend days to weeks examining the schema and iterating on complex queries until they get the exact data they need, in the structure they need. In other words, you’ve done the easy thing once, and you’re forcing your collaborators to do the hard things repeatedly.

Now consider when you have well-modeled data. Admittedly, it takes much longer for data engineers to work with domain experts, data scientists and data product developers to get the modeling done well. But if you do that “once” — not forgetting about maintenance and updates as data types / needs change — then you make your collaborators’ lives easier. They can find their data faster, they can query their data faster, and it gets delivered in the structure that they need to move quickly. You do the hard thing once, and then you can do the easy things repeatedly.

Ce n’est pas évident?

Aspects of well-modeled data

To produce well-modeled data, consider the following seven attributes:
I’m sure I’m missing some…but seven’s a good number…

1. Schema / shape of stored data

You can have 6 tables, or you can have 60 (or more). You can implement many-to-one and many-to-many relationships in separate tables, or you can find ways to squeeze all that information into a single table. You could choose to implement a star schema, a key-value model, or even try a graph model.

I’m not going to advocate for or against any specific schema or technology.

The best model for storing data depends on the scope / requirements of your consumers— do you know what they are? Don’t choose a technology or schema before you know.

Consider that it might be best to segregate completely different types of data in different schemas / technologies — especially if your consumers are segregated. Don’t assume that all data needs to be in a centralized monolith — see: Data Mesh.

You may even want to store the same data in different models if your consumer use cases are diverse enough.

As a rule of thumb, KISS — i.e. Keep It Simple, Stupid.

2. Simplicity of queries

If your data scientists and data product developers have to make complex queries to get the data they need, in the structure they need — you’ve made a mistake modeling the data. It will take them too long to get each query formed properly, every time they need to make one. And when they or someone else needs to tweak a complex query to do something differently, that will take forever, too.

Some considerations:

  • How many table joins or graph traversals are required?
  • Are transformations or calculations on data occurring at the query level?
  • How many characters / lines / loops are in each query?
  • Know and avoid SQL anti-patterns.

Advanced tip — consider going query-free. Implement a middle-layer of API methods / microservices on top of your data model which will do all the complex queries from a well governed, testable codebase. Tiny plug — my company, Tag.bio, has a platform which does this really well.

3. Speed of queries

Executing a query should not take hours. Of course, it depends on data volume, but as a rule of thumb — if the execution of a query to get a small amount of data takes an inordinate amount of time — i.e. more than a few seconds — you’ve got a bad data model for that use case.

Monitor performance and have your consumers start to query the data before finalizing your data model.

4. Schema / understandability of query outputs

Data scientists and data product developers love data frames. Do queries on your data model produce useful data frames? Data frames should only include primitive values — i.e. text, numbers, boolean flags — not complex objects.

For example— if a query produces values within a row that have nested JSON objects — or god forbid — XML objects, your data scientist consumers are going to hate you.

Of course it may be the case that your consumers would like the query output schema to be something else — work with them to enable exactly what they need.

Finally, consumers of your data model need to understand the data they’re getting back in their queries. Data dictionaries are great — but please, please consider using human-readable column headers. Nobody understands poorly-designed acronyms — not even you, a year from now.

5. Volume of query outputs

Queries on big data sources might produce large result sets — this may or may not be a problem. Clearly, if a query is returning columns or rows that will not be used by the consumer, get rid of them — I mean the columns / rows, not the consumer. :)

There are a number of other design patterns and technologies to solve the “firehose” problem of large query outputs — including, but not limited to, a middle layer which caches and serves data in smaller chunks — but I can’t cover them all here.

Ok, here’s one more option — consider implementing a way to allow consumers to get a random subset of data from a query, with an optional seed parameter.

Just don’t assume “this is just the way it has to be”. If query output volume is a problem, and it slows down your consumers, fix it.

6. Aggregation vs. data loss

Data scientists can get pretty frustrated when they perform a query and they see that the data they care about has already been “conveniently” aggregated — e.g. “monthly total” — when they want the granular data — e.g. “daily count”.

Plus, creating totals and averages in your data model is “lossy” — i.e. you’re losing information from the input sources.

At the same time, serving up granular data when consumers don’t need it will increase query complexity, query compute time, and query volume. I’m thinking specifically about the IOT example where sensors record data at millisecond frequencies. Maybe your consumers need that millisecond-level data, or maybe they just need the daily average.

Don’t assume, ask them.

Consider storing and modeling data in both raw (lossless) and aggregated (lossy) structures. Your consumers will appreciate having options, as long as the schema doesn’t get too complicated.

7. Testing, maintenance and emerging data types

You can’t assume that your data model will be perfect forever — needs change, new data types emerge. But you also can’t just change everything on a whim later on without breaking use cases in production.

Put together a rational plan for maintenance and change for your data model.

In particular, establishing a data testing framework is a great way to monitor any potential problems with production use cases as you adjust the data model to changing needs / emerging data types.

As a data engineer / data modeler — data testing is your responsibility, not the responsibility of data scientists and data product developers.

Conclusion

Thanks for reading! Please let me know if you have any thoughts or questions.

--

--

Full-stack data scientist, computational biologist, and pick-up soccer junkie. Brussels and San Francisco. Opinions are mine alone.