Lessons learned after deploying over 200 domain-driven data products
If you’re in my field — and presumably you are — you’ve started hearing and using these terms on a daily basis. I’m now even seeing job titles on LinkedIn called, e.g. “Director of Data Products”.
Don’t get me wrong — these popular terms are useful! Partly because, as described in this article about harmonization, both terms refer to an abstract dream of a perfect system that does everything right with data and solves everyone’s problems. I mean, who doesn’t want that?
The definition of Data Product is changing
The initial definition of data product was coined by DJ Patil in 2012 — “a product that facilitates an end goal through the use of data”. Unfortunately, Patil’s definition is unspecific — doesn’t that statement essentially describe all software? What software doesn’t use some form of data to facilitate end goals?
A different, better definition for data product came in 2019 — in the original Data Mesh article:
For a distributed data platform to be successful, domain data teams must apply product thinking with similar rigor to the datasets that they provide; considering their data assets as their products and the rest of the organization’s data scientists, ML and data engineers as their customers.
— How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, by Zhamak Dehghani
Originally framed by Dehghani with the phrase data as a product — I’m now seeing most folks in the data architecture and applications field start to refer to instantiations of that concept — i.e. decentralized, domain-driven data sources (potentially combined with end-user applications that match Patil’s definition) — as data products.
The fluidity and subjectivity in the definition of data product is partially responsible for fueling the hype — as the term can mean different good things to different groups.
It’s more than just hype, of course — the advantages of this emerging architectural pattern are real and transformative. I know this, because my company Tag.bio has been designing and implementing decentralized, domain-driven data products since 2014.
At present count, our technology has helped researchers and customers design, build, maintain and use over 200 disparate, domain-driven data products.
Here’s what we’ve learned.
Making domain-driven data products a reality
How does an organization achieve the ambitious, dreamlike promise of harmonized, domain-driven data products? It takes work, of course — design, development, and iteration with domain experts.
From my perspective, the work begins with proper modeling of domain-specific data for optimal use — i.e. reporting, analytics, exploration, machine learning — all of those operations require data to be in a well-modeled, well described, domain-driven schema.
For example, no Machine Learning (ML) engineer wants to have to take weeks to figure out how to join data from 30 input tables or traverse a complicated knowledge graph just to get the right data into a data frame — they just want the data frame! A well-designed data product should give the ML engineer their desired output with minimal effort and coding — thereby it’s faster to do once, and far cheaper to maintain over time.
The same goes for reporting and simple dashboard software, e.g. Tableau — it’s a lot easier and requires far less maintenance if the data is already represented a single table that’s dashboard-ready.
In a basic sense, an organization can implement an initial version of domain-driven data products using a cluster of data warehouses and data views — while perhaps also enabling ownership of those assets by domain-specific groups — i.e. the marketing groups owns the marketing data products.
Data products must also be findable and accessible in order to be useful across an organization, so implementing a catalog of data products with metadata and access information is usually the next step in the process.
However, data products must be much more than findable, queryable sources of well-modeled data
- Data products are also the codebases and algorithms that query and analyze the data.
- Data products are also responsible for data quality, observability and governance.
- Data products are responsible for domain-specific, useful end-user experiences.
- Data products are responsible for versioning, provenance and reproducibility of data analysis artifacts, not just data.
That’s where the real challenge, cost and time lies. The initial steps — data modeling, domain ownership of data products, and data product cataloging — are just the beginning.
And that’s exactly where using a Domain Driven Data Product Design Kit (4DK) and an out-of-the-box Data Mesh platform offers a significant cost, time, scaling, and maintenance advantage—i.e. 3 months instead of 3 years.
How to accelerate the process — and minimize maintenance later on
- Continue to utilize your existing databases, data warehouses, and data lakes. Don’t waste time and money re-implementing that wheel unless you really have to. On the other hand, it may be important to transfer ownership of these existing data resources to domain-specific groups.
- Use a data product layer on top of those data sources to perform domain-specific modeling and integrate domain-specific applications.
- Harmonize (i.e. use the same) the technology for the layer that ingests and models domain-specific data in each data product. Data quality testing, data observability, and data governance then becomes instantly available for every data product.
- It’s better if the harmonized data ingestion and modeling technology is low-code. Your data engineers will thank you, engineers won’t stick you with unmaintainable code after they leave, and onboarding new data engineers will be a snap.
- Make sure all data products speak the same API language. With this, data products can then self-describe themselves to the larger data catalog, and polyglot client applications can run domain-specific applications across multiple data products.
- Embed domain-specific algorithms and applications inside of each data product. This not only significantly increases efficiency of algorithms, but it also makes the applications available to all consumers of a data product.
- Embedding also allows automated testing and governance systems within a data product to extend beyond data elements to algorithms and applications.
- Domain-specific algorithms usually require pluggable pro-code elements — in this case, low-code is not better. Let your data scientists bring the appropriate algorithms (e.g. via R/Python) into the data product.
- Use the same containerization/deployment/CI/CD process for every data product. This ensures harmonized error detection, testing, observability and governance over the entire mesh of data products.
- Iterate on the usefulness of data products with end users and consumers of the data product. Don’t stop iterating until it’s what they need.
- If a data product has divergent use cases which produce a design conflict, split the data product into two.
I’ll wrap it up here — there’s a lot more we’ve learned and integrated into the Tag.bio Data Mesh platform and 4DK — but I hope that these initial lessons offer some useful advice to those of you who are rolling-your-own.