https://tag.bio/solutions/data-products/

Lessons learned after deploying over 200 domain-driven data products

There are two hot buzzwords competing for the data business 2021 Concept of the Year: Data Harmonization and Data Product.

If you’re in my field — and presumably you are — you’ve started hearing and using these terms on a daily basis. I’m now even seeing job titles on LinkedIn called, e.g. “Director of Data Products”.

Don’t get me wrong — these popular terms are useful! Partly because, as described in this article about harmonization, both terms refer to an abstract dream of a perfect system that does everything right with data and solves everyone’s problems. I mean, who doesn’t want that?

The definition of Data Product is changing

A different, better definition for data product came in 2019 — in the original Data Mesh article:

For a distributed data platform to be successful, domain data teams must apply product thinking with similar rigor to the datasets that they provide; considering their data assets as their products and the rest of the organization’s data scientists, ML and data engineers as their customers.
 — How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, by Zhamak Dehghani

Originally framed by Dehghani with the phrase data as a product — I’m now seeing most folks in the data architecture and applications field start to refer to instantiations of that concept — i.e. decentralized, domain-driven data sources (potentially combined with end-user applications that match Patil’s definition) — as data products.

The fluidity and subjectivity in the definition of data product is partially responsible for fueling the hype — as the term can mean different good things to different groups.

It’s more than just hype, of course — the advantages of this emerging architectural pattern are real and transformative. I know this, because my company Tag.bio has been designing and implementing decentralized, domain-driven data products since 2014.

At present count, our technology has helped researchers and customers design, build, maintain and use over 200 disparate, domain-driven data products.

Here’s what we’ve learned.

Making domain-driven data products a reality

From my perspective, the work begins with proper modeling of domain-specific data for optimal use — i.e. reporting, analytics, exploration, machine learning — all of those operations require data to be in a well-modeled, well described, domain-driven schema.

For example, no Machine Learning (ML) engineer wants to have to take weeks to figure out how to join data from 30 input tables or traverse a complicated knowledge graph just to get the right data into a data frame — they just want the data frame! A well-designed data product should give the ML engineer their desired output with minimal effort and coding — thereby it’s faster to do once, and far cheaper to maintain over time.

The same goes for reporting and simple dashboard software, e.g. Tableau — it’s a lot easier and requires far less maintenance if the data is already represented a single table that’s dashboard-ready.

In a basic sense, an organization can implement an initial version of domain-driven data products using a cluster of data warehouses and data views — while perhaps also enabling ownership of those assets by domain-specific groups — i.e. the marketing groups owns the marketing data products.

Data products must also be findable and accessible in order to be useful across an organization, so implementing a catalog of data products with metadata and access information is usually the next step in the process.

However, data products must be much more than findable, queryable sources of well-modeled data

  • Data products are also the codebases and algorithms that query and analyze the data.
  • Data products are also responsible for data quality, observability and governance.
  • Data products are responsible for domain-specific, useful end-user experiences.
  • Data products are responsible for versioning, provenance and reproducibility of data analysis artifacts, not just data.

That’s where the real challenge, cost and time lies. The initial steps — data modeling, domain ownership of data products, and data product cataloging — are just the beginning.

And that’s exactly where using a Domain Driven Data Product Design Kit (4DK) and an out-of-the-box Data Mesh platform offers a significant cost, time, scaling, and maintenance advantage—i.e. 3 months instead of 3 years.

How to accelerate the process — and minimize maintenance later on

  • Continue to utilize your existing databases, data warehouses, and data lakes. Don’t waste time and money re-implementing that wheel unless you really have to. On the other hand, it may be important to transfer ownership of these existing data resources to domain-specific groups.
  • Use a data product layer on top of those data sources to perform domain-specific modeling and integrate domain-specific applications.
  • Harmonize (i.e. use the same) the technology for the layer that ingests and models domain-specific data in each data product. Data quality testing, data observability, and data governance then becomes instantly available for every data product.
  • It’s better if the harmonized data ingestion and modeling technology is low-code. Your data engineers will thank you, engineers won’t stick you with unmaintainable code after they leave, and onboarding new data engineers will be a snap.
  • Make sure all data products speak the same API language. With this, data products can then self-describe themselves to the larger data catalog, and polyglot client applications can run domain-specific applications across multiple data products.
  • Embed domain-specific algorithms and applications inside of each data product. This not only significantly increases efficiency of algorithms, but it also makes the applications available to all consumers of a data product.
  • Embedding also allows automated testing and governance systems within a data product to extend beyond data elements to algorithms and applications.
  • Domain-specific algorithms usually require pluggable pro-code elements — in this case, low-code is not better. Let your data scientists bring the appropriate algorithms (e.g. via R/Python) into the data product.
  • Use the same containerization/deployment/CI/CD process for every data product. This ensures harmonized error detection, testing, observability and governance over the entire mesh of data products.
  • Iterate on the usefulness of data products with end users and consumers of the data product. Don’t stop iterating until it’s what they need.
  • If a data product has divergent use cases which produce a design conflict, split the data product into two.

I’ll wrap it up here — there’s a lot more we’ve learned and integrated into the Tag.bio Data Mesh platform and 4DK — but I hope that these initial lessons offer some useful advice to those of you who are rolling-your-own.