A diagram of the Tag.bio data product engine — AKA, the Flux Capacitor — from https://patents.google.com/patent/US10642868B2/
Architectural diagram of the Tag.bio data product engine — https://patents.google.com/patent/US10642868B2/

Anatomy of a Data Product — Part One

Jesse Paquette

--

My company, Tag.bio, has implemented over 200 data products for a wide range of use cases in the scientific, healthcare and pharmaceutical sector. Examples include:

  • Top 10 US hospital — a platform to extract and compare cohorts of patients across 50 million inpatient encounters and outpatient visits.
  • Top 5 global pharma — harmonization of diverse clinical trials across oncology and precision medicine for systematic cross-analysis of genomic biomarkers and outcomes.

Data products make complex, evolving, multi-modal data usable — for researchers who need fast answers to their scientific and business questions, and for data scientists who need fast access to precise data frames. A well-designed data product will save an organization years of effort by eliminating arduous, repeated efforts to find, query, combine and model the data every time a new question arises.

Data products have been slow and expensive to build — until now. Over the past few years, we have developed and optimized a software framework which simplifies, harmonizes, and significantly accelerates the design, development, and maintenance of data products — for any sources of data.

This series of articles will provide a deep-dive into how a domain-specific data product can be designed, implemented, versioned, and deployed within our framework — with working code examples for a basic dataset.

Here, in Part One, I will discuss the purpose and high-level design of a data product, and I’ll introduce a set of key terms to be used in the future.

What is a Data Product?

A data product is a software application that makes complex, multi-modal data easily consumable by domain experts and data scientists. It contains:

  • A single- or multi-modal dataset, ingested from one or more sources/schemas and transformed via config into a well-modeled, versioned, immutable data snapshot. The snapshot is versioned and immutable in order to achieve complete reproducibility of analysis.
  • A harmonized API for serving metadata about the data product — i.e. what data it contains, what it can do with the data—and for invoking domain-specific queries, algorithms, and visualizations within.
  • A suite of versioned, testable API methods — called protocols — which define parameterized data queries, algorithms and visualizations. Domain-specific algorithms and visualizations can be added to protocols as plugins created by data scientists in R & Python. These versioned API methods represent a data contract for each data product.

What can you do with a Data Product?

A data product supports two types of investigators who want to quickly access and utilize the data within:

  1. End-users — clickers — who will use a no-code user interface (UI) to explore the data product’s capabilities and invoke protocols as apps to answer their questions.
  2. Data scientists — coders — who will access the data model within the data product via API requests or via programming libraries — e.g. R & Python — to extract data frames for model-building, exploratory analysis, and visualization.
A screenshot of the Tag.bio user interface showing the results of a investigation into DNA mutations associated with patient survival in a cancer clinical trial
Example results from the no-code user interface after invoking an app within a data product
A screenshot of the Tag.bio developer studio showing access and exploration with a data product using the Tag.bio R library
An example coding environment showing data product access and exploration using the Tag.bio R library

What constitutes a Data Product?

A data product is concisely defined by and deployed from two digital components:

  1. A versioned & well-modeled, immutable snapshot of the source data.
  2. A versioned, low-code + pro-code codebase stored within a single repository — e.g. Git — which comprehensively defines all operations within, including:
    - Data ingestion and modeling code for creating the snapshot.
    - Configurable protocols & plugins registered as API methods.
    - A manifest of metadata about the data product.
A screenshot of Visual Studio Code showing the codebase for a data product, backed by a git repository
A screenshot of Visual Studio Code showing the codebase for a data product, backed by a git repository

Data Products within an enterprise ecosystem

A single data product can be useful when deployed in isolation. However, data products are significantly more useful when deployed into an ecosystem with other data products and software components.

  • A transactional and/or long-term storage system for the source data. Organizations typically have such components already implemented— e.g RDBMS, Data Lakes, bucket storage, etc. Data products can systematically connect to and create data snapshots from these sources.
  • A CI/CD system¹ for testing, deploying, and hosting data products securely within a decentralized Data Mesh architecture.
  • A no-code UI¹ enabling clickers to interact with data product protocols as apps and define governance rules around data products.
  • An access control and governance system¹ for ensuring data products, API methods, and data objects are secure, and only accessible to authorized users & software clients.
  • A storage and retrieval system¹ for useful data artifacts (UDATs) — e.g. analysis results, cohorts, feature sets — derived from performing analyses on the data product.
  • R & Python libraries² and a Notebook environment¹, enabling coders to query one or more data products in production and extract useful data frames.
  • A low-code + pro-code development framework (SDK)² for developing, compiling, and testing data products.

¹Tag.bio delivers these enterprise components into customer cloud accounts.
²The SDK and R & Python libraries are open-access, provided by Tag.bio.

What’s coming in Part Two and beyond?

We’ll explore a working data product!

  • Part Two — Introducing the example data source and a link to a public Git repository for readers to follow-along.
  • Part Three — Describing how the low-code config components ingest & transform the data source into a versioned, well-modeled data snapshot.
  • Part Four — Presenting API method and algorithm design via low-code protocols.
  • Part Five — Demonstrating how pro-code plugins can be implemented within protocols to provide customized R & Python analysis and visualizations as API methods.

In memory of Susann Edler-Childress, AKA The Eldress, for the talent and joy she brought to our workplace.

--

--

Jesse Paquette

Full-stack data scientist, computational biologist, and pick-up soccer junkie. Brussels and San Francisco. Opinions are mine alone.