An architectural diagram of the Tag.bio data product engine — AKA, the Flux Capacitor — from https://patents.google.com/patent/US10642868B2/
Architectural diagram of the Tag.bio data product engine — https://patents.google.com/patent/US10642868B2/

Anatomy of a Data Product — Part Four

Jesse Paquette
7 min readDec 27, 2022

--

This is an article in a series about building data products with Tag.bio.

To begin the series, check out Part One, which outlines the reason for and definition of a data product, along with key concepts and terms. To access the data & codebase to follow along with these examples, see Part Two.

Here, in Part Four, we will introduce the low-code protocols layer, which defines API methods for the data product that present and invoke domain-specific queries, algorithms and visualizations, using the data model produced by the config.

The protocols layer is defined by low-code JSON objects, arrays, and primitives (alternatively YAML). A complete reference to the Tag.bio JSON syntax for protocols is here.

Remember — in the Tag.bio syntax, the value for any JSON object attribute can be replaced with a string representing a path to a file containing a JSON object or array. This facilitates code simplicity, modularization and re-use.

The main file

The main file defines how the data product will operate in a production environment — it has three primary functions:

  • Define metadata about the data product that can be accessed via API.
  • Register protocols as API methods and organize them into categories.
  • Integrate tests for protocols that can be run manually or automatically in a CI/CD system.

The main file for the fc-iris-demo example is named main.json, located at the root of the project, and looks like this:

Note these top-level attributes:

  • data_product_definition — contains the metadata for the data product.
  • protocols — lists paths to files containing protocol objects, which will be registered.
  • tests — lists paths to files containing test objects, typically one test for each protocol.

Extended documentation for syntax within the main file is (coming soon).

Protocols and their attributes

Protocols define how an API method can be presented to a client, and the operations that method will perform to query, analyze, and visualize data.

Protocol files are conventionally placed in the protocols/ folder at the root of the data product.

The JSON object defining each protocol utilizes three top-level attributes:

  • protocol_definition — contains metadata for the protocol, and defines arguments which will present parameter choices to client applications.
  • script — declares how the protocol will query and analyze data when executed, given the parameter choices submitted by the client application.
  • protocol_output — customizes the output of the API response after execution of the script within the protocol.

Here’s an example of one of the protocols in the fc-iris-demo example — it will be described more in sections below:

A screenshot from Visual Studio Code showing a the JSON structure of a protocol.

Protocol definition

Here’s a zoomed-in screenshot of the protocol_definition section in the protocol above. The full syntax reference for protocol_definition is here.

A screenshot from Visual Studio Code showing a the JSON structure the protocol_definition section of a protocol.

Note the attributes within this protocol_definition:

  • name — assigns a unique ID to this protocol for testing and API method invocation.
  • title — presents a short, human-readable title so this protocol can be located and understood by end users.
  • description —presents a longer, human-readable explanation of what this protocol will do.
  • asset — a path to an image file that will be presented as a thumbnail for this protocol.
  • argument_sets — defines a set of configurable parameters which API clients and end users will use to modify the behavior of this protocol.

The first four attributes above — name, title, description, asset — are all static, but the argument_sets array contains (file references to) powerful, dynamic functionality.

Here’s the file for the first argument_set in that array:

In the interests of brevity, we will not cover all argument_set attributes here. See comments in the screenshot above, or visit the syntax guide for more information. However, note the argument_expanders attribute — this array will auto-generate an argument from each of the data functions listed.

Here’s the information from that protocol_definition shown in a UI after making an API call to the data product. Note how the argument sets represent groupings of arguments, and how arguments display values from within the data model for customized invocation of the protocol.

A screenshot from the Tag.bio UI showing the “summary” protocol configuration after making an API call to the data product.

Note how something interesting has happened — the arguments are presenting data for selection from specific collections and variables within the fc-iris-demo data model.

This occurs because the arguments are constructed using data functions.

Data functions

Data functions are modular building blocks which reference and transform data from the data model for display to the end user as arguments, or for utilization of data by the algorithms defined in the script section.

All references to data are conventionally located within the protocols/data_functions/ folder for efficient understanding and maintenance of each data product.

The screenshots below show the most basic forms of data functions — categorical and numeric — which simply reference collections and/or variables within the data model.

A screenshot from Visual Studio Code showing a simple categorical data function.
A categorical data function that references the Species collection within the data model
A screenshot from Visual Studio Code showing a simple numeric data function.
A numeric data function that references the Petal length variable in the Petal collection within the data model

Data functions can range from the simple examples shown above to dynamic, complex transformations of data. Advanced data functions use modular references to inner data functions as building blocks. For a full overview, see the syntax guide.

Script

The script section of each protocol defines how the protocol will query, analyze and visualize data after the API method is invoked, given argument values provided in the API request.

Here’s a zoomed-in view of the script section within the example protocol:

The attributes of the script section will vary, depending on the method used. Most protocols, including this simple example, will contain the following attributes in their script:

  • method — defines the algorithm that this protocol will invoke.
  • background — defines the entities — i.e. rows — which this protocol will utilize in the algorithm.
  • analysis_variables — defines the collections / variables — i.e. columns — which this protocol will utilize in the algorithm.

In this case, the “summary” method will invoke the native summary algorithm in the Tag.bio framework. All the analysis_variables will be summarized across the set of entities defined in the background.

Note how the background and analysis_variables attributes use special data functions which are not explicit references to data within the data model. These data functions are argument-set-references, which will dynamically utilize the argument values provided in the API request.

A screenshot from Visual Studio Code highlighting the background and analysis_variables section of a protocol, utilizing argument-set-references.

For more information about argument-references and argument-set-references, see the syntax guide.

How it works

  1. An API request is made to invoke the protocol.
  2. The protocol will collect the argument values provided in the request, replace argument-references and argument-set-references in the script section with data functions that reference data in the data model (or remove them if no argument value is provided), then invoke the algorithm defined by the method on a data frame defined by background and analysis_variables.
  3. The output is ready for being returned with the API response. There’s (usually) only one more step…

Protocol output

The protocol_output section of a protocol contains attributes which modify the API response after the script section executes.

Attributes of protocol_output include:

  • dynamic_title — utilizes provided argument values and analysis properties to produce a customized title for the API response.
  • dynamic_description — utilizes provided argument values and analysis properties to produce a customized description for the API response.
  • visualization — specifies statistical attributes and dynamic text that will be returned with the API response for interpretable visualizations.

For more information about the protocol_output section of a protocol (with more attributes not described here), see the syntax guide.

Putting it all together

Here’s a series of screenshots from a UI, showing the workflow of a protocol, from configuration to results.

The protocol will only be run for the Iris-virginica entities — i.e. rows — and will summarized the Sepal length variable.

A screenshot from the Tag.bio UI showing the configuration for the example protocol. Iris-virginica entities are selected for the background, and Sepal length is selected as an analysis variable.

Clicking the green “Run protocol” button on the right will invoke the protocol. The UI then shows the progress of protocol execution.

A screenshot from the Tag.bio UI showing the execution progress of the example protocol.

The protocol results page after execution. Note the number of entities analyzed in the dynamic_description at the top, and the auto-generated box-plot (alternatively histogram) for the selected analysis variable.

A screenshot from the Tag.bio UI showing the results from execution of the example protocol.

What’s next?

Part Five in this series will cover a specific method of protocol which integrates a custom R script — highlighting the capability of the framework to add powerful API methods to a data product using any algorithm / visualization available in R & Python.

Go to Part Five

Back to Part One

--

--

Jesse Paquette

Full-stack data scientist, computational biologist, and pick-up soccer junkie. Brussels and San Francisco. Opinions are mine alone.