Image for post
Image for post
Photo by Kari Shea on Unsplash

You know how sometimes you get sick after an intense period of Adulting, and someone inevitably says “it’s just your body telling you to slow down"?

After 7 years of startup hustle, on Saturday my 2-year old laptop told me to slow down.

It happened out of nowhere — my trusty machine which had been happily crunching code just a few hours before presented me with what folks here in Brussels (presumably) call l’ecran noir.

Hm.

Turned it off, then on again. Nope. Tried again.

Panic. Please, not this, not now.

Over the next few hours, no web-searched solution or secret-key-combo-into-safe-mode helped. The computer seemed to start up just fine, but the screen stayed black. …


Image for post
Image for post
Photo by Amy-Leigh Barnard on Unsplash

The era of the monolithic data warehouse/data lake is coming to an end — long live the decentralized data mesh!

Oh, do not despair! All those person-years spent cleaning, transferring, and loading data into your centralized systems hasn’t been in vain. With data mesh, you don’t have to start again from scratch with new technology — i.e. you don’t have to replace your RDBMS, Snowflake, or Databricks with a new vendor or open-source solution. A data mesh will simply utilize your existing databases, warehouses and lakes as nodes in its greater, decentralized network of data products.

If this is your first introduction to the data mesh concept, here’s a technical deep-dive from ThoughtWorks, and an executive-level overview from McKinsey.


Image for post
Image for post
Photo by Patrick Tomasso on Unsplash

Data Science —
Answering questions with data
Is presumed to be an art,
Or at least a high-tech craft,
Producing exponential value and driving innovation.

Organizations know
Answering questions with data
Needs to be faster,
Automated.
They design a plan —
A centralized data lake with dashboards!
But it takes too long to build.
It goes over budget.
Centralized data doesn’t scale
And dashboards aren’t specific enough to be useful.

To this day,
Eighty percent of questions are answered the slow way.
It’s a human-scale process —
Emails, meetings, queries, modeling & analysis —
Waiting, waiting for weeks
For the bottleneck
Artisan data specialists handcrafting answers. …


Using the most successful, scalable pattern in software history to solve the worst problems in Data Science and Analytics.

Image for post
Image for post
Photo by Greg Rakozy on Unsplashhttps://unsplash.com/@grakozy

Data, data, everywhere…

…brimming with immense potential value for discovery in science, business and society.

Unfortunately, the actual utility of most collected data is greatly diminished for value discovery/extraction purposes — like drinking salty seawater, the cost/benefit is a net loss. Why is this?

  1. Collected data suffers from extreme variety — in encoding formats, access methods, and bespoke, domain-specific schemas. Even if it were possible to reformat and restructure the entire universe of data sources as standard, SQL-compatible tables, there would still remain the impossible task of joining and deciphering the vast diversity of table/column schemas that would arise.
  2. Collected data suffers from incompleteness of content — and I’m not just talking about missing values. What does the data represent? Why was the data collected? What was the design/purpose? Who was the data collected from and for? What biases or special conditions exist? Typically the missing information about a data source is stored in the brains of domain experts — i.e. specific humans. And those humans are rarely in the same place as the data sources they own or once owned. …


Image for post
Image for post
https://commons.wikimedia.org/wiki/File:Punishment_sisyph.jpg

An ever-growing list of anti-patterns and symptoms, in no particular order.

I think about this mostly from the SELECT-side, so I’m sure there’s a fair amount missing on the INSERT/UPDATE-side, and also from the NoSQL perspective.

  1. Entities are referenced across your codebase via unique keys generated within the database. H/T — Nicklas Millard.
  2. You have SQL queries in front-end code.
  3. Users can see table or column names in the front-end.
  4. You have to mitigate risk of SQL injection from API calls.
  5. You have an extensive authentication/authorization model implemented within the database.
  6. Your database has an authentication/authorization model that is aware of individual developers or users. …


Image for post
Image for post

You should be able to read this straight through, even though terms are presented in alphabetical order. Alternatively, you can jump around to specific terms of interest.

Terms in bold (←except this one) are all defined in this glossary. I’m going to figure out how to turn them into anchor links later.

API-Driven Design

What if a user’s Data Experience in software were primarily driven by server-defined functionality — instead of being driven by front-end functionality?

This would turn a front-end application into a simple browser of server content — which seems feature-weak — that is, if you only have one type of data application server to connect to. …


Image for post
Image for post
Example output from the Java UMAP library in Tag.bio, embedding ~11,000 TCGA Pan Cancer Atlas tumor samples using gene expression dimensions.

Developed in collaboration between San Francisco, California based Tag.bio and New Zealand based Real Time Genomics, the umap-java library represents a port of the original Uniform Manifold Approximation and Projection (UMAP) Python implementation by Leland McInnes.

The open source project is available here on GitHub, and is released for use under a BSD-3 License.

On a personal note, I’d like to offer heartfelt thanks to everyone who contributed to this work so far.


Vending machine, automated choice and delivery.
Vending machine, automated choice and delivery.
Data analysis systems should be systematic, like vending machines. Your question or request goes in, and a Useful Data Artifact comes out.

For starters, I’d like to acknowledge Josh Dunn for first using the term “Useful Data Artifacts” in a conversation over lunch at the Boston Seaport a few weeks back. I’d been using the term “Data Artifacts” for some time, but what’s the value of one, if it’s not useful?

What is a Useful Data Artifact?

First and foremost, a Useful Data Artifact is an actual digital thing. It is not an idea, a thought, a realization, or an insight. It’s not in your brain — it’s a structured data object, created when you or an algorithm do something with data.

More technically — a Useful Data Artifact is a nonrandom subset or derivative digital product of a data source, created by an intelligent agent (human or software) after performing a function on the data source. …


Image for post
Image for post

I’m a co-founder at a San Francisco based startup, but I spend most of my time working from Brussels, so I find myself on both sides of this issue.

  1. Know their time zone, and use it when scheduling things.
    This doesn’t mean you as employer/manager need to spend as much time during the middle of the night as they do. It’s their business to accommodate your workday. I’m just saying, know the time difference, and when scheduling things, use their time zone.
  2. Know their holidays, religious and/or secular.
    Beyond acknowledging their time off, say something nice. Let them know their cultural traditions have value to you. …


Image for post
Image for post

I’m the technical co-founder of Tag.bio, and the software platform we’ve built is my baby. So naturally, I’m biased—this 10-point overview is intended to clarify my declarations of technological awesomeness with more objectivity.

Disclaimer — this is intended to be as concise as possible, so it’s chock-full of technical jargon. If that suits you, please continue.

The 30,000 foot view

About

Jesse Paquette

Full-stack data scientist, co-founder at Tag.bio, pick-up soccer junkie, located in Brussels and San Francisco.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store