Towards: Infrastructure for AI

How do we best design our data infrastructure? That is today a pretty big question.

Matt Bornstein and his colleagues from the Tech VC Andreessen Horowitz wrote a recent article on this topic. They describe three blueprints for modern data architectures. These blueprints are based on best-practices coming from many US-based practitioners with which they conducted interviews.

A first glance

The three blueprints are distributed over two axes (on the left):

Analytic Systems power decision-making. This is about dashboarding, reports, SQL queries, data visualization and so on.

Operational Systems are production software systems that are powered by data and algorithms. So this is really about engineering.

The blueprints are then positioned along these axes:

Modern Business Intelligence on the analytics side of things.
Multimodal Data Processing as a hybrid between the two support both use-cases (analytical and operational).
AI and ML on the pure operational side of things.

In contrast to the original article, in this post I want to explore the individual components of the three blueprints in more detail. Why? Because I’m interested in building things and want to shed more light on what technologies actually go into these blueprints.

Without further ado, let’s get straight into it. Starting with the modern world of BI.

1. Modern Business Intelligence

When to use it:

You are a smaller startup and have a small data team. You take the first steps to build up the Data Infrastructure and BI stack.
You don’t have a need for advanced ML and real-time data processing.
You don’t want/can operate larger pieces of infrastructure yourself.

This is essentially the blueprint for analytical workloads. Let’s look at the flow.

We get data from the Sources and use Connectors (e.g. Fivetran) to easily pipe this data into the Data Warehouse (e.g. Snowflake). To get the data into the format we want use tools for Data Modeling (dbt).
After the data is stored, we can query this data as analysts directly through a SQL-like interface to perform Historical analysis of the data.
We provide access in higher level software tools. This could be building something static like Dashboards. It could also be allowing for interactive analysis from business users in form of visualizations and drill-downs that they click together themselves. This is what is here called Embedded Analytics (e.g. Looker). Augmented Analytics then describes a capability that was actually new to me and is from what I understand about automatically analysing and correlating developments of your data over time, e.g. changes in buyer segments. (Check e.g. Anodot)

On the bottom you see “horizontal”, cross-cutting functionality that supports in all phases.

Data Quality and Testing can be performed with tools like Great Expectations, which in a nutshell allows you to define expectations on your data, e.g. column values need to be between x and y. The median value for a field should be z.

Entitlement and Security is covering the newer field of regulations and security breaches by, for example, applying rule-based policies to roles limiting what a user can see across services. (See for instance Immuta)

2. Multimodal Data Processing

When to use it:

You are a larger startup or established company.
You have heterogeneous requirements for both analytical as well as operational (engineering) use-cases.
You have a sufficiently sized team to maintain and take care of infrastructure (or intend to do so).

Multimodal Data Processing. Big words. What this means is essentially an infrastructure that supports both modes of data processing: both the analytical as well as the operational.

Let’s break it down.

We still have our Sources.

But now in the Ingest and Transformation layer it gets more crowded.

We now have Workflow Managers, such as Airflow, that can programatically build a graph of transformations than you can apply to your data. So, simply said this is not a simple “take this raw data” and pipe it into a storage but it is more evolved. It allows you to take your data and perform bigger transformations and orchestrate these. You can define steps of transformations and it’s bit like what the connectors can but on steroids.
It is not simple SQL queries but especially Python Libraries (Pandas, Ray) that you can then leverage a distributed execution environment such as the Spark Platform, e.g. the hosted one on AWS called EMR.
For the matter of completeness, if we are storing data in Hadoop we can make use of Batch Query Engines such as Hive. It is not 100% clear to me how the authors intended to use it here. I could imagine to query data in the Data Lake, do some processing, and write enriched data back to the Data Lake.

Storage

The transformed data can go into the Data Warehouse (like before) and/or the Data Lake. The Data Lake here is a vast, horizontally scaleable storage for data that could be structured data as well as unstructured data (video files, audios, images, …). You can also write first into the Data Lake and then push or pull data that is relevant into the Data Warehouse.

Inside the Data Lake you see three kind of subsections.
- On the bottom you find the service being used, whether that’s AWS S3 or Google Cloud Storage.
- Next, you see the file format being used where the goal is to achieve fast query times and low disk capacity taken up. Think of it as a binary format that allows you to specify metadata (column names and types) that allow for SQL-like queries. (e.g. Avro)
- On the top, you then find transactional layers, e.g. Delta Lake from Databricks, that bring transactional guarantees that you know from relational databases to the Data Lake world.

Historical

With the addition of Event Streaming (e.g. Kafka) and Stream Processing (e.g. Flink) in the ingest layer - which just means we now retrieve data in real-time (more or less) - we also have Real-time Analytics solutions (e.g. Rockset) to get a real-time view of data flowing in.

With the Data Lake as a new storage we can use different Ad Hoc Query Engines to query data from it when we want (e.g. Presto).

Predictive

Where it gets interesting now is that we move further into the future (what is here called predictive).

The Data Science Platform gets data from the Data Warehouse and/or from the Data Lake. What does it do? The authors grouped here different tools but in essence it is about giving Data Scientists and Engineers an environment in which they can build and monitor models. Essentially, these environments abstract away a lot of the heavy lifting that is necessary for training, deploying and monitoring ML models in production. Check for instance Databricks, SageMaker or DataRobot

The Data Science and ML Libraries are less automatic and represent more the core tools that do the heavy-lifting. I won’t say it’s bare-metal (because we are far form hardware here) but it goes metaphorically into this direction. Examples: scikit, Spark ML

Output

Remember the difference with the first blueprint? We now talk about operational (software) integration.

New output: Custom Apps (that’s Amazon’s recommender system) or interactive apps built with an App Framework like Streamlit.

Horizontal Enablement

Metadata Management is about cataloging stored data and making it easy for consumers, e.g. Analysts or Data Scientists, to discover and consume his data. (Read e.g. here about LinkedIn’s DataHub)
Observability is about detecting data issues in real time, uncovering anomalies and in the case of AI it is also about observing model performance and related aspects. (Check e.g. Fiddler)

3. Artificial Intelligence and Machine Learning

When to use it:

You have strong ML requirements for advanced analysis and/or products powered by algorithms and data.
You have enough power in your team to build things and stitch together the tooling (or plan to do so).

The previous blueprint covers both the analytical as well as the operational workloads. Highlighted you see the parts that are essential for building operational applications powered by Data & ML.

Highlighted: AI-specific parts of the infrastructure — **Highlighted:** AI-specific parts of the infrastructure

Given the complexity of the tool landscape, Bornstein and his colleagues detailed the AI/ML blueprint in more detail.

As you can see, the dimensions (on the top) are different for AI/ML. Instead of having a flow from ingest and transformation over storage to historical and predictive data access, we now simply have:

Data Transformation (getting data into the right format),
Model Training and Development
Model Inference (making it accessible to whatever client awaits behind).

A lot of stuff is going on here, so let’s unpack that. We start from the left, where we have our Data Sources which are - in contrast to before - coming from the Data Lake, the Data Warehouse or the streaming engine (so no raw data directly from the source.

Data Transformation

In order to build ML models we need to preprocess and transform this data.

For this purpose, we can perform Data Labeling to create training data. (E.g. Labelbox)
We can orchestrate the set of transformations through tools for Dataflow Automation such as Airflow or Kubeflow.
Depending on the storage, we can use different Query Engines (Presto) to query data. We use different Data Science Libraries (Pandas, NumPy) to the the low-level transformations for us.
The Data Science Platform represents the “human interface” for this tooling. Spin up a notebook and interact with the data and experiment in an interactive fashion (e.g. Deepnote, Jupyter).

Model Training and Development

In this phase, the goal is to train and determine the best model for the problem we want to solve. It involves both the manual, exploratory part of model development as well as the part that automates the whole process in the end.

To enable us to do our feature transformations, model training and hyperparameter search in a repeatable way that doesn’t have to involve humans, we use a Dataflow Automation tool like Kubeflow.
Feature Stores allow us to store re-usable features in a central storage without the need to recalculate them over and over again for different models or services (Tecton).

To build the best models we perform Experiment Tracking on different models and compare loss metrics, validation errors, etc., e.g. with MLflow, Comet or Weights & Biases
We perform Visualization of our models, e.g. Neural Networks with Tensorboard, or our data with Fiddler.
We perform Model Tuning through tools that allow us to optimize hyperparameters, e.g. Ray Tune.

Depending on the model class we want to train we can make use of ML Frameworks (e.g. Scikit-learn for a wide range of algorithms), DL Frameworks (e.g. PyTorch) or RL Libraries for Reinforcement Learning (RLlib, Coach).

To do all this in a timely fashion we ideally model training in parallel through frameworks for Distributed Processing. See Horovod, Ray, Dask, Distributed TensorFlow

To store models, version them and recover them with their set of hyperparameters we use a Model Registry. See MLflow.

In case we need it, we can use a Compiler to translate any Deep Learning model we have (Tensorflow, PyTorch, …) to run on any hardware platform (whether a NVidia GPU or x86-compatible processors). See TVM.

Model Inference

Finally, we want to retrieve model predictions in production and serve them to clients, e.g. a native mobile app.

Transform real-time data into features for our models through Feature Servers (Tecton).
If you have no low-latency requirement, you can use a Batch Prediction (Spark) to perform inference, say, at night, to update your customer segments for the day. If you are in need of low-latency, real-time predictions, e.g. for fraud use-cases, use an Online Model Server (Seldon Core, Ray Serve, TensorFlow Serving).

Model Monitoring extends the visualisation part that we have seen during model training into production. So we can essentially monitor model or data drift, and the prediction quality in real-time and either take manual or automated actions: Fiddler, Arize

How to apply this in practice

So, obviously the whole Data/ML landscape is always in flux. And I don’t believe we should use the tools in the blueprints as “the truth”. Rather, I believe the blueprints can serve as a source of inspiration and make a nice framework to check our own architectures and see how we can evolve or improve over time. I will certainly try and map out the architecture at my company and then see where we could go from there.

In this sense, happy building!

Resources

Original Article: Emerging Architectures for Modern Data Infrastructure

Concise collection of all blueprints: Blueprint collection

An interesting interview with the CEO of Databricks, Ali Ghodsi, talking more about the possible unification of Data Infrastructure in the future.

Podcast: Data Alone Is Not Enough: The Evolution of Data Architectures

Towards: Infrastructure for AI

A first glance﻿

1. Modern Business Intelligence

2. Multimodal Data Processing

3. Artificial Intelligence and Machine Learning

How to apply this in practice

Resources

The Alcatraz of Being

Metrics. Of 4.

A first glance