Often I talk to customers who are in a journey to implementing Artificial Intelligence (AI) in their organization. In my last blog, I wrote about some best practices around Deep Learning. In this blog, I am going to get more prescriptive about how you would and probably should explore “Scaling AI” inside your organization.
To start, I am assuming that organizations have already developed a noteworthy model. This can be at inception Modelling or Natural Language Processing. You can deploy on GPUs or CPUS. Stochastic Gradient Descent or Micro-Batches. You have trained your model and are ready for inference deployments. Or are you?
Models – Day 2
What happens when you want to train more than one model at a time? What happens if you are sourcing multiple inference models at the same time?
Welcome to a few days into your production AI/Deep Learning environment and your environment starts hosting multiple models and Data Science teams at once.
What is the best way for you to capture these statistics at scale? What components matter most when you Scale AI?
Queueing Theory
Queueing theory is probably most eloquently articulated in the postulate around “Little’s Law”. Little’s Law deals with the queueing of “arrivals” and response times. In algebraic terms for infrastructures, process or file occupancy = latency x bandwidth. Each process (be it GPU or CPU) to train or deploy a production model depends on how fast it can get serviced and completed. Mathematically, the discussion of latency vs bandwidth has been debated for decades. This is hindered by each implementation of a storage operating system with its locks and latches, network latency, storage bandwidth, network bandwidth, and compute parallelism that drives great arrival rates. It has been argued that bandwidth is far greater than latency for servicing requests at scale. Great bandwidth causes increased latency depending on the latches and locks on the underlying storage infrastructure. As such, this becomes a puzzle that has customers often asking for help when they wrap their heads around Scaling AI.
Other Factors
There have been a number of studies looking at various elements of running analytics at scale. Compression and Encoding algorithms for data files were the topic of one recent study at storage content at scale. For most of my lab testing, we use Protocol Buffers (ProtoBuf) with LZO compression. There are native hooks into the Deep Learning Frameworks (e.g., Tensorflow, Caffe2) to handle native ProtoBuf file serialization and streaming on read. This allows for effective utilization of binary encoding and CPU optimized compression.
When you are singly running training models or deploying models for inference, it is easy to tweak data to fit the whole model in cached memory. What happens when this environment becomes multi-tenant? A dev or data scientist can no longer guarantee that all model data be loaded into memory for training or inception. This pushes down more concurrency and arrivals to the storage layer. Caches, spill space, and software defined caches can certainly help. At any given scale, this will probably still push a large amount of concurrent requests to the storage layer for servicing around inference or training models.
Containers are all the rage. Containerization of workloads certainly makes trouble shooting and deploying models and environments easier. The complexity comes into the ability to load durable storage into the container. Either all the data is loaded into each container, the container security model is breached to allow durable storage connections, or a container bridge is deployed that allows full fidelity security containers that can leverage file or object external systems for durable model data. Our recommended approach would be a container bridge to maintain container security and allow external shared durable storage for model and training data. This allows manageability and scale to be maximized.
What does good look like?
Let’s start to tie back some of the items raised above to drill into the specifics.
Let’s assume you have a series of Inception and NLP models running at scale. Based upon our internal analysis running several deep learning frameworks, storage profiles as Random IO (i.e., random reads) for Deep Learning. If you assume a highly read centric workload when training or deploying inference models, they each have different level of parallelism. Training can only go so embarrassingly parallel with respect to GPU size and how the mini-batch optimizers are organized. Inference of trained models can deploy in very embarrassingly parallel configurations. This brings different levels of dynamics when all are running at once in the same cluster. If the training is isolated to one cluster of compute/storage and inference is serviced via another cluster, this will be easier to model mathematically. If you have serialized your data, it will allow a manageable compression and optimization of file management. A single serialized file will add more single file concurrency than managing a myriad of small files though. The tradeoff occurs when a storage subsystem doesn’t degrade latency via little’s law with more concurrent connections and allows optimization of storage and management through serialization. If the overarching factors start to sound like chess vs checkers, then you are starting to realize the complexity associated with Scaling AI.
If you assume a compute farm worthy of embarrassingly parallel workloads, then you can start the math model around your environment. Little’s Law is a good place to start. How many concurrent models should we expect of what types (training vs inference)? What types of architectures within the Neural Network on what sized images, text bits, or serialized files in what configurations (e.g., CNNs, RNNs)? What is the expectation around turn time with regard to wall clock model completion times? These factors can be leveraged into network capabilities modelling (10G vs 40G for example), storage chassis with concurrency and bandwidth metrics (1M concurrent reads per second at 15 GB/sec per 4u storage chassis).
Summary
The net of the effort is an engineering math model that can be simulated to understand capabilities around the scaled AI environment. This takes you a far cry from writing your first trained model to achieve business outcomes and curious how that will deploy at scale to service your production customers. Questions you might want to slowly answer as you build out your Scaled AI environment:
- What level of concurrency can my storage support before latency ticks up?
- What is my container bridge strategy?
- What level of performance am I expecting and or getting in my PoCs?
- What is the scale of the durable storage for training vs inference?
- How comfortable am I doing this alone or do I want to talk to something who does this for a living?