Deep learning is hard. Between organizing, cleaning and labeling data, selecting the right neural network topology, picking the right hyperparameters, and then waiting – hoping – that the model produced is accurate enough to put into production. It can seem like an impossible puzzle for your data science team to solve.
But the IT aspect of the puzzle is no less complicated, especially when the environment needs to be multi-user and support distributed model training. From choosing an operating system, to installing libraries, frameworks, dependencies, and development platforms, building the infrastructure to support your company’s deep learning efforts can be even more challenging than the data science. Add on top of that, the rapid pace of change in deep learning software and supporting libraries – many of which change monthly – creates a recipe for IT headaches.
Containerization helps solve some of the IT complexity. Instead of your IT staff cobbling together dozens of libraries and dependent software packages to make your deep learning framework of choice function, you can download pre-configured containers which handle all of that. Or you can have your data scientists build custom containers to meet their specific needs. However, your IT department must still build and configure infrastructure for orchestrating those containers, while providing a resilient, scalable platform for your data science team to be as productive as possible.
Nauta Deep Learning Platform
Nauta software seeks to solve many of the problems associated with building container orchestration infrastructure for deep learning. Nauta is a containerized deep learning platform which uses Kubernetes for container orchestration. It provides an intuitive command-line interface for building, running, curating and evaluating experiments, and it includes must-have features such as Jupyter notebooks and Tensorboard.
We’ve been using Nauta in the Dell HPC & AI Innovation Lab, testing its features, functionality, extensibility, and ease of use. We use Nauta to run many of our cutting-edge deep learning research projects, including scalable convolutional neural network (CNN) training on chest xrays and ultra-scalable multi-head attention network training for language translation. It allows us to go from early proof-of-concept in Juypyter notebooks – to high-performance distributed training using the MPI-based Horovod framework for TensorFlow – to wide hyperparameter analysis for producing the most accurate model possible. Best of all, it’s a scalable platform built on top of Kubernetes and Docker, allowing us to easily share and replicate work between team members.
In addition to training neural networks, Nauta also provides a mechanism for testing deployments of trained models. This allows us to evaluate model accuracy, benchmark performance, and test reduced-precision quantization on new hardware, such as the 2nd-Generation Intel® Xeon® Scalable processor with Intel® Deep Learning Boost. Nauta allows inference on both batches of data, as well as streaming inference using REST APIs. And while Nauta isn’t expressly designed for production model deployment, the ability to evaluate trained models and experiment with reduced precision is an important component of the overall model development and deployment process.
Looking Forward
The Dell HPC & AI Innovation Lab team continues to use, evaluate, report and resolve issues, and recommend improvements to Nauta. Select customers are also experimenting and evaluating Nauta on Dell hardware, and Nauta will be a central component of future Ready Solutions. In the end, your company’s AI efforts are only going to be successful if the infrastructure is ready to support your data science team. Nauta provides an on-ramp for your IT organization and your data science team to get started training in an on-premises containerized environment quickly and easily.