Scalable Deployment Pipeline of Deep Learning-based Recommender Systems with NVIDIA Merlin
Scalable Deployment Pipeline of Deep Learning-based Recommender Systems with NVIDIA Merlin
“When we work on machine learning (ML) or deep learning (DL) models, we particularly focus on building this very accurate model that gives high prediction results for our validation/test data. However, we need to think beyond that if the goal is to put these models into production and get useful business insights from them. An end-to-end ML/DL pipeline consists of preprocessing and feature engineering (ETL), model training, and model deployment for inference. Model deployment is the critical step of this pipeline since it is to start using our model for practical business decisions, therefore, the model(s) needs to be effectively integrated into the production environment.
Why is deployment of deep learning-based recommender systems hard?
Deploying a trained model to production is a significant engineering problem which is often neglected in literature.
The models can be accessed directly or indirectly by hundreds of millions of users. Our production system needs to be able to provide high throughput. In addition, online services often have latency requirements, such as requests need to be served to the user in less than 100s of milliseconds. Our production models have to be scalable and provide low latency for each request. Even if we fulfill these requirements, we still want that our production environment is optimally utilized and we do not want to have computational resources idle. And that’s not the end, there are still more requirements to a production system…”