Some Notes on Machine Learning Engineering for Production

5 min readJul 3, 2021

The name Andrew Ng, rings a bell right. If you have or haven’t started ML also, you would also recognize this person as the Chairman and Co-Founder of Coursera, also is Founder of DeepLearning.AI. When I started Machine Learning, his course for us was like the first step towards the journey into this field. And believe me, who went through that course with patience have pretty much stayed loyal to the field of AI. But he alone didn’t intrigued my interest, there were these Megaminds- Robert Crowe (RC), TensorFlow Developer Engineer, Google and Laurence Moroney (LM), AI Advocate, Google. If you have been doing some courses in Coursera in AI field, you pretty much have stumbled across their lectures-both as guest or whole courses.

The reason I am mentioning them is because recently they have been stating the word MLOps very frequently. To make this clear DeepLearning.AI recently hosted an event Machine Learning Engineering for Production which included this Expert Panel. (link at the end). These are my notes for a quick recap of my memory and also for you to have an insights of the event.

Machine Learning Operations or “MLOps” for short, is really a nascent field. People might have seen some of the articles that are out there comparing and contrasting ML and DevOps and suggesting that MLOps is sort of a mashup of the two. Is that the right way to think about it? And to what extent do the roles of Data Scientist or ML Engineer involve MLOps?

What is MLOps?

A better term could be “production ML”- stated RC. The model needs to adapt to change because data isn’t static anymore. Understanding machine learning and deep learning concepts is essential, but if you’re looking to build an effective AI career, you need production engineering capabilities as well by giving ML take a software-centric view. How can we take learnings from Software Engineering to ML Systems? Bringing together people from different areas to solve a common problem. MLOps will allow for frequent updates to the “software” by continous monitoring and automated training of data if and when required.

How is MLOps actually implemented in an industry setting?

Depends- said LM. For deployment purposes,the whole ML pipeline needs to be scaled, monitored, and configured many other moving parts. So one goal is to have flexibility around this. This has gone into projects like TFX and Flask APIs like frameworks to deploy non realtime solutions, DVC (data version control), KubeFlow or Vertex.AI like tools are also coming up to cater the needs of MLOps. RC, however, compared to the outside world, there are many powerful but bespoke frameworks and toolings at Google. (Eg. Vertex.AI) These can move into the open-sourced direction like Google did with Tensorflow and Kubernetes.

Is MLOps suitable for early-stage startups, or only for teams with enough resources to essentially do what the big tech companies do?

The big tech companies do have an edge in regards to the research and invention of their own tools but however AN suggested, for a startup, just pick and choose tools. Don’t make everything from start, wherever possible. Focus on getting an end-to-end system in place. It feels like we have almost enough tools. But there is so much stuff yet to be invented. Depends on company, the maturity, etc. And because of this, there is no standard pipeline. New startups and big tech companies can more easily adopt tools. But there are many companies in the middle who have it hard to adopt all these tools and techniques. Maybe 5 years down the road, it can be more standard agreed everyone. But the key takeaway I think is MLOps pipeline needs to be data-centric, not model-centric, as most of the companies think now.

How to bridge ML and the business interest?

RM: The ML Engineer need to learn to focus on the end-to-end system and not just experimental model validations. The business person should also know that AI isn’t a magic wand.

It’s a big challenge in raw performance as well as management rigor. Datasets are massive and growing, and they can change in real time. AI models require careful tracking through cycles of experiments, tuning and retraining.

So, MLOps needs a powerful AI infrastructure that can scale as companies grow. Of course, NVIDIA has built ready to use integrator systems like NVIDIA DGX systems, CUDA-X and other software components available on NVIDIA’s software hub, NGC and Software containers, typically based on Kubernetes, to simplify running these jobs are already in use to satisfy this growing hunger of business interest.

Where should future MLOps engineer should start in his/her career?

Everyone stated that MLOps still needs to mature. There is a need to formulate roadmaps and “rules” for MLOps to codify good standards. This could be implemented as software.According to LM, the tooling will keep changing and quickly. The MLOps can look very different based on scale, and whether the data is structured vs unstructured.

RC told all the AI enthusiasts that you need to have both ML knowledge and software engineering. It is a very rare combination. What we can do is not to only focus on techniques or tools rather be problem-focused. Try to do a lot of projects and a variety of them. If choosing between ML and Engineering, choose the latter. It can be that the ML expert is stuck in their ways. So, the best way is to start by being a great engineer, be focused on the problem-solving.

What will remain relevant 5 years from now?

One principle/focus can be: How to ensure consistently high-quality data through the entire lifecycle of the project. It’s the number of iterations (of the whole lifecycle) that matters. So focus on what can make you go faster.

In the end, each AI team needs to find the mix of MLOps products and practices that best fits its use cases. They all share a goal of creating an automated way to run AI smoothly as a daily part of a company’s digital life. That is only possible if we focus on the data and the lifecycle.