MLOps teams face pressure to advance their capabilities to scale AI. In 2022, we saw an explosion of buzz around AI and MLOps inside and outside of organizations. 2023 promises more hype with the success of ChatGPT and the traction of models inside enterprises.
MLOps teams look to proactively expand their capabilities while reactively meeting the pressing needs of the business. These teams start 2023 with a long list of resolutions and initiatives to improve how they industrialize AI. How will we scale the components of MLOps (deployment, monitoring, and governance)? What are the top priorities for our team?
AlignAI teamed up with an automotive organization to write this playbook to guide MLOps teams based on what we have seen be successful to scale.
To start, we need a working definition of MLOps. MLOps is an organization’s transition from delivering a few AI models to reliably providing hundreds of models at scale. This transition requires a repeatable and predictable process. MLOps means more AI and the associated return on investment. Teams win at MLOps when they focus on orchestrating the process, the team, and the tools.
While working with an executive at an automotive organization, we reviewed the usage metrics of a model and had a productive conversation about why usage had dropped. This visibility of the impact and adoption of models is crucial to building trust and reacting to the needs of the business.
A fundamental question for teams leveraging AI and investing in MLOps capabilities is how do we know if we are progressing?
Teams focus on quantifying performance in the business impact they provide and the operational metrics enabling it. Measuring impact captures the picture of how we generate.
The first hurdle teams face in MLOps is deploying models into production. As the number of models grows, teams must create a standardized process and shared platform to handle the increased volume. Managing 20 models deployed using 20 different patterns can make things cumbersome. Enterprise teams typically create centralized infrastructure resources around X models. Choosing the right architecture and infrastructure across models and teams can be an uphill battle. However, once it is established, it provides a strong foundation to build the capabilities around monitoring and governance.
We created a standard deployment function using Kubernetes, Google Cloud Platform, and a team to support the organization we were working with.
A unique and challenging aspect of machine learning is the ability of models to drift and change in production. Monitoring is critical in creating trust with stakeholders to use the models. Google’s Rules of Machine Learning says to “practice good alerting hygiene, such as making alerts actionable.” This requires teams to define the areas to monitor and how to generate these alerts. A challenging piece becomes making these alerts actionable. There needs to be a process established to investigate and mitigate issues in production.
At the automotive organization, the Model Operations Center is the centralized location with screens full of information and data to understand if the models are getting what we expect in near real-time.
Here is a simplified example of a dashboard looking for usage or record counts dropping below a set threshold.
Here are monitoring metrics to consider for your models:
Innovation inherently creates risk, especially in the enterprise environment. Therefore, successfully leading innovation requires designing controls into the systems to mitigate risk. Being proactive can save a lot of headaches and time. MLOps teams should proactively anticipate and educate stakeholders on the risks and how to mitigate them.
Developing a proactive approach to governance helps avoid reacting to the needs of the business. Two key pieces of the strategy are controlling access to sensitive data and capturing lineage and metadata for visibility and audit.
Governance provides great opportunities for automation as teams scale. Waiting for data is a constant momentum killer on data science projects. A model automatically determines if there is personally identifiable information in a data set with 97% accuracy. Machine learning models also help with access requests and have reduced the processing time from weeks to minutes in 90% of the cases.
The other piece is tracking meta-data throughout the model’s life cycle. Scaling machine learning requires scaling while maintaining trust in the models themselves. MLOps at scale requires built-in quality, security, and control to avoid issues and bias in production.
Teams can get caught up in the theory and opinions around governance. The best course of action is to start with clear access and controls around user access.
From there, meta-data capture and automation are key. The table below outlines the areas to collect meta-data. Wherever possible, leverage pipelines or other automation systems to capture this information automatically to avoid manual processing and inconsistencies.
Here are the items to collect for each model:
Many technical teams fall into the pitfall of thinking: “if we build it, they will come.” There’s more to solving the problem. It also involves sharing and advocating for the solution to increase organizational impact. MLOps teams need to share the best practices and how to solve the unique problems of your organization’s tools, data, models, and stakeholders.
Anybody in the MLOps team can be an evangelist by partnering with the business stakeholders to showcase their success stories. Showcasing examples from your organization can illustrate the benefits and opportunities clearly.
People across the organization looking to industrialize AI need education, documentation, and other support. Lunch and Learns, onboarding, and mentorship programs are great places to start. As your organization scales, more formalized learning and onboarding programs with supporting documentation can accelerate your organization’s transformation.
MLOps teams and leaders face a mountain of opportunities while balancing the pressing needs of industrializing models. Each organization faces different challenges, given its data, models, and technologies. If MLOps were easy, we probably would not like working on the problem.
We hope this playbook helped generate new ideas and areas for your team to explore. The first step is to generate a list of opportunities for your team in 2023. Then prioritize them based on what will have the most significant impact on the business. Teams can also define and measure their maturity progress against emerging benchmarks. This guide from Google can provide a framework and maturity milestones for your team.