Data is a first-class citizen in machine learning projects. It means it's a critical dependency for every other component in the system. Hence, getting them right is the top priority for any ML engineer.
But this task isn't easy as it sounds.
Theoretically, cleaning, verifying, and validating data is an orchestrated process of some well-defined methods. You can learn something about removing null values and duplicates and standardizing continuous variables.
But in reality, data is the most difficult yet profitable challenge in an ML project.
Any experienced data scientist would attest that most of their time is spent on data preparation. It's even more challenging when your ML models are in production.
According to Forbes, about 60% of the effort is spent cleaning and organizing data. Also, 57% of data scientists hate doing it.
This article discusses five significant challenges ML engineers face in production systems. At the end of this post, you'd understand what makes managing data in production systems hard.
Related: Machine Learning Systems in Real-Life Production vs. Research/Academic Settings.
Labeling rapidly growing datasets.
Most supervised systems need labeled data to learn from. But labels are not always inherent in the system.
A customer purchase dataset of an eCommerce site is pretty simple. We don't have to put much effort into determining whether the customer has made a purchase. The label is inherent.
But an image recognition system to detect wildlife activities can be more challenging. The label is not inherent. A human needs to go through the dataset and manually label them.
It gets trickier for cases like Tumer detection. You need a trained healthcare professional to go through your dataset and label them. Their time is scarce and costly.
The challenge with the production dataset is that it continuously produces new data points that are hard to predict. Someone needs to label them manually and keep doing this till the end of the life cycle. The model cannot adapt to concept and data drift without new labels.
The best way to label data is to use an active learning model. In this approach, a model trained to label datasets will automatically label the straightforward ones. The ambiguous ones are sent to a human labeler. Then they manually label the difficult ones and feed them to the active learning model.
Active learning may sound trivial. But it has been proven effective in handling large amounts of live data.
Feature space coverage in training and production.
When your model is in production, one thing is sure. Its performance will degrade over time. When it does, you will revert to model retraining.
Sometimes, model retraining may use a different feature space than the previous version. You may have new variables in your dataset. If not, you may use feature engineering techniques to improve model performance.
Nonetheless, changing circumstances demand model retraining. And model retraining may need structural changes to the model. When the model change, it works on a different feature space.
Unlike in research settings, your production environment differs from the development environment. It's critical to ensure that updates to the model receive features as they received in the dev environment.
Production systems often use feature stores and metadata stores to handle this. When your model changes, you update features in the store. We can easily switch between model versions if the new model doesn't do well.
Maintaining fewer dimensions.
More features almost always mean more parameters. To learn the complex decision space, you might need more features. Yet, many dimensions demand more parameters.
The downside of having more parameters is that we need more computational power both in training and model serving. Each parameter needs to be optimized during the training phase. And during model serving, the input must go through each of them to make predictions.
During model serving, the need for high computation power is not a challenge. The training phase is what takes so long.
But the real challenge is when scalability matters. For a few hundred users, your model may perform well. But when you grow large, for thousands of concurrent access, your model needs a ton of processing power.
You'd rarely have this problem in a research or academic setting. That's because projects are not meant to serve users in a live system.
So, it's better for you to use a dimensional reduction technique before you feed data to your model. For instance, you can use principal component analysis (PCA) to reduce the dimensions. PCA can bring down from thousands, if not millions, of features to a few hundred variables without losing much information.
Fairness in data
Machine learning started to leap a decade ago. But as more companies adopt AI solutions, we have realized many issues that are not straightforward. One of the key challenges here is to build responsible AI. That is making more responsible machine learning modes.
AI bias can lead to severe consequences. For instance, you developed an ML model to detect a medical condition. You trained the model with a dataset dominated by college students. But when you have deployed it in production, many older adults might go without a diagnosis.
Responsible AI is challenging, as they are often invisible during development. Until some users start experiencing unfair treatment, most ethical issues are uncovered. It takes work to define what's right in a different context. One user may find your predictions entirely fair, while the other feel discriminated against.
But the reason for almost all AI bias is the bias in the training data. The model learns from what you feed. It's crucial to ensure that the input data is free from known biases.
A particularly vulnerable area is model retraining. As with many production deployments, the risk is several times higher when this process is automated.
But ML engineers must ensure that any previously identified biases are correctly handled. It must happen at every stage of an ML project life cycle.
Eliminating bias isn't possible in a production system. But ensuring all kinds of users are represented in your dataset could be achievable. Yet, when you're serving a large customer base, it may be challenging.
Deploying machine learning into production systems comes with an array of different issues. Most research or academic projects don't care a lot about this.
But as ML engineers, we must understand the challenges and take precautions as early as possible because the model in a production ML system is only a tiny part.
Educating you about such challenges is what this post was about. The challenges we discussed here are only related to the data aspect of machine learning. We also need to talk about the software side challenges when talking about production ML. You'll find them in the following article.
Thanks for reading, friend! Say Hi to me on LinkedIn, Twitter, and Medium.
Not a Medium member yet? Please use this link to become a member because, at no extra cost to you, I earn a small commission for referring you.