3 Ways to Deploy Machine Learning Models in Production | 1

3 Ways to Deploy Machine Learning Models in Production

Deploy ML models and make them available to users or other components of your project.

Working with data is one thing, but deploying a machine-learning model to production can be another.

Data engineers are always looking for new ways to deploy their machine-learning models to production. They want the best performance, and they care about how much it costs.

Well, now you can have both!

Let’s take a look at the deployment process and see how we can do it successfully!

Grab your aromatic coffee (or tea) and get ready…!

How to deploy a machine learning model in production?

Most data science projects deploy machine learning models as an on-demand prediction service or in batch prediction mode. Some modern applications deploy embedded models in edge and mobile devices.

Each model has its own merits. For example, in the batch scenario, optimizations are done to minimize model compute costs. There are fewer dependencies on external data sources and cloud services. The local processing power is sometimes sufficient for computing algorithmically complex models.

It is also easy to debug an offline model when failures occur or tune hyperparameters since it runs on powerful servers.

On the other hand, web services can provide cheaper and near real-time predictions. Availability of CPU power is less of an issue if the model runs on a cluster or cloud service. The model can be easily made available to other applications through API calls and so on.

One of the main benefits of embedded machine learning is that we can customize it to the requirements of a specific device.

We can easily deploy the model to a device, and its runtime environment cannot be tampered with by an external party. A clear drawback is that the device needs to have enough computing power and storage space.

Deploying machine learning models as web services.

The simplest way to deploy a machine learning model is to create a web service for prediction. In this example, we use the Flask web framework to wrap a simple random forest classifier built with scikit-learn.

To create a machine learning web service, you need at least three steps.

The first step is to create a machine learning model, train it and validate its performance. The following script will train a random forest classifier. Model testing and validation are not included here to keep it simple. But do remember those are an integral part of any machine learning project.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('titanic.csv')

x = df[df.columns.difference(['Survived'])
y = df['Survived']
classifier = RandomForestClassifier()
classifier.fit(x, y)

In the next step, we need to persist in the model. The environment where we deploy the application is often different from where we train them. Training usually requires a different set of resources. Thus this separation helps organizations optimize their budget and efforts.

Scikit-learn offers python specific serialization that makes model persistence and restoration effortless. The following is an example of how we can store the trained model in a pickle file.

from sklearn.externals import joblib

joblib.dump(classifier, 'classifier.pkl')

Finally, we can serve the persisted model using a web framework. The following code creates a REST API using Flask. This file is hosted in a different environment, often in a cloud server.

from flask import Flask

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
     json_ = request.json
     query_df = pd.DataFrame(json_)
     query = pd.get_dummies(query_df)
     classifier = joblib.load('classifier.pkl')
     prediction = classifier.predict(query)
     return jsonify({'prediction': list(prediction)})

if __name__ == '__main__':

 The above code takes input in a POST request through https://localhost:8080/predict and returns the prediction in a JSON response.

Deploying machine learning models for batch prediction.

While online models can serve predictions, on-demand batch predictions are sometimes preferable.

Offline models can be optimized to handle a high volume of job instances and run more complex models. In batch production mode, you don’t need to worry about scaling or managing servers either.

Batch prediction can be as simple as calling the predict function with a data set of input variables. The following command does it.

prediction = classifier.predict(UNSEEN_DATASET)

Sometimes you will have to schedule the training or prediction in the batch processing method. There are several ways to do this. My favorite is to use either Airflow or Prefect to automate the task.

import requests
from datetime import timedelta, datetime

import pandas as pd

from prefect import task, Flow
from prefect.schedules import IntervalSchedule

@task(max_retries=3, retry_delay=timedelta(5))
def predict(input_data_path:str):
    This task load the saved model, input data and returns prediction.
    If failed this task will retry 3 times at 5 min interval and fail permenantly.
    classifier = joblib.load('classifier.pkl')
    df = pd.read_csv(input_data_path)
    prediction = classifier.predict(df)
    return jsonify({'prediction': list(prediction)})
@task(max_retries=3, retry_delay=timedelta(5))
def save_prediction(data, output_data_path:str):
    This task will save the prediction to an output file. 
    If failed, this task will retry for 3 times and fail permenantly.
    with open(output_data_path, 'w') as f:

# Create a schedule object. 
# This object starts 5 seconds from the time of script execution and repeat once a week. 
schedule = IntervalSchedule(
    start_date=datetime.utcnow() + timedelta(seconds=5),

# Attach the schedule object and orchastrate the workflow.
with Flow("predictions", schedule=schedule) as flow:
    prediction = predict("./input_data.csv")
    save_prediction(prediction. "./output_data.csv")

The above script schedules prediction on a weekly basis starting from 5 seconds after the script execution. Prefect will retry the tasks 3 times if they fail.

However, building the model may require multiple stages in the batch-processing framework. You need to decide what features are required and how you should construct the model for each stage.

Train the model on a high-performance computing system with an appropriate batch-processing framework.

Usually, you partition the training data into segments that are processed sequentially, one after the other. You can do this by splitting the dataset using a sampling scheme (e.g., balanced sampling, stratified sampling) or via some online algorithm (e.g., map-reduce).

The partitions can be distributed to multiple machines, but they must all load the same set of features. Feature scaling is recommended. If you used unsupervised pre-training (e.g., autoencoders) for transfer learning, you must undo each partition.

After all the stages have been executed, you can predict unseen data with the resulting model by iterating sequentially over the partitions.

Deploying machine learning models on edge devices as embedded models.

Computing on edge devices such as mobile and IoT has become very popular in recent years. The benefits of deploying a machine learning model on edge devices include, but are not limited to:

  • Reduced latency as the device is likely to be close to the user than a server far away.
  • Reduce data bandwidth consumption as we ship processed results back to the cloud instead of raw data that requires big size and eventually more bandwidth.

Edge devices, such as mobile and IoT devices, have limited computation power and storage capacity due to the nature of their hardware. We cannot simply deploy machine learning models to these devices directly, especially if our model is big or requires extensive computation to run inference on them.

Instead, we should simplify the model using techniques such as quantization and aggregation while maintaining accuracy. These simplified models can be deployed efficiently on edge devices with limited computation, memory, and storage.

We can use the TensorFlow Lite library on Android to simplify our TensorFlow model. TensorFlow Lite is an open-source software library for mobile and embedded devices that tries to do what the name says: run TensorFlow models in Mobile and Embedded platforms.

The following example converts a Keras TensorFlow model.

import tensorflow as tf

# create and train a keras neural network
classifier = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units=1, input_shape=[1]),
    tf.keras.layers.Dense(units=28, activation='relu'),
classifier.compile(optimizer='sgd', loss='mean_squared_error')
classifier.fit(x=[-1, 0, 1], y=[-3, -1, 1], epochs=5)

# Convert the model to a Tensorflow Lite object
converter = tf.lite.TFLiteConverter.from_keras_model(classifier)
tfl_classifier = converter.convert()

# Save the model as a .tflite file
with open('classifier.tflite', 'wb') as f:

You can read the newly created .tflite file on any platform of your choice. Tensorflow lite supports Android, iOS, and Linux (Including Raspberry Pi).

For examples and tutorials on deploying on separate platforms, please do check out the TensorFlow Lite documentation.

Final thoughts

Training a machine learning model is only one aspect of a data science project. Data scientists put a lot of effort into deploying them in a production environment.

We’ve discussed three different methods to deploy machine learning models and their merits. Depending on your application, you may have to choose one of the options available.

Of course, this post is only the tip of the iceberg. But I trust it would have given you the starting point to explore further.

ML models in production have lots of other aftercare, such as periodic model evaluation. But they are for another post.

Thanks for the read, friend. It seems you and I have lots of common interests. Say Hi to me on LinkedIn, Twitter, and Medium.

Not a Medium member yet? Please use this link to become a member because I earn a commission for referring at no extra cost for you.

Similar Posts