Deep learning models are notorious for their appetite for data. The more data you can give them, the better they perform.
Unfortunately, in most real-life situations, this is not possible. You may not have enough data, or the data may be too expensive to collect.
For instance, if you're training a model to identify cancer cells. You need trained medical professionals to label the image. Their time is so limited and expensive.
Further, there may be minimal cases of specific cancer patients. It would be complicated to find enough of them and take samples.
Thus, it's inevitable to work with tiny datasets at some point. Yet, deep learning on small datasets is even more challenging.
Why deep learning needs so much data?
Machine learning with small datasets often ends up producing overfitting models. Meaning there are too many parameters to learn from fewer examples. As a result, the algorithm starts to memorize the input dataset rather than generalize it.
Often you'd see excellent results for the training datasets. But on the validation dataset, you'd see more significant prediction errors.
The critical point here is that with more parameters, you need more data. Small datasets can only help train smaller models.
Deep learning models are compelling because they can learn complex relationships. Deep learning models comprise many layers. Each layer learns a progressively more complex representation of the data.
The first layer might learn to detect simple patterns, such as edges. The second layer might learn to see patterns of those edges, such as shapes. The third layer might learn to identify objects made up of those shapes, and so on.
Each layer consists of a series of neurons, and they are connected to every neuron in the previous layer.
All these layers and neurons mean there are a ton of parameters to optimize. So deep learning models have a lot of capacity, which is good. But it also means they are prone to overfitting, which is terrible. As we now know, overfitting happens when a model captures too much noise in the training data and fails to generalize to new data.
Deep learning models can detect very complex relationships with enough data. Yet, if you do not have enough data, the deep learning model will not be able to understand these complex relationships.
We must have enough data so that the deep learning model can learn.
But when the odds are not so good to collect more data, we have several techniques to overcome them.
I've included links to some useful books in this article. I may earn a small commission on qualifying purchases when you buy something I recommend. But it never affects your price.
1. Transfer learning can help train deep learning models with small datasets.
Transfer learning is a machine learning technique that takes a model trained on one problem and uses it as a starting point to solve a related but different problem.
Transfer learning has proven successful in many instances. Successful machine learning models running in production systems are primarily trained for different reasons.
When training deep learning models with small datasets is inevitable, it's best to find a trained model.
Besides helping smaller deep-learning datasets, transfer learning is also efficient regarding training time and cost. You may need only a few new examples for a model to adopt your new domain.
Here's an example. You could take a model trained on a large dataset of dog images and use it as a starting point to train a model to identify dog breeds.
The hope is that the features learned by the first model can be reused, saving time and resources.
There is no rule of thumb on how different the two applications can be. But, you can use transfer learning even if the original and new datasets differ.
For example, you could take a model trained on images of cats and use it as a starting point to train a model to identify types of camels. The hope here is that the ability to find out four legs in the first model may help recognize camels.
You could refer to Transfer Learning for Natural Language Processing to learn more about transfer learning. You may also find Hands-On Transfer Learning with Python helpful if you are a Python programmer.
2. Try data augmentation
Data augmentation is a technique where you take your existing data and generate new, synthetic data.
For example, if you have a dataset of images of dogs, you could use data augmentation to generate new pictures of dogs.
You could do this by randomly cropping images, flipping them horizontally, adding noise, and several other techniques.
Data augmentation is beneficial when you have a small dataset.
By generating new data, you can artificially increase the size of your dataset and give your deep learning model more data to work with.
These lecture notes on deep learning are a great starting point for learning more about data augmentation.
4. Use autoencoders
Autoencoders are a deep learning model to learn low-dimensional data representations.
Autoencoders are beneficial when you have a small dataset because they can learn to compress your data into a lower-dimensional space.
There are many different types of autoencoders. Variational autoencoder (VAE) is a popular one. VAEs are a generative model, which means they can generate new data.
This is beneficial because you can use a VAE to generate new data points similar to your training data. This is a great way to increase your dataset size without having to collect more data.
These are just a few techniques to overcome the small data problem.
Of course, the best solution is to collect more data. But, if you're working with a small dataset, these techniques can help you build a deep learning model that can generalize well.
But did you know that traditional models may outperform deep learning models on small datasets?
In some cases, it may be better to use a traditional machine learning model, such as a support vector machine or a decision tree. Experimenting with different models and seeing what works best for your problem is essential.