What is so Special about Data?

Effective Data

For Machine Learning to take place and for AI to effective it needs high quality data. It also needs as much as it can get. Data is the food that feeds AI. If the food is bad or scarce (or both) then the AI will be do very poorly.

Data can be just that, data, numbers. It can also be words, it can be images, sounds, videos and so on. It converts it all into numbers anyway. This data is pushed through the training process repeatedly to train it. But not all the data is used. Some of it is for validating and some of it is for testing.

Training Datasets

All the data is jumbled up and the first 80% is drawn out of the pot to train the model on. This is the key moment. You get the results and the loss which is the difference between its own predictions and the target answer is almost zero (which is what you want). But you may have a problem called overfitting. This is where the model has just simply learned the answers. So, to find out whether it has overfitted you can test it on the validation dataset.

Validation Datasets

This half (10%) of the remaining dataset. It is like a dry run to check that the model hasn’t just memorised the data. This is especially true if you get incredible results from the loss function.

When the model has overfitted then when you show it some brand new dataset it will show you very quickly that there is a problem of overfitting. If it does you go back to your model and tweak the hyperparameters. These are things you can change like the number of hidden layers, the number of nodes on each layer and other things as well.

When you have done that (maybe many times) you check and recheck with the validation data to see if the model works. When you are satisfied you can do a third test with the testing dataset.

Testing Datasets

This is the remaining 10% of the dataset you set aside. If this doesn’t give good results then you may have a much bigger problem on your hands. One reason could be not enough data, poor data or the data is really inconclusive and the model cannot just learn anything useful. It cannot make patterns up if there aren’t any.

Energy Efficient

To train a very large dataset with thousands of images is very energy demanding. There are so many calculations (remember this is just an algorithm) to perform. Some models can take hours or even days to complete only for you to find that the results were poor.

GPU’s

One of the reasons for the surge in the development and use of Machine Learning is the use of GPUs used for gaming. These got more and more powerful as the demands of increasingly more sophisticated and image rich games were made. Data scientists realised that they could now do far more with the data they had and do it much faster.

Making Fake Datasets

One way to develop what you might need is to either create synthetic datasets or augment what you already have. Synthetic data might be making a model of a car in say Blender3D and then take images of it from different angles, colours, or backgrounds.

Augmented data is taking a set of images and rotating them. Also maybe using only use part of the image or reversing it so that you generate more images (data) than you already have. You could easily double or triple the amount of data.

Conclusion

Data is king. High quality and lots of it is the key. This is also the most expensive part because it often requires humans to check through or annotate it. You might want to draw a box round every item in a picture to label it. This is for the AI to learn what a tree looks like, of a car or a person etc.

This is an excellent video showing you in more detail how a neural network works