Machine learning in applications becomes more and more popular. Intelligent YouTube or Netflix recommendations, live text translation by Google Translate. Combining the power of mobile, artificial intelligence and machine learning leads to the great user experience. However, since learning models is a very computationally complex process, and smartphones are low-power devices, machine learning for mobile will inevitably require training on a local computer or server.
Accurate modern object recognition models may contain millions of parameters. For example, Google’s model Inception-v3 shown in [Fig. 1], where one block represents one layer,
Fig. 1: Inception-v3 diagram is able to distinguish between a spotted salamander and a fire salamander [Fig. 2].
Fig. 2: Photos of spotted and fire salamander
Unfortunately, the training process of such complex models requires huge computing power, i.e., Inception-v3 requires two weeks of learning with 8 NVIDIA Tesla K40 graphics cards. To accelerate the process Google has released a version or a pre-trained inception model that can be adapted to a new task. This process is called transfer learning and significantly facilitates retraining of existing weights of las layers to recognize new objects. It’s not as effective as training from scratch, but surprisingly effective for many applications. The best is that it can achieve satisfactory results in approximately 30 minutes on a laptop, without requiring a GPU.
Inception-v3 is a great model, but slowish and bulky for mobile devices. It occupies a lot of space and memory (almost 100 MB). Also, input-to-output processing time takes up to 200-300 ms to process one input 224×224 image on a decent phone (Nexus 5). Fortunately, Google has also released models optimized for mobile – „MobileNet”.
MACs (Multiply Accumulates) – proportionate to required computing power,
parameters – proportionate to memory usage
Additionally, every model comes with normal and quantized weights. A quantized model version uses 8-bit weights instead of 32 bit. As a result, the model has decreased its size up to 75% (at the cost of slightly worse accuracy), and because of the 8-bit computation, the processing time has decreased.
GATHER TRAINING DATA
To get started, we need training data of objects we want to recognize. We need at least 1000 images of every object. To make this process faster, we can make a movie and split it into frames. To make it happen I will useFFMpeg.
If movie resolution is high, we should reduce it first. WithFFMpeg we can call the command below:
If we pass desired_width as500, it will scale the width down to 500 px and because of the passed height size value of-1 , the script will automatically adjust it to maintain the ratio.
Finally, we can split it with:
If the movie is recorded in 30 fps and we pass the fps value of:
30 – it will return images of every frame,
15 – it will return every second frame,
1 – it will return one frame every second of the movie.
This process should be repeated for every object we want to recognize.
To start retraining, execute the retrain.py script:
image_dir – a path to the folder with the structure like this:
learning_rate – controls the size of the updates to the final layer during training,
testing_percentage – what percentage of images to use as a test set,
validation_percentage – what percentage of images to use as a validation set,
train_batch_size – how many images to train on at a time,
validation_batch_size – how many images to use in an evaluation batch. This validation set is used much more often than the test set, and is an early indicator of how accurate the model is during the training. A value of -1 causes the entire validation set to be used, which leads to more stable results across training iterations, but may be slower on large training sets,
flip_left_right – whether to randomly flip half of the training images horizontally,
random_scale – percentage determining how much to randomly scale up the size of the training images,
random_brightness – percentage determining how much to randomly multiply the training image input pixels up or down,
eval_step_interval – how often to evaluate the training results,
how_many_training_steps – how many training steps to run before ending,
architecture – name of a model architecture (which will be automatically downloaded).
At first, I recommend leaving the architecture field blank. Inception-v3 model will be selected. This will verify if the quality of your training data is sufficient. If the accuracy will be satisfactory, you can try to select smaller MobileNet architectures.
We can observe the learning process in the console window or graphically [Fig. 4], in the form of graphs, by calling:
On completion of the learning process, the model will be saved to/tmp/output_graph.pb and labels file to/tmp/output_labels.txt.
As you can see, retraining a model to recognize custom objects is pretty easy and takes less than an hour, including learning time, on a decent laptop. In the next article, I will show how to make use of the generated model to visualize results of recognized objects.
Rate this post:
Tomek Antkowiak Android Developer, specialized in Java & Kotlin programming languages, with the knowledge of a few main game engines such as Unity, Play Canvas or Cocos2D.
The complete guide on how to avoid mistakes in creating mobile apps