Reducing latency in tensorflow lite

Anidh Singh
3 min readDec 23, 2020

--

Have you designed a neural network model trained it for days and perfected it with hyperparameter tuning for weeks only to add it to your android/ios workflow and find that it’s taking something like 200ms to do a single inference? There are some problems where this is acceptable like image classification where you need to do inference only a single time and be done with it but for the majority of use cases like super-resolution, object detection, or low light image enhancement this is a bummer and needs a workaround.

The tensorflow lite tool has come a long way from its initial release and they have added a bunch of features like OpenCL support for faster inference, tensorflow model optimization toolkit a suite of tools for optimizing ML models for deployment and execution, etc. but still if you have tried those options and still want more keep reading!

Using Separable Convolutions -

In simple terms, Separable convolutions consist of first performing a depthwise spatial convolution (which acts on each input channel separately) followed by a pointwise convolution that mixes the resulting output channels. This may sound like a very small thing but this results in a dramatic improvement in model runtime. So instead of using

net = tf.layers.conv2d(lr_batch,1,3,activation=tf.nn.relu, name=”conv1",padding=”same”)

We can do something like

net = tf.layers.separable_conv2d(net, 4,3,activation=tf.nn.relu, depthwise_initializer=tf.keras.initializers.he_normal(), pointwise_initializer=tf.keras.initializers.he_normal(), ,padding=”same”,name=”conv1")

Now remember this will definitely result in some loss of accuracy but we have seen huge improvements in terms of speedup. In our test done on an Oneplus 5 running a sample model on GPU backend, this change alone gave us a speedup of around 4x-5x. If this is done for all of the convolution layers we can see a dramatic improvement. This also can be strategically placed like starting layers can use normal convolutions and in between layers can separable convolutions. If you want to read more on this topic you can start here.

Changing the activations -

It’s the simple things that make the most impact and this is again demonstrated by just a simple change of activations. Now, this depends on the target backend like CPU, GPU, or DSP. For starters, I would advise to create a dummy model with all of the activations and see which gives the best timings on your preferred backend.

This also depends on the kind of problem you are working on so for our use case when we tested it on our GPU backend we saw that the sigmoid activation performed the worst and the tanh activation performed the best but in terms of accuracy, it was the opposite.

To have the best of both worlds we decided to approximate the sigmoid activation function. Some of the approximations which we used were -

def sigmoid_approx(self,x,name=”Sigmoid”,alpha=0.5):
with tf.name_scope(name, “Sigmoid”, [x]) as name:
self.x =x
y = (x / (1 + tf.math.abs(x)))
#y=0.5 + 0.5*tf.nn.tanh(0.5*x)
#y=1/(1 + tf.math.pow(0.3678749025,x))
#y = 0.5 * (x * alpha / (1 + abs(x*alpha))) + 0.5
#y=x/tf.math.sqrt(1+tf.pow(x,2))
return y

Now some of these activations are going to be very fast and some very slow also they have a varying degree of delta in terms of accuracy and that depends on your problem. You can also approximate other activations.

In our particular use case, we needed the output to fall between 0 and 1 and we were using the relu activation function before followed by tf.clip_by_value() function to clip our output values between 0 and 1. To get around this problem and save some time here also we used a sigmoid activation function which gave us a value directly between 0 and 1.

Reduce the input dimensions -

We can save more time by making our convolutions work on smaller input size. To accomplish this there are several methods like strided convolutions etc. We decided to use the space to depth approach what it does is it outputs a copy of the input tensor where values from the height and width dimensions are moved to the depth dimension. This reduces the input dimensions without much loss in accuracy.

net=tf.space_to_batch_nd(lr_batch,paddings=[[0,0],[0,0]], block_shape=[2,2],name=’s2b’)

This also speeds up our inference times by around 1.5x to 1.7x on our target backend.

These are some of the most common techniques which we used to bring down our inference time from around 110ms to 18ms. If you liked what I typed then a clap would work wonders for me. I will be happy to write other techniques in reducing time for tflite workflow.

--

--

Anidh Singh
Anidh Singh

Written by Anidh Singh

A deep learning enthusiast passionate about cutting edge technologies to solve real-world problems.

No responses yet