A few days ago, I started learning about tensorflow, it is the google’s new open source library for doing various machine learning.
One of the reasons I wanted to try tensorflow was – I heard that tensorflow is heavily parallelized, so I just became curious to know how people do it. Tensorflow can also run on GPU which made me more curious.
So I start learning tensorflow. I installed tensorflow with virtualenv. However, I wasn’t able to install tensorflow for GPU. Soon after that I went to tutorials page and studied the hello world (MNIST) program on the tensorflow website. This program reads image which contains hand-written digits and predicts it. Here is the code.
Now I ran that hello world program. Initially, it took about 50 seconds. That’s because it was downloading the dataset. So I tried it again. This time I got better results around 14.62 seconds (obviously, because the dataset is already downloaded).
Around ~14 s for training 10000 images out of 55000 train set and testing 10000 images is unbelievable. This code achieves an accuracy of 91%. Amazing!
Maybe “Hello world” is not the right way to benchmark. So I tried a different code aimed to do the same predict hand-written digits (MNIST) with higher accuracy, called deep MNIST for experts on their official website. This use 2 additional hidden layers in their neural network. Here is the official code. I tried running this, unfortunately, I ran out of space. My 4GB RAM wasn’t enough for training this. So I modified this code by removing a hidden layer in the neural network. Here is the modified code. Then, I ran this code which took about half an hour by achieving an accuracy of ~98%. By removing a hidden layer I decreased the accuracy from 99.2% (according to the tensorflow website) to 98%.
This most interesting part was that real time was ~half an hour but the user time was more than an hour. This means it made use of multi-cores on my laptop. It couldn’t be using either of multithreading or multiprocessing python parallel processing libraries because multithreading cannot parallelize (all thanks to GIL :P) and multiprocessing creates many processes whereas I see only one python process running. So there must be some other way it parallelizes it.
Let’s take a look at the tensorflow source code, it looks the core is completely written in C++. Hmm… So all the parallelism is coming from C++. But how is it wrapped in python? How do you run a C++ parallelized code in python? Well, my guess was that it uses ctypes.