Python is a wonderful language. One of my favorite languages. It is so easy to express logic in python. But it lacks one thing, the ability to use multicore processors hence ability to do things in parallel. One reason is python’s Global Interpreter Lock (GIL).
But python has provided us with Threading library. So why can’t we use it?
This library still doesn’t get rid of GIL. These threads are still executed concurrently. These threads don’t make use of multicores. Then why use it? Why do we even need it?
It is because it is good for dealing with programs which are I/O bound or network bound. You don’t want to make the user wait while the data is loading. Example – fetching data from a web server, loading a very large file.
Here is my CPU intensive program to test threading library.
After running this code for a couple of times I got this output.
real 0m5.983s user 0m5.508s sys 0m0.472s
By seeing the output, my conclusion is that this code is not running parallelly. That’s because real time is approximately equal to the sum of other two.
One way to overcome python’s GIL is to use python’s multiprocessing library. This library looks similar to the threading library syntax-wise, but it works differently. It makes use of OS’s process forking to fork a new process. Each process holds a python interpreter which runs the code, therefore it’s own GIL. It is left up to the OS to run it in parallel on separate cores.
So let us test it with the same program.
The output I got was
real 0m3.415s user 0m12.636s sys 0m0.624s
As you can see the real time has decreased from almost 5.9 seconds to 3.4 seconds. It nearly halved. It means that it actually ran in parallel. Ideally, it should have been 1/4th the time because there were 4 processes running on 4 cores parallelly.
This extra time is because there is a lot of overhead here in creating these processes which later joins back to the same process. Since a process can’t access other process’s data, we need the concept of shared memory which will create more overhead (more I/O operations). By default, multiprocessing library doesn’t create shared memory but creates a copy of data in each process.
It is not worth parallelizing python program unless the program does too much of computation for which we can use multiprocessing library given the system has sufficient resources.