# Challenging Cython — the Python Module for High-Performance Computing

April 18, 2023

In a recent post, we’ve discussed how to run Python about 91X faster than its usual speed. In this post, I aim to discuss an unanswered question from the previous one.

How do the solutions provided perform compared to Python’s traditional performance fix technique — Cython?

Although Python is a fantastic programming language, its performance was its major drawback. The elegant language with straightforward syntax was not made for faster computing.

Cython has been bridging this gap for many years by converting Python code into compiled C programs. A range of Scientific computing packages relies on Cython to speed up computation.

Let’s compare its performance with its modern alternative.

We’ll start by counting prime numbers using plain Python. Then, we’ll compare it with its Cython version. We’ll repeat them using Python’s multiprocessing module to find out its benefits.

Finally, we’ll compare the performances with modern ways to speed up Python programs.

## Cython .vs Python for simple calculations.

The most straightforward way to grasp the benefits of Cython is to use it in a rudimentary calculation. Here we use the same measure I used in the previous benchmark — Counting the number of prime numbers below a number.

We use the cProfile module to measure the performance of every computation. cProfile is one of the many standard Python modules to measure code running times.

### Computing using Python.

The following code will count the number of prime numbers below 35,000 and print a detailed performance report.

``````import cProfile

def count_primes(max_num: int):
"""This function counts of prime numbers below the input value.
Input values are in thousands, ie. 40, is 40,000.
"""
count: int = 0
for num in range(max_num * 1000 + 1):
if num > 1:
for i in range(2, num):
if num % i == 0:
break
else:
count += 1
print(count)
return count

cProfile.run("count_primes(35)")``````

When running the above Python code snippet, the output may look like the below. The numbers may differ according to your PC specs. Because there could be many factors affecting the performance, it’s wise to do the experiment several times and use the average for comparison. The average for the above execution in my PC is 18.721 seconds.

### Computing using Cython.

Using Cython is different from using Python. First, we need to convert the Python script into C.

1. Install Cython if it's not already installed.
``pip3 install --upgrade cython``

2. Create a file `count_prime.pyx` with the below content. It’s the same as before.

``````def count_primes(n: int) -> int:
"""Returns how many prime numbers are there less than n"""

count = 0
primes = [False for i in range(n + 1)]
for i in range(2, n):
if primes[i] == False:
count += 1
j = 2
while j * i < n:
primes[j * i] = True
j += 1
print(count)
return count``````

3. Create another file, `setup.py` , in the same directory with the below content. It will compile your code into a C program.

``````from setuptools import setup
from Cython.Build import cythonize

setup(
ext_modules = cythonize("count_prime.pyx")``````

4. Now, running the following command in a terminal window will create the C versions of our prime number counter function.

``python setup.py build_ext --inplace``

5. Also, rename (or move) the pyx file because it may cause import conflicts otherwise.

``mv count_prime.pyx count_prime_dep.pyx``

6. Finally, import the function into our main app and run it the same way we’d call any other python function.

``````import cProfile
from count_prime import count_primes

cProfile.run("count_primes(35)")``````

The average time to run this program is 11.210 seconds. Here is a sample cProfile report from one of the ten trials I did. A reduction from 18 seconds to 11 is a remarkable improvement. Cython does its job as expected. Let’s repeat the steps for the multiprocessing scenario as well.

## Cython .vs Python on multiprocessing.

The multiprocessing module in Python is an excellent way to start many subprocesses. Each process will run an instance of your code. If the processor has idling cores, they can run in parallel too.

### Python multiprocessing.

The following code will compute prime numbers below 20K, 25K, 30K, 35K, and 40K in multiple processes (parallel if possible.)

``````import cProfile
from multiprocessing import Pool

def count_primes(max_num: int):
"""This function counts of prime numbers below the input value.
Input values are in thousands, ie. 40, is 40,000.
"""
count: int = 0
for num in range(max_num * 1000 + 1):
if num > 1:
for i in range(2, num):
if num % i == 0:
break
else:
count += 1
print(count)
return count

if __name__ == "__main__":
with Pool(5) as p:
cProfile.run("p.map(count_primes, [20, 25, 30, 35, 40])")``````

Running the above code will result in an average time of 29.7 seconds. The outputs may look like the below. ### Cython multiprocessing.

The below code will repeat the same process with the C version we compiled earlier.

``````import cProfile
from multiprocessing import Pool
from count_prime import count_primes

if __name__ == "__main__":
with Pool(5) as p:
cProfile.run("p.map(count_primes, [20, 25, 30, 35, 40])")``````

The average time for this is around 18 seconds, and the output may look like the below. Once again, Cython has proved effective in speeding up python processing times significantly. It has cut down the time taken to run the entire loop by 37.9%.

## High-performance computing with Tuplex.

We’ve seen spectacular success in using Cython. In simple prime number counting and multiprocessing application, it reduces execution time by a massive margin.

Could this be faster? Does the Tuplex library we’ve discussed in the previous post perform better than the Cython versions?

If you’re looking for a detailed post on Tuplex, here is one.

Related Post:How to Speed up Python Data Pipelines up to 91X?

### Tuplex for a single process.

Tuplex converts Python scripts into LLVM bitecodes and runs them in parallel. The result is a remarkable improvement in the performance. This new library focuses on data pipelines, but its applications are not limited to data science only.

The below code will run a single computation using Tuplex. Because we pass only a single number in the list argument of the parallelizing method, Tuplex will run it in a single thread.

``````import cProfile
from tuplex import *

c = Context()

def count_primes(max_num: int):
"""This function counts of prime numbers below the input value.
Input values are in thousands, ie. 40, is 40,000.
"""
count: int = 0
for num in range(max_num * 1000 + 1):
if num > 1:
for i in range(2, num):
if num % i == 0:
break
else:
count += 1
print(count)
return count

cProfile.run("c.parallelize().map(count_primes).collect()")``````

The tuplex method results in an average of 9 seconds. Note that we are not considering the time Tuplex takes to set up the context. That’s impressive compared to the 11 seconds of Cython. The results may also vary according to the computer’s architecture. But this number is decent enough to show that Cython has a worthy opponent.

### Tuplex on parallel computation.

One last comparison is pending that could be the central purpose of Tuplex — parallelizing. Let’s compute the same multiprocessing exercise with Tuplex. Tweaking the last line of the Tuplex single process example will do the trick. Here’s how it should look.

``cProfile.run("c.parallelize([20, 25, 30, 35, 40]).map(count_primes).collect()")``

The result of running the Tuplex code in parallel is an astonishing 10 seconds on average. To finish the same task, Cython took 18 seconds, and plain Python took 29. ## Conclusion

Cython had been the primary (and only) performance fix available for Python programmers. The library does well often; thus, many other scientific computing packages use Cython internally.

Yet, a relatively new library Tuplex is performing in a promising way. In this article, we did a prime number counting experiment to compare Tuplex with Cython versions. There seems to be a massive reduction in the execution time as well.

But the results may vary based on your hardware, OS, and system state. Also, since Tuplex is not yet production-ready, you must use it with caution. Cython, on the other hand, is stable, and many apps in production are already using it.

The whole point of this comparison is not to conclude one is better than the other. Tuplex has a long journey ahead to come to this point. Yet, If Cython does not fit your project, it’s not a dead end. Not any more.

Did you like what you read? Consider subscribing to my email newsletter because I post more like this frequently.
Thanks for reading, friend! Say Hi to me on