Speed is always a concern for developers — especially for data-savvy work.
The ability to iterate is the basis of all automation and scaling. The first and foremost choice for all of us is a for-loop. It’s excellent, simple, and flexible. Yet, they are not built for scaling up to massive datasets.
This is where vectorization comes in. When you do extensive data processing in for-loops, consider vectorization. And Numpy comes in handy there.
This post explains how fast NumPy operations are compared to for-loops.
Comparing For-loops with NumPy
Let’s take a simple summation operation. We have to sum up all the elements in a list.
The sum is an inbuilt operation in Python you can use over a list of numbers. But let’s assume there isn’t one, and you need to implement it.
Any programmer would opt to iterate over the list and add the numbers to a variable. But experienced developers know the limitations and go for an optimized version.
Here are both the list and NumPy versions of our summation. We create an array with a million random numbers between 0 and 100. Then we use both methods and record the execution times.
Let’s run the program and see what we get. The output may look like the one below.
The NumPy version is faster. It took roughly one-hundredth of the time for-loops took.
More examples of using Numpy to Speed up calculations
NumPy is used heavily for numerical computation. That said, if you’re working with colossal dataset vectorization and the use of NumPy is unavoidable.
Most machine learning libraries use NumPy under the hood to optimize algorithms. If you’ve ever created a scikit learn-to model, you’d have used NumPy already.
Here are some more examples you’d frequently use when dealing with extensive numerical data.
Sum products in NumPy vs. Lists
It’s a popular numerical computation you can even use in Excel. Let’s measure the performances of lists and NumPy versions.
The following code multiplies each element of an array with a corresponding element in another array. Finally, we sum up all the individual products.
Here’s the output of the above code:
Once again, the NumPy version was about 100 times faster than iterating over a list.
Matrix multiplication performance of NumPy and lists.
Matrix multiplication is an extended version of sum-product. It involves not a single array but an array of arrays.
Matrix multiplication is also very common when implementing algorithms that involve a lot of data. Here’s the benchmark.
The results of using NumPy were profound. Our vectorized version ran more than 500 times faster.
NumPy’s benefits are more prominent as the size and dimensions of arrays grow.
Why is NumPy faster than lists?
Simple; They are designed for different purposes.
NumPy’s role is to provide an optimized interface for numerical computation. A Python list, however, is only a collection of objects.
A NumPy array allows only homogeneous data types. Thus the NumPy operations don’t have to worry about types before every step of an algorithm. This is where we gain a lot of speed — quick wins.
Also, in NumPy, the whole array, not individual elements, is an object known as densely packed. Thus it takes much less memory.
Further, NumPy operations are (primarily) implemented using C, not in Python itself.
Lists in Python are not more than an object store. Individual objects take up space, and you’ll quickly need more memory to process them. Also, lists could accommodate different types of objects in it. But on the downside, you’d have to do element-wise-type checks on every operation. This makes it costly.
This post encourages you to convert your lists to NumPy arrays and use vectorized operations to speed executions.
It’s natural for people to use for-loops over a list because it’s straightforward. But if it involves a lot of numbers, it’s not the optimal way. To understand it better, we’ve compared the performances of trivial operations such as summation, sum-product, and matrix multiplication. In all cases, NumPy performed far better than lists.
For-loops, too, have their place in programming. The rule of thumb is to use them when your data structures are more complex and have fewer items to iterate.
You may be better off summing a few hundred numbers without NumPy. Also, if you have to do more work than numerical computation in each iteration, NumPy isn’t your option.
Not a Medium member yet? Please use this link to become a member because I earn a commission for referring at no extra cost for you.