Downloading files from the internet programmatically is a task frequently encountered in Python applications.
I do it a few times a year. Sometimes, the number of files we need to download from an internet archive is large enough to spend a few weeks. With programmatic access, we can bring it down to less than a day.
This article delves into the realm of optimal file-downloading techniques using Python, shedding light on crucial aspects like exception handling, employing suitable libraries, and incorporating advanced functionalities such as resumable downloads and stream processing.
Join me as we embark on this journey of exploring the top best practices for Python-based file downloads.
Downloading files using the requests library
Python ecosystem is so rich. We have a package for almost every need.
Requests is one such Python library that helps us make HTTP requests programmatically. And this could download files as well. Sometimes, this isn’t the best way to download files. If you can’t directly access the file with the URL, you might have to use Selenium to download the file.
But for this post’s purpose, we stick with the request library.
Let’s take a look at a simple example that demonstrates how to download a file using the
This code uses the requests library to send a GET request to a URL. If the response status code is 200 (indicating a successful request), it opens a file named.
"myfile.txt" in write binary mode. Then, it writes the content of the response to the file. If the status code is not 200, it prints a failure message along with the actual status code.
Handle exceptions proactively when programmatically downloading files.
No matter what, the internet is an ocean of unknowns.
Errors can occur at various stages when working with network requests. The server may be temporarily down, and the new change in the source system is not compatible with your script’s expectation, the connection may be unstable, and many more.
Thus we must proactively address possible issues. Thus enclose your code inside a try-except block and handle it appropriately.
Let me provide you with an example that demonstrates how to handle exceptions gracefully when downloading a file.
The above code encloses the main code inside a try-except block. It handles all the exceptions type included in the requests module.
Instead, you can handle different types of errors differently. For example, handle all connection issues by retrying after 5 minutes. And it could send an email for an invalid URL error.
You’d have realized how useful handling exceptions is. It makes life easy when fixing issues in production systems.
Implement download resumption for large files.
Have you noticed when you’re downloading other network activities get jammed? It’s more common if you try to download files in bulk using Python (or any other programming language)
For large files or unstable network connections, implementing download resumption is a valuable feature. It allows you to resume a download from where it left off rather than starting from scratch.
To implement download resumption, you can utilize the
Range header in your requests. Here’s an example:
Recemption is definitely useful when downloading large files. You don’t have to throttle your network bandwith is my primary goal of doing it.
Set user-agent headers.
Sending user-agent headers is not a must in most cases.
But, some websites may require a user-agent header to be set in the request to simulate a web browser. You can set the user-agent header in the request headers to make your request appear more like a regular browser request.
As you see it’s pretty straightforward to send User-Agent headers. Here’s a list of popular user agents for your reference.
Iterate over the response.
Large files may have come through the network, but you’d have to process it without harming your memory.
When downloading large files, it’s advisable to stream the response instead of loading the entire file into memory. Streaming allows you to download and save the file in smaller chunks, reducing memory consumption. Here’s an example:
If you’re chunking the downloaded file in Python like as shown in the example, you can process the file with limited resources. This is a must if you’re developing data pipelines. In such a system, you’d never know when there’s a overusage in the system.
Verify the integrity of your downloaded file.
After downloading a file, it’s a good practice to verify its integrity using a hash algorithm. This ensures that the file hasn’t been corrupted during the download process. Here’s an example using the
hashlib module to calculate the MD5 hash of a file:
There we go, you’re file’s integrity is verified.
We don’t normally download files programmatically. But there are situations where it’s needed.
We usually don’t care about anything other than the file being on our hard disks. But truly, there are many other things we need to care for.
This post has discussed some of my suggestions when downloading files from the internet using Python. I’ve been doing this a few times in a year.
Hope it helps.
Not a Medium member yet? Please use this link to become a member because I earn a commission for referring at no extra cost for you.