Best Practices of Downloading Files From the Web Using Python.
Downloading files from the internet programmatically is a task frequently encountered in Python applications.
I do it a few times a year. Sometimes, the number of files we need to download from an internet archive is large enough to spend a few weeks. With programmatic access, we can bring it down to less than a day.
I download files using Python a few times a year. Also, I’ve built data pipelines with Python, automatically downloading files from the web.
This article delves into the realm of optimal file-downloading techniques using Python, shedding light on crucial aspects like exception handling, employing suitable libraries, and incorporating advanced functionalities such as resumable downloads and stream processing.
Join me as we embark on this journey of exploring the top best practices for Python-based file downloads.
Downloading files using the requests library
Python ecosystem is so rich. We have a package for almost every need.
Requests is one such Python library that helps us make HTTP requests programmatically. And this could download files as well. Sometimes, this isn’t the best way to download files. If you can’t directly access the file with the URL, you might have to use Selenium to download the file.
But for this post’s purpose, we stick with the request library.
Let’s take a look at a simple example that demonstrates how to download a file using the requests
library:
import requests
url = "https://example.com/myfile.txt"
response = requests.get(url)
if response.status_code == 200:
with open("myfile.txt", "wb") as file:
file.write(response.content)
print("File downloaded successfully!")
else:
print(f"Failed to download file. Status code: {response.status_code}")
PythonThis code uses the requests library to send a GET request to a URL. If the response status code is 200 (indicating a successful request), it opens a file named. "myfile.txt"
in write binary mode. Then, it writes the content of the response to the file. If the status code is not 200, it prints a failure message along with the actual status code.
Handle exceptions proactively when programmatically downloading files.
No matter what, the internet is an ocean of unknowns.
Errors can occur at various stages when working with network requests. The server may be temporarily down, and the new change in the source system is not compatible with your script’s expectation, the connection may be unstable, and many more.
Thus we must proactively address possible issues. Thus enclose your code inside a try-except block and handle it appropriately.
Let me provide you with an example that demonstrates how to handle exceptions gracefully when downloading a file.
import requests
url = "https://example.com/myfile.txt"
try:
response = requests.get(url)
response.raise_for_status()
with open("myfile.txt", "wb") as file:
file.write(response.content)
print("File downloaded successfully!")
except requests.exceptions.RequestException as e:
print(f"Failed to download file: {e}")
PythonThe above code encloses the main code inside a try-except block. It handles all the exceptions type included in the requests module.
Instead, you can handle different types of errors differently. For example, handle all connection issues by retrying after 5 minutes. And it could send an email for an invalid URL error.
import requests
import time
import smtplib
url = "https://example.com/myfile.txt"
retry_count = 0
max_retries = 3
while retry_count < max_retries:
try:
response = requests.get(url)
response.raise_for_status()
with open("myfile.txt", "wb") as file:
file.write(response.content)
print("File downloaded successfully!")
break
except requests.exceptions.RequestException as e:
retry_count += 1
if retry_count == max_retries:
send_email("Invalid URL Error", f"Failed to download file: {e}")
print(f"Failed to download file: {e}")
else:
print(f"Connection error, retrying in 5 minutes...")
time.sleep(300)
def send_email(subject, message):
# Code to send an email using an email sending service or library
pass
PythonYou’d have realized how useful handling exceptions is. It makes life easy when fixing issues in production systems.
Implement download resumption for large files.
Have you noticed when you’re downloading other network activities get jammed? It’s more common if you try to download files in bulk using Python (or any other programming language)
For large files or unstable network connections, implementing download resumption is a valuable feature. It allows you to resume a download from where it left off rather than starting from scratch.
To implement download resumption, you can utilize the Range
header in your requests. Here’s an example:
import requests
url = "https://example.com/largefile.zip"
headers = {"Range": "bytes=500-"}
response = requests.get(url, headers=headers)
# Append the downloaded content to the existing file
with open("largefile.zip", "ab") as file:
file.write(response.content)
# Continue appending subsequent partial downloads until complete
PythonRecemption is definitely useful when downloading large files. You don’t have to throttle your network bandwith is my primary goal of doing it.
Set user-agent headers.
Sending user-agent headers is not a must in most cases.
But, some websites may require a user-agent header to be set in the request to simulate a web browser. You can set the user-agent header in the request headers to make your request appear more like a regular browser request.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
PythonAs you see it’s pretty straightforward to send User-Agent headers. Here’s a list of popular user agents for your reference.
Iterate over the response.
Large files may have come through the network, but you’d have to process it without harming your memory.
When downloading large files, it’s advisable to stream the response instead of loading the entire file into memory. Streaming allows you to download and save the file in smaller chunks, reducing memory consumption. Here’s an example:
response = requests.get(url, stream=True)
with open('file.txt', 'wb') as file:
for chunk in response.iter_content(chunk_size=8192):
if chunk:
file.write(chunk)
PythonIf you’re chunking the downloaded file in Python like as shown in the example, you can process the file with limited resources. This is a must if you’re developing data pipelines. In such a system, you’d never know when there’s a overusage in the system.
Verify the integrity of your downloaded file.
After downloading a file, it’s a good practice to verify its integrity using a hash algorithm. This ensures that the file hasn’t been corrupted during the download process. Here’s an example using the hashlib
module to calculate the MD5 hash of a file:
import hashlib
def calculate_md5(file_path):
hash_md5 = hashlib.md5()
with open(file_path, 'rb') as file:
for chunk in iter(lambda: file.read(4096), b''):
hash_md5.update(chunk)
return hash_md5.hexdigest()
downloaded_file = 'file.txt'
expected_md5 = '...'
if calculate_md5(downloaded_file) == expected_md5:
print('File integrity verified!')
else:
print('File integrity check failed!')
PythonThere we go, you’re file’s integrity is verified.
Conclusion
We don’t normally download files programmatically. But there are situations where it’s needed.
We usually don’t care about anything other than the file being on our hard disks. But truly, there are many other things we need to care for.
This post has discussed some of my suggestions when downloading files from the internet using Python. I’ve been doing this a few times in a year.
Hope it helps.
Thanks for the read, friend. It seems you and I have lots of common interests. Say Hi to me on LinkedIn, Twitter, and Medium.
Not a Medium member yet? Please use this link to become a member because I earn a commission for referring at no extra cost for you.