{"id":1165,"date":"2023-06-27T13:50:55","date_gmt":"2023-06-27T13:50:55","guid":{"rendered":"https:\/\/www.the-analytics.club\/?p=1165"},"modified":"2023-06-27T15:58:03","modified_gmt":"2023-06-27T15:58:03","slug":"best-practices-of-downloading-files-from-the-web-using-python","status":"publish","type":"post","link":"https:\/\/www.the-analytics.club\/best-practices-of-downloading-files-from-the-web-using-python\/","title":{"rendered":"Best Practices of Downloading Files From the Web Using Python."},"content":{"rendered":"\n\n\n

Downloading files from the internet programmatically is a task frequently encountered in Python applications. <\/p>\n\n\n\n

I do it a few times a year. Sometimes, the number of files we need to download from an internet archive is large enough to spend a few weeks. With programmatic access, we can bring it down to less than a day. <\/p>\n\n\n\n

I download files using Python<\/a> a few times a year. Also, I’ve built data pipelines with Python<\/a>, automatically downloading files from the web. <\/p>\n\n\n\n

This article delves into the realm of optimal file-downloading techniques using Python, shedding light on crucial aspects like exception handling, employing suitable libraries, and incorporating advanced functionalities such as resumable downloads and stream processing. <\/p>\n\n\n\n

Join me as we embark on this journey of exploring the top best practices for Python-based file downloads.<\/p>\n\n\n\n

\n
\n
\n

Grab your aromatic coffee <\/a>(or tea<\/a>) and get ready…!<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n

Downloading files using the requests library<\/h2><\/div>\n\n\n\n

Python ecosystem is so rich. We have a package for almost every need. <\/p>\n\n\n\n

Requests is one such Python library that helps us make HTTP requests programmatically. And this could download files as well. Sometimes, this isn’t the best way to download files. If you can’t directly access the file with the URL, you might have to use Selenium to download the file. <\/p>\n\n\n\n

But for this post’s purpose, we stick with the request library. <\/p>\n\n\n\n

Let’s take a look at a simple example that demonstrates how to download a file using the requests<\/code> library:<\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
import<\/span> requests<\/span><\/span>\n<\/span>\nurl <\/span>=<\/span> <\/span>"<\/span>https:\/\/example.com\/myfile.txt<\/span>"<\/span><\/span>\nresponse <\/span>=<\/span> requests<\/span>.<\/span>get<\/span>(<\/span>url<\/span>)<\/span><\/span>\n<\/span>\nif<\/span> response<\/span>.<\/span>status_code <\/span>==<\/span> <\/span>200<\/span>:<\/span><\/span>\n    <\/span>with<\/span> <\/span>open<\/span>(<\/span>"<\/span>myfile.txt<\/span>"<\/span>,<\/span> <\/span>"<\/span>wb<\/span>"<\/span>)<\/span> <\/span>as<\/span> file<\/span>:<\/span><\/span>\n        file<\/span>.<\/span>write<\/span>(<\/span>response<\/span>.<\/span>content<\/span>)<\/span><\/span>\n        <\/span>print<\/span>(<\/span>"<\/span>File downloaded successfully!<\/span>"<\/span>)<\/span><\/span>\nelse<\/span>:<\/span><\/span>\n    <\/span>print<\/span>(<\/span>f<\/span>"Failed to download file. Status code: <\/span>{<\/span>response<\/span>.<\/span>status_code<\/span>}<\/span>"<\/span>)<\/span><\/span>\n<\/span><\/code><\/pre>Python<\/span><\/div>\n\n\n\n

This code uses the requests library to send a GET request to a URL. If the response status code is 200 (indicating a successful request), it opens a file named. \"myfile.txt\"<\/code> in write binary mode. Then, it writes the content of the response to the file. If the status code is not 200, it prints a failure message along with the actual status code.<\/p>\n\n\n\n

Handle exceptions proactively when programmatically downloading files.<\/h2><\/div>\n\n\n\n

No matter what, the internet is an ocean of unknowns. <\/p>\n\n\n\n

Errors can occur at various stages when working with network requests. The server may be temporarily down, and the new change in the source system<\/a> is not compatible with your script’s expectation, the connection may be unstable, and many more. <\/p>\n\n\n\n

Thus we must proactively address possible issues. Thus enclose your code inside a try-except block and handle it appropriately. <\/p>\n\n\n\n

Let me provide you with an example that demonstrates how to handle exceptions gracefully when downloading a file.<\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
import<\/span> requests<\/span><\/span>\n<\/span>\nurl <\/span>=<\/span> <\/span>"<\/span>https:\/\/example.com\/myfile.txt<\/span>"<\/span><\/span>\n<\/span>\ntry<\/span>:<\/span><\/span>\n    response <\/span>=<\/span> requests<\/span>.<\/span>get<\/span>(<\/span>url<\/span>)<\/span><\/span>\n    response<\/span>.<\/span>raise_for_status<\/span>()<\/span><\/span>\n    <\/span>with<\/span> <\/span>open<\/span>(<\/span>"<\/span>myfile.txt<\/span>"<\/span>,<\/span> <\/span>"<\/span>wb<\/span>"<\/span>)<\/span> <\/span>as<\/span> file<\/span>:<\/span><\/span>\n        file<\/span>.<\/span>write<\/span>(<\/span>response<\/span>.<\/span>content<\/span>)<\/span><\/span>\n        <\/span>print<\/span>(<\/span>"<\/span>File downloaded successfully!<\/span>"<\/span>)<\/span><\/span>\nexcept<\/span> requests<\/span>.<\/span>exceptions<\/span>.<\/span>RequestException <\/span>as<\/span> e<\/span>:<\/span><\/span>\n    <\/span>print<\/span>(<\/span>f<\/span>"Failed to download file: <\/span>{<\/span>e<\/span>}<\/span>"<\/span>)<\/span><\/span><\/code><\/pre>Python<\/span><\/div>\n\n\n\n

Instead, you can handle different types of errors differently. For example, handle all connection issues by retrying after 5 minutes. And it could send an email for an invalid URL error. <\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
import<\/span> requests<\/span><\/span>\nimport<\/span> time<\/span><\/span>\nimport<\/span> smtplib<\/span><\/span>\n<\/span>\nurl <\/span>=<\/span> <\/span>"<\/span>https:\/\/example.com\/myfile.txt<\/span>"<\/span><\/span>\n<\/span>\nretry_count <\/span>=<\/span> <\/span>0<\/span><\/span>\nmax_retries <\/span>=<\/span> <\/span>3<\/span><\/span>\n<\/span>\nwhile<\/span> retry_count <\/span><<\/span> max_retries<\/span>:<\/span><\/span>\n    <\/span>try<\/span>:<\/span><\/span>\n        response <\/span>=<\/span> requests<\/span>.<\/span>get<\/span>(<\/span>url<\/span>)<\/span><\/span>\n        response<\/span>.<\/span>raise_for_status<\/span>()<\/span><\/span>\n        <\/span>with<\/span> <\/span>open<\/span>(<\/span>"<\/span>myfile.txt<\/span>"<\/span>,<\/span> <\/span>"<\/span>wb<\/span>"<\/span>)<\/span> <\/span>as<\/span> file<\/span>:<\/span><\/span>\n            file<\/span>.<\/span>write<\/span>(<\/span>response<\/span>.<\/span>content<\/span>)<\/span><\/span>\n            <\/span>print<\/span>(<\/span>"<\/span>File downloaded successfully!<\/span>"<\/span>)<\/span><\/span>\n        <\/span>break<\/span><\/span>\n    <\/span>except<\/span> requests<\/span>.<\/span>exceptions<\/span>.<\/span>RequestException <\/span>as<\/span> e<\/span>:<\/span><\/span>\n        retry_count <\/span>+=<\/span> <\/span>1<\/span><\/span>\n        <\/span>if<\/span> retry_count <\/span>==<\/span> max_retries<\/span>:<\/span><\/span>\n            <\/span>send_email<\/span>(<\/span>"<\/span>Invalid URL Error<\/span>"<\/span>,<\/span> <\/span>f<\/span>"Failed to download file: <\/span>{<\/span>e<\/span>}<\/span>"<\/span>)<\/span><\/span>\n            <\/span>print<\/span>(<\/span>f<\/span>"Failed to download file: <\/span>{<\/span>e<\/span>}<\/span>"<\/span>)<\/span><\/span>\n        <\/span>else<\/span>:<\/span><\/span>\n            <\/span>print<\/span>(<\/span>f<\/span>"Connection error, retrying in 5 minutes..."<\/span>)<\/span><\/span>\n            time<\/span>.<\/span>sleep<\/span>(<\/span>300<\/span>)<\/span><\/span>\n<\/span>\ndef<\/span> <\/span>send_email<\/span>(<\/span>subject<\/span>,<\/span> <\/span>message<\/span>):<\/span><\/span>\n    <\/span># Code to send an email using an email sending service or library<\/span><\/span>\n    <\/span>pass<\/span><\/span>\n<\/span><\/code><\/pre>Python<\/span><\/div>\n\n\n\n

Implement download resumption for large files.<\/h2><\/div>\n\n\n\n

Have you noticed when you’re downloading other network activities get jammed? It’s more common if you try to download files in bulk using Python<\/a> (or any other programming language)<\/p>\n\n\n\n

For large files or unstable network connections, implementing download resumption<\/a> is a valuable feature. It allows you to resume a download from where it left off rather than starting from scratch. <\/p>\n\n\n\n

To implement download resumption, you can utilize the Range<\/code> header in your requests. Here’s an example:<\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
import<\/span> requests<\/span><\/span>\n<\/span>\nurl <\/span>=<\/span> <\/span>"<\/span>https:\/\/example.com\/largefile.zip<\/span>"<\/span><\/span>\nheaders <\/span>=<\/span> <\/span>{<\/span>"<\/span>Range<\/span>"<\/span>:<\/span> <\/span>"<\/span>bytes=500-<\/span>"<\/span>}<\/span><\/span>\n<\/span>\nresponse <\/span>=<\/span> requests<\/span>.<\/span>get<\/span>(<\/span>url<\/span>,<\/span> <\/span>headers<\/span>=<\/span>headers<\/span>)<\/span><\/span>\n<\/span>\n# Append the downloaded content to the existing file<\/span><\/span>\nwith<\/span> <\/span>open<\/span>(<\/span>"<\/span>largefile.zip<\/span>"<\/span>,<\/span> <\/span>"<\/span>ab<\/span>"<\/span>)<\/span> <\/span>as<\/span> file<\/span>:<\/span><\/span>\n    file<\/span>.<\/span>write<\/span>(<\/span>response<\/span>.<\/span>content<\/span>)<\/span><\/span>\n<\/span>\n# Continue appending subsequent partial downloads until complete<\/span><\/span><\/code><\/pre>Python<\/span><\/div>\n\n\n\n

Recemption is definitely useful when downloading large files. You don’t have to throttle your network bandwith is my primary goal of doing it. <\/p>\n\n\n\n

Set user-agent headers<\/strong>.<\/h2><\/div>\n\n\n\n

Sending user-agent headers is not a must in most cases. <\/p>\n\n\n\n

But, some websites may require a user-agent header to be set in the request to simulate a web browser. You can set the user-agent header in the request headers to make your request appear more like a regular browser request. <\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
headers <\/span>=<\/span> <\/span>{<\/span>'<\/span>User-Agent<\/span>'<\/span>:<\/span> <\/span>'<\/span>Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/58.0.3029.110 Safari\/537.3<\/span>'<\/span>}<\/span><\/span>\nresponse <\/span>=<\/span> requests<\/span>.<\/span>get<\/span>(<\/span>url<\/span>,<\/span> <\/span>headers<\/span>=<\/span>headers<\/span>)<\/span><\/span><\/code><\/pre>Python<\/span><\/div>\n\n\n\n

Iterate over the response.<\/h2><\/div>\n\n\n\n

Large files may have come through the network, but you’d have to process it without harming your memory. <\/p>\n\n\n\n

When downloading large files, it’s advisable to stream the response instead of loading the entire file into memory. Streaming allows you to download and save the file in smaller chunks, reducing memory consumption. Here’s an example:<\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
response <\/span>=<\/span> requests<\/span>.<\/span>get<\/span>(<\/span>url<\/span>,<\/span> <\/span>stream<\/span>=True<\/span>)<\/span><\/span>\nwith<\/span> <\/span>open<\/span>(<\/span>'<\/span>file.txt<\/span>'<\/span>,<\/span> <\/span>'<\/span>wb<\/span>'<\/span>)<\/span> <\/span>as<\/span> file<\/span>:<\/span><\/span>\n    <\/span>for<\/span> chunk <\/span>in<\/span> response<\/span>.<\/span>iter_content<\/span>(<\/span>chunk_size<\/span>=<\/span>8192<\/span>):<\/span><\/span>\n        <\/span>if<\/span> chunk<\/span>:<\/span><\/span>\n            file<\/span>.<\/span>write<\/span>(<\/span>chunk<\/span>)<\/span><\/span><\/code><\/pre>Python<\/span><\/div>\n\n\n\n

Verify the integrity of your downloaded file.<\/h2><\/div>\n\n\n\n

After downloading a file, it’s a good practice to verify its integrity using a hash algorithm. This ensures that the file hasn’t been corrupted during the download process. Here’s an example using the hashlib<\/code> module to calculate the MD5 hash of a file:<\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
import<\/span> hashlib<\/span><\/span>\n<\/span>\ndef<\/span> <\/span>calculate_md5<\/span>(<\/span>file_path<\/span>):<\/span><\/span>\n    hash_md5 <\/span>=<\/span> hashlib<\/span>.<\/span>md5<\/span>()<\/span><\/span>\n    <\/span>with<\/span> <\/span>open<\/span>(<\/span>file_path<\/span>,<\/span> <\/span>'<\/span>rb<\/span>'<\/span>)<\/span> <\/span>as<\/span> file<\/span>:<\/span><\/span>\n        <\/span>for<\/span> chunk <\/span>in<\/span> <\/span>iter<\/span>(<\/span>lambda<\/span>:<\/span> file<\/span>.<\/span>read<\/span>(<\/span>4096<\/span>),<\/span> <\/span>b<\/span>''<\/span>):<\/span><\/span>\n            hash_md5<\/span>.<\/span>update<\/span>(<\/span>chunk<\/span>)<\/span><\/span>\n    <\/span>return<\/span> hash_md5<\/span>.<\/span>hexdigest<\/span>()<\/span><\/span>\n<\/span>\ndownloaded_file <\/span>=<\/span> <\/span>'<\/span>file.txt<\/span>'<\/span><\/span>\nexpected_md5 <\/span>=<\/span> <\/span>'<\/span>...<\/span>'<\/span><\/span>\nif<\/span> <\/span>calculate_md5<\/span>(<\/span>downloaded_file<\/span>)<\/span> <\/span>==<\/span> expected_md5<\/span>:<\/span><\/span>\n    <\/span>print<\/span>(<\/span>'<\/span>File integrity verified!<\/span>'<\/span>)<\/span><\/span>\nelse<\/span>:<\/span><\/span>\n    <\/span>print<\/span>(<\/span>'<\/span>File integrity check failed!<\/span>'<\/span>)<\/span><\/span><\/code><\/pre>Python<\/span><\/div>\n\n\n\n

There we go, you’re file’s integrity is verified. <\/p>\n\n\n\n

Conclusion<\/h2><\/div>\n\n\n\n

We don’t normally download files programmatically. But there are situations where it’s needed. <\/p>\n\n\n\n

We usually don’t care about anything other than the file being on our hard disks. But truly, there are many other things we need to care for. <\/p>\n\n\n\n

This post has discussed some of my suggestions when downloading files from the internet using Python. I’ve been doing this a few times in a year. <\/p>\n\n\n\n

Hope it helps.<\/p>\n\n\n\n


\n\n\n\n
\n

Thanks for the read, friend. It seems you and I have lots of common interests. Say Hi to me on LinkedIn<\/strong><\/a>, Twitter<\/strong><\/a>, and Medium<\/strong><\/a>. <\/p>\n\n\n\n

Not a Medium member yet? Please use this link to become a member<\/strong><\/a> because I earn a commission for referring at no extra cost for you.<\/p>\n<\/blockquote>\n","protected":false},"excerpt":{"rendered":"

Downloading files from the internet programmatically is a task frequently encountered in Python applications. I do it a…<\/p>\n","protected":false},"author":2,"featured_media":1323,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kad_blocks_custom_css":"","_kad_blocks_head_custom_js":"","_kad_blocks_body_custom_js":"","_kad_blocks_footer_custom_js":"","_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"footnotes":""},"categories":[3,5],"tags":[24,33,27],"taxonomy_info":{"category":[{"value":3,"label":"Python"},{"value":5,"label":"Programming"}],"post_tag":[{"value":24,"label":"automation"},{"value":33,"label":"best practice"},{"value":27,"label":"python"}]},"featured_image_src_large":["https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/Digital-Intelligence-Images-1-1024x576.jpg",1024,576,true],"author_info":{"display_name":"Thuwarakesh","author_link":"https:\/\/www.the-analytics.club\/author\/thuwarakesh\/"},"comment_info":1,"category_info":[{"term_id":3,"name":"Python","slug":"python","term_group":0,"term_taxonomy_id":3,"taxonomy":"category","description":"","parent":5,"count":52,"filter":"raw","cat_ID":3,"category_count":52,"category_description":"","cat_name":"Python","category_nicename":"python","category_parent":5},{"term_id":5,"name":"Programming","slug":"programming","term_group":0,"term_taxonomy_id":5,"taxonomy":"category","description":"","parent":0,"count":43,"filter":"raw","cat_ID":5,"category_count":43,"category_description":"","cat_name":"Programming","category_nicename":"programming","category_parent":0}],"tag_info":[{"term_id":24,"name":"automation","slug":"automation","term_group":0,"term_taxonomy_id":24,"taxonomy":"post_tag","description":"","parent":0,"count":5,"filter":"raw"},{"term_id":33,"name":"best practice","slug":"best-practice","term_group":0,"term_taxonomy_id":33,"taxonomy":"post_tag","description":"","parent":0,"count":2,"filter":"raw"},{"term_id":27,"name":"python","slug":"python","term_group":0,"term_taxonomy_id":27,"taxonomy":"post_tag","description":"","parent":0,"count":9,"filter":"raw"}],"_links":{"self":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/1165"}],"collection":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/comments?post=1165"}],"version-history":[{"count":11,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/1165\/revisions"}],"predecessor-version":[{"id":1353,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/1165\/revisions\/1353"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media\/1323"}],"wp:attachment":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media?parent=1165"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/categories?post=1165"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/tags?post=1165"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}

If you’re chunking the downloaded file in Python<\/a> like as shown in the example, you can process the file with limited resources. This is a must if you’re developing data pipelines. In such a system, you’d never know when there’s a overusage in the system. <\/p>\n\n\n\n

As you see it’s pretty straightforward to send User-Agent headers. Here’s a list of popular user agents <\/a>for your reference. <\/p>\n\n\n\n

You’d have realized how useful handling exceptions is. It makes life easy when fixing issues in production systems<\/a>. <\/p>\n\n\n\n

The above code encloses the main code inside a try-except block. It handles all the exceptions type included in the requests module<\/a>. <\/p>\n\n\n\n