CSVs are the most popular file type to store data. They are famous for their straightforward way of organizing data and compatibility with many systems.
CSVs don’t need specialized software. Any basic text editor could handle them. You can even use editors such as Vim and Nano to edit CSV files.
Spreadsheet software adds a bit more color to the way you handle CSV.
Since CSVs are widespread, most programmers store data in CSV formats. But depending on where you have your CSV and the end format, techniques to read CSVs could change.
Here’s a list of possible ways to read CSVs in Python and write them back to the filesystem.
Read CSV in Python without any helper modules.
Let’s first do it the old-school way. Read CSV as any other text file and process them manually.
When you run the code above, the output will look like this.
No, not yet!
If you look closely, this method has a lot of flaws. You have to hand-code a lot of post-processing. For instance, the above output has newlines from the CSV. We should get rid of them.
To help us with all this post-processing, Python has an inbuilt library called ‘csv.’
Meanwhile, you can write information in CSV format without any helper modules. The following function will write a list of lists as CSV.
Each element in the innermost list is a cell. Outer elements will be the rows.
Read CSV to a List in Python with the CSV Module.
Python’s CSV module is one of the most popular libraries for processing data in CSV format. It has many handy features that make working with CSV files much easier.
Also, you could use the CSV module also to write back to the filesystem.
Running the above code will print the following.
As you can see, the csv package takes care of a lot of the heavy lifting for us. For instance, it removes new lines and extra spaces from the data. You don’t have to worry about such things anymore.
The csv.reader method returns all rows as a list of strings. If you want to read CSV into a Python dictionary, you can use the DictReader.
For instance, if you want the data in a dictionary format, you can use the following code.
Note: Similarly, you can use the DictWriter to write a list of dictionaries to a CSV file. Each dictionary will represent a row. Its keys will be the headers.
Running the code above will give you the following output.
The DictReader method has taken the first row of the file as headers. Thus, it uses the header values as keys. We can do two things to customize the keys of the DictReader output.
1. Pass a field names argument to the DictReader method.
2. Create a header row on the file. The following shell command will do it. Note the newline character at the end of the line, which is important.
Both the above will customize the keys of the dictionary used in the DictReader method.
Using Pandas to read CSV files.
Most people do this when they want to read CSV in Python. Pandas is an excellent library for data manipulation. Almost all Python programmers use Pandas.
I like to read CSVs using Pandas because it already puts them in a tabular format. Further, you get many customization options when reading CSVs using Pandas.
You can use the to_csv method of the dataframe to write data as CSV files. However, by default, Pandas will also write the index column. To ignore the index column, you can pass index=False.
For instance, if a column has date fields, you can pass the column name into the parse_date argument instead of converting them to Python DateTime objects.
In the same way, you can use the ‘
usecols‘ parameter to select which columns to include when you’re reading. This is a faster way of ignoring unnecessary columns.
Another helpful technique is to read CSVs directly into NumPy.
Numpy is a numerical computing library in Python. The Pandas library is built on top of NumPy. Since NumPy is written in C, it’s often way faster than plain Python code. Thus it’s advisable to use NumPy arrays wherever possible in Python.
NumPy has a function to read and convert CSV into arrays. We can use the ‘genfromtxt’ function.
Reading CSV files from web URLs.
Often, you might have to read CSV from external sources. If it’s hosted in a URL, you can use it to load the data into Python.
If you are not using Pandas, you might have to use another package like
urllib or requests to fetch the web CSV resource first.
But, using Pandas makes the code more concise and robust. To load a CSV from a URL to Pandas, you can directly pass in the URL to the to_csv function like you would pass the local file path.
Reading CSVs stored in the cloud to Python.
With the advancement of cloud technology, more companies store their data in cloud storage, such as S3, Azure Storage, etc.
If your storage space is set to host static resources (CSV in our case), you can use its URL to load the data to Python. Please refer to the previous section.
If the storage space is private, you must use the cloud SDK to load data to Pandas. Using an in-memory stream, you can avoid downloading the cloud resource to your local file system.
Reading CSV from S3 buckets.
Amazon S3 is one of the popular cloud storage solutions. As part of the AWS stack, it has gained much traction in the past few years.
S3 is organized as buckets and objects. Buckets are like folders in your computer; objects are anything in a bucket. Thus the CSV file stored on an S3 bucket is an object.
You must install the AWS client to access the S3 bucket programmatically. The Python client library is called boto3. You can install it from the PyPI repository.
Once installed, you need to get the credentials to securely access your S3 buckets and objects. Please follow along with this post to get your keys.
The following code will read CSVs in the S3 bucket.
Reading CSV from Azure blob storage.
The blob storage is the S3 alternative to the Microsoft Azure stack.
You can create an Azure subscription and get credentials for free. And you must also install the Azure SKD for Python to programmatically access the blob storage.
With the SDK installed. The following code will read CSV on blob storage to Pandas dataframes.
Reading CSV from Google Cloud storage.
Besides AWS and Azure, Google is the other popular cloud service provider in the market.
Google Cloud also offers storage options like S3 and Azure. Like the other two, Google Cloud also has a Python SDK to programmatically access the storage objects.
To authenticate Google Cloud SDK, you must download the credentials from Google Cloud and set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the downloaded JSON file.
Once the SDK is installed and the Keys are in place, the following code will read CSV to Python from Google Cloud storage.
This post has looked at some fundamental things about CSV and Python. Everything starts with reading CSV to Python first.
But we’ve discussed there are many ways you can read CSV. You can open it as a text file and split it for yourself. If not, you can use the CSV module that ships with Python. A more sophisticated way is to load CSVs directly into Pandas dataframes or NumPy arrays. We covered that also.
Lastly, we’ve looked at how we can read CSVs stored in cloud storage. We’ve opened CSV files from AWS S3 buckets, Azure blob storage, and Google Cloud storage.
Not a Medium member yet? Please use this link to become a member because I earn a commission for referring at no extra cost for you.