Different Ways to Read CSV in Python (And Write Back!) | 1
|

Different Ways to Read CSV in Python (And Write Back!)

CSVs are the most popular file type to store data. They are famous for their straightforward way of organizing data and compatibility with many systems.

CSVs don’t need specialized software. Any basic text editor could handle them. You can even use editors such as Vim and Nano to edit CSV files.

Spreadsheet software adds a bit more color to the way you handle CSV.

Since CSVs are widespread, most programmers store data in CSV formats. But depending on where you have your CSV and the end format, techniques to read CSVs could change.

Here’s a list of possible ways to read CSVs in Python and write them back to the filesystem.

Read CSV in Python without any helper modules.

Let’s first do it the old-school way. Read CSV as any other text file and process them manually.

def read_csv_and_split(file_name):

    with open(file_name, "r") as f:
        lines = f.readlines()
    return [line.split(",") for line in lines]


print(read_csv_and_split("data.csv"))
Python

When you run the code above, the output will look like this.

$ python read_csv.py 
[['apple', 'red', 'imported\n'], ['apple', 'green', 'imported\n'], ['banana', 'green', 'local\n'], ['pinapple', 'orenge', 'local']]
Bash

Mission accomplished!

No, not yet!

If you look closely, this method has a lot of flaws. You have to hand-code a lot of post-processing. For instance, the above output has newlines from the CSV. We should get rid of them.

To help us with all this post-processing, Python has an inbuilt library called ‘csv.’

Meanwhile, you can write information in CSV format without any helper modules. The following function will write a list of lists as CSV.

Each element in the innermost list is a cell. Outer elements will be the rows.

from typing import Any, List


def save_text_as_csv(info: List[List[Any]], filename: str) -> None:
    """
    Save a list of lists as a CSV file.
    """
    with open(filename, "w") as f:
        for row in info:
            f.write(",".join(str(x) for x in row) + "\n")  # write row as a string


save_text_as_csv([[1, 2, 3], [4, 5, 6]], "test.csv")
Python

Read CSV to a List in Python with the CSV Module.

Python’s CSV module is one of the most popular libraries for processing data in CSV format. It has many handy features that make working with CSV files much easier.

Also, you could use the CSV module also to write back to the filesystem.

import csv

def read_csv(filename):
    with open(filename, 'r') as f:
        reader = csv.reader(f)
        return list(reader)

print(read_csv('data.csv'))


# ------------------------------------
# To write back

import csv
from typing import Any, List


def save_text_as_csv(info: List[Any], filename: str) -> None:
    """
    Save a list of information as a csv file.
    """
    with open(filename, "w", newline="") as csvfile:
        writer = csv.writer(csvfile)
        for row in info:
            writer.writerow(row)
Python

Running the above code will print the following.

$ python read_csv.py 

[['apple', 'red', 'imported'], ['apple', 'green', 'imported'], ['banana', 'green', 'local'], ['pinapple', 'orenge', 'local']]
Bash

As you can see, the csv package takes care of a lot of the heavy lifting for us. For instance, it removes new lines and extra spaces from the data. You don’t have to worry about such things anymore.

The csv.reader method returns all rows as a list of strings. If you want to read CSV into a Python dictionary, you can use the DictReader.

For instance, if you want the data in a dictionary format, you can use the following code.

 

Note: Similarly, you can use the DictWriter to write a list of dictionaries to a CSV file. Each dictionary will represent a row. Its keys will be the headers.

import csv

def read_csv_dict(filename):
    with open(filename, 'r') as f:
        reader = csv.DictReader(f)
        return list(reader)


print(read_csv_dict('data.csv'))

# ----------------------------------------
# To write back

import csv
from typing import Any, Dict, List


def save_text_as_csv(info: List[Dict], filename: str) -> None:
    """
    Save a list of dictionaries as a csv file.
    """
    with open(filename, "w", newline="") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=info[0].keys())
        writer.writeheader()
        for row in info:
            writer.writerow(row)
Python

Running the code above will give you the following output.

$ python read_csv.py 
[{'apple': 'apple', 'red': 'green', 'imported': 'imported'}, {'apple': 'banana', 'red': 'green', 'imported': 'local'}, {'apple': 'pinapple', 'red': 'orenge', 'imported': 'local'}]
Bash

The DictReader method has taken the first row of the file as headers. Thus, it uses the header values as keys. We can do two things to customize the keys of the DictReader output.

1. Pass a field names argument to the DictReader method.

reader = csv.DictReader(f, fieldnames=['fruit', 'color', 'origin']
Python

2. Create a header row on the file. The following shell command will do it. Note the newline character at the end of the line, which is important.

sed -i '1s/^/fruit,color,origin/\n' data.csv
Bash

Both the above will customize the keys of the dictionary used in the DictReader method.

$ python read_csv.py 
[{'fruit': 'apple', 'color': 'red', 'origin': 'imported'}, {'fruit': 'apple', 'color': 'green', 'origin': 'imported'}, {'fruit': 'banana', 'color': 'green', 'origin': 'local'}, {'fruit': 'pinapple', 'color': 'orenge', 'origin': 'local'}]
Bash

Using Pandas to read CSV files.

Most people do this when they want to read CSV in Python. Pandas is an excellent library for data manipulation. Almost all Python programmers use Pandas.

I like to read CSVs using Pandas because it already puts them in a tabular format. Further, you get many customization options when reading CSVs using Pandas.

 You can use the to_csv method of the dataframe to write data as CSV files. However, by default, Pandas will also write the index column. To ignore the index column, you can pass index=False.

import pandas as pd

df = pd.read_csv('data.csv')

print(df)

# To write back 
df.to_csv('data.csv', index=False)
Python

Reading CSV using pandas

For instance, if a column has date fields, you can pass the column name into the parse_date argument instead of converting them to Python DateTime objects.

df = pd.read_csv('data.csv', parse_dates=['shipped_date'])
Python

 

Reading CSV using Pandas and parsing date columns

 

In the same way, you can use the ‘usecols‘ parameter to select which columns to include when you’re reading. This is a faster way of ignoring unnecessary columns.

df = pd.read_csv("data.csv", usecols=["fruit", "color"])
Python
Selecting columns to read on Pandas

 

Another helpful technique is to read CSVs directly into NumPy.

Numpy is a numerical computing library in Python. The Pandas library is built on top of NumPy. Since NumPy is written in C, it’s often way faster than plain Python code. Thus it’s advisable to use NumPy arrays wherever possible in Python.

NumPy has a function to read and convert CSV into arrays. We can use the ‘genfromtxt’ function.

import numpy as np

array = np.genfromtxt('data.csv', delimiter=',')
Python

 

Screenshot from 2022-07-13 19-27-39

 

Reading CSV files from web URLs.

Often, you might have to read CSV from external sources. If it’s hosted in a URL, you can use it to load the data into Python.

If you are not using Pandas, you might have to use another package like urllib or requests to fetch the web CSV resource first.

import requests
import csv


def load_csv_from_url(url):

    """Read CSV to list of lists"""

    response = requests.get(url)
    reader = csv.reader(response.text.splitlines())
    return list(reader)


print(
    load_csv_from_url(
        "https://raw.githubusercontent.com/thuwarakeshm/knime-tutorial/main/visits.csv"
    )
)
Python

But, using Pandas makes the code more concise and robust. To load a CSV from a URL to Pandas, you can directly pass in the URL to the to_csv function like you would pass the local file path.

 

Load CSV to Pandas from URL

import pandas as pd

csv_url = "https://raw.githubusercontent.com/thuwarakeshm/knime-tutorial/main/visits.csv"

df = pd.read_csv(csv_url)

print(df.head())
Python

Reading CSVs stored in the cloud to Python.

With the advancement of cloud technology, more companies store their data in cloud storage, such as S3, Azure Storage, etc.

If your storage space is set to host static resources (CSV in our case), you can use its URL to load the data to Python. Please refer to the previous section.

If the storage space is private, you must use the cloud SDK to load data to Pandas. Using an in-memory stream, you can avoid downloading the cloud resource to your local file system.

Reading CSV from S3 buckets.

Amazon S3 is one of the popular cloud storage solutions. As part of the AWS stack, it has gained much traction in the past few years.

S3 is organized as buckets and objects. Buckets are like folders in your computer; objects are anything in a bucket. Thus the CSV file stored on an S3 bucket is an object.

You must install the AWS client to access the S3 bucket programmatically. The Python client library is called boto3. You can install it from the PyPI repository.

pip install boto3
Bash

Once installed, you need to get the credentials to securely access your S3 buckets and objects. Please follow along with this post to get your keys.

The following code will read CSVs in the S3 bucket.

from io import StringIO

import boto3
import pandas as pd

client = boto3.client(
    's3',
    aws_access_key_id="<ACCESS_KEY>",
    aws_secret_access_key="<SECRET_KEY>",
    aws_session_token="<SESSION_TOKEN>"
)

bucket_name = '<BUCKET_NAME>'
object_key = '<OBJECT_KEY>'

# Read the object from S3
response = client.get_object(Bucket=bucket_name, Key=object_key)
body = response['Body']

# Read the object's content
content = body.read().decode('utf-8')

# Create a dataframe from the CSV content
df = pd.read_csv(StringIO(content))
Python

Reading CSV from Azure blob storage.

The blob storage is the S3 alternative to the Microsoft Azure stack.

You can create an Azure subscription and get credentials for free. And you must also install the Azure SKD for Python to programmatically access the blob storage.

pip install azure-storage-blob
Bash

With the SDK installed. The following code will read CSV on blob storage to Pandas dataframes.

from io import StringIO

from azure.storage.blob import BlobServiceClient
import pandas as pd


# Create the BlobServiceClient object which will be used to create a container client
blob_service_client = BlobServiceClient.from_connection_string(
    connnection_string="DefaultEndpointsProtocol=https;AccountName=<ACCOUNT_NAME>;AccountKey=<ACCOUNT_KEY>"
)


# Create a blob client object
blob_client = blob_service_client.get_blob_client(container="<CONTAINER_NAME>")

# Download the blob's content to a stream
blob_content = blob_client.download_blob(blob="<BLOB_NAME>").readall()

# Create a dataframe from the CSV content
df = pd.read_csv(StringIO(blob_content))
Python

Reading CSV from Google Cloud storage.

Besides AWS and Azure, Google is the other popular cloud service provider in the market.

Google Cloud also offers storage options like S3 and Azure. Like the other two, Google Cloud also has a Python SDK to programmatically access the storage objects.

pip install google-cloud-storage
Bash

To authenticate Google Cloud SDK, you must download the credentials from Google Cloud and set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the downloaded JSON file.

Once the SDK is installed and the Keys are in place, the following code will read CSV to Python from Google Cloud storage.

from google.cloud import storage
import pandas as pd

# Instantiates a client
storage_client = storage.Client()

# The name of the bucket to list
bucket_name = "<BUCKET_NAME>"

# The name of the object to list
object_name = "<OBJECT_NAME>"

# Get the bucket
bucket = storage_client.bucket(bucket_name)

# Get the object
blob = bucket.blob(object_name)

# Create a dataframe from he object's content
df = pd.read_csv(blob.download_as_string())
Python

Final thoughts

This post has looked at some fundamental things about CSV and Python. Everything starts with reading CSV to Python first.

But we’ve discussed there are many ways you can read CSV. You can open it as a text file and split it for yourself. If not, you can use the CSV module that ships with Python. A more sophisticated way is to load CSVs directly into Pandas dataframes or NumPy arrays. We covered that also.

Lastly, we’ve looked at how we can read CSVs stored in cloud storage. We’ve opened CSV files from AWS S3 buckets, Azure blob storage, and Google Cloud storage.


Thanks for the read, friend. It seems you and I have lots of common interests. Say Hi to me on LinkedIn, Twitter, and Medium.

Not a Medium member yet? Please use this link to become a member because I earn a commission for referring at no extra cost for you.

Similar Posts