Pandas may be the de-facto standard library for data analysis in Python. Yet, most data scientists are comfortable with SQL than pandas data operations.
At times it may feel like things are easy to do in SQL. Wouldn't it be great if we have a way to get the best of both worlds?
A wonderful library in the Python ecosystem makes it possible to do this.
Related: A Better Way to Summarize Pandas Dataframes.
Write your first SQL query on a Pandas dataframe with Pandasql
Pandasql is an open-source package that lets you run SQL code on pandas data frames. You can check out its repository on GitHub.
You can install Pandasql with PyPI as follows.
pip install pandasql
Once you've installed it, you can use its self function to query any data frame in the memory.
from pandasql import sqldf from sklearn import datasets data = datasets.load_wine(as_frame=True)['data'] sqldf("SELECT * FROM data LIMIT 10", globals())
We've imported the wine recognition dataset available in the scikit learn toy dataset repository in the above example. We then queried the first ten records from the dataset.
The first argument to the
sqldf function is straightforward to understand. It's a SQL query. The second argument,
global() tells the function use the global context to find the dataset. The other possible value for this argument is
local() which tells the interpreter to look only within the context of a function block.
If we're to do the same using only Pandas operations, it will look as follows.
Both operations results in the following dataset.
At first, it may seem counterintuitive to add complexity to the trivial task of querying the first ten records. But, as our query gets bigger, it becomes easier to do it using a SQL query than the Pandas way.
__PRO TIP: Query pandas using SQL without having to mention the global environment.___
We can use the
functool.partial utility to create a partial function that doesn't require you to mention the environment every time. Here's how to do it.
from functools import partial gpandsql = partial(sqldf, env=globals()) gpandsql("SELECT * FROM data LIMIT 10")
Related: Pandas Replace: The Faster and Better Approach to Change Values of a Column.
What types of SQL queries you can perform on a Pandas dataframe?
Pandasql uses sqlite as it's temporary backend. You can perform all kinds of SQL operations sqlite supports. This includes GROUPBY, WHERE, and different kinds of JOINS.
Here's an example:
from functools import partial from pandasql import sqldf from sklearn import datasets data = datasets.load_wine(as_frame=True)['data'] target = datasets.load_wine(as_frame=True)['target'] gpandsql = partial(sqldf, env=globals()) gpandsql(""" SELECT * FROM data d LEFT JOIN target t WHERE d.ash = 1.36 LIMIT 5""" ).describe()
You can even perform more advanced operations such as window functions.
However, SQLite is not a database designed for large-scale applications. Also, we should note that SQLite has some limitations we need to be aware of.
Also, note that we can only query dataframes using Pandasql. We can not create, update, or delete records using this library.
Related: Use Pipe Operations in Python for More Readable and Faster Coding
Most data scientists work well in SQL databases. For some, it isn't easy to use Pandas operations.
For them, Pandasql is a fantastic tool to enjoy the best of both worlds.
I've also noticed that SQL queries are excellent for readability. They follow the natural language flow, making it easy for even novice programmers to grasp.
Thanks for the read, friend. It seems you and I have lots of common interests. Say Hi to me on LinkedIn, Twitter, and Medium. I’ll break the ice for you.
Not a Medium member yet? Please use this link to become a member because I earn a commission for referring at no extra cost for you.