I remember the good old college days where we spent weeks analyzing survey data in SPSS. It's interesting to see how far we came from that point.
Today, we do all of them and a lot more in a single command before you even blink.
That's a remarkable improvement!
This short article will share three impressive Python libraries for exploratory data analysis (EDA). Not a Python pro? Don't worry! You can benefit from these tools even if you know nothing about Python.
They could save weeks of your data exploration and improve its quality. Also, you are going to have a lot fewer hair-pulling moments.
The first one is the most popular, then my favorite and the last one is the most flexible. Even if you know these libraries before, the CLI wrapper I introduce in this post may help you use them at lightning speed.
The most popular Python exploratory data analysis library.
With over 7.7k stars in GitHub, Pandas-Profiling is our list's most popular exploratory data analysis tool. It's easy to install, straightforward to use, and impeccable in its results.
You can use either PyPI or Conda to install Pandas-Profiling.
pip install pandas-profiling # conda install -c conda-forge pandas-profiling
The installation allows you to use the pandas-profiling CLI in your terminal window. Within seconds, it generates an HTML report with tons of analysis about your dataset.
The blink moment: Here's a demo that shows how it works. We use the popular titanic survivor dataset for our analysis and store it in an HTML file. We then use our favorite browser to open it. Here is a live version you can play around with.
When you open the file or the live link above, it will look like the following.
The variables section is a comprehensive analysis of every variable in your dataset. It includes descriptive statistics, histograms, common and extreme values of the variable.
In the interactions section, you can choose any two variables and create a scatterplot.
It's a single-page dependency-free web app. You can host it with any static site hosting provider because the generated HTML is a self-contained application.
One of my favorites in this report is the correlation section. It creates a heatmap of correlations of variables. You can choose the type of correlation to use in the heatmap.
My favorite EDA library.
Though it has only 1.7k stars on GitHub, Sweetviz fascinates me in many ways. The obvious magnet is the library's super-cool interactive HTML output. But my love for this tool is for other reasons.
You can install the library using the below command
pip install sweetviz
Sweetviz doesn't ship with a command-line interface. But the below code creates a CLI wrapper around the library. If you want to learn more about creating nifty CLI's for your data science projects, check out my previous article on the topic.
The complete code is available in the Github repository. Non-python users can follow the instructions there to get started quickly.
The primary usage of Sweetviz with the CLI.
import pandas as pd import sweetviz as sv import typer app = typer.Typer() @app.command() def report(input_path: str): # Read CSV file from the argument df = pd.read_csv(input_path) # Generate a reporte. report = sv.analyze(df) # Render HTML report in your default browser. report.show_html() # -------------------- MORE FEATURES HERE, LATER ----------------- if __name__ == "__main__": app()
For the above script to work,
- Copy the content to a file called
sweet(note that the file doesn't have an extension);
- make the file an executable. You can do it with
chmod +x sweet, and;
- add the current directory to the system path with
The blink moment: This creates the CLI we need to generate EDA's quicker. Here's the primary usage of it.
The above example generates a detailed report about the dataset and opens it in the browser. The output may look like the below. A live version is available too.
You may see Sweetviz gives almost the same information Pandas-Profiling does. Sweetviz, too, generates a self-contained HTML. You can host it with static hosting solutions such as Github pages.
Sweetviz is my favorite in two of its remarkable features—dataset comparisons and setting target variables. We'll look at them one by one and then together.
Comparing datasets with Sweetviz in a CLI.
Update the sweet file we created with the below content. You can paste it below the 'MORE FEATURES' line. This function gives the extra capability to your CLI—comparison.
# -------------------- MORE FEATURES HERE, LATER ----------------- @app.command() def compare(input1: str, input2: str): # Read CSV files from the arguments df1, df2 = pd.read_csv(input1), pd.read_csv(input2) # Generate a comparison report report = sv.compare(df1, df2) # Render the HTML in your default browser report.show_html()
The blink moment: Here's how it works. It takes two files as arguments and generates the report as it did earlier. For this example, I created a second file by sampling the Titanic dataset. In a real-life scenario, you may have a different version of the same file.
The generated output now looks different. It now contains a comparison value displayed at every level. You can see it clearly in this live version.
Making such a comparison with two datasets might take significant effort otherwise.
Another cool thing about Sweetvis is its target variable setting. With this, you can generate the report where every cut is examined against a target variable. The below update to the code will let you do it with the CLI.
@app.command() def target(input_path: str, target: str): # Read CSV file from the argument df = pd.read_csv(input_path) # Generate a reporte. report = sv.analyze(df, target) # Render HTML report in your default browser. report.show_html()
The blink moment: Now, you can specify the dataset name and a target variable in the CLI. Here's the demo and the output (live version).
I've specified the 'Survived' variable as the target variable. Now, alongside every variable, you can also study the variability of the target.
In most cases, you'll have to see how your target variable has changed from different versions of your dataset. It's only another blink with Sweetviz.
Dataset comparison with a target variable
The below code will update the CLI to accept three arguments. The first is the primary dataset, then the comparison dataset, and the last is the target variable.
@app.command() def compare_with_target(input1: str, input2: str, target: str): # Read CSVs from arguments df1, df2 = pd.read_csv(input1), pd.read_csv(input2) # Generate a comparison report against the target variable report = sv.compare(df1, df2, target) # Render HTML report in your default browser report.show_html()
The blink moment: You can run it with the sample dataset we created earlier for the comparison and the 'Survived' column as the target.
The output now has both the comparison dataset and analysis against a target variable. In most professional endeavors, this could be extremely useful. If you work on the same dataset, update it with new observations and focus on a single variable. Here is the live version to test.
The flexible EDA playground.
If you can spend a few more blinks but need more control over your analysis, here is what you need. Pandas GUI creates a graphical wrapper around your data frame. Instead of writing code, you can use a convenient interface. Pandas GUI is more of an exploration playground than a quick exploration tool.
You can install it with PyPI:
pip install pandasgui
Like Sweetviz, Pandas GUI, too, doesn't come with a CLI. Although starting it isn't tricky, the CLI wrapper below could help you if you aren't a Python user.
#! /usr/bin/python # If you are using a virtualenv you should change the above line to it's python executable. import pandas as pd from pandasgui import show import typer app = typer.Typer() @app.command() def report(filepath:str): df = pd.read_csv(filepath) show(df) if __name__ == "__main__": app()
Like we did for the Sweetviz, create a file named
pgui with the above content. make it executable with
chmod +x pgui.But, you don't have to add the current directory again to the path, as we already did. The below command will start the UI.
You can see interactive software popping up. With this tool, you can do different analyses that are impossible with the two other tools I've mentioned.
For example, here is a contour plot of survivors against their age.
We aren't going into more details about Pandas GUI here. But the below video from their official docs will help you learn more about it.<iframe width="560" height="315" src="https://www.youtube.com/embed/NKXdolMxW2Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Besides its interpretation, exploratory data analysis is repetitive for the most part. Gone are the days we struggled with SPSS and Excel to do trivial things. Today, we can do a lot more than that in the blink of an eye.
In this article, I've discussed three strikingly convenient Python libraries to do EDA. Pandas profiling is the most popular one among them. Sweetviz creates a self-contained HTML application that I find handy. Lastly, we discussed Pandas GUI, a tool that allows you to control your analysis.
Along with the library, we've also discussed creating CLI wrappers to make it more convenient. It allows non-Python users to also benefit from these tools.
Installation and usage are straightforward for all three libraries. With the repetitive tasks of EDA being taken care of, you may focus your attention on the more exciting stuff.
Be armed to surprise your audience before they blink.