How to Do a Ton of Analysis in Python in the Blink of An Eye. | Analysis in Python | 1
| |

How to Do a Ton of Analysis in Python in the Blink of An Eye.

I remember the good old college days when we spent weeks analyzing survey data in SPSS. It’s interesting to see how far we have come from that point.

Today, we do all of them and a lot more in a single command before you even blink.

That’s a remarkable improvement!

This short article will share three impressive Python libraries for exploratory data analysis (EDA). Not a Python pro? Don’t worry! You can benefit from these tools even if you know nothing about Python.

They could save weeks of your data exploration and improve its quality. Also, you are going to have a lot fewer hair-pulling moments.

The first one is the most popular, then my favorite, and the last one is the most flexible. Even if you know these libraries before, the CLI wrapper I introduce in this post may help you use them at lightning speed.

The most popular Python exploratory data analysis library.

With over 7.7k stars on GitHub, Pandas-Profiling is our list’s most popular exploratory data analysis tool. It’s easy to install, straightforward to use, and impeccable in its results.

You can use either PyPI or Conda to install Pandas-Profiling.

pip install pandas-profiling
# conda install -c conda-forge pandas-profiling
Bash

The installation allows you to use the pandas-profiling CLI in your terminal window. Within seconds, it generates an HTML report with tons of analysis about your dataset.

The blink moment: Here’s a demo that shows how it works. We use the popular titanic survivor dataset for our analysis and store it in an HTML file. We then use our favorite browser to open it. Here is a live version you can play around with.

Auto generating analysis report for titanic dataset

When you open the file or the live link above, it will look like the following.

The Pandas Profiling analysis report for the Titanic dataset

The variables section is a comprehensive analysis of every variable in your dataset. It includes descriptive statistics, histograms, and common and extreme values of the variable.

In the interactions section, you can choose any two variables and create a scatterplot.

It’s a single-page dependency-free web app. You can host it with any static site hosting provider because the generated HTML is a self-contained application.

One of my favorites in this report is the correlation section. It creates a heatmap of correlations of variables. You can choose the type of correlation to use in the heatmap.

My favorite EDA library.

Though it has only 1.7k stars on GitHub, Sweetviz fascinates me in many ways. The obvious magnet is the library’s super-cool interactive HTML output. But my love for this tool is for other reasons.

You can install the library using the below command

pip install sweetviz
Bash

Sweetviz doesn’t ship with a command-line interface. But the below code creates a CLI wrapper around the library. If you want to learn more about creating nifty CLIs for your data science projects, check out my previous article on the topic.

The complete code is available in the Github repository. Non-python users can follow the instructions there to get started quickly.

The primary usage of Sweetviz with the CLI.

import pandas as pd
import sweetviz as sv

import typer
app = typer.Typer()


@app.command()
def report(input_path: str):
    # Read CSV file from the argument
    df = pd.read_csv(input_path)

    # Generate a reporte.
    report = sv.analyze(df)

    # Render HTML report in your default browser.
    report.show_html()

# -------------------- MORE FEATURES HERE, LATER -----------------


if __name__ == "__main__":
    app()
Python

For the above script to work,

  1. Copy the content to a file called sweet(note that the file doesn’t have an extension);
  2. Make the file executable. You can do it with chmod +x sweet, and;
  3. Add the current directory to the system path with export PATH=$PATH:$PWD.

The blink moment: This creates the CLI we need to generate EDAs quicker. Here’s the primary usage of it.

Generating analysis for Titanic dataset with Sweetwiz command line interface (CLI)

The above example generates a detailed report about the dataset and opens it in the browser. The output may look like the one below. A live version is available too.

Sweetwiz analysis for Titanic dataset.

You may see Sweetviz gives almost the same information Pandas-Profiling does. Sweetviz, too, generates a self-contained HTML. You can host it with static hosting solutions such as GitHub pages.

Sweetviz is my favorite in two of its remarkable features — dataset comparisons and setting target variables. We’ll look at them one by one and then together.

Comparing datasets with Sweetviz in a CLI.

Update the sweet file we created with the below content. You can paste it below the ‘MORE FEATURES’ line. This function gives extra capability to your CLI — comparison.

# -------------------- MORE FEATURES HERE, LATER -----------------
@app.command()
def compare(input1: str, input2: str):
    # Read CSV files from the arguments
    df1, df2 = pd.read_csv(input1), pd.read_csv(input2)

    # Generate a comparison report
    report = sv.compare(df1, df2)

    # Render the HTML in your default browser
    report.show_html()
Python

The blink moment: Here’s how it works. It takes two files as arguments and generates the report as it did earlier. For this example, I created a second file by sampling the Titanic dataset. In a real-life scenario, you may have a different version of the same file.

Comparing datasets in command line using Sweetviz

The generated output now looks different. It now contains a comparison value displayed at every level. You can see it clearly in this live version.

Dataset comparison report of Sweetviz

Making such a comparison between two datasets might take significant effort otherwise.

Another cool thing about Sweetvis is its target variable setting. With this, you can generate a report where every cut is examined against a target variable. The below update to the code will let you do it with the CLI.

@app.command()
def target(input_path: str, target: str):

    # Read CSV file from the argument
    df = pd.read_csv(input_path)

    # Generate a reporte.
    report = sv.analyze(df, target)

    # Render HTML report in your default browser.
    report.show_html()
Python

The blink moment: Now, you can specify the dataset name and a target variable in the CLI. Here’s the demo and the output (live version).

Analyzing Titanic dataset with Sweetviz CLI with 'Survived' as target variable.
Sweetviz analysis report for Titanic dataset with 'Survived' as it's target variable.

I’ve specified the ‘Survived’ variable as the target variable. Now, alongside every variable, you can also study the variability of the target.

In most cases, you’ll have to see how your target variable has changed from different versions of your dataset. It’s only another blink with Sweetviz.

Dataset comparison with a target variable

The below code will update the CLI to accept three arguments. The first is the primary dataset, then the comparison dataset and the last is the target variable.

@app.command()
def compare_with_target(input1: str, input2: str, target: str):

    # Read CSVs from arguments
    df1, df2 = pd.read_csv(input1), pd.read_csv(input2)

    # Generate a comparison report against the target variable
    report = sv.compare(df1, df2, target)

    # Render HTML report in your default browser
    report.show_html()
Python

The blink moment: You can run it with the sample dataset we created earlier for the comparison and the ‘Survived’ column as the target.

Comparing datasets in command prompt with a target variable.

The output now has both the comparison dataset and analysis against a target variable. In most professional endeavors, this could be extremely useful. If you work on the same dataset, update it with new observations and focus on a single variable. Here is the live version to test.

Sweetviz dataset comparison report for Titanic dataset with 'Survived' as target variable.

The flexible EDA playground.

If you can spend a few more blinks but need more control over your analysis, here is what you need. Pandas GUI creates a graphical wrapper around your data frame. Instead of writing code, you can use a convenient interface. Pandas GUI is more of an exploration playground than a quick exploration tool.

You can install it with PyPI:

pip install pandasgui
Bash

Like Sweetviz, Pandas GUI, too, doesn’t come with a CLI. Although starting it isn’t tricky, the CLI wrapper below could help you if you aren’t a Python user.

#! /usr/bin/python
# If you are using a virtualenv you should change the above line to it's python executable.

import pandas as pd
from pandasgui import show
import typer

app = typer.Typer()


@app.command()
def report(filepath:str):
    df = pd.read_csv(filepath)

    show(df)


if __name__ == "__main__":
    app()
Python

As we did for the Sweetviz, create a file named pgui with the above content. make it executable with chmod +x pgui.But you don’t have to add the current directory again to the path, as we already did. The below command will start the UI.

pgui titanic.csv
Bash
The analytics interface of Pandas GUI for Titanic dataset

You can see interactive software popping up. With this tool, you can do different analyses that are impossible with the two other tools I’ve mentioned.

For example, here is a contour plot of survivors against their age.

Heatmap created in Pandas GUI for Titanic dataset with 'Survived' variable in y-axis and Age in x-axis.

We aren’t going into more details about Pandas GUI here. But the below video from their official docs will help you learn more about it.

Conclusion

Besides its interpretation, exploratory data analysis is repetitive for the most part. Gone are the days we struggled with SPSS and Excel to do trivial things. Today, we can do a lot more than that in the blink of an eye.

In this article, I’ve discussed three strikingly convenient Python libraries to do EDA. Pandas profiling is the most popular one among them. Sweetviz creates a self-contained HTML application that I find handy. Lastly, we discussed Pandas GUI, a tool that allows you to control your analysis.

Along with the library, we’ve also discussed creating CLI wrappers to make it more convenient. It allows non-Python users to also benefit from these tools.

Installation and usage are straightforward for all three libraries. With the repetitive tasks of EDA being taken care of, you may focus your attention on the more exciting stuff.

Be armed to surprise your audience before they blink.


Thanks for the read, friend. It seems you and I have lots of common interests. Say Hi to me on LinkedIn, Twitter, and Medium.

Not a Medium member yet? Please use this link to become a member because I earn a commission for referring at no extra cost for you.

Similar Posts