*Describe was the first function I try on any new dataset. But I found a better one now.*
I replaced it with Skimpy. It’s a small python package that shows some extended summary results for a dataset. You can also run it on a terminal window without entering a Python shell.
You can install it from PyPI using the following command.
pip install skimpy
Related: Pandas Replace: The Faster and Better Approach to Change Values of a Column.
Why Skimpy?
In a previous post, I’ve shared three Python exploratory data analysis tools. With them, you can generate more complete reports about your datasets in the blink of an eye.
But what if you need a simpler cut?
If I had to start with a dataset, I’d run df.describe()
almost all the time. It gives you a nice tabular view of important numbers.
But to study the dataset more closely, I have to create histograms and several other summaries.
This is where Skimpy helps us. With a single command, it generates more matrices and histograms about the dataset.
from skimpy import skim
skim(df)
The summary above contains more information in a visually organized way.
Each section summarizes variables of the same type. Numerical variables also include histograms. I find the first last dates and frequency details about DateTime variables are handy.
Summarize datasets in a terminal; You don't need a Python REPL.
You don’t have to get into a Python reply or Jupyter notebook every time to use skimpy. You can use Skimpy CLI on the dataset to summarize.
skimpy iris.csv
Running the above command on a terminal will print the same result in the window and return.
This way, Skimpy is a convenient way to generate quick summaries of any dataset, even without writing any code.
Final thought
Skimpy is a new tool in the Python ecosystem to help us work with data more easily. Yet, it already solves a fantastic problem by generating extended summary results.
You can learn more about it from their GitHub page. And you can also contribute to improving the tool as well.