Let’s go over some of the useful features.
1. Overview of the data
This section at the top of the report gives you a complete overview of the data in a matter of minutes. You can scan through and pick all sorts of important information such as size, features, and data types summarized into a small table, which is extremely useful when running out of time.
2. Descriptive Statistics
Below the overview section, we have descriptive statistics for every single variable present in the data. Some statistics available for these variables are max, min, inter-quartile range, sum, coefficient of kurtosis, skewness coefficient, and more.
In python, we usually use some statistical packages like statsmodels
or use pandas’ pd.describe()
to generate this. But when it’s all readily available in the report, why not take advantage of it?
3. Missing Values
This is an important feature, especially because we tend to forget to do the missing value check. Missing values need to be treated before you proceed to the modeling phase. If not treated appropriately, missing values may drastically change the results we obtain post-modeling.
For every variable, SweetViz lists the percentage/count of missing values.
4. Histogram Distribution
The report generated has histograms for numerical variables and bar plots for categorical. This is extremely useful when you need to inspect for outliers and the distributions as a whole, and SweetViz does a great job.
5. Associations/Correlations
In my opinion, this is the most useful feature of them all. I keep going back to associations, sometimes even analyze them directly from the report.
Why? Machine learning is all about correlations and associations. Remember? If there’s no correlation, you can’t make any better prediction than a random guess.
Analyzing correlations help us in feature engineering and features selection phases later in the data science lifecycle.