Data Exploration: Harnessing the power of python's hidden libraries

Agenda

Introduction
Pyjanitor
Drawdata
Sweetviz
Autoviz
D-Tale
Yellowbrick
Conclusion

Data Exploration

Data exploration is a critical step in the data analysis process.
It allows data scientists to:
- summarise the main characteristics of the data
- identify underlying patterns, relationships, and trends
- identify potential outliers or anomalies
- visualise the data to gain a better understanding of its structure and distribution.

Popular data exploration libraries

Pandas
Matplotlib
Seaborn

In this article, we will take a look into some libraries which you might never have encountered before. We will teach you how to leverage hidden data exploration libraries in Python to gain deeper insights into your data and how these libraries can help you enhance your workflow.

DrawData : Generating Data

Drawdata is a very useful visualisation library that allows you to generate data by interactively drawing it on graphs.

It provides a wide range of chart types, including scatter plots, line charts, bar charts, and pie charts, which can be customised using various styling options. One of the key features of DrawData is its ability to create interactive visualisations.

#DrawData
import drawdata as dd

dd.draw_scatter()

#dd.draw_line()

df=pd.read_clipboard(sep=",")
df.head()

Untitled

Pyjanitor : Preprocessing data pipelines

Pyjanitor is a library that is extends Pandas functionality and helps you clean and transform messy datasets with ease. It provides a set of functions that can be chained together to perform common data cleaning operations.

Features

chaining multiple cleaning operations together to form complex data cleaning pipelines
Automatic column name correction and formatting
Easy removal of duplicated or missing values
Simplification of data types

import pandas as pd
import janitor

df=pd.read_csv("credit_customers.csv")

#chaining processes
clean_df=(
    df
    .remove_empty()
    .clean_names(remove_special=True, enforce_string=True, strip_underscores=True) 

)
clean_df.head()

#useful for machine learning
clean_df.label_encode(column_names="class")
x,y=clean_df.get_features_targets(target_column_names=["class"],feature_column_names=["age","existing_credits"])

SweetViz : Visualisation and reports

Sweetviz is a Python library for creating visualisations of datasets with just a few lines of code.

It allows users to quickly compare and analyse datasets to gain insights into their structure and distribution.

Features

Automated exploratory data analysis (EDA) interactive report generation. Users can simply input their dataset and Sweetviz will generate a comprehensive report with visualisations and statistics for easy comparison of different aspects of the data and identify patterns and outliers.
Visualise and compare different types of datasets, infer types and find mixed-type asscoiations
Customise visualisations and statistics to their liking. Users can also specify which variables to include in the report and adjust the level of detail

import sweetviz as sv

report=sv.analyze(clean_df)
report.show_html("report.html")

#comparing train and test datasets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

comparison = sv.compare([x_train, 'Train'], [x_test, 'Test'])
comparison.show_html('compare.html')

Untitled

AutoViz : Exploratory Data Analysis

AutoViz is a powerful library that can automatically generate visualisations your data with minimal coding. It allows you to create a wide variety of charts, including scatter plots, histograms, heatmaps, box plots, and more, making it a versatile tool for data exploration and analysis.

Features

detect the type of data in each column and generate the most appropriate visualisation based on the data type
identify patterns ad relationships between data
provide data cleaning suggestions

#Autoviz
import autoviz as av
AV=av.AutoViz_Class()

analysis=AV.AutoViz(
    filename="",
    dfte=clean_df,
    lowess=False,
    chart_format="bokeh",
    max_rows_analyzed=150000,
    max_cols_analyzed=30
)

Untitled

D-Tale : Exploring Data

D-Tale is a Python library that facilitates the exploration of data in a Jupyter notebook environment. It provides an intuitive graphical interface to visualise and analyse data, making it easy to identify trends, outliers, and relationships within the data.

Features

provides a quick summary of the data, including descriptive statistics, missing values, data types, and data shape.
allows users to interactively filter data by selecting specific data points or by applying logical conditions
allows data preparation via data manipulation, type convertion,
provides several built-in visualisation options including scatter plots, histograms, box plots, etc
allows users to export their filtered and transformed data to CSV, Excel, or JSON format

#d-tale
import dtale as dt
# dt.show(clean_df)
dt.show(clean_df).open_browser()

Untitled

Yellowbrick : Machine Learning Toolkit

Yellowbrick is an open-source Python library that is used for machine learning visualisation. It provides a unique and interactive way to explore and analyse your machine learning models' performance.

Features

Feature Importance: Identify the most important features in your dataset that contribute to the target variable.
Model Selection: Compare the performance of different models using visual representations.
Class Balance: Visualise the distribution of target classes in your dataset.
Hyperparameter tuning
Discrimination Threshold: Explore how changes to the classification threshold can impact model performance.
ROC/AUC Curve: Visualise the trade-off between true positive and false positive rates.

#yellowbrick
import yellowbrick as yb
from yellowbrick.features import Rank2D

visualiser=Rank2D(algorithm="pearson")
visualiser.fit(x,y)
visualiser.transform(x)
visualiser.show()

#train model
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
# model=SVC().fit(x,y)
model=DecisionTreeClassifier().fit(x,y)
labels=["good","bad"]

from yellowbrick.classifier import ROCAUC

visualizer = ROCAUC(model, classes=labels)
visualizer.fit(x_train, y_train)        
visualizer.score(x_test, y_test)        
visualizer.show()

from yellowbrick.target import ClassBalance

visualizer = ClassBalance(labels=["good", "bad"])

visualizer.fit(clean_df["class"])        # Fit the data to the visualizer
visualizer.show()

from yellowbrick.classifier import ConfusionMatrix
cm = ConfusionMatrix(model, classes=labels)
cm.fit(x_train, y_train)
cm.score(x_test, y_test)
cm.show()

from yellowbrick.model_selection import FeatureImportances

viz = FeatureImportances(model)
viz.fit(x, y)
viz.show()

Untitled

Conclusion

In this article, we have explored several less known visualisation libraries in Python. Each of these libraries offers unique features and functionalities that can help with exploratory data analysis, data cleaning, feature engineering, and model calibration.

By leveraging these hidden gems, data scientists can accelerate their workflow, improve their model accuracy, and save time in the long run.

We hope this article has provided you with valuable insights into these hidden visualisation libraries and inspired you to explore and experiment with them in your own data science projects. With their intuitive interfaces and powerful capabilities, these libraries have the potential to transform the way we approach data analysis and modelling.