As a common saying goes, it’s often not the drawing that’s the hard part; instead, it’s deciding what to draw that gets the best of most. Fortunately, once you’ve chosen, the rest isn’t as hard as it might seem. The same is the case in data science, and the phase where you’re *deciding* is what we refer to as EDA or Exploratory Data Analysis.
EDA plays the most pivotal role from acquiring the dataset to figuring out how to proceed with it and get your desired results.
So, in this article, we will be going through a beginner’s guide to EDA. Don’t worry if you’re a complete newbie and just discovered EDA; by the end of the article, you will have a firm grasp on all the major concepts involved in EDA, along with a step-by-step, hands-on coding example. Let’s dig in!
What is Exploratory Data Analysis?
EDA is one of the first and foremost steps of a data science project and sets the whole project into motion. It provides the project a specific direction and plan.
Exploratory data analysis means studying the data to its depth to extract actionable insight from it. It includes analyzing and summarizing massive datasets, often in the form of charts and graphs.
Hence, it’s unarguably the most crucial step in a data science project, which is why it takes almost 70-80% of time spent in the whole project. The better you know your dataset, the better you can make use of it!
To get a better picture of where the EDA fits in the whole data science process, here’s a graphic:
I expect you would have a solid understanding of what EDA is all about. We’re ready to dive into specifics now!
EDA – A Quick Overview
Let’s quickly go through a brief overview of the steps that EDA comprises. After that, we’ll see a practical example where we’ll perform different EDA techniques on a real-world dataset.
While EDA techniques should be applied according to the situation and the data types available, I will provide a go-through of the main techniques that you need to know as a beginner and form your base upon. Let’s see:
What is Univariate Analysis?
As the name suggests, univariate analysis is when we perform the analysis on variables individually. Whether the variable is categorical or continuous, as long as we’re analyzing it independent of other variables, it’s called univariate analysis.
Here are some of the basic univariate analysis techniques:
- Central Tendency
- Dispersion
- Visualizations (box plots, histograms, etc.)
What is Bivariate Analysis
Bivariate analysis refers to studying the relationship between any two variables in the dataset. It could be the relationship between any two predictor variables or with the target variable. Such relationships, if they exist, could cause problems during the model development, e.g., noise.
Some of the techniques used for bivariate analysis are:
- Scatter Plots
- Regression Analysis
- Correlation Coefficients
Hands-on Coding Example
We’ve had enough talk; Now, let’s move on to a real-world example where you can see what EDA is worth when developing a practical machine learning model. I’ll be using the StudentsPerformane dataset that you can find along with the Jupyter notebook on my GitHub here.
The task is to predict students’ performances based on certain factors that define their background. However, the goal of this tutorial is to perform the EDA while keeping the dataset and model in mind.
Let’s kick things off by importing the required libraries and reading the data.
Importing the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault)
Loading the Dataset
data = pd.read_csv("StudentsPerformance.csv")
Here’s what our dataframe looks like:
You can use df.head function of Pandas to view the dataframe like this.
# This will output the first N rows as shown in the image above
# Our df is named data, so we call df.head(N) as so
data.head(5)
Evaluating Columns / Finding Missing Values
As we can see, there is a total of 8 columns here. To see a more detailed view, let’s use the df.info function and know more about the columns we’re dealing with:
data.info()
So, we have three int type fields, whereas the rest of them are object type. The above information helps a lot when we’re applying different calculations on a column level.
The next most important step is to discover the missing values we have. If we do, we will need to act accordingly in further development stages. We don’t necessarily have to deal with missing values at this stage (EDA). But we do need to explore them so we can cater to them later.
data.isnull().sum()
Luckily, we have no missing values in our dataset. It doesn’t happen often, but you’re in real luck when it does!
However, if you did have some missing values, what would you do? Well, you can either drop the missing values if there aren’t many in the data, or you can fill them with mean or median values using the Pandas function data.fillna().
Univariate Analysis
Now, it’s time for some quick visualizations to see what groups and categories our data comprises.
First, we will explore the ratio of Male VS Female in the dataset.
sns.set_style('darkgrid')
sns.countplot(y='gender',data=data,palette='colorblind
plt.xlabel('Count')
plt.ylabel('Gender')
plt.show()
The Seaborn plot tells us a pretty accurate representation of the division of data into males and females. Here’s a more precise version of the count:
female_count = len(data[data['gender']=='female'])
male_count = 1000 - female_count
print(" Total Females:",female_count,"\n","Total Males:",male_count)
Wondering why we’re so interested in knowing the ratio of genders? Well, now that we know there are almost equal occurrences of both genders in our dataset, we don’t have to worry about any gender bias in the model we develop using this dataset.
Now, let’s check how the data is divided into different races or ethnicities. To do this, we’ll follow a similar procedure that we did in the previous step. Here’s how the graph will look:
The graph can be plotted using the following code:
sns.set_style('whitegrid')
sns.countplot(x='race/ethnicity',data=data,palette='colorblind')
plt.xlabel("Race/Ethnicity")
plt.ylabel("Count")
plt.show()
Next, we’ll do the same to explore the distribution for the parentalLevelofEducation. Let’s see what we have there.
This tells us that most parents have at least an associate degree.
Bivariate Analysis
Next up, we will explore if there is any correlation between individual features (columns) that we need to consider. Certain models like Naïve Bayes use the assumption that there’s no correlation between individual features, so this step is crucial.
So, let’s plot some scatterplots for different combinations of subjects.
Code:
sns.set_style('darkgrid')
plt.title('Maths score vs Reading score',size=16)
plt.xlabel('Maths Score',size=12)
plt.ylabel('Reading Score',size=12)
sns.scatterplot(x='math score', y='reading score', data=data, hue='gender', edgecolor='black', palette='cubehelix', hue_order=['male','female'])
plt.show()
Code:
plt.title('Maths score vs Writing score',size=16)
plt.xlabel('Maths score',size=12)
plt.ylabel('Writing score',size=12)
sns.scatterplot(x='math score', y='writing score', data=data, hue='gender', s=90, edgecolor='black', palette='cubehelix', hue_order=['male','female'])
plt.show()
Code:
sns.set_style('whitegrid')
plt.title('Reading score vs Writing score',size=16)
plt.xlabel('Reading score',size=12)
plt.ylabel('Writing score',size=12)
sns.scatterplot(x='reading score', y='writing score', data=data, hue='gender', s=90, edgecolor='black', palette='colorblind',hue_order=['male','female'])
plt.show()
So, the scatterplots suggest a high degree of correlation between the students’ scores in different subjects. Student score in maths vs. (reading and writing) are little spread out, but they generally follow an uptrend so if a student score more in maths he/she will also generally score more in other subjects. On the other hand, scores in reading vs. writing are more linear.
The information above tells us a lot about the importance of EDA; it would have taken hours to arrive at this result if it wasn’t for EDA.
We know that the “total marks” is the target variable in this specific dataset. Now, it always piques a developer’s interest for technical reasons that which of the variables affect the target variable the most.
Let’s plot another graph to find this out:
total_marks = ((data['math score'] + data['reading score'] + data['writing score'])/300)*100
data['total_marks'] = total_marks
kde_data = data[['math score','reading score','writing score','total_marks']]
sns.set_style("darkgrid")
sns.kdeplot(data=kde_data,shade=True,palette='colorblind')
plt.show()
It’s pretty evident that almost all the subjects affect the total score to the same degree. So, we don’t need to consider any specific feature affecting the target variable more than the other.
That’s all for today, folks! While EDA doesn’t end there, and there’s a lot more to know, that might be a story for another day. Till then, make sure you try to implement this by yourself to discover the true essence of EDA.
Wrap-Up
EDA – Exploratory Data Analysis, is amongst the foremost important steps of a data science project. Not only does it help to define the direction of the project, but it also helps us utilize the dataset in the best way possible.
Throughout the article, we have seen all the major concepts involved in a typical EDA process. Moreover, we went through a step-by-step implementation of some basic EDA practices using a practical dataset.
However, this is just the beginning. As you move on, you’ll discover the world of EDA is far more diverse and detailed. The best way to learn is to try doing your own EDA on some of our datasets for machine learning projects.