Exploratory Data Analysis in Python

Data visualization and exploratory data analysis is most important step of any Machine Learning or Data Science project. To determine the right machine learning model, you need to first understand the data. Also, to decide the correct predictive algorithm (e.g. Random Forest, Logistics regression, Bayesian classifier etc.), data visualization and exploratory data analysis is prerequisite.

Data Visualization and exploratory data analysis is a visual method to understand different characteristics of the data. When we will plot the data values in different diagrams, we can conclude the most important features or factors. Also, we will have an understanding of the correlation between different features. By visualizing the data, we can easily find out the factors that are influencing the output.

Before we start exploratory data ananysis, we should have Python and Jupyter Notebook installed in your system, please refer to How to install Jupyter Notebook in Ubuntu using VirtualBox or Jupyter Notebook on Windows for installation instruction.

For this tutorial, I am using below Python packages for Exploratory Data Ananlysis.

numPy

numPy supports fast mathematical computing by generating multidimentional array object.It supports basic linear algebric, statistical, shape manipulation, random simulation etc.It is very efficient on large volume of data.To learn more about numPy, please refer numPy

To install numPy, please use below command

pip install numpy

pandas

pandas basically built on top of mumPy package. Most useful feature of pandas is Data Frame. Data Frame can effuciently handle missing data manipulation, reshaping, data alignment etc.It has robust IO tool for loading different file format or database.To learn more about pandas, please refer pandas.

To install pandas, please use below command

pip install pandas

matplotlib
Since 2003, matplotlib is one of the most popular library for creating 2D visualization in Python. It is written in Python and it uses NumPy internally to provide good performance for processing large arrays. To install matplotlib, you can use below command
python -m pip install -U pip python -m pip install -U matplotlib
To learn more about matplotlob, please refer matplotlib
seaborn

seaborn is a data visualization library that support statistical functions. It is built on top of matplotlib and it is strongly integrated with pandas. To understand more details about seaborn, please refer to seaborn: statistical data visualization

To install seaborn, please use below command

pip install seaborn

To explain Exploratory Data Ananlysis, I am using the dataset for heart attack classification available in kaggle. Let me give a brief overview of the columns in this dataset.

Columns Description :

column - 1 : age = Age of the person
column - 2 : sex = Gender of the person (1=Male and 0=Female)
column - 3 : cp = Chest Pain category
- Value 0: typical angina
- Value 1: atypical angina
- Value 2: non-anginal pain
- Value 3: asymptomatic
column - 4 : trtbps = resting blood pressure
column - 5 : chol = cholestoral in mg/dl
column - 6 : fbs = (fasting blood sugar > 120 mg/dl) (1 = true and 0 = false)
column - 7 : restecg = resting electrocardiographic results
- Value 0: normal
- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
column - 8 : thalachh = maximum heart rate achieved
column - 9 : exng = exercise induced angina (1 = yes; 0 = no)
column - 10 : oldpeak = Previous peak or ST depression induced by exercise relative to rest
column - 11 : slp = The slope of the peak exercise ST segment
- Value 0: upsloping
- Value 1: flat
- Value 2: downsloping
column - 12 : caa = number of major vessels
column - 13 : thall = Thal rate
column - 14 : output = Target variable OR diagnosis of heart disease (angiographic disease status)
- Value 0: < 50% diameter narrowing
- Value 1: > 50% diameter narrowing

Before we start exploring data and generate useful visualization, we first need to load libraries or packages e.g numpy, pandas, seaborn, matplotlib etc.

Loading numpy, pandas, seaborn, matplotlib library

After all neccessary libraries loaded, we will create DataFrame from the the input csv file (csv is basically comma separated values). DataFrame is a two dimensional data structure that contains labeled rows and columns. Using df.shape, we can check how many rows and columns are present on that dataFrame.

using panda csv data got loaded and returned dataFrame

If you use df.head(), it will return first five rows of the dataset. you can also pass numerical value e.g. df.head(10), to see 10 rows of the dataset.

using head data snippet visualization

df.info will give you summary of the DataFrame. You can see some metadata information e.g. column name, data type, memory usage etc.

using info column information visualization

df.isnull() is usefull to check null/NA values in a dataFrame. In a large dataset, there are many values for a given column can be blank. Sometime if most of the values of column is blank, you should not consider that column to be important for desicision making.

using isnull column value visualization

Scatter Plot is a very popular graphical diagram to understand the correlation between two variables on each axis. The correlation can be positive, negetive also there may not be any corelation. In exploratory data analysis, you can plot scatter diagram using two variables or sometimes single variable with its values.

data visualization using matplotlib scatterplot

You can create more nice scatter plot visualization using color scale or color bar.

data visualization using matplotlib scatterplot with color scale

Correlation Heatmap is a graphical representation of correlations between different variables or columns in the dataset. This matrix is very useful to determine the dependencies across differnt columns. Correlation is a numerical value that can be between -1 to 1 range.

data visualization using seaborn correlation heatmap

Histogram is a graphical representation of the distribution of numerical data. In Histogram generally we buckets a rangein x-axis and y-axis is occurance of that data for each bucket. In below Histogram, we are representing Patient’s age in x-axis and number of patient in y-axis. Here we are using seaborn and matplotlib to generate the Histogram.

data visualization using seaborn Histogram

Exploratory Data Analysis in Python

Categories

Tags