Data visualization and exploratory data analysis is most important step of any Machine Learning or Data Science project. To determine the right machine learning model, you need to first understand the data. Also, to decide the correct predictive algorithm (e.g. Random Forest, Logistics regression, Bayesian classifier etc.), data visualization and exploratory data analysis is prerequisite.
Data Visualization and exploratory data analysis is a visual method to understand different characteristics of the data. When we will plot the data values in different diagrams, we can conclude the most important features or factors. Also, we will have an understanding of the correlation between different features. By visualizing the data, we can easily find out the factors that are influencing the output.
Before we start exploratory data ananysis, we should have Python and Jupyter Notebook installed in your system, please refer to How to install Jupyter Notebook in Ubuntu using VirtualBox or Jupyter Notebook on Windows for installation instruction.
For this tutorial, I am using below Python packages for Exploratory Data Ananlysis.
-
- numPy
- numPy supports fast mathematical computing by generating multidimentional array object.It supports basic linear algebric, statistical, shape manipulation, random simulation etc.It is very efficient on large volume of data.To learn more about numPy, please refer numPy
To install numPy, please use below command
pip install numpy
-
- pandas
- pandas basically built on top of mumPy package. Most useful feature of pandas is Data Frame. Data Frame can effuciently handle missing data manipulation, reshaping, data alignment etc.It has robust IO tool for loading different file format or database.To learn more about pandas, please refer pandas.
To install pandas, please use below command
pip install pandas
-
- matplotlib
- Since 2003, matplotlib is one of the most popular library for creating 2D visualization in Python. It is written in Python and it uses NumPy internally to provide good performance for processing large arrays.
To install matplotlib, you can use below command
python -m pip install -U pip python -m pip install -U matplotlib
To learn more about matplotlob, please refer matplotlib
-
- seaborn
- seaborn is a data visualization library that support statistical functions. It is built on top of matplotlib and it is strongly integrated with pandas. To understand more details about seaborn, please refer to seaborn: statistical data visualization
To install seaborn, please use below command
pip install seaborn
To explain Exploratory Data Ananlysis, I am using the dataset for heart attack classification available in kaggle. Let me give a brief overview of the columns in this dataset.
Columns Description :
- column - 1 : age = Age of the person
- column - 2 : sex = Gender of the person (1=Male and 0=Female)
- column - 3 : cp = Chest Pain category
- Value 0: typical angina
- Value 1: atypical angina
- Value 2: non-anginal pain
- Value 3: asymptomatic
- column - 4 : trtbps = resting blood pressure
- column - 5 : chol = cholestoral in mg/dl
- column - 6 : fbs = (fasting blood sugar > 120 mg/dl) (1 = true and 0 = false)
- column - 7 : restecg = resting electrocardiographic results
- Value 0: normal
- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
- column - 8 : thalachh = maximum heart rate achieved
- column - 9 : exng = exercise induced angina (1 = yes; 0 = no)
- column - 10 : oldpeak = Previous peak or ST depression induced by exercise relative to rest
- column - 11 : slp = The slope of the peak exercise ST segment
- Value 0: upsloping
- Value 1: flat
- Value 2: downsloping
- column - 12 : caa = number of major vessels
- column - 13 : thall = Thal rate
- column - 14 : output = Target variable OR diagnosis of heart disease (angiographic disease status)
- Value 0: < 50% diameter narrowing
- Value 1: > 50% diameter narrowing
Before we start exploring data and generate useful visualization, we first need to load libraries or packages e.g numpy, pandas, seaborn, matplotlib etc.
After all neccessary libraries loaded, we will create DataFrame from the the input csv file (csv is basically comma separated values). DataFrame is a two dimensional data structure that contains labeled rows and columns. Using df.shape, we can check how many rows and columns are present on that dataFrame.
If you use df.head(), it will return first five rows of the dataset. you can also pass numerical value e.g. df.head(10), to see 10 rows of the dataset.
df.info will give you summary of the DataFrame. You can see some metadata information e.g. column name, data type, memory usage etc.
df.isnull() is usefull to check null/NA values in a dataFrame. In a large dataset, there are many values for a given column can be blank. Sometime if most of the values of column is blank, you should not consider that column to be important for desicision making.
Scatter Plot is a very popular graphical diagram to understand the correlation between two variables on each axis. The correlation can be positive, negetive also there may not be any corelation. In exploratory data analysis, you can plot scatter diagram using two variables or sometimes single variable with its values.
You can create more nice scatter plot visualization using color scale or color bar.
Correlation Heatmap is a graphical representation of correlations between different variables or columns in the dataset. This matrix is very useful to determine the dependencies across differnt columns. Correlation is a numerical value that can be between -1 to 1 range.
Histogram is a graphical representation of the distribution of numerical data. In Histogram generally we buckets a rangein x-axis and y-axis is occurance of that data for each bucket. In below Histogram, we are representing Patient’s age in x-axis and number of patient in y-axis. Here we are using seaborn and matplotlib to generate the Histogram.