Descriptive data measures for machine learning

Srisindhu
4 min readOct 21, 2021

--

This article is about the various descriptive data measures used while performing Exploratory Data Analysis. It consists of two categories, measure of central tendency and measure of dispersion.

But why do we need to know them?

The first task we do when we start with the dataset in order to perform machine learning algorithms on it is, we need to analyze data, and see how well it is distributed, are there any outliers, null values, etc. Using descriptive measures on our data set, helps us to understand data well.

The measure of central tendency :

Measures of central tendency are used for finding the central point or average in a dataset.

It includes mean median and mode.

Mean :

Mean gives the average of the data.

Mean = Sum of all elements/ Total number of elements.

In a data set, there could be fluctuations in the mean, if there are outliers in the data.

Median :

Median gives the middle value of all the elements present.

Median = (number of elements +1) / 2

Median is resistant to Outliers.

Mode:

Mode is the value that often occurs. It is also highly resistant to outliers.

Measure of Dispersion

Central tendency describes the central point in a data set, whereas dispersion describes the spread of data in a data set.

Standard deviation :

Standard deviation is a quantity expressing how much each data point differs from the mean(avg).

It is the square root of variance.

Now let’s see what variance is.

Variance is the average of squared difference from the mean.

Formula :

Range :

The difference between the highest and lowest in a set is Range.

Range = Max(X)- Min(X)

Coefficient of variation:

The coefficient of variation is a relative measure to compare distributions with respect to their standard deviations

Greater the number, greater the variability in the data irrespective of scale

Quartiles :

Quartiles divide data into 4 equal parts.

Q1/First Quartile is the smallest 25% of the data out of remaining that are larger.

Q2/Second Quartile is the median of the data set. It divides 50% of the values from the rest that is larger or equal to the median.

Q3/Third Quartile is the smallest 75% from the rest of the data.

Interquartile Range :

Interquartile Range (IQR)is the range for the middle 50% of the data. It is nothing but, the difference between Q3 and Q1.

IQR is used to find outliers in a dataset.

Now that, we went through these descriptive measures, let's see how to use them in our data set.

Below is the sample code for each of these in python. You can also find the code at

https://github.com/srisindhuk/basic-statistics

Firstly to load data,

import numpy as np
import pandas as pd
df = pd.read_csv(“seeds.csv”)

df

Out[] :

To find the mean on the dataset,

df.mean()

out[]:

To find median on dataset,

df.median()

To find the mode on the dataset,

df.mode()

To find variance, df.var()

To find standard deviation,

df.std()

There is an API in pandas, describe(), which generates the descriptive statistics on the dataset. Instead of using individual APIs,

just use df.describe() and it gives you a measure of central tendency, dispersion, and shape of data excluding NaN values in the dataset.

--

--

Srisindhu
Srisindhu

Written by Srisindhu

Data science and Machine Learning Enthusiast .Like to blog about what I learn and read blogs to gain more knowledge!

No responses yet