Outliers in Machine Learning

Srisindhu
3 min readJul 18, 2022

What are outliers

The unusual distribution in the dataset is termed outliers. Removing outliers is part of the data cleaning process, before modeling data.

Outliers can be caused by human error, corruption of data or some irregularity in the data set.

Removing Outliers helps in reducing the skewing of data. Having outliers in the dataset will lead to a less accurate data models.

Below, is a small data frame, containing Id and Age.

As it is a small dataset, by looking at the data, it is clear that the 9th and 11th rows are way higher than the rest of the rows. These are termed as outliers as the values of these doesnt match with rest of them.

Outliers detection

For a smaller dataset, outliers can be detected by analysing the data.

But,for huge datasets, it's not possible to just look at the data and determine if there is an outlier. For such data sets, one of the ways to check for outliers is to use boxplot.

Let us look at the box plot for the above data frame,

Code for boxplot in python using the seaborn library.

The two data points in the below graph are noticeably far away from the rest of the data points.

Another way to see if data has outliers is to use IQR(Inter Quartile Range) method. In general terms, the difference between Q3(75th percentile of the data)and Q1(25th percentile of the data) is termed as IQR.

First, calculate the Q1, Q3 and IQR by using the following code.

Below code is to find out the outliers in the dataset, Any data which is less than 1.5 times of IQR from the Q1 and 1.5 times of IQR above the Q3 are termed as Outliers. In this case, it is 26 and 50, which exactly matches the box plot values.

The next step once is treating them.

Before that, lets know how many outliers are present in the data.

If it is a very huge data and outliers are very minimal, the best thing to do is to remove those rows from data , as removing a few from a very large dataset wont change the model much.

But when there are quite number of outliers, they need to be treated.

The best way to treat outliers is to replace them with mean/median value .

But this method is not possible for all columns, for example, Age cannot be have a float value. Hence, in such scenarios, the best possible way is to divide Age into different ranges using pandas.cut method.

--

--

Srisindhu

Data science and Machine Learning Enthusiast .Like to blog about what I learn and read blogs to gain more knowledge!