Life Cycle of Machine Learning Algorithms

Srisindhu
6 min readJul 11, 2022

--

What Exactly is ML?

When you hear ML or AI, what comes to mind is that you input some data or input images to ML Algorithms and these algorithms predict an output or the images? Yes, algorithms do these, but there is a lot of work to be done with the data before it passed through ML algorithms.

As I started learning ML concepts, it became obvious, that the whole process is not as simple as it seemed to be from the outside perspective!

In this blog, I am going to talk about what involves in the process of using ML algorithms and the complete steps required to use ML models on a dataset.

The basic steps needed, I have divided as follows:

  1. EDA -Perform Exploratory Data Analysis on the data which includes data collection, Data Cleaning, Univariate, and Bivariate Analysis.

2. Split data — Split the data into Train and Test datasets

3. Data Training — Train dataset using the ML Algorithms depending on what sort of data it is(classification for categorical output label or regression for continuous output label).

4. Data Testing — Once the data has been trained, it is applied to test data.

Life Cycle of ML Algorithms

EDA

EDA is a very important step in MachineLearning as it helps us to understand the dataset, clean it up if any irregularities are present, and perform analysis using plots, so we get a better understanding of the distribution and relation between the labels(columns) present in the dataset. Understanding the relationship between the labels helps to decide what labels are dependent on each other, and how is it deciding the output variable.

All the unwanted labels are dropped from the dataset. This helps the ML model to train data well.

Data collection :

The first step is Data Collection.

Data is very important for ML models and having proper data helps in better training and good accuracy on data.

Data Extraction from websites is usually termed web/data scraping.

ML/DS community websites have tons of open datasets, the most popular among which is Kaggle.

Data files downloaded from the websites, usually come in CSV and xls formats.

Pandas.read_csv() is used for getting data from a csv file and

pandas.read_excel() for loading data from an excel file.

Code to load data from a csv.

import pandas as pd

datadf=pd.read_csv(‘Data.csv’)

Pandas.read_csv reads a comma-separated values (csv) file into DataFrame.

And the code to load data from an excel file

import pandas as pd

datadf= pd.read_excel(Data.xls)

pandas.read_excel() reads an Excel file into a pandas DataFrame

Data Analysis:

Once data has been read into a DataFrame, the next step is to analyze data.

pandas.info() prints concise summary data of the data frame.

This method prints information about data type, number of columns, and non-null values count for the columns present in the dataset.

pandas.shape() prints the number of columns and rows present in the dataframe. In this case, a number of labels are present and a number of items are present.

pandas.describe() prints the entire summary of dataframe. It prints the mean, median, std, and quarter percentiles of all columns present in the DataFrame.

Data Cleaning:

Before performing Univariate and Bivariate Analysis, data must be cleaned of any irregularities.

This involves checking if there are any missing values and removing any unwanted variables and outliers from the datasets. This helps in reducing the unnecessary skewing of data.

Removing missing values.

To check if the dataset has any missing values, use pandas.DataFrame.isna() It returns a boolean value for each column item, false if the item is not NULL and True if it's NULL.

There are some True values for cabin label. That means it has missing values.

Another way to check missing values is to get the sum of all missing values, so it's easier to know how many missing values are present in the dataset. If there are only a few missing values, either those rows can be deleted, else can be replaced with the mean/median value, of its int type(non-categorical variable).

Removing Outliers.

There are different ways to see if there are outliers in the dataset. The easiest way is to do a boxplot. Another way is Interquartile Range Method. will make another blog entirely for outliers. The image below is an example of how outliers look in a boxplot. The dots far away from the box tell us that there are some outliers for the label SibSp.

The dots far away from the actual box represent outliers in the SibSp Label.

Univariate Analysis:

After Data collection and data cleaning, the next step is Univariate Analysis.

Univariate Analysis helps to analyze each label in the dataset. It shows the range and the central tendency of the values. Its performed by using various plots which are included with seaborn and matplotlib libraries.

Most popular are Barplot, Boxplot, Distplot ,Countplot.

Distplot is used for continuous variables whereas countplot is used for categorical variables. Boxplot can be used for both variables.

The graph shows how Age is distributed in the dataset. Graph analysis shows most of them are in 20–40 age group.
Countplot is performed for categorical like gender, class, paid, spam etc.

Bivariate Analysis:

Bivariate Analysis helps to understand the relation between the(two) labels in a dataset. Commonly used matplotlib plots scatter plot, pairplot.

Count Plot to find the relation between two Labels in the dataset.

Splitting Data:

The next step after EDA is to split data into train and test data.

Once data has been cleaned and dropped off with non-dependent columns, the dataset is split into X and y. X contains the input variables and y contains the output variable. Here comes another library called sklearn. sklearn library contains the method required for splitting data, using different models on data, and checking various metrics on it.

This means X will contain all the columns/labels which determine the output/result. X and y are split into Training and Testing datasets. The usual practice is to divide it into 70:30. In case, it's small data set, divide it into 80:20.

Training Data:

This is the part where data is given to ML models to train data. There are various models, depending on the type of data and the output Label. Data with continuous output variable/Label is trained with unsupervised models, for classification data, supervised models are used.

The below code is an example of data training using logistic regression.

Testing Data:

Once the model is trained, it is used on testing data, for confirming that the model is predicting correctly unseen data. There are many metrics that are used to check how well the model has worked on the data.

Accuracy, Precision, and recall are some of the metrics commonly used. The accuracy of training data and testing data should be almost equal. Precision tells how good the model is at predicting and recall tells, how many times the model was about to predict the correct output.

There are so many other factors involved, some of them are the need for a validation set and data conversion from object type to a numeric type. scaling, normalization of date before training the model. The above blog is about basic ML Life.

Thanks for reading my blog! I am here to write and learn from various other blogs.

--

--

Srisindhu
Srisindhu

Written by Srisindhu

Data science and Machine Learning Enthusiast .Like to blog about what I learn and read blogs to gain more knowledge!

No responses yet