What is Unsupervised Learning and Algorithms used for Unsupervised Learning
Unsupervised learning is one of the machine learning algorithm techniques, used for analyzing unlabelled data, or in a simple way you could say, it’s used for data with no Target Variable/Output Variable.
For Example, Consider Lego Blocks with all the colors together in a box. And you decide to sort them . You can sort/group them according to color or shape or even size. This exactly is unsupervised learning.
Unsupervised Learning is more complex compared to supervised learning but is a better option in scenarios where there is no labeled data / it is easy to get unlabelled data.
The popular Unsupervised Algorithms are :
- K-means Clustering
- Hierarchical Clustering
- DBSCAN Clustering( Density-based spatial clustering of applications with noise)
K-Means Clustering :
It is the most commonly used unsupervised algorithm.
In K-means Clustering, n observations are partitioned into k clusters.
K centroids are formed at the first step, which is the arithmetic mean of all the data associated with a cluster. Hence this is called the centroid-based algorithm.
Compute the distance between each data and these k centroids and assign each to the nearest centroid and form Clusters. Because of this, data that are closer to centroid have similarities compared to data from other centroids.
The one question always arises with k-means clustering is how to choose optimal K. It is done by using the most commonly used elbow method.
Optimal K is chosen at the point where there is a sudden distortion of the graph and from there it starts to decrease in a linear fashion. From the above graph, k=6.
Pros of K-Means:
It is fast, and it is easy to assign data to a particular cluster based on the distance.
Cons:
Choosing Optimal K. Categorical data has to be converted, as the k-means algorithm needs a mean value for calculating centroids.
Hierarchical Clustering :
This algorithm groups similar objects into Clusters.
Any type of distance measure can be used to find the measure of similarity. Hence it is a connectivity-based algorithm.
The output is in the form of a tree, which is termed a dendrogram. It shows the relation between similar objects.
Hierarchical Clustering can be done in two ways, agglomerative or divisive.
In agglomerative clustering, We start with single points and combine them sequentially per iteration. Agglomerative clustering is a widely used method.
In divisive clustering, we start with one cluster containing all the objects and keep dividing them until each cluster has one object.
DBScan Clustering :
This unsupervised algorithm is used on large data with noise and outliers. The clusters/groups are formed based on the high-density points grouped together and the outliers/low-density points are separated.
The two parameters of DBscan Clustering:
Epsilon: This distance measure is used for locating the neighborhood points. If the distance between two points is less than or equal to epsilon, the points are considered neighbors. Epsilon value is calculated based on a k-distance graph.
minPts: Minimum number of points that are inside the epsilon/radius of a cluster.
Always choose a larger minPts value in a large amount of data.
There are three points in a DBScan Clustering :
- Core point: Core Point is the point that has at least the minpts given, within the epsilon.
- Border Point: A BorderPoint is a point that has at least one point within an epsilon but is in the neighborhood of a core point.
- Noise: Noise is basically an Outlier. It is not assigned to any cluster.
Silhouette Score:
We use Recall, Precision, Confusion matrix metrics to evaluate the effectiveness of supervised learning. In the same way, for unsupervised learning, silhouette score is the metric to evaluate how well the model performed. Its value ranges from -1 to 1. If tending towards -1, that means the clusters are far away from each other.
Hope this article gives you insight into Unsupervised Learning and its algorithms. Thank you for reading!