One of the most common clustering algorithm is kmeans clustering. There are several more clustering algorithms to explore. This blog explains the theoretical steps to understand kmeans clustering technique.
Machine learning is mainly classified into Supervised learning, Unsupervised learning(UL) and Reinforcement learning.
This is my first blog and I wanted to write about Unsupervised learning since I find it very interesting and even people who are new to machine learning can understand it easily.
Unlike supervised learning algorithms like KNN or decision trees unsupervised learning can also be used as an extension to Exploratory data analysis(EDA).The unique feature of unsupervised learning is that it can be used to find hidden features in a dataset without a target column and it can be used to detect outliers too.
What is soo unique about Unsupervised Learning????
Imagine you are working with a dataset and you want to form groups with a similar behavior and you would like to identify some hidden features in the data or to detect outliers in the dataset, using UL would be a good start.
For example… you are working with a dataset from a supermarket and you want to classify the customers based on their behavior to give them some offers during a festival and you don’t know where to start.
Well try running a basic kmeans clustering algorithm on the dataset and you can group the customers into groups or clusters with common behavior.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
The next question would be on how many clusters should be formed. In the above code we used three clusters but coming back to our example we will not know the optimal number of groups in which the customer should be classified into.
To Identify the optimal number of clusters there are different techniques like elbow method, silhouette analysis and many more.
Elbow method
The most basic way to identify the number of clusters is to use the elbow method.
The optimal number of clusters that has to be formed from the above image is 3. Since the curve flattens after that point and there wont be a big difference in forming extra clusters(note that there won’t be a distinct drop in the curve as shown in the image when you are working in the industry and u should know to limit the number of clusters with domain knowledge).
This is the sample code to form clusters and plot elbow curve in iris dataset.
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris['feature_names'])
#print(X)
data = X[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']]
sse = {}
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data)
data["clusters"] = kmeans.labels_
sse[k] = kmeans.inertia_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
How does the clustering algorithm work?? I am not going to go into the working and different parameters of the algorithm, but the major underlying factor used in clustering are using the distance measures such as Euclidean or cosine distance(there are several more distance measures). So you can basically understand that the data points are plotted on a N dimensional space and the distance measures can be used to find the neighboring points in the space. The distance measure can be given as a hyperparameter too.
What should we do after we form the clusters???
We should form separate groups of the clusters and identify the behavior or common features of the data in that particular cluster and label the clusters accordingly.
We can then apply separate supervised algorithms to each of the cluster or when a new data point comes in we can add it to the group which it belongs to.
Conclusion:
Clustering can be used to identify hidden features in a dataset which we might not find when we see or visualize the data.
It is always a good practice to do unsupervised learning after we do EDA.
We can use clustering to form groups in the dataset.
There are a lot of methods to do clustering and to explore, You can find them in this link.