K-Means Clustering: Unveiling Data Patterns and Image Compression

Paregi Aanchal
5 min readMar 31, 2024

--

In the realm of unsupervised learning, K-means clustering stands out as a fundamental algorithm for discerning patterns within datasets and compressing images. Let’s delve deeper into the implementation details of K-means clustering, from its core principles to real-world applications.

Understanding K-Means Clustering

At its core, K-means clustering seeks to partition a dataset into K distinct clusters based on the similarity of data points. The algorithm iteratively assigns each data point to the nearest cluster centroid and updates the centroids based on the mean of the assigned points. Through this process, clusters emerge, minimizing the within-cluster sum of squared distances.

Implementation in Python

Implementing K-means clustering in Python involves two key steps: finding the closest centroids and computing centroid means.

1. `find_closest_centroids` Function:

This function is responsible for assigning each data point to the nearest centroid. Here’s a breakdown of its implementation:

Python code:
def find_closest_centroids(X, centroids):
# Compute distances between each data point and centroids
distances = np.linalg.norm(X[:, np.newaxis] — centroids, axis=2)
# Assign each data point to the nearest centroid
return np.argmin(distances, axis=1)

- Input Parameters:
— `X`: The input data matrix of shape (m, n), where m is the number of data points and n is the number of features.
— `centroids`: The current centroids of shape (K, n), where K is the number of clusters.

- Distance Calculation:
— `np.linalg.norm(X[:, np.newaxis] — centroids, axis=2)`: This line calculates the Euclidean distance between each data point in X and each centroid. It utilizes NumPy broadcasting to efficiently compute distances for all data points and centroids simultaneously.

- Assigning Data Points:
— `np.argmin(distances, axis=1)`: This line finds the index of the centroid with the minimum distance for each data point. It returns an array of indices representing the closest centroid for each data point.

2. `compute_centroids` Function:

This function updates the centroids based on the mean of the data points assigned to each centroid. Let’s explore its implementation:

Python code:
def compute_centroids(X, idx, K):
centroids = np.zeros((K, X.shape[1]))
for k in range(K):
centroids[k] = np.mean(X[idx == k], axis=0)
return centroids

- Input Parameters:
— `X`: The input data matrix, similar to the `find_closest_centroids` function.
— `idx`: An array containing the index of the closest centroid for each data point. It has a length equal to the number of data points.
— `K`: The number of clusters.

- Updating Centroids:
— `centroids[k] = np.mean(X[idx == k], axis=0)`: This line computes the mean of all data points assigned to the k-th centroid. It filters the data points based on their assigned centroid using boolean indexing (`idx == k`) and then calculates the mean along each feature dimension (`axis=0`). Finally, it updates the k-th centroid with the computed mean.

These functions are essential building blocks of the K-means clustering algorithm, enabling the iterative assignment of data points to centroids and the subsequent update of centroid positions. Understanding their implementation is key to comprehending the inner workings of K-means clustering and its applications.

Exploration with Sample Data

To grasp K-means clustering’s mechanics, we can experiment with a sample dataset. Visualizing the clustering process reveals how centroids converge toward cluster centers, providing insights into the algorithm’s behavior.

Random Initialization

Random initialization of centroids plays a pivotal role in K-means clustering. By selecting initial centroids from the dataset, the algorithm can explore various cluster configurations. However, choosing the right initialization method is crucial for achieving optimal results.Got it! Here’s the revised paragraph:

You can run K-Means again with random initial centroids to observe how different clusters are created based on the initial points chosen. By setting the number of centroids and maximum iterations, you can control the clustering process. Each run of the algorithm may result in different cluster formations, reflecting the sensitivity of K-Means to initial centroid placement. This variability highlights the importance of multiple runs with different initializations to ensure robustness and capture diverse cluster configurations. Through repeated executions of the algorithm, you can gain insights into the stability and consistency of the clustering outcomes, helping to refine the choice of parameters and optimize cluster assignments for your dataset. You can refer to the attached screenshots for visual representations of the clusters formed in different runs.

different clusters created based on the initial points

Image Compression: A Practical Application

One fascinating application of K-means clustering is image compression. By reducing the number of colors in an image to a predefined set of centroids, we can represent the image more efficiently. This compression technique significantly reduces file size while preserving visual quality, making it ideal for storage-constrained environments.Attached screenshots depict the image compression process using K-Means, illustrating color reduction and the resulting compressed image. This visual representation highlights K-Means’ efficacy in preserving essential image features while reducing color complexity.

K-means clustering offers a versatile approach to uncovering data patterns and compressing images. Whether you’re analyzing customer segments or optimizing storage resources, K-means clustering provides valuable insights and practical solutions. By delving into its principles and detailed implementation, you can leverage K-means clustering to unlock new possibilities in data analysis and image processing.

Embark on your journey with K-means clustering today. Explore the provided code examples, experiment with different datasets, and discover the power of unsupervised learning. Happy clustering!

I have provided a link to the GitHub repository containing the code used for the K-Means implementation. You can access the code here.

https://github.com/aanchalparegi/K-means-algorithm-implementatiom

--

--