Customer Segmentation in Python using K-means

Suhaib Ali Kamal
Nerd For Tech
Published in
4 min readApr 9, 2021

--

Analytics is now increasingly being used to guide business decisions in a variety of domains. Businesses these days have more data than ever on their consumers and the market. As a result it has become important for businesses to leverage that data to assisst in their decision making.

Marketing is one domain where data analysis is being used to provide actionable insights to marketing managers. One of the key tasks of any brand manager is to identify customer segments. A customer segment is a group of the population that has shared characteristics like age,education,preferences and attitudes. Segments can help marketing managers formulate strategies to target customers and hence increase sales.In this example we are going to conduct cluster analysis on a sample of mall customers.

The dataset that we are going is to use is a list of mall customers for which we have their gender,age ,annual income and spending.After reading the file it is important to do some exploratory data analysis to get a feel of the data.For this we will use the matplotlib library.

sns.scatterplot(cust["Age"],cust["Spending Score (1-100)"],color="green",hue=cust["Gender"])
plt.title(" Spending Score vs Age"
Scatter plot of spending score/age

One of the first plots that we can look upon is the spending score vs age where we can clearly see that there people between the ages of 20–40 have higher spending scores than the rest of the sample.There is some correlation between age and the spending score. For further analysis , we can also look into the correlation matrix.

Correlation plot

As can been be seen age and spending score has very high correlation as compared to the rest of the factors.These plots can help us identify relationships and patterns in our data . Now lets move onto writing the code for cluster analysis.

The clustering algorithm that we are going to use is K-means which is a very popular algorithm. It identifies k number of centroids and then allocates every point in the data to the closest cluster.K-means is an iterative procedure which means the centroids are recalculated after every allocation iteration until no improvement can be made in the clusters or the maximum number of iterations has been reached.

Before moving onto the cluster analysis it is important to scale the variables for which I have written the following code.

from sklearn.preprocessing import StandardScaler
cust_transform["Age"]=scaler.fit_transform(cust["Age"].values.reshape(-1,1))
cust_transform["Annual Income"]=scaler.fit_transform(cust["Annual Income"].values.reshape(-1,1))
cust_transform["Spending Score"]=scaler.fit_transform(cust["Spending Score"].values.reshape(-1,1))

One of the biggest decisions while doing a k-means cluster analysis is to decide upon the number of clusters. For this data we are going to use the silhouette coefficient which is calculated by the following formula

x-y/max(x,y)

where x is the mean inter cluster distance and y is the mean intra cluster distance. The score ranges from -1 to +1 where the higher the score the better segregated the clusters are. We are going to use the sci-kit library to calculate the silhouette score for a range of clusters

from sklearn.metrics import silhouette_score
silhouette_coefficients=[]
for i in range(2,12):
kmeans = KMeans(n_clusters=i)
kmeans.fit(cust_transform)
score = silhouette_score(cust_transform, kmeans.labels_)
silhouette_coefficients.append(score)

sns.lineplot(x=range(2,12),y=silhouette_coefficients,color="green"
Optimal number of clusters : 6

As can be seen from the above the optimal number of clusters are 6 and so we segregate the data into 6 clusters.After the clusters I am going to visualise the clusters and try to identify any patterns or insights that can help us in decision making.

From the above diagram we can see that cluster 0 consists of people that have higher annual income but lower spending score. Cluster #2 is the cluster which has high annual income and also a high spending score.This is the cluster that we can target for expensive purchases.We can further analyse our cluster along different dimensions

Here the clusters are visualised along spending score and age. From the previous visualisation we know that Cluster 2 consists of high income people with high spending score. However from this diagram we can see that they are aged between 20–40 years old.Furthermore Cluster 4 also has higher spending score but their annual income is lower .

Cluster analysis has a lot of applications in business related decisions especially marketing.Here we used cluster analysis to create market segments which can help marketing managers identify groups or segment who they can target to increase sales.

--

--