Hello Data Experts,
Before I start what is Cluster Analysis (CA), could you think of how we as an individual help perform Cluster Analysis. Let me help you surface out how we are contributing to CA for Retail or Hospitality Industry. We all are so charmed with reward points and why not if it helps me get additional benefit over time for either staying in a hotel or doing shopping. Both Retailers or Hoteliers have come up with the mechanism to collect structure data to their benefit for their business. They do analysis on the Patterns and Trends to decipher individuals or groups demographics and behavioral styles. Based on these interpretations customized promotion are launched to increase their bottom line. Coming back to the question, how we contribute in cluster analysis is by unknowingly allowing business entities to collect data (lakes) with every action associated with the reward points. While reading this article, continue to correlate details wit rewards points analogy.
What is Cluster Analysis?
Cluster is simple English means a group, so executing a structured algorithm to come up with the likeminded objects with similar properties. Objects with similar properties can be form different type of groups. Grouping cannot be generalized, it is not an acceptable outcome. Transaction data plays important role in defining the clusters but variables and number of observations are key to make it more relevant to domain. More we explore, higher similarities will be observed within each cluster thus it is more of exploratory approach. Clustering is a kind of Unsupervised clustering approach where Y (output) is unknown.
Why do we need Cluster Analysis?
Clustering helps label the data having same attributes for future actions, like promotions by retailers or Hoteliers. If Retailers are unaware of who is their segmented customer, they might not be successful in launching right commercial. Cluster Analysis algorithms are based on proximity rather than correlations. Outcome of CA will be to get the classified data which is much more manageable than dataset with only transaction data.
Let us wear a statistician hat now and change gears to better understand how as Data Scientist we should proceed with Cluster Analysis.
Few key points to keep in mind while working on Cluster Analysis:
- There are different techniques or approaches for clustering,
- Hierarchal: Dendrogram clustering
- Non-Hierarchal: k-means clustering
- Hybrid mix of Hierarchal and Non-Hierarchal clustering
- Cluster size can be defined in k-means whereas optimal number of cluster will be formed using Dendrogram.
- Both techniques compute clusters using standardized data, i.e., Z value between Zero to 1. Please refer my blog on Z distribution to know more about Z score.
- Graphical representation of clustering helps visualize similar and dissimilar group. Dendrogram depicts Tree structure whereas k-means depicts Cluster structure.
- Easy identification of outliers, they form a single record cluster or very small sized cluster in case of set of outliers.
- Distance is the measure to define proximity of the objects, smaller the distance higher the similarity between objects. Distance can be measure using
- Euclidian distance equation (we will use this in this session)
- Manhattan distance equation.
- Mahalanobis distance.
- Distance is the key measure however what distance to calculate is defined by link between 2 points called linkage. What 2 points to pick to calculate distance is based on an appropriate algorithm i.e., types of linkages:
- Single Linkage – Nearest neighborhood
- Complete Linkage – Farthest neighborhood
- Average Linkage – Average for all pair of data points
- Centroid Linkage – Centers of clusters.
Hierarchal clustering (H-Cluster):
- Dendrogram is the technique for Hierarchal clustering, where in outcome is reflected as hierarchy.
- This technique is used for medium to small size population size, anything less than 100 typically for easy visualization, however it can run on size of 1000 as well without constraints.
- There are 2 approaches to compute H-Cluster either top to bottom (Cluster break to get records or smaller clusters i.e., 1 to n approach) or bottom to top view (Records to groups to form a Cluster i.e., n to 1 approach). These techniques are called Decisive and Agglomerative approaches respectively.
- Quick and easy to approach however flip side is to negative impact of outlier that specific observation might have to be removed from dataset.
Non-Hierarchal Cluster (K-means):
- K-means is the technique for non-hierarchal clustering, where visualization is more in form of clusters.
- Less influenced by outlier and still club that observation to one of the cluster.
- User for large data set and it works based on Fit model against the defined cluster size.
- Requires iterations to get optimal cluster sizing, seeding point can be defined using scree plot leveraging elbow curve point. Scree plot is a graphical representation of variance between clusters. It represents Steep curve, Elbow curve and Horizontal Curve, where each curve reflects variance so pick Elbow curve cluster size to get optimal # of clusters.
- This approach help refines clustering over time and can be arrived to an optimize size by incrementing cluster size by 1 to the starting seed point.
Time for us to get to R programming to create clusters using Hierarchical and Non-Hierarchal approached. Let me first pick up Hierarchical Approach using Dendrogram.
Let us take an example from Retail Industry. Nowadays we all shop online irrespective of tier of the city we live in and time of the day, hence it will be easy for us to group user to understand pattern. E-retailer LALARA, who in in the business of selling mobile accessories. They have a Loyalty program associated with their site, called MoRew. Over last 1 week, 500 customers did the shopping and below information is collected from those transactions, it is in excel “UP.xlsx”.
|#||Username||Total Purchase||City||Purchase Mode||# of Items||Time of Purchase|
Based on the above table data (assuming) there are 500 transactions for last 1 week. Objective is to cluster user’s buying pattern so that promotion can be launched for the right audience.
Let us use Dendrogram approach for this, CDS.csv has all the transaction details listed in above table
Step: 1, Load csv file having all these data points
CDS <- read.csv(“<FilePath>/CDS.csv”)
Step: 2, Execute plot command to understand high level how points are scattered.
Step: 3, As we can notice “Total Purchase” and “# of items” are having very high parity in terms of amount, so as not to get analysis skewed because of high value of total Purchase, let is first standardized this data. Standardizing data means normalizing the data. formula to standardize data points is Z = X-Mean/Standard Deviation i.e., (x-μ/σ). To get normalized data, we can either first calculate Mean and Standard Deviation then by applying formula we get each value converted to standardized value however using R we can apply scale command to get it done in a single run.
Execute below command
SCDS <- scale(CDS)
Step: 4, Once data is standardized let us calculate distance between each point. By default, distance calculated by R is using Euclidian distance when we use “dist” function. This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
SCDSdistance <- dist(SCDS)
Step:5, Once we have distance between each transaction point, we are ready to draw Hierarchal Cluster Dendrogram. We will use “hclust” command to draw the graph. By default, Complete Linkage is used for calculations.
HCLUSTSCDS <- hclust(SCDSdistnace)
Once we have the hclust out, it is a time for us to plot Dendrogram plot.
# To get raw dendrogram execute plot command.
# To draw symmetrical dendrogram add hang = -1 to the command.
Since Dendrogram plot will be huge for retail hence show below is the sample dendrogram.
It is always good to execute Hierarchal clustering using various linkage approaches. By default, it works on “Complete” Linkage however if we need to change it to “Average”, add attribute method= “Average”. Dendrogram shows how clustering will change as against Complete linkage above and Average Linkage below.
Visually it is difficult to understand clusters, so R allows us to create borders based on the numbers of clusters we would like to see. If we want to look at 3 clusters execute below command. Ideally, we should first understand optimal cluster size and then execute this command.
rect.hclust(HCLUST, k=3, border=”red”)
Let us draw a spree plot to understand optimal cluster size:
wss = (nrow(SCDS) – 1) *sum(apply(SCDS,2,var))
for (i in 2:5) wss[i]=sum(kmeans(SCDS,centers= i)$withinss)
plot(1:5, wss, type= “b”)
after drawing the spree plot, identify elbow point to get the optimal cluster size.
Let us use k-means approach for the same dataset CDS.csv, which has all the transaction details.
Step: 1, Load csv file having all these data points
CDS <- read.csv(“<FilePath>/CDS.csv”)
Step: 2, Execute plot command to understand high level how points are scattered. plot(CDS)
Step: 3, Let us draw k-means command to k-means cluster, where 4 reflect number of clusters we would like to have however cluster size is defined by screeplot.
Let me conclude this session and would assume this blog helped you gather understanding of cluster analysis.