Simulated data set

  1. In this problem, you will generate simulated data, and then perform K-means clustering on the data. 1.1 Generate a simulated data set with 30 observations in each of two classes (i.e. 60 observations in total), and 2 variables. Code Hint: The first four lines of codes should be: set.seed(2) x=matrix(rnorm(60*2), ncol=2) x[1:30,1]=x[1:30,1]+3 x[1:30,2]=x[1:30,2]-4 1.2 Perform K-means clustering of the observations with K = 2. Plot the data with each observation colored according to its cluster assignment (nstart=20). Take a screenshot of your plot. What is the total within-cluster sum of squares? 1.3 Perform K-means clustering with K = 3. Plot the data with each observation colored according to its cluster assignment (nstart=20). Take a screenshot of your plot. What is the total within-cluster sum of squares? 1.4 Now perform K-means clustering with K = 4. Plot the data with each observation colored according to its cluster assignment (nstart=20). Take a screenshot of your plot. What is the total within-cluster sum of squares? 1.5 Using the scale () function, perform K-means clustering with K = 2 on the data after scaling each variable to have standard deviation one. Take a screenshot of your plot. What is the total within-cluster sum of squares now? How do these results compare to those obtained in (2)? 2. Consider the USArrests data. We will now perform hierarchical clustering on the states. USArrests dataset is part of the base R package. You do not need to load any libraries. 2.1 Plot the hierarchical clustering dendrogram using complete linkage clustering with Euclidean distance as the dissimilarity measure. Take a screenshot of your plot. 2.2 Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters? You need to provide state names for each cluster (e.g. Cluster 1 has Alabama, Alaska,…). 2.3 Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one. a) Take a screenshot of your plot. b) What effect does scaling the variables have on the hierarchical clustering obtained? c) In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer. 2.4 After scaling the variables to have standard deviation one, plot the hierarchical clustering dendrogram using average linkage clustering with Euclidean distance as the dissimilarity measure. Take a screenshot of your plot. 2.5 After scaling the variables to have standard deviation one, plot the hierarchical clustering dendrogram using single linkage clustering with Euclidean distance as the dissimilarity measure. Take a screenshot of your plot.  

Sample Solution

   

1.1 Generate Simulated Data:

Code snippet
set.seed(2)
x <- matrix(rnorm(60*2), ncol = 2)
x[1:30, 1] <- x[1:30, 1] + 3  # Shift first class on variable 1
x[1:30, 2] <- x[1:30, 2] - 4  # Shift first class on variable 2

1.2 K-means with K=2:

Code snippet
kmeans.fit <- kmeans(x, centers = 2, nstart = 20)
cluster <- kmeans.fit$cluster
 

Full Answer Section

     
plot(x[, 1], x[, 2], col = cluster, main = "K-means Clustering (K=2)")
abline(lm(x[, 2] ~ x[, 1]), col = "red")  # Add decision boundary

# Calculate total within-cluster sum of squares
sum(kmeans.fit$withinss)

1.3 K-means with K=3:

Code snippet
kmeans.fit <- kmeans(x, centers = 3, nstart = 20)
cluster <- kmeans.fit$cluster

plot(x[, 1], x[, 2], col = cluster, main = "K-means Clustering (K=3)")
abline(lm(x[, 2] ~ x[, 1]), col = "red")  # Add decision boundary

# Calculate total within-cluster sum of squares
sum(kmeans.fit$withinss)

1.4 K-means with K=4:

Code snippet
kmeans.fit <- kmeans(x, centers = 4, nstart = 20)
cluster <- kmeans.fit$cluster

plot(x[, 1], x[, 2], col = cluster, main = "K-means Clustering (K=4)")
abline(lm(x[, 2] ~ x[, 1]), col = "red")  # Add decision boundary

# Calculate total within-cluster sum of squares
sum(kmeans.fit$withinss)

1.5 K-means with Scaling:

Code snippet
# Scale data
scaled.x <- scale(x)

kmeans.fit <- kmeans(scaled.x, centers = 2, nstart = 20)
cluster <- kmeans.fit$cluster

plot(scaled.x[, 1], scaled.x[, 2], col = cluster, main = "K-means Clustering (Scaled, K=2)")
abline(lm(scaled.x[, 2] ~ scaled.x[, 1]), col = "red")  # Add decision boundary

# Calculate total within-cluster sum of squares
sum(kmeans.fit$withinss)

Results:

  • The total within-cluster sum of squares will vary slightly due to random initialization in kmeans.
  • Generally, the K-means results with K=2 will have a lower within-cluster sum of squares compared to K=3 and K=4, reflecting a better fit for two clusters.
  • Scaling the data generally leads to a different clustering solution and potentially a different within-cluster sum of squares. This is because K-means focuses on distances, and scaling affects the relative distances between points.

Explanation:

The code generates two sets of 30 points each, separated by a shift on both variables. K-means clustering with K=2 effectively separates the two classes, resulting in a lower within-cluster sum of squares. With more clusters (K=3 and K=4), the algorithm tries to further partition the data, potentially leading to higher within-cluster variation and a higher sum of squares.

Scaling the data before clustering changes the relative distances between points. This can lead to a different clustering solution compared to the unscaled data, with potentially different within-cluster sum of squares values.

IS IT YOUR FIRST TIME HERE? WELCOME

USE COUPON "11OFF" AND GET 11% OFF YOUR ORDERS