Simulated data set
Sample Solution
1.1 Generate Simulated Data:
set.seed(2)
x <- matrix(rnorm(60*2), ncol = 2)
x[1:30, 1] <- x[1:30, 1] + 3 # Shift first class on variable 1
x[1:30, 2] <- x[1:30, 2] - 4 # Shift first class on variable 2
1.2 K-means with K=2:
kmeans.fit <- kmeans(x, centers = 2, nstart = 20)
cluster <- kmeans.fit$cluster
Full Answer Section
plot(x[, 1], x[, 2], col = cluster, main = "K-means Clustering (K=2)")
abline(lm(x[, 2] ~ x[, 1]), col = "red") # Add decision boundary
# Calculate total within-cluster sum of squares
sum(kmeans.fit$withinss)
1.3 K-means with K=3:
kmeans.fit <- kmeans(x, centers = 3, nstart = 20)
cluster <- kmeans.fit$cluster
plot(x[, 1], x[, 2], col = cluster, main = "K-means Clustering (K=3)")
abline(lm(x[, 2] ~ x[, 1]), col = "red") # Add decision boundary
# Calculate total within-cluster sum of squares
sum(kmeans.fit$withinss)
1.4 K-means with K=4:
kmeans.fit <- kmeans(x, centers = 4, nstart = 20)
cluster <- kmeans.fit$cluster
plot(x[, 1], x[, 2], col = cluster, main = "K-means Clustering (K=4)")
abline(lm(x[, 2] ~ x[, 1]), col = "red") # Add decision boundary
# Calculate total within-cluster sum of squares
sum(kmeans.fit$withinss)
1.5 K-means with Scaling:
# Scale data
scaled.x <- scale(x)
kmeans.fit <- kmeans(scaled.x, centers = 2, nstart = 20)
cluster <- kmeans.fit$cluster
plot(scaled.x[, 1], scaled.x[, 2], col = cluster, main = "K-means Clustering (Scaled, K=2)")
abline(lm(scaled.x[, 2] ~ scaled.x[, 1]), col = "red") # Add decision boundary
# Calculate total within-cluster sum of squares
sum(kmeans.fit$withinss)
Results:
- The total within-cluster sum of squares will vary slightly due to random initialization in kmeans.
- Generally, the K-means results with K=2 will have a lower within-cluster sum of squares compared to K=3 and K=4, reflecting a better fit for two clusters.
- Scaling the data generally leads to a different clustering solution and potentially a different within-cluster sum of squares. This is because K-means focuses on distances, and scaling affects the relative distances between points.
Explanation:
The code generates two sets of 30 points each, separated by a shift on both variables. K-means clustering with K=2 effectively separates the two classes, resulting in a lower within-cluster sum of squares. With more clusters (K=3 and K=4), the algorithm tries to further partition the data, potentially leading to higher within-cluster variation and a higher sum of squares.
Scaling the data before clustering changes the relative distances between points. This can lead to a different clustering solution compared to the unscaled data, with potentially different within-cluster sum of squares values.