Simulated data set

Full Answer Section

plot(x[, 1], x[, 2], col = cluster, main = "K-means Clustering (K=2)")
abline(lm(x[, 2] ~ x[, 1]), col = "red")  # Add decision boundary

# Calculate total within-cluster sum of squares
sum(kmeans.fit$withinss)

Use code with caution.

1.3 K-means with K=3:

kmeans.fit <- kmeans(x, centers = 3, nstart = 20)
cluster <- kmeans.fit$cluster

plot(x[, 1], x[, 2], col = cluster, main = "K-means Clustering (K=3)")
abline(lm(x[, 2] ~ x[, 1]), col = "red")  # Add decision boundary

# Calculate total within-cluster sum of squares
sum(kmeans.fit$withinss)

Use code with caution.

1.4 K-means with K=4:

kmeans.fit <- kmeans(x, centers = 4, nstart = 20)
cluster <- kmeans.fit$cluster

plot(x[, 1], x[, 2], col = cluster, main = "K-means Clustering (K=4)")
abline(lm(x[, 2] ~ x[, 1]), col = "red")  # Add decision boundary

# Calculate total within-cluster sum of squares
sum(kmeans.fit$withinss)

Use code with caution.

1.5 K-means with Scaling:

# Scale data
scaled.x <- scale(x)

kmeans.fit <- kmeans(scaled.x, centers = 2, nstart = 20)
cluster <- kmeans.fit$cluster

plot(scaled.x[, 1], scaled.x[, 2], col = cluster, main = "K-means Clustering (Scaled, K=2)")
abline(lm(scaled.x[, 2] ~ scaled.x[, 1]), col = "red")  # Add decision boundary

# Calculate total within-cluster sum of squares
sum(kmeans.fit$withinss)

Use code with caution.

Results:

The total within-cluster sum of squares will vary slightly due to random initialization in kmeans.
Generally, the K-means results with K=2 will have a lower within-cluster sum of squares compared to K=3 and K=4, reflecting a better fit for two clusters.
Scaling the data generally leads to a different clustering solution and potentially a different within-cluster sum of squares. This is because K-means focuses on distances, and scaling affects the relative distances between points.

Explanation:

The code generates two sets of 30 points each, separated by a shift on both variables. K-means clustering with K=2 effectively separates the two classes, resulting in a lower within-cluster sum of squares. With more clusters (K=3 and K=4), the algorithm tries to further partition the data, potentially leading to higher within-cluster variation and a higher sum of squares.

Scaling the data before clustering changes the relative distances between points. This can lead to a different clustering solution compared to the unscaled data, with potentially different within-cluster sum of squares values.

Sample Solution

1.1 Generate Simulated Data:

Code snippet

set.seed(2)
x <- matrix(rnorm(60*2), ncol = 2)
x[1:30, 1] <- x[1:30, 1] + 3  # Shift first class on variable 1
x[1:30, 2] <- x[1:30, 2] - 4  # Shift first class on variable 2

1.2 K-means with K=2:

Code snippet

kmeans.fit <- kmeans(x, centers = 2, nstart = 20)
cluster <- kmeans.fit$cluster

We are here to help

We have crazy offers

It’s quick and easy to place an order. We have an efficient customer service that works 24/7 to assist you.It’s quick and easy to place an order. We have an efficient customer service that works 24/7 to assist you.

We are here and ready to help

Ready to join our block community of business leaders for four days of virtual sessions on driving developer happiness and boosting productivity?

Order Now