Collecting Data for a group of students in a statistics

  1. Suppose we collect data for a group of students in a statistics class with variables X1 =hours studied, X2 =undergrad GPA, and Y = receive an A. We fit a logistic regression and produce estimated coefficient, 𝛽̂0 = −7, 𝛽̂1 = 0.06, 𝛽̂2 = 1. (You do not need R code to solve this question). (1) Estimate the probability that a student who studies for 50 hours and has an undergrad GPA of 3.5 gets an A in the class. (Hint: For logistic regression, 𝑝(𝑥) = 𝑒𝛽0+𝛽1𝑋1+𝛽2𝑋2 1+𝑒𝛽0+𝛽1𝑋1+𝛽2𝑋2 (2) How many hours would a student with GPA 3.4 need to study to have a 50% chance of getting an A in the class? (Hint: We can use the equation log ( 𝑝(𝑥) ) = 𝛽 + 𝛽 𝑋 + 𝛽 𝑋 )) 1−𝑝(𝑥) 0 1 1 2 2 2. The following questions (3) to (8) should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010. (3) Use require(ISLR) and library (ISLR) to load the ISLR package. a) Use summary( ) function to produce some numerical summaries of the Weekly data. b) Use pairs ( ) function to produce a scatterplot matrix of the variables of the data. c) Do you see the relationship between Year and Volume? What is the pairwise correlation value between Year and Volume? d) Is the relationship positive or negative? (4) Use the full dataset to perform a logistic regression with Direction as the dependent variable and Lag1, Lag2, Lag3, Lag4 and Volume as independent variables (i.e. predictors). Use the summary() function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones? Take a screenshot of your outputs and then answer the questions. (5) Based on 4)’s results, compute the confusion matrix and overall faction of correct predictions (Hint: refer the code from Chapter 4 lab session on the textbook; we use 0.5 as the predicted probability cut-off for the classifier). What is the precision rate? What is the recall rate? Take a screenshot of your output and then answer the questions. (6) Now fit the logistic regression model using a training data period from 1990 to 2009 with Lag 2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (i.e. test data) (the data from 2010). In addition, please calculate the precision rate and recall rate. (Hint: refer the code from Chapter 4 lab session on the textbook; we use 0.5 as the predicted probability cut-off for the classifier). Take a screenshot of your output and then answer the questions. (7) Repeat (6) using KNN with K=1. Compute the confusion matrix and the overall fraction of correct predictions for the held-out data. In addition, please calculate the precision rate and recall rate. (Hint: refer the code from Chapter 4 lab session on the textbook; If you encounter some errors such as “dims of ‘test’ and ‘train’ differ”, try to use knn(data.frame(train.X), …) ). (Use set.seed(1)) (8) Repeat (6) using KNN with K=10. Compute the confusion matrix and the overall fraction of correct predictions for the held-out data. In addition, please calculate the precision rate and recall rate. 3. The quantity 𝑝(𝑋) is called the odds. Please answer the following questions (You do not need R code 1−𝑝(𝑋) to solve this question): (9) On average, what fraction of people with an odds of 0.35 of defaulting on their credit card payment will in fact default? (10) Suppose that an individual has a 15% chance of defaulting on her credit card payment. What are the odds that she will default? 4. The logistic regression model that results from predicting the probability of default from student status can be seen in the following table. We create a dummy variable that takes on a value of 1 for students and 0 for non-students. Please answer the following questions (You do not need R code for these questions). (11) How to explain the coefficient before Student[Yes]? (12) If it is a non-student, what are the estimated odds? Is the probability of default less than the probability of not default?      

Sample Solution

     

1. Logistic Regression Calculations

(1) Probability of an A with 50 Hours Studied and 3.5 GPA

We can estimate the probability (p(x)) using the logistic regression formula:

p(x) = e^(β₀ + β₁X₁ + β₂X₂) / (1 + e^(β₀ + β₁X₁ + β₂X₂))

where:

  • β₀ = -7 (estimated coefficient)
  • β₁ = 0.06 (estimated coefficient)
  • β₂ = 1 (estimated coefficient)
  • X₁ = 50 (hours studied)
  • X₂ = 3.5 (undergrad GPA)

Full Answer Section

     

Plug in the values:

p(x) = e^(-7 + (0.06 * 50) + (1 * 3.5)) / (1 + e^(-7 + (0.06 * 50) + (1 * 3.5))) ≈ 0.997

The student has a very high estimated probability (almost certain) of getting an A with these values.

(2) Hours Needed for 50% Chance of A with GPA 3.4

We can use the log-odds formula and solve for X₁ (hours studied):

log(p(x)) = β₀ + β₁X₁ + β₂X₂
log(0.5) = -7 + 0.06X₁ + (1 * 3.4)  // p(x) is 50% chance of A
X₁ ≈ 81.6 hours

The student with a GPA of 3.4 would need to study approximately 81.6 hours to have a 50% chance of getting an A.

**2. Weekly Data Analysis (Note: R code snippets are included for reference, but results may vary slightly depending on software version or environment.)

(3) Summary and Scatterplots

Code snippet
library(ISLR)
summary(Weekly)
pairs(Weekly)
  • Summary: This provides basic statistics like mean, median, minimum, and maximum for each variable.
  • Scatterplots: These visualize pairwise relationships between variables.

Look for correlations between Year and Volume in the summary output and scatterplot matrix.

(4) Logistic Regression with Lags and Volume

Code snippet
model <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Volume, data = Weekly)
summary(model)

Examine the p-values in the summary output. Statistically significant predictors will have low p-values (typically < 0.05).

(5) Confusion Matrix and Evaluation

Code snippet
predictions <- predict(model, type = "response")
cut_off <- 0.5  # Threshold for classifying up/down
cm <- table(Weekly$Direction, predictions > cut_off)
accuracy <- sum(diag(cm)) / sum(cm)

precision <- cm[1, 1] / (cm[1, 1] + cm[1, 2])
recall <- cm[1, 1] / (cm[1, 1] + cm[2, 1])

print(cm)
cat("Accuracy:", accuracy, "\n")
cat("Precision:", precision, "\n")
cat("Recall:", recall, "\n")
  • Confusion Matrix: This shows how well the model classified up/down movements (actual vs. predicted).
  • Accuracy: Proportion of correctly predicted observations.
  • Precision: Proportion of true positives among predicted positives.
  • Recall: Proportion of true positives identified by the model.

(6) Logistic Regression with Lag 2 for Held-Out Data (2010)

Code snippet
train_data <- Weekly[Weekly$Year < 2010, ]
test_data <- Weekly[Weekly$Year == 2010, ]

model_lag2 <- glm(Direction ~ Lag2, data = train_data)
predictions_lag2 <- predict(model_lag2, newdata = test_data, type = "response")

cm_lag2 <- table(test_data$Direction, predictions_lag2 > cut_off)
accuracy_lag2 <- sum(diag(cm_lag2)) / sum(cm_lag2)

precision_lag2 <- cm_lag2[1, 1] / (cm_lag2[1, 1] + cm_

IS IT YOUR FIRST TIME HERE? WELCOME

USE COUPON "11OFF" AND GET 11% OFF YOUR ORDERS