STAT*2040 Statistics I

STAT*2040 Statistics I
G. Umphrey Graded Assignment #3 Fall 2015
This assignment will demonstrate how R can be used to very efficiently obtain analyses of certain
statistical methods that we cover near the end of Stat*2040, namely one-way ANOVA and simple linear
regression analysis. This assignment has been designed to deliver high educational value in a reasonably
short time. A Summary Sheet and instructions on what to hand in will be posted separately.
All of the data sets are stored as csv (comma separated values) in Excel and are ready to use after you
download them.
PART A: A Bit on t Tests
One-sample, two-sample and paired t tests are not that difficult to conduct by hand (i.e. with a
calculator) for small data sets but using a computer reduces the chance of making an error, calculates pvalues,
is certainly easier for larger data sets, and provides very accessible graphical methods. A nice
summary of the t tests in R is provided by Robert I. Kabacoff on his excellent Quick-R website; you can
find this readily by googling “Quick-R t tests” or through the URL
The t-test for two independent samples can analyze data that is structured in either “unstacked” or
“stacked” formats. In the “unstacked” data format the samples are side-by-side, this is the way you most
commonly see a data table in a book. In the “stacked” data format each observation is on a separate line
so that the response variable values are in a single column, and another column codes which group each
observation belongs to. The paired t-test only takes data in the unstacked format.
Load the “Moon Illusion – from OzDASL” into R and conduct the required paired t-test (you have
already seen this data set on Question Set #6 , question 6-10). To load the data set into R from your
computer use the command:
Moon <- read.csv(file.choose())
Of course you can use a data frame name other than “Moon” if you wish. You can check the first six
lines with head(Moon), the last six lines with tail(Moon), or the entire data set by simply entering
“Moon”. I recommend you attach your data frame with attach(Moon); remember to detach when you
are finished with it if you want to move on to another analysis with a different data frame in the same R
Now run the paired t-test analysis in R. Obtain the value of the t test statistic and the p-value, exactly as
they are reported on the R output. Be careful in your analysis; note that this is a one-sided test.
The purpose of a one-way ANOVA is to test a null hypothesis that the population means of two or more
groups (often called “treatments” when the groups are defined by the application of different treatments).
The test statistic to conduct the test has an F distribution if the null hypothesis is true, but the value of
the F test statistic tends to be inflated if the alternative hypothesis is true. The F test statistic is calculated
via an ANOVA table. If the null hypothesis is rejected at the specified significance level and there are
three or more groups being compared, we typically want to know where the differences are. The Tukey
HSD (Honestly Significant Difference) procedure is one such procedure for doing so, and the easiest one
to run in R.
The data sets “Cows1 (stacked)” and “Cows1 (unstacked)” comes from Canadian records of milk
production for five breeds of cattle: Ayrshire, Canadian, Guernsey, Holstein-Friesen, and Jersey.
Butterfat percentages were recorded for milk produced by 10 cows of each breed. You require the
“stacked” data for analysis in R. The “unstacked” data set simply makes it easier to look at the data.
Read the data set into R:
> Cows <- read.csv(file.choose())
Remember, use ““Cows1 (stacked)”. Check out the first six and last six observations with
head(Cows) and tail(Cows) commands. Note there are both “Breed” and “BreedCode”
variables, you can use either but using BreedCode keeps the output a bit neater. Now attach the data
> attach(Cows)
B1. Use R to obtain the ANOVA table for testing the null hypothesis that the populations of the five
breeds have equal mean butterfat percentages. The key commands are:
> CowsMod1 <- aov(Butterfat ~ BreedCode)
> anova(CowsMod1)
B2. What is the value of the Total SS?
B3. What does the p-value for the F test statistic allow you to conclude at the 5% level of significance?
B4. If appropriate (and it will be here), use R to conduct Tukey’s HSD test at the 5% level of
significance. What conclusions can you make based on this analysis? Summarize your results with
an underscore diagram. The key commands are:
> TukeyHSD(CowsMod1, ordered = T)
The “ordered = T” is an option that makes the output a bit easier to interpret. You will want to get
the breed means. Two ways to do this are:
> tapply(Butterfat, Breed, mean)
> model.tables(CowsMod1, “means”)
B5. Obtain side-by side boxplots with
> boxplot(Butterfat ~ BreedCode)
You can enhance the boxplots with options, if you wish.
Part C: Simple Linear Regression Analysis
The last topic we cover in lectures is simple linear regression and correlation analysis. Simple linear
regression analysis obtains a “best fitting” straight line to a set of bivariate data, where one of the
variables is treated as the independent (or predictor) variable, the other variable is treated as the
dependent (or response) variable and the criterion of least squares defines “best fit”. The independent
variable is commonly designated X and the dependent variable Y, reflecting the fact that the (x, y) data
are plotted with the x values on the horizontal (X) axis and the y values are plotted on the vertical (Y)
axis. A graph of the (x, y) observations in a data set is called a scatterplot; on this it is common to plot a
fitted regression model. We restrict our fitted models in this course to straight lines, but you should be
aware that many nonlinear models can also be fit (a topic for Stat*2050, Stat*3240 and beyond).
In this exercise we will fit a couple of simple linear regression equation in R; this will provide an
example for fitting other simple linear regression models.
The data set we will use is “Hand_508 Haematology of paint sprayers”, from Hand et al’s Handbook of
Small Data Sets. Hand et al introduce the data set as follows: “The data in this example are a subset of
data obtained in a health survey of paint sprayers in a car assembly plant.”
The variables are:
HAEMO = haemoglobin concentration
PCV = packed cell volume
WBC = white blood cell count
LYMPHO = lymphocyte count
NEUTRO = neutrophil count
LEAD = serum lead concentration
Open R and read the data frame into R; I would suggest
> Blood <- read.csv(file.choose())
To check your data set enter:
> head(Blood)
> tail(Blood)
Note that the data frame contains the six variable names plus a “Case” ID number. To use these variable
names without specifying the dataframe each time we need to “attach” the dataframe:
> attach(Blood)
We could make a scatterplot of the data for any pair of variables at this point. But if we also want to fit
the least squares regression line, we first need to run a “linear model” analysis. A good pair of variables
to start with is HAEMO and PCV, let’s regress HAEMO on PCV:
> BloodMod1 <- lm(HAEMO ~ PCV)
If we had not attached the dataframe, we would have typed
> BloodMod1 <- lm(HAEMO ~ PCV, data = Blood)
so that the lm function would know where to find the variables HAEMO and PCV.
To get a bare-bones scatterplot, type:
> plot(HAEMO ~ PCV)
Now superimpose the least squares regression line by typing:
> abline(BloodMod1)
Enhancements can be made to the graph, for example we can add a title and better label the axes by
adding options to the plot() statement:
> plot(HAEMO ~ PCV, main = “Paint Sprayer Blood Chemistry”, xlab
+ = “Haemoglobin concentration”, ylab = “Packed cell volume”)
You will want to execute the “abline()” command again.
A summary of the lm analysis is obtained by typing:
> summary(BloodMod1)
From this summary, you can determine the equation of the least squares regression line that was
superimposed on the scatter plot. You also get information for conducting hypothesis tests and
calculating confidence intervals.
The ANOVA table for the regression analysis is obtained by typing:
> anova(BloodMod1)
A good format to save your graph in is pdf, as this can be converted to high quality images in other
graphic formats (such as png or jpg, with png generally being superior to jpg) that can readily be
imbedded in other documents. You can also save to png or jpg formats directly, the resolution is not the
best but it is OK for our purposes.
For the regression analysis of HAEMO on PCV, answer the following questions.
C1. What is the least squares regression equation?
C2. Obtain (individual) 95% confidence intervals for the true slope and true Y-intercept.
C3. What is the predicted value of HAEMO if PCV = 45?
C4. What is the value of the residual for the last observation in the data set? Make sure you know how
to calculate this.
C5. What is the value of the correlation coefficient?
C6. If each residual was squared and then summed, what entry in the ANOVA table would be obtained?
Confirm this by rerunning the analysis in R. You can look at the residuals by typing
> residuals(BloodMod1)
Now sum the squared residuals by typing:
> res <- residuals(BloodMod1)
> sum(res^2)
You should be able to find the sum of the squared residuals on the ANOVA table. What entry in
the ANOVA table does this correspond to: SSTr, SSE, MSTr, or MSE?
C7. Continuing with the R analysis, type
> boxplot(res)
How many outliers do you see and where are they located?
C8. The “summary(BloodMod1)” produces a table with two t-statistic values and their associated pvalues.
What are these values and what do they allow us to conclude at the 5% level of
C9. The ANOVA table produces a value of an F test statistic. It tests one of the same set of hypotheses
as was done in C8. Briefly explain how one of the t-tests in C8 is statistically equivalent to the F
Questions C1 to C9 can be repeated with any other pair of variables. I will have you repeat the exercise,
but this time regressing HAEMO on LEAD. This will allow us to test if the haemoglobin concentration
is linearly associated with the serum lead concentration. This pair of variables gives results which are
quite different from the HAEMO on PVC regression analysis.
You are welcome to run regression analyses on any other pair of variables!