STAT*2040 Statistics I

STAT*2040 Statistics I G. Umphrey Graded Assignment #3 Fall 2015 This assignment will demonstrate how R can be used to very efficiently obtain analyses of certain statistical methods that we cover near the end of Stat*2040, namely one-way ANOVA and simple linear regression analysis. This assignment has been designed to deliver high educational value in a reasonably short time. A Summary Sheet and instructions on what to hand in will be posted separately. All of the data sets are stored as csv (comma separated values) in Excel and are ready to use after you download them. PART A: A Bit on t Tests One-sample, two-sample and paired t tests are not that difficult to conduct by hand (i.e. with a calculator) for small data sets but using a computer reduces the chance of making an error, calculates pvalues, is certainly easier for larger data sets, and provides very accessible graphical methods. A nice summary of the t tests in R is provided by Robert I. Kabacoff on his excellent Quick-R website; you can find this readily by googling “Quick-R t tests” or through the URL http://www.statmethods.net/stats/ttest.html The t-test for two independent samples can analyze data that is structured in either “unstacked” or “stacked” formats. In the “unstacked” data format the samples are side-by-side, this is the way you most commonly see a data table in a book. In the “stacked” data format each observation is on a separate line so that the response variable values are in a single column, and another column codes which group each observation belongs to. The paired t-test only takes data in the unstacked format. Load the “Moon Illusion - from OzDASL” into R and conduct the required paired t-test (you have already seen this data set on Question Set #6 , question 6-10). To load the data set into R from your computer use the command: Moon <- read.csv(file.choose()) Of course you can use a data frame name other than “Moon” if you wish. You can check the first six lines with head(Moon), the last six lines with tail(Moon), or the entire data set by simply entering “Moon”. I recommend you attach your data frame with attach(Moon); remember to detach when you are finished with it if you want to move on to another analysis with a different data frame in the same R session. Now run the paired t-test analysis in R. Obtain the value of the t test statistic and the p-value, exactly as they are reported on the R output. Be careful in your analysis; note that this is a one-sided test. PART B: One-Way ANOVA The purpose of a one-way ANOVA is to test a null hypothesis that the population means of two or more groups (often called “treatments” when the groups are defined by the application of different treatments). The test statistic to conduct the test has an F distribution if the null hypothesis is true, but the value of the F test statistic tends to be inflated if the alternative hypothesis is true. The F test statistic is calculated via an ANOVA table. If the null hypothesis is rejected at the specified significance level and there are three or more groups being compared, we typically want to know where the differences are. The Tukey HSD (Honestly Significant Difference) procedure is one such procedure for doing so, and the easiest one to run in R. The data sets “Cows1 (stacked)” and “Cows1 (unstacked)” comes from Canadian records of milk production for five breeds of cattle: Ayrshire, Canadian, Guernsey, Holstein-Friesen, and Jersey. Butterfat percentages were recorded for milk produced by 10 cows of each breed. You require the “stacked” data for analysis in R. The “unstacked” data set simply makes it easier to look at the data. Read the data set into R: > Cows <- read.csv(file.choose()) Remember, use ““Cows1 (stacked)”. Check out the first six and last six observations with head(Cows) and tail(Cows) commands. Note there are both “Breed” and “BreedCode” variables, you can use either but using BreedCode keeps the output a bit neater. Now attach the data frame: > attach(Cows) B1. Use R to obtain the ANOVA table for testing the null hypothesis that the populations of the five breeds have equal mean butterfat percentages. The key commands are: > CowsMod1 <- aov(Butterfat ~ BreedCode) > anova(CowsMod1) B2. What is the value of the Total SS? B3. What does the p-value for the F test statistic allow you to conclude at the 5% level of significance? B4. If appropriate (and it will be here), use R to conduct Tukey’s HSD test at the 5% level of significance. What conclusions can you make based on this analysis? Summarize your results with an underscore diagram. The key commands are: > TukeyHSD(CowsMod1, ordered = T) The “ordered = T” is an option that makes the output a bit easier to interpret. You will want to get the breed means. Two ways to do this are: > tapply(Butterfat, Breed, mean) or > model.tables(CowsMod1, “means”) B5. Obtain side-by side boxplots with > boxplot(Butterfat ~ BreedCode) You can enhance the boxplots with options, if you wish. Part C: Simple Linear Regression Analysis The last topic we cover in lectures is simple linear regression and correlation analysis. Simple linear regression analysis obtains a “best fitting” straight line to a set of bivariate data, where one of the variables is treated as the independent (or predictor) variable, the other variable is treated as the dependent (or response) variable and the criterion of least squares defines “best fit”. The independent variable is commonly designated X and the dependent variable Y, reflecting the fact that the (x, y) data are plotted with the x values on the horizontal (X) axis and the y values are plotted on the vertical (Y) axis. A graph of the (x, y) observations in a data set is called a scatterplot; on this it is common to plot a fitted regression model. We restrict our fitted models in this course to straight lines, but you should be aware that many nonlinear models can also be fit (a topic for Stat*2050, Stat*3240 and beyond). In this exercise we will fit a couple of simple linear regression equation in R; this will provide an example for fitting other simple linear regression models. The data set we will use is “Hand_508 Haematology of paint sprayers”, from Hand et al’s Handbook of Small Data Sets. Hand et al introduce the data set as follows: “The data in this example are a subset of data obtained in a health survey of paint sprayers in a car assembly plant.” The variables are: HAEMO = haemoglobin concentration PCV = packed cell volume WBC = white blood cell count LYMPHO = lymphocyte count NEUTRO = neutrophil count LEAD = serum lead concentration Open R and read the data frame into R; I would suggest > Blood <- read.csv(file.choose()) To check your data set enter: > head(Blood) and > tail(Blood) Note that the data frame contains the six variable names plus a “Case” ID number. To use these variable names without specifying the dataframe each time we need to “attach” the dataframe: > attach(Blood) We could make a scatterplot of the data for any pair of variables at this point. But if we also want to fit the least squares regression line, we first need to run a “linear model” analysis. A good pair of variables to start with is HAEMO and PCV, let’s regress HAEMO on PCV: > BloodMod1 <- lm(HAEMO ~ PCV) If we had not attached the dataframe, we would have typed > BloodMod1 <- lm(HAEMO ~ PCV, data = Blood) so that the lm function would know where to find the variables HAEMO and PCV. To get a bare-bones scatterplot, type: > plot(HAEMO ~ PCV) Now superimpose the least squares regression line by typing: > abline(BloodMod1) Enhancements can be made to the graph, for example we can add a title and better label the axes by adding options to the plot() statement: > plot(HAEMO ~ PCV, main = “Paint Sprayer Blood Chemistry”, xlab + = “Haemoglobin concentration”, ylab = “Packed cell volume”) You will want to execute the “abline()” command again. A summary of the lm analysis is obtained by typing: > summary(BloodMod1) From this summary, you can determine the equation of the least squares regression line that was superimposed on the scatter plot. You also get information for conducting hypothesis tests and calculating confidence intervals. The ANOVA table for the regression analysis is obtained by typing: > anova(BloodMod1) A good format to save your graph in is pdf, as this can be converted to high quality images in other graphic formats (such as png or jpg, with png generally being superior to jpg) that can readily be imbedded in other documents. You can also save to png or jpg formats directly, the resolution is not the best but it is OK for our purposes. For the regression analysis of HAEMO on PCV, answer the following questions. C1. What is the least squares regression equation? C2. Obtain (individual) 95% confidence intervals for the true slope and true Y-intercept. C3. What is the predicted value of HAEMO if PCV = 45? C4. What is the value of the residual for the last observation in the data set? Make sure you know how to calculate this. C5. What is the value of the correlation coefficient? C6. If each residual was squared and then summed, what entry in the ANOVA table would be obtained? Confirm this by rerunning the analysis in R. You can look at the residuals by typing > residuals(BloodMod1) Now sum the squared residuals by typing: > res <- residuals(BloodMod1) > sum(res^2) You should be able to find the sum of the squared residuals on the ANOVA table. What entry in the ANOVA table does this correspond to: SSTr, SSE, MSTr, or MSE? C7. Continuing with the R analysis, type > boxplot(res) How many outliers do you see and where are they located? C8. The “summary(BloodMod1)” produces a table with two t-statistic values and their associated pvalues. What are these values and what do they allow us to conclude at the 5% level of significance? C9. The ANOVA table produces a value of an F test statistic. It tests one of the same set of hypotheses as was done in C8. Briefly explain how one of the t-tests in C8 is statistically equivalent to the F test. Questions C1 to C9 can be repeated with any other pair of variables. I will have you repeat the exercise, but this time regressing HAEMO on LEAD. This will allow us to test if the haemoglobin concentration is linearly associated with the serum lead concentration. This pair of variables gives results which are quite different from the HAEMO on PVC regression analysis. You are welcome to run regression analyses on any other pair of variables!

IS IT YOUR FIRST TIME HERE? WELCOME

USE COUPON "11OFF" AND GET 11% OFF YOUR ORDERS