Submit only one file in pdf format to the link on the Study Desk.
Assume that your report will be read by someone familiar with the data sets but with limited statistical knowledge. Fully explain plots and when stating statistics or results explain what they mean statistically AND in context of the data.
Presentation should be neat, consistent, spell-checked and proof read. All questions should be clearly labelled and all answers should clearly and concisely address the questions.
If you convert a Word document to pdf for submission check that all symbols, equations etc. have converted correctly, i.e., proof-read your work.
If you do not use knitr to compile your submission, where asked to provide R code, paste relevant code within the assignment document and italicise (or otherwise highlight or distinguish from other content). Do not include code in an appendix.
Do not include an appendix at all. Any work included in an appendix will not be marked.
Please note that referencing text books and other resources is not the goal of this assessment. This work requires students to demonstrate their understanding of the analysis and interpretation, not provide quotes from resources.
When interpreting output, you are expected to do so in context of the data and the method (i.e. ensure you comment on aspects of the method that affect your interpretation with the respect to the variables and sample).
A maximum of 10 marks will be deducted from your total marks for poor presentation.
Question 1 CCA: 25
Question 2 FA: 20
Question 3 MDS: 20
Question 4 DFA: 30
Question 5 reflection: 5
Page 2 of 5
Data: Only one data set will be used for all questions in this assignment.
The data file ‘jobsat.txt’ contains data measuring three job satisfaction variables and three job characteristic variables for 70 employees of a large corporation. The sex of each employee and the years of their employment within the company were also recorded.
Three measures of job satisfaction:
• career: employee satisfaction with career direction and the possibility of future advancement, expressed as a percent
• supervisor: employee satisfaction with supervisor’s communication and management style, expressed as a percent
• finance: employee satisfaction with salary and other benefits, using a scale measurement from 1 to 10 (1=unsatisfied, 10=satisfied)
Three variables associated with job characteristic:
• variety: degree of variety involved in tasks, expressed as a percent
• feedback: degree of feedback required in job tasks, expressed as a percent
• autonomy: degree of autonomy required in job tasks, expressed as a percent
• male: 1
• female: 2
• <5 years: 1
• 5-10 years: 2
• >10 years: 3
Assume all job satisfaction variables and job characteristic variables meet MVN and other test assumptions for the purpose of these exercises.
Question 1 (25 marks):
Explore the relationship between the job satisfaction and job characteristic sets of variables by completing the following:
(a) Based on standardised variables produce and comment on a pairwise correlation matrix for the six variables of interest. Does this correlation matrix suggest that canonical correlation would be an appropriate form of analysis and why? (3 marks)
(b) Perform a canonical correlation using the set of job characteristic variables (standardised) as the X variables and the job satisfaction variables (standardised) as the
Page 3 of 5
Y variables. Provide appropriate output, definitions and interpretations for: (10 marks)
• canonical correlations (also explain why canonical correlations become successively weaker but do not add up to one).
• chi-square test of significance and Rao’s F approximation significance test
• redundancy coefficients for the variance in the Y set of variables explained by the variance in the X set.
[Note: ‘appropriate’ requires you to select the appropriate parts of the output from your analysis to address each dot-point – do not include all R output].
(c) Provide the equations that describe the first canonical function using your analysis solution. Interpret the canonical loadings and the value of the analysis overall. (6 marks)
(d) Provide the output showing the eigen values and interpret. Explain the relationship between eigen values and canonical correlations. (2 marks)
(e) Why is canonical correlation an appropriate technique for this analysis and not multiple regression or MANOVA? (2 marks)
(f) What are the limitations associated with canonical correlation analysis? (2 marks)
Question 2 (20 marks):
(a) Perform PCA analysis on all six of the job satisfaction and job characteristic variables. Include in your answer: (5 marks)
• A brief explanation of your choice between the correlation and covariance matrix as the basis of the PCA analysis
• Justification for your choice of the number of PCs to use in factor analysis
(b) Perform a Factor Analysis on all six variables (apply no rotation) using the number of factors identified in part a). Interpret the output, in particular: (8 marks)
• Interpret the variable loadings, the variance explained and the chi-square test.
• Discuss the difference in uniqueness values for the variables finance and autonomy.
(c) Repeat the FA with a varimax rotation. In addition, calculate the communalities. Interpret the output, comparing to output from part b) and discussing the communality of the variable career and how it is reflected in the factor loading. (7 marks)
Question 3 (20 marks):
Note: for the plots required in this question you may not have been given example code in the course materials for all aspects of the plot details. You are required to problem solve these plotting code issues for yourself – look for solutions online/google. Some trial and error will be required. Do not ask for solutions on the course forum.
Page 4 of 5
Use metric MDS to determine if employees form clusters based on sex and years by completing the following:
(a) Perform metric 2D MDS ordination based on Euclidian distances for the six standardised measurement variables. Provide the Goodness of Fit (GoF) output only. Plot the results, identifying employees by sex using open circles for males and solid squares for females. Include a legend. Interpret the plot and include interpretation of the GoF output. What happens to the GoF if another dimension is added to the analysis? (6 marks)
(b) Repeat the plot in part (a), but instead identify employees by years employed in the plot space. In addition, colour each number: red for year 1, blue for year 2 and dark green for year 3. Interpret in context of the data and the method. (6 marks)
Hint: Use colors() to find names of colours to use in code.
(c) What is the Euclidian distance between employee 2 and 4 from your original distance matrix from part a)? Prove this distance mathematically based on the original data. (4 marks)
(d) Why can’t the influential variables along each dimension be identified and suggest other methods that may achieve this. (4 marks)
Question 4 (30 marks):
Determine if the years of employment can be predicted by an employee’s response to all of the job satisfaction and job characteristic variables.
(a) Produce and interpret pair-wise scatter plots for all six of the job satisfaction and job characteristic variables, distinguishing between years using colour. (4 marks)
(b) Training and test sets should be used with a 75/25 split and you must provide the seed value used in your code. Use the table function in R to provide the number of employees in each year for both the training and test sets that you have constructed. (6 marks)
(c) Perform a DFA. Explain why there are only two DFs calculated. Provide output, definition and interpretation (in context of the data and method) for: (10 marks)
• the prior probabilities
• the trace values
• the weightings on LD1 and LD2
(d) Based on the DFA, predict year membership and create and interpret a table showing observed vs predicted for the test set. Create an x-y plot of the two DFs grouped by the
Page 5 of 5
original year labels and another by the predicted year labels. Indicate on the 2nd plot the employees who were misclassified. (10 marks)
Question 5 (5 marks)
Write 100 to 300 words explaining whether any of these forms of analysis have helped your understanding of the data. Do not restate results.