Regression analysis

Assume that you still work for Ms. Deanna V. Ashun (aka “Dee”) and she is now most concerned about finding that set of variables which truly relate to annual salary (e.g, EDUCATION LEVEL probably is correlated with salary, whereas CITIZENSHIP is probably not). Ms. Ashun has certain suspicions, but is not absolutely sure which variables are the most important in terms of the salaries paid at B&T. She decides to exclude the CEO from all calculations and use the following notation: y = Annual salary paid in $1000s x1 = Age in years x2 = Years of experience prior to B&T x3 = Level of education; x3 = 1,2,3,4,5 x4 = 0/1 variable for computer usage x5 = Job classification; x5 = 1,2,3,4,5,6 x6 = Years of experience at B&T x7 = Gender; 0  Male; 1  Female x8 = Citizenship (but this is nominal data, so will not be used here) x9 = Salary adjustor for location; x9 = 1,2,3,4,5 STATS250 PROJECT #3 (Regression Analysis) Page 2 (a) After some thought, she decides that PRIOR, EDUC, GRADE, EXPERI and GENDER are probably those variables which correlate most highly with SALARY. Assuming that all relationships are linear (i.e., of the form E(y) = 0 + 1xi), she asks you to complete the following table (please put your answers on the Yellow Sheet): (10 points) Dep. Variable Ind. Variable Prediction Equation R2 value SALARY PRIOR y = _______________ + ________________ * PRIOR SALARY EDUC y = _______________ + ________________ * EDUC SALARY GRADE y = _______________ + ________________ * GRADE SALARY EXPERI y = _______________ + ________________ * EXPERI SALARY GENDER y = _______________ + ________________ * GENDER (b) 1. Are the sings of the slops as expected? 2. Interpret each slope coefficient: (8 points) PRIOR: EDUC: GRADE: EXPERI: STATS250 PROJECT #3 (Regression Analysis) Page 3 (c) Of the five variables, which two have the highest R2 values? (1 point)  PRIOR  EDUC  GRADE  EXPERI  GENDER Of the five variables, which two have the lowest R2 values? (1 point)  PRIOR  EDUC  GRADE  EXPERI  GENDER Now aware that GRADE has the single greatest impact on SALARY, Ms. Ashun wonders what variables influence GRADE. She suspects that greater academic credentials are needed to get promoted to the higher ranks at B&T, and further suspects that this relationship is linear. Thus, for all employees in the sample (excluding the CEO), she asks you to investigate the following model: GRADE = 0 + 1*EDUC (d) Get the full regression output for this model. d1. Specify the final prediction equation: (3 points) d2. What percent of the variance in GRADE is due to factors other than EDUCation? (3 points) d3. What is the 95% confidence interval for the slope of your model? (3 points) d4. Assuming your reader is Mr. Pellsize (intelligent non-statistician), explain the numeric values found in part (d3) in one or two sentences. (3 points) STATS250 PROJECT #3 (Regression Analysis) Page 4 Ms. Ashun also knows that using this regression model to make predictions about GRADE means that at least there assumptions must be satisfied: • The errors terms must follow a normal distribution; and • Error values are statistically independent; and • The variance of the error terms must be relatively constant. (e) Generate both a residual plot as well as the normal probability plot. Please attach these two plots to this paper. (6 points) (f) Based on your plots, do you believe that the errors terms follow a normal distribution? (1 point) Justify your answer from this part. (3 points) (g) Based on your plots, do you believe that the error values are statistically independent? (1 point) Justify your answer from this part. (3 points) (h) Based on your plots, do you believe that the variance of the error term is constant? (1 point) Justify your answer from this part. (3 points) c) Two highest R2 values:  PRIOR  EDUC  GRADE  EXPERI  GENDER Two lowest R2 values:  PRIOR  EDUC  GRADE  EXPERI  GENDER (d) d1. Prediction equation: GRADE = __________________ + ____________________*EDUC STATS250 YELLOW SHEET FOR: PROJECT #3 Page 3 d2: Percent variance due to other factors = _________________________ d3. 95% Confidence Interval for slope = ( ____________, ____________ ) d4. Interpretation of (d3): _____________________________________________________ _____________________________________________________ (e) Generate both a residual plot as well as the normal probability plot. Please attach these two plots to this paper. (f) Error terms follow a normal distribution?  YES  NO Justification of your answer: _____________________________________________________ _____________________________________________________ (g) Are error values statistically independent?  YES  NO Justification of your answer: _____________________________________________________ _____________________________________________________ (h) Error terms show constant variance?  YES  NO      

Sample Solution