You work at a credit card company and you would like to predict new cardholders credit card balances based on a number of factors. This dataset only contains information on cardholders who maintain a balance at some point during a month (that is, their balances are not zero). The credit card company does have customers who do not have a credit card balance (because they are not using their cards), but this analysis is only examining active card users. Your business questions are: What variables effectively contribute to predicting active cardholders credit card balances? and What credit card balance might a new active cardholder hold depending on certain variables?
Variables: The variables in this dataset include:
Income: Annual income, in dollars
Limit: Credit limit for credit card, in dollars
Rating: A credit rating calculated by the credit card company. (Not the same as a typical
credit score)
Age: Age in years
Education: Number of years of education
Student: Whether or not the cardholder is a student (No = 0, Yes = 1)
Gender: The gender of the cardholder (Male = 0, Female = 1)
Married: Whether or not the cardholder is married (No = 0, Yes = 1)
Balance: The amount of each cardholders balance, in dollars
Assignment Steps:
Carry out the steps below to complete the assignment, then answer the questions in the Module 3 Assignment Quiz on Brightspace. The quiz questions are included here, with their numbers, if you prefer to answer them as you are doing the assignment and enter them in the Brightspace quiz all at once (multiple choice questions are labeled MC).
Generate summary statistics for the variables in the Credit.csv dataset.
Quiz question #1: How many cardholders in the full dataset are students?
Partition the dataset into a training set and a validation set (following the method used in the lecture code car_regression_ex.R)
**IMPORTANT #1: Because this dataset is smaller than the one used in the video example, divide the dataset 50-50 rather than 70-30 as was done in the video example.
**IMPORTANT #2: In order to get results that align with the correct answers in the assignment quiz, when you are partitioning your dataset you MUST set the seed value to 42 using the set.seed () function. If you do not do this, you will not be able to reproduce the answers that correspond with the assignment quiz.
Create a correlation matrix with the quantitative variables in the training dataframe.
Quiz question #2: Looking at the correlation matrix, which pair of variables has the strongest correlation? (MC)
Conduct a multiple regression analysis using the training dataframe with Balance as the outcome variable and all the other variables in the dataset as predictor variables.
Quiz question #3: What is the slope coefficient for the Rating variable?
Calculate the Variance Inflation Factor (VIF) for all predictor variables.
Quiz question #4: What is the VIF for the Limit variable?
Quiz question #5: What problem does the VIF for Limit suggest that we have with the analysis? (MC)
Conduct a new multiple regression analysis using the training dataframe with Balance as the outcome variable and Income, Rating, Age, Education, Student, Gender, and Married as predictor variables.
Quiz question #6: What is the new slope coefficient for the Rating variable?
Create a residual plot and a normal probability plot using the results of the regression analysis in Step (6).
Quiz question #7: What pattern do you see in the residual plot? (MC)
Quiz question #8: What does this pattern tell you? (MC)
Quiz question #9: What pattern do you see in the normal probability plot? (MC)
Quiz question #10: What does this pattern tell you? (MC)
Examine the regression output from Step (6).
Quiz question #11: Which predictor variables have statistically significant relationships with the outcome variable, Balance? (MC)
Conduct a new multiple regression analysis using the training dataframe with Balance as the outcome variable and only the variables with statistically significant relationships with Balance (identified in Step (8)) as predictors.
Quiz question #12: What is the slope coefficient for the Age variable?
Quiz question #13: How would you interpret the slope coefficient for the Rating variable? (MC)
Quiz question #14: How would you interpret the slope coefficient for the Student variable? (MC)
Quiz question #15: What is the adjusted R2 for this regression analysis?
Quiz question #16: How can this adjusted R2 value be interpreted? (MC)
Quiz question #17: What is the standardized slope coefficient for the Income variable?
Quiz question #18: Looking at the standardized slope coefficients, which variable makes the strongest unique contribution to predicting credit card balance? (MC)
Conduct a final multiple regression analysis using the validation dataframe with Balance as the outcome variable and only the variables with statistically significant relationships with Balance (the same variables as in Step (9) as predictors.
Quiz question #19: What is the new slope coefficient for the Rating variable?
Using the data contained in the csv file credit_card_prediction.csv, predict the credit card balances for three new cardholders, with 95% prediction intervals.
Quiz question #20: What is the predicted balance for new cardholder #1?
Quiz question #21: What is the 95% prediction interval for the predicted balance for new cardholder #2?
could you answer theses Question
Q3/ What is the slope coefficient for the Rating variable? (Round to 3 decimal places)Q4/ What is the VIF for the Limit variable? (Round to 3 decimal places)