Using Rapid Miner/Radoop for Big Data Preparation for Analytics and Basic Data Mining

In this Assignment, you will prepare data for and improve the class recall performance of a logistic regression model.

Scenario
You enjoyed your work as a data scientist for the German bank so much that you have decided to establish your own data science consulting business.

As a data scientist consultant for another big bank, you will prepare data for and improve the class recall performance of a logistic regression model. Currently, the class recall performance of the model is 71.43% (positive class: Y). The bank would like you to increase this class recall performance to 80%.

The bank uses the model to mitigate the risks of its loan defaults (loans that are not paid in full). The bank uses the logistic regression model to predict which of its current loans are likely to default and then uses this information to focus its attention on rescuing these loans before they default.

The data set for this logistic regression model is the “University of Toronto Credit Scoring Data.” The data set represents archived data of old bank loans and whether these loans defaulted or not. There are two predictor attributes and one label in the data set:

  1. The age of the loan (BUSAGE)
  2. The number of days the loan has been delinquent (DAYSDELQ)
  3. The label of whether the loan defaulted (Y) or not (N) (DEFAULT)

The bank also has developed a RapidMiner process that uses the “University of Toronto Credit Scoring Data” to generate and evaluate the prediction logistic regression model. The bank will provide both the data set and its current RapidMiner process to you for this assessment.

To prepare:
• Review the lessons and Learning Resources for this week.
• Download the Wk5_DataAndProcess.zip (attached to order) file from the Learning Resources for this week. This file contains the bank’s data set and RapidMiner process as follows:
o A CSV (comma-separated-value) file (05_Regression_5.2_logreg_credit_scores.csv) of the “University of Toronto Credit Scoring Data” data set.
o A RapidMiner process file (05_Regression_5.2_LogReg_2attr_cred_scoring.rmp) that generates the bank’s logistic regression model.

Download the submission template Solution Submission Template (attached to order) from the Learning Resources for this week. You will use it to submit your response to this Assignment.

To complete this Assignment:

  1. Unzip the Wk5_DataAndProcess.zip file in a directory of your choice to retrieve the following files:
    • 05_Regression_5.2_logreg_credit_scores.csv
    • 05_Regression_5.2_LogReg_2attr_cred_scoring.rmp
  2. Create a RapidMiner repository called W5 to hold your work for this Assignment. Create two subfolders (data and processes) under the W5 repository.
  3. Import both the data set and the RapidMiner process into their respective subfolders of your W5 repository. Your Design view should look like this:
  4. Prepare the data set and run the 05_Regression_5.2_LogReg_2attr_cred_scoring RapidMiner process. Examine the resulting confusion matric in the Results view and notice that class recall performance of the model is 71.43% (positive class: Y).
  5. Modify the 05_Regression_5.2_LogReg_2attr_cred_scoring RapidMiner process and its operators to increase the class recall performance to 80%.
  6. Take screenshots of the results of your work.
  7. Explain your work, interpret your results, and reflect on your experience:
    • Explain how you prepared the data set and how you modified the RapidMiner process to increase the class recall performance to 80%.
    • Interpret the results of your prediction model (the recall performance metrics and the reason for wanting to increase it).
    • Reflect on your experience with this task and lessons learned.
  8. Prepare your deliverables as follows:
    • Prepare all the required screenshots.
    • Prepare all your explanations, interpretations, and reflections.
    • Zip your entire W5 repository.
    • Complete all four sections (copy of repository and screenshots, explanation of work, interpretation of results, and reflection) of the provided submission template file (Solution Submission Template), as directed in the submission template.
    • Submit the completed submission template.

Sources to be used for citation and reference:

Abbott, D. (2014j). Predictive modeling: Linear regression. In Applied predictive analytics: Principles and techniques for the professional data analyst (pp. 271–280). New York, NY: John Wiley & Sons.

Abbott, D. (2014k). Predictive modeling: Logistic regression. In Applied predictive analytics: Principles and techniques for the professional data analyst (pp. 230–239). New York, NY: John Wiley & Sons.

Kotu, V., & Deshpande, B. (2015q). Regression methods. In Predictive analytics and data mining: Concepts and practice with RapidMiner (pp.165-192). Amsterdam: Morgan Kaufmann.

RapidMiner. (2019m). Linear regression (RapidMiner Studio core). Retrieved from https://docs.rapidminer.com/latest/studio/operators/modeling/predictive/functions/linear_regression.html

RapidMiner. (2019p). Performance (RapidMiner Studio core). Retrieved from https://docs.rapidminer.com/studio/operators/validation/performance/performance.html

Grading Rubric:

Element 1: Execution of Steps--
Mastery 48 (60%) points
Student demonstrates accurate and successful execution of all steps in Using RapidMiner/Radoop for Big Data Preparation for Analytics and Basic Data Mining by providing all his/her deliverables (template and screenshots of the tasks you perform). There are no errors.
Exceptional 44.64 (55.8%) points
Student demonstrates accurate and successful execution of most steps in in Using RapidMiner/Radoop for Big Data Preparation for Analytics and Basic Data Mining by providing all his/her deliverables (template and screenshots of the tasks you perform); however, one or two minor steps are not documented but are evident in the execution of subsequent steps.
Competent 40.8 (51%) points
Student provides some screenshots of the outputs of the steps in Using RapidMiner/Radoop for Big Data Preparation for Analytics and Basic Data Mining by providing his/her deliverables (template and screenshots of the tasks you perform); however, several steps are not documented and/or one step is executed incorrectly.
Developing 36 (45%) points
Student provides some screenshots of the outputs of the steps in Using RapidMiner/Radoop for Big Data Preparation for Analytics and Basic Data Mining by providing his/her deliverables (template and screenshots of the tasks you perform); however, more than one step is executed incorrectly and/or there is little evidence supporting successful completion of activities.
Unacceptable 24 (30%) points
Student provides documentation for an incomplete or cursory attempt that does not directly address this element and/or meet minimal requirements.
Not submitted 0 (0%) points
Student did not submit this element.
Element 2: Summary of Experiences--
Mastery 14 (17.5%) points
Student provides a thorough and detailed summary of his/her experiences Using RapidMiner/Radoop for Big Data Preparation for Analytics and Basic Data Mining. There are no details missing.
Exceptional 13.02 (16.28%) points
Student provides a detailed summary of his/her experiences Using RapidMiner/Radoop for Big Data Preparation for Analytics and Basic Data Mining. There are one or two minor details missing.
Competent 11.9 (14.88%) points
Student provides a summary of his/her experiences in Using RapidMiner/Radoop for Big Data Preparation for Analytics and Basic Data Mining. There are some details missing.
Developing 10.5 (13.12%) points
Student provides a cursory or incomplete summary of his/her experiences Using RapidMiner/Radoop for Big Data Preparation for Analytics and Basic Data Mining. There are many details missing.
Unacceptable 7 (8.75%) points
Student provides an incomplete or cursory summary of his/her experiences that does not directly address this element and/or meet minimal requirements.
Not submitted 0 (0%) points
Student did not submit this element.
Element 3: Content and Technical Knowledge--
Mastery 12 (15%) points
Student demonstrates mastery of content knowledge by referencing or building upon the text when appropriate and using topic-appropriate language and terminology. Technical language and elements (including—but not limited to—program code) are well written and communicated accurately. There are no errors.
Exceptional 11.16 (13.95%) points
Student demonstrates mastery of content knowledge by referencing or building upon the text when appropriate and using topic-appropriate language and terminology. Technical language and elements (including—but not limited to—program code) are well written and communicated accurately. There are one or two minor errors.
Competent 10.2 (12.75%) points
Student demonstrates some application of content knowledge by referencing or building upon the text when appropriate and uses topic-appropriate language and terminology. Technical language and elements (including—but not limited to—program code) are well written and communicated accurately. There are some errors.
Developing 9 (11.25%) points
Student demonstrates minimal application of content knowledge by not referencing or building upon the text when appropriate, and/or does not use topic-appropriate language and terminology. Technical language and elements (including—but not limited to—program code) are cursory or incomplete. There are numerous errors.
Unacceptable 6 (7.5%) points
Student provides an incomplete or cursory description that does not directly address this element and/or meet minimal requirements.
Not submitted 0 (0%) points
Student did not submit this element.
Element 4: Organization and Writing/Form/Style--
Mastery 6 (7.5%) points
Student demonstrates thorough organization and writing skills by consistently applying APA format and style. Writing is well organized and grammatically correct, including complete sentences that are free of spelling errors. A Reference List with a variety of scholarly resources is provided, using APA formatting, and it matches the citations cited within the text.
Exceptional 5.58 (6.98%) points
Student demonstrates thorough organization and writing skills by consistently applying APA format and style. Writing is well organized and grammatically correct, including complete sentences that are free of spelling errors. A Reference List with a variety of scholarly sources is provided, using APA formatting, and it matches the citations cited within the text—but with one or two minor errors.
Competent 5.1 (6.38%) points
Student demonstrates organization and writing skills by mostly applying APA format and style. Writing is well organized and mostly grammatically correct, including complete sentences that are mostly free of spelling errors. While a Reference List is provided and includes a variety of resources, APA formatting may be incorrect, or the list may not have matched the citations cited within the text.
Developing 4.5 (5.62%) points
Student made a cursory attempt to address this element, but there are numerous errors, writing is difficult to read, and/or no Reference List is provided.
Unacceptable 3 (3.75%) points
Student submission does not adhere to the writing expectations.
Not submitted 0 (0%) points
Student did not submit this element.

Sample Solution