Data Analysis

Open diabetes.arff (attached) in a text editor (Notepad++) and read about its attributes. Once you understand what the attributes represent, open the data set in Weka. Run the following classifiers using the default algorithm values and 10-folds cross validation: J48, Nearest Neighbor, Naïve Bayes, and Artificial Neural Network (ANN). Note the accuracies of each algorithm in the table below.

Classifier Accuracy
J48
Nearest Neighbor
Naïve Bayes
ANN

Discretize the diastolic blood pressure (pres), BMI (mass), and age attributes using the values shown in the tables below. Create a new ARFF file with this data in it and name it diabetes_disc.arff. Include a screenshot of each attribute’s distribution in Weka after you have performed discretization on those attributes. Be sure to properly label each screenshot.

Diastolic Blood Pressure
low: < 90 ideal: 90 to 120 prehigh: > 120 to 140
high: > 140

Body Mass Index (BMI)
underweight: < 18.5 normal: 18.5 to 25 overweight: > 25

Age
young: < 40 middle: 40 to 60 elderly: > 60

Using the discretized data set, rerun J48, Nearest Neighbor, Naïve Bayes, and ANN and note their accuracies in the table below. How did the accuracies of each classifier change from the previous data set to now? Did discretization improve classifier performance or not for these classifiers?

Classifier Accuracy
J48
Nearest Neighbor
Naïve Bayes
ANN

Using the Nearest Neighbor classifier on the continuous data set (diabetes.arff), change the k-value to 3, 5, 7, and 9 and note the resultant accuracies in the table below. What happens to the classifier’s accuracy as k increases? Why might this happen?

k-value Accuracy
3
5
7
9

Sample Solution

ACED ESSAYS