Using the concepts you have learned thus far in the course, you will design a machine learning method which will be able to identify a flower based on four characteristics:

Sepal length
Sepal width
Petal length
Petal width
Your program will differentiate between three types of iris flowers:

Iris-setosa
Iris-versicolor
Iris-virginica
You will design 9 different functions:

readData (read data from data files)
display (display the loaded data)
mean (calculate the average across an array of values)
stddev (calculate the standard deviation across an array of values)
stats (display mean and standard deviation of each characteristic)
distance (how similar two flowers are based on euclidean distance)
nearestNeighbor (find the flower most similar to another)
accuracy (calculate how accurate your machine learning method is)
main (the main function)
These functions are discussed in more detail in the following sections. Each function will be individually tested. That said, functions can and should make use of each other. So, for example, the stddev function would call mean as a part of calculating the standard deviation.

Read Data
You have been given two files: train.data and test.data. These two file names will be passed as command line arguments to your program. Example:

./a.out train.data test.data
If the number of arguments is less than or greater than expected, print:

Usage: project4 TRAIN_FILE TEST_FILE OPTION
and exit the program (return 1).

Now we’ll focus on train.data. Your first task will be to read the data in this file. There are 120 lines of entries. The first four columns correspond to the four characteristics mentioned in the overview. The fifth column is the flower type these characteristics describe, called the ‘label’.

You will design a function to read the data in this file into five arrays. The first four arrays are for the four characteristics, and the fifth array stores the flower type. The function definition should be:

int readData(char filename[], double sepal_lengths[], double sepal_widths[], double petal_lengths[], double petal_widths[], int labels[], int length)
You will notice the labels array is an array of integers. This is because in machine learning, it’s common to number each label, as numbers are easier to work with than strings. When reading the file, store Iris-setosa as 0, Iris-versicolor as 1, and Iris-virginica as 2.

The number of lines to read into the arrays is passed as the final parameter. For train.data, for example, it would be 120.

If the file does not exist, return a value of 1, else, return a 0.

In your main method, you should read the data for both files before doing anything else (read training data before testing data). If either method returns a 1, immediately print the following error and exit main (return 1):

Unable to open file FILENAME
where FILENAME is the filename passed to the function.

Examples:

./a.out not_a_file.txt another_fake_file.txt
Unable to open file not_a_file.txt
./a.out train.data another_fake_file.txt
Unable to open file another_fake_file.txt
Display Data
To ensure the data was loaded properly, you will design a function to print out all the stored values. The display function will iterate over each flower and print its sepal length, sepal width, petal length, petal width, and label. Formatted as:

(sepal length, sepal width, petal length, petal width) => label
The function definition should be:

display(double sepal_lengths[], double sepal_widths[], double petal_lengths[], double petal_widths[], int labels[], int length)
where the last parameter, length, is how many flowers there are (length of the arrays).

Example (first three lines when calling display on the train.data data)

(5.100000, 3.500000, 1.400000, 0.200000) => 0
(4.900000, 3.000000, 1.400000, 0.200000) => 0
(4.700000, 3.200000, 1.300000, 0.200000) => 0
Statistics
When working on a machine learning project, it’s always important for the data scientist to become familiar with their data. One way to do this is to look at the statistics of your dataset. In this case, we will be interested in the mean and standard deviation for each of the values for each flower.

MEAN
double mean(double values[], int labels[], int filter, int length)
The mean method will take an array of values and an array of labels. However, we want to know the mean for a specific flower type. The desired flower type will be passed as filter. Finally, length describes the length of the values and labels array.

For example, let’s say you want to know the mean petal length given the following values:

petal lengths: [4, 10, 5] and labels: [0, 0, 1]

If you called mean(petal_lengths, labels, 0, 3), the expected result would be 7.0.

STANDARD DEVIATION
This method will be just like the mean method above, except instead of returning the mean, it will return the standard deviation:

double stddev(double values[], int labels[], int filter, int length)
To compute this, you will need the sqrt method in math.h. When compiling, you will need to link the math library by passing -lm:

gcc project4.c -lm
STATS
About 95% of the data will fall within 2 standard deviations of the mean (assuming a normal distribution). As such, to get an idea of how much the characteristics of the different flowers overlap, we will display the mean +/- 2 * standard_deviation for each characteristic of each flower.

Write a function to print these values:

void stats(double sepal_lengths[], double sepal_widths[], double petal_lengths[], double petal_widths[], int labels[], int length)
Expected output is shown below (values should be calculated and not hard coded, as the data used to test your program will be different):

| Sepal length | Sepal width | Petal length | Petal width
0 | 5.04 +/- 0.72 | 3.44 +/- 0.72 | 1.46 +/- 0.34 | 0.23 +/- 0.20
1 | 6.01 +/- 1.03 | 2.78 +/- 0.66 | 4.32 +/- 0.89 | 1.35 +/- 0.41
2 | 6.62 +/- 1.35 | 2.96 +/- 0.66 | 5.61 +/- 1.16 | 1.99 +/- 0.54
Each line is a flower, displayed in order. Each column is a characteristic. Use %.2lf to show each value to 2 decimal places. You can copy and paste the first line with the headers and use that verbatim. Each mean and 2 * standard deviation will be less than 10, so the spacing will stay the same.

Nearest Neighbor
A simple introductory machine learning method is nearest neighbor. When the characteristics of an unknown flower are passed, it loops over all known flowers (train.data) and finds the most similar known flower. It then assumes this unknown flower will be the same type and classifies the unknown flower as such.

To find how similar two flowers are, we will be using the distance formula to find how “far apart” they are.

Just like in two dimensions, the distance formula is sqrt((x1-x2)^2 + (y1-y2)^2), the distance formula for four dimensions is sqrt((a1-a2)^2 + (b1-b2)^2 + (c1-c2)^2 + (d1-d2)^2), where a, b, c, and d represent the four characteristics of the flowers.

As an example, let’s calculate the distance between these two flowers given their four characteristics:

1: [5, 4, 3, 2]
2: [8, 9, 3, 7]

distance = sqrt((5 – 8)^2 + (4 – 9)^2 + (3 – 3)^2 + (2 – 7)^2)

which equals 7.68
Write a function to calculate this distance:

double distance(double a1, double b1, double c1, double d1, double a2, double b2, double c2, double d2)
With your distance function, you will be able to implement the nearest neighbor algorithm. It will accept the arrays containing the characteristics of the flowers in train.data and then the four characteristics of the unknown flower.

The method will find the known flower with the minimum distance to the unknown flower. It will then return the label of the known flower as the prediction for the unknown flower.

int nearestNeighbor(
double sepal_lengths[], double sepal_widths[], double petal_lengths[], double petal_widths[], int labels[], int length,
double sepal_length, double sepal_width, double petal_length, double petal_width
)
If there is a tie, the entry that comes first in the list should win the tie.

As a note, this data, like a lot of data in machine learning, is messy. It doesn’t follow clean, well-separated bell curves. So, if you enter a value near the mean for a flower, but it classifies as something else, don’t worry. That just means the nearest neighbor was a different flower than expected.

Testing Accuracy
Now that you have the nearestNeighbor method, which can predict the type of a flower, you can test how accurate it is. Read in test.data and use nearestNeighbor to classify each flower looking only at its characteristics.

You should design a method called accuracy which accepts the arrays of data for know flowers and arrays of data for the testing flowers. You will use the labels included in test.data to determine whether nearestNeighbor made the correct prediction. accuracy should return the percentage of correct predictions as a decimal.

Hint: the value should be about 0.7.

double accuracy(
double sepal_lengths[], double sepal_widths[], double petal_lengths[], double petal_widths[], int labels[], int length,
double sepal_lengths_test[], double sepal_widths_test[], double petal_lengths_test[], double petal_widths_test[], int labels_test[], int length_test
);
Main method
Your main method should accept two files passed as arguments: train.data and test.data. The final argument should be a string, indicating the option:

display
stats
accuracy
classify
If an option not in this list is passed, your program should print:

Unknown option
and exit (return 1).

display should call the display function on the loaded data:

Example 1 (omits the entire output; you would be more than three values under each):

./a.out train.data test.data display
Training data:
(5.100000, 3.500000, 1.400000, 0.200000) => 0
(4.900000, 3.000000, 1.400000, 0.200000) => 0
(4.700000, 3.200000, 1.300000, 0.200000) => 0

Testing data:
(5.000000, 3.500000, 1.300000, 0.300000) => 0
(4.500000, 2.300000, 1.300000, 0.300000) => 0
(4.400000, 3.200000, 1.300000, 0.200000) => 0
Example 2

./a.out train.data test.data stats
| Sepal length | Sepal width | Petal length | Petal width
0 | 5.04 +/- 0.72 | 3.44 +/- 0.72 | 1.46 +/- 0.34 | 0.23 +/- 0.20
1 | 6.01 +/- 1.03 | 2.78 +/- 0.66 | 4.32 +/- 0.89 | 1.35 +/- 0.41
2 | 6.62 +/- 1.35 | 2.96 +/- 0.66 | 5.61 +/- 1.16 | 1.99 +/- 0.54
Example 3

./a.out train.data test.data accuracy
Test accuracy: 1.00
Example 4

(examples 4 through 6 have the user entering flower characteristics as input)

./a.out train.data test.data classify
Flower characteristics: 5 3 1.5 0.2
Prediction: 0 (Iris-setosa)
Example 5

./a.out train.data test.data classify
Flower characteristics: 6 3 5 1.5
Prediction: 1 (Iris-versicolor)
Example 6

./a.out train.data test.data classify
Flower characteristics: 7 3 6 2.5
Prediction: 2 (Iris-virginica)

Sample Solution

This question has been answered.

Get Answer