Classification training data

Q1. (20 points) Suppose you are given the following multi-class classification training data,
where each input example has three features and output label takes a value from good, bad,
and ugly.
• x1=(0, 1, 0) and y1=good
• x2=(1, 0, 1) and y2=bad
• x3=(1, 1, 1) and y3=ugly
• x4=(1, 0, 0) and y4=bad
• x5=(0, 0, 1) and y5=good
Suppose we want to learn a linear classifier using multi-class perceptron algorithm and start
from the following weights: wgood=(0,0,0); wbad=(0,0,0); and wugly=(0,0,0). Please do hand
calculations to show how weights change after processing examples in the same order (i.e.,
one single pass over the five training examples). See slide 88 of the Perceptron notes.
Q2. (20 points) Suppose you are given the following binary classification training data,
where each input example has three features and output label takes a value good or bad.
• x1=(0, 1, 0) and y1=good
• x2=(1, 0, 1) and y2=bad
• x3=(1, 1, 1) and y3=good
• x4=(1, 0, 0) and y4=bad
• x5=(0, 0, 1) and y5=good
Suppose we want to learn a classifier using kernelized perceptron algorithm. Start from the
following dual weights: α1=0; α2=0; α3=0; α4=0; and α5=0. Please do hand calculations to
show how dual weights change after processing examples in the same order (i.e., one single
pass over the five training examples). Do this separately for the following kernels: (a) Linear
kernel: K(x, x0
)=x · x
0
; and (b) Polynomial kernel with degree 3: K(x, x0
)=(x · x
0 + 1)3
,
where x · x
0
stands for dot product between two inputs x and x
0
. See Algorithm 30 in
http://ciml.info/dl/v0_99/ciml-v0_99-ch11.pdf. You can ignore the bias term b.
Q3. (20 points) Suppose x = (x1, x2, · · · , xd) and z = (z1, z2, · · · , zd) be any two points in
a high-dimensional space (i.e., d is very large). Suppose you are given the following property,
where the right-hand side quantity represents the standard Euclidean distance.

1

d
X
d
i=1
xi −
1

d
X
d
i=1
zi
!2

X
d
i=1
(xi − zi)
2
(1)
We know that the computation of nearest neighbors is very expensive in the high-dimensional
space. Discuss how we can make use of the above property to make the nearest neighbors
computation efficient?
Q4. (20 points) We know that we can convert any decision tree into a set of if-then rules,
where there is one rule per leaf node. Suppose you are given a set of rules R = {r1, r2, , rk},
where ri corresponds to the i
th rule. Is it possible to convert the rule set R into an equivalent
decision tree? Explain your construction or give a counterexample.
Q5. (20 points) Please read the following two papers and write a brief summary of the
main points in at most THREE pages.
D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner,
Vinay Chaudhary, Michael Young, Jean-Franois Crespo, Dan Dennison: Hidden Technical
Debt in Machine Learning Systems. NIPS 2015: 2503-2511
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.
pdf
Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley: The ML test score: A
rubric for ML production readiness and technical debt reduction. BigData 2017: 1123-1132
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/
46555.pdf

Sample Solution