Machine Learning (Unsupervised Learning, Tree Based Methods, Support Vector
Machines, Classification, Linear and non-linear Regression, and resampling methods)
The following is an example of the coursework that will be expected to be delivered within 12 hours, This
coursework contains four questions. Answer ALL FOUR. All questions will be given equal weight (25%).Time
allowed – Expected Writing Time: 2 hours (you would have 12 hours to answer)
In this exam is
(a) Suppose that yi ∼ N(µ, 1) for i = 1, . . . , n and that the yi’s are independent.
i. Show that the sample mean estimator ˆµ1 =1/n ∑yi is obtained from
minimising the least squares criterion [7 marks]
µˆsub(1) = argmin.∑(yi-µ)^2, and that ^µsub(1) an unbiased estimator of µ. Also find the variance of ^µsub(1)
ii. Consider adding a penalty term to the least squares criterion, and therefore using the estimator that
minimises µˆ2 = argmin∑(yi-µ)^2+ λ(µ)^2 for the mean, where λ is a non-negative tuning parameter. Derive ˆµ2,
find it bias and show that its variance is lower than that of ˆµ1
Consider the multiple linear regression model yi = β0 + ∑βsub(j)x(sub)ij + e(sub)i, i = 1, . . . , n, j = 1, dots, p,
where β = (β1, …, βp)^T and error-term= (e(sub)1….e(sub)n)^T∼ N(0, σ^2 I(sub)n).
i. When p is comparable to n, the multicollinearity becomes an issue. Describe the effects of multicollinearity on
the estimated coefficients, the
associated standard errors and the significance of the coefficients using the
ordinary maximum likelihood method.
ii. The ridge regression estimate of β can be obtained by minimising a particular expression with respect to β.
Write down this expression as well as
an alternative formulation of it.
iii. Explain why ridge regression can potentially correct the problems of
multicollinearity. [2 marks]
iv. Provide an advantage and a disadvantage of ridge regression over the standard linear regression.

1. Let x = (x1, . . . , x100), with ∑xi = 20, be a random sample from the Exponential(λ)
distribution with probability density function given by
f(x(sub)i|λ) = 1/λ exp(−x(sub)i/λ), x(sub)i > 0, λ > 0. Note that E(xi) = λ.
(a) Assign the IGamma(0.1, 0.1) prior to λ and find the corresponding posterior distribution.
(b) Find the Jeffreys’ prior for λ. Which is the corresponding posterior distribution.
(c) Find a Bayes estimator for λ based on the priors of parts (a) and (b)
(d) Let y represent a future observation from the same model. Find the predictive
distribution of y based either on the prior of part (a) or (b).
(e) Describe how you can calculate the mean the of the predictive distribution in
software such as R.
2. (a) i. Suppose a non-linear model that can be written as Y = f(X) + e,
where e has zero mean and variance σ^2, and is independent of X. Show
that the expected test error, conditional on X can be decomposed into the
following three parts:
E[(Y − ˆf(X))^2] = σ^2 + Bias [f(x)]^2 + Var [f(x)] , where f(·) is estimated from the training data.
7/22/2020 Order 323199824
ii. To estimate the test error rate, one can use the 10-fold Cross Validation
(CV) approach or the information criterion approach, e.g. AIC, BIC. What
are the main advantage and disadvantage of using the 5-fold CV approach
in comparison with AIC or BIC?
iii. State which one of AIC and BIC tends to select smaller size model and
explain the reason
(b) i. The tree in Figure 1 provides a regression tree based on a dataset of patient visits for upper respiratory
infection. The aim is to identify factors
associated with a physicians rate of prescribing, which is a continuous variable. The variables appearing in the
regression tree are private: percent
of privately insured patients a physician has, black: the percent of black
patients a physician has, and fam whether or not the physician specialises
in family medicine. Provide an interpretation of this tree.
ii. Consider the regression tree of Figure 2 where the response variable is the
log salary of a baseball player, based on the number of years that he has
played in the major leagues (Years) and the number of hits that he made
in the previous year (Hits). Create a diagram that represent the partition
of the predictors spaces according to this tree
4 (a) i. Consider the following data: 10 20 40 80 85 121 160 168 195.
Use the k-means algorithm with k = 3 to cluster the data set. Use the
Euclidean distance to measure the distance between the data points. Suppose that the points 160, 168, and
195 were selected as the initial cluster
means. Work from these initial values to determine the final clustering for
the data. Provide results from each iteration.
ii. What are the main disadvantages of k-means clustering? Why one may
want to consider hierarchical clustering as an alternative?
(b) i. Data are available for students taking BSc degree in Data Science and
in particular the variables X1: average mark on project coursework, X2:
average hours studied per course, and Y : get a degree with distinction. The
estimated coefficients of a logistic regression model were β0 =?5, β1 = 0.02,
β2 = 0.1. Estimate the probability that a student who takes on average
50% on project coursework and studies 30 hours on average for each course
gets a degree with distinction? How many hours would the student in part
(a) need to study on average to have a 50 % chance of getting a degree
with distinction ?
ii. Suppose that we wish to predict whether a high quality chip produced in
a factory will pass the quality control (‘Pass’ or ‘Fail’) based on x, the
measurement of its diameter. Diameter measurements are available for a
large number of chips. After examining them it turns out that the mean
value of x for chips that passed the quality control was 5mm, while the
mean for those that didn’t was 7mm. Moreover, the variance of x for
these two sets of companies was σ^2 = 1. Finally, 70% of the produced
chips passed the quality control. Assuming that x follows the normal
distribution, predict the probability that a chip with x = 5.8 will pass the
quality control.

Sample Solution