Machine Learning

Machine Learning (Unsupervised Learning, Tree Based Methods, Support Vector
Machines, Classification, Linear and non-linear Regression, and resampling methods)
The following is an example of the coursework that will be expected to be delivered within 12 hours, This
coursework contains four questions. Answer ALL FOUR. All questions will be given equal weight (25%).Time
allowed – Expected Writing Time: 2 hours (you would have 12 hours to answer)
In this exam is
(a) Suppose that yi ∼ N(µ, 1) for i = 1, . . . , n and that the yi’s are independent.
i. Show that the sample mean estimator ˆµ1 =1/n ∑yi is obtained from
minimising the least squares criterion [7 marks]
µˆsub(1) = argmin.∑(yi-µ)^2, and that ^µsub(1) an unbiased estimator of µ. Also find the variance of ^µsub(1)
ii. Consider adding a penalty term to the least squares criterion, and therefore using the estimator that
minimises µˆ2 = argmin∑(yi-µ)^2+ λ(µ)^2 for the mean, where λ is a non-negative tuning parameter. Derive ˆµ2,
find it bias and show that its variance is lower than that of ˆµ1
Consider the multiple linear regression model yi = β0 + ∑βsub(j)x(sub)ij + e(sub)i, i = 1, . . . , n, j = 1, dots, p,
where β = (β1, …, βp)^T and error-term= (e(sub)1….e(sub)n)^T∼ N(0, σ^2 I(sub)n).
i. When p is comparable to n, the multicollinearity becomes an issue. Describe the effects of multicollinearity on
the estimated coefficients, the
associated standard errors and the significance of the coefficients using the
ordinary maximum likelihood method.
ii. The ridge regression estimate of β can be obtained by minimising a particular expression with respect to β.
Write down this expression as well as
an alternative formulation of it.
iii. Explain why ridge regression can potentially correct the problems of
multicollinearity. [2 marks]
iv. Provide an advantage and a disadvantage of ridge regression over the standard linear regression.

  1. Let x = (x1, . . . , x100), with ∑xi = 20, be a random sample from the Exponential(λ)
    distribution with probability density function given by
    f(x(sub)i|λ) = 1/λ exp(−x(sub)i/λ), x(sub)i > 0, λ > 0. Note that E(xi) = λ.
    (a) Assign the IGamma(0.1, 0.1) prior to λ and find the corresponding posterior distribution.
    (b) Find the Jeffreys’ prior for λ. Which is the corresponding posterior distribution.
    (c) Find a Bayes estimator for λ based on the priors of parts (a) and (b)
    (d) Let y represent a future observation from the same model. Find the predictive
    distribution of y based either on the prior of part (a) or (b).
    (e) Describe how you can calculate the mean the of the predictive distribution in
    software such as R.
  2. (a) i. Suppose a non-linear model that can be written as Y = f(X) + e,
    where e has zero mean and variance σ^2, and is independent of X. Show
    that the expected test error, conditional on X can be decomposed into the
    following three parts:
    E[(Y − ˆf(X))^2] = σ^2 + Bias [f(x)]^2 + Var [f(x)] , where f(·) is estimated from the training data.
    7/22/2020 Order 323199824
    https://admin.writerbay.com/orders_available?subcom=detailed&id=323199824 3/4
    ii. To estimate the test error rate, one can use the 10-fold Cross Validation
    (CV) approach or the information criterion approach, e.g. AIC, BIC. What
    are the main advantage and disadvantage of using the 5-fold CV approach
    in comparison with AIC or BIC?
    iii. State which one of AIC and BIC tends to select smaller size model and
    explain the reason
    (b) i. The tree in Figure 1 provides a regression tree based on a dataset of patient visits for upper respiratory
    infection. The aim is to identify factors
    associated with a physicians rate of prescribing, which is a continuous variable. The variables appearing in the
    regression tree are private: percent
    of privately insured patients a physician has, black: the percent of black
    patients a physician has, and fam whether or not the physician specialises
    in family medicine. Provide an interpretation of this tree.
    ii. Consider the regression tree of Figure 2 where the response variable is the
    log salary of a baseball player, based on the number of years that he has
    played in the major leagues (Years) and the number of hits that he made
    in the previous year (Hits). Create a diagram that represent the partition
    of the predictors spaces according to this tree
    4 (a) i. Consider the following data: 10 20 40 80 85 121 160 168 195.
    Use the k-means algorithm with k = 3 to cluster the data set. Use the
    Euclidean distance to measure the distance between the data points. Suppose that the points 160, 168, and
    195 were selected as the initial cluster
    means. Work from these initial values to determine the final clustering for
    the data. Provide results from each iteration.
    ii. What are the main disadvantages of k-means clustering? Why one may
    want to consider hierarchical clustering as an alternative?
    (b) i. Data are available for students taking BSc degree in Data Science and
    in particular the variables X1: average mark on project coursework, X2:
    average hours studied per course, and Y : get a degree with distinction. The
    estimated coefficients of a logistic regression model were β0 =?5, β1 = 0.02,
    β2 = 0.1. Estimate the probability that a student who takes on average
    50% on project coursework and studies 30 hours on average for each course
    gets a degree with distinction? How many hours would the student in part
    (a) need to study on average to have a 50 % chance of getting a degree
    with distinction ?
    ii. Suppose that we wish to predict whether a high quality chip produced in
    a factory will pass the quality control (‘Pass’ or ‘Fail’) based on x, the
    measurement of its diameter. Diameter measurements are available for a
    large number of chips. After examining them it turns out that the mean
    value of x for chips that passed the quality control was 5mm, while the
    mean for those that didn’t was 7mm. Moreover, the variance of x for
    these two sets of companies was σ^2 = 1. Finally, 70% of the produced
    chips passed the quality control. Assuming that x follows the normal
    distribution, predict the probability that a chip with x = 5.8 will pass the
    quality control.

Sample Solution

ACED ESSAYS