Data Mining

Answer the following questions:

1.Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or

ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity. Example: Age in years.

Answer: Discrete, quantitative, ratio
Time in terms of AM or PM.
Brightness as measured by a light meter.
Brightness as measured by people's judgments.
Angles as measured in degrees between 0 and 360.
Bronze, Silver, and Gold medals as awarded at the Olympics.
Height above sea level.
Number of patients in a hospital.
ISBN numbers for books. (Look up the format on the Web.)
Ability to pass light in terms of the following values: opaque, translucent' transparent.
Military rank.
Distance from the center of campus.
Density of a substance in grams per cubic centimeter.
Coat check number. (When you attend an event, you can often give your coat to someone who, in turn, gives you a number that you can use to claim your coat

when you leave.)

  1. Can you think of a situation in which identification numbers would be useful for prediction?
  2. An educational psychologist wants to use association analysis to analyze test results. The test consists of 100 questions with four possible answers each.
    How would you convert this data into a form suitable for association analysis?
    In particular, what type of attributes would you have and how many of them are there?
  3. Which of the following quantities is likely to show more temporal autocorrelation: daily rainfall or daily temperature? Why?
  4. Many sciences rely on observation instead of (or in addition to) designed experiments. Compare the data quality issues involved in observational science

with those of experimental science and data mining.

  1. Discuss the difference between the precision of a measurement and the terms single and double precision, as they are used in computer science, typically to

represent floating-point numbers that require 32 and 64 bits, respectively.

  1. Give at least two advantages to working with data stored in text files instead of in a binary format.
  2. Distinguish between noise and outliers. Be sure to consider the following questions.
    Is noise ever interesting or desirable? Outliers?
    Can noise objects be outliers?
    Are noise objects always outliers?
    Are outliers always noise objects?
    Can noise make a typical value into an unusual one, or vice versa?
  3. For the following vectors, x and y, calculate the indicated similarity or distance measures.
    (a) x : (0,0,1,1), y : (2,2,2,2) cosine, correlation, Euclidean
    (b) x : (0,1,0,1), y : (0,1,0,1) cosine, correlation, Euclidean, Jaccard
    (c) x : (1,1,0,1), y : (-1,0,-1,0) cosine, correlation, Euclidean
    (d) x : (1,0,0,1,0,1), y : (0,1,1,0,0,1) cosine, correlation, Jaccard
    (e) x : (2,1,0,2,0,3),y : (1,1,1,0,0,1) cosine, correlation
  4. This exercise compares and contrasts some similarity and distance measures. For binary data, the L1 distance corresponds to the Hamming distance; that is,

the number of bits that are different between two binary vectors.The Jaccard similarity is a measure of the similarity between two binary vectors. Compute the

Hamming distance and the Jaccard similarity between the following two binary vectors.

Sample Solution