1. Data-mining functions: Here are three examples of data mining applications. Match each application to one of the three data-mining functions. Then, for each particular application, elaborate potential variables (features/attributes), techniques (algorithms/models) and evaluation criteria. [15 points]
    A. A credit card company tries to distinguish fraud transactions from thousands of normal transactions.
    B. A supermarket analyzes customers’ transaction records and find out items that are often purchased together.
    C. A furniture retailer tries to identify its target customers by segmenting the market into groups of similar people.
    Data-mining functions:
    Association mining: ()
    • Variables (features/attributes)?
    • Techniques (algorithms/models)? FP-Growth, Create Association Rules
    • Evaluation criteria? Support, confidence, lift
    Cluster analysis: ()
    • Variables (features/attributes)?
    • Techniques (algorithms/models)? Clustering
    • Evaluation criteria?
    Classification: ()
    • Variables (features/attributes)?
    • Techniques (algorithms/models)? Decision Tree, Naive Bayes(Kernel), Deep Learning
    • Evaluation criteria? Accuracy
  2. Text crawling and scraping: We learned how to use regular expression to define web crawling rules and how to use Xpath to extract information from web pages. Suppose you are interested in studying the trend of data mining techniques. https://www.kdnuggets.com/ is a good website that publishes news and opinions of data mining. You want to collect all news, opinions, tutorials, etc. that are published in 2020 from this website. https://www.kdnuggets.com/2020/index.html is a good starting point. [10 points]
    a. Use regular expression to define your crawling rules. Please also explain the meaning of your regular expression.
    b. Design two Xpath queries. One is used to extract titles from the web pages and the other is used to extract the article bodies from the web pages.
  3. Text representations: The following questions examine the text processing operations required for different text mining tasks. Consider the following three text-mining tasks. For each task, give a list of preprocessing operators (tf-idf vs binary, stemming, stopwords, ect.) you will use and explain why you choose these operators. [15 points]
    a. Finding hot topics from news articles.
    b. Predicting the categories of news articles.
    c. Extracting biomedical relations (e.g., protein A activates protein B) from scientific literature.
  4. Business applications: Suppose that you work for AT&T, which runs customer discussion groups on its website. There are active discussion happening simultaneously – too many for the company to monitor them all.
    a. How can the company get a general understanding of what is being discussed, and how it changes from week to week? Please describe your text mining solution including choices of text preprocessing and data mining techniques (e.g., association rule, k-means, decision tree, etc.). [8 points]

b. Each discussion page has slots for two ads. The company would like to select ads that are good match to the page. Assume that there are many ads. How are the best two ads for this web page are selected? Please describe your text mining solution including choices of text preprocessing and data mining techniques (e.g., association rule, k-means, decision tree, etc.). [8 points]

c. After observing the effectiveness of your solution for a while, the company realizes that advertising revenue could be improved if ad selection is tuned differently for people based on their primary interest in using the website. There are five types of primary interest: “phone hardware,” “phone GUI,” “phone apps,” “coverage,” and “price.” For a particular user, how can you use a person’s profiles (e.g., age, gender) and behavior (e.g., posts, comments, reading history) to predict which type of user he or she is? Please describe your text mining solution including choices of text preprocessing and data mining techniques (e.g., association rule, k-means, decision tree, etc.). [8 points]

Sample Solution

This question has been answered.

Get Answer