Data Mining

● Part I – Data mining: a brief introduction
■ Why data mining
■ What is data mining
■ The four views of data mining
■ Data mining examples
■ Data representation
● Part II – Data mining on the social media
■ Data gathering
■ Research and applications on mining social
media
■ Issues and challenges
Outline
INF6024
● Part I – Data mining: a brief introduction
■ Why data mining
■ What is data mining
■ The four views of data mining
■ Data mining examples
■ Data representation
● Part II – Data mining on the social media
■ Data gathering
■ Research and applications on mining social
media
■ Issues and challenges
Why Data Mining
INF6024 Source: https://mjolner.dk/2015/01/14/realizing-the-fourth-industrial-revolution/
Why Data Mining
INF6024 Source: https://mjolner.dk/2015/01/14/realizing-the-fourth-industrial-revolution/
INF6024 Source: http://www.iflscience.com/technology/how-much-data-does-the-world-generate-every-minute/
Why Data Mining
90% of the
world’s data
today has
been created
in the last 2
years alone
● Businesses
● Sensors
● You & I
BIG
DATA
INF6024
Why Data Mining
“We are
drowning in
data, but
starving for
knowledge!”
(Eric D. Brown)
Source: http://www.tatewilliams.org/2014/05/20/new-stanford-center-using-meta-research-crack-shoddy-science/
BIG
MESSY
DATA
What Is Data Mining
INF6024
● Discover interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or
knowledge from huge amount of data
● The core of the Knowledge Discovery (KDD) process
From data to knowledge
What Is Data Mining
INF6024
The Knowledge Discovery Process
What Is Data Mining
INF6024
The Knowledge Discovery Process
Data cleaning (e.g., remove errors, fill missing and
inconsistent values), transformation (e.g., scaling
values to [0, 1.0] range), reduction (reduce
attributes, reduce attribute values, sampling)
What Is Data Mining
INF6024
The Knowledge Discovery Process
Understand the characteristics of data, such as size
and amount of data, possible relationships amongst
data elements or files/tables in the data.
What Is Data Mining
INF6024
The Knowledge Discovery Process
We will discuss Data Presentation as a kind of
technique for Data Mining
Four views of Data Mining
INF6024
● Relational, transactional, stream, sensor, spatial,
time-series, sequence, text, images, videos, network, the
Web, etc
Data: what data can I find that knowledge in?
Knowledge: what knowledge do I need?
Techniques: what methods do I use to discover it?
Applications: what applications do I want to build?
● Characterisation, association, correlation, classification,
prediction, clustering, outlier, trend, interaction
● Database oriented, machine learning, statistics,
visualisation
● Fraud detection, insurance quotation, retail sales pattern,
share price prediction, recommendation, knowledge base
The Knowledge View – Details (1)
INF6024
● Association: are two variables depend on each other
(e.g., price of a house is associated with its distance from
good schools)?
● Correlation: how are two variables depend on each other
(e.g., sales of ice-cream and temperature is a positive
correlation)?
Association and correlation:
Characterisation:
● Generalise, summarise, and contrast data
characteristics, e.g., posting behaviours of different age
groups on Twitter
The Knowledge View – Details (2)
INF6024
Classification:
● To distinguish classes or concepts based on data (e.g.,
developed countries vs developing countries)
Prediction:
● Predicting missing or unknown values (e.g., is the share
price of Rolls Royce going up or down tomorrow?)
Cluster:
● Class label is unknown
● Group data to form new classes (e.g., help marketing to
discover different groups of buyers)
Outlier:
● Data objects that do not comply with the general
behaviour of the data (e.g., fraudulent transactions)
The Knowledge View – Details (3)
INF6024
Trend:
● Trend and deviation – are data following the anticipated
trend, are they forming new trend
● Sequential pattern – buying patterns and shelving
sequence, e.g., if you buy a sandwich you are likely to
buy also a drink next
Interaction:
● Graph, network and links – who is the authority in a user
community formed by Twitter users? Who helps
propagate information, and who initiate (create)
information?
Example – Fraud Detection
INF6024
● Relational, transactional, stream, sensor, spatial,
time-series, sequence, text, images, videos, network, the
Web, etc
Data: what data can I find that knowledge in?
Knowledge: what knowledge do I need?
Techniques: what methods do I use to discover it?
Applications: what applications do I want to build?
● Characterisation, association, correlation, classification,
prediction, clustering, outlier, trend, interaction
● Database oriented, machine learning, statistics,
visualisation
● Fraud detection, insurance quotation, retail sales pattern,
share price prediction, recommendation, knowledge base
Banks are using
information (e.g.,
amount, time, location,
frequency) associated
with credit card
transactions to detect
patterns of frauds, which
are ‘outliers’ compared
to normal transaction
patterns, and use the
discovered patterns to
catch future fraud
transactions (see Pawar
et al., 2014)
Example – Fraud Detection
INF6024
● Relational, transactional, stream, sensor, spatial,
time-series, sequence, text, images, videos, network, the
Web, etc
Data: what data can I find that knowledge in?
Knowledge: what knowledge do I need?
Techniques: what methods do I use to discover it?
Applications: what applications do I want to build?
● Characterisation, association, correlation, classification,
prediction, clustering, outlier, trend, interaction
● Database oriented, machine learning, statistics,
visualisation
● Fraud detection, insurance quotation, retail sales pattern,
share price prediction, recommendation, knowledge base
The same task can also be
achieved through
classification using machine
learning, if we have ‘labeled’
data
(See Awoyemi et al., 2017)
Example – Sales Pattern Extraction
INF6024
● Relational, transactional, stream, sensor, spatial,
time-series, sequence, text, images, videos, network, the
Web, etc
Data: what data can I find that knowledge in?
Knowledge: what knowledge do I need?
Techniques: what methods do I use to discover it?
Applications: what applications do I want to build?
● Characterisation, association, correlation, classification,
prediction, clustering, outlier, trend, interaction
● Database oriented, machine learning, statistics,
visualisation
● Fraud detection, insurance quotation, retail sales pattern,
share price prediction, recommendation, knowledge base The monthly sales
pattern of iPhone 6S
may correlate with
that of Samsung
Galaxy S6, so we can
use the sales patterns
of iPhone X to predict
the sales of Samsung
Galaxy S9
(see Ferreira et al.,
2016)
Example – Share Price Prediction
INF6024
● Relational, transactional, stream, sensor, spatial,
time-series, sequence, text, images, videos, network, the
Web, etc
Data: what data can I find that knowledge in?
Knowledge: what knowledge do I need?
Techniques: what methods do I use to discover it?
Applications: what applications do I want to build?
● Characterisation, association, correlation, classification,
prediction, clustering, outlier, trend, interaction
● Database oriented, machine learning, statistics,
visualisation
● Fraud detection, insurance quotation, retail sales pattern,
share price prediction, recommendation, knowledge base
Previous daily closing
price of shares are
used to forecast the
trend of the next day
open price, e.g.,
Simple/ Exponential
Moving Average. But
the real problem is
much harder than this
(e.g., company
financials, sentiment,
economic policies)!
(see Milosevic 2016)
Example – Knowledge Base/Graph Creation
INF6024
● Relational, transactional, stream, sensor, spatial,
time-series, sequence, text, images, videos, network, the
Web, etc
Data: what data can I find that knowledge in?
Knowledge: what knowledge do I need?
Techniques: what methods do I use to discover it?
Applications: what applications do I want to build?
● Characterisation, association, correlation, classification,
prediction, clustering, outlier, trend, interaction
● Database oriented, machine learning, statistics,
visualisation
● Fraud detection, insurance quotation, retail sales pattern,
share price prediction, recommendation, knowledge base
In the area of ‘text data
mining’ we extract
concepts, named
entities, relations from
text and form facts that
are linked to create a
‘knowledge
graph/base’. More on
this later…
(see Mitchell et al.
2015)
Example – Knowledge Base/Graph Creation
INF6024
● Relational, transactional, stream, sensor, spatial,
time-series, sequence, text, images, videos, network, the
Web, etc
Data: what data can I find that knowledge in?
Knowledge: what knowledge do I need?
Techniques: what methods do I use to discover it?
Applications: what applications do I want to build?
● Characterisation, association, correlation, classification,
prediction, clustering, outlier, trend, interaction
● Database oriented, machine learning, statistics,
visualisation
● Fraud detection, insurance quotation, retail sales pattern,
share price prediction, recommendation, knowledge base
In the area of ‘text
data mining’ we
extract concepts,
named entities,
relations from text
and form facts that
are linked to create a
‘knowledge
graph/base’
(see Milosevic 2016)
Example – Recommendation
INF6024
● Relational, transactional, stream, sensor, spatial,
time-series, sequence, text, images, videos, network, the
Web, etc
Data: what data can I find that knowledge in?
Knowledge: what knowledge do I need?
Techniques: what methods do I use to discover it?
Applications: what applications do I want to build?
● Characterisation, association, correlation, classification,
prediction, clustering, outlier, trend, interaction
● Database oriented, machine learning, statistics,
visualisation, graph mining
● Fraud detection, insurance quotation, retail sales pattern,
share price prediction, recommendation, knowledge base
Facebook uses your friends
network and interaction with
other people to recommend
new friends; LinkedIn uses
your professional network
and interaction with others to
recommend new
connections (real algorithms
uses also your profile and
are more complex) see
Naruchitparames et al.
(2011)
Data Representation
INF6024
Ask yourself: if you are to perform the task by yourself
(e.g., detect credit card fraud transactions), what will be
the data instances you must examine, and what are the
kinds of information that you are looking for in them in
order to make a decision?
After Data Preprocessing/Integration,
an important goal of Data Exploration is
to develop some understanding of how
to represent the data for data mining
techniques
The Fundamental
Concept of
Data mining techniques require you to encode that
information in certain form…
Data Representation
INF6024
● Dataset: previous 5,000 transactions from random bank
customers, where 4,000 are known to be legitimate, 1,000
are known to be fraud (these are called ‘labels’)
● Data instance: each individual transaction
● Features (attributes of the instance to help you make a
decision): hour of the day, amount, hours from last
transaction, transaction type, ratio to average transaction
amount, country of residence, country of transaction
● Feature values: nominal (transaction type=online), ordinal
(hour of the day=2am), interval (amount=£120), ratio (ratio
to average transaction amount=1.5)
The Fundamental
Concept of
E.g., credit card fraud detection using classification
(knowledge) and machine learning (technique)
Data Representation
INF6024
Feature vector: an n-dimensional vector of features
(each feature is a dimension) to represent data instances
(used by many data mining algorithms)
The Fundamental
Concept of
Label Instance ID Hour of
the day
Amount Ratio to
avg. per
trans.
Hours
from last
trans.
Trans.
Type

F T1 1AM 519.00 10 0.1 Online …
L T2 5PM 15.99 0.3 26 POS …
… … … … … … … …
L T5000 6PM 123.00 2 72 Online …
Data Representation
INF6024
Machine learning algorithms then analyse the features
of these data instances to identify patterns that are
common to fraud transactions but not to legitimate
transactions
E.g., all fraud transactions are online (trans. type),
attempted less than half an hour from the last
successful transaction (hours from last trans.)
The Fundamental
Concept of
Label Instance ID Hour of
the day
Amount Ratio to
avg. per
trans.
Hours
from last
trans.
Trans.
Type

F T1 1AM 519.00 10 0.1 Online …
L T2 5PM 15.99 0.3 26 POS …
… … … … … … … …
L T5000 6PM 123.00 2 72 Online …
Data Representation
INF6024
Machine learning algorithms then analyse the features
of these data instances to identify patterns that are
common to fraud transactions but not to legitimate
transactions
E.g., all fraud transactions are online (trans. type),
attempted less than half an hour from the last
successful transaction (hours from last trans.)
The Fundamental
Concept of
Label Instance ID Hour of
the day
Amount Ratio to
avg. per
trans.
Hours
from last
trans.
Trans.
Type

F T1 1AM 519.00 10 0.1 Online …
L T2 5PM 15.99 0.3 26 POS …
… … … … … … … …
L T5000 6PM 123.00 2 72 Online …
In real applications we
often have hundreds of
dimensions of feature
vectors and the patterns
discovered can be very
complex and not human
interpretable! Data mining
techniques typically are
not 100% accurate!
Outline
INF6024
● Part I – Data mining: a brief introduction
■ Why data mining
■ What is data mining
■ The four views of data mining
■ Data mining examples
■ Data representation
● Part II – Data mining on the social media
■ Data gathering
■ Research and applications on mining social
media
■ Issues and challenges
Before Data Gathering
INF6024
Before start collecting data you should understand your
data needs – think about the four views of data mining
● What do you want to study, what is your research
question or application
● What kind of knowledge do you need to mine from your
data
● What is the data you need? What kind of data?
● Are there already public dataset you can use? If not, will
you be able to gather the data?
■ A practical tip: use Google, there are already a lot
of data available from research
Before Data Gathering
INF6024
Before start collecting data you should understand your
data needs – think about the four views of data mining
● How much data do you need?
■ Qualitative analysis (e.g., studying a single user’s
posting behaviour on Twitter) – consider gathering
data manually
■ Quantitative analysis (when drawing conclusion
from an individual or small population can
introduce significant bias, e.g., do people like the
new iPhone X) – hundreds and even more samples
are needed (the more the better!)
● The key question is, what do you expect to be the factors
influencing your research outcome, and do you think your
data include reasonable samples of all those factors?
Data Gathering From Social Media Via APIs
INF6024
You can gather data from social media automatically
using programs and/or tools via Application Programming
Interface (API)
Points to consider
● APIs change frequently
● Not all platforms give you access to their data
● Most platforms don’t give you real-time access
● Some platform charge for real-time/historical
● Reliance on a single (and possibly unsuitable) platform
(e.g., is Twitter suitable for analysing someone’s friend
network?)
● There may not be a tool that can collect that data for you
Data Gathering From Social Media Via APIs
INF6024
● Twitter: can collect data in real-time, historic data, but you
have to pay to get complete coverage. Free data is rate
limited, you may only get a sample.
● YouTube: API opened up in recent years. Can collect all
comments for a video for example.
● Instagram: the current API will be replaced by the
Instagram Graph API in July 2018
● Facebook: limited for pages, groups and search. More
available, but mainly via pay for API
This is only a summary, you should study the APIs in
details before using them!
There are many others you can use for your
assignment, e.g., LinkedIn, Tumblr etc. But we do
not look at them here
Data Gathering From Social Media Via APIs
INF6024
For all APIs, you need to:
● Create an account on the corresponding social media
platform
● Obtain an ‘authorisation code’ to use with your account
● Use a programming language to call the right endpoint to
get different kinds of data
● Respect the rate limit and the T&C’s
○ E.g., normally, Twitter does not allow you to distribute
the Tweet text other than its ID – you could be
breaching the T&C otherwise!
More on this in the lab, Week 8 & 9
Mining Social Media: Research and Applications
INF6024
FootballWhispers.com – can Twitter conversation predict
footballer transfers?
Mining Social Media: Research and Applications
INF6024
FootballWhispers.com – can Twitter conversation predict
footballer transfers?
● Tweets mentioning footballers and teams are collected in
real time on a daily basis
● Tweets are parsed to understand if they describe the
‘transfer’ relation
● Tweets mentioning transfers are quantified to compute a
score of ‘confidence’ (i.e., FW-index)
● The score decays following a non-linear function over
days, unless new conversation is generated to strengthen
it
Exclusive partnership with Sky Sports, 2017 Sports
Technology award
THINK: reflect on the four data mining views, what data and
techniques are used and what knowledge is extracted?
Mining Social Media: Research and Applications
INF6024
Facebook introduced suicide prevention and mental
health support tools
Source:
https://www.nbcnews.com/tech/tech-news/facebook-addresses-growing-issue-livestreamed-suicides-n727671
Mining Social Media: Research and Applications
INF6024
Facebook introduced suicide prevention and mental
health support tools
● Identifying profiles that are expressing thoughts of suicide
on the platform
● Detect patterns from posts or live videos where suicidal
thoughts are being expressed
○ comments like “Are you OK?” and “Can I help?” can be
strong indicators.
THINK: how much data should you gather to build such a
detection system and what are suitable features?
Mining Social Media: Research and Applications
INF6024
Detecting earthquake in under a minute – The US
Geological Survey and Twitter case
Source: https://www.theverge.com/2015/10/8/9477675/twitter-usgs-earthquake-detection
Mining Social Media: Research and Applications
INF6024
Detecting earthquake in under a minute – The US
Geological Survey and Twitter case
● Users who are actually experiencing earthquakes tweet
very short messages
● When they filter out tweets with more than seven words
and links, these filtered tweets proved to be effective at
monitoring earthquakes
● Now, when a number of people start tweeting about an
earthquake in the area, the USGS gets an alert.
● In 2014, they used tweets to detect an earthquake in
Napa, California in 29 seconds
THINK: how much data should you gather to make a
prediction, but how much data you needed to analyse
before to gain such insight?
Mining Social Media: Research and Applications
INF6024
Not so successful – Google Flu Trend (no longer
available)
Mining Social Media: Research and Applications
INF6024
Not so successful – Google Flu Trend (no longer
available)
● Using its massive query logs, Google analysed 50 million
common queries entered weekly within the United States from
2003 to 2008
● The correlation between each query and the Influenza-like
illness (ILI) physician visit data held by the U.S. Centers for
Disease Control and Prevention (CDC) is calculated
● Top 45 queries with the strongest correlation are monitored to
predict future outbreak of flu
● 97% accuracy on historical data, but failed to predict in
2011-2013
● The big problem is that most people who think they have “the
flu” do not – many disease can have ‘flu like’ symptoms
THINK: what is the strength and limitations of quantitative
analysis, v.s. Qualitative analysis?
Mining Social Media: Research and Applications
INF6024
Not so successful – Predicting election results, a mixed
picture, results are rather inconclusive
Mining Social Media: Research and Applications
INF6024
Not so successful – Predicting election results, a mixed
picture, results are rather inconclusive
● Many research have shown that sentiment analysis
on Twitter during elections can predict election results
● However, an extensive study by Gayo-Avello et al.
(2011) shows conflicting results
● Results predicted based on social media needs to be
interpreted ‘with a pinch of salt’
○ User demographics on social media can be biased
○ Spammers and propagators
○ People that express opinions on social media, v.s.
people that have the right to vote, v.s. people that
are likely to vote
Mining Social Media: Your Assignment
INF6024
What research will you do?
● It does not have to be ground-breaking…
● Start with a general idea, e.g., the use of social media for
personal health management
● Narrow it down to a specific problem, e.g., what are the
demographics of social media users that seek information
about a particular health problem, e.g., weight control,
sleeping, diabetes (or even a particular aspect of the
problem)
● Think about what data you need and how will you collect
them (e.g., do you need the APIs? Do they support getting
the data you want? If not what will you do instead?)
● Remember, the process of finding truth is more valuable
than the outcome!
Mining Social Media: Issues and Challenges
INF6024
Ethics – social media can be a minefield of ethical issues,
as there are many mis-perceptions
● Public ≠ Ethical
○ Do you have ‘informed consent’ from the users that
produced the data – this is a very difficult problem
● Anonymity ≠ Ethical
○ Does you research allow to identify individuals from
data?
● Risk of harm
○ Will your research be used in an unwanted way…
○ … and will that put any individuals or groups in a
vulnerable situation (e.g., online policing)?
Refer to your PGT student resources for details. You must
read them or you could be at risk of failing some modules
(e.g., your dissertation) if you collect data!
Mining Social Media: Issues and Challenges
INF6024
Scalability – social media data are growing exponentially
● Does your research/method require access to large
amount of data
● How much data do you deal with today v.s. tomorrow, v.s.
months later, v.s. years later (growth)
● Do you have the infrastructure and capacity to
○ Store these data
○ Query these data efficiently
○ Analyse these data
○ Deal with all the above in real-time – if that’s what you
need to do?
Mining Social Media: Issues and Challenges
INF6024
Noise – social media data can contain a lot of noise,
coming from, e.g.,
● Informal chat-speak-style content, misspellings, slangs
and slurs, and more…
○ ‘Government confirms blast n nuclear plants n
japan…don’t knw wht s gona happen nw…’
● Spammers and bots
○ 67% of Taylor Swift’s Twitter Followers are Bots
● Fake news
○ The dark fake news business: a journalist can be
discredited for $55,000; a 12-month political campaign
to change people’s opinions can be run for $400,000
○ New national security unit set up to tackle fake news in
UK
Summary
INF6024
● Part I – Data mining: a brief introduction
■ Why data mining
■ What is data mining
■ The four views of data mining
■ Data mining examples
■ Data representation
● Part II – Data mining on the social media
■ Data gathering
■ Research and applications on mining social
media
■ Issues and challenges
Reference
INF6001

  1. Cadle, J., Yeates, D. (2007). Project Management for Information Systems. 5th ed, Chapter 11,
  2. Harlow, UK: Pearson Prentice Hall, ISBN 9786611829780; ISBN 6611829784
  3. Pressman, R.S. (1987). Software Engineering: A Practitioners Approach. McGraw-Hill
    Int.Editions, London. ISBN: 0071002344
  4. Pawar, A., Kalavadekar, P., Tambe, S. (2014). A Survey on Outlier Detection Techniques for
    Credit Card Fraud Detection. IOSR Journal of Computer Engineering, Volume 16, Issue 2, Ver.
    VI, PP 44-48
  5. Awoyemi, J., Adetunmbi, A., Oluwadare, S. (2017). Credit card fraud detection using machine
    learning techniques: A comparative analysis. Proceedings of the International Conference on
    Computing Networking and Informatics (ICCNI)
  6. Ferreira, KJ, Lee, BHA, Simchi-Levi, D. (2016). Analytics for an online retailer: Demand
    forecasting and price optimization. Manufacturing Service Oper. Management 18(1):69–88
  7. Milosevic, N (2016). Equity Forecast: Predicting Long Term Stock Price Movement using
    Machine Learning. University of Manchester.
  8. Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A., Dalvi, B.,
    Gardner, M., Kisiel, B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T., Nakashole, N.,
    Platanios, E., Ritter, A., Samadi, M., Settles, B., Wang, R., Wijaya, D., Gupta, A., Chen, X.,
    Saparov, A., Greaves, M., and Welling, J. Neverending learning. In Proceedings of the
    Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15), 2015
  9. Naruchitparames J, Gunes MH, Louis SJ (2011) Friend recommendations in social networks
    using genetic algorithms and network topology. IEEE congress on evolutionary computation
    (CEC), pp 2207–2214
  10. Daniel Gayo-Avello, Panagiotis T. Metaxas, and Eni Mustafaraj. 2011. Limits of electoral
    predictions using twitter. In Proceedings of the International Conference on Weblogs and Social
    Media (ICWSM) 2011, July 17-21, 2011.

Sample Solution

ACED ESSAYS