1. Introduction

What is data mining?

Semi-automatic procedures to find general and useful patterns in large data sets.

Applications

Approximate retrieval: Finding similar elements (similar songs, image search, plagiarism detection, copyright protection, etc.) in giant datasets
Supervised learning, such as large scale classification (of user behavior, of images, of text, etc.) and regression
Unsupervised learning, such as large scale clustering (search for groups of similar users, images, songs, articles, etc.) and dimension reduction
Recommender systems (bandit algorithms ($\epsilon$-greedy, UCB1, LinUCB, Hybrid LinUCB, etc.) and their applications in fields such as news article recommendation and adverting)
Others (monitoring transients in astronomy, spam filtering, fraud detection, machine translation, six degrees of ~~Kevin Bacon~~ separation etc.)

Scale

Example: 10-100 TB of data per sky survey (astronomy)
Archive sizes measured in petabytes
Real-time data flows (e.g. computing trends in social network)
Data sources
- science
- commercial/civil/engineering
- security/intelligence/defense

Technical aspects

Want to keep data in main memory as much as possible (faster)
If data don't fit in the main memory, we have to access it in a streaming fashion. Random access would be much too expensive, so we have to adapt our algorithms in order to learn from data streams.
Want real-time analytics
Want real-time synthesis
Want to leverage large-scale parallelism (across entire data centers)
Data quality often sucks (missing elements, missing elements represented as seemingly-present elements (null vs. "" vs. 0 vs. "\0" vs. undefined, etc.), inconsistent schema, etc.)
Need to respect users' privacy (control direct access to data.)

Not covered

Systems issues (databases, data center management, etc.)
Specialized data structures
Domain specific algorithms
- see Information retrieval course for more text-specific elements

MapReduce

Works well with commodity hardware in data centers (DCs)
Failure-tolerant (redundancy over DC)
Works with distributed file systems (e.g. Google GFS, HDFS, etc.), which are optimized for durability, frequent reads and appends, but rare updates
map(key, value) and reduce(key, values) (bread and butter; other operations exist); the default shuffler does a lot of the grunt work!
Partitions the input data, schedules program execution across a set of machines, handles machine failures, and manages inter-machine communication
A job's output is often another job's input; many tools support multi-stage pipelines natively (Cascading, Scalding, Spark, MS DryadLINQ, etc.)
Stragglers (last few remaining reducers) $\implies$ spawn multiple copies of job and take the result of whoever finishes first
Hadoop is the most common MapReduce implementation; relies a lot on disk access $\implies$ slow; Spark offers massive speedups by relying less on disk access

Trick to compute variance in one pass: use formula based on expectation ($\mathbb{V}ar(X) = \hat{\mathbb{E}}[X^2] - \hat{\mathbb{E}}[X]^2$).

GPGPUs can also offer massive speed-ups when used right. They are not covered in this course, but are very widely used for algorithms requiring heavy number-crunching (many vector/matrix operations).

Statistical limits and Bonferroni's Principle

Context

2002, Post 9/11 USA, The Patriot Act $\implies$ the Total Information Awareness project (eventually killed by Congress, allegedly)
Problem: when viewing so much data, almost everything could be seen as suspicious. It depends just on how narrowly you define the activities that you look for.
One can find certain event types or patterns even in completely random data
Avoid this fallacy using Bonferroni's principle
See first chapter in textbook for more details CITE

The principle

first calculate the $\mathbb{E}[\text{nr. occurrences}]$, assuming the data is purely random
if this number is significantly larger than the actual number of instances we hope to find, then any result would be bogus

An example

Assume:
- evil doers exist, and gather periodically at a hotel
- 1 billion suspects
- everyone goes to the hotel once every 100 days
- each hotel has 100 spots; there are 100000 hotels
- will examine hotel records for 1000 days
Seek:
- pairs of people who were both at the same hotel on two different days (the two hotels don't have to be the same on the different days)
Apply principle:
- assume no evil-doers ("random" data)
- $P(\text{visit a hotel}) = 0.01$
- $P(\text{2 people visit a hotel}) = (0.01)^2 = 0.0001 = 10^{-4}$
- $P(\text{2 people same hotel}) = P(\text{2 people visit a hotel}) \times \frac{1}{\text{#hotels}} = 10^{-4} \times 10^{-5} = 10^{-9}$
- $P(\text{2 people same hotel on 2 days}) = \left( 10^{-9} \right)^{2} = 10^{-18}$
- Approximate the number of different events (#hotel pairs times #person pairs). It's $5 \times 10^{17} \times 5 \times 10^5$.
- #suspicious events is the above number times the probability such an event is something we're looking for.
- Our result is 250k suspicious events. Under the assumption that there are no terrorists.
- If there really are 10 pairs of evil-doers, the police would still need to investigate ~500k people first
Conclusion:
- we are limited in our ability to mine data for features that are not sufficiently rare in practice

Miscellaneous

Power law: a linear relationship between the logarithms of the variables
Forms: $\log{y} = b + a\log{x} \iff y = cx^a$, for some constants $a$, $b$, and $c$.

References

TODO make this work nicely.



In [ ]: