# 1. Introduction

## What is data mining?

• Semi-automatic procedures to find general and useful patterns in large data sets.

## Applications

• Approximate retrieval: Finding similar elements (similar songs, image search, plagiarism detection, copyright protection, etc.) in giant datasets
• Supervised learning, such as large scale classification (of user behavior, of images, of text, etc.) and regression
• Unsupervised learning, such as large scale clustering (search for groups of similar users, images, songs, articles, etc.) and dimension reduction
• Recommender systems (bandit algorithms ($\epsilon$-greedy, UCB1, LinUCB, Hybrid LinUCB, etc.) and their applications in fields such as news article recommendation and adverting)
• Others (monitoring transients in astronomy, spam filtering, fraud detection, machine translation, six degrees of Kevin Bacon separation etc.)

## Scale

• Example: 10-100 TB of data per sky survey (astronomy)
• Archive sizes measured in petabytes
• Real-time data flows (e.g. computing trends in social network)
• Data sources
• science
• commercial/civil/engineering
• security/intelligence/defense

## Technical aspects

• Want to keep data in main memory as much as possible (faster)
• If data don't fit in the main memory, we have to access it in a streaming fashion. Random access would be much too expensive, so we have to adapt our algorithms in order to learn from data streams.
• Want real-time analytics
• Want real-time synthesis
• Want to leverage large-scale parallelism (across entire data centers)
• Data quality often sucks (missing elements, missing elements represented as seemingly-present elements (null vs. "" vs. 0 vs. "\0" vs. undefined, etc.), inconsistent schema, etc.)

## Not covered

• Systems issues (databases, data center management, etc.)
• Specialized data structures
• Domain specific algorithms
• see Information retrieval course for more text-specific elements

## MapReduce

• Works well with commodity hardware in data centers (DCs)
• Failure-tolerant (redundancy over DC)
• Works with distributed file systems (e.g. Google GFS, HDFS, etc.), which are optimized for durability, frequent reads and appends, but rare updates
• map(key, value) and reduce(key, values) (bread and butter; other operations exist); the default shuffler does a lot of the grunt work!
• Partitions the input data, schedules program execution across a set of machines, handles machine failures, and manages inter-machine communication
• A job's output is often another job's input; many tools support multi-stage pipelines natively (Cascading, Scalding, Spark, MS DryadLINQ, etc.)
• Stragglers (last few remaining reducers) $\implies$ spawn multiple copies of job and take the result of whoever finishes first
• Hadoop is the most common MapReduce implementation; relies a lot on disk access $\implies$ slow; Spark offers massive speedups by relying less on disk access

Trick to compute variance in one pass: use formula based on expectation ($\mathbb{V}ar(X) = \hat{\mathbb{E}}[X^2] - \hat{\mathbb{E}}[X]^2$).

GPGPUs can also offer massive speed-ups when used right. They are not covered in this course, but are very widely used for algorithms requiring heavy number-crunching (many vector/matrix operations).

## Statistical limits and Bonferroni's Principle

### Context

• 2002, Post 9/11 USA, The Patriot Act $\implies$ the Total Information Awareness project (eventually killed by Congress, allegedly)
• Problem: when viewing so much data, almost everything could be seen as suspicious. It depends just on how narrowly you define the activities that you look for.
• One can find certain event types or patterns even in completely random data
• Avoid this fallacy using Bonferroni's principle
• See first chapter in textbook for more details CITE

### The principle

• first calculate the $\mathbb{E}[\text{nr. occurrences}]$, assuming the data is purely random
• if this number is significantly larger than the actual number of instances we hope to find, then any result would be bogus

### An example

• Assume:
• evil doers exist, and gather periodically at a hotel
• 1 billion suspects
• everyone goes to the hotel once every 100 days
• each hotel has 100 spots; there are 100000 hotels
• will examine hotel records for 1000 days
• Seek:
• pairs of people who were both at the same hotel on two different days (the two hotels don't have to be the same on the different days)
• Apply principle:
• assume no evil-doers ("random" data)
• $P(\text{visit a hotel}) = 0.01$
• $P(\text{2 people visit a hotel}) = (0.01)^2 = 0.0001 = 10^{-4}$
• $P(\text{2 people same hotel}) = P(\text{2 people visit a hotel}) \times \frac{1}{\text{#hotels}} = 10^{-4} \times 10^{-5} = 10^{-9}$
• $P(\text{2 people same hotel on 2 days}) = \left( 10^{-9} \right)^{2} = 10^{-18}$
• Approximate the number of different events (#hotel pairs times #person pairs). It's $5 \times 10^{17} \times 5 \times 10^5$.
• #suspicious events is the above number times the probability such an event is something we're looking for.
• Our result is 250k suspicious events. Under the assumption that there are no terrorists.
• If there really are 10 pairs of evil-doers, the police would still need to investigate ~500k people first
• Conclusion:
• we are limited in our ability to mine data for features that are not sufficiently rare in practice

## Miscellaneous

• Power law: a linear relationship between the logarithms of the variables
• Forms: $\log{y} = b + a\log{x} \iff y = cx^a$, for some constants $a$, $b$, and $c$.

## References

TODO make this work nicely.



In [ ]: