1. Introduction

What is data mining?

  • Semi-automatic procedures to find general and useful patterns in large data sets.


  • Approximate retrieval: Finding similar elements (similar songs, image search, plagiarism detection, copyright protection, etc.) in giant datasets
  • Supervised learning, such as large scale classification (of user behavior, of images, of text, etc.) and regression
  • Unsupervised learning, such as large scale clustering (search for groups of similar users, images, songs, articles, etc.) and dimension reduction
  • Recommender systems (bandit algorithms ($\epsilon$-greedy, UCB1, LinUCB, Hybrid LinUCB, etc.) and their applications in fields such as news article recommendation and adverting)
  • Others (monitoring transients in astronomy, spam filtering, fraud detection, machine translation, six degrees of Kevin Bacon separation etc.)


  • Example: 10-100 TB of data per sky survey (astronomy)
  • Archive sizes measured in petabytes
  • Real-time data flows (e.g. computing trends in social network)
  • Data sources
    • science
    • commercial/civil/engineering
    • security/intelligence/defense

Technical aspects

  • Want to keep data in main memory as much as possible (faster)
  • If data don't fit in the main memory, we have to access it in a streaming fashion. Random access would be much too expensive, so we have to adapt our algorithms in order to learn from data streams.
  • Want real-time analytics
  • Want real-time synthesis
  • Want to leverage large-scale parallelism (across entire data centers)
  • Data quality often sucks (missing elements, missing elements represented as seemingly-present elements (null vs. "" vs. 0 vs. "\0" vs. undefined, etc.), inconsistent schema, etc.)
  • Need to respect users' privacy (control direct access to data.)

Not covered

  • Systems issues (databases, data center management, etc.)
  • Specialized data structures
  • Domain specific algorithms
    • see Information retrieval course for more text-specific elements


  • Works well with commodity hardware in data centers (DCs)
  • Failure-tolerant (redundancy over DC)
  • Works with distributed file systems (e.g. Google GFS, HDFS, etc.), which are optimized for durability, frequent reads and appends, but rare updates
  • map(key, value) and reduce(key, values) (bread and butter; other operations exist); the default shuffler does a lot of the grunt work!
  • Partitions the input data, schedules program execution across a set of machines, handles machine failures, and manages inter-machine communication
  • A job's output is often another job's input; many tools support multi-stage pipelines natively (Cascading, Scalding, Spark, MS DryadLINQ, etc.)
  • Stragglers (last few remaining reducers) $\implies$ spawn multiple copies of job and take the result of whoever finishes first
  • Hadoop is the most common MapReduce implementation; relies a lot on disk access $\implies$ slow; Spark offers massive speedups by relying less on disk access

Trick to compute variance in one pass: use formula based on expectation ($\mathbb{V}ar(X) = \hat{\mathbb{E}}[X^2] - \hat{\mathbb{E}}[X]^2$).

GPGPUs can also offer massive speed-ups when used right. They are not covered in this course, but are very widely used for algorithms requiring heavy number-crunching (many vector/matrix operations).

Statistical limits and Bonferroni's Principle


  • 2002, Post 9/11 USA, The Patriot Act $\implies$ the Total Information Awareness project (eventually killed by Congress, allegedly)
  • Problem: when viewing so much data, almost everything could be seen as suspicious. It depends just on how narrowly you define the activities that you look for.
  • One can find certain event types or patterns even in completely random data
  • Avoid this fallacy using Bonferroni's principle
  • See first chapter in textbook for more details CITE

The principle

  • first calculate the $\mathbb{E}[\text{nr. occurrences}]$, assuming the data is purely random
  • if this number is significantly larger than the actual number of instances we hope to find, then any result would be bogus

An example

  • Assume:
    • evil doers exist, and gather periodically at a hotel
    • 1 billion suspects
    • everyone goes to the hotel once every 100 days
    • each hotel has 100 spots; there are 100000 hotels
    • will examine hotel records for 1000 days
  • Seek:
    • pairs of people who were both at the same hotel on two different days (the two hotels don't have to be the same on the different days)
  • Apply principle:
    • assume no evil-doers ("random" data)
    • $P(\text{visit a hotel}) = 0.01$
    • $P(\text{2 people visit a hotel}) = (0.01)^2 = 0.0001 = 10^{-4}$
    • $P(\text{2 people same hotel}) = P(\text{2 people visit a hotel}) \times \frac{1}{\text{#hotels}} = 10^{-4} \times 10^{-5} = 10^{-9}$
    • $P(\text{2 people same hotel on 2 days}) = \left( 10^{-9} \right)^{2} = 10^{-18}$
    • Approximate the number of different events (#hotel pairs times #person pairs). It's $5 \times 10^{17} \times 5 \times 10^5$.
    • #suspicious events is the above number times the probability such an event is something we're looking for.
    • Our result is 250k suspicious events. Under the assumption that there are no terrorists.
    • If there really are 10 pairs of evil-doers, the police would still need to investigate ~500k people first
  • Conclusion:
    • we are limited in our ability to mine data for features that are not sufficiently rare in practice


  • Power law: a linear relationship between the logarithms of the variables
  • Forms: $\log{y} = b + a\log{x} \iff y = cx^a$, for some constants $a$, $b$, and $c$.


TODO make this work nicely.

In [ ]: