MapReduce
- Works well with commodity hardware in data centers (DCs)
- Failure-tolerant (redundancy over DC)
- Works with distributed file systems (e.g. Google GFS, HDFS, etc.), which are optimized for durability, frequent reads and appends, but rare updates
map(key, value)
and reduce(key, values)
(bread and butter; other operations exist); the default shuffler does a lot of the grunt work!
- Partitions the input data, schedules program execution across a set of machines, handles machine failures, and manages inter-machine communication
- A job's output is often another job's input; many tools support multi-stage pipelines natively (Cascading, Scalding, Spark, MS DryadLINQ, etc.)
- Stragglers (last few remaining reducers) $\implies$ spawn multiple copies of job and take the result of whoever finishes first
- Hadoop is the most common MapReduce implementation; relies a lot on disk access $\implies$ slow; Spark offers massive speedups by relying less on disk access
Trick to compute variance in one pass: use formula based on expectation ($\mathbb{V}ar(X) = \hat{\mathbb{E}}[X^2] - \hat{\mathbb{E}}[X]^2$).
GPGPUs can also offer massive speed-ups when used right. They are not covered in this course, but are very widely used for algorithms requiring heavy number-crunching (many vector/matrix operations).