https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-transformations.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-shuffle.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-dagscheduler.html
Have you noticed your console during heavy data processing?. You might have seen something like this, <br />
[Stage 2:=======> (143 + 20) / 1000]<br />
So what's a stage?. Wide dependncies are staged, while some narrow dependencies are grouped into a stage.
You can find the stages for your tasks executed in a Spark session at, <br\>
http://localhost:4040/stages/
For example,
Hash Partitioner :- http://stackoverflow.com/questions/31424396/how-does-hashpartitioner-work
Hash Tables :- (Switch to mute during the intro for like 10 seconds) https://www.youtube.com/watch?v=h2d9b_nEzoA
Wikipedia :- https://en.wikipedia.org/wiki/Partition_(database)
Custom Partition Example :- http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where
http://stackoverflow.com/questions/36636/what-is-a-closure
https://en.wikipedia.org/wiki/Closure_(computer_programming)
A simple C++ example for Closure,
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
http://stackoverflow.com/questions/21138751/spark-java-lang-outofmemoryerror-java-heap-space
Some Known errors
Remember DAG in Spark?. Refer Page 7 in,
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
If you submit your tasks to a DAG schedules, you can check stages visually in Spark Web UI.
interested in creating a custom scheduler, you can start from here,
http://stackoverflow.com/questions/39471601/how-to-create-a-custom-apache-spark-scheduler