Previous | Next --- Slide 41 of 49

How are lineages stored and how are all of the nodes' changes compiled into a single lineage (assuming there's a single shared data structure)? Or does each node keep track of its own independent lineage?

sagoyal

@tp I believe this this is another case of the difference between abstraction and actual implementation. I think that as a programmer it helps to think that Spark traces a lineage across the RDD's however during actual compile time there is no code that stores the lineage. Only the movement of functions is stored. I think this Stack Overflow Link does a good job explaining this.

haofeng

Spark logs sequences of read-only transformations as lineages for fault-proof multi-node execution. Because of the read-only nature of the RDD, the data can be reconstructed deterministically when a single machine fails. Because the RDD transformations are batched, the logging overhead becomes minimal. I'm wondering if certain node fails in a wide-dependency graph, will the other relevant node put to halt to wait for the node's results or everything will start from the very beginning?

suninhouse

Lineages allow Spark to be able to recover from errors and restart. Nodes with partition of the data necessary can participate in the recovering work, i.e., to work from the lineage again and recover the steps that previously had errors.

Please log in to leave a comment.