Previous | Next --- Slide 29 of 49

haiyuem

Are RDDs more like sequences than arrays? Seems like we can only get to the next RDD by processing the previous one.

I feel like the point lies in that order is not important - unlike arrays, we don't need to process the records one by one in order.

dishpanda

@haiyuem Rather than an array, a set or collection is probably a better way of thinking about it...

lfu

@haiyuem Kayvon mentioned this in lecture (as a written answer / chat comment) that RDD is the Spark name for a "sequence" from the prior lecture. Also that "Spark transformations are data parallel operators that produce and consume Spark RDDs."

anon33

My understanding is that the word "sequence" may itself be slightly misleading because it's not necessary that only sequential access is required, only that it supports an interface with a certain set of functions like map, reduce, fold, groupByKey, etc. In Java I believe this is called a stream. Perhaps it is best to think of RDDs as the Spark instantiation of a notion of a general data structure whose members are ordered and can only be accessed via these specific data-parallel operations.

felixw17

I agree with @anon33 that RDDs can be thought of abstractly as a data structure with those two properties. However, it seems to me that this is a pretty common abstraction - my understanding is that one of the key advantages of Spark is in the implementation details of this abstraction (namely, that RDDs can be handled in memory in a distributed and fault-tolerant manner, which therefore makes computations a lot faster and avoids going to a DFS like MapReduce might have to).

Please log in to leave a comment.