Persistence and storage levels

Описание к видео Persistence and storage levels

Official Website: http://bigdataelearning.com

What does it mean to persist a RDD?
RDD Persistence is an important capability of spark. When an RDD is persisted, it means that the data is stored in memory and it will be reused when subsequent actions use them. Why the phrase “subsequent actions”? Because the first time the RDD is computed in an action it gets created and gets stored in the memory. During the subsequent actions, the RDD is used from the memory instead of re-computing.

Why to persist a RDD?

⮚ Since the persisted RDD is not recomputed and can be fetched from the memory directly, the execution will be much faster.
⮚ This can also be used when an RDD computation is expensive. By persisting an expensive RDD, we can avoid it from being recomputed in the case of node failure.
⮚ RDD persistence can be used for iterative algorithms and interactive uses.

Persistence levels
RDD can be persisted on different levels.
1. MEMORY_ONLY - It can be persisted on memory as de-serialized objects. When the entire RDD doesn’t fit on the memory, the remaining dataset is recomputed on the fly. This is like using CACHE method to persist the RDD. In other words, rdd1.cache() is same as rdd1.persist(StorageLevel.MEMORY_ONLY)

2. MEMORY_AND_DISK - RDD can be persisted on memory and disk, which means the RDD will be stored in memory and the excess RDD that can’t be fit into the memory will be stored in the disk. E.g. rdd1.persist(StorageLevel.MEMORY_AND_DISK)

3. MEMORY_ONLY_SER – This is very much like the MEMORY_ONLY option. It can be stored in memory but as serialized objects. This is generally more space efficient than de-serialized objects. E.g. rdd1.persist(StorageLevel.MEMORY_ONLY_SER)


4. MEMORY_AND_DISK_SER - This is very much like the MEMORY_AND_DISK option. It can be stored in memory and disk but as serialized objects. The data that doesn’t fit into memory are spilled on the disk. E.g. rdd1.persist(StorageLevel.MEMORY_AND_DISK_SER)


5. DISK_ONLY – stores the RDD data only on disk. E.g. rdd1.persist(StorageLevel.DISK_ONLY)

Комментарии

Информация по комментариям в разработке