cache() – Stores in Memory (Default Storage Level)
✅ Stores the DataFrame in memory (MEMORY_AND_DISK) for faster access.
✅ No control over storage level (always stored in RAM).
✅ Useful when the dataset fits into memory and needs to be reused multiple times.
- Equivalent to
.persist(StorageLevel.MEMORY_ONLY)(RAM only). - If memory is insufficient, cached partitions are recomputed instead of being written to disk.
- On subsequent actions, Spark retrieves the DataFrame from memory instead of recomputing it.
persist() – Customizable Storage Level
✅ Allows different storage levels (memory, disk, or both).
✅ Useful when memory constraints require spilling data to disk.
- If memory is insufficient, data is spilled to disk instead of recomputing (depending on the storage level).
- Uses more resources for tracking
📌 Use cache() when you’re sure the dataset fits in memory.
📌 Use persist() when working with large datasets or memory-constrained environments.**