cache() – Stores in Memory (Default Storage Level)

Stores the DataFrame in memory (MEMORY_AND_DISK) for faster access.
No control over storage level (always stored in RAM).
Useful when the dataset fits into memory and needs to be reused multiple times.

  • Equivalent to .persist(StorageLevel.MEMORY_ONLY) (RAM only).
  • If memory is insufficient, cached partitions are recomputed instead of being written to disk.
  • On subsequent actions, Spark retrieves the DataFrame from memory instead of recomputing it.

persist() – Customizable Storage Level

Allows different storage levels (memory, disk, or both).
Useful when memory constraints require spilling data to disk.

  • If memory is insufficient, data is spilled to disk instead of recomputing (depending on the storage level).
  • Uses more resources for tracking

📌 Use cache() when you’re sure the dataset fits in memory.
📌 Use persist() when working with large datasets or memory-constrained environments.**