Exam Concept

A Spark DataFrame is an immutable tabular data structure that is the most common Structured API in Spark.

Spark execution hierarchy: Job > Stage > task Slot is not technically part of the hierarchy itself, but rather the resource threads that can be used for parallelization within a Spark application. Executor is not technically part of the hierarchy itself, but the process in the work node

Low number of executor is likely to experience delays due to garbage collection of a large DF

A partition is a collection of rows of data that fit on a single machine in a cluster.

filter() is a narrow transformation along with select, cast, union, filter whereas distinct, groupBy, sort, join are wide and induces shuffles/exchanges

Garbage collection in JVM is about releasing memory of objects that are no longer used so that new objects can be created

Stanley Chan's Note🧠

Explorer

Exam Concept

Graph View

Backlinks