Structured API

The Structured APIs are a tool for manipulating all sorts of data, from unstructured log files to semi-structured CSV files and highly structured Parquet files. These APIs refer to three core types of distributed collection APIs:

Datasets (only for JVM based lanuages, NA for pyspark)
DataFrames
SQL tables and views

Dataframe and datasets are immutable, typed and lazily evaluated plans most of them applied for b ]oth batch and streaming (little to no effort for migration)

Their types can be manually defined or schema on read (see Schema on writes & on read). Spark use Catalyst to maintains type information throughout the planning and processing. So the operation is purely in Spark but not python.

dataframe = dataset of type ROW → optimized in-memory format for computation, with type checked during runtime (vs. during compile time in datasets)

See :

how the Structured API Execution performs
Structured operations (basic)

Stanley Chan's Note🧠

Explorer

Structured API

Graph View

Backlinks