The Structured APIs are a tool for manipulating all sorts of data, from unstructured log files to semi-structured CSV files and highly structured Parquet files. These APIs refer to three core types of distributed collection APIs:

  • Datasets (only for JVM based lanuages, NA for pyspark)
  • DataFrames
  • SQL tables and views

Dataframe and datasets are immutable, typed and lazily evaluated plans most of them applied for b ]oth batch and streaming (little to no effort for migration)

Their types can be manually defined or schema on read (see Schema on writes & on read). Spark use Catalyst to maintains type information throughout the planning and processing. So the operation is purely in Spark but not python.

dataframe = dataset of type ROW optimized in-memory format for computation, with type checked during runtime (vs. during compile time in datasets)

See :