What is Incremental Data Ingestion from files

  • Loading new data files encountered since the last ingestion
  • Reduces redundant processing
  • Two mechanisms:
    • COPY INTO
    • Autoloader

COPY INTO

  • 1000s of files, less efficient at scale
  • SQL command load data into a delta table
  • Idempotently and incrementally load new data files
    • Files that have already been loaded are skipped.
COPY INTO my_table
FROM '/path/to/files’
FILEFORMAT = CSV FORMAT_OPTIONS ('delimiter' = '|’, 'header' = 'true') COPY_OPTIONS ('mergeSchema' = 'true’)  //schema evolved according to incoming data

Auto Loader

  • 1000000s of file, more efficient at scale
  • Structured Streaming, with directory of files as a streaming source
  • Can process billions of files, spitted into multi-batches
  • Support near real-time ingestion of millions of files per hour.
  • store metadata of the discovered files
  • assumed all columns are string if Json files are read
spark.readStream
		.format("cloudFiles")   //once specify cloudFiles => AutoLoader
		.option("cloudFiles.format", ) 
		.option("cloudFiles.schemaLocation", )  //optional, stored schema
		.load('/path/to/files’) 
	.writeStream 
		.option("checkpointLocation", ) 
		.option("mergeSchema", “true”)  //optional, used when schema's different
		.table()

Tips / Intuition

Spark Structured Streaming is a flexible tool that can deal with various data sources, not only directory of files. For example: reading data from Pub/Sub messaging systems (format: kafka), or from Delta Tables (format: delta), or … directory of files (format: cloudFiles). The ‘cloudFiles’ format represents the Auto Loader feature. So, Auto Loader is based on Spark Structured Streaming to treat a directory of files as a streaming source.