4.2 Incremental Data Ingestion from files

What is Incremental Data Ingestion from files

Loading new data files encountered since the last ingestion
Reduces redundant processing
Two mechanisms:
- COPY INTO
- Autoloader

COPY INTO

1000s of files, less efficient at scale
SQL command load data into a delta table
Idempotently and incrementally load new data files
- Files that have already been loaded are skipped.

COPY INTO my_table
FROM '/path/to/files’
FILEFORMAT = CSV FORMAT_OPTIONS ('delimiter' = '|’, 'header' = 'true') COPY_OPTIONS ('mergeSchema' = 'true’)  //schema evolved according to incoming data

Auto Loader

1000000s of file, more efficient at scale
Structured Streaming, with directory of files as a streaming source
Can process billions of files, spitted into multi-batches
Support near real-time ingestion of millions of files per hour.
store metadata of the discovered files
assumed all columns are string if Json files are read

spark.readStream
		.format("cloudFiles")   //once specify cloudFiles => AutoLoader
		.option("cloudFiles.format", ) 
		.option("cloudFiles.schemaLocation", )  //optional, stored schema
		.load('/path/to/files’) 
	.writeStream 
		.option("checkpointLocation", ) 
		.option("mergeSchema", “true”)  //optional, used when schema's different
		.table()

Tips / Intuition

Spark Structured Streaming is a flexible tool that can deal with various data sources, not only directory of files. For example: reading data from Pub/Sub messaging systems (format: kafka), or from Delta Tables (format: delta), or … directory of files (format: cloudFiles). The ‘cloudFiles’ format represents the Auto Loader feature. So, Auto Loader is based on Spark Structured Streaming to treat a directory of files as a streaming source.

Stanley Chan's Note🧠

Explorer

4.2 Incremental Data Ingestion from files

COPY INTO

Auto Loader

Graph View

Table of Contents

Backlinks