What is Incremental Data Ingestion from files
- Loading new data files encountered since the last ingestion
- Reduces redundant processing
- Two mechanisms:
- COPY INTO
- Autoloader
COPY INTO
- 1000s of files, less efficient at scale
- SQL command load data into a delta table
- Idempotently and incrementally load new data files
- Files that have already been loaded are skipped.
COPY INTO my_table
FROM '/path/to/files’
FILEFORMAT = CSV FORMAT_OPTIONS ('delimiter' = '|’, 'header' = 'true') COPY_OPTIONS ('mergeSchema' = 'true’) //schema evolved according to incoming data
Auto Loader
- 1000000s of file, more efficient at scale
- Structured Streaming, with directory of files as a streaming source
- Can process billions of files, spitted into multi-batches
- Support near real-time ingestion of millions of files per hour.
- store metadata of the discovered files
- assumed all columns are string if Json files are read
spark.readStream
.format("cloudFiles") //once specify cloudFiles => AutoLoader
.option("cloudFiles.format", )
.option("cloudFiles.schemaLocation", ) //optional, stored schema
.load('/path/to/files’)
.writeStream
.option("checkpointLocation", )
.option("mergeSchema", “true”) //optional, used when schema's different
.table()
Tips / Intuition
Spark Structured Streaming is a flexible tool that can deal with various data sources, not only directory of files. For example: reading data from Pub/Sub messaging systems (format: kafka), or from Delta Tables (format: delta), or … directory of files (format: cloudFiles). The ‘cloudFiles’ format represents the Auto Loader feature. So, Auto Loader is based on Spark Structured Streaming to treat a directory of files as a streaming source.