• a Databrick ETL framework for building maintainable data processing pipelines
  • simplifies large scale ETL while maintaining table dependencies and data quality
    • Can declare the table with SQL directly, instead of registering the data first in pyspark as structured streaming
    • less coding
    • with UI managing the pipeline
  • implemented using Databrick notebook

Declare DLT

  • streaming allow autoloader
  • LIVE DLT
  • cloud_files() allow autoloader to be used natively with SQL
CREATE OR REFRESH STREAMING LIVE TABLE orders_raw

COMMENT "The raw books orders, ingested from orders-raw"

AS SELECT * FROM cloud_files("${datasets.path}/orders-json-raw", "json", 

                             map("cloudFiles.inferColumnTypes", "true"))

Quality control can be implemented using contraint keyword, with 3 actions on violation:

  • DROP ROW: discard records
  • FAIL UPDATE: cause pipeline to fail
  • (omitted): records will be kept and reported in metrics
--reject records with null order_id:
CREATE OR REFRESH STREAMING LIVE TABLE orders_cleaned (
  CONSTRAINT valid_order_number EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW
)

COMMENT "The cleaned books orders with valid order_id"

AS
.....

Refer other DTL table: use LIVE syntax, wrapped with SREAM if it is a stream table:

CREATE OR REFRESH STREAMING LIVE TABLE orders_cleaned 
COMMENT "The cleaned books orders with valid order_id"

AS

  SELECT *
  FROM STREAM(LIVE.orders_raw) o

Delta Live Tables expectations

Expectations are optional clauses you add to Delta Live Tables dataset declarations that apply data quality checks on each record passing through a query.

An expectation consists of three things:

  • A description, which acts as a unique identifier and allows you to track metrics for the constraint.
  • A boolean statement that always returns true or false based on some stated condition.
  • An action to take when a record fails the expectation, meaning the boolean returns false.

Pipeline:

  • DLT is not designed to be run interactively in notebook cells. Executing a cell that contains Delta Live Tables syntax in a Databricks notebook results in an error message
  • To run your queries, you must configure your notebooks as part of a pipeline

Pipeline execution mode:

Continuous: All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing

TriggeredContinuous
When does the update stop?Automatically once complete.Runs continuously until manually stopped.
What data is processed?Data available when the update is started.All data as it arrives at configured sources.
What data freshness requirements is this best for?Data updates run every 10 minutes, hourly, or daily.Data updates desired between every 10 seconds and a few minutes.