- a Databrick ETL framework for building maintainable data processing pipelines
- simplifies large scale ETL while maintaining table dependencies and data quality
- Can declare the table with SQL directly, instead of registering the data first in pyspark as structured streaming
- less coding
- with UI managing the pipeline
- implemented using Databrick notebook
Declare DLT
- streaming ⇒ allow autoloader
- LIVE ⇒ DLT
- cloud_files() ⇒ allow autoloader to be used natively with SQL
CREATE OR REFRESH STREAMING LIVE TABLE orders_raw
COMMENT "The raw books orders, ingested from orders-raw"
AS SELECT * FROM cloud_files("${datasets.path}/orders-json-raw", "json",
map("cloudFiles.inferColumnTypes", "true"))
Quality control can be implemented using contraint
keyword, with 3 actions on violation:
- DROP ROW: discard records
- FAIL UPDATE: cause pipeline to fail
- (omitted): records will be kept and reported in metrics
--reject records with null order_id:
CREATE OR REFRESH STREAMING LIVE TABLE orders_cleaned (
CONSTRAINT valid_order_number EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT "The cleaned books orders with valid order_id"
AS
.....
Refer other DTL table: use LIVE
syntax, wrapped with SREAM
if it is a stream table:
CREATE OR REFRESH STREAMING LIVE TABLE orders_cleaned
COMMENT "The cleaned books orders with valid order_id"
AS
SELECT *
FROM STREAM(LIVE.orders_raw) o
Delta Live Tables expectations
Expectations are optional clauses you add to Delta Live Tables dataset declarations that apply data quality checks on each record passing through a query.
An expectation consists of three things:
- A description, which acts as a unique identifier and allows you to track metrics for the constraint.
- A boolean statement that always returns true or false based on some stated condition.
- An action to take when a record fails the expectation, meaning the boolean returns false.
Pipeline:
- DLT is not designed to be run interactively in notebook cells. Executing a cell that contains Delta Live Tables syntax in a Databricks notebook results in an error message
- To run your queries, you must configure your notebooks as part of a pipeline
Pipeline execution mode:
Continuous: All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing
Triggered | Continuous | |
---|---|---|
When does the update stop? | Automatically once complete. | Runs continuously until manually stopped. |
What data is processed? | Data available when the update is started. | All data as it arrives at configured sources. |
What data freshness requirements is this best for? | Data updates run every 10 minutes, hourly, or daily. | Data updates desired between every 10 seconds and a few minutes. |