A higher order function is a function that takes one or more functions as arguments, or returns a function as its result In Databrick, higher order function allow you to work directly with hierarchical data like arrays and map type objects.
Two useful Higher Order Functions
filter()
- filter array by a given lambda function (i just refers to books)
--example books columns: [{"book_id":"B09","quantity":2,"subtotal":48}]
--filter the books column where quantity>=2:
SELECT
order_id,
books,
FILTER (books, i -> i.quantity >= 2
) AS multiple_copies
FROM orders
TRANSFORM()
- transform array by a given lambda function
SELECT
order_id,
books,
TRANSFORM (
books, b -> CAST(b.subtotal * 0.8 AS INT)
) AS subtotal_after_discount
FROM orders;
User-defined functions
- allow you to register a custom combination of SQL logic as function, which can be reused
- leverage spark SQL directly & maintain all optimization
- UDFs are permanent objects that are persisted to the database (ok for different session/ notebooks)
-- 3 parts needed: function name, returned type and logic
CREATE OR REPLACE FUNCTION get_url(email STRING)
RETURNS STRING
RETURN concat("https://www.", split(email, "@")[1])
- see details of the UDFs by:
DESCRIBE FUNCTION EXTENDED get_url
- drop function by:
DROP FUNCTION get_url;