A higher order function is a function that takes one or more functions as arguments, or returns a function as its result In Databrick, higher order function allow you to work directly with hierarchical data like arrays and map type objects.

Two useful Higher Order Functions

filter()

  • filter array by a given lambda function (i just refers to books)
--example books columns: [{"book_id":"B09","quantity":2,"subtotal":48}]
--filter the books column where quantity>=2:

SELECT

  order_id,
  books,
  FILTER (books, i -> i.quantity >= 2
	) AS multiple_copies

FROM orders

TRANSFORM()

  • transform array by a given lambda function
SELECT

  order_id,

  books,

  TRANSFORM (

    books, b -> CAST(b.subtotal * 0.8 AS INT)

  ) AS subtotal_after_discount

FROM orders;

User-defined functions

  • allow you to register a custom combination of SQL logic as function, which can be reused
  • leverage spark SQL directly & maintain all optimization
  • UDFs are permanent objects that are persisted to the database (ok for different session/ notebooks)
-- 3 parts needed: function name, returned type and logic

CREATE OR REPLACE FUNCTION get_url(email STRING)
RETURNS STRING
RETURN concat("https://www.", split(email, "@")[1])
  • see details of the UDFs by:
DESCRIBE FUNCTION EXTENDED get_url
  • drop function by:
DROP FUNCTION get_url;