Direct Query from file:

SELECT * FROM file_format.`/path/to/file` //noted that is ``, not ''
  • formats
    • self-describing formats: (e.g. json/parquet) structured when querying
    • Non - self-describing formats: (e.g. CSV/TSV) not structured when querying (e.g. with delimiter per row)
  • path:
    • single file (file_2022.json)
    • multiple file (file_*.json) with wildcard
    • complete directory (/path/dir) - given that all files have same format/ schema

Raw data:

  • Extract text files as raw strings
    • text-based files (JSON, CSV, TSV, TXT…)
    • SELECT * FROM text./path/to/file“
    • useful for corrupted data - apply custom text parsing function afterward
    SELECT * FROM json.`${dataset.bookstore}/customers-json/export_001.json`
    SELECT * FROM csv.`${dataset.bookstore}/books-csv`
    
  • Extract files as raw bytes
    • Images or unstructured data
    • SELECT * FROM binaryFile./path/to/file
SELECT * FROM text.`${dataset.bookstore}/customers-json`
SELECT * FROM binaryFile.`${dataset.bookstore}/customers-json`

Table creation

CTAS: Registering Tables from Files

  • ’CREATE TABLE table_name AS SELECT * FROM file_format./path/to/file
  • Advantage:
    • Automatically infer schema information from query results
    • Useful for external data ingestion with well-defined schema
    • Delta table
  • Disadvantage:
    • Do Not support manual schema declaration
    • Do Not support file options
CREATE TABLE customers AS 
SELECT * FROM json.`${dataset.bookstore}/customers-json`; 
DESCRIBE EXTENDED customers;

Registering Tables on External Data Sources (with USING syntax)

CREATE TABLE table_name (col_name1 col_type1, ...) 
USING source_type 
OPTIONS (key1 = val1, key2 = val2, ...) 
LOCATION = path
  • Support files options and schema declaration
  • Disadvantage:
    • keep as original format as External table, Non-Delta table! , just pointing to files in an external location
    • no performance guarantees and no features of Delta Lake and Lakehouse
    • performance issue for a huge database table
//CSV
CREATE TABLE table_name (col_name1 col_type1, ...) 
USING CSV 
OPTIONS (header = "true", delimiter = ”;") 
LOCATION = path

//database using JDBC connection
CREATE TABLE table_name (col_name1 col_type1, ...) 
USING JDBC 
OPTIONS (url = "jdbc:sqlite://hostname:port", dbtable = "database.table", user = "username", password = ”pwd” )
  • Solution overcome Disadvantage
//extracting data from external source as view
CREATE TEMP VIEW temp_view_name (col_name1 col_type1, ...) 
USING data_source 
OPTIONS (key1 = “val1”, key2 = “val2”, ..., path = “/path/to/files”)

//create a delta table from the view using CTAS
CREATE TABLE table_name 
AS SELECT * FROM temp_view_name

Tips / Intuition

When reading multiple files, it is useful to add input_file_name() from spark SQL command to records the source file from each row. It can be useful for troble shooting. E.g.

SELECT *, input_file_name() source_file FROM json.${dataset.bookstore}/customers-json;

Tips / Intuition

Adding more data files in the directly will not refresh the table being qeury, as spark caches the file. To refresh a table, us:

REFRESH TABLE <table_name>

spark.sql() can be used to execute a SQL query with Pyspark