Direct Query from file:
SELECT * FROM file_format.`/path/to/file` //noted that is ``, not ''
- formats
- self-describing formats: (e.g. json/parquet) → structured when querying
- Non - self-describing formats: (e.g. CSV/TSV) → not structured when querying (e.g. with delimiter per row)
- path:
- single file (file_2022.json)
- multiple file (file_*.json) with wildcard
- complete directory (/path/dir) - given that all files have same format/ schema
Raw data:
- Extract text files as raw strings
- text-based files (JSON, CSV, TSV, TXT…)
SELECT * FROM text.
/path/to/file“- useful for corrupted data - apply custom text parsing function afterward
SELECT * FROM json.`${dataset.bookstore}/customers-json/export_001.json` SELECT * FROM csv.`${dataset.bookstore}/books-csv`
- Extract files as raw bytes
- Images or unstructured data
SELECT * FROM binaryFile.
/path/to/file
SELECT * FROM text.`${dataset.bookstore}/customers-json`
SELECT * FROM binaryFile.`${dataset.bookstore}/customers-json`
Table creation
CTAS: Registering Tables from Files
- ’CREATE TABLE table_name AS SELECT * FROM file_format.
/path/to/file
’ - Advantage:
- Automatically infer schema information from query results
- Useful for external data ingestion with well-defined schema
- Delta table
- Disadvantage:
- Do Not support manual schema declaration
- Do Not support file options
CREATE TABLE customers AS
SELECT * FROM json.`${dataset.bookstore}/customers-json`;
DESCRIBE EXTENDED customers;
Registering Tables on External Data Sources (with USING syntax)
CREATE TABLE table_name (col_name1 col_type1, ...)
USING source_type
OPTIONS (key1 = val1, key2 = val2, ...)
LOCATION = path
- Support files options and schema declaration
- Disadvantage:
- keep as original format as External table, Non-Delta table! , just pointing to files in an external location
- no performance guarantees and no features of Delta Lake and Lakehouse
- performance issue for a huge database table
//CSV
CREATE TABLE table_name (col_name1 col_type1, ...)
USING CSV
OPTIONS (header = "true", delimiter = ”;")
LOCATION = path
//database using JDBC connection
CREATE TABLE table_name (col_name1 col_type1, ...)
USING JDBC
OPTIONS (url = "jdbc:sqlite://hostname:port", dbtable = "database.table", user = "username", password = ”pwd” )
- Solution overcome Disadvantage
//extracting data from external source as view
CREATE TEMP VIEW temp_view_name (col_name1 col_type1, ...)
USING data_source
OPTIONS (key1 = “val1”, key2 = “val2”, ..., path = “/path/to/files”)
//create a delta table from the view using CTAS
CREATE TABLE table_name
AS SELECT * FROM temp_view_name
Tips / Intuition
When reading multiple files, it is useful to add input_file_name() from spark SQL command to records the source file from each row. It can be useful for troble shooting. E.g.
SELECT *, input_file_name() source_file FROM json.
${dataset.bookstore}/customers-json
;
Tips / Intuition
Adding more data files in the directly will not refresh the table being qeury, as spark caches the file. To refresh a table, us:
REFRESH TABLE <table_name>
spark.sql() can be used to execute a SQL query with Pyspark