3.1 Querying files

Direct Query from file:

SELECT * FROM file_format.`/path/to/file` //noted that is ``, not ''

formats
- self-describing formats: (e.g. json/parquet) → structured when querying
- Non - self-describing formats: (e.g. CSV/TSV) → not structured when querying (e.g. with delimiter per row)
path:
- single file (file_2022.json)
- multiple file (file_*.json) with wildcard
- complete directory (/path/dir) - given that all files have same format/ schema

Raw data:

Extract text files as raw strings
- text-based files (JSON, CSV, TSV, TXT…)
- SELECT * FROM text./path/to/file“
- useful for corrupted data - apply custom text parsing function afterward
```
SELECT * FROM json.`${dataset.bookstore}/customers-json/export_001.json`
SELECT * FROM csv.`${dataset.bookstore}/books-csv`
```
Extract files as raw bytes
- Images or unstructured data
- SELECT * FROM binaryFile./path/to/file

SELECT * FROM text.`${dataset.bookstore}/customers-json`
SELECT * FROM binaryFile.`${dataset.bookstore}/customers-json`

Table creation

CTAS: Registering Tables from Files

’CREATE TABLE table_name AS SELECT * FROM file_format./path/to/file’
Advantage:
- Automatically infer schema information from query results
- Useful for external data ingestion with well-defined schema
- Delta table
Disadvantage:
- Do Not support manual schema declaration
- Do Not support file options

CREATE TABLE customers AS 
SELECT * FROM json.`${dataset.bookstore}/customers-json`; 
DESCRIBE EXTENDED customers;

Registering Tables on External Data Sources (with USING syntax)

CREATE TABLE table_name (col_name1 col_type1, ...) 
USING source_type 
OPTIONS (key1 = val1, key2 = val2, ...) 
LOCATION = path

Support files options and schema declaration
Disadvantage:
- keep as original format as External table, Non-Delta table! , just pointing to files in an external location
- no performance guarantees and no features of Delta Lake and Lakehouse
- performance issue for a huge database table

//CSV
CREATE TABLE table_name (col_name1 col_type1, ...) 
USING CSV 
OPTIONS (header = "true", delimiter = ”;") 
LOCATION = path

//database using JDBC connection
CREATE TABLE table_name (col_name1 col_type1, ...) 
USING JDBC 
OPTIONS (url = "jdbc:sqlite://hostname:port", dbtable = "database.table", user = "username", password = ”pwd” )

Solution overcome Disadvantage

//extracting data from external source as view
CREATE TEMP VIEW temp_view_name (col_name1 col_type1, ...) 
USING data_source 
OPTIONS (key1 = “val1”, key2 = “val2”, ..., path = “/path/to/files”)

//create a delta table from the view using CTAS
CREATE TABLE table_name 
AS SELECT * FROM temp_view_name

Tips / Intuition

When reading multiple files, it is useful to add input_file_name() from spark SQL command to records the source file from each row. It can be useful for troble shooting. E.g.

SELECT *, input_file_name() source_file FROM json.${dataset.bookstore}/customers-json;

Tips / Intuition

Adding more data files in the directly will not refresh the table being qeury, as spark caches the file. To refresh a table, us:

REFRESH TABLE <table_name>

spark.sql() can be used to execute a SQL query with Pyspark

Stanley Chan's Note🧠

Explorer

3.1 Querying files

Direct Query from file:

Table creation

Graph View

Table of Contents

Backlinks