databases = schemas in Hive metasore create database db_name = create schema db_name

HIve Metastore

  • Repository of metadata that stores information for data structure (databases, table, partition)
  • table definition, format, path to the storage

Default path when create database: dbfs:/user/hive/warehouse

Every workspace has central hive metastore accessible by all cluster to persist table metadata

  • without specifying, hive metastore is default (managed tables)
  • with LOCATION syntax, we can create a storage in custom location other than default dbfs:/user/hive/warehouse (External tables)

Two types of tables

To create external tables outside the default hive warehouse

use db_x;   //specify the DB
create table table_name
LCOATION 'dbfs:/some/path/table_name'  //specify the storage location

The table is classified as external as long as you use LOCATION syntax, even you specify the location as hive path: dbfs:/user/hive/warehouse/<Table_Name>

CTAS: we can create the table via CTAS with additional options:

CREATE TABLE new_table
COMMENT "<some_comment>"
PARTITIONED BY (col1,col2)
LOCATION '/some/path'
AS SELECT col1, col2, col3 as new_col3, FROM old_table

Noted that CTAS do not support manual schema declaration (e.g. data type), it is automatically inferred

Table Constraints:

  • NOT NULL/ CHECK
  • ALTER TABLE table name ADD CONSTRAINT constraint name constraint details (e.g. ALTER TABLE orders ADD CONSTRAINT valid_date CHECK (date > ‘2020-01-01)

Cloning Delta lake tables:

  • Useful to set up tables for testing in dev
  • Two ways, in either case, data modifications will not affect the source
  1. DEEP CLONE
    • Fully copies data + metadata from a source table to a target
    • can sync changes
    • take quite a while for large dataset
CREATE TABLE table_clone
DEEP CLONE source_table
  1. SHALLOW CLONE
    • just copy the delta transaction logs (no data moving so it’s quick)
CREATE TABLE table_clone
SHALLOW CLONE source_table

View

  • logical query against source tables, execute when querying
  • 3 Types
    1. (stored) Views
      • persisted objects CREATE VIEW view_name AS qeury
      • dopped only by DROP VEIW
    2. Temporary Views
      • session-scoped view, a session is created when:
        • opening a new notebook
        • detaching and reattaching to a cluster
        • installing a python package
        • restarting a cluster
      • `CREATE TEMP VIEW view_name AS qeury
      • dropped when session ends
    3. Global Temporary View
      • Cluster-scoped view, as long as cluster is running, any notebook attached to that cluster can access its global temporary across sessions
      • `CREATE GLOBAL TEMP VIEW view_name AS qeury
      • `SELECT * FROM global_temp.view_name
      • dropped when cluster restarted

See all tables and views

SHOW TABLES;
SHOW TABLES IN global_temp; //for global view