Spark catalog and SQL query

Dataframe API is part of a Spark library named Spark SQL, which enables data analysts to use SQL expressions to query and manipulate data. Spark catalog is a metastore for relational data objects such as views and tables

One of the simplest ways to make data in a dataframe available for querying in the Spark catalog is to create a temporary view, as shown in the following code example:

df.createOrReplaceTempView("products_view")

You can create an empty table by using the spark.catalog.createTable method, or you can save a dataframe as a table by using its saveAsTable method. Deleting a managed table also deletes its underlying data.

df.write.format("delta").saveAsTable("products")

Using the Spark SQL API to query data

# uses a SQL query to return data from the **products** table as a dataframe.

bikes_df = spark.sql("SELECT ProductID, ProductName, ListPrice \
                      FROM products \
                      WHERE Category IN ('Mountain Bikes', 'Road Bikes')")
display(bikes_df)

Using SQL code

The previous example demonstrated how to use the Spark SQL API to embed SQL expressions in Spark code. In a notebook, you can also use the `sql SELECT Category, COUNT(ProductID) AS ProductCount FROM products GROUP BY Category ORDER BY Category

Stanley Chan's Note🧠

Explorer

Spark catalog and SQL query

Using the Spark SQL API to query data

Using SQL code

Graph View

Table of Contents

Backlinks