Two high level plane:

  1. Control Plane (databricks account)
    • web ui
    • Cluster management
    • workflows
    • Notebooks
  2. Data plane (your own cloud account)
    • cluster VMs for compute
    • Storage

Spark on Databricks

  • in-memory, distributed data processing
  • support Scala, python, sql, r, java
  • allow batch processing & stream processing
  • structured, semi structured and unstructured data

Databricks file system (DBFS)

  • distributed file system
  • pre-installed in Databrick clusters
  • abstraction layer using underlaying cloud storage (eg S3):
    • File created in DBFS in the cluster will store in cloud

Compute

  • multi-code - cluster (compose of master node - driver, that coordinating other worker for parallel execution of task, and some other worker nodes)
  • single node - no workers and run spark on the driver