Two high level plane:
- Control Plane (databricks account)
- web ui
- Cluster management
- workflows
- Notebooks
- Data plane (your own cloud account)
- cluster VMs for compute
- Storage
Spark on Databricks
- in-memory, distributed data processing
- support Scala, python, sql, r, java
- allow batch processing & stream processing
- structured, semi structured and unstructured data
Databricks file system (DBFS)
- distributed file system
- pre-installed in Databrick clusters
- abstraction layer using underlaying cloud storage (eg S3):
- File created in DBFS in the cluster will store in cloud
Compute
- multi-code - cluster (compose of master node - driver, that coordinating other worker for parallel execution of task, and some other worker nodes)
- single node - no workers and run spark on the driver