Spark approaches cluster communication in two different ways during joins. It either incurs (1) a shuffle join, which results in an all-to-all communication or (1) a broadcast join.
Shuffle Join
- for both big tables with partition
- every node talks to every other node and they share data according to which node has a certain key or set of keys (on which you are joining)
- these joins are expensive because the network can become congested with traffic, especially if your data is not partitioned well.
Broadcast Join
- for big tables joining with small table
- given that the small table fit in a worker node, it is replicated onto every worker node in the cluster to join each partition
- without having to wait or communicate with any other worker node, joins will be performed on every single node individually
- first step is expensive
- if you try to broadcast something too large, you can crash your driver node