T3
Transparent tracking and triggering, fine-grained computation and set overlap
CommonProductProductivityDistributed techniqueHardware-software co-design
Large language models increasingly rely on distributed techniques for training and inference. These techniques necessitate communication between devices, and as the number of devices increases, this can degrade scaling efficiency. While some distributed techniques can overlap communication to hide independent computation, techniques like tensor parallelism (TP) inherently serialize communication with model execution. One way to hide this serialized communication is to interweave it with producer operations (data generation) in a fine-grained manner. However, implementing this fine-grained communication and computation interleaving in software can be challenging. Furthermore, like any concurrent execution, it requires sharing computational and memory resources between computation and communication, leading to resource contention and decreased overlap efficiency. To overcome these challenges, we propose T3, which uses hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with computation. T3, through simple configuration of producer output address spaces, transparently fuses producer operations and subsequent communication, requiring minimal software changes. At the hardware level, T3 incorporates lightweight tracking and triggering mechanisms to orchestrate producer computation and communication. It further leverages enhanced compute memory for computation related to communication. Consequently, T3 reduces resource contention and effectively overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 achieves a geometric mean speedup of 30% (up to 47%) for communication-intensive sublayers and a geometric mean reduction of 22% (up to 36%) in data movement. Furthermore, T3's benefits persist as models scale: achieving a geometric mean speedup of 29% for sublayers in the 500B parameter sim model, PALM, and MT-NLG.
T3 Visit Over Time
Monthly Visits
19075321
Bounce Rate
45.07%
Page per Visit
5.5
Visit Duration
00:05:32