MoMask is a model for text-driven 3D human motion generation. It employs a hierarchical quantization scheme to represent human motion as multi-layer discrete motion tokens with high fidelity details. MoMask utilizes two distinct bidirectional Transformer networks for generation, predicting motion tokens from text input. This model outperforms existing methods in the text-to-motion generation task and can seamlessly be applied to related tasks such as text-guided temporal repair.