In the rapidly advancing field of artificial intelligence, an international research team is paving the way for the development of AI language models in Europe. They have launched the MOSEL (Massive Open-source compliant Speech data for European Languages) project, compiling a comprehensive open-source speech dataset for the 24 official languages of the European Union. This initiative aims to promote the development of open AI language models in Europe and challenge the current dominance of English datasets and proprietary systems of major tech companies.

The MOSEL project aggregates speech data from 18 different sources, including well-known projects such as CommonVoice, LibriSpeech, and VoxPopuli. This extensive database includes both transcribed speech recordings and unlabeled audio data, with particularly valuable being the 505,000 hours of transcribed data.

However, the distribution of data across various languages is highly uneven. English boasts over 437,000 hours of labeled data, while languages like Maltese or Irish have only a few hours of data. To improve the data situation for resource-scarce languages, the research team employed innovative methods: using OpenAI's Whisper AI model, they automatically transcribed an additional 441,000 hours of unlabeled audio data.

The research team explained that although automatic transcription is not perfect, it provides a substantial amount of training material for languages lacking manual transcription data. These generated transcriptions are released under a Creative Commons CC-BY license, allowing for free use with proper attribution.

The challenges of automatic transcription were particularly evident in the case of Maltese. The Whisper model had a word error rate of over 80% when processing Maltese, meaning on average four out of every five words were incorrectly identified. This highlights the significant challenges that certain languages still face in automatic processing.

Despite this, the research team believes that these automatic transcriptions can serve as a starting point for further improvements. They plan to collect more data for underrepresented languages and continuously refine the MOSEL database.

The entire dataset of the MOSEL project is freely available on GitHub, aiming to provide researchers and developers with convenient access to European language speech data. This open-sharing initiative not only embodies the spirit of collaboration in the scientific community but also injects new vitality into the development of European AI language models.

The significance of the MOSEL project goes beyond the data itself. It represents Europe's efforts to pursue technological autonomy in the AI field, potentially driving the development of more diverse and inclusive AI language models. By providing multilingual open-source data, MOSEL offers valuable resources for the preservation and development of minor languages in the AI era, helping to reduce biases and inequalities in language processing by AI technologies.

As the MOSEL database continues to be refined and expanded, we can expect to see more AI applications and services based on European languages. This will not only drive the development of Europe's digital economy but also make significant contributions to the diversity of global AI language technology.