The research team has recently introduced an exciting new method called T-FREE, which significantly boosts the operational efficiency of large language models. Scientists from Aleph Alpha, Technische Universität Darmstadt, hessian.AI, and the German Research Center for Artificial Intelligence (DFKI) have jointly launched this remarkable technology, whose full name is "Tokenizer-Free Sparse Representation for Memory-Efficient Embeddings."
Traditionally, we use tokenizers to convert text into a digital format understandable by computers, but T-FREE takes a different approach. It utilizes character trigrams, which we refer to as "triplets," to directly embed words into the model through sparse activation. This innovative move results in a staggering reduction of over 85% in the number of parameters in the embedding layer, while the model's performance remains unaffected in tasks such as text classification and question answering.
Another highlight of T-FREE is its intelligent modeling of morphological similarities between words. Much like the words "house," "houses," and "domestic" that we often encounter in daily life, T-FREE can more effectively represent these similar words within the model. Researchers believe that similar words should be closer to each other when embedded, thereby achieving higher compression rates. Therefore, T-FREE not only reduces the size of the embedding layer but also decreases the average encoding length of text by 56%.
Moreover, T-FREE excels in transfer learning across different languages. In an experiment, researchers used a model with 3 billion parameters, first trained in English and then in German, and found that T-FREE's adaptability far exceeds traditional tokenizer-based methods.
However, the researchers remain humble about their current achievements. They acknowledge that the experiments conducted so far are limited to models with up to 3 billion parameters, and they plan to further evaluate the method on larger models and more extensive datasets in the future.