Mark Hamilton, a doctoral student in the Department of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology (MIT), is a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He aims to leverage machines to understand how animals communicate. To achieve this, he first set out to create a system capable of learning human language "from scratch."
Product Entry:https://top.aibase.com/tool/denseav
This algorithm, named DenseAV, learns the meaning of language by correlating audio and video signals. After training DenseAV to play audio-video matching games, Hamilton and his colleagues observed which pixels the model focused on when it heard sounds. For instance, when someone says "dog," the algorithm immediately looks for dogs in the video stream. This pixel selection helps reveal what the algorithm perceives the meaning of a word to be.
Interestingly, when DenseAV hears a dog barking, it also looks for dogs in the video stream. This piqued the researchers' interest, leading them to explore whether the algorithm understands the difference between the word "dog" and a dog's bark by giving DenseAV a "dual brain." They found that one side naturally focuses on language, such as the word "dog," while the other focuses on sounds like a dog's bark. This indicates that DenseAV has not only learned the meanings of words and the locations of sounds but also differentiated these cross-modal connections without human intervention or any textual input.
Core Features of DenseAV:
1. DenseAV is a dual encoder grounding architecture that learns high-resolution, semantic, and audiovisual alignment features by watching videos.
2. It can discover the "meaning" of words and the "location" of sounds without explicit localization supervision.
3. DenseAV can automatically differentiate the associations between word meanings and sound locations without supervision.
4. It uses audio-video contrastive learning to link sounds with the visual world, enabling unsupervised learning.
5. The model significantly enhances its localization information capability using contrastive similarity based on the inner product between local audio and visual representation tokens.
6. DenseAV can naturally organize its features into sound features and language features without knowing what sounds or languages are.
7. With less than half the parameters, DenseAV outperforms the previous state-of-the-art model ImageBind in cross-modal retrieval.
One application area for this method is learning from the vast amount of videos posted to the internet daily. Researchers hope this method can be used to understand new languages without written forms of communication, such as those of dolphins or whales. Ultimately, they aim for this method to discover pattern associations between other signals, such as seismic sounds and geological conditions emitted by the Earth.
A significant challenge the team faces is learning language without any textual input. Their goal is to avoid using pre-trained language models and instead rediscover the meaning of language from scratch, inspired by how children understand language by observing and listening to their environment.
Paper link: https://arxiv.org/abs/2406.05629