A general perspective on multimodal deep learning

As evidenced using a quick search on Semantic Scholar we can trace back the concept of Multimodal deep learning back to 2011 by the self titled paper from Stanford researchers.

In essence, it’s about improving deep networks to learn features across multiple modalities. The authors show that “better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time”.

Keeping it simple the hypothesis is that humans understand better what’s being said by using both audio and visual cues, a phenomenon also known as McGurk Effect. Can we train neural networks (thus computers) to better understand speech by using multimodal learning modalities?

The authors claim that this is possible and also that we can teach machines to lip read what humans are saying.


The same techniques can be used to teach machines to recognize emotions from faces (Ghayoumi, M., & Bansal, A. K. (2016)). Emotions are, too, multimodal entities (facial expressions are a basis as stated by the well known studies by Paul Ekman, but they can also mix together tone of voice, gestures, etc.).

So the questions the authors were trying to solve is: “can we learn better representations for audio/visual speech recognition?”. The answer seems to be positive.

Others (Cha, M., Gwon, Y., & Kung, H.T. (2015)) are already applying these and derived techniques to better analyze the web and allow machines to have more sophisticated representations of the multimodal web (image / sound, video on YouTube, image and text on Wikipedia). In their article they show that

[…]shared representations enabled by our framework substantially improve the classification performance under both unimodal and multimodal settings. [abstract].

Fig. 5. Network architectures used for feature learning (adopted from [115]). (a) Concatenating audio and video vectors and employing a single input network. (b) Two-input network with separate inputs for audio and video streams.

Given the whole trend on speech recognition (think Siri, Alexa / Echo and company), this will certainly be a growing area of research as these systems will (probably) soon have eyes too.

If you think about it, Echo is already in many households, what avoids them to have eyes and check when we get home, apply home surveillance, track visual changes and act accordingly?


will the next Alexa have eyes?






Khosla, A., Kim, M., Lee, H., Ngiam, J., Nam, J., & Ng, A.Y. (2011). Multimodal Deep Learning. ICML.

Ghayoumi, M., & Bansal, A. K. (2016). Multimodal architecture for emotion in robots using deep learning. In Future Technologies Conference, San Francisco, United States.

Tian, C., Ji, W., & Yuan, Y. (2017). Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading. arXiv preprint arXiv:1701.04224.

Cha, M., Gwon, Y., & Kung, H.T. (2015). Multimodal sparse representation learning and applications. CoRR, abs/1511.06238.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.