ubiai deep learning

Multimodal Learning problems

Nov 06th, 2023

Multimodal machine learning is a dynamic field that presents a multitude of challenges in its quest to unlock new frontiers in artificial intelligence through the integration of data from various sensory modalities. These challenges serve as significant roadblocks that researchers and practitioners must navigate. They are like intricate pieces of a puzzle, shaping the landscape of this discipline.


In this article, we delve into the distinct problems that emerge in the realm of multimodal machine learning and explore how they influence the development of models, data comprehension, and the approach to real world issues. Each problem represents a unique facet of this complex journey.


As we examine these problems one by one, we’ll gain a deeper understanding of the multifaceted nature of this field. Let’s embark on this journey by first defining multimodal machine learning and then delving into the core problems that await exploration in this  fascinating field.

What is multimodal ?

Multimodal machine learning is a cutting-edge field within the realm of artificial intelligence that revolutionizes the way machines understand and process information. It stands in stark contrast to unimodal AI models, which specialize in handling data from a single sensory modality, such as text, images, or speech.


Multimodal AI, on the other hand, excels in processing and integrating data from diverse sensory sources, including written and spoken language, images and videos, and audio signals. The primary objective of multimodal machine learning is to bridge the intricate relationships that exist between these various modalities, leading to a more profound comprehension of complex, real-world data and enabling the execution of a wide array of tasks.

multimodel learning
Screenshot 2023-11-01 at 11.39.02 AM

The Multimodal Deep Learning group stands at the forefront of this field, dedicating its efforts to pioneering advanced deep learning techniques that seamlessly synthesize information from multiple sensory modalities. The group’s research is structured into three core subfields:

Zero-Shot Learning: This subfield focuses on teaching models to recognize and classify objects or concepts they have never encountered during training by leveraging semantic descriptions and attributes associated with novel classes.

Conditional Image Synthesis: Explores the generation of images from scratch using deep networks conditioned on detailed textual descriptions, striving to create images that closely resemble real-world objects.

Deeply Explainable Artificial Intelligence( XAI): Enhances the transparency and interpretability of deep learning models by generating visual explanations and highlighting the evidence behind classification decisions, ultimately making AI more understandable and trustworthy.

The Multimodal Deep Learning group’s pioneering work in narrowing the gap between vision and language in AI brings forth significant core challenges that researchers are actively tackling, underscoring the intricacies of the field.

Core problems

In the domain of multimodal machine learning, there are five fundamental core problems that researchers contend with. These core problems involve fusion, representation, translation, alignment, and co-learning. To delve deeper into these core problems.  Let’s begin by examining the first one:



Screenshot 2023-11-01 at 11.41.49 AM

Multimodal fusion is the process of seamlessly integrating information from various sensory modalities to tackle tasks like classification or regression, thereby enhancing predictive capabilities and providing a more comprehensive data understanding. It offers adaptability, making it effective even when one modality is missing. However, it comes with its own set of challenges, including the risk of overfitting, variations in how well modalities generalize, temporal misalignment, and discrepancies due to noise in multimodal data.



Multimodal machine translation is a transformative approach that redefines the boundaries of translation by incorporating information from diverse sensory modalities, resulting in a richer and more comprehensive understanding of input data. Unlike traditional translation that predominantly deals with text, multimodal machine translation broadens its scope, considering not only written language but also spoken language, images, and videos. This versatility and context-awareness drive its power and potential.



Co-learning is a strategy that tackles the issue of transferring knowledge between different sensory modalities. It’s particularly valuable when working with a modality that has limited resources, such as a shortage of annotated data, noisy inputs, or unreliable labels. In such cases, transferring knowledge from a modality that has more resources can be a smart solution.

● Parallel Approaches: These methods involve training data that includes samples from both modalities, and they require direct links between instances from different modalities. For instance, it could be video and the corresponding speech signal of a person speaking. Transfer learning, in this context, can enhance models, like improving lip-reading by transferring knowledge from a speech-recognition neural network.

Non-parallel Data Approaches: Unlike parallel approaches, non-parallel approaches do not need direct connections between samples from different modalities. For
example, in action recognition, adding an extra modality, like 3D human-skeleton sequences, to the training data can lead to improvements.

Hybrid Approach: In this approach, modalities are connected through a shared modality or dataset. For example, if you have a parallel dataset of images with English captions and another parallel dataset of English and French documents, a model could be constructed to retrieve images based on French captions, bridging the gap between modalities.



Alignment is a fundamental concept in multimodal learning that revolves around the intricate task of establishing correspondences and connections between data originating from two or more sensory modalities, all of which are related to the same event or context. This process is pivotal for enhancing the understanding and utilization of information from diverse modalities. To grasp the essence of alignment, let’s delve into a practical example:

Imagine you have a collection of travel itineraries presented in text form, and alongside these textual descriptions, you possess a gallery of travel photos. In this scenario, multimodal alignment takes center stage as it seeks to match and link specific descriptions in the text with the corresponding images that represent each segment of the journey.


Achieving this alignment is akin to piecing together a complex puzzle. It requires the model to evaluate and measure similarities between these distinct modalities, bridging the gap between them and comprehending the intricate relationship that exists, even when these modalities are scattered across extensive spans of data.


Multimodal representation is a pivotal task in the realm of machine learning, involving the encoding of data from multiple sensory modalities into vector or tensor forms.

Effective representations, capable of encapsulating the semantic essence of raw data, are paramount for the success of machine learning models. Yet, the challenge lies in the extraction of features from diverse and heterogeneous data sources while harnessing the synergies between them and eliminating redundancy.


There are two fundamental approaches to multimodal representations:

Joint Representation: This method entails encoding each individual modality and subsequently placing them in a shared high-dimensional space. It’s the most direct approach and tends to work well when modalities share a similar nature or characteristics.

Coordinated Representation: In this approach, each modality is encoded independently of one another, regardless of their inherent nature. The coordination between modalities is maintained through similarity measures or structural constraints, allowing for flexibility and adaptability.


In conclusion, the field of multimodal machine learning confronts a set of core problems that serve as significant roadblocks, demanding the attention of both researchers and practitioners. These core problems encompass Fusion, Translation, Alignment, Co-learning, and Representation, each contributing to the complexity of this dynamic discipline.


As we’ve delved into these core problems one by one, we’ve gained insights into the multifaceted nature of multimodal machine learning. Researchers are constantly innovating and developing solutions to these challenges, paving the way for new frontiers in artificial intelligence and data analysis.