The Submission deadline is approaching ! 

Authors are encouraged to submit previously unpublished research papers in the scope of one of the accepted special sessions. The submissions are peer-reviewed in a single-blind process. Unless otherwise specified, the same lengths as for regular paper apply.

The deadlines are the same as for regular papers. The deadlines are the same as for regular papers, Please submit your papers to Special Session category and tick the title  box of Special session you are submitting your paper to here.  
The following special sessions are open for submission:

Further Information about the remaining individual special session will be published shortly.

Cultural Heritage and Multimedia Content

This special session addresses the processing of all types of data related to cultural heritage. As stated by UNESCO, cultural heritage provides societies with a wealth of resources inherited from the past and created in the present for the benefit of future generations. It includes tangible (built and natural environments, artifacts) and intangible (such as traditions, language, and knowledge) heritage. The objective of this session is to bring together the various communities and the latest research dedicated to cultural heritage data on different aspects, from their acquisition up to their restitution, including retrieval, structuring, interactions, interfaces, analysis, etc.

For various applications, we will address the presentation of generic methods and their application to cultural heritage, as well as dedicated approaches designed to deal with such content. Non-exhaustively, we will consider:

  • Content-based multimedia indexing and retrieval
  • Deep representations in adverse conditions
  • Generative models for cultural heritage
  • Ontology and semantic web for cultural heritage
  • Knowledge-driven machine learning
  • Multi-source and multimodal visualization
  • Spatio-temporal analysis
  • Large-scale multimedia database management
  • Bench-marking, Open Data Movement

The panel of applications targeted is large, including:

  • Analysis, archeometry of artifacts
  • Diagnosis and monitoring for restoration and preventive conservation
  • Geosciences / Geomatics for cultural heritage
  • Analysis of the evolution of the territory
  • Education
  • Smart and sustainable tourism
  • Digital Twins


Interactive Video Retrieval for Beginners (IVR4B)

Despite the advances in automated content description using deep learning, and the emergence of joint image-text embedding models, many video retrieval tasks still require a human user in the loop. This is in particular the case when the information needed is fuzzy, or when the underlying dataset is homogeneous, i.e. contains data from one domain, differing in small details and with little or no editorial structure. Interactive video retrieval (IVR) systems address these challenges. In order to assess their performance, multimedia retrieval benchmarks such as Video Browser Showdown (VBS) or Lifelog Search Challenge (LSC) have been established. These benchmarks provide large-scale datasets as well as task settings and evaluation protocols, allowing us to measure progress in research on IVR systems. However, in order to achieve the best possible performance of the participating systems, they are usually operated by members of the development team. This setting does not allow for properly measuring usability aspects of the system, which are important in order to deploy them successfully in a target application context, where they need to be operated by domain experts rather than video retrieval researchers.

This special session thus aims at providing better insights into how such systems are usable
by users with solid IT backgrounds, but not familiar with the details behind the system. The
the special session thus calls for papers describing IVR systems, addressing topics such as:

  • Search functionalities supporting non-expert users
  • browsing and navigation capabilities
  • approach to result in visualization
  • usability aspects of the system

The contributions to this session are short papers (4 pages + references), describing the participating IVR system. In contrast to papers such as those submitted to VBS or LSC, these papers should focus on how the system supports users that are not retrieval experts, and the search and browsing feature is expected to be of particular interest to these users.

The review process is single-blind. A link to a three-minute video showcasing the usage of the system on the VBS collection must be included with the submission. Prior participation in VBS is not a prerequisite for submission.

The IVR systems will be presented in the demo session, followed by an interactive competition session, in which the systems are used by novice users to solve video retrieval tasks.


Physical Models and AI in Image and in Multi-modality 

Deep neural networks now enable the learning of complex functions from the data to address a variety of difficult problems. However, questions emerge on the relevance and understanding of their learned
functions, especially when relating to physical models.
Knowledge of the physical environment can allow for the introduction of constraints on the model in order to reduce the search space and converge to more relevant and simplified solutions that can contribute to higher model confidence. Image synthesis and computer graphics are typical application domains that make use of the known physical constraints on objects and materials to produce realistic images.

Conversely, when the physical model is not perfectly known, AI-based models can help identify relevant solutions to improve the physicist’s knowledge and understanding of a given phenomenon. A typical example is haze removal in images while the physical haze model is still not well identified.

Finally, when addressing AI and physics problems, the different data modalities need to be fused appropriately. Again, multimodality handling can be guided by knowledge of the physical model or can be learned in order to increase knowledge.  

This special session aims to bring together researchers working on the analysis, indexing, and mining of data related to images and multimodality in various fields involving physical models, remote sensing, astrophysics, mechanics, and computer graphics, provides them a venue for sharing novel ideas and discuss their most recent works and promotes exchanges between computer scientists and astrophysicists.

Topics of interest include (but are not limited to):

  • Supervised learning: classification and regression
  • Unsupervised learning: clustering and dimensionality reduction
  • Real-time, on-site, or on-board processing
  • Learning physical models from data
  • From simulated to actual data
  • Time-Series Analysis
  • Management of data in physics


Computational Memorability of Imagery

The subject of memorability has seen an influx in interest since the likelihood of images being recognised upon subsequent viewing was found to be consistent across individuals. Driven primarily by the MediaEval Media Memorability tasks which has just completed its 5th annual iteration, recent research has extended beyond static images, pivoting to the more dynamic and multi-modal medium of video memorability.

The memorability of a video or an image is an abstract concept and like other features such as aesthetics and beauty, is an intrinsic feature of imagery. There are many applications for predicting image and video memorability including marketing where some part of a video advertisement should strive to be the most memorable, in education where key parts of educational content should be memorable, in other areas of content creation such as video summaries of longer events like movies or wedding photography, and in cinematography where a director may want to make some parts of a movie or TV program more, or less, memorable than the rest.

For computing video memorability, researchers have used a variety of approaches including video vision transformers as well as more conventional machine learning, text features from text captions, a range of ensemble approaches, and even generating surrogate videos using stable diffusion methods. The performance of these approaches tells us that we are now close to the best performance for memorability prediction for video and for images that we could get using current techniques and that there are many research groups who can achieve such a level of performance.

We believe that image and video memorability is now ready for the spotlight and for researchers to be drawn to using video memorability prediction in creative ways. We invite submissions from researchers who wish to extend their reported techniques and/or apply those techniques to real-world applications like marketing, education, or other areas of content production. We hope that the output from this special session will be a community-wide realization of the potential for video memorability prediction and uptake in research into, and applications of, the topic.

The topics of the special session include, but are not limited to:

  • Development and interpretation of single- or multi-modal models for Computational Memorability
  • Transfer learning and transferability for Computational Memorability
  • Computational Memorability applications
  • Extending work from MediaEval Predicting Media Memorability task
  • Cross- and multi-lingual aspects in Computational Memorability
  • Evaluation and resources for Computational Memorability
  • Computational memorability prediction based on physiological data (e.g.: EEG data)


P.S : The contributions to this session are short papers (4 pages + references)

Cross-modal multimedia analysis and retrieval for well-being insights

Getting insights into well-being has attracted people’s attention for decades by understanding human mental and physical health, social relations, and connections via direct and indirect perspectives. The direct perspective observes self-data from wearable sensors, medical profiles, lifelog cameras, and personal social networks that reflect people’s health, activities, and behaviors. The indirect perspective captures social data through surrounding sensors, social network interaction, and third-party data to understand how people interact with their surrounding environment and society. By utilizing these perspectives, governments, industries, and citizens can gather intelligence, plan, control, retrieve, and make decisions in wellbeing and its impact areas efficiently and effectively. Numerous studies have been done on each perspective, but few have focused on analyzing and retrieving cross-data from different perspectives for better human benefit. In this context, it is more than evident that developing cross-data multimedia analysis and retrieval is a critical issue in well-being insights.

This special session aims to bring together researchers and practitioners from various research fields in social, life, and natural sciences, to discuss the latest developments and challenges in cross-data multimedia analysis and retrieval in wellbeing insights, including food computing, social activity recommendation, anti-infertility, personal training, stress reduction, mental/physical health improvement,
and wellbeing research, to name a few.

The topics of interest include, but are not limited to:

  • Interpretation of multimedia for wellbeing
  • Psychological Stress and Social Media Use
  • Health synthetic data generation
  • The effect of media use on wellbeing
  • Lifestyle Recommendations Based on Diverse Observations
  • Multimodal personal health lifelog data analysis
  • Multimodal lifelog data analysis and retrieval
  • Food in the Media and Health-Conscious Consumer
  • Training Performance Indications Based on Activity Lifelogs
  • The effects of air pollution on human health
  • Safety driving improvement using multimedia and sensory data


  • Minh-Son Dao, National Institute of Information and Communications
    Technology, Japan (
  • Vincent Nguyen, Orléans University, France
  • Michael Alexander Riegler, Simula Metropolitan Center for Digital Engineering, Norway (
  • Duc-Tien Dang-Nguyen, Bergen University, Norway (
  • Cathal Gurrin, Dublin City University, Ireland
  • Thanh-Binh Nguyen, Vietnam National University in HCM City, University
    of Science (

Explainability in Multimedia Analysis (ExMA)

The rise of machine learning approaches, and in particular deep learning, has led to a
significant increase in the performance of AI systems. However, it has also raised the
question of the reliability and explicability of their predictions for decision-making (e.g., the
black-box issue of the deep models). Such shortcomings also raise many ethical and
political concerns that prevent wider adoption of this potentially highly beneficial technology,
especially in critical areas, such as healthcare, self-driving cars or security.
It is therefore critical to understand how their predictions correlate with information
perception and expert decision-making. The objective of eXplainable AI (XAI) is to open this
black box by proposing methods to understand and explain how these systems produce their
Among the multitude of relevant multimedia data, face information is an important feature
when indexing image and video content containing humans. Annotations based on faces
the span from the presence of faces (and thus persons), over localizing and tracking them,
and analyzing features (e.g., determining whether a person is speaking) to the identification of
persons from a pool of potential candidates or the verification of assumed identities. Unlike
many other types of metadata or features commonly used in multimedia applications, the
analysis of faces affects sensitive personal information. This raises both legal issues, e.g.
concerning data protection and regulations in the emerging European AI regulation, as well
as ethical issues, related to potential bias in the system or misuse of these technologies.

This special session focuses on AI-based explainability technologies in multimedia analysis,
and in particular on:

  • The analysis of the influencing factors relevant to the final decision is an essential
  • Step to understand and improve the underlying processes involved;
  • Information visualization for models or their predictions;
  • Interactive applications for XAI;
  • Performance assessment metrics and protocols for explainability;
  • Sample-centric and dataset-centric explanations;
  • Attention mechanisms for XAI;
  • XAI-based pruning;
  • Applications of XAI methods; and
  • Open challenges from industry or emerging legal frameworks.

This special session aims at collecting scientific contributions that will help improve trust and
transparency of multimedia analysis systems with important benefits for society as a whole.

We invite the submission of long papers describing novel methods or their adaptation to
specific applications or short papers describing emerging work or open challenges. The
review process is single-blind, i.e. submissions do not need to be anonymized.