Banner
Home      Log In      Contacts      FAQs      INSTICC Portal
 
Documents

Keynote Lectures

Rethinking Multimodal AI Models: Beyond Accuracy, Towards Trust
Elisa Ricci, University of Trento, Italy, Italy

Self-Supervised Learning Is Dead, Long Live Self-Supervised Learning (in the Age of MLLMs)
Yuki Asano, University of Technology Nuremberg, Germany, Germany

Seeing, Speaking, and Reasoning in a Visual World
Cees G. M. Snoek, University of Amsterdam, Netherlands, Netherlands

 

Rethinking Multimodal AI Models: Beyond Accuracy, Towards Trust

Elisa Ricci
University of Trento, Italy
 

Brief Bio
Elisa Ricci is a Professor at the University of Trento and Head of the Research Unit Deep Visual Learning at Fondazione Bruno Kessler. Her research lies at the intersection of computer vision, deep learning, and robotics perception. She is interested in developing novel approaches for learning from visual and multi-modal data in an open world. Elisa received her MSc and PhD from the University of Perugia and has held positions at Idiap Research Institute, Fondazione Bruno Kessler and as visiting researcher at ETH Zurich and the University of Bristol. She has co-authored over 200 scientific publications with broad impact in vision and learning research.? She was the General Co-Chair of ACM ICMR 2025 and Program Co-Chair of ACM MM 2022 and ECCV 2024. Her group has received many best paper awards, including Best Paper Award at ACM MM 2015. She serves on the editorial boards of several international journals, such as Pattern Recognition and Computer Vision and Image Understanding. Elisa has led and contributed to multiple national and EU research projects, including H2020 and HE projects such as SPRING and ELLIOT. She is an ELLIS and a IAPR Fellow.


Abstract
The rapid adoption of Large Multimodal Models is reshaping smart technologies, while simultaneously raising fundamental questions about trust, safety and accountability. Moving beyond traditional performance metrics, real-world deployment demands a deeper understanding of model behavior under uncertainty, bias, and privacy constraints. This talk presents a perspective on Trustworthy AI approaches, with a focus on vision-language tasks, across five interconnected dimensions: (i) eXplainable AI for making multimodal decisions transparent; (ii) bias discovery and mitigation through scalable and automated approaches; (iii) automatic benchmarking frameworks for robust evaluation of Large Multimodal Models; (iv) uncertainty quantification as a foundation for reliable decision making; and (v) privacy-based learning, including privacy leakage analysis, unlearning and federated learning. The talk concludes by outlining open research challenges toward principled and socially aligned AI systems for the next generation of smart technologies.



 

 

Self-Supervised Learning Is Dead, Long Live Self-Supervised Learning (in the Age of MLLMs)

Yuki Asano
University of Technology Nuremberg, Germany
 

Brief Bio
Yuki M. Asano leads the Fundamental AI (FunAI) Lab at the University of Technology Nuremberg as a full Professor, having previously led the QUVA lab at the University of Amsterdam as an Assistant Professor, where he collaborated with Qualcomm AI Research. He completed his PhD at Oxford's Visual Geometry Group (VGG), working with Andrea Vedaldi and Christian Rupprecht. His research interests include computer vision and machine learning, particularly self-supervised and multimodal learning. He has won outstanding paper award at ICLR and is an ELLIS scholar and has served as area chair and senior area chair for top conferences including NeurIPS, ICLR, and CVPR, and organizes workshops and PhD schools including the ELLIS winter school on Foundation Models.


Abstract
Is self-supervised learning still used and useful or can we stop thinking about it?
In this talk, we will explore how self-supervised learning have inspired current prevailing paradigms and has become 'just another' tool in the deep learning toolbox and where it might be headed next. We will explore recent innovations in pre- and post-training for images, videos, image-text and point cloud data.



 

 

Seeing, Speaking, and Reasoning in a Visual World

Cees G. M. Snoek
University of Amsterdam, Netherlands
https://www.ceessnoek.info
 

Brief Bio
Cees G.M. Snoek is a full professor in artificial intelligence at the University of Amsterdam, where he heads the Video & Image Sense Lab and the interdisciplinary Human-Aligned Video AI Lab. He is also a director of three public-private AI research labs with stakeholders like Qualcomm, TomTom and TNO. At University spin-off Kepler Vision Technologies he acts as Chief Scientific Officer. Professor Snoek is also scientific director of Amsterdam AI, a collaboration between government, academic, medical and other organisations in Amsterdam to study, develop and deploy responsible AI. He was previously an assistant and associate professor at the University of Amsterdam, as well as Visiting Scientist at Carnegie Mellon University, Fulbright Junior Scholar at UC Berkeley, head of R&D at University spin-off Euvision Technologies and managing principal engineer at Qualcomm Research Europe.


Abstract
Vision–language foundation models have made striking progress, yet they still fall short of forming coherent models of the world. Many systems remain linguistically narrow, visually ungrounded, and reliant on shallow, single-step reasoning, limiting their ability to generalize across languages, cultures, and complex visual scenes. In this talk, I will argue that meaningful progress requires treating seeing, speaking, and reasoning as a unified problem, grounded in visual evidence and inclusive by design. I will present recent advances that move toward this goal along two tightly connected dimensions. First, I will discuss how multilingual modeling reshapes text-to-image generation. Moving beyond translation-based pipelines, learning visual concepts directly across languages enables more faithful, culturally aligned generation while retaining strong performance and efficiency. This reframes inclusivity as a catalyst for better representations rather than a trade-off. Second, I will address the limitations of one-shot visual reasoning and introduce an approach to iterative, grounded reasoning. By explicitly linking each reasoning step to image regions and enforcing consistency between global scene understanding and local visual evidence, models can achieve more accurate, interpretable, and spatially precise reasoning across challenging visual tasks. Together, these directions outline a broader vision for visual AI: systems that can perceive the world, communicate across languages, and reason over space and time in a grounded and unified manner—moving beyond surface-level fluency toward deeper visual understanding.



footer