Banner
Home      Log In      Contacts      FAQs      INSTICC Portal
 
Documents

Keynote Lectures

Available Soon
Elisa Ricci, University of Trento, Italy, Italy

Self-Supervised Learning Is Dead, Long Live Self-Supervised Learning (in the Age of MLLMs)
Yuki Asano, University of Technology Nuremberg, Germany, Germany

Seeing, Speaking, and Reasoning in a Visual World
Cees G. M. Snoek, University of Amsterdam, Netherlands, Netherlands

 

Available Soon

Elisa Ricci
University of Trento, Italy
 

Brief Bio
Prof. Elisa Ricci (PhD, University of Perugia 2008) is an Associate Professor at Department of Information Engineering and Computer Science (DISI) at the University of Trento and the head of the Deep Visual Learning research group at Fondazione Bruno Kessler. She has published over 160 papers on international venues. Her research interests are mainly in the areas of computer vision, robotic perception and multimedia analysis. At UNITN she is the Coordinator of the Doctoral Programme in Information Engineering and Computer Science. She is an Associate Editor of IEEE Trans. on Multimedia, Computer Vision and Image Understanding and Pattern Recognition. She was the Program Chair of ACM MM 2020 and the Diversity Chair of ACM MM 2022. She is the recipient of the ACM MM 2015 Best Paper award and ICCV 2021 Honorable mention award.


Abstract
Available Soon



 

 

Self-Supervised Learning Is Dead, Long Live Self-Supervised Learning (in the Age of MLLMs)

Yuki Asano
University of Technology Nuremberg, Germany
 

Brief Bio
Yuki M. Asano leads the Fundamental AI (FunAI) Lab at the University of Technology Nuremberg as a full Professor, having previously led the QUVA lab at the University of Amsterdam as an Assistant Professor, where he collaborated with Qualcomm AI Research. He completed his PhD at Oxford's Visual Geometry Group (VGG), working with Andrea Vedaldi and Christian Rupprecht. His research interests include computer vision and machine learning, particularly self-supervised and multimodal learning. He has won outstanding paper award at ICLR and is an ELLIS scholar and has served as area chair and senior area chair for top conferences including NeurIPS, ICLR, and CVPR, and organizes workshops and PhD schools including the ELLIS winter school on Foundation Models.


Abstract
Is self-supervised learning still used and useful or can we stop thinking about it?
In this talk, we will explore how self-supervised learning have inspired current prevailing paradigms and has become 'just another' tool in the deep learning toolbox and where it might be headed next. We will explore recent innovations in pre- and post-training for images, videos, image-text and point cloud data.



 

 

Seeing, Speaking, and Reasoning in a Visual World

Cees G. M. Snoek
University of Amsterdam, Netherlands
https://www.ceessnoek.info
 

Brief Bio
Cees G.M. Snoek is a full professor in artificial intelligence at the University of Amsterdam, where he heads the Video & Image Sense Lab and the interdisciplinary Human-Aligned Video AI Lab. He is also a director of three public-private AI research labs with stakeholders like Qualcomm, TomTom and TNO. At University spin-off Kepler Vision Technologies he acts as Chief Scientific Officer. Professor Snoek is also scientific director of Amsterdam AI, a collaboration between government, academic, medical and other organisations in Amsterdam to study, develop and deploy responsible AI. He was previously an assistant and associate professor at the University of Amsterdam, as well as Visiting Scientist at Carnegie Mellon University, Fulbright Junior Scholar at UC Berkeley, head of R&D at University spin-off Euvision Technologies and managing principal engineer at Qualcomm Research Europe.


Abstract
Vision–language foundation models have made striking progress, yet they still fall short of forming coherent models of the world. Many systems remain linguistically narrow, visually ungrounded, and reliant on shallow, single-step reasoning, limiting their ability to generalize across languages, cultures, and complex visual scenes. In this talk, I will argue that meaningful progress requires treating seeing, speaking, and reasoning as a unified problem, grounded in visual evidence and inclusive by design. I will present recent advances that move toward this goal along two tightly connected dimensions. First, I will discuss how multilingual modeling reshapes text-to-image generation. Moving beyond translation-based pipelines, learning visual concepts directly across languages enables more faithful, culturally aligned generation while retaining strong performance and efficiency. This reframes inclusivity as a catalyst for better representations rather than a trade-off. Second, I will address the limitations of one-shot visual reasoning and introduce an approach to iterative, grounded reasoning. By explicitly linking each reasoning step to image regions and enforcing consistency between global scene understanding and local visual evidence, models can achieve more accurate, interpretable, and spatially precise reasoning across challenging visual tasks. Together, these directions outline a broader vision for visual AI: systems that can perceive the world, communicate across languages, and reason over space and time in a grounded and unified manner—moving beyond surface-level fluency toward deeper visual understanding.



footer