Banner
Home      Log In      Contacts      FAQs      INSTICC Portal
 
Documents

Tutorials

The role of the tutorials is to provide a platform for a more intensive scientific exchange amongst researchers interested in a particular topic and as a meeting point for the community. Tutorials complement the depth-oriented technical sessions by providing participants with broad overviews of emerging fields. A tutorial can be scheduled for 1.5 or 3 hours.



Tutorial on
Seeing Through the User’s Eyes: Advances in Human-Centric Egocentric Vision


Instructor

Francesco Ragusa
University of Catania
Italy
 
Brief Bio
Francesco Ragusa is a Research Fellow at the University of Catania. He is member of the IPLAB (University of Catania) research group since 2015. He has completed an Industrial Doctorate in Computer Science in 2021. During his PhD studies, he has spent a period as Research Student at the University of Hertfordshire, UK. He received his master’s degree in computer science (cum laude) in 2017 from the University of Catania. Francesco has authored one patent and more than 10 papers in international journals and international conference proceedings. He serves as reviewer for several international conferences in the fields of computer vision and multimedia, such as CVPR, ECCV, BMVC, WACV, ACM Multimedia, ICPR, ICIAP, and for international journals, including TPAMI, Pattern Recognition Letters and IeT Computer Vision. Francesco Ragusa is member of IEEE, CVF e CVPL. He has been involved in different research projects and has honed in on the issue of human-object interaction anticipation from egocentric videos as the key to analyze and understand human behavior in industrial workplaces. He is co-founder and CEO of NEXT VISION s.r.l., an academic spin-off the the University of Catania since 2021. His research interests concern Computer Vision, Pattern Recognition, and Machine Learning, with focus on First Person Vision.
Abstract

Wearable devices equipped with cameras, sensors, and on-device AI capabilities are rapidly evolving, driven by the growing availability of commercial solutions and the integration of Augmented Reality into everyday workflows. These devices enable natural and continuous user-machine interaction and open the door to intelligent assistants that expand human capabilities. The combination of mobility, contextual awareness, and multimodal sensing makes wearable systems a unique platform for advanced AI and Computer Vision applications.


First-person (egocentric) vision, unlike traditional third-person approaches that observe the scene from an external point of view, captures the world directly from the user's perspective. This viewpoint provides privileged access to users’ actions, intentions, attention, and interactions with objects and the environment. Recent advances in multimodal learning, large-scale datasets, and foundation models have further accelerated research in egocentric understanding, task reasoning, and human-AI collaboration.


This tutorial will present an updated overview of the challenges and opportunities in egocentric vision, discussing its foundations while highlighting recent methodological breakthroughs and emerging applications.


Keywords

Wearable devices, first person vision, egocentric vision, augmented reality, egocentric datasets, action recognition, action anticipation, human-object interaction, procedural understanding

Aims and Learning Objectives

The participants will understand the main advantages of first person (egocentric) vision over third person vision to analyze the user’s behavior, build personalized applications and predict future events. Specifically, the participants will learn about: 1) the main differences between third person and first person (egocentric) vision, including the way in which the data is collected and processed, 2) the devices which can be used to collect data and provide services to the users, 3) the algorithms which can be used to manage first person visual data for instance to perform action recognition, human-object interaction, procedural understanding and the prediction of future events.

Target Audience

First year PhD students, graduate students, researchers, practitioners.

Prerequisite Knowledge of Audience

Fundamentals of Computer Vision and Machine Learning (including Deep Learning)

Detailed Outline

The tutorial is divided into two parts and will cover the following topics:
Part I: History and motivation
• Agenda of the tutorial;
• Definitions, motivations, history and research trends of First Person (egocentric) Vision;
• Seminal works in First Person (Egocentric) Vision;
• Differences between Third Person and First Person Vision;
• First Person Vision datasets;
• Wearable devices to acquire/process first person visual data;
• Main research trends in First Person (Egocentric) Vision;
Part II: Fundamental tasks for first person vision systems:
• Visual Localization;
• Attended Object Detection;
• Hand-Object Interaction;
• Procedural Understanding;
• Actions and Objects anticipation;
• Dual-Agent Language Assistance
• Industrial Applications;
The tutorial will cover the main technological tools (devices and algorithms) which can be used to build first person vision applications, discussing challenges and open problems and will give conclusions and insights for research in the field.

Secretariat Contacts
e-mail: visapp.secretariat@insticc.org

Tutorial on
Advanced Methods for Visual Information Retrieval and Exploration in Large Multimedia Collections


Instructor

Kai Uwe Barthel
HTW Berlin / vviinn
Germany
 
Brief Bio
Kai Uwe Barthel is a professor at the Institute for Media and Computing at HTW Berlin, where he leads the Visual Computing Group. His work centers on technologies for media retrieval, including image understanding, metric learning, and visual exploration. He earned his PhD from the Technical University of Berlin with highest honors for research on fractal image compression and later led a 3D-video coding project. As Head of R&D at N-Tec Media and LuraTech Inc., he contributed to the development of the JPEG2000 standard. Since 2001, he has taught image analysis, machine learning, and information retrieval at HTW Berlin. In 2009, he founded pixolution, a company specializing in visual image search. Currently, he also heads the AI research team at vviinn, focusing on AI applications for e-commerce. Prof. Barthel holds numerous patents and publications regarding image sorting and visual navigation. More information is available at www.visual-computing.com.
Abstract

This tutorial presents advanced methods for efficiently searching and exploring large visual datasets, addressing the increasing demand for high-performance visual retrieval systems as multimedia content grows exponentially. We will begin by covering the principles of large visual encoders, emphasizing techniques for generating and improving compact, high-quality general-purpose visual descriptors. Crucially, we address the complex challenge of joint embedding misalignment, presenting strategies to optimize image encoders for superior retrieval accuracy while strictly preserving their alignment with text models. Participants will gain insights into approximate nearest neighbor search methods, with a focus on graph-based approaches that maximize search efficiency in dynamic datasets. In addition, the tutorial introduces innovative visualization techniques for high-dimensional data, such as grid-based sorting, which enable intuitive navigation and exploration of extensive image collections. Hands-on exercises provide practical experience with the concepts discussed using Jupyter notebooks and interactive demonstrations.

Keywords

Visual Information Retrieval, Cross-modal Retrieval, Contrastive Language-Image Pairing (CLIP), Approximate Nearest Neighbor Search (ANNS), Dynamic Graphs, Visual Exploration, Grid-based Sorting, Image Sorting.

Aims and Learning Objectives

This tutorial aims to equip participants with essential skills for image retrieval, focusing on advanced encoding techniques like Contrastive Language-Image Pairing (CLIP) and state-of-the-art image encoders to enhance general-purpose and cross-modal retrieval. Participants will gain hands-on experience with graph-based Approximate Nearest Neighbor Search (ANNS) methods, optimized for large, dynamic datasets. Additionally, the tutorial covers organizing and visually presenting extensive image collections using innovative grid-based sorting techniques. Participants will learn to create intuitive layouts that enhance search result visualization and support seamless exploration of large datasets.

Target Audience

Researchers & Practitioners: Interested in computer vision, data science, and multimedia information retrieval.
Students: Seeking to learn about multimedia content organization and visual exploration systems.
Data Scientists: Looking to manage high-dimensional data and visual foundation models.


Prerequisite Knowledge of Audience

Machine Learning/Data Science: Basic familiarity with concepts such as embeddings and search algorithms.
Mathematics: Basic knowledge of vector spaces is helpful.
Programming: Python programming skills are recommended for the hands-on parts (Jupyter notebooks).


Detailed Outline

Part 1: Introduction (approx. 30 mins)
Overview: Goals, schedule, and motivation.
Challenges: Addressing performance bottlenecks in visual information retrieval and the difficulties of visualizing high-dimensional data.
Setup: Accessing code examples via GitHub.

Part 2: Image Encoders for Effective Retrieval (approx. 45 mins)
Encoding Techniques: Representing images as vector embeddings for efficient search and comparison.
CLIP Optimization: Optimizing Contrastive Language-Image Pairing (CLIP) models to produce robust representations and improve generalization in similarity search tasks.

Cross-modal Retrieval: Solving the critical "misalignment" problem in joint embeddings—optimizing image encoders to significantly boost retrieval accuracy while strictly preserving alignment with text models, enabling a single, unified vector space for all search tasks.
Evaluation: Assessing retrieval embedding models using retrieval-specific loss functions.

Part 3: Approximate Nearest Neighbor Search (ANNS) in Dynamic Datasets (approx. 45 mins)
ANNS Fundamentals: Techniques and challenges in high-dimensional feature spaces.
Graph-based Methods: Introduction to graph-based ANNS for high search efficiency and adaptability.
Dynamic Graphs: Algorithms for constructing and updating novel dynamic graph structures that achieve state-of-the-art performance in evolving datasets.
Exploratory Search: Analyzing graph properties critical for complex applications with variable result lengths and metadata filters.

Part 4: Visual Exploration & Conclusion (approx. 60 mins)
Visualization Challenges: Why traditional point-based methods (PCA, t-SNE) fail for image sorting and content visibility.
Grid-based Sorting: Algorithms for arranging images by visual similarity to enable the viewing of hundreds of images simultaneously.
Interactive Exploration: Combining grid-based sorting with dynamic similarity graphs to create navigable 2D maps of large collections.
Conclusion & Q&A: Summary of techniques and final discussion.

Secretariat Contacts
e-mail: visapp.secretariat@insticc.org

footer