Tutorial on
Seeing Through the User’s Eyes: Advances in Human-Centric Egocentric Vision
Instructor
|
Francesco Ragusa
University of Catania
Italy
|
|
|
|
Brief Bio
Francesco Ragusa is a Research Fellow at the University of Catania. He is member of the IPLAB (University of Catania) research group since 2015. He has completed an Industrial Doctorate in Computer Science in 2021. During his PhD studies, he has spent a period as Research Student at the University of Hertfordshire, UK. He received his master’s degree in computer science (cum laude) in 2017 from the University of Catania. Francesco has authored one patent and more than 10 papers in international journals and international conference proceedings. He serves as reviewer for several international conferences in the fields of computer vision and multimedia, such as CVPR, ECCV, BMVC, WACV, ACM Multimedia, ICPR, ICIAP, and for international journals, including TPAMI, Pattern Recognition Letters and IeT Computer Vision. Francesco Ragusa is member of IEEE, CVF e CVPL. He has been involved in different research projects and has honed in on the issue of human-object interaction anticipation from egocentric videos as the key to analyze and understand human behavior in industrial workplaces. He is co-founder and CEO of NEXT VISION s.r.l., an academic spin-off the the University of Catania since 2021. His research interests concern Computer Vision, Pattern Recognition, and Machine Learning, with focus on First Person Vision.
|
Abstract
Wearable devices equipped with cameras, sensors, and on-device AI capabilities are rapidly evolving, driven by the growing availability of commercial solutions and the integration of Augmented Reality into everyday workflows. These devices enable natural and continuous user-machine interaction and open the door to intelligent assistants that expand human capabilities. The combination of mobility, contextual awareness, and multimodal sensing makes wearable systems a unique platform for advanced AI and Computer Vision applications.
First-person (egocentric) vision, unlike traditional third-person approaches that observe the scene from an external point of view, captures the world directly from the user's perspective. This viewpoint provides privileged access to users’ actions, intentions, attention, and interactions with objects and the environment. Recent advances in multimodal learning, large-scale datasets, and foundation models have further accelerated research in egocentric understanding, task reasoning, and human-AI collaboration.
This tutorial will present an updated overview of the challenges and opportunities in egocentric vision, discussing its foundations while highlighting recent methodological breakthroughs and emerging applications.
Keywords
Wearable devices, first person vision, egocentric vision, augmented reality, egocentric datasets, action recognition, action anticipation, human-object interaction, procedural understanding
Aims and Learning Objectives
The participants will understand the main advantages of first person (egocentric) vision over third person vision to analyze the user’s behavior, build personalized applications and predict future events. Specifically, the participants will learn about: 1) the main differences between third person and first person (egocentric) vision, including the way in which the data is collected and processed, 2) the devices which can be used to collect data and provide services to the users, 3) the algorithms which can be used to manage first person visual data for instance to perform action recognition, human-object interaction, procedural understanding and the prediction of future events.
Target Audience
First year PhD students, graduate students, researchers, practitioners.
Prerequisite Knowledge of Audience
Fundamentals of Computer Vision and Machine Learning (including Deep Learning)
Detailed Outline
The tutorial is divided into two parts and will cover the following topics:
Part I: History and motivation
• Agenda of the tutorial;
• Definitions, motivations, history and research trends of First Person (egocentric) Vision;
• Seminal works in First Person (Egocentric) Vision;
• Differences between Third Person and First Person Vision;
• First Person Vision datasets;
• Wearable devices to acquire/process first person visual data;
• Main research trends in First Person (Egocentric) Vision;
Part II: Fundamental tasks for first person vision systems:
• Visual Localization;
• Attended Object Detection;
• Hand-Object Interaction;
• Procedural Understanding;
• Actions and Objects anticipation;
• Dual-Agent Language Assistance
• Industrial Applications;
The tutorial will cover the main technological tools (devices and algorithms) which can be used to build first person vision applications, discussing challenges and open problems and will give conclusions and insights for research in the field.
Tutorial on
Advanced Methods for Visual Information Retrieval and Exploration in Large Multimedia Collections
Instructor
|
Kai Uwe Barthel
HTW Berlin / vviinn
Germany
|
|
|
|
Brief Bio
Kai Uwe Barthel is a professor at the Institute for Media and Computing at HTW Berlin, where he leads the Visual Computing Group. His work centers on technologies for media retrieval, including image understanding, metric learning, and visual exploration. He earned his PhD from the Technical University of Berlin with highest honors for research on fractal image compression and later led a 3D-video coding project. As Head of R&D at N-Tec Media and LuraTech Inc., he contributed to the development of the JPEG2000 standard. Since 2001, he has taught image analysis, machine learning, and information retrieval at HTW Berlin. In 2009, he founded pixolution, a company specializing in visual image search. Currently, he also heads the AI research team at vviinn, focusing on AI applications for e-commerce. Prof. Barthel holds numerous patents and publications regarding image sorting and visual navigation. More information is available at www.visual-computing.com.
|
Abstract
This tutorial presents advanced methods for efficiently searching and exploring large visual datasets, addressing the increasing demand for high-performance visual retrieval systems as multimedia content grows exponentially. We will begin by covering the principles of large visual encoders, emphasizing techniques for generating and improving compact, high-quality general-purpose visual descriptors. Crucially, we address the complex challenge of joint embedding misalignment, presenting strategies to optimize image encoders for superior retrieval accuracy while strictly preserving their alignment with text models. Participants will gain insights into approximate nearest neighbor search methods, with a focus on graph-based approaches that maximize search efficiency in dynamic datasets. In addition, the tutorial introduces innovative visualization techniques for high-dimensional data, such as grid-based sorting, which enable intuitive navigation and exploration of extensive image collections. Hands-on exercises provide practical experience with the concepts discussed using Jupyter notebooks and interactive demonstrations.
Keywords
Visual Information Retrieval, Cross-modal Retrieval, Contrastive Language-Image Pairing (CLIP), Approximate Nearest Neighbor Search (ANNS), Dynamic Graphs, Visual Exploration, Grid-based Sorting, Image Sorting.
Aims and Learning Objectives
This tutorial aims to equip participants with essential skills for image retrieval, focusing on advanced encoding techniques like Contrastive Language-Image Pairing (CLIP) and state-of-the-art image encoders to enhance general-purpose and cross-modal retrieval. Participants will gain hands-on experience with graph-based Approximate Nearest Neighbor Search (ANNS) methods, optimized for large, dynamic datasets. Additionally, the tutorial covers organizing and visually presenting extensive image collections using innovative grid-based sorting techniques. Participants will learn to create intuitive layouts that enhance search result visualization and support seamless exploration of large datasets.
Target Audience
Researchers & Practitioners: Interested in computer vision, data science, and multimedia information retrieval.
Students: Seeking to learn about multimedia content organization and visual exploration systems.
Data Scientists: Looking to manage high-dimensional data and visual foundation models.
Prerequisite Knowledge of Audience
Machine Learning/Data Science: Basic familiarity with concepts such as embeddings and search algorithms.
Mathematics: Basic knowledge of vector spaces is helpful.
Programming: Python programming skills are recommended for the hands-on parts (Jupyter notebooks).
Detailed Outline
Part 1: Introduction (approx. 30 mins)
Overview: Goals, schedule, and motivation.
Challenges: Addressing performance bottlenecks in visual information retrieval and the difficulties of visualizing high-dimensional data.
Setup: Accessing code examples via GitHub.
Part 2: Image Encoders for Effective Retrieval (approx. 45 mins)
Encoding Techniques: Representing images as vector embeddings for efficient search and comparison.
CLIP Optimization: Optimizing Contrastive Language-Image Pairing (CLIP) models to produce robust representations and improve generalization in similarity search tasks.
Cross-modal Retrieval: Solving the critical "misalignment" problem in joint embeddings—optimizing image encoders to significantly boost retrieval accuracy while strictly preserving alignment with text models, enabling a single, unified vector space for all search tasks.
Evaluation: Assessing retrieval embedding models using retrieval-specific loss functions.
Part 3: Approximate Nearest Neighbor Search (ANNS) in Dynamic Datasets (approx. 45 mins)
ANNS Fundamentals: Techniques and challenges in high-dimensional feature spaces.
Graph-based Methods: Introduction to graph-based ANNS for high search efficiency and adaptability.
Dynamic Graphs: Algorithms for constructing and updating novel dynamic graph structures that achieve state-of-the-art performance in evolving datasets.
Exploratory Search: Analyzing graph properties critical for complex applications with variable result lengths and metadata filters.
Part 4: Visual Exploration & Conclusion (approx. 60 mins)
Visualization Challenges: Why traditional point-based methods (PCA, t-SNE) fail for image sorting and content visibility.
Grid-based Sorting: Algorithms for arranging images by visual similarity to enable the viewing of hundreds of images simultaneously.
Interactive Exploration: Combining grid-based sorting with dynamic similarity graphs to create navigable 2D maps of large collections.
Conclusion & Q&A: Summary of techniques and final discussion.
Tutorial on
Optimizing 3D Scene Representations: From Gaussian Splatting Compression to Self-Organizing Grids
Instructors
|
Peter Eisert
Fraunhofer HHI, Humboldt University Berlin
Germany
|
|
|
|
Brief Bio
Peter Eisert is Professor for Visual Computing at the Humboldt University Berlin and heading the Vision & Imaging Technologies Department of the Fraunhofer Institute for Telecommunications - Heinrich Hertz Institute Berlin, Germany. He received the Dr.-Ing. degree "with highest honors" from the University of Erlangen-Nuremberg, Germany, in 2000. In 2001, he worked as a postdoctoral fellow at the Stanford University, USA, on 3D image analysis as well as facial animation and computer graphics. In 2002, he joined Fraunhofer HHI, where he is coordinating and initiating numerous national and international 3rd party funded research projects with a total budget of more than 25 Mio Euros. He has published more than 200 conference and journal papers and is Associate Editor of the International Journal of Image and Video Processing as well as in the Editorial Board of the Journal of Visual Communication and Image Representation. His research interests include 3D image/video analysis and synthesis, face and body processing, machine learning, computer vision, computer graphics in application areas like multimedia, security and medicine.
|
|
Kai Uwe Barthel
HTW Berlin / vviinn
Germany
|
|
|
|
Brief Bio
Kai Uwe Barthel is a professor at the Institute for Media and Computing at HTW Berlin, where he leads the Visual Computing Group. His work centers on technologies for media retrieval, including image understanding, metric learning, and visual exploration. He earned his PhD from the Technical University of Berlin with highest honors for research on fractal image compression and later led a 3D-video coding project. As Head of R&D at N-Tec Media and LuraTech Inc., he contributed to the development of the JPEG2000 standard. Since 2001, he has taught image analysis, machine learning, and information retrieval at HTW Berlin. In 2009, he founded pixolution, a company specializing in visual image search. Currently, he also heads the AI research team at vviinn, focusing on AI applications for e-commerce. Prof. Barthel holds numerous patents and publications regarding image sorting and visual navigation. More information is available at www.visual-computing.com.
|
Abstract
This tutorial provides a comprehensive guide to state-of-the-art techniques in 3D Gaussian Splatting (3DGS), focusing on efficiency, storage optimization, and structured data organization. We begin with a foundational introduction to 3DGS, establishing how it models scenes using explicit 3D Gaussians. The core of the tutorial explores the concepts of Compaction (reducing the number of primitives) and Compression (reducing bit-rate via quantization and entropy coding), reviewing recent methods such as HAC, Compact3D, Scaffold-GS, and SOG. Bridging computer graphics and data visualization, we introduce Advanced Grid-based Sorting methods, specifically Fast Linear Assignment Sorting (FLAS) and the Distance Preservation Quality (DPQ) metric. Finally, we synthesize these concepts with "Self-Organizing Gaussian Splats" (SOG) and show how the arrangement of 3D Gaussian parameters in sorted 2D grids exploits redundancies for superior compression performance.
Keywords
3D Gaussian Splatting, Neural Rendering, Data Compression, Scene Compaction, Grid Layouts, Linear Assignment Sorting (LAS), Self-Organizing Maps, Distance Preservation Quality (DPQ)
Aims and Learning Objectives
This tutorial aims to bridge the gap between high-fidelity neural rendering and efficient data representation by exploring how organizing unstructured 3D data into structured 2D grids enables superior compression. Participants will learn to distinguish between compaction and compression strategies in the 3D Gaussian Splatting (3DGS) pipeline and understand the mechanics of Fast Linear Assignment Sorting (FLAS) for mapping high-dimensional data. By mastering the Self-Organizing Gaussians (SOG) workflow. Attendees will gain the skills to leverage standard image codecs for 3D scenes, effectively evaluating the critical trade-offs between storage size, training time, and rendering fidelity.
Target Audience
* Computer Graphics Researchers & Practitioners: Interested in real-time rendering and neural field representations.
* Machine Learning Engineers: Working on model compression, quantization, and efficient data structures for 3D vision.
* Data Visualization Specialists: Interested in high-dimensional data sorting and grid layout algorithms.
Prerequisite Knowledge of Audience
Mathematics: Linear algebra (matrix operations, eigenvectors), basic probability (Gaussian distributions), and optimization (gradient descent).
Computer Vision/Graphics: Familiarity with the rendering pipeline, structure-from-motion (SfM), point clouds, and basic Neural Radiance Fields (NeRF) concepts.
Programming: Proficiency in Python and familiarity with PyTorch or CUDA (for the practical implementation details of sorting and splatting).
Detailed Outline
Part 1: Foundations of Efficient 3DGS (approx. 60 mins)
● 3DGS Fundamentals: Overview of explicit scene modeling via 3D Gaussians, covariance matrices, and rasterization.
● The Storage Bottleneck: Analyzing the high memory costs of explicit representations.
● Compaction Strategies: Techniques for removing non-contributing splats and merging redundant geometry.
● Attribute Compression: Implementing vector quantization and entropy coding to minimize precision footprints.
● Survey Insights: Reviewing trade-offs between rendering quality (PSNR) and file size across methods.
Part 2: Advanced Sorting and Grid Structures (approx. 60 mins)
● Grid-based Sorting: Introduction to arranging high-dimensional data onto structured 2D grids.
● Linear Assignment Sorting (LAS): Formulating data arrangement as a cost-minimization problem.
● Fast LAS (FLAS): Optimizing sorting speed for massive datasets using block-based solvers.
Part 3: Self-Organizing Gaussians & Integration (approx. 60 mins)
● Self-Organizing Gaussians (SOG): Mapping 3D Gaussian attributes onto sorted 2D image grids.
● Unified Pipeline: Leveraging sorting to smooth data signals, enabling highly efficient standard image compression.
● Q&A and Interactive Demos.