Robots learn to interact with the world through visual input. But how exactly does the choice of visual representation, the way a robot "sees" its environment, impact its learning ability? This project aims to systematically evaluate several state-of-the-art pretrained encoders (e.g., ResNet-50, DINOv2, R3M, VIP, CLIP) by analyzing their visual attention patterns and correlating these with RL learning speed and success rate in a robotic manipulation task. You will embed images from a simulated WidowX pick-and-place task, generate attention heat maps, and measure how well attention overlaps with relevant objects. Then, you will plug these frozen encoders into an off-policy RL algorithm and empirically evaluate their downstream performance.
This project is a valuable opportunity for students interested in computer vision and robot learning, offering hands-on experience with modern deep learning frameworks, reinforcement learning, and simulation environments. You will directly contribute to understanding how visual representation quality impacts robot learning performance. When applicable, your results can be published as a short benchmark note or as an appendix to an existing paper.
For more details or to apply, feel free to contact me directly via email or in-person.
You should have strong familiarity with PyTorch and a python-based simulation framework. Experience with OpenCV and basic reinforcement learning concepts (e.g., SAC or similar) will be beneficial. The project provides you with a prebuilt 3D printed WidowX arm, ready-to-use MuJoCo simulation environments, baseline RL implementations, and all necessary computational resources (though bringing your own GPU is a plus).
Rhythmus | Tag | Uhrzeit | Format / Ort | Zeitraum | |
---|---|---|---|---|---|
nach Vereinbarung | n.V. | 13.10.2025-06.02.2026 |
Modul | Veranstaltung | Leistungen | |
---|---|---|---|
39-M-Inf-P Projekt | Projekt | unbenotete Prüfungsleistung
|
Studieninformation |
Die verbindlichen Modulbeschreibungen enthalten weitere Informationen, auch zu den "Leistungen" und ihren Anforderungen. Sind mehrere "Leistungsformen" möglich, entscheiden die jeweiligen Lehrenden darüber.
Analyze and compare how different pretrained visual encoders affect robotic reinforcement learning performance, efficiency, and interpretability for a simulated manipulation task.