Current AI systems can generate images and videos from text prompts. However, generating visual content directly from speech remains a challenging problem, as speech contains not only linguistic information but also tone, emotion, and prosody.
This project explores how semantic representations extracted from speech can drive visual generation using generative AI models.
The goal is to design and implement a prototype pipeline that maps speech features to semantic embeddings compatible with visual generative models. The speech-to-image or speech-to-video generation pipeline will be trained and evaluated using multimodal datasets.
Depending on the number of students and project scope, the project can also include:
• Evaluation of the alignment between generated visuals and spoken input
• Analysis of the influence of prosody and emotion
• Comparison of direct speech-based vs. speech-to-text-based pipelines
• Good programming skills with Python
• Basic knowledge of machine learning/deep learning
• Interest in generative AI
• Preferably experience or a very strong interest in speech processing (e.g. speech recognition, speech-to-text, ...)
Upon completion of this project, we will work hand in hand to publish the results in a well-established conference or journal in Human–Computer Interaction (HCI) or Computer Vision (CV)
| Rhythmus | Tag | Uhrzeit | Format / Ort | Zeitraum | |
|---|---|---|---|---|---|
| nach Vereinbarung | n.V. | 13.04.-24.07.2026 | Nach Vereinbarung, online, CITEC oder R.1 |
| Modul | Veranstaltung | Leistungen | |
|---|---|---|---|
| 39-M-Inf-AI-app-foc_a Applied Artificial Intelligence (focus) Applied Artificial Intelligence (focus) | Applied Artificial Intelligence (focus): Projekt | Studienleistung
|
Studieninformation |
| 39-M-Inf-INT-app-foc_a Applied Interaction Technology (focus) Applied Interaction Technology (focus) | Applied Interaction Technology (focus): Projekt | Studienleistung
|
Studieninformation |
Die verbindlichen Modulbeschreibungen enthalten weitere Informationen, auch zu den "Leistungen" und ihren Anforderungen. Sind mehrere "Leistungsformen" möglich, entscheiden die jeweiligen Lehrenden darüber.