30 hp - Leveraging Vision-Language Foundation Model for Reasoning and Prediction in Autonomous Drivi

 

30 hp – Leveraging Vision-Language Foundation Model for Reasoning and Prediction in Autonomous Driving 

 

Introduction 
The concept of autonomous driving has rapidly transitioned from being a futuristic vision of robotics and artificial intelligence to being a present-day reality. The department for autonomous driving perception announces a thesis project suitable for a master student. 

 

Background  
The increasing complexity of real-world driving environments necessitates the integration of Vision-Language Models (VLMs) to enhance scene understanding and reasoning capabilities. VLMs can generate detailed descriptions of the environment, identifying critical elements like a parked police car that may signal potential risks. These models prioritize relevant objects, such as a distant cyclist approaching an intersection, and predict their future movements. By reasoning over temporal data, VLMs anticipate hazards, while techniques like space-aware pre-training sharpen spatial localization. Additionally, VLMs trained on diverse datasets can effectively generalize to handle rare, long-tail traffic events, adapting to unpredictable scenarios. 

 

Objective  
The aim of this thesis is to develop effective algorithms to leveraging VLM to improve the robustness of reasoning and prediction for long-tail scenarios in autonomous driving. Inspired by the existing methods [1,2,3], propose a new method or pipeline for motion prediction with VLM. It could use the open-source state-of-the-art methods and library as a foundation and modify based on it. The result will be compared among other methods. 

 

References 

[1] Zhou, Yunsong, et al. "Embodied understanding of driving scenarios." arXiv preprint arXiv:2403.04593 (2024). 

[2] Tian, Ran, et al. "Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving." arXiv preprint arXiv:2407.00959 (2024). 

[3] Pan, Chenbin, et al. "VLP: Vision Language Planning for Autonomous Driving." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. Link

 

Job description  

The assignment is divided into sub-tasks:     

  • Literature study on perception, motion prediction and vision-language models (VLM) 

  • Develop new method for improving performance using different VLMs 

  • Benchmark different concepts 

 

Education/program/focus 

Master (civilingenjör) in machine learning, robotics, computer science, engineering physics, electrical engineering, or applied mathematics, preferably with a specialization in deep learning. Knowledge of programming and training deep neural networks is a plus.  

  • Number of students: 1-2    

  • Start date: January 2025      

  • Estimated time needed: 20 weeks   

 

Contact persons and supervisors 
Carol Yi Yang, Industrial PhD in Perception, carol-yi.yang@scania.com    

Truls Nyberg, Industrial PhD in Situational Awareness, truls.nyberg@scania.com 

Magnus Granström, Manager, Autonomous Driving Perception, magnus.granstrom@scania.com 

 

Application: 
Your application must include a CV, personal letter and transcript of grades  

A background check might be conducted for this position. We are conducting interviews continuously and may close the recruitment earlier than the date specified. 

 

 

 

Requisition ID:  10967
Number of Openings:  2.0
Part-time / Full-time:  Full-time
Regular / Temporary:  Temporary
Country / Region:  SE
Location(s): 

Södertälje, SE, 151 38

Required Travel:  0%
Workplace:  Hybrid