30 hp - Leveraging Vision-Language Foundation Model for Reasoning and Prediction in Autonomous Drivi

30 hp – Leveraging Vision-Language Foundation Model for Reasoning and Prediction in Autonomous Driving

Introduction
The concept of autonomous driving has rapidly transitioned from being a futuristic vision of robotics and artificial intelligence to being a present-day reality. The department for autonomous driving perception announces a thesis project suitable for a master student.

Background
The increasing complexity of real-world driving environments necessitates the integration of Vision-Language Models (VLMs) to enhance scene understanding and reasoning capabilities. VLMs can generate detailed descriptions of the environment, identifying critical elements like a parked police car that may signal potential risks. These models prioritize relevant objects, such as a distant cyclist approaching an intersection, and predict their future movements. By reasoning over temporal data, VLMs anticipate hazards, while techniques like space-aware pre-training sharpen spatial localization. Additionally, VLMs trained on diverse datasets can effectively generalize to handle rare, long-tail traffic events, adapting to unpredictable scenarios.

Objective
The aim of this thesis is to develop effective algorithms to leveraging VLM to improve the robustness of reasoning and prediction for long-tail scenarios in autonomous driving. Inspired by the existing methods [1,2,3], propose a new method or pipeline for motion prediction with VLM. It could use the open-source state-of-the-art methods and library as a foundation and modify based on it. The result will be compared among other methods.

References

[1] Zhou, Yunsong, et al. "Embodied understanding of driving scenarios." arXiv preprint arXiv:2403.04593 (2024).

[2] Tian, Ran, et al. "Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving." arXiv preprint arXiv:2407.00959 (2024).

[3] Pan, Chenbin, et al. "VLP: Vision Language Planning for Autonomous Driving." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. Link.

Job description

The assignment is divided into sub-tasks:    

Literature study on perception, motion prediction and vision-language models (VLM)

Develop new method for improving performance using different VLMs

Benchmark different concepts

Education/program/focus

Master (civilingenjör) in machine learning, robotics, computer science, engineering physics, electrical engineering, or applied mathematics, preferably with a specialization in deep learning. Knowledge of programming and training deep neural networks is a plus. 

Number of students: 1-2

Start date: January 2025

Estimated time needed: 20 weeks

Contact persons and supervisors
Carol Yi Yang, Industrial PhD in Perception, carol-yi.yang@scania.com   

Truls Nyberg, Industrial PhD in Situational Awareness, truls.nyberg@scania.com

Magnus Granström, Manager, Autonomous Driving Perception, magnus.granstrom@scania.com

Application:
Your application must include a CV, personal letter and transcript of grades

A background check might be conducted for this position. We are conducting interviews continuously and may close the recruitment earlier than the date specified.

Requisition ID: 10967

Number of Openings: 2.0

Part-time / Full-time: Full-time

Regular / Temporary: Temporary

Country / Region: SE

Location(s):

Södertälje, SE, 151 38

Required Travel: 0%

Workplace: Hybrid