Abstract
Accurately and safely predicting the trajectories of surrounding vehicles is essential for fully realizing autonomous driving (AD). This paper presents the Human-Like Trajectory Prediction model (HLTP++), which emulates human cognitive processes to improve trajectory prediction in AD. HLTP++ incorporates a novel teacher-student knowledge distillation framework. The ``teacher'' model equipped with an adaptive visual sector, mimics the dynamic allocation of attention human drivers exhibit based on factors like spatial orientation, proximity, and driving speed. On the other hand, the ``student'' model focuses on real-time interaction and human decision-making, drawing parallels to the human memory storage mechanism. Furthermore, we improve the model's efficiency by introducing a new Fourier Adaptive Spike Neural Network (FA-SNN), allowing for faster and more precise predictions with fewer parameters. Evaluated using the NGSIM, HighD, and MoCAD benchmarks, HLTP++ demonstrates superior performance compared to existing models, which reduces the predicted trajectory error with over 11% on the NGSIM dataset and 25\% on the HighD datasets. Moreover, HLTP++ demonstrates strong adaptability in challenging environments with incomplete input data. This marks a significant stride in the journey towards fully AD systems.
Model Structure
Overview of the HLTP++ Model:
1. Teacher-Student Knowledge Distillation Framework:
-
The Teacher Model: This model is designed to learn rich and complex spatio-temporal patterns from driving scenarios. It uses a deep neural network to process and understand the intricate relationships between various elements in the driving environment, such as other vehicles, pedestrians, road signs, and more. The teacher model captures this data through extensive training on large datasets, allowing it to predict future trajectories with high accuracy.
-
The Student Model: The teacher model is computationally expensive, making it less suitable for real-time applications. The student model is a lightweight version that learns from the teacher model through a process called knowledge distillation. This means that the student model is trained to replicate the predictions of the teacher model but with much fewer computational resources. The student model is thus optimized for real-time trajectory prediction on embedded systems.
2. Fourier Adaptive Spike Neural Network (FA-SNN): FA-SNN is inspired by how the human brain processes information. It incorporates spike-based neural processing, where information is encoded in spikes or pulses, similar to how neurons communicate in the brain. The model integrates Fourier transforms to efficiently capture frequency-domain information, which helps in understanding periodic patterns in the data, such as the regular motion of vehicles or pedestrians.
3. Adaptive Visual Pooling Mechanism: This mechanism mimics the human driver's ability to focus on the most relevant parts of the visual field while driving. The model dynamically adjusts its attention based on the current driving context. For example, if the vehicle is moving at high speed, the model may focus more on distant objects, while at lower speeds, it might pay more attention to nearby pedestrians or cyclists.
Experiment Result
Our comprehensive evaluation demonstrates HLTP++'s superior performance compared to state-of-the-art baselines. It notably achieves gains of 11.2% for long-term (5s) and 11.4% for average predictions on the NGSIM dataset. The corresponding outstanding performance is also evident on the HighD dataset and the MACAD dataset. It is noteworthy that our model HLTP++(h), despite utilizing only 1.5 seconds of input data (half of the input of other baselines), achieves comparable prediction accuracy. This highlights the adaptability and robustness of HLTP++.
Our benchmarking against SOTA baselines reveals that HLTP++ models outperform in all metrics while maintaining a minimal parameter count. Specifically, HLTP++ reduce parameters by 56.91% and 33.51% compared to WSiP and CS-LSTM, respectively. Compared to HLTP++(SM), the ``teacher'' model of HLTP++, HLTP++(TM), achieve the second best score in three datasets, while maintaining a larger number of parameters and slower inference speed. However, HLTP++ maintain the lowest inference time while achieve the best accuracy in trajectory prediction. Utilizing the Knowledge Distillation Module (KDM), HLTP++ retains the lightweight advantages of the HLTP++(SM), while concurrently enhancing its predictive capabilities by assimilating knowledge gleaned from the teacher model, thereby surpassing the performance of the teacher model itself. This highlights the efficiency and adaptability of our lightweight ``teacher-student'' knowledge distillation framework, offering a balance between accuracy and computational resources.
Figure showcases the multimodal probabilistic prediction performance of HLTP++ on the NGSIM dataset. The heat maps shown represent the Gaussian Mixture Model of predictions in challenging scenes. These visualizations show that the highest probability predictions of our model are very close to the ground truth, indicating its impressive performance. Moreover, the figure visually demonstrates our model's ability to accurately predict complex scenarios such as merging and lane changing, confirming its effectiveness and safety in various traffic situations. Interestingly, in certain complex scenarios, the trajectory predictions of the ``student'' model exceed the accuracy of the ``teacher'' model. This result illustrates the ability of the ``student'' model to selectively assimilate and refine the knowledge acquired from the ``teacher'' model, effectively ``extracting the essence and discarding the dross''. Moreover, we observed that vehicles in closer proximity to the target vehicle received higher attention values. This observation aligns with the driving behavior of human operators who primarily focus on the vehicle ahead, as it has the most significant influence on the driving trajectory. This also substantiates the utility of vision pooling in reducing the perturbations caused by neighboring vehicles, thereby prioritizing the significance of the leading vehicle.