Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy, Consistency, and Reasoning Behaviors

Abstract

In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, GPT-4o, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3, and Llama 3.1 in solving some selected undergraduate-level transportation engineering problems. We introduce TransportBench, a benchmark dataset that includes a sample of transportation engineering problems on a wide range of subjects in the context of planning, design, management, and control of transportation systems. This dataset is used by human experts to evaluate the capabilities of various commercial and open-sourced LLMs, especially their accuracy and consistency, in solving transportation engineering problems. Our comprehensive analysis also uncovers the unique strengths and limitations of each LLM, e.g. our analysis shows the impressive accuracy and some unexpected reasoning breakdown of Claude 3.5 Sonnet and Claude 3 Opus in solving TransportBench problems. Our study marks a thrilling first step toward harnessing artificial general intelligence in transportation engineering, setting the stage for a future of more effective, LLM-based solutions to complex transportation challenges.

Publication
Under review in Transportation Research Part C: Emerging Technologies