Web Agent Version-Robust Benchmark

Duration: Jan 2026 - Present
Context: EECS 545 (Machine Learning) Course Project
Advisor: Prof. Honglak Lee

Project Overview

Modern web agents often break when websites update their layouts or DOM structures. This project aims to benchmark the resilience of LLM-based web agents against historical website variations.

Key Contributions

Built a reproducible testing pipeline using Docker to deploy baseline agents like QWen3-VL-30B across historical snapshots of open-source websites (such as SimpleWiki).
Formulated a taxonomy for web version variations and designed controlled experiments to isolate version-induced failure modes in LLM-based web agents.
Implementing a Knowledge Graph-based method to automate benchmark task generation across different website versions dynamically.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Chenming Ge

Project Overview

Key Contributions

Share on