Web Agent Version-Robust Benchmark

  • Duration: Jan 2026 - Present
  • Context: EECS 545 (Machine Learning) Course Project
  • Advisor: Prof. Honglak Lee

Project Overview

Modern web agents often break when websites update their layouts or DOM structures. This project aims to benchmark the resilience of LLM-based web agents against historical website variations.

Key Contributions

  • Built a reproducible testing pipeline using Docker to deploy baseline agents like QWen3-VL-30B across historical snapshots of open-source websites (such as SimpleWiki).
  • Formulated a taxonomy for web version variations and designed controlled experiments to isolate version-induced failure modes in LLM-based web agents.
  • Implementing a Knowledge Graph-based method to automate benchmark task generation across different website versions dynamically.