Jia Li

Ph.D. Student · Computer Science

Jia Li Linsay Li

The Chinese University of Hong Kong
Advised by Prof. Michael R. Lyu

I work on agentic software engineering — building and rigorously evaluating code agents that reason over entire repositories, diagnose failures, repair programs, and learn from real execution feedback.

My research connects code large language models with classical program analysis and reinforcement learning, spanning benchmarks, agentic systems, and training methods for real, repository-level, industrial-scale software. I worked as a research intern at WeChat, where several of these systems were deployed in production.

LLM agents look capable on toy benchmarks, yet fail in real software systems at three successive stages. My work follows this causal chain and resolves each stage in turn.

1

Real Code

Can the agent see?

Real, repository-level program information rarely enters the model's context. I inject whitebox program slices so the agent sees cross-function causal chains.

MioHint

2

Real Reasoning

Can the agent reason?

Even with full context, agents fail to aggregate cross-file evidence — an aggregation deficit. I diagnose it and converge scattered runtime evidence in production.

RepoReasonHolmes

3

Real Reward

Can the agent fix & learn?

Patches may compile yet be wrong, and weak execution feedback is hard to learn from. I reshape it into comparable, creditable, and learnable training signal.

ComBenchVulKeyGRPO Reshaping

Every stage of the software lifecycle leaves observations and feedback. My long-term goal is a closed-loop, self-evolving software system — where agents turn these observations into learnable signals and keep improving, cycle after cycle.

DevOps infinity loop: plan, code, build, test, release, deploy, operate, monitor
The DevOps lifecycle is one continuous feedback loop.
  1. Plan / Code Change context & program structure
  2. Build / Test Compilation errors & test feedback
  3. Operate / Monitor Crash scenes & runtime logs
  4. Feedback / Learn Repair outcomes & learnable reward

These observations become the three signals my work grounds agents in — Real Code (authentic artifacts), Real Reasoning (attributable evidence chains), and Real Reward (learnable execution feedback). I aim to generalize this signal reshaping to broad weak-feedback, long-horizon code tasks, so agents continuously self-evolve from real feedback.

First author on all listed works.

  1. GRPO signal reshaping architecture

    Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

    Preprint

    Reframes weak-feedback multi-turn code repair as a GRPO trainability problem, using layered rewards, step-level process weighting, and failure-cause-aware rollout governance to lift two-stage repair from 0.385 to 0.535.

  2. Holmes retrieve-explore-reason architecture

    Holmes: Multimodal Agentic Diagnosis for Mixed-Language Mobile Crashes at Industrial Scale

    FSE 2026, Industry Track

    A retrieve–explore–reason multi-agent system for post-mortem mobile crash diagnosis over WeChat's mixed-language codebase, reaching 87.6% function-level localization accuracy and running ~40k times in production.

  3. RepoReason benchmark architecture

    From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level

    ACL 2026 Main Oral

    Introduces RepoReason, a repository-level benchmark built on abductive assertion verification and execution-driven mutation, with ESV / MCL / DFI metrics that expose aggregation deficits in frontier code agents.

  4. VulKey pattern-guided repair architecture

    VulKey: Automated Vulnerability Repair Guided by Domain-Specific Repair Patterns

    FSE 2026

    Guides vulnerability repair with a three-level abstraction of CWE/NVD expert knowledge, improving PrimeVul and Vul4J accuracy across languages and models.

  5. MioHint LLM-assisted mutation architecture

    MioHint: LLM-Assisted Request Mutation for Whitebox REST API Testing

    ICSE 2026

    Combines statement-level def-use analysis with LLM-generated mutation hints to escape fitness plateaus in whitebox REST API testing.

  6. ComBench benchmark construction overview

    ComBench: A Repository-level Real-world Benchmark for Compilation Error Repair

    Preprint

    Mines and faithfully reproduces real C/C++ compilation failures from GitHub CI histories, exposing the compile-but-incorrect gap between compilable and semantically correct LLM patches.