LLM-as-Judge: Prone to bias and lacks "executable proof," leading to a false sense of security.
Web-Specific Complexity
Beyond "Crashes": The Silent Failure. Unlike older software that crashes when a bug occurs, web apps often keep running normally even while being hacked. The vulnerability is hidden, and there are no obvious error messages or "smoke."
Deep Logic Chains: It's Not Just One Click. Modern bugs are rarely found in a single input. They often require a specific sequence across pages or APIs (e.g., Register -> Change Settings -> Trigger Action) to actually show up.
Agentic AVR · Thesis Defense§ 2.1
Part II · Problems
04 / 15
Limitations of Single-Agent Systems
The Limitation of Single-Agent Systems
Cognitive Overload
Security repair is a multi-step process. A single prompt cannot effectively manage file discovery, root-cause analysis, and regression testing at the same time. Large context will overwhelm the model, leading to lost details and reduced precision.
Hallucination Risk
Single agents suffer from "self-grading" bias, where the model hallucinates a successful outcome while reviewing its own patch, failing to objectively identify its own logic flaws.
Based on these insights, we propose a Multi-Agent Workflow that decomposes the complex repair task into specialized roles, ensuring higher precision and reliability.
Agentic AVR · Thesis Defense§ 2.2
Part Three
III
Supported Workflows
IBackground
IIProblems
IIISupported Workflows
IVMethodology
VEvaluation
VIEthics
Part III · Supported Workflows
05 / 15
Four practical security review scenarios.
I
Issue Review
Analyze security-related issue reports and produce structured findings.
II
Pull Request Review
Review code changes in PRs and provide risk analysis with mitigation suggestions.
III
Repo-wide Full Scan
Run baseline security scanning across the whole repository.
IV
Repo-wide Incremental Scan
Re-scan only the changed code file between last scan and current codebase for continuous security assurance.
Agentic AVR · Thesis Defense§ 3.1
Part Four
IV
Methodology & Case Study
IBackground
IIProblems
IIISupported Workflows
IVMethodology & Case Study
VEvaluation
VIEthics
Part IV · Methodology & Case Study
06 / 15
Issue Review Example
Starlette CVE-2023-29159
Directory traversal in static file serving
Issue description: Directory traversal vulnerability in Starlette versions 0.13.5 and later and prior to 0.27.0 allows a remote unauthenticated attacker to view files in a web service which was built using Starlette.
Analyze, mitigate, and verify the Starlette path traversal fix
Agent Analysis Procedure
1
Analyze
Confirmed a real path traversal risk, traced attacker-controlled request paths into StaticFiles.lookup_path, and identified os.path.commonprefix() as the vulnerable security boundary.
2
Mitigate
Generated a minimal root-cause patch by replacing string-prefix validation with path-component validation: commonprefix -> commonpath.
3
Verify
Checked that the patch was applied at the exact vulnerable location and marked it effective, while flagging missing runtime regression coverage and Windows edge cases.
- if os.path.commonprefix([full_path, directory]) != directory:
+ if os.path.commonpath([full_path, directory]) != directory:
Agentic AVR · Thesis Defense§ 4.2
Part IV · Methodology & Case Study
08 / 15
Mechanism
Evidence + Execution
Grounding agent reasoning in code facts and runnable checks.
Cross-Cutting Mechanism
Whether the workflow is Issue Review or Repo-wide Scan, agents combine repository evidence with lightweight execution to stay grounded.
Evidence-Grounded Reasoning
Inspect source files, diffs, tests, and configuration to gather supporting evidence.
Tie findings back to concrete code locations instead of free-form assumptions.
Execution-Guided Validation
Run lightweight command-line checks, targeted tests, or reproduction scripts when available.
Use execution results to confirm assumptions, expose false positives, and reduce self-grading bias.
Agentic AVR · Thesis Defense§ 4.3
Part IV · Methodology & Case Study
09 / 15
Repo-wide Full Scan
Workflow
Monitor, analyze, mitigate, verify, and merge across the entire repository.
STAGE 01
Monitor
Scan the entire codebase.
Discovery: identify potential vulnerability candidates and security hotspots.
Triage: remove obvious false positives first and merge similar items.
STAGE 02
Analysis
The analyzer audits input signals together with repository evidence to decide whether a case is confirmed.
Analyzer Gate: If the case does not pass by the analyzer, workflow stops before mitigation; if it passes, the system will then assigns **CVSS v4** scores.
STAGE 03
Mitigation
Generates "Minimal Patches" that remove vulnerability sinks while preserving business logic.
STAGE 04
Verification
Audits original input, analyzer conclusions, and patch content together with **skeptical view**.
Verifier Gate: If verification does not pass, the workflow is blocked and won't be passed to merge.
STAGE 05
Merge
Correlates related fixes and merge them into unified delivery bundles.
Agentic AVR · Thesis Defense§ 4.4
Part IV · Methodology & Case Study
10 / 15
Repo-wide Full Scan Example
Vulnerable-Book-Shop
Real-world Vulnerability Discovery & Mitigation in a Simulated E-commerce Environment
Testbed: Vulnerable-Book-Shop
Architecture: A complete full-stack application featuring complex business logic: User Auth, Search, Cart Management, and Payment flows.
Real-world Complexity: Built with actual database connections, JWT authentication, and mock payment gateways.
This gives reviewers a clearer severity and priority signal to make decisions.
Agentic AVR · Thesis Defense§ 4.8
Part Five
V
Evaluation
IBackground
IIProblems
IIISupported Workflows
IVMethodology
VEvaluation
VIEthics
Part V · Evaluation
14 / 15
Evaluation
Evaluate Issue Scenario with PatchEval Dataset
The Benchmarking Dataset: PatchEval
PatchEval transforms selected CVEs into 100+ reproducible cases. Each case contains a Docker image, a CVE description, and test scripts covering both attack PoCs and unit tests.
Original Codebase
The baseline environment containing the actual vulnerability.
Vulnerability Description
Acts as the analysis heuristic and task directive for the Agent.
Test Scripts
Pre-defined automation logic serving as the Gold Standard for success verification.
Agentic AVR · Thesis Defense§ 5.1
Part Six
VI
Ethics
IBackground
IIProblems
IIISupported Workflows
IVMethodology
VEvaluation
VIEthics
Part VI · Ethics
15 / 15
Ethics & Governance
Ethics & Governance
The Dual-Use Reality
Short-term Risk: Like any automated security tool, there is potential for misuse. Attackers could use the system to identify vulnerabilities.
Long-term Benefit: The long-term benefits of improving web security at scale outweigh these risks. It helps developers fix vulnerabilities faster and improve overall security levels.
Limited Access: Rather than an immediate public release, access is restricted to audited companies and trusted institutions, then expanded gradually as misuse is monitored.
Ethical Positioning of This Project
Blue-Team Focus: Purely optimized for Detection & Mitigation.
No Exploit Capabilities: Intentionally excludes "Exploit Agents" to prevent malicious use.
Technical Safeguards
Sandboxed Execution: All agent actions are strictly isolated; zero real-world impact.