MASTER THESIS DEFENSE · APRIL 2026— I —

Agentic AI Framework for Web Vulnerability Detection, Mitigation and Patching

ProgrammeMPCSN

DefenseApril 2026

Part One

Background

IBackground

IIProblems

IIISupported Workflows

IVMethodology

VEvaluation

VIEthics

Part I · Background

01 / 15

The Landscape

Modern web applications have a massive attack surface; vulnerabilities are appearing faster than they can be patched.

The Pipeline Imbalance

Detection: High automation (SAST/DAST/IAST) and signature-based tools.
Mitigation: Remains a severe human-dependent bottleneck.

The Failure of Traditional APR

Deep-learning systems (e.g., VulRepair) require massive, labeled vulnerability datasets.
Struggle to generalize across diverse, real-world web frameworks (React, Django, Spring).

Comparison of SAST, DAST, and IAST approaches

Agentic AVR · Thesis Defense§ 1.1

Part I · Background

02 / 15

The Promise

Large Language Models (LLMs) offer a "Reasoning-on-the-fly" capability without task-specific fine-tuning.

The Reliability Wall

Semantic Drift

LLMs often propose "hallucinated" fixes in the wrong files or break existing business logic.

False Safety

Models confidently claim a vulnerability is fixed, but the fix is only superficial (e.g., adding a comment or a useless check).

The Goal

We need a system that moves from "Generation" to "Execution-Verified Mitigation."

Agentic AVR · Thesis Defense§ 1.2

Part Two

Problems

IBackground

IIProblems

IIISupported Workflows

IVMethodology

VEvaluation

VIEthics

Part II · Problems

03 / 15

Domain Misalignment & The "Silent Failure" Paradigm

APR vs. AVR

APR: Automated Program Repair.

AVR: Automated Vulnerability Repair.

General coding agents (SWE-agent) optimize for passing unit tests.

Automated Vulnerability Repair (AVR) must optimize for eliminating sinks.

The inadequacy of Static Validation

Structural Matching: Penalizes innovative, valid patches.

LLM-as-Judge: Prone to bias and lacks "executable proof," leading to a false sense of security.

Web-Specific Complexity

Beyond "Crashes": The Silent Failure. Unlike older software that crashes when a bug occurs, web apps often keep running normally even while being hacked. The vulnerability is hidden, and there are no obvious error messages or "smoke."

Deep Logic Chains: It's Not Just One Click. Modern bugs are rarely found in a single input. They often require a specific sequence across pages or APIs (e.g., Register -> Change Settings -> Trigger Action) to actually show up.

Agentic AVR · Thesis Defense§ 2.1

Part II · Problems

04 / 15

Limitations of Single-Agent Systems

The Limitation of Single-Agent Systems

Cognitive Overload

Security repair is a multi-step process. A single prompt cannot effectively manage file discovery, root-cause analysis, and regression testing at the same time. Large context will overwhelm the model, leading to lost details and reduced precision.

Hallucination Risk

Single agents suffer from "self-grading" bias, where the model hallucinates a successful outcome while reviewing its own patch, failing to objectively identify its own logic flaws.

Based on these insights, we propose a Multi-Agent Workflow that decomposes the complex repair task into specialized roles, ensuring higher precision and reliability.

Single Agent ReAct loop compared with a four-agent sequential workflow

Agentic AVR · Thesis Defense§ 2.2

Part Three

III

Supported
Workflows

IBackground

IIProblems

IIISupported Workflows

IVMethodology

VEvaluation

VIEthics

Part III · Supported Workflows

05 / 15

Four practical security review scenarios.

Issue Review

Analyze security-related issue reports and produce structured findings.

Pull Request Review

Review code changes in PRs and provide risk analysis with mitigation suggestions.

III

Repo-wide Full Scan

Run baseline security scanning across the whole repository.

Repo-wide Incremental Scan

Re-scan only the changed code file between last scan and current codebase for continuous security assurance.

Repo-wide incremental scan workflow example

Agentic AVR · Thesis Defense§ 3.1

Part Four

Methodology
& Case Study

IBackground

IIProblems

IIISupported Workflows

IVMethodology & Case Study

VEvaluation

VIEthics

Part IV · Methodology & Case Study

06 / 15

Issue Review Example

Starlette CVE-2023-29159

Directory traversal in static file serving

Issue description: Directory traversal vulnerability in Starlette versions 0.13.5 and later and prior to 0.27.0 allows a remote unauthenticated attacker to view files in a web service which was built using Starlette.

Path Diagram

/app
├── static
│   └── logo.png
└── static_secrets
    └── config.json

Allowed Directory

/app/static

Attacker Request

/app/static_secrets/config.json

Why commonprefix Fails

commonprefix(
["/app/static_secrets/config.json", "/app/static"]
) = "/app/static"

Looks valid as text, but is outside the allowed directory in the real filesystem.

Correct Fix

commonpath(
["/app/static_secrets/config.json", "/app/static"]
) = "/app"

Compare path components, not string prefixes.

Agentic AVR · Thesis Defense§ 4.1

Part IV · Methodology & Case Study

07 / 15

Issue Review Example

Agent Analysis Procedure

Analyze, mitigate, and verify the Starlette path traversal fix

Agent Analysis Procedure

1
Analyze
Confirmed a real path traversal risk, traced attacker-controlled request paths into StaticFiles.lookup_path, and identified os.path.commonprefix() as the vulnerable security boundary.
2
Mitigate
Generated a minimal root-cause patch by replacing string-prefix validation with path-component validation: commonprefix -> commonpath.
3
Verify
Checked that the patch was applied at the exact vulnerable location and marked it effective, while flagging missing runtime regression coverage and Windows edge cases.

- if os.path.commonprefix([full_path, directory]) != directory:
+ if os.path.commonpath([full_path, directory]) != directory:

Agentic AVR · Thesis Defense§ 4.2

Part IV · Methodology & Case Study

08 / 15

Mechanism

Evidence + Execution

Grounding agent reasoning in code facts and runnable checks.

Cross-Cutting Mechanism

Whether the workflow is Issue Review or Repo-wide Scan, agents combine repository evidence with lightweight execution to stay grounded.

Evidence-Grounded Reasoning

Inspect source files, diffs, tests, and configuration to gather supporting evidence.
Tie findings back to concrete code locations instead of free-form assumptions.

Execution-Guided Validation

Run lightweight command-line checks, targeted tests, or reproduction scripts when available.
Use execution results to confirm assumptions, expose false positives, and reduce self-grading bias.

Agentic AVR · Thesis Defense§ 4.3

Part IV · Methodology & Case Study

09 / 15

Repo-wide Full Scan

Workflow

Monitor, analyze, mitigate, verify, and merge across the entire repository.

STAGE 01

Monitor

Scan the entire codebase.

Discovery: identify potential vulnerability candidates and security hotspots.

Triage: remove obvious false positives first and merge similar items.

STAGE 02

Analysis

The analyzer audits input signals together with repository evidence to decide whether a case is confirmed.

Analyzer Gate: If the case does not pass by the analyzer, workflow stops before mitigation; if it passes, the system will then assigns **CVSS v4** scores.

STAGE 03

Mitigation

Generates "Minimal Patches" that remove vulnerability sinks while preserving business logic.

STAGE 04

Verification

Audits original input, analyzer conclusions, and patch content together with **skeptical view**.

Verifier Gate: If verification does not pass, the workflow is blocked and won't be passed to merge.

STAGE 05

Merge

Correlates related fixes and merge them into unified delivery bundles.

Agentic AVR · Thesis Defense§ 4.4

Part IV · Methodology & Case Study

10 / 15

Repo-wide Full Scan Example

Vulnerable-Book-Shop

Real-world Vulnerability Discovery & Mitigation in a Simulated E-commerce Environment

Testbed: Vulnerable-Book-Shop

Architecture: A complete full-stack application featuring complex business logic: User Auth, Search, Cart Management, and Payment flows.
Real-world Complexity: Built with actual database connections, JWT authentication, and mock payment gateways.

The Number Story (End-to-End)

79 files traversed -> 68 scanned -discovery-> 26 candidates -triage-> 11 focused cases -analyzer-> 11 confirmed cases -mitigate & verifier-> 11 verified cases -merge-> 8 publishable deliveries (3 combined + 5 single)

discovery

triage

analyzer

mitigate & verifier

merge

files traversed

scanned

candidates

focused cases

confirmed cases

verified cases

publishable deliveries

3 combined + 5 single

Agentic AVR · Thesis Defense§ 4.5

Part IV · Methodology & Case Study

11 / 15

Case Evidence · Summary Overview

One page view of both workflow-level summary evidence and PR-generation breadth.

Summary overview screenshot of security findings and mitigation workflow

Generated pull request list showing multiple security fix proposals

Agentic AVR · Thesis Defense§ 4.6

Part IV · Methodology & Case Study

12 / 15

Case Evidence · PR Detail & Case Report Detail

One page view of patch-level detail and final case report evidence.

Detailed pull request view with patch changes and discussion

Agentic AVR · Thesis Defense§ 4.7

Part IV · Methodology & Case Study

13 / 15

Reviewer Load Reduction

Two Merge Points, Two Different Goals

AFTER MONITOR · BEFORE ANALYZER

Merge, Before Analyzer

Reduce Analysis Load

Remove obvious false positives first.
Merge items only when they describe the same underlying explanation.
Goal: reduce the number of analysis targets without losing meaningful risk signals.

AFTER VERIFIER

Merge, After Verifier

Reduce Human Reviewer Overload

Group outputs only when delivery execution is strongly coupled.
Rule: merge only if fixes must be applied together to be complete, or to avoid execution conflicts.
Goal: keep human review volume manageable while preserving delivery correctness.

CVSS v4

CVSS v4 Scoring

Prioritize Reviewer Attention

Confirmed vulnerabilities receive standardized CVSS v4 scoring.
This gives reviewers a clearer severity and priority signal to make decisions.

Agentic AVR · Thesis Defense§ 4.8

Part Five

Evaluation

IBackground

IIProblems

IIISupported Workflows

IVMethodology

VEvaluation

VIEthics

Part V · Evaluation

14 / 15

Evaluation

Evaluate Issue Scenario with PatchEval Dataset

The Benchmarking Dataset: PatchEval

PatchEval transforms selected CVEs into 100+ reproducible cases. Each case contains a Docker image, a CVE description, and test scripts covering both attack PoCs and unit tests.

Original Codebase: The baseline environment containing the actual vulnerability.
Vulnerability Description: Acts as the analysis heuristic and task directive for the Agent.
Test Scripts: Pre-defined automation logic serving as the Gold Standard for success verification.

Agentic AVR · Thesis Defense§ 5.1

Part Six

Ethics

IBackground

IIProblems

IIISupported Workflows

IVMethodology

VEvaluation

VIEthics

Part VI · Ethics

15 / 15

Ethics & Governance

The Dual-Use Reality

Short-term Risk: Like any automated security tool, there is potential for misuse. Attackers could use the system to identify vulnerabilities.
Long-term Benefit: The long-term benefits of improving web security at scale outweigh these risks. It helps developers fix vulnerabilities faster and improve overall security levels.
Limited Access: Rather than an immediate public release, access is restricted to audited companies and trusted institutions, then expanded gradually as misuse is monitored.

Ethical Positioning of This Project

Blue-Team Focus: Purely optimized for Detection & Mitigation.
No Exploit Capabilities: Intentionally excludes "Exploit Agents" to prevent malicious use.

Technical Safeguards

Sandboxed Execution: All agent actions are strictly isolated; zero real-world impact.

Agentic AVR · Thesis Defense§ 6.1

Master Thesis Defense

fin.

In Closing

Thank you.

Agentic AVR · Thesis Defensefin.

Agentic AI Framework for Web Vulnerability Detection, Mitigation and Patching

Background

Modern web applications have a massive attack surface; vulnerabilities are appearing faster than they can be patched.

Large Language Models (LLMs) offer a "Reasoning-on-the-fly" capability without task-specific fine-tuning.

Problems

Domain Misalignment & The "Silent Failure" Paradigm

The Limitation of Single-Agent Systems

SupportedWorkflows

Four practical security review scenarios.

Methodology& Case Study

Starlette CVE-2023-29159

Agent Analysis Procedure

Evidence + Execution

Workflow

Vulnerable-Book-Shop

Two Merge Points, Two Different Goals

Evaluation

Evaluate Issue Scenario with PatchEval Dataset

Ethics

Ethics & Governance

Thank you.

Supported
Workflows

Methodology
& Case Study