MASTER THESIS DEFENSE · APRIL 2026— I —

Agentic AI Framework for Web Vulnerability Detection, Mitigation and Patching

Candidate— Your Name —
ProgrammeMSc Computer Science
DefenseApril 2026
Part One
I

Background

IBackground
IIProblems
IIISupported Workflows
IVMethodology
VEvaluation
VIEthics
Part I · Background
01 / 11
The Landscape

Modern web applications have a massive attack surface; vulnerabilities are appearing faster than they can be patched.

The Pipeline Imbalance
Detection
High automation (SAST/DAST/IAST) and signature-based tools.
Mitigation
Remains a severe human-dependent bottleneck.
The Failure of Traditional APR
  • Deep-learning systems (e.g., VulRepair) require massive, labeled vulnerability datasets.
  • Struggle to generalize across diverse, real-world web frameworks (React, Django, Spring).
Agentic AVR · Thesis Defense§ 1.1
Part I · Background
02 / 11
The Promise

Large Language Models (LLMs) offer a "Reasoning-on-the-fly" capability without task-specific fine-tuning.

The Reliability Wall

Semantic Drift

LLMs often propose "hallucinated" fixes in the wrong files or break existing business logic.

False Safety

Models confidently claim a vulnerability is fixed, but the fix is only superficial (e.g., adding a comment or a useless check).

The Goal

We need a system that moves from "Generation" to "Execution-Verified Mitigation."

Agentic AVR · Thesis Defense§ 1.2
Part Two
II

Problems

IBackground
IIProblems
IIISupported Workflows
IVMethodology
VEvaluation
VIEthics
Part II · Problems
03 / 11

Domain Misalignment & The "Silent Failure" Paradigm

APR vs. AVR

General coding agents (SWE-agent) optimize for passing unit tests.

Automated Vulnerability Repair (AVR) must optimize for eliminating sinks.

The inadequacy of Static Validation

Structural Matching: Penalizes innovative, valid patches.

LLM-as-Judge: Prone to bias and lacks "executable proof," leading to a false sense of security.

Web-Specific Complexity

Unlike C/C++ (where crashes/segfaults provide clear signals), web vulnerabilities are often "Silent Failures", which means the app runs fine, but remains exploitable.

Agentic AVR · Thesis Defense§ 2.1
Part II · Problems
04 / 11
Limitations of Single-Agent Systems

The Limitation of Single-Agent Systems

Cognitive Overload

Security repair is a high-stakes, multi-step process. A single prompt cannot effectively manage file discovery, root-cause analysis, and regression testing simultaneously.

Hallucination Risk

Single agents often suffer from "confirmation bias", in which they believe their own patch works without external verification.

Based on these insights, we propose a Multi-Agent Workflow that decomposes the complex repair task into specialized roles, ensuring higher precision and reliability.

Agentic AVR · Thesis Defense§ 2.2
Part Three
III

Supported
Workflows

IBackground
IIProblems
IIISupported Workflows
IVMethodology
VEvaluation
VIEthics
Part III · Supported Workflows
05 / 11

Four practical security review scenarios.

I

Issue Review

Analyze security-related issue reports and produce structured findings.

II

Pull Request Review

Review code changes in PRs and provide risk analysis with mitigation suggestions.

III

Repo-wide Full Scan

Run baseline security scanning across the whole repository.

IV

Repo-wide Incremental Scan

Re-scan only the changed scope between base/head for continuous security assurance.

Agentic AVR · Thesis Defense§ 3.1
Part Four
IV

Methodology

IBackground
IIProblems
IIISupported Workflows
IVMethodology
VEvaluation
VIEthics
Part IV · Methodology
06 / 11
The Agentic Security Lifecycle

A Multi-Stage Workflow for Autonomous Threat Mitigation

Methodology is shared across workflows; here we use repo-wide scan as the most complete example.

STAGE 01
Monitor

Systematic scanning of the entire codebase or specific change sets (Diffs).

Identification of potential vulnerability candidates and security hotspots.

STAGE 02
Analysis

The analyzer audits input signals together with repository evidence to decide whether a case is confirmed.

Analyzer Gate: If the case does not pass this gate, workflow stops before mitigation; if it passes, the agent verifies root cause, models exploit chains, and assigns dynamic CVSS v4 scores.

STAGE 03
Mitigation & Verification

Mitigate: Generates Minimal Patches that remove vulnerability sinks while preserving business logic.

Verify: Audits original input, analyzer conclusions, and patch content together with regression checks.

Verifier Gate: If verification does not pass, mitigation output is blocked while validated analysis remains available for reporting.

STAGE 04
Merge & Controlled Delivery

Correlates related fixes and consolidates them into unified delivery bundles.

Produces traceable artifacts for evidence, reasoning, and patch status, then automates report and GitHub Pull Request generation under explicit human approval.

Agentic AVR · Thesis Defense§ 4.1
Part IV · Methodology
07 / 11
Advanced Agent Implementation

Structured Context & Evidence-Oriented Exploration

To move beyond prompt-only reasoning, each stage operates on grounded repository context.

Structured Context Injection
  • The agent receives rich repository context, task constraints, and relevant security signals.
  • This keeps stage behavior aligned with the current review scope and reduces ambiguity.
Evidence-Oriented Exploration
  • Agents can inspect the codebase directly to gather supporting and contradicting evidence.
  • Agents can run simple command-line checks to validate their assumptions.
Agentic AVR · Thesis Defense§ 4.2
Part IV · Methodology
08 / 11
Reviewer Load Reduction

Two Merge Points, Two Different Goals

BEFORE ANALYZER

Merge Before Analyzer (Reduce Analysis Load)

  • Remove obvious false positives first.
  • Merge items only when they describe the same underlying explanation.
  • Goal: reduce the number of analysis targets without losing meaningful risk signals.
AFTER VERIFIER

Merge After Verifier (Reduce Reviewer Overload)

  • Group outputs only when delivery execution is strongly coupled.
  • Rule: merge only if fixes must be applied together to be complete, or to avoid execution conflicts.
  • Goal: keep human review volume manageable while preserving delivery correctness.
CVSS v4

CVSS v4 Scoring (Prioritize Reviewer Attention)

  • Confirmed vulnerabilities receive standardized CVSS v4 scoring.
  • This gives reviewers a clearer severity and priority signal to make decisions.
Agentic AVR · Thesis Defense§ 4.3
Part Five
V

Evaluation

IBackground
IIProblems
IIISupported Workflows
IVMethodology
VEvaluation
VIEthics
Part V · Evaluation
09 / 11
Closed-Loop Benchmarking with PatchEval
The Benchmarking Dataset: PatchEval

High-Fidelity Samples: Each test case serves as an independent "miniature repository" representing real-world vulnerability scenarios.

Triad Data Structure:

Original Codebase
The baseline environment containing the actual vulnerability.
Vulnerability Description
Acts as the analysis heuristic and task directive for the Agent.
Test Scripts
Pre-defined automation logic serving as the Gold Standard for success verification.
Automated Evaluation Pipeline
  1. 1

    Scenario Simulation

    Transforms each PatchEval case into a dedicated Agent Workspace.

  2. 2

    End-to-End Mitigation

    The Agent autonomously performs path discovery, analysis, and patch generation based on the provided descriptions.

  3. 3

    Execution-Based Verification

    System automatically executes accompanying test scripts; a fix is marked successful only if all tests pass without regressing original functionalities.

Metrics & Strategic Value
  • Repair Success Rate: Measures the precision of the Agent in handling complex vulnerability mitigation.
  • Zero-Shot Automation: The entire process is conducted without human intervention, ensuring repeatability and objectivity of evaluation results.
Agentic AVR · Thesis Defense§ 5.1
Part V · Evaluation
10 / 11
Case Study

Case Study: Full-Stack System Sanitization

Real-world Vulnerability Discovery & Mitigation in a Simulated E-commerce Environment

The Number Story (End-to-End)
  • 79 files traversed -> 69 scanned -> 38 candidates -> 30 focused cases.
  • 30 analyzed -> 23 confirmed (CVSS scored) -> 21 verification-recognized.
  • 23 retained -> 11 publishable deliveries (5 combined + 6 single).
Testbed: Vulnerable-Book-Shop
  • Architecture: A complete full-stack application featuring complex business logic: User Auth, Search, Cart Management, and Payment flows.
  • Real-world Complexity: Built with actual database connections, JWT authentication, and mock payment gateways.
Vulnerability Instrumentation
  • SQL Injection: Intentionally embedded non-parameterized queries within the User Registration API.
  • Stored XSS: Introduced unsafe rendering logic in the user reviews section.
Agent Performance & Outcomes
  • Autonomous Discovery: The Agent successfully identified high-risk vulnerabilities hidden deep within the authentication and payment modules during a Repo-wide scan.
  • Closed-Loop Mitigation: For SQLi, automatically generated patches transitioning to parameterized queries; for XSS, introduced robust HTML escaping logic and submitted automated Pull Requests.
  • Deliverables: Produced comprehensive security reports with details on impact analysis, fix rationales, and verification results, proving efficacy in professional-grade environments.
Agentic AVR · Thesis Defense§ 5.2
Part Six
VI

Ethics

IBackground
IIProblems
IIISupported Workflows
IVMethodology
VEvaluation
VIEthics
Part VI · Ethics
11 / 11
Ethical Awareness · Governance

Ethical Awareness & Governance: Balancing Defense and Risk

The Dual-Use Dilemma (Insights from Anthropic's Mythos)

Long-term Defensive Utility: Similar to the evolution of Fuzzers, autonomous agents are expected to become essential defensive pillars. While they accelerate discovery, their long-term value lies in empowering defenders through automation.

The Case for Staged Deployment: To mitigate potential short-term shifts in the threat landscape, these systems should initially be limited to organizations with robust governance and auditing capabilities, ensuring a responsible path to public release.

Ethical Positioning of This Project
  • Blue-Team Orientation: The framework is fundamentally designed for Detection and Mitigation rather than Exploitation and Offense.
  • Intentional Capability Limitation: Our architecture excludes dedicated Exploit Agents. By focusing on defensive workflows, we minimize the ethical concerns associated with red-team-oriented tools.
Safety Safeguards
  • Sandboxed Execution: All agent operations and code executions are confined to isolated sandbox environments, ensuring no real-world impact.
  • Governance Perspective: The ethical challenge is not just the utility of the tool, but the transparency of how it is released, to whom, and under what safeguards.
Agentic AVR · Thesis Defense§ 6.1
Master Thesis Defense
fin.
In Closing

Thank you.
Questions welcome.

Contribution

An agentic AVR framework with explicit gates and closed-loop verification.

Evaluation

PatchEval benchmark · Vulnerable-Book-Shop case study.

Ethical Stance

Blue-team only · sandboxed · staged release.

Agentic AVR · Thesis Defensefin.