Cascaded Surveillance Anomaly Detection with VisionβLanguage Foundation Model Reasoning and Semantic Label Stabilization
- Overview
- Key Contributions
- System Architecture
- Model Details
- Results
- Installation
- Usage
- Project Structure
- Methodology
- Citation
- Acknowledgments
- Contact
This repository contains the official implementation of "Cascaded Surveillance Anomaly Detection with VisionβLanguage Foundation Model Reasoning and Semantic Label Stabilization".
The project targets largeβscale camera networks (e.g., smartβcity CCTV, industrial plants, critical infrastructure), where operators must monitor many streams in real time but can only escalate a small subset of truly suspicious events to expensive visionβlanguage models (VLMs). The codebase implements the full cascaded system, evaluation pipeline, and multiβagent coordination logic described in the paper, so that reviewers and practitioners can:
- Reproduce the frameβlevel detection metrics on UCFβCrime.
- Validate crossβdataset generalisation on ShanghaiTech Campus and XDβViolence.
- Measure perβstage latency and earlyβexit behaviour under different loads.
- Benchmark the multiβagent, multiβstream scheduling strategy for latencyβconstrained VLM calls.
We propose a three-stage cascaded pipeline for real-time surveillance anomaly detection that combines:
| Stage | Component | Purpose | Latency |
|---|---|---|---|
| I | Convolutional Autoencoder | Fast anomaly gating via reconstruction error | 6.5 ms |
| II | YOLOv8 Object Detection | Person/object semantic classification | 8.5 ms |
| III | Vision-Language Model | Human-interpretable explanations | ~2.3 s |
Traditional approaches process every frame through expensive models. Our cascade exits early for normal frames:
π Efficiency Gains:
βββ 72% of frames exit at Stage I (never reach YOLO)
βββ 95% of frames exit at Stage II (never reach VLM)
βββ Only true anomalies trigger full pipeline
At a high level, the project is structured around four interacting components:
-
Stage I β Autoencoder Gate (
ConvAutoencoder)- Learns to reconstruct only normal surveillance frames at (128 \times 128) resolution.
- Computes a perβframe MSE reconstruction error that becomes the primary anomaly score.
- Implements a tunable threshold ( \tau ) that decides whether a frame is normal (early exit) or suspicious (forwarded to Stage II).
-
Stage II β Semantic Object Detector (YOLOv8βnano)
- Runs only on frames that pass the AE gate.
- Detects persons and contextual objects (e.g. vehicles, bags) using Ultralytics YOLOv8.
- Encodes whether a suspicious frame contains people, enabling a split between texture / environmental anomalies and personβcentric events.
-
Stage III β VisionβLanguage Model (VLM)
- Invoked only for personβcontaining anomalies, drastically reducing the number of expensive calls.
- Given a frame crop and a prompt (βDescribe the anomalous activity in this surveillance frameβ), maps events into stable semantic categories such as
person_intrusion,violent_activity,theft_attempt, etc. - This stage is modelβagnostic: the code treats the VLM as an external black box with configurable latency, which is important for the multiβagent latency analysis.
-
MultiβAgent Coordination Layer
- Models each camera stream as an independent βagentβ feeding frames into a shared processing queue.
- Implements eventβdriven vs. cyclical agent modes, and measures throughput, queue depth, dropped frames, and scaling efficiency as the number of streams grows (1, 2, 4, 8, 16).
- Provides detailed systemβlevel metrics so reviewers can see how the cascade behaves under realistic multiβstream load, not just on a single offline video.
On top of these core components, the repository includes:
- Dataset loaders for UCFβCrime, ShanghaiTech Campus, and XDβViolence.
- A shared metrics module (
metrics.py) implementing frameβ and videoβlevel AUC, AP, F1, and optimal threshold search. - Utility scripts to generate the tables and plots used in the paper (without shipping any LaTeX or figureβgeneration code).
Taken together, the repo is intended to serve both as a reproducible research artifact for CVC 2026 and as a reference implementation for engineers building cascaded, multiβagent VLMβbased surveillance systems.
- Cascaded Architecture: Multi-stage pipeline with early-exit mechanism for computational efficiency
- Reconstruction-based Gating: Lightweight autoencoder filters normal frames before expensive detection
- Semantic Label Stabilization: VLM reasoning provides consistent, human-interpretable anomaly categories
- Real-time Performance: ~152 FPS for Stage I, enabling deployment on edge devices
βββββββββββββββββββββββββββββββββββββββββββ
β SURVEILLANCE CAMERA β
β (Video Stream) β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PREPROCESSING β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Frame Grab β β β Resize to β β β Normalize β β β To Tensor β β
β β β β 128Γ128 β β [0, 1] β β (B,3,H,W) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE I: AUTOENCODER GATE β
β (6.5 ms/frame) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ENCODER β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββββββββββββββββββββββ β β
β β β Input β β Conv2D β β Conv2D β β Conv2D β β β
β β β 128Γ128 β β β 3β32 ch β β β 32β64ch β β β 64β128 ch β β β
β β β Γ3 ch β β k=3Γ3 β β k=3Γ3 β β k=3Γ3 β β β
β β βββββββββββ β s=2 β β s=2 β β s=2 β β β
β β β ReLU β β ReLU β β ReLU β β β
β β β 64Γ64 β β 32Γ32 β β 16Γ16 β β β
β β βββββββββββ βββββββββββ βββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β Bottleneck: 16Γ16Γ128 = 32,768 dims β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DECODER β β
β β βββββββββββββββββββββββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β β ConvTranspose2D β βConvT2D β βConvT2D β β Output β β β
β β β 128β64 ch β β β 64β32ch β β β 32β3 ch β β β 128Γ128 β β β
β β β k=3Γ3 β β k=3Γ3 β β k=3Γ3 β β Γ3 ch β β β
β β β s=2 β β s=2 β β s=2 β β β β β
β β β ReLU β β ReLU β β Sigmoid β β β β β
β β β 32Γ32 β β 64Γ64 β β 128Γ128 β β β β β
β β βββββββββββββββββββββββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Anomaly Score = MSE(Input, Reconstruction) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββ΄ββββββββββββββββββββββββ
β β
score β€ Ο score > Ο
(Normal) (Anomaly)
β β
βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ
β β
NORMAL β β STAGE II: YOLO DETECTION β
β Early Exit β β (8.5 ms/frame) β
β β β β
β ~72% of frames β β βββββββββββββββββββββββββββββ β
β exit here β β β YOLOv8-nano β β
ββββββββββββββββββββ β β β β
β β β’ Person Detection β β
β β β’ Object Classification β β
β β β’ Bounding Boxes β β
β β β’ Confidence Scores β β
β βββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββ΄βββββββββββββββββββββββββ
β β
No Person Person Found
β β
βΌ βΌ
ββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
β β οΈ Generic Anomaly β β STAGE III: VLM REASONING β
β β β (~2.3 s/event) β
β Motion/Texture β β β
β Anomaly β β βββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββ β β Vision-Language Model β β
β β β β
β β "Describe this scene β β
β β and identify the β β
β β anomalous activity" β β
β β β β
β βββββββββββββββββββββββββββββ β
β β
β Output: Semantic Label β
β β’ person_intrusion β
β β’ suspicious_behavior β
β β’ violent_activity β
β β’ theft_attempt β
βββββββββββββββββββββββββββββββββββ
| Category | Description | Trigger Condition |
|---|---|---|
camera_blur |
Camera lens obstruction or defocus | High reconstruction error, no person |
person_intrusion |
Unauthorized person detected | Person in restricted zone |
suspicious_behavior |
Unusual movement patterns | Person + abnormal pose/motion |
violent_activity |
Physical altercation | Multiple persons + rapid motion |
theft_attempt |
Suspicious object interaction | Person + reaching/grabbing motion |
environmental |
Lighting/weather anomaly | High error, scene-wide change |
ConvAutoencoder(
(encoder): Sequential(
(0): Conv2d(3, 32, kernel_size=3, stride=2, padding=1) # 128β64
(1): ReLU(inplace=True)
(2): Conv2d(32, 64, kernel_size=3, stride=2, padding=1) # 64β32
(3): ReLU(inplace=True)
(4): Conv2d(64, 128, kernel_size=3, stride=2, padding=1) # 32β16
(5): ReLU(inplace=True)
)
(decoder): Sequential(
(0): ConvTranspose2d(128, 64, kernel_size=3, stride=2, padding=1, output_padding=1)
(1): ReLU(inplace=True)
(2): ConvTranspose2d(64, 32, kernel_size=3, stride=2, padding=1, output_padding=1)
(3): ReLU(inplace=True)
(4): ConvTranspose2d(32, 3, kernel_size=3, stride=2, padding=1, output_padding=1)
(5): Sigmoid()
)
)| Property | Value |
|---|---|
| Parameters | ~115K |
| Model Size | ~460 KB |
| Input Shape | (B, 3, 128, 128) |
| Bottleneck | (B, 128, 16, 16) |
| Compression Ratio | 1.5:1 |
| Hyperparameter | Value |
|---|---|
| Optimizer | Adam |
| Learning Rate | 1e-3 |
| Batch Size | 32 |
| Epochs | 100 |
| Loss Function | MSE |
| Training Data | Normal frames only |
| Metric | Value | Description |
|---|---|---|
| Frame-level AUC | 72.37% | Area under ROC curve |
| Average Precision | 91.67% | Area under PR curve |
| Accuracy @ Optimal Ο | 75.21% | At threshold = 0.001104 |
| F1-Score @ Optimal Ο | 81.04% | Harmonic mean of P/R |
| Optimal Threshold | 0.001104 | MSE threshold value |
| Metric | Normal Frames | Anomaly Frames |
|---|---|---|
| PSNR | 28-32 dB | 18-24 dB |
| SSIM | 0.90-0.95 | 0.70-0.85 |
| MSE | 0.001-0.002 | 0.004-0.010 |
| Stage | Mean Latency | Std Dev | FPS | GPU Memory |
|---|---|---|---|---|
| Preprocessing | 0.5 ms | 0.1 ms | 2000 | - |
| Stage I (AE) | 6.55 ms | 0.8 ms | 152.7 | 500 MB |
| Stage II (YOLO) | 8.45 ms | 1.2 ms | 118.3 | 1 GB |
| Stage III (VLM) | 2300 ms | 200 ms | 0.4 | 4 GB |
| Total (early exit) | 7.05 ms | - | 141.8 | - |
| Total (full pipeline) | 2315 ms | - | 0.4 | - |
| Dataset | Exit @ Stage I (AE gate) |
|---|---|
| UCF-Crime | 72.1% |
Additional datasets (ShanghaiTech, XD-Violence) can be evaluated using cross_dataset_evaluation.py.
| Method | Year | AUC (%) | Notes |
|---|---|---|---|
| C3D + MIL | 2018 | 75.41 | 3D CNN |
| RTFM | 2021 | 84.30 | Temporal features |
| MGFN | 2023 | 86.67 | Multi-granularity |
| ProDisc-VAD | 2025 | 87.31 | Prototype + discriminative |
| Ex-VAD | 2024 | 86.92 | Explainable VLM |
| VadCLIP | 2024 | 88.02 | CLIP-based |
| Ours (Cascade) | 2026 | 74.47 | Efficiency + interpretability |
The cascade trades some detection accuracy for a threefold reduction in latency and the ability to provide real-time semantic explanations via selective VLM invocation.
| Configuration | AUC (%) | Exit Rate (%) |
|---|---|---|
| YOLO Only | 72.15 | 0.0 |
| AE Only (Ours) | 72.37 | 0.0 |
| AE + YOLO Cascade | 73.87 | 68.3 |
| Full Cascade (AE+YOLO+VLM) | 74.47 | 72.1 |
- Python 3.10+
- CUDA 11.8+ (recommended for GPU acceleration)
- 8 GB RAM minimum
- NVIDIA GPU with 4+ GB VRAM (optional)
# Clone repository
git clone https://github.com/speesrl/CVC26-Multiagent-anomaly.git
cd CVC26-Multiagent-anomaly
# Create virtual environment
python -m venv .venv
# Activate environment
# Windows:
.\.venv\Scripts\activate
# Linux/Mac:
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txttorch>=2.0.0
torchvision>=0.15.0
ultralytics>=8.0.0
opencv-python>=4.8.0
numpy>=1.24.0
scikit-learn>=1.3.0
scikit-image>=0.21.0
matplotlib>=3.7.0
Pillow>=10.0.0
tqdm>=4.65.0
-
Download UCF-Crime Dataset:
- Official: UCF-Crime
- Extract frames to
Test/folder
-
Expected Structure:
Test/
βββ Arrest/
β βββ frame_0001.png
β βββ frame_0002.png
β βββ ...
βββ Arson/
βββ Assault/
βββ Burglary/
βββ Explosion/
βββ Fighting/
βββ NormalVideos/
python run_evaluation.py --data-dir ./Test --model ./autoencoder_model.pthOptions:
--data-dir: Path to UCF-Crime test frames--model: Path to autoencoder weights--output-dir: Results directory (default:./evaluation_results)--skip-ablation: Skip ablation study--skip-latency: Skip latency benchmark--demo: Run with synthetic data
python anomaly_dashboard.pyFeatures:
- Real-time frame processing
- Reconstruction visualization
- Anomaly score display
- Semantic label output
import torch
from anomaly_dashboard import ConvAutoencoder
from metrics import evaluate_anomaly_detection
# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ConvAutoencoder()
model.load_state_dict(torch.load("autoencoder_model.pth", weights_only=True))
model.to(device).eval()
# Process frame
frame = preprocess(image) # (1, 3, 128, 128)
reconstruction = model(frame)
anomaly_score = torch.mean((frame - reconstruction) ** 2).item()
# Threshold decision
threshold = 0.0035
is_anomaly = anomaly_score > thresholdpython full_video_evaluation.py --ucf-crime /path/to/UCF-Crime \
--ae-model autoencoder_model.pth --sample-rate 5python cross_dataset_evaluation.py \
--ucf-crime /path/to/UCF-Crime \
--shanghaitech /path/to/ShanghaiTech \
--xd-violence /path/to/XD-Violence \
--ae-model autoencoder_model.pthpython multi_agent_benchmark.py --ae-model autoencoder_model.pth \
--streams 1,2,4,8,16 --duration 5.0python ablation_study.py --ae-model autoencoder_model.pth --demopython latency_benchmark.py --ae-model autoencoder_model.pth --n-frames 100CVC26-Multiagent-anomaly/
β
βββ π README.md # This file
βββ π LICENSE # MIT License
βββ π requirements.txt # Python dependencies
β
βββ π§ autoencoder_model.pth # Trained model weights (240 KB)
β
βββ π AAMS.ipynb # Training notebook
β
βββ π₯οΈ anomaly_dashboard.py # GUI dashboard application
β
βββ π¬ Evaluation Scripts
β βββ run_evaluation.py # Main evaluation pipeline
β βββ full_video_evaluation.py # Standard full-video UCF-Crime protocol
β βββ cross_dataset_evaluation.py # Multi-dataset eval (UCF/SHT/XD)
β βββ multi_agent_benchmark.py # Scaling, queueing, resource metrics
β βββ metrics.py # AUC/ROC/PR computation
β βββ ablation_study.py # Ablation experiments
β βββ latency_benchmark.py # Timing measurements
β βββ ucf_crime_loader.py # UCF-Crime dataset loader
β βββ shanghaitech_loader.py # ShanghaiTech Campus loader
β βββ xd_violence_loader.py # XD-Violence dataset loader
β
βββ π Results Generation
β βββ generate_results.py # Generate metric tables
β βββ final_metrics.py # Final paper metrics
β
βββ π docs/
β βββ EVALUATION.md # Evaluation methodology
β βββ ARCHITECTURE.md # Detailed architecture specs
β
βββ ποΈ Test/ # UCF-Crime frames (not in repo)
βββ Arrest/
βββ Arson/
βββ ...
Given surveillance video V = {fβ, fβ, ..., fβ}, predict frame-level anomaly labels Y = {yβ, yβ, ..., yβ} where yβ β {0, 1}.
The autoencoder is trained on normal frames only:
L_AE = (1/N) Ξ£ ||fα΅’ - fΜα΅’||Β²
Anomaly score for frame f:
s(f) = (1/HW) Ξ£ (f_hw - fΜ_hw)Β²
Decision rule:
Ε· = 1 if s(f) > Ο
0 otherwise
For frames with s(f) > Ο, apply YOLOv8:
boxes, classes, conf = YOLO(f)
person_detected = β c β classes : c = "person"
For person-containing anomalies:
label = VLM(prompt, f)
Where prompt = "Describe the anomalous activity in this surveillance frame."
| Task | Metric | Method |
|---|---|---|
| Detection | AUC-ROC | Reconstruction error thresholding |
| Identification | Human evaluation | VLM semantic labeling |
If you use this code in your research, please cite:
@article{rehman2026cascaded,
title={Cascaded Surveillance Anomaly Detection with Vision--Language Foundation Model Reasoning and Semantic Label Stabilization},
author={Rehman, Tayyab and De Gasperis, Giovanni and Shmahell, Aly},
year={2026}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Spee s.r.l β Industry partner and sponsor
- University of L'Aquila β Academic support
- UCF-Crime Dataset β Real-world Anomaly Detection in Surveillance Videos
- Ultralytics YOLOv8 β Object detection
- Baseline methods: C3D-MIL, RTFM, MGFN, VadCLIP
For questions, issues, or collaboration:
- Email: tayyab.rehman@graduate.univaq.it
- GitHub Issues: Open an issue
- Organization: Spee s.r.l
Tayyab Rehman β University of L'Aquila / SPEE S.R.L.
