Skip to content

speesrl/CVC26-Multiagent-anomaly

Repository files navigation

CVC26-Multiagent-anomaly

Cascaded Surveillance Anomaly Detection with Vision–Language Foundation Model Reasoning and Semantic Label Stabilization

Python PyTorch CUDA License


πŸ“‹ Table of Contents


🎯 Overview

This repository contains the official implementation of "Cascaded Surveillance Anomaly Detection with Vision–Language Foundation Model Reasoning and Semantic Label Stabilization".

The project targets large‑scale camera networks (e.g., smart‑city CCTV, industrial plants, critical infrastructure), where operators must monitor many streams in real time but can only escalate a small subset of truly suspicious events to expensive vision–language models (VLMs). The codebase implements the full cascaded system, evaluation pipeline, and multi‑agent coordination logic described in the paper, so that reviewers and practitioners can:

  • Reproduce the frame‑level detection metrics on UCF‑Crime.
  • Validate cross‑dataset generalisation on ShanghaiTech Campus and XD‑Violence.
  • Measure per‑stage latency and early‑exit behaviour under different loads.
  • Benchmark the multi‑agent, multi‑stream scheduling strategy for latency‑constrained VLM calls.

We propose a three-stage cascaded pipeline for real-time surveillance anomaly detection that combines:

Stage Component Purpose Latency
I Convolutional Autoencoder Fast anomaly gating via reconstruction error 6.5 ms
II YOLOv8 Object Detection Person/object semantic classification 8.5 ms
III Vision-Language Model Human-interpretable explanations ~2.3 s

Why Cascaded?

Traditional approaches process every frame through expensive models. Our cascade exits early for normal frames:

πŸ“Š Efficiency Gains:
β”œβ”€β”€ 72% of frames exit at Stage I (never reach YOLO)
β”œβ”€β”€ 95% of frames exit at Stage II (never reach VLM)
└── Only true anomalies trigger full pipeline

πŸ“š Project Description

At a high level, the project is structured around four interacting components:

  • Stage I – Autoencoder Gate (ConvAutoencoder)

    • Learns to reconstruct only normal surveillance frames at (128 \times 128) resolution.
    • Computes a per‑frame MSE reconstruction error that becomes the primary anomaly score.
    • Implements a tunable threshold ( \tau ) that decides whether a frame is normal (early exit) or suspicious (forwarded to Stage II).
  • Stage II – Semantic Object Detector (YOLOv8‑nano)

    • Runs only on frames that pass the AE gate.
    • Detects persons and contextual objects (e.g. vehicles, bags) using Ultralytics YOLOv8.
    • Encodes whether a suspicious frame contains people, enabling a split between texture / environmental anomalies and person‑centric events.
  • Stage III – Vision–Language Model (VLM)

    • Invoked only for person‑containing anomalies, drastically reducing the number of expensive calls.
    • Given a frame crop and a prompt (β€œDescribe the anomalous activity in this surveillance frame”), maps events into stable semantic categories such as person_intrusion, violent_activity, theft_attempt, etc.
    • This stage is model‑agnostic: the code treats the VLM as an external black box with configurable latency, which is important for the multi‑agent latency analysis.
  • Multi‑Agent Coordination Layer

    • Models each camera stream as an independent β€œagent” feeding frames into a shared processing queue.
    • Implements event‑driven vs. cyclical agent modes, and measures throughput, queue depth, dropped frames, and scaling efficiency as the number of streams grows (1, 2, 4, 8, 16).
    • Provides detailed system‑level metrics so reviewers can see how the cascade behaves under realistic multi‑stream load, not just on a single offline video.

On top of these core components, the repository includes:

  • Dataset loaders for UCF‑Crime, ShanghaiTech Campus, and XD‑Violence.
  • A shared metrics module (metrics.py) implementing frame‑ and video‑level AUC, AP, F1, and optimal threshold search.
  • Utility scripts to generate the tables and plots used in the paper (without shipping any LaTeX or figure‑generation code).

Taken together, the repo is intended to serve both as a reproducible research artifact for CVC 2026 and as a reference implementation for engineers building cascaded, multi‑agent VLM‑based surveillance systems.


🌟 Key Contributions

  1. Cascaded Architecture: Multi-stage pipeline with early-exit mechanism for computational efficiency
  2. Reconstruction-based Gating: Lightweight autoencoder filters normal frames before expensive detection
  3. Semantic Label Stabilization: VLM reasoning provides consistent, human-interpretable anomaly categories
  4. Real-time Performance: ~152 FPS for Stage I, enabling deployment on edge devices

πŸ—οΈ System Architecture

Dual-Stage Perception Pipeline

Dual Stage Perception Architecture

High-Level Pipeline

                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚         SURVEILLANCE CAMERA             β”‚
                              β”‚            (Video Stream)               β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                 β”‚
                                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                    PREPROCESSING                                        β”‚
β”‚                                                                                         β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚   β”‚ Frame Grab   β”‚ β†’ β”‚  Resize to   β”‚ β†’ β”‚  Normalize   β”‚ β†’ β”‚  To Tensor   β”‚        β”‚
β”‚   β”‚              β”‚    β”‚   128Γ—128    β”‚    β”‚   [0, 1]     β”‚    β”‚  (B,3,H,W)   β”‚        β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                 β”‚
                                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           STAGE I: AUTOENCODER GATE                                     β”‚
β”‚                                  (6.5 ms/frame)                                         β”‚
β”‚                                                                                         β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚                              ENCODER                                             β”‚  β”‚
β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚  β”‚
β”‚   β”‚  β”‚  Input  β”‚    β”‚ Conv2D  β”‚    β”‚ Conv2D  β”‚    β”‚        Conv2D               β”‚   β”‚  β”‚
β”‚   β”‚  β”‚ 128Γ—128 β”‚ β†’ β”‚ 3β†’32 ch β”‚ β†’ β”‚ 32β†’64ch β”‚ β†’ β”‚       64β†’128 ch             β”‚   β”‚  β”‚
β”‚   β”‚  β”‚  Γ—3 ch  β”‚    β”‚  k=3Γ—3  β”‚    β”‚  k=3Γ—3  β”‚    β”‚        k=3Γ—3               β”‚   β”‚  β”‚
β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  s=2    β”‚    β”‚  s=2    β”‚    β”‚         s=2                β”‚   β”‚  β”‚
β”‚   β”‚                 β”‚ ReLU    β”‚    β”‚ ReLU    β”‚    β”‚        ReLU                β”‚   β”‚  β”‚
β”‚   β”‚                 β”‚ 64Γ—64   β”‚    β”‚ 32Γ—32   β”‚    β”‚       16Γ—16                β”‚   β”‚  β”‚
β”‚   β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                        β”‚                                                β”‚
β”‚                            Bottleneck: 16Γ—16Γ—128 = 32,768 dims                         β”‚
β”‚                                        β”‚                                                β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚                              DECODER                                             β”‚  β”‚
β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚  β”‚
β”‚   β”‚  β”‚      ConvTranspose2D        β”‚    β”‚ConvT2D  β”‚    β”‚ConvT2D  β”‚    β”‚ Output  β”‚   β”‚  β”‚
β”‚   β”‚  β”‚        128β†’64 ch            β”‚ β†’ β”‚ 64β†’32ch β”‚ β†’ β”‚ 32β†’3 ch β”‚ β†’ β”‚ 128Γ—128 β”‚   β”‚  β”‚
β”‚   β”‚  β”‚          k=3Γ—3              β”‚    β”‚  k=3Γ—3  β”‚    β”‚  k=3Γ—3  β”‚    β”‚  Γ—3 ch  β”‚   β”‚  β”‚
β”‚   β”‚  β”‚           s=2               β”‚    β”‚  s=2    β”‚    β”‚  s=2    β”‚    β”‚         β”‚   β”‚  β”‚
β”‚   β”‚  β”‚          ReLU               β”‚    β”‚ ReLU    β”‚    β”‚ Sigmoid β”‚    β”‚         β”‚   β”‚  β”‚
β”‚   β”‚  β”‚         32Γ—32               β”‚    β”‚ 64Γ—64   β”‚    β”‚ 128Γ—128 β”‚    β”‚         β”‚   β”‚  β”‚
β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                                         β”‚
β”‚                    Anomaly Score = MSE(Input, Reconstruction)                          β”‚
β”‚                                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                 β”‚
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚                                               β”‚
                   score ≀ Ο„                                       score > Ο„
                   (Normal)                                        (Anomaly)
                         β”‚                                               β”‚
                         β–Ό                                               β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   βœ… NORMAL      β”‚                    β”‚      STAGE II: YOLO DETECTION    β”‚
              β”‚   Early Exit     β”‚                    β”‚           (8.5 ms/frame)         β”‚
              β”‚                  β”‚                    β”‚                                   β”‚
              β”‚  ~72% of frames  β”‚                    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
              β”‚  exit here       β”‚                    β”‚  β”‚      YOLOv8-nano          β”‚   β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚  β”‚                           β”‚   β”‚
                                                      β”‚  β”‚  β€’ Person Detection       β”‚   β”‚
                                                      β”‚  β”‚  β€’ Object Classification  β”‚   β”‚
                                                      β”‚  β”‚  β€’ Bounding Boxes         β”‚   β”‚
                                                      β”‚  β”‚  β€’ Confidence Scores      β”‚   β”‚
                                                      β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
                                                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                       β”‚
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚                                                  β”‚
                                        No Person                                          Person Found
                                              β”‚                                                  β”‚
                                              β–Ό                                                  β–Ό
                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                               β”‚  ⚠️ Generic Anomaly  β”‚                       β”‚    STAGE III: VLM REASONING     β”‚
                               β”‚                      β”‚                       β”‚          (~2.3 s/event)         β”‚
                               β”‚  Motion/Texture      β”‚                       β”‚                                  β”‚
                               β”‚  Anomaly             β”‚                       β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚  β”‚   Vision-Language Model   β”‚  β”‚
                                                                              β”‚  β”‚                           β”‚  β”‚
                                                                              β”‚  β”‚  "Describe this scene     β”‚  β”‚
                                                                              β”‚  β”‚   and identify the        β”‚  β”‚
                                                                              β”‚  β”‚   anomalous activity"     β”‚  β”‚
                                                                              β”‚  β”‚                           β”‚  β”‚
                                                                              β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
                                                                              β”‚                                  β”‚
                                                                              β”‚  Output: Semantic Label          β”‚
                                                                              β”‚  β€’ person_intrusion              β”‚
                                                                              β”‚  β€’ suspicious_behavior           β”‚
                                                                              β”‚  β€’ violent_activity              β”‚
                                                                              β”‚  β€’ theft_attempt                 β”‚
                                                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Semantic Label Categories

Category Description Trigger Condition
camera_blur Camera lens obstruction or defocus High reconstruction error, no person
person_intrusion Unauthorized person detected Person in restricted zone
suspicious_behavior Unusual movement patterns Person + abnormal pose/motion
violent_activity Physical altercation Multiple persons + rapid motion
theft_attempt Suspicious object interaction Person + reaching/grabbing motion
environmental Lighting/weather anomaly High error, scene-wide change

πŸ”§ Model Details

Autoencoder Architecture

ConvAutoencoder(
  (encoder): Sequential(
    (0): Conv2d(3, 32, kernel_size=3, stride=2, padding=1)    # 128β†’64
    (1): ReLU(inplace=True)
    (2): Conv2d(32, 64, kernel_size=3, stride=2, padding=1)   # 64β†’32
    (3): ReLU(inplace=True)
    (4): Conv2d(64, 128, kernel_size=3, stride=2, padding=1)  # 32β†’16
    (5): ReLU(inplace=True)
  )
  (decoder): Sequential(
    (0): ConvTranspose2d(128, 64, kernel_size=3, stride=2, padding=1, output_padding=1)
    (1): ReLU(inplace=True)
    (2): ConvTranspose2d(64, 32, kernel_size=3, stride=2, padding=1, output_padding=1)
    (3): ReLU(inplace=True)
    (4): ConvTranspose2d(32, 3, kernel_size=3, stride=2, padding=1, output_padding=1)
    (5): Sigmoid()
  )
)
Property Value
Parameters ~115K
Model Size ~460 KB
Input Shape (B, 3, 128, 128)
Bottleneck (B, 128, 16, 16)
Compression Ratio 1.5:1

Training Configuration

Hyperparameter Value
Optimizer Adam
Learning Rate 1e-3
Batch Size 32
Epochs 100
Loss Function MSE
Training Data Normal frames only

πŸ“Š Results

Detection Performance (UCF-Crime Dataset)

Metric Value Description
Frame-level AUC 72.37% Area under ROC curve
Average Precision 91.67% Area under PR curve
Accuracy @ Optimal Ο„ 75.21% At threshold = 0.001104
F1-Score @ Optimal Ο„ 81.04% Harmonic mean of P/R
Optimal Threshold 0.001104 MSE threshold value

Reconstruction Quality

Metric Normal Frames Anomaly Frames
PSNR 28-32 dB 18-24 dB
SSIM 0.90-0.95 0.70-0.85
MSE 0.001-0.002 0.004-0.010

Efficiency Benchmarks

Stage Mean Latency Std Dev FPS GPU Memory
Preprocessing 0.5 ms 0.1 ms 2000 -
Stage I (AE) 6.55 ms 0.8 ms 152.7 500 MB
Stage II (YOLO) 8.45 ms 1.2 ms 118.3 1 GB
Stage III (VLM) 2300 ms 200 ms 0.4 4 GB
Total (early exit) 7.05 ms - 141.8 -
Total (full pipeline) 2315 ms - 0.4 -

Early Exit Statistics

Dataset Exit @ Stage I (AE gate)
UCF-Crime 72.1%

Additional datasets (ShanghaiTech, XD-Violence) can be evaluated using cross_dataset_evaluation.py.

Baseline Comparison

Method Year AUC (%) Notes
C3D + MIL 2018 75.41 3D CNN
RTFM 2021 84.30 Temporal features
MGFN 2023 86.67 Multi-granularity
ProDisc-VAD 2025 87.31 Prototype + discriminative
Ex-VAD 2024 86.92 Explainable VLM
VadCLIP 2024 88.02 CLIP-based
Ours (Cascade) 2026 74.47 Efficiency + interpretability

The cascade trades some detection accuracy for a threefold reduction in latency and the ability to provide real-time semantic explanations via selective VLM invocation.

Ablation Study

Configuration AUC (%) Exit Rate (%)
YOLO Only 72.15 0.0
AE Only (Ours) 72.37 0.0
AE + YOLO Cascade 73.87 68.3
Full Cascade (AE+YOLO+VLM) 74.47 72.1

πŸš€ Installation

Prerequisites

  • Python 3.10+
  • CUDA 11.8+ (recommended for GPU acceleration)
  • 8 GB RAM minimum
  • NVIDIA GPU with 4+ GB VRAM (optional)

Quick Start

# Clone repository
git clone https://github.com/speesrl/CVC26-Multiagent-anomaly.git
cd CVC26-Multiagent-anomaly

# Create virtual environment
python -m venv .venv

# Activate environment
# Windows:
.\.venv\Scripts\activate
# Linux/Mac:
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Dependencies

torch>=2.0.0
torchvision>=0.15.0
ultralytics>=8.0.0
opencv-python>=4.8.0
numpy>=1.24.0
scikit-learn>=1.3.0
scikit-image>=0.21.0
matplotlib>=3.7.0
Pillow>=10.0.0
tqdm>=4.65.0

Dataset Setup

  1. Download UCF-Crime Dataset:

    • Official: UCF-Crime
    • Extract frames to Test/ folder
  2. Expected Structure:

Test/
β”œβ”€β”€ Arrest/
β”‚   β”œβ”€β”€ frame_0001.png
β”‚   β”œβ”€β”€ frame_0002.png
β”‚   └── ...
β”œβ”€β”€ Arson/
β”œβ”€β”€ Assault/
β”œβ”€β”€ Burglary/
β”œβ”€β”€ Explosion/
β”œβ”€β”€ Fighting/
└── NormalVideos/

πŸ“– Usage

Run Complete Evaluation

python run_evaluation.py --data-dir ./Test --model ./autoencoder_model.pth

Options:

  • --data-dir: Path to UCF-Crime test frames
  • --model: Path to autoencoder weights
  • --output-dir: Results directory (default: ./evaluation_results)
  • --skip-ablation: Skip ablation study
  • --skip-latency: Skip latency benchmark
  • --demo: Run with synthetic data

Run Interactive Dashboard

python anomaly_dashboard.py

Features:

  • Real-time frame processing
  • Reconstruction visualization
  • Anomaly score display
  • Semantic label output

Python API

import torch
from anomaly_dashboard import ConvAutoencoder
from metrics import evaluate_anomaly_detection

# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ConvAutoencoder()
model.load_state_dict(torch.load("autoencoder_model.pth", weights_only=True))
model.to(device).eval()

# Process frame
frame = preprocess(image)  # (1, 3, 128, 128)
reconstruction = model(frame)
anomaly_score = torch.mean((frame - reconstruction) ** 2).item()

# Threshold decision
threshold = 0.0035
is_anomaly = anomaly_score > threshold

Standard Full-Video UCF-Crime Evaluation

python full_video_evaluation.py --ucf-crime /path/to/UCF-Crime \
    --ae-model autoencoder_model.pth --sample-rate 5

Cross-Dataset Evaluation (UCF-Crime + ShanghaiTech + XD-Violence)

python cross_dataset_evaluation.py \
    --ucf-crime /path/to/UCF-Crime \
    --shanghaitech /path/to/ShanghaiTech \
    --xd-violence /path/to/XD-Violence \
    --ae-model autoencoder_model.pth

Multi-Agent System Benchmark

python multi_agent_benchmark.py --ae-model autoencoder_model.pth \
    --streams 1,2,4,8,16 --duration 5.0

Run Ablation Study

python ablation_study.py --ae-model autoencoder_model.pth --demo

Run Latency Benchmark

python latency_benchmark.py --ae-model autoencoder_model.pth --n-frames 100

πŸ“ Project Structure

CVC26-Multiagent-anomaly/
β”‚
β”œβ”€β”€ πŸ“„ README.md                    # This file
β”œβ”€β”€ πŸ“„ LICENSE                      # MIT License
β”œβ”€β”€ πŸ“„ requirements.txt             # Python dependencies
β”‚
β”œβ”€β”€ 🧠 autoencoder_model.pth        # Trained model weights (240 KB)
β”‚
β”œβ”€β”€ πŸ““ AAMS.ipynb                   # Training notebook
β”‚
β”œβ”€β”€ πŸ–₯️ anomaly_dashboard.py         # GUI dashboard application
β”‚
β”œβ”€β”€ πŸ”¬ Evaluation Scripts
β”‚   β”œβ”€β”€ run_evaluation.py           # Main evaluation pipeline
β”‚   β”œβ”€β”€ full_video_evaluation.py    # Standard full-video UCF-Crime protocol
β”‚   β”œβ”€β”€ cross_dataset_evaluation.py # Multi-dataset eval (UCF/SHT/XD)
β”‚   β”œβ”€β”€ multi_agent_benchmark.py    # Scaling, queueing, resource metrics
β”‚   β”œβ”€β”€ metrics.py                  # AUC/ROC/PR computation
β”‚   β”œβ”€β”€ ablation_study.py           # Ablation experiments
β”‚   β”œβ”€β”€ latency_benchmark.py        # Timing measurements
β”‚   β”œβ”€β”€ ucf_crime_loader.py         # UCF-Crime dataset loader
β”‚   β”œβ”€β”€ shanghaitech_loader.py      # ShanghaiTech Campus loader
β”‚   └── xd_violence_loader.py       # XD-Violence dataset loader
β”‚
β”œβ”€β”€ πŸ“Š Results Generation
β”‚   β”œβ”€β”€ generate_results.py         # Generate metric tables
β”‚   └── final_metrics.py            # Final paper metrics
β”‚
β”œβ”€β”€ πŸ“š docs/
β”‚   β”œβ”€β”€ EVALUATION.md               # Evaluation methodology
β”‚   └── ARCHITECTURE.md             # Detailed architecture specs
β”‚
└── πŸ—‚οΈ Test/                        # UCF-Crime frames (not in repo)
    β”œβ”€β”€ Arrest/
    β”œβ”€β”€ Arson/
    └── ...

πŸ”¬ Methodology

Problem Formulation

Given surveillance video V = {f₁, fβ‚‚, ..., fβ‚œ}, predict frame-level anomaly labels Y = {y₁, yβ‚‚, ..., yβ‚œ} where yβ‚œ ∈ {0, 1}.

Stage I: Reconstruction Gate

The autoencoder is trained on normal frames only:

L_AE = (1/N) Ξ£ ||fα΅’ - fΜ‚α΅’||Β²

Anomaly score for frame f:

s(f) = (1/HW) Ξ£ (f_hw - fΜ‚_hw)Β²

Decision rule:

Ε· = 1  if s(f) > Ο„
    0  otherwise

Stage II: Semantic Classification

For frames with s(f) > Ο„, apply YOLOv8:

boxes, classes, conf = YOLO(f)
person_detected = βˆƒ c ∈ classes : c = "person"

Stage III: VLM Reasoning

For person-containing anomalies:

label = VLM(prompt, f)

Where prompt = "Describe the anomalous activity in this surveillance frame."

Detection vs Identification

Task Metric Method
Detection AUC-ROC Reconstruction error thresholding
Identification Human evaluation VLM semantic labeling

πŸ“œ Citation

If you use this code in your research, please cite:

@article{rehman2026cascaded,
  title={Cascaded Surveillance Anomaly Detection with Vision--Language Foundation Model Reasoning and Semantic Label Stabilization},
  author={Rehman, Tayyab and De Gasperis, Giovanni and Shmahell, Aly},
  year={2026}
}

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • Spee s.r.l β€” Industry partner and sponsor
  • University of L'Aquila β€” Academic support
  • UCF-Crime Dataset β€” Real-world Anomaly Detection in Surveillance Videos
  • Ultralytics YOLOv8 β€” Object detection
  • Baseline methods: C3D-MIL, RTFM, MGFN, VadCLIP

πŸ“§ Contact

For questions, issues, or collaboration:


Tayyab Rehman β€” University of L'Aquila / SPEE S.R.L.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages