Hybrid

2025

Multimodal VAD: Visual Anomaly Detection in Intelligent Monitoring System via Audio-Vision-Language

20 June 2025·8913 words·42 mins

Dicong Wang , Qilong Wang , Qinghua Hu , Kaijun Wu

The paper proposes a dual-stream multimodal video anomaly detection network that leverages video, audio, and text modalities to achieve reliable and precise anomaly detection. It introduces effective multimodal fusion, abnormal-aware context prompts (ACPs), and a coarse-support-fine strategy to enhance anomaly discrimination and description, demonstrating superior performance on large-scale datasets.

Networking Systems for Video Anomaly Detection: A Tutorial and Survey

1 April 2025·21983 words·104 mins

Jing Liu , Yang Liu , Jieyu Lin , Jielin Li , Liang Cao , Peng Sun , Bo Hu , Liang Song , Azzedine Boukerche , Victor C.M. Leung

Cuhk-Avenue Shanghaitech Xd-Violence Ubnormal Ucf-Crime Ucsd-Ped Hybrid Survey

A comprehensive survey and tutorial exploring the assumptions, frameworks, recent advances, applications, and future trends of Networking Systems for Video Anomaly Detection (NSVAD), emphasizing the integration of AI, IoVT, and computing for real-world deployable systems.

AADC-Net: A Multimodal Deep Learning Framework for Automatic Anomaly Detection in Real-Time Surveillance

31 March 2025·10163 words·48 mins

Duc Tri Phan , Vu Hoang Minh Doan , Jaeyeop Choi , Byeongil Lee , Junghwan Oh

Ucf-Crime Xd-Violence Hybrid Other

Introduces AADC-Net, a multimodal deep neural network leveraging pretrained vision-language models, large language models, and object detection (DETR) for real-time anomaly detection and categorization in surveillance videos. The framework addresses data scarcity, imbalance, and computational challenges, demonstrating state-of-the-art performance on multiple datasets, with practical deployment in smart gyms and healthcare settings.

Personalizing Vision-Language Models With Hybrid Prompts for Zero-Shot Anomaly Detection

13 February 2025·8885 words·42 mins

Yunkang Cao , Xiaohao Xu , Yuqi Cheng , Chen Sun , Zongwei Du , Liang Gao , Weiming Shen

Cuhk-Avenue Shanghaitech Xd-Violence Ubnormal Ucf-Crime Ucsd-Ped Other Weakly Supervised Semi Supervised Training Free Instruction Tuning Unsupervised Hybrid Other

Introduces AnomalyVLM, a framework leveraging hybrid prompts derived from prior knowledge to enhance zero-shot anomaly detection by personalizing vision-language models, incorporating an anomaly region generator and refiner, and utilizing hybrid prompts for category-specific customization and improved detection performance.

PLOVAD: Prompting Vision-Language Models for Open Vocabulary Video Anomaly Detection

10 January 2025·10371 words·49 mins

Chenting Xu , Ke Xu , Xinghao Jiang , Tanfeng Sun

Ucf-Crime Shanghaitech Xd-Violence Ubnormal Weakly Supervised Instruction Tuning Unsupervised Hybrid Method

A novel framework (PLOVAD) leveraging prompt tuning on large-scale pretrained image-based vision-language models for open vocabulary video anomaly detection, incorporating domain-specific and anomaly-specific prompts, and a temporal module to detect and categorize both seen and unseen anomalies with limited parameters.

Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models

1 January 2025·6657 words·32 mins

Chao Huang , Yushu Shi , Jie Wen , Wei Wang , Yong Xu , Xiaochun Cao

Ucf-Crime Xd-Violence Hybrid Method

The paper introduces Ex-VAD, a comprehensive framework for fine-grained and explainable video anomaly detection that leverages visual-language models (VLMs) and large language models (LLMs). It features modules for generating anomaly explanations, fusing multimodal features for coarse detection, and expanding/aligning labels for fine-grained classification, with improved interpretability and accuracy demonstrated on UCF-Crime and XD-Violence datasets.

2024

Text-Driven Traffic Anomaly Detection With Temporal High-Frequency Modeling in Driving Videos

17 April 2024·10204 words·48 mins

Rongqin Liang , Yuanman Li , Jiantao Zhou , Xia Li

Cuhk-Avenue Shanghaitech Xd-Violence Ubnormal Ucf-Crime Ucsd-Ped Other Hybrid Other

The paper introduces TTHF, a novel single-stage method aligning video clips with text prompts for traffic anomaly detection. It emphasizes modeling high frequency in the temporal domain to capture dynamic changes in driving scenes, and proposes an attentive anomaly focusing mechanism to enhance detection of various traffic anomalies. The approach leverages visual-text semantic alignment, modeling temporal high frequency, and guided attention mechanisms, achieving superior performance on benchmark datasets.

VLAVAD: Vision-Language Models Assisted Unsupervised Video Anomaly Detection

1 January 2024·6374 words·30 mins

Changkang Li , Yalong Jiang

Shanghaitech Unsupervised Instruction Tuning Hybrid Method

Proposes VLAVAD, an unsupervised video anomaly detection method leveraging vision-language pre-trained models, utilizing semantic features, Selective Prompt Adapter, and Sequence State Space Module to improve interpretability and transferability, achieving state-of-the-art performance on the ShanghaiTech dataset.

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

1 January 2024·6817 words·33 mins

Peng Wu , Xuerong Zhou , Guansong Pang , Lingru Zhou , Qingsen Yan , Peng Wang , Yanning Zhang

Xd-Violence Ucf-Crime Hybrid Method

A novel paradigm for weakly supervised video anomaly detection leveraging frozen CLIP model with dual-branch architecture, temporal modeling modules, and prompt mechanisms to utilize vision-language knowledge for both coarse- and fine-grained detection tasks, achieving state-of-the-art performance on benchmarks.

2023

AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM

1 December 2023·8129 words·39 mins

Sunghyun Ahn , Youngwan Jo , Kijung Lee , Sein Kwon , Inpyo Hong , Sanghyun Park

Ubnormal Hybrid Method

Proposes the AnyAnomaly model utilizing large vision language models (LVLMs) for zero-shot, customizable video anomaly detection that detects user-defined anomalies without additional training, incorporating segment-level processing and context-aware visual question answering (VQA). The approach enhances generalization across diverse environments and achieves state-of-the-art results on benchmark datasets, demonstrating practical potential for real-world applications.

Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead

31 October 2023·12272 words·58 mins

Yunkang Cao , Xiaohao Xu , Chen Sun , Xiaonan Huang , Weiming Shen

Hybrid Survey

This study explores the use of GPT-4V, a large visual-linguistic model, for generic anomaly detection across multiple modalities and domains, demonstrating its ability to understand global and fine-grained semantics, reason automatically, and improve with prompts. It evaluates GPT-4V on diverse tasks including industrial, medical, logical, video, 3D, and time series anomaly detection, discussing its promising performance and future directions for enhancement, such as quantitative metrics, expanded benchmarks, multi-round interactions, human feedback, and real-time application.

SmartHome-Bench: A Comprehensive Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal Large Language Models

5 October 2023·10616 words·50 mins

Xinyi Zhang , John Doe , Jane Smith

Other Hybrid Benchmark

The paper introduces SmartHome-Bench, the first large-scale dataset and benchmark designed specifically for video anomaly detection (VAD) within smart home environments, incorporating 1,203 annotated videos across seven categories such as Wildlife, Senior Care, and Baby Monitoring. The dataset includes detailed annotations with anomaly tags, descriptions, and rationales, facilitating research on multi-modal large language models (MLLMs) for explainable VAD. It evaluates various adaptation methods, including prompting strategies and a novel taxonomy-driven reflective LLM chain (TRLC), demonstrating significant performance improvements and highlighting current model limitations. The study aims to advance smart home security by providing a dedicated benchmark and novel framework for enhancing MLLM-based anomaly detection and reasoning.

VISIONGPT: LLM-ASSISTED REAL-TIME ANOMALY DETECTION FOR SAFE VISUAL NAVIGATION

1 October 2023·7867 words·37 mins

Hao Wang , Ashish Bastola , Jiayou Qin , Xiwen Chen , John Suchanek , Zihao Gong , Abolfazl Razi

Other Hybrid Application

A framework combining lightweight object detection and large language models for real-time visual navigation safety and anomaly detection, with dynamic scenario switching and prompt engineering.

Video Anomaly Detection in 10 Years: A Survey and Outlook

1 October 2023·18854 words·89 mins

MOSHIRA ABDALLA , SAJID JAVED , MUAZ AL RADI , ANWAAR ULHAQ , NAOUFEL WERGHI

Shanghaitech Xd-Violence Ucf-Crime Ucsd-Ped Other Hybrid Survey

A comprehensive survey exploring deep learning-based video anomaly detection, including emerging paradigms such as weakly supervised, self-supervised, and unsupervised approaches, with a focus on core challenges, feature extraction, supervision schemes, loss functions, regularization techniques, and the potential of vision-language models (VLMs) for enhanced anomaly detection.

VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

1 October 2023·8183 words·39 mins

Muchao Ye , Weiyang Liu , Pan He

Ucf-Crime Xd-Violence Hybrid Method

Introduces VERA, a framework that enables frozen vision-language models to perform explainable video anomaly detection by learning detailed anomaly-characterization questions from coarsely labeled data, without model parameter modifications. The method decomposes complex reasoning into reflections on guiding questions, optimizes them via verbal interactions, and guides VLMs to generate segment- and frame-level anomaly scores with improved explainability and performance.

VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning

1 October 2023·9520 words·45 mins

Liyun Zhu , Qixiang Chen , Xi Shen , Xiaodong Cun

Ucf-Crime Shanghaitech Other Hybrid Method

Introduces VAU-R1, a reinforcement fine-tuning framework leveraging Group Relative Policy Optimization (GRPO) to enhance multimodal large language models’ (MLLMs) reasoning capabilities in video anomaly understanding (VAU). Develops VAUBench, a comprehensive Chain-of-Thought benchmark with rich annotations across perception, grounding, reasoning, and classification tasks, supported by multiple evaluation metrics including VAU-Eval, QA accuracy, temporal IoU, and Factual Consistency. Demonstrates significant improvements over supervised fine-tuning in question answering accuracy, temporal localization, and interpretability, thereby establishing a scalable, interpretable, and reasoning-aware VAU framework.

Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought

1 October 2023·11169 words·53 mins

Chao Huang , Benfeng Wang , Jie Wen , Chengliang Liu , Wei Wang , Li Shen , Xiaochun Cao

Shanghaitech Xd-Violence Ubnormal Ucf-Crime Ucsd-Ped Other Hybrid Method

Proposes a structured Perception-to-Cognition Chain-of-Thought and introduces Vad-Reasoning dataset, along with an improved reinforcement learning algorithm AVA-GRPO, to enhance the deep reasoning capabilities of Multimodal Large Language Models in video anomaly detection and understanding tasks.

Unspecified

1 October 2023

Hybrid Survey

Unspecified

Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models

1 October 2023·11640 words·55 mins

Jiacong Xu , Shao-Yuan Lo , Bardia Safaei , Vishal M. Patel , Isht Dwivedi

Hybrid Other

Introduces a specialist visual assistant, Anomaly-OV, leveraging an anomaly expert and visual token selection mechanism to improve zero-shot anomaly detection and reasoning, establishing new datasets and benchmarks in the domain.

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

1 October 2023·8709 words·41 mins

Zhiwei Yang , Jing Liu , Peng Wu

Ucf-Crime Xd-Violence Hybrid Method

Proposes a novel pseudo-label generation and self-training framework incorporating CLIP for text-image alignment, learnable text prompts, normality visual prompts, a pseudo-label generation module guided by normality clues, and a self-adaptive temporal dependence learning module, achieving state-of-the-art performance on benchmark datasets.

↑