[{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/benchmarks/","section":"Benchmarks","summary":"","title":"Benchmarks","type":"benchmarks"},{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/dicong-wang/","section":"Authors","summary":"","title":"Dicong Wang","type":"authors"},{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/categories/hybrid/","section":"Categories","summary":"","title":"Hybrid","type":"categories"},{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/kaijun-wu/","section":"Authors","summary":"","title":"Kaijun Wu","type":"authors"},{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/type/method/","section":"Type","summary":"","title":"Method","type":"type"},{"content":" Multimodal VAD: Visual Anomaly Detection in Intelligent Monitoring System via Audio-Vision-Language # Dicong Wang , Graduate Student Member, IEEE, Qilong Wang , Senior Member, IEEE , Qinghua Hu , Senior Member, IEEE, and Kaijun Wu , Member, IEEE\nAbstract—The deep learning-based anomaly detection methods using visual sensors generally rely on a single modality or variants as raw signal inputs, which severely limits expressiveness and adaptability. The evolution of multimodal and visual-language pretrained models is shaping new possibilities in video anomaly detection (VAD). So, how to efficiently leverage them to achieve reliable multimodal VAD presents a significant challenge worth investigating. In this work, we propose a novel dual-stream multimodal VAD network, which integrates coarse-grained and fine-grained streams combining video, audio, and text modalities. First, in the coarse-grained stream, we perform cross-modal fusion of audio features with temporally modeled visual features, utilizing contrastive optimization to achieve more accurate coarse-grained results. In the fine-grained stream, we constructed abnormal-aware context prompts (ACPs) by integrating visual information and prior knowledge related to anomalous events into the text modality. Through the \u0026ldquo;coarse-support-fine\u0026rdquo; strategy, we further enhanced the model\u0026rsquo;s ability to discriminate fine-grained anomalies. Our method achieved optimal performance in experiments on two large-scale anomaly datasets, demonstrating its effectiveness and superiority. It supports the development of highly robust intelligent monitoring systems and promotes the potential applications of multimodal VAD across industrial monitoring, public safety, smart cities, and so on.\nIndex Terms—Abnormal-aware context prompts (ACPs), multimodal, video anomaly detection (VAD), weakly supervision.\nI. INTRODUCTION # T HE arrival of Industry 4.0 marks the transformation of manufacturing, driving the development of \u0026ldquo;intelligent\nReceived 18 November 2024; revised 25 March 2025; accepted 28 May 2025. Date of publication 16 June 2025; date of current version 20 June 2025. This work was supported in part by the National Natural Science Foundation of China under Grant 62276186 and Grant U23B2049; in part by the Natural Science Foundation Key Project of Gansu Province under Grant 23JRRA860; in part by the Inner Mongolia Key Research and Development and Achievement Transformation Project under Grant 2023YFDZ0043, Grant 2023YFDZ0054, and Grant 2023YFSH0043; in part by the Key Research and Development Project of Lanzhou Jiaotong University under Grant ZDYF2304; and in part by the Key Talent Project of Gansu Province. The Associate Editor coordinating the review process was Dr. Jianbo Yu. (Corresponding authors: Qilong Wang; Kaijun Wu.)\nDicong Wang is with the College of Intelligence and Computing, Tianjin University, Tianjin 300350, China, and also with the School of Electronics and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China (e-mail: wangdc 2021@tju.edu.cn).\nQilong Wang and Qinghua Hu are with the College of Intelligence and Computing, Tianjin University, Tianjin 300350, China (e-mail: qlwang@tju.edu.cn; huqinghua@tju.edu.cn).\nKaijun Wu is with the School of Electronics and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China (e-mail: wkj@mail.lzjtu.cn).\nDigital Object Identifier 10.1109/TIM.2025.3578702\ndetection/monitoring systems\u0026rdquo; by integrating digital technologies, such as machine learning (ML), the Internet of Things (IoT), and cyber-physical systems into visual anomaly detection. Following this, Industry 5.0 further advanced the application of digital technologies and intelligent devices, enabling comprehensive monitoring and refined management of production processes. This effectively enhanced production efficiency and reduced costs for enterprises, further propelling the development of intelligent detection and monitoring systems. Surveillance cameras are increasingly used in industrial fault detection [1] , [2], intelligent security monitoring [3] , [4] , [5], and production line defect detection [6] , [7] , [8]. Nevertheless, the present automated anomaly detection technologies lag behind, resulting in suboptimal utilization of surveillance equipment and limiting their potential. Drawing inspiration from the natural way humans watch videos, video anomaly detection (VAD) relies not only on visual information but also on audio and others (such as captions or text), which are indispensable for creating a complete perceptual experience. In smart cities, the precision and comprehensiveness required by video surveillance systems make multimodal learning especially necessary. In intelligent security systems, the ability to identify and respond to various anomalous behaviors and emergencies is critical, especially in areas, such as public safety, traffic management, and emergency response. Introducing multimodal information can significantly reduce reliance on unimodal data, effectively overcoming issues, such as information loss and incomplete representation that may arise from using a single modality, thereby enabling the model to analyze anomalies from multiple perspectives [9]. Therefore, constructing a multimodal VAD network is not only essential for enhancing the intelligence of security systems but also plays a vital role in advancing the development of smart cities. It assists urban managers in real-time, comprehensive monitoring and management of urban conditions, improving city safety and quality of life.\nAccording to the industry definition of anomalous events, VAD methods can be roughly categorized into unsupervised VAD and weakly supervised VAD. In the unsupervised paradigm, VAD models are trained solely on normal data, treating any data that deviates from the normal distribution as anomalous. Numerous excellent studies have emerged under the unsupervised paradigm [10] , [11] , [12] , [13] , [14] , [15] , [16] , [17], including those based on reconstruction [10] , [15] ,\n1557-9662 © 2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission.\nSee https://www.ieee.org/publications/rights/index.html for more information.\nFig. 1. Distribution of modalities in existing methods. (a) Focused on using unimodal images or variants for VAD. (b) Some multimodal methods combine images with audio or text. (c) Ours integrates image, audio, and text modalities, capturing subtle features of anomalies and improving model adaptability and reliability.\nprediction [12] , [13], and so on. However, the main limitation of the aforementioned methods is the singularity of modalities and the lack of anomalous diversity. Despite the existence of various modalities, such as optical flow, depth maps, and semantic maps, these are still variants or derivatives based on RGB images, which, broadly speaking, can be considered as the same modality [as shown in Fig. 1(a)]. Weakly supervised VAD (WSVAD) has garnered increasing attention in recent years due to its broad application prospects. Compared to unsupervised methods, weakly supervised methods have several notable advantages: 1) due to the simultaneous presence of both anomalous and normal data, the model can learn richer discriminative representations; 2) WSVAD leverages coarse-grained video-level labels to achieve a good tradeoff between performance and efficiency at a lower cost; and 3) the production of large-scale anomaly datasets becomes feasible. Currently, most weakly supervised methods typically formalize it as multiple instance learning (MIL) [18] , [19] , [20] , [21] , [22] , [23] , [24] , [25] , [26] , [27] , [28] , [29]. As a mainstream research paradigm in WSVAD, MIL has led to the emergence of various innovative ideas and methods in this field. Sevetlidis et al. [19] proposed a method that integrates anomaly information through weighted training, aimed at enhancing detection performance in weakly supervision, particularly demonstrating noteworthy results in scenarios with sparse labeled data. Lv et al. [23] proposed a bias-free anomaly detector through invariant learning of confident and ambiguous segments with different contextual biases, effectively reducing bias issues caused by contextual changes during the detection process. Given the scarcity of anomalous data and high labeling costs, continuous development and efforts have shown that incorporating multimodal information effectively reduces dependence on single-modal data, mitigates potential information loss, and makes VAD tasks more comprehensive [2] , [7]. Meanwhile, with the notable achievements of visionlanguage pretraining (VLP) models across various tasks, an increasing number of researchers are exploring how to transfer the semantic knowledge learned by these models into VAD. The strong ability of VLP models to capture the complex multimodal relationships between vision and language offers new possibilities for multimodal VAD [24] , [25] , [27]. However, most current methods primarily focus on the combination of video with audio or text [as shown in Fig. 1(b)], failing to fully leverage the potential advantages of multimodalities. In our work, we organically integrate video, audio, and textual information [as shown in Fig. 1(c)], fully leveraging the synergistic and complementary effects of multimodal data to more comprehensively capture and describe the characteristics of anomalous events.\nBased on the above analysis, we propose a multimodal two-stream VAD network capable of processing visual, audio, and textual modalities. Specifically, we first perform temporal modeling on visual information to capture dynamic changes and spatial features in the video. Subsequently, we conduct cross-modal fusion of the visual features and the encoded audio signals to enhance the perception and understanding of anomalous events. The fused features are then utilized for coarse-grained frame-level classification to preliminarily distinguish between anomalous and normal events. At the same time, we introduce a contrastive constraint to strengthen the model\u0026rsquo;s discriminative capacity and resilience, which further enhances subsequent fine-grained anomaly detection. For the fine-grained stream, we design abnormal-aware context prompts (ACPs), encoding them into a text encoder to generate corresponding text embeddings. By calculating the similarity matrix between the text embeddings and the visual embeddings, we further improve the model\u0026rsquo;s performance in fine-grained anomaly detection.\nIn summary, the main contributions of this article are as follows.\nEffective Fusion of Multimodal Data: By effectively integrating heterogeneous source information—visual, audio, and textual—we achieve mutual supplementation of information, capturing richer, more comprehensive, and diverse feature representations. Dual-Stream Network With Coarse and Fine Granularity: The coarse-grained stream via audiovisual fusion, contrastively optimized, yields a global perspective. The fine-grained stream via APCs, the \u0026ldquo;coarse-support-fine\u0026rdquo; strategy, offers anomalous detail. Abnormal-Aware Context Prompt (ACP): Incorporating anomalous visual information and prior knowledge into the textual modality improves the recognition and analysis of fine-grained anomalous behaviors. Extensive Experimental Validation and SOTA Performance: We systematically evaluated its effectiveness and adaptability on two large benchmarks, achieving optimal performance across multiple metrics. The organization of this article is as follows. In Section II , we review the research progress related to VAD. In Section III , we elaborate on the multimodal VAD method we proposed. In Section IV, we conduct a series of experiments to validate and analyze the proposed method. Finally, Section V concludes our work and provides an outlook on future research.\nII. RELATED WORK # A. Video Anomaly Detection # Early VAD was predominantly based on unsupervised methods [11] , [29] , [30]; however, these approaches are limited in capturing subtle behavioral changes and in thoroughly understanding anomalous events, thus underscoring the urgent\nneed to incorporate anomaly-related knowledge [26] , [31]. The continuous advancements in deep learning have driven the evolution of VAD, gradually shifting research toward weakly supervised anomaly detection. Current weakly supervised methods can generally be categorized into two main types: one adopts a two-stage self-training strategy, while the other is a single-stage method based on MIL. Since the introduction of the MIL method, it has been continuously developed and optimized, attracting widespread attention. The related research focuses on leveraging spatiotemporal contextual information and motion features, emphasizing the overall temporal continuity of anomalous events and accurately capturing the subtle boundary changes between anomalies and normal instances [18] , [29] , [32]. With the continuous advancement and evolution of VAD, multimodal approaches have gradually emerged. These methods are not only model based on traditional visual information but also integrate other modalities, such as audio and text. These modalities\u0026rsquo; complementarity effectively compensates for the limitations that may exist in single modality, enabling capturing anomalous behaviors and better interpreting scenes through cross-modal fusion [24] , [26] , [33] , [34]. Meanwhile, anomaly detection methods based on large language models (LLMs) and vision-language models (VLMs) have also begun to attract the attention and exploration of researchers [35] , [36]. These cutting-edge technologies and innovative ideas open novel opportunities and challenges in visual anomaly detection.\nB. LLMs and VLMs on VAD # In recent years, vision-language pretraining models have not only demonstrated remarkable progress in tasks, such as image captioning [37], visual question answering [38], and imagetext matching [39] , [40], but have also played a crucial role in industrial monitoring and measurement. By integrating visual sensor data with textual descriptions, these models enhance the accuracy of anomaly detection [19], fault diagnosis [1] , and intelligent decision-making systems, providing critical technological support for intelligent industrial production and quality control. Kim et al. [41] used ChatGPT to generate textual descriptions of normal and abnormal elements, thus providing the textual data required for training CLIP. However, this approach has a human-in-the-loop issue, necessitating a certain degree of manual intervention to optimize output according to specific application contexts. Zanella et al. [35] exploited the existing VLM to generate textual descriptions for each test frame and designed specific prompt mechanisms to unlock the capabilities of LLM in temporal aggregation and anomaly score estimation, making them effective detectors. Du et al. [9] employed visual language large models (VLLMs) to uncover and capture the key clues of anomalous behavior and establish a logical chain of causality, thereby accurately identifying and inferring the occurrence process of anomalous events. In this article, we will further explore how to effectively transfer the semantic understanding and cross-modal matching advantages demonstrated by the large-scale VLM CLIP in processing image and text information to the VAD task, providing new insights and directions in subsequent research.\nC. Multimodality on VAD # With the rapid development in areas, such as image/video understanding [42], pattern recognition [30] , [31], text-toimage generation [43], and speech recognition [44], we are closer than ever to achieve the integration and unification of multimodal learning [45]. Compared to unimodal methods, multimodal VAD integrates heterogeneous source data—such as visual, audio, and textual information—that exhibit significant differences in representation and distribution. This enhances capabilities in information fusion and representation while offering unique research mechanisms and challenges to the VAD community due to the latent connections and interactions among modalities [20] , [25] , [27] , [34] , [46] , [47]. Yu et al. [46] proposed a modality-aware contrastive multi-instance learning network with self-distillation, offering improved parameter efficiency, scalability, and effectiveness while resolving inconsistencies between audio and video events. Wu et al. [20] , [21] tackled the issues of limited scene diversity and modality insufficiency in prior datasets by constructing a large-scale audiovisual multimodal dataset called XD-Violence. This innovative work enriches the existing VAD data resources and greatly enhances the reliability and expansibility of VAD through the fusion of multimodal information. Zhang et al. [47] effectively corrected and refined the detection boundaries in the anomaly space by introducing multiple constraints and optimized the definition of the anomaly boundaries to better align with the requirements of practical applications. In contrast, our method integrates different types of data, including visual, audio, and textual information, to construct a richer and more multidimensional feature representation. This enhances the understanding and analysis of complex video content and aids in more comprehensive identification and interpretation of anomalous events.\nIII. METHODOLOGY # A. Preliminaries # First, given an untrimmed audiovisual sequence X = (x v , x a ), where x v represents video information and x a represents audio information, we segment the entire sequence into N segments x = {x v , x a } N i=1 and assign corresponding coarsegrained video-level labels y c ∈ {0 , 1}, where y c = 1 indicates the presence of anomalous events in x. We utilize off-theshelf pretrained networks as feature extractors to separately extract visual and audio information, obtaining visual features F v and audio features F a F a . Specifically, F v = {f v } N i=1 ∈ R L×d v and F a F a = {f a } N i=1 ∈ R L×d a , where f i v f i and f i a f i denote the video and audio features of the i-th segment, d v and d a represent the dimensions of the video and audio features, respectively, and L denotes the length of the video sequence X . In the MIL framework, we treat a video sequence as a bag and the audiovisual segments {x v , x a } n i=1 as instances.\nB. Two-Stream Network # In this article, we propose a dual-stream network for multimodal VAD, enhancing precision and generalization through the synergy of coarse-grained and fine-grained streams (refer to Fig. 2). The coarse-grained stream fuses visual\nFig. 2. Overview. We have constructed a dual-stream network capable of simultaneously processing image, audio, and text modalities, utilizing a \u0026ldquo;coarsesupport-fine\u0026rdquo; strategy to establish their interconnections. Through cross-modal audio-video fusion and contrastive constraints, we perform coarse-grained anomaly detection; concurrently, we introduce ACPs that are sensitive to anomalies to enhance the discriminative capability for fine-grained anomalous events.\nand audio signals to generate multimodal representations, facilitating effective detection of anomalies. A contrastive learning strategy further improves sensitivity by focusing on anomalous samples through positive-negative sample comparisons. In the fine-grained stream, we construct ACPs using learnable parameters, category information, and weighted coarse-grained fused features, enabling reliable identification of anomalies. This collaboration ensures comprehensive detection at both levels, showcasing robust generalization and sensitivity.\nC. Coarse-Grained Stream # Temporal Modeling and Cross-Modal Fusion: During audio signal processing, we perform feature encoding to obtain audio features F Fig. 3. Diagram of contrast constraint.\nfeatures\nIn the coarse-grained stream, the audio-visual features are processed through a binary classifier to obtain the coarsegrained anomaly confidence P c\nwhere σ denotes the sigmoid function.\nContrastive Constraint: To further explore the feature differences and interrelations between normal and abnormal videos, we propose an optimization method based on contrastive constraints, as shown in Fig. 3. Specifically, to accurately capture the feature differences between normal and abnormal videos, we introduce a video sequence partitioning strategy based on anomaly confidence p c . By conducting a comparative analysis of feature segments in different score intervals, we reveal subtle abnormal patterns. First, we differentiate between normal and abnormal videos based on the anomaly confidence p c of each video sequence. When p c exceeds a threshold ε, the video sequence is classified as abnormal, denoted as Xa Xa ; when p c is below the threshold, it is classified as normal, denoted as Xn Xn . For abnormal sequences Xa Xa (i.e., p c \u0026gt; ε), we select the top k feature segments with the highest scores, forming an abnormal mini-bag B top-a min ⊂ {D a (m)} k m= 1 .\nThese feature segments generally represent more significant abnormal patterns, reflecting the model\u0026rsquo;s high-confidence predictions of anomalous events. Simultaneously, we also select the k lowest scoring feature segments, forming another abnormal mini-bag B bot-a min ⊂ {D a (m)} k m= 1 . These segments typically manifest as more or uncertain abnormal samples, which help the model capture potential fine-grained anomalies.\nSimilarly, for normal sequences Xn Xn (i.e., p c \u0026lt; ε), we adopt a similar partition strategy; first, we select the top-k feature segments with the highest scores to form the normal mini-bag B top-n min ⊂ {D n (m)} k m= 1 , which typically represent the characteristic features of normal videos. Then, we select the bottom-k feature segments with the lowest scores to form the normal mini-bag B bot-n min ⊂ {D n (m)} k m= 1 to help the model recognize common or marginal features in normal videos.\nBased on these partitions, we use the InfoNCE loss function as the optimization objective, constraining the model to focus more on the detailed contrast between anomalous and normal video sequences during training. This enables the model not only to effectively learn the latent structure within the data but also to enhance its perception and understanding of subtle yet important feature differences in video sequences, thereby improving its discriminative capability\nD. Fine-Grained Stream # Abnormal-Aware Context Prompt (ACP): ACP aims to enhance visual representation by introducing anomaly-related semantic information, enabling more effective modeling of various anomaly patterns, as shown in Fig. 4. Specifically, discrete textual labels (such as robbery, fighting, shooting, etc.) are regarded as category identifiers. These labels serve as descriptive information for the target events and need to be encoded. We use the Tokenizer to process the textual labels, generating a high-dimensional embedding representation te tembed = Tokenizer(y t ), where y t represents the specific textual label, which contains key information related to the event. Next, the generated textual embedding tembed is combined with M learnable parameters vi. After concatenating the textual embedding with these learnable parameters, we obtain a new sentence embedding representation, as shown in the following: Fig. 4. Diagram of ACP construction.\nWe believe that anomaly-conditional context embeddings can better adapt to the task requirements of multimodal VAD. To achieve this, we introduced a lightweight dynamically aligned visual prompt network (composed of a simple FFN and skip connection) to incorporate anomaly knowledge into each learnable prompt, enabling it with anomaly awareness. To be exact, we introduce vϑ(∗) = ξ, an alignment visual prompt network parameterized by ϑ, which fuses the anomalyweighted visual information with textual information. The output is then processed by vϑ(∗), and by adding this result to the existing context prompt, we obtain a context prompt with anomaly awareness\nImage-Text Registration: To further explore the potential of fine-grained anomaly detection, the anomaly scores generated by the coarse-grained stream are used as weighting factors, which are then combined with the fused multimodal audio-visual features to perform fine-grained detection. The focus is on regions or frames in the video where anomalies may exist, thereby enhancing the model\u0026rsquo;s sensitivity to anomaly details and its ability to discriminate them Finally, we compute the similarity between the fused visual features F and the category embeddings T , resulting in a registration map M\nE. Objective Function # Coarse-Grained Stream: In the coarse-grained stream, in addition to using the contrastive loss to enhance the discriminative ability between different feature representations, we further introduce a video-level classification loss. We calculate the binary cross-entropy loss function L c bce , between the video prediction scores and their corresponding ground truth (GT) labels to optimize the model\nFine-Grained Stream: In the fine-grained stream, we adopt an attention-based top-k mechanism to extend the vanilla MIL into a generalized MIL suitable for multiclass tasks. In particular, in the image-text alignment mapping matrix M , which represents the similarity between visual features and all class embeddings, we select the top-k similarity values for each category and average the similarities of the selected frames to quantify the alignment degree between the video and the current category. Through this process, we obtain a vector u = {u1 , . . . , uk}, which represents the similarity between the video and all categories. We expect the video and its paired text label to have the highest similarity score. The multiclass prediction is then carried out as follows: where p f i represents the predicted confidence for the i-th class and τ is the temperature scaling hyperparameter. It is important to note that due to the dual-stream nature of the network architecture, each network branch with different granularities generates its own anomaly score. We take the larger of the two as the anomaly score for the entire network. Finally, we employ the fine-grained aligned binary cross-entropy loss L f bce . Additionally, we introduce an embedding constraint to ensure semantic consistency, thereby obtaining semantically rich anomaly-aware prompts. We quantify the differences in the feature space by calculating the cosine similarity between the normal embeddings and the embeddings of each anomaly class. Based on this similarity, we further define an embedding constraint loss L embed to regulate the organization of the embedding space, ensuring that normal embeddings are closely clustered, while anomaly embeddings maintain an appropriate distance from the normal embeddings\nwhere t e 0n t embed represents the normal embeddings and t e 0a t embed represents the abnormal embeddings.\nThe final multimodal VAD loss function is the sum of the above loss functions\nIV. EXPERIMENTS # A. Datasets and Evaluation Metrics # Datasets: Our experiments were conducted on two large-scale datasets widely recognized in the VAD community: UCF-Crime [28] and XD-Violence [20]. UCF-Crime primarily consists of 1900 untrimmed real-world surveillance videos with a total duration of 128 h, where the number of abnormal and normal videos is roughly equal. Among these, 810 abnormal and 800 normal videos are used for training. These videos cover 13 types of real-world anomalies. XD-Violence mainly consists of YouTube videos from movies and outdoor scenes. It contains a total of 4854 untrimmed videos with audio and weakly labels, amounting to 217 h, with 2405 abnormal and 2349 normal videos, of which 1905 abnormal and 2049 normal videos are used for training. These videos are captured in diverse settings, such as handheld cameras, CCTV, movies, sports, and so on, covering six types of anomalous behaviors. 2) Evaluation Metrics: Following previous excellent works [21] , [24] , [27] , [46], we use frame-level average precision (AP) for XD-Violence and frame-level AUC, as well as AUC under only anomalous videos (AUC no-n ), for UCF-Crime. In addition, to ensure the comprehensiveness and standardization of the evaluation, we adhere to the standard evaluation protocols in the field of video action recognition [50], using mean AP (mAP) values under different intersection over union (IoU) thresholds, along with the average of overall mAP. The selection and application of these evaluation metrics not only render our results comparable and consistent but also provide a more comprehensive reflection of the model\u0026rsquo;s performance across various complex scenarios.\nB. Implementation Details # Following previous excellent works [20] , [24] , [60], we employed the pretrained CLIP (ViT-B/16) model to extract video and text features for both modalities. For audio features, we adopted the same feature extraction method as in prior works, using the VGGish audio extractor, which was pretrained on the YouTube dataset. Additionally, both the visual and audio features were aligned at a coarse-grained level. For hyperparameter settings, the batch sizes and total epochs for UCF-Crime and XD-Violence were set to 64 and 128 and 40 and 60, respectively. In (13) , λ1 was set to 0.1 and 0.15, and λ2 λ2 to 1e-1 and 2e-4, and Adam was used for optimization. Our network was implemented using PyTorch [51] and trained end-to-end on an NVIDIA GeForce RTX 3090 GPU.\nC. Comparisons With State-of-the-arts # Result on UCF-Crime: Our evaluation results on UCFCrime are presented in Tables I and II. Overall, our method markedly outperforms unsupervised methods as well as other weakly supervised methods, regardless of whether these methods are based on I3D, VideoSwinTransformer, or other extractors. Specifically, in the coarse-grained aspect, as shown in Table I, our method exhibits a significant advantage, surpassing multiple existing methods, including UR-DMU [52] and so on. Although UR-DMU introduces two additional memory modules, which separately store abnormal and normal patterns to enrich feature representation, our method still achieves superior performance. Moreover, compared to VadCLIP [24], our method attains comparable results, with a gap of only 0.06%. This minimal difference further underscores the effectiveness and robustness of our approach in the challenging task of VAD. In the fine-grained aspect, as depicted in Table II , our method outperforms the current state-of-the-art works we compared. It is worth noting that although our coarse-grained TABLE I COMPARISON WITH OTHER SOTA METHODS ON UCF-CRIME. BOLD INDICATES THE PERFORMANCE ACHIEVED BY OUR METHOD\nTABLE II FINE-GRAINED COMPARISON WITH OTHER SOTA METHODS ON UCF-CRIME. BOLD INDICATES THE PERFORMANCE ACHIEVED BY OUR METHOD\nperformance does not reach the best level as VadCLIP, our fine-grained performance is superior. This indicates that our method achieves a more precise alignment with the anomalous time intervals than VadCLIP, which is mainly attributed to the incorporation of more anomaly-related information and knowledge into the model, allowing it to more accurately capture and identify subtle changes in abnormal events.\nResult on XD-Violence: Table III presents a comparison of the AP values between our method and the current state-ofthe-art methods. It is evident from Table III that our method performs exceptionally well in both unsupervised and weakly supervised methods and reaches optimal performance. Specifically, our method outperforms TPWNG [27], VadCLIP [24] , and PEL4VAD [25] by 2.64%, 1.81%, and 0.73%, respectively. The key to VadCLIP\u0026rsquo;s performance enhancement lies in its use of finer-grained class labels and the full utilization of the semantic representation capabilities of textual information. However, our method goes further by introducing an efficient anomaly-aware contextual prompt to deeply model anomalous scenes. Moreover, we integrate audio information, TABLE III COMPARISON WITH OTHER SOTA METHODS ON XD-VIOLENCE. BOLD INDICATES THE PERFORMANCE ACHIEVED BY OUR METHOD\nTABLE IV COMPARISON OF SOTA METHODS UNDER DIFFERENT MODALITIES ON XD-VIOLENCE. BOLD INDICATES THE PERFORMANCE ACHIEVED BY OUR METHOD\neffectively assisting the model\u0026rsquo;s ability to capture anomalies. Although PEL4VAD also utilizes textual information and achieves results close to ours, its primary strength lies in the fine processing of visual features, which still shows some gaps compared to our method. In Table IV, we present a detailed comparative analysis of the state-of-the-art methods under different modality inputs. The results show that methods relying solely on the visual modality exhibit relatively average performance. In contrast, bimodal methods that incorporate audio signals (such as environmental sounds and event-related audio cues) or higher level semantic text information substantially improve performance. This indicates that by integrating additional modality information, the model is able to more comprehensively understand and analyze video content, thus\nTABLE V FINE-GRAINED COMPARISON WITH OTHER SOTA METHODS ON XD-VIOLENCE. BOLD INDICATES THE PERFORMANCE ACHIEVED BY OUR METHOD\nTABLE VI IMPACT OF DIFFERENT INPUT MODALITIES ON THE MODEL\nenhancing the accuracy of anomaly detection. Our method integrates information from three different modalities—visual, audio, and text—by constructing more expressive multimodal representations that fully exploit the intrinsic semantics of the data, enabling the model to outperform methods relying solely on single-visual and other forms of bimodal methods across various metrics. In Table V, we provide a detailed comparison of the fine-grained performance on the XD-Violence dataset. Compared to coarse-grained analysis in Table III, fine-grained analysis can more accurately reflect the anomaly categories as well as the continuity and completeness of anomalous events, thus posing greater challenges and offering stronger representativeness. Specifically, compared to the state-of-theart methods AVVD [20] and VadCLIP [24], our method improves the average performance by 7.23% and 4.46%, respectively. This substantial improvement confirms both the efficacy and strength of our method while emphasizing that fine-grained analysis captures subtle details and the full scope of anomalous events. In complex VAD tasks, such analysis offers more precise feedback and broader opportunities for optimizing anomaly detection models.\nD. Ablation Study # Analysis of the Contribution of Different Modalities to the Model: We conduct an in-depth exploration of the impact of different input modalities and their combinations on the model, as detailed in Table VI. The video, audio, and text modalities are represented as V , A, and T , respectively. The results show that multimodal combinations significantly outperform single-modal inputs in terms of performance, a finding that is also validated in Table IV. For instance, the joint of video and audio (video + audio, V + A) as well as the joint of text and video (text + video, T + V) improved performance by 1.08% and 4.10%, respectively, compared to using merely the video modality. The improvement can be attributed to the text modality, which provides more direct and structured information than the audio modality, TABLE VII ANALYSIS OF THE CONTRIBUTION OF EACH COMPONENT TO THE MODEL\nTABLE VIII COMPARISON BEFORE AND AFTER THE CONSTRUCTION OF ACPS\nenhancing the model\u0026rsquo;s ability to understand and reason semantically. While audio contains some directional information, it is more susceptible to environmental noise, which affects its effectiveness. Furthermore, when all three modalities (video, audio, and text, V +A+T ) are combined, the model exhibits the best result. The underlying reason is that audio effectively enhances the model\u0026rsquo;s ability to better capture significant anomalous features in the video, while text further provides clear and accurate semantic direction, improving the model\u0026rsquo;s perception and discrimination capabilities.\nAnalysis of the Contribution of Each Component to the Model: In Table VII, we conducted ablation experiments to evaluate the contributions of each component. The results reveal that multiple core components of the model have a positive impact. First, the temporal unit effectively models the temporal information of videos. Compared to raw visual representations, the temporal unit captures the relationships between frames, allowing the model to learn dynamic changes over time, as anomalies often involve gradual or abrupt changes over time. Second, the ACP integrates textual information while concurrently augmenting the model\u0026rsquo;s capacity to perceive visual anomalies. After the cross-modal fusion, the model can combine anomaly-related visual information with textual prompts, thereby improving the discrimination of anomalies. Additionally, the contrastive constraint effectively strengthens the model\u0026rsquo;s ability to differentiate between the two, particularly optimizing the performance on hard samples (i.e., ambiguous samples). Overall, as each component is progressively introduced, the model\u0026rsquo;s performance exhibits certain fluctuations. Although increasing the number of components leads to a moderate decline in execution efficiency, this reduction remains within a reasonable and controllable range, posing no substantial threat to the whole system\u0026rsquo;s real-time performance and stability. Ultimately, when all components work in unison, the model\u0026rsquo;s status reaches its optimum. Fig. 5. Analysis of the number of learnable parameters.\nFig. 6. Visualization of the differences before and after ACP construction.\nAnalysis of Learnable Parameter Count and ACPs: In the abnormal-aware module, we performed an in-depth analysis of the quantity of learnable parameters and systematically compared the changes before and after introducing anomaly knowledge. Specifically, as illustrated in Figs. 5 and 6, and Table VIII. In Fig. 5, considering the symmetry of learnable parameters, we chose to represent half the quantity to simplify the analysis of total parameters. We found that the sensitivity of performance to the number of learnable parameters varies across different datasets. It highlights the necessity of appropriately tuning the quantity of learnable parameters to improve network performance on particular datasets. Table VIII illustrates the variations before and after the implementation of anomaly knowledge. By integrating anomaly knowledge, all performance metrics of the model have been markedly improved. In the visualization difference map of Fig. 6, it is also clear that the boundaries of anomalous events are delineated more precisely, and the distinction between anomalous and normal states is enhanced. This phenomenon can be attributed to the introduction of anomaly knowledge, providing the model with richer representations and effectively enhancing the sensitivity and recognition ability for anomalous events. Effectiveness of Different Losses and the Selection of Loss Weights During Optimization: We conducted a systematic evaluation of the effectiveness of different loss functions and their weight selection in the overall optimization process, with the relevant results summarized in Tables IX and X. The results in Table IX indicate that a foundation of TABLE IX VALIDATION OF THE EFFECTIVENESS OF DIFFERENT LOSS FUNCTIONS DURING THE OPTIMIZATION PROCESS\nTABLE X EFFECT OF DIFFERENT λi ON THE OPTIMIZATION FUNCTION\nTABLE XI IMPACT OF VISUAL SEGMENTS AND LEARNING RATE ON THE MODEL\ncoarse-grained optimization supplemented by fine-grained tuning can consistently enhance performance, and the introduction of contrastive loss further reinforces this improvement, thereby validating the complementarity between loss components at different levels. To verify the rationality of the weight configuration for each loss term in the objective function, we adopted the parameter settings of previous excellent methods and designed controlled variable experiments (detailed in Table X). The results show that, with the coarse-grained contrast weight fixed (i.e., λ1 = 0 . 15), appropriately adjusting the fine-grained weight gradually improves performance, reaching the highest effect under the optimal configuration; similarly, when the finegrained contrast weight is fixed, tuning the coarse-grained contrast weight also enhances performance. These findings demonstrate that the complementary interactions among the various loss terms in the multiloss joint optimization strategy positively contribute to the overall performance.\nEffective Partitioning of Visual Segments and Selection of Batch Size: Given that videos exhibit notable temporal continuity, segmenting the video for input is indispensable to capture its dynamic information. At the same time, properly configuring key hyperparameters—such as batch size—is critical for ensuring that the model achieves optimal convergence and efficient performance during training. In Table XI, we gradually increased the number of input visual segments from 64 to 512. The results indicate that as the amount of visual information per input increases, the model performance improves significantly. However, once the input information Fig. 7. Qualitative results analysis on XD-violence and UCF-crime.\nTABLE XII IMPACT OF DIFFERENT BATCH SIZES ON PERFORMANCE\nexceeds a certain threshold, the negative effects of information redundancy lead to a decline in performance. Furthermore, Table XII reveals that the optimal batch size varies considerably across different application scenarios. Therefore, batch size and related hyperparameters should be dynamically adjusted according to specific environments to achieve optimal performance.\nE. Qualitative Comparison # Fig. 7 presents the analysis of both abnormal and normal videos. For the anomalous videos, when abnormal events occur, the anomaly score curve experiences a substantial increase and maintains a high level for the duration of these events. This reflects the model\u0026rsquo;s ability to capture the occurrence of abnormal events in real time and effectively monitor their duration. The anomaly score curve steadily drops to lower values as the anomalous events come to an end, with minor oscillations within this range. Conversely, the anomaly score curve for normal videos remains consistently low, with occasional minor fluctuations that, while resembling abnormalities, do not reach the threshold required to trigger an anomaly warning. Here, it further validates its ability to maintain high sensitivity to abnormal events while effectively avoiding false positives for normal behaviors.\nV. CONCLUSION # In this article, we propose a multimodal dual-stream VAD network that can utilize video, audio, and textual information. In the coarse-grained stream, we first perform cross-modal fusion of temporally modeled visual and audio features, introduce contrastive learning to enhance the discriminative ability, enable preliminary identification of anomalies, and support subsequent fine-grained anomaly detection. The fine-grained stream integrates textual information by constructing ACPs, fully leveraging the semantic information carried by the text, and combining the cross-modal fusion features and corresponding anomaly scores from the coarse-grained stream. By employing a \u0026ldquo;coarse-support-fine\u0026rdquo; strategy, the model is able to effectively identify anomalies at different levels. Finally, we conducted systematic experiments, detailed visual analysis, and ablation studies on two large-scale datasets, the proposed method achieves optimal performance across multiple metrics. This can efficiently identify anomalous behaviors in industry monitoring, driving the development of intelligent surveillance systems, and supporting the realization of big data analysis and intelligent decision-making in the industry.\nIn future research, although we have explored relevant aspects about VLMs, there remains extensive and profound research potential in utilizing LLMs or VLMs for anomaly behavior or event reasoning and analysis. Therefore, future studies will focus on how to more effectively tap into and utilize the emergent capabilities to advance deep reasoning and precise descriptive analysis of abnormalities. Meanwhile, the emergent capabilities of general-purpose large models have already yielded preliminary success in VAD, and they still face significant limitations when analyzing highly specialized and complex anomalous vision. Thus, designing and training LLMs specifically for VAD in vertical domains will be an important direction.\nREFERENCES # [1] Y. Liu and H. Liang, \u0026ldquo;Review on the application of the nonlinear output frequency response functions to mechanical fault diagnosis,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 72, pp. 1–12, 2023.\n[2] P. P. Liang, A. Zadeh, and L.-P. Morency, \u0026ldquo;Foundations \u0026amp; trends in multimodal machine learning: Principles, challenges, and open questions,\u0026rdquo; ACM Comput. Surveys, vol. 56, no. 10, pp. 1–42, Oct. 2024.\n[3] F. Harrou, M. M. Hittawe, Y. Sun, and O. Beya, \u0026ldquo;Malicious attacks detection in crowded areas using deep learning-based approach,\u0026rdquo; IEEE Instrum. Meas. Mag., vol. 23, no. 5, pp. 57–62, Aug. 2020.\n[4] T. Li and H. Yu, \u0026ldquo;Visual–inertial fusion-based human pose estimation: A review,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 72, pp. 1–16, 2023.\n[5] L. Wang, X. Wang, F. Liu, M. Li, X. Hao, and N. Zhao, \u0026ldquo;Attentionguided MIL weakly supervised visual anomaly detection,\u0026rdquo; Measurement , vol. 209, Mar. 2023, Art. no. 112500.\n[6] M. A. Abou-Khousa, M. S. U. Rahman, K. M. Donnell, and M. T. A. Qaseer, \u0026ldquo;Detection of surface cracks in metals using microwave and millimeter-wave nondestructive testing techniques—A review,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 72, pp. 1–18, 2023.\n[7] H. Qi, X. Kong, Z. Shen, Z. Liu, and J. Gu, \u0026ldquo;Progressively learning dynamic level set for weakly supervised industrial defect segmentation,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 72, pp. 1–14, 2023.\n[8] Y. Li, X. Wu, P. Li, and Y. Liu, \u0026ldquo;Ferrite beads surface defect detection based on spatial attention under weakly supervised learning,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 72, pp. 1–12, 2023.\n[9] H. Du et al., \u0026ldquo;Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 18793–18803.\n[10] H. Park, J. Noh, and B. Ham, \u0026ldquo;Learning memory-guided normality for anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 14372–14381.\n[11] M. Liu, Y. Jiao, J. Lu, and H. Chen, \u0026ldquo;Anomaly detection for medical images using teacher–student model with skip connections and multi-scale anomaly consistency,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 73, pp. 1–15, 2024.\n[12] D. Wang, Q. Hu, and K. Wu, \u0026ldquo;Dual-branch network with memory for video anomaly detection,\u0026rdquo; Multimedia Syst., vol. 29, no. 1, pp. 247–259, Feb. 2023.\n[13] R. Liu, W. Liu, H. Li, H. Wang, Q. Geng, and Y. Dai, \u0026ldquo;Metro anomaly detection based on light strip inductive key frame extraction and MAGAN network,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 71, pp. 1–14, 2022.\n[14] J. Jiang, S. Wei, X. Xu, Y. Cui, and X. Liu, \u0026ldquo;Unsupervised anomaly detection and localization based on two-hierarchy normalizing flow,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 73, pp. 1–11, 2024.\n[15] Y. Yan, D. Wang, G. Zhou, and Q. Chen, \u0026ldquo;Unsupervised anomaly segmentation via multilevel image reconstruction and adaptive attentionlevel transition,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 70, pp. 1–12, 2021.\n[16] G. Yu, S. Wang, Z. Cai, X. Liu, E. Zhu, and J. Yin, \u0026ldquo;Video anomaly detection via visual cloze tests,\u0026rdquo; IEEE Trans. Inf. Forensics Security , vol. 18, pp. 4955–4969, 2023.\n[17] C. Zhang, Y. Wang, and W. Tan, \u0026ldquo;MTHM: Self-supervised multi-task anomaly detection with hard example mining,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 72, pp. 1–13, 2023.\n[18] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro, \u0026ldquo;Weakly-supervised video anomaly detection with robust temporal feature magnitude learning,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 4955–4966.\n[19] V. Sevetlidis, G. Pavlidis, V. Balaska, A. Psomoulis, S. G. Mouroutsos, and A. Gasteratos, \u0026ldquo;Enhancing weakly supervised defect detection through anomaly-informed weighted training,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 73, pp. 1–10, 2024.\n[20] P. Wu, X. Liu, and J. Liu, \u0026ldquo;Weakly supervised audio-visual violence detection,\u0026rdquo; IEEE Trans. Multimedia, vol. 25, pp. 1674–1685, 2023.\n[21] P. Wu et al., \u0026ldquo;Not only look, but also listen: Learning multimodal violence detection under weak supervision,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis., Jan. 2020, pp. 322–339.\n[22] Y. Pu and X. Wu, \u0026ldquo;Audio-guided attention network for weakly supervised violence detection,\u0026rdquo; in Proc. 2nd Int. Conf. Consum. Electron. Comput. Eng. (ICCECE), Jan. 2022, pp. 219–223.\n[23] H. Lv, Z. Yue, Q. Sun, B. Luo, Z. Cui, and H. Zhang, \u0026ldquo;Unbiased multiple instance learning for weakly supervised video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 8022–8031.\n[24] P. Wu et al., \u0026ldquo;VadCLIP: Adapting vision-language models for weakly supervised video anomaly detection,\u0026rdquo; in Proc. AAAI Conf. Artif. Intell. , vol. 38, Mar. 2024, pp. 6074–6082.\n[25] Y. Pu, X. Wu, L. Yang, and S. Wang, \u0026ldquo;Learning prompt-enhanced context features for weakly-supervised video anomaly detection,\u0026rdquo; IEEE Trans. Image Process., vol. 33, pp. 4923–4936, 2024.\n[26] P. Wu, J. Liu, X. He, Y. Peng, P. Wang, and Y. Zhang, \u0026ldquo;Toward video anomaly retrieval from video anomaly detection: New benchmarks and model,\u0026rdquo; IEEE Trans. Image Process., vol. 33, pp. 2213–2225, 2024.\n[27] Z. Yang, J. Liu, and P. Wu, \u0026ldquo;Text prompt with normality guidance for weakly supervised video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 32, Jun. 2024, pp. 18899–18908.\n[28] P. Wu et al., \u0026ldquo;Open-vocabulary video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 18297–18307.\n[29] W. Sultani, C. Chen, and M. Shah, \u0026ldquo;Real-world anomaly detection in surveillance videos,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6479–6488.\n[30] X. Wu, T. Wang, Y. Li, P. Li, and Y. Liu, \u0026ldquo;A CAM-based weakly supervised method for surface defect inspection,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 71, pp. 1–10, 2022.\n[31] C. Cao, Y. Lu, and Y. Zhang, \u0026ldquo;Context recovery and knowledge retrieval: A novel two-stream framework for video anomaly detection,\u0026rdquo; IEEE Trans. Image Process., vol. 33, pp. 1810–1825, 2024.\n[32] S. Li, F. Liu, and L. Jiao, \u0026ldquo;Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection,\u0026rdquo; in Proc. AAAI Conf. Artif. Intell., 2022, vol. 36, no. 2, pp. 1395–1403.\n[33] D. Wei, Y. Liu, X. Zhu, J. Liu, and X. Zeng, \u0026ldquo;MSAF: Multimodal supervise-attention enhanced fusion for video anomaly detection,\u0026rdquo; IEEE Signal Process. Lett., vol. 29, pp. 2178–2182, 2022.\n[34] C. Feng, Z. Chen, and A. Owens, \u0026ldquo;Self-supervised video forensics by audio-visual anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 10491–10503.\n[35] L. Zanella, W. Menapace, M. Mancini, Y. Wang, and E. Ricci, \u0026ldquo;Harnessing large language models for training-free video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 18527–18536.\n[36] H. Zhang et al., \u0026ldquo;Holmes-VAD: Towards unbiased and explainable video anomaly detection via multi-modal LLM,\u0026rdquo; 2024, arXiv:2406.12235 .\n[37] X. Hu et al., \u0026ldquo;Scaling up vision-language pretraining for image captioning,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 17959–17968.\n[38] Z. Jiang, J. Araki, H. Ding, and G. Neubig, \u0026ldquo;How can we know when language models know? On the calibration of language models for question answering,\u0026rdquo; Trans. Assoc. Comput. Linguistics, vol. 9, pp. 962–977, Sep. 2021.\n[39] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, \u0026ldquo;Learning to prompt for visionlanguage models,\u0026rdquo; Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, Sep. 2022.\n[40] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, \u0026ldquo;Conditional prompt learning for vision-language models,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2022, pp. 16795–16804.\n[41] J. Kim, S. Yoon, T. Choi, and S. Sull, \u0026ldquo;Unsupervised video anomaly detection based on similarity with predefined text descriptions,\u0026rdquo; Sensors , vol. 23, no. 14, p. 6256, Jul. 2023.\n[42] P. Jin, R. Takanobu, W. Zhang, X. Cao, and L. Yuan, \u0026ldquo;Chat-UniVi: Unified visual representation empowers large language models with image and video understanding,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 13700–13710.\n[43] Q. Wang et al., \u0026ldquo;StableIdentity: Inserting anybody into anywhere at first sight,\u0026rdquo; 2024, arXiv:2401.15975 .\n[44] J. Li, \u0026ldquo;Recent advances in end-to-end automatic speech recognition,\u0026rdquo; APSIPA Trans. Signal Inf. Process., vol. 11, no. 1, pp. 1–27, 2022.\n[45] P. Pu Liang, A. Zadeh, and L.-P. Morency, \u0026ldquo;Foundations and trends in multimodal machine learning: Principles, challenges, and open questions,\u0026rdquo; 2022, arXiv:2209.03430 .\n[46] J. Yu, J. Liu, Y. Cheng, R. Feng, and Y. Zhang, \u0026ldquo;Modality-aware contrastive instance learning with self-distillation for weakly-supervised audio-visual violence detection,\u0026rdquo; in Proc. 30th ACM Int. Conf. MultiMedia, Oct. 2022, pp. 6278–6287.\n[47] Z. Zhang, B. Yang, and J. Ma, \u0026ldquo;Multiple constraints flow for weakly observable defect detection based on defect-free samples,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 73, pp. 1–13, 2024.\n[48] Z. Liu et al., \u0026ldquo;Swin transformer: Hierarchical vision transformer using shifted windows,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , Oct. 2021, pp. 9992–10002.\n[49] A. Vaswani et al., \u0026ldquo;Attention is all you need,\u0026rdquo; in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 30, 2017, pp. 1–11.\n[50] P. Lee, J. Wang, Y. Lu, and H. Byun, \u0026ldquo;Weakly-supervised temporal action localization by uncertainty modeling,\u0026rdquo; Proc. AAAI Conf. Artif. Intell., vol. 35, no. 3, pp. 1854–1862, May 2021.\n[51] A. Paszke et al., \u0026ldquo;PyTorch: An imperative style, high-performance deep learning library,\u0026rdquo; in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Jan. 2019, p. 32.\n[52] H. Zhou, J. Yu, and W. Yang, \u0026ldquo;Dual memory units with uncertainty regulation for weakly supervised video anomaly detection,\u0026rdquo; in Proc. AAAI Conf. Artif. Intell., Jun. 2023, vol. 37, no. 3, pp. 3769–3777.\n[53] J. Wang and A. Cherian, \u0026ldquo;GODS: Generalized one-class discriminative subspaces for anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8200–8210.\n[54] P. Wu, J. Liu, M. Li, Y. Sun, and F. Shen, \u0026ldquo;Fast sparse coding networks for anomaly detection in videos,\u0026rdquo; Pattern Recognit., vol. 107, Nov. 2020, Art. no. 107515.\n[55] M. Z. Zaheer, A. Mahmood, M. H. Khan, M. Segu, F. Yu, and S.I. Lee, \u0026ldquo;Generative cooperative learning for unsupervised video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 14724–14734.\n[56] A. Al-Lahham, M. Z. Zaheer, N. Tastan, and K. Nandakumar, \u0026ldquo;Collaborative learning of anomalies with privacy (CLAP) for unsupervised video anomaly detection: A new baseline,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 12416–12425.\n[57] M. Zhang et al., \u0026ldquo;Multi-scale video anomaly detection by multi-grained spatio-temporal representation learning,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 17385–17394.\n[58] H. Lv, C. Zhou, Z. Cui, C. Xu, Y. Li, and J. Yang, \u0026ldquo;Localizing anomalies from weakly-labeled videos,\u0026rdquo; IEEE Trans. Image Process., vol. 30, pp. 4505–4515, 2021.\n[59] Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, and Y.-C. Wu, \u0026ldquo;MGFN: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection,\u0026rdquo; in Proc. AAAI Conf. Artif. Intell., 2023, vol. 37, no. 1, pp. 387–395.\n[60] H. K. Joo, K. Vo, K. Yamazaki, and N. Le, \u0026ldquo;CLIP-TSA: Clipassisted temporal self-attention for weakly-supervised video anomaly detection,\u0026rdquo; in Proc. IEEE Int. Conf. Image Process. (ICIP), Oct. 2023, pp. 3230–3234.\n[61] Y. Su, Y. Tan, S. An, and M. Xing, \u0026ldquo;Anomalies cannot materialize or vanish out of thin air: A hierarchical multiple instance learning with position-scale awareness for video anomaly detection,\u0026rdquo; Exp. Syst. Appl. , vol. 254, Nov. 2024, Art. no. 124392.\n[62] B. Scholkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and ¨ ¨ J. C. Platt, \u0026ldquo;Support vector method for novelty detection,\u0026rdquo; in Proc. Adv. Neural Inf. Process. Syst., vol. 12, 1999, pp. 582–588.\n[63] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, \u0026ldquo;Learning temporal regularity in video sequences,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 733–742.\n[64] J.-C. Wu, H.-Y. Hsieh, D.-J. Chen, C.-S. Fuh, and T.-L. Liu, \u0026ldquo;Selfsupervised sparse representation for video anomaly detection,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 729–745.\n[65] P. Wu and J. Liu, \u0026ldquo;Learning causal temporal relation and feature discrimination for anomaly detection,\u0026rdquo; IEEE Trans. Image Process. , vol. 30, pp. 3513–3527, 2021.\n[66] T. Liu, C. Zhang, K.-M. Lam, and J. Kong, \u0026ldquo;Decouple and resolve: Transformer-based models for online anomaly detection from weakly labeled videos,\u0026rdquo; IEEE Trans. Inf. Forensics Security, vol. 18, pp. 15–28, 2023.\n[67] C. Zhang et al., \u0026ldquo;Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 16271–16280.\n[68] S. Paul, S. Roy, and K. R.-C. Amit, \u0026ldquo;W-TALC: Weakly-supervised temporal activity localization and classification,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis., Jan. 2018, pp. 563–579.\n[69] S. Narayan, H. Cholakkal, F. S. Khan, and L. Shao, \u0026ldquo;3C-net: Category count and center loss for weakly-supervised action localization,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8678–8686.\nDicong Wang (Graduate Student Member, IEEE) is currently pursuing the joint Ph.D. degree with the College of Intelligence and Computing, Tianjin University, Tianjin, China, and the School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou, China.\nHis research interests include computer vision and video anomaly detection.\nQilong Wang (Senior Member, IEEE) received the Ph.D. degree from the School of Information and Communication Engineering, Dalian University of Technology, Dalian, China, in 2018.\nHe is currently a Professor with Tianjin University, Tianjin, China. He has authored or co-authored more than 40 academic papers in top conferences and referred journals, including ICCV, CVPR, NeurIPS, ECCV, IEEE TRANSACTIONS ON PAT -TERN ANALYSIS AND MACHINE INTELLIGENCE , IEEE TRANSACTIONS ON IMAGE PROCESSING , and IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECH -NOLOGY. His research interests include visual understanding and deep learning, particularly deep models with high-order statistical modeling, and self-attention mechanism.\nProf. Wang served as the Area Chair for CVPR 2024 and 2025.\nQinghua Hu (Senior Member, IEEE) received the B.S., M.S., and Ph.D. degrees from Harbin Institute of Technology, Harbin, China, in 1999, 2002, and 2008, respectively.\nHe was a Post-Doctoral Fellow with the Department of Computing, The Hong Kong Polytechnic University, Hong Kong, from 2009 to 2011. He is currently the Chair Professor with the College of Intelligence and Computing, Tianjin University, Tianjin, China, and the Director of the SIG Granular Computing and Knowledge Discovery and Chinese\nAssociation of Artificial Intelligence. He was supported by the Key Program and the National Natural Science Foundation of China. He has authored or coauthored over 300 peer-reviewed articles. His current research interests include uncertainty modeling in big data, machine learning with multimodality data, and intelligent unmanned systems.\nProf. Hu is an Associate Editor of IEEE TRANSACTIONS ON FUZZY SYSTEMS , Acta Automatica Sinica, and Acta Electronica Sinica .\nKaijun Wu (Member, IEEE) received the Ph.D. degree from Lanzhou Jiaotong University, Lanzhou, China, in 2017.\nHe is currently a Professor with the School of Electronics and Information Engineering, Lanzhou Jiaotong University. His research interests include intelligent algorithm optimization and image processing.\n","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/multimodal_vad_visual_anomaly_detection_in_intelligent_monitoring_system_via_audio-vision-language/","section":"Papers","summary":"The paper proposes a dual-stream multimodal video anomaly detection network that leverages video, audio, and text modalities to achieve reliable and precise anomaly detection. It introduces effective multimodal fusion, abnormal-aware context prompts (ACPs), and a coarse-support-fine strategy to enhance anomaly discrimination and description, demonstrating superior performance on large-scale datasets.","title":"Multimodal VAD: Visual Anomaly Detection in Intelligent Monitoring System via Audio-Vision-Language","type":"method"},{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/","section":"Papers","summary":"","title":"Papers","type":"papers"},{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/qilong-wang/","section":"Authors","summary":"","title":"Qilong Wang","type":"authors"},{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/qinghua-hu/","section":"Authors","summary":"","title":"Qinghua Hu","type":"authors"},{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/","section":"sis-arxiv-vad-papers","summary":"","title":"sis-arxiv-vad-papers","type":"page"},{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/type/","section":"Type","summary":"","title":"Type","type":"type"},{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/benchmarks/ucf-crime/","section":"Benchmarks","summary":"","title":"Ucf-Crime","type":"benchmarks"},{"content":"","date":"20 June 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/benchmarks/xd-violence/","section":"Benchmarks","summary":"","title":"Xd-Violence","type":"benchmarks"},{"content":"","date":"28 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/guodong-zhou/","section":"Authors","summary":"","title":"Guodong Zhou","type":"authors"},{"content":"","date":"28 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jiamin-luo/","section":"Authors","summary":"","title":"Jiamin Luo","type":"authors"},{"content":"","date":"28 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jingjing-wang/","section":"Authors","summary":"","title":"Jingjing Wang","type":"authors"},{"content":"","date":"28 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/junxiao-ma/","section":"Authors","summary":"","title":"Junxiao Ma","type":"authors"},{"content":"","date":"28 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/benchmarks/other/","section":"Benchmarks","summary":"","title":"Other","type":"benchmarks"},{"content":"","date":"28 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/type/other/","section":"Type","summary":"","title":"Other","type":"type"},{"content":"","date":"28 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/peiying-yu/","section":"Authors","summary":"","title":"Peiying Yu","type":"authors"},{"content":"","date":"28 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/categories/semi-supervised/","section":"Categories","summary":"","title":"Semi Supervised","type":"categories"},{"content":"","date":"28 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/benchmarks/shanghaitech/","section":"Benchmarks","summary":"","title":"Shanghaitech","type":"benchmarks"},{"content":" Sherlock: Towards Multi-scene Video Abnormal Event Extraction and Localization via a Global-local Spatial-sensitive LLM # Junxiao Ma # jxma0711@stu.suda.edu.cn School of Computer Science and Technology, Soochow University Suzhou, China\nJingjing Wang ∗ # djingwang@suda.edu.cn School of Computer Science and Technology, Soochow University Suzhou, China\nPeiying Yu # 20244227007@stu.suda.edu.cn School of Computer Science and Technology, Soochow University Suzhou, China\nJiamin Luo # 20204027003@stu.suda.edu.cn School of Computer Science and Technology, Soochow University Suzhou, China\nGuodong Zhou # gdzhou@suda.edu.cn School of Computer Science and Technology, Soochow University Suzhou, China\nFigure 1: (a) and (b) illustrate two surveillance video examples for our M-VAE task and Sherlock model in two scenes (Street and Residence). Sherlock precisely generates the abnormal event quadruples and their corresponding timestamps. (c) presents a circular ratio diagram illustrating different spatial information. From (c), we observe that the global spatial information and the local spatial information (i.e., action, object relation, and background) in our M-VAE dataset are imbalanced.\nAbstract # Prior studies on Video Anomaly Detection (VAD) mainly focus on detecting whether each video frame is abnormal or not in the video, which largely ignore the structured video semantic information (i.e., what, when, and where does the abnormal event happen). With this in mind, we propose a new chat-paradigm Multi-scene Video Abnormal Event Extraction and Localization (M-VAE) task, aiming to extract the abnormal event quadruples (i.e., subject, event type, object, scene) and localize such event. Further, this paper believes that this new task faces two key challenges, i.e., globallocal spatial modeling and global-local spatial balancing. To this end, this paper proposes a Global-local Spatial-sensitive Large Language\n∗ Corresponding Author: Jingjing Wang.\nPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.\nWWW ’25, April 28–May 2, 2025, Sydney, NSW, Australia.\n© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 979-8-4007-1274-6/25/04\nhttps://doi.org/10.1145/3696410.3714617\nModel (LLM) named Sherlock, i.e., acting like Sherlock Holmes to track down the criminal events, for this M-VAE task. Specifically, this model designs a Global-local Spatial-enhanced MoE (GSM) module and a Spatial Imbalance Regulator (SIR) to address the two challenges respectively. Extensive experiments on our M-VAE instruction dataset show the significant advantages of Sherlock over several advanced Video-LLMs. This justifies the importance of global-local spatial information for the M-VAE task and the effectiveness of Sherlock in capturing such information.\nCCS Concepts # Computing methodologies → Artificial intelligence . Keywords # Multi-scene Video, Video Abnormal Event, Spatial-sensitive LLM\nACM Reference Format: # Junxiao Ma, Jingjing Wang, Jiamin Luo, Peiying Yu, and Guodong Zhou. 2025. Sherlock: Towards Multi-scene Video Abnormal Event Extraction and Localization via a Global-local Spatial-sensitive LLM . In Proceedings of the ACM Web Conference 2025 (WWW \u0026lsquo;25), April 28–May 2, 2025, Sydney, NSW, Australia. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/ 3696410.3714617\n1 Introduction # Video Understanding is a foundational task in artificial intelligence, which focuses on analyzing and interpreting the content of videos to enable various applications, including video classification, activity recognition, and scene understanding [40 , 58 , 59]. As a critical branch of video understanding, Video Anomaly Detection (VAD) [20], which aims to automatically detect abnormal videos, has garnered significant research attention due to its wide range of applications in criminal activity detection and disaster response [56]. Prior studies on VAD mainly focus on detecting whether each video frame is abnormal or not in the video [20 , 29 , 41 , 56]. However, these studies overlook targeting at determining the underlying video semantic structure, i.e., \u0026ldquo;what is the abnormal type, where they have occurred, which people or things are involved\u0026rdquo; with a given video.\nMotivated by these, this paper proposes a novel Multi-scene Video Abnormal Event Extraction and Localization (M-VAE) task, aiming at localizing abnormal events (i.e., starting and ending times of the anomaly) and extracting event quadruples (i.e. [subject of the event, event type, object of the event, scene of the event]) through a chat paradigm. Take an example of Street scene in Figure 1 (a), within 23s to 25s, a man bends down and pries the lock, then drives away from the street and the abnormal event quadruple is [people, steal, car, street]. Different scene (i.e., Residence scene) is also shown in Figure 1 (b). Within 15s to 17s, a man vandalizes a sculpture at one\u0026rsquo;s residence and the quadruple is [people, Vandalism, Sculpture, Residence]. This structured processing for abnormal videos can significantly improve the practicality and efficiency of video anomaly localization systems. In fields such as real-time abnormal event monitoring that require high reliability and precision monitoring, using such structured processing can quickly search and screen for the required abnormal elements, which provides more convenient and intuitive evidence for further processing. Therefore, it is worthwhile to address this new task. Nevertheless, we believe that this new task faces two key challenges.\nFor one thing, it is challenging to model the global-local spatial information (named global-local spatial modeling challenge). Existing video understanding models [34 , 38 , 54] mainly focus on modeling general global information. However, local spatial information in our M-VAE task is often crucial compared to general global information, which are highly discriminative and essential for precise identification. Taking Figure 1 (a) as an example, the local spatial information, such as action (bend down), object relations (\u0026lt;man, near, car\u0026gt;), and background (street), can help better identify abnormal events. However, those local spatial information (e.g., actions, object relations, backgrounds) have different heterogeneous representations (i.e., different model structures and encoders). Therefore, a single, fixed-capacity transformer-based model, often makes it difficult to capture those critical local spatial information in videos. Recently, the Mixture of Expert (MoE) [18 , 23] paradigm has demonstrated scalability in multi-modal heterogeneous representation fusion tasks [18 , 23 , 24]. Inspired by this, a well-behaved model for our task should adopt the MoE paradigm to not only consider global spatial information but also emphasize the importance of local spatial information.\nFor another, a straightforward approach is to employ a basic Mixture of Expert (MoE) mechanism [18 , 23 , 24] to treat global spatial information (i.e., general representations of videos) and local spatial information (e.g., actions) as the global expert and local experts for integrating those information. However, the data imbalance issue among local spatial information may lead to the basic MoE experts being biased towards the more frequently occurring spatial information in the dataset. The statistics in Figure 1 (c) can illustrate this imbalance. Certain frequently appearing local information (i.e., action at 45%), can lead to higher weight for the corresponding expert. However, in Figure 1 (a), the object relations information, with the smallest proportion (25%), but is the most discriminative for extracting and localizing Theft events. More seriously, global spatial information is the most frequent and our preliminary experiments in Figure 7 (a) reveal global expert is often more thoroughly trained and often have the highest weights. Therefore, a better-behaved MoE expert fusion mechanism should mitigate this data imbalance (named global-local spatial balancing challenge), ensuring all experts are sufficiently trained to highlight their importance.\nTo tackle above challenges, we propose a Global-local Spatialsensitive LLM named Sherlock, i.e., acting like Sherlock Holmes to track down criminal events, for M-VAE. Specifically, this model designs a Global-local Spatial-enhanced MoE (GSM) module to address the global-local spatial modeling challenge, which includes four spatial experts to extract spatial information and an expert gate to weigh global and local spatial information. Furthermore, this model designs a Spatial Imbalance Regulator (SIR) to address the global-local spatial balancing challenge, which includes a Gated Spatial Balancing Loss (GSB) to further balance global and local experts. Particularly, we construct a M-VAE instruction dataset to better evaluate the effectiveness of our model. Detailed experiments show Sherlock can effectively extract and localize abnormal events and surpass advanced Video-LLMs in multiple evaluation metrics.\n2 Related Work # · Video Anomaly Detection. Video Understanding is a rapidly evolving research field which encompasses several tasks, including video grounding [40 , 58 , 59], spatial-temporal detection [13] and so on. As an important branch of video understanding, previous studies on Video Anomaly Detection (VAD) can be categorized into unsupervised, weakly-supervised, and fully-supervised categories. Unsupervised approaches focus on leveraging reconstruction techniques to identify anomalies [15 , 20 , 62 , 64]. Weaklysupervised methods have shown promising results in identifying abnormal frames [11 , 36 , 57 , 60 , 72]. Fully-supervised methods are scarce due to the expensive frame-level annotations required [7 , 10 , 12 , 19 , 53 , 55 , 70]. Different from the above studies, our Sherlock model aims to target at determining the underlying video semantic structure, providing a structured quadruple that goes beyond previous methods, facilitating the rapid detection and early warning of abnormal events in real-time.\n· Event Extraction (EE) focuses on extracting structured information from given types of information. Traditional EE methods mainly extract from text documents [21 , 25 , 35 , 37 , 52]. Recently, many studies [2 , 44 , 66 – 68] generate similar event structures from visual image data. Different from all the above studies, we are the first to focus on extracting the abnormal event from videos and constructing a quadruple dataset, incorporating information from\nSherlock # Figure 2: The overall framework of Sherlock. It consists of a Global-local Spatial-enhanced MoE (GSM) Module and a Spatial Imbalance Regulator (SIR). The SIR exerts a direct influence on the output weights of the expert gate.\nmultiple spatial information, enriching the task of event extraction, and making it more practical for real-world applications.\n· Video-oriented Large Language Models. The rise of ChatGPT [49] has stimulated the prosperity of Video Large Language Models which can be categorized into four major types: firstly, Video Chat [34] and Video LLaMA [69], which utilize BLIP-2 [33] and Q-Former to map visual representations onto Vicuna; secondly, models like Video ChatGPT [47], Otter [31], Valley [46], mPLUGOwl [65], and Chat-UniVi [26], which leverage CLIP [51] to encode visual features; thirdly, PandaGPT [54], which adopts ImageBind [14] as its core architecture for video understanding; and fourthly, VideoLLaVA [38], which aligns image and video features into a linguistic feature space using LanguageBind [73]. Recently, a few studies [27 , 63] consider incorporating spatial information in models. Besides, some studies [18 , 23 , 24] introduce the concept of MoE into LLMs, but they only focus on efficiency, without considering the balance between different information. Different from all the above studies, we design a new Sherlock model, to address our M-VAE task, which includes a Global-local Spatial-enhanced MoE module and a Spatial Imbalance Regulator to address the challenges of global-local modeling and balancing.\n3 Our Sherlock Model # In this paper, we propose a Sherlock model to address the MVAE task. Figure 2 illustrates the framework of Sherlock, which is composed of two core components (i.e., the Global-local Spatialenhanced MoE (GSM) module (sec 3.1) for the global-local spatial modeling challenge and the Spatial Imbalance Regulator (SIR) (sec 3.2) for the global and local spatial balancing challenge). Subsequently, we present our training strategies to enhance the ability of understanding spatial information (sec 3.3).\nBackbone. We choose Video-LLaVA [38] and its visual encoder LanguageBind [73] as the core framework. Video-LLaVA, which is optimized with a mixed dataset of images and videos, demonstrates leading performance across most image and video benchmarks. We employ Video-LLaVA as the backbone to explore the potential of Video-LLMs in extracting and localizing abnormal events.\nTask Formulation. Given a video 𝑉 for 𝑀 frames, each frame is labeled with 1 or 0, where 1 and 0 represent whether this frame conveys an abnormal event. The goal of M-VAE is to interactively generate the quadruple (𝑠𝑢𝑏 , 𝑡𝑦𝑝𝑒 , 𝑜𝑏 𝑗 , 𝑠𝑐𝑒 ) for each event along with the corresponding timestamp 𝑠𝑡𝑎 and 𝑒𝑛𝑑, where 𝑠𝑢𝑏 , 𝑡𝑦𝑝𝑒 , 𝑜𝑏 𝑗 , 𝑠𝑐𝑒 , 𝑠𝑡𝑎 and 𝑒𝑛𝑑 are the subject, event type, object, scene, start time and end time of the abnormal event. As shown in Figure 1 (a), a man steals a car at street from 23s to 25s. Therefore, the output of our M-VAE task is {23s , 25s, (people , steal, l, car, r, street)}.\n3.1 Global-local Spatial-enhanced MoE Module # As shown in Figure 2, we design a Global-local Spatial-enhanced MoE (GSM) Module for the global-local spatial modeling challenge. Inspired by Mixture-of-Experts (MoE) [24], we design three Local Spatial Experts (i.e., Local Action Expert, Local Object Relation\nExpert and Local Background Expert) and a Global Spatial Expert to extract spatial information, detailed as follows.\nLocal Spatial Experts contain three local spatial experts (i.e., action, object relation, and background), detailed as follows.\n· Local Action Expert (Action Expert, AE). We leverage HigherHRNet [6], a well-adopted bottom-up human pose estimation network to extract local spatial action information. HigherHRNet can generate local spatial action tokens T 𝒂 = {𝒕 𝒂 1 , \u0026hellip;, 𝒕 𝒂 𝒊 , \u0026hellip;, 𝒕 𝒂 𝒎 }, and each token consists of 17 human joint nodes for each individual in every frame of a video sequence. Here, 𝑖 denotes the 𝑖-th frame. Next, we apply Action Graph Attention to integrate T 𝒂 with the video tokens T 𝒗 = {𝒕 𝒗 1 , \u0026hellip;, 𝒕 𝒗 𝒊 , \u0026hellip;, 𝒕 𝒗 𝒎 } generated by the Video Encoder in Video-LLMs. We start by calculating the attention weights 𝛼 𝑘 𝑗 for each node 𝑒𝑘in 𝒕 𝒂 𝒊 relative to its neighboring node 𝑒𝑗:\nwhere ℎ𝑘 and ℎ𝑗is the features of 𝑒𝑘 and 𝑒𝑗 respectively. Wa denote the learnable weight matrix, and 𝑑 is the feature dimension. Then we aggregate the feature ˆ ℎ𝑘 of node 𝑒𝑘: ˆ ℎ𝑘 = Í 𝑗 ∈N ( 𝑒𝑘 ) 𝛼 𝑘 𝑗 · ℎ𝑗 , where N (𝑒𝑘) is the neighboring nodes of 𝑒𝑘. Finally the feature of 𝑒𝑘is calculated by ℎ ′ 𝑘 = ReLU(Wk[ ˆ ℎ𝑘, ℎ𝑘]), where W a donates the weight matrix and [ ˆ ℎ𝑘, ℎ𝑘] is the concatenation of ˆ ℎ𝑘 and ℎ𝑘 .\nAfter graph attention operation, we enhance T 𝒂 using the attention mechanism with query Q 𝒗 , key K 𝒂 , and value V 𝒂 calculation to obtain final action tokens: T ′ 𝒂 = softmax Q ⊤ 𝒗 · K 𝒂 \u0001 · V 𝒂.\nLocal Object Relation Expert (Object Relation Expert, ORE). We leverage RelTR [9], a well-studied one-stage object relation graph generation method to extract local spatial object relation information. RelTR can generate an object relation token 𝒕 𝒐 𝒊 = (𝑅𝑖 where 𝜎 is the activation function on the graph. A ˜ is the adjacency matrix of the object-relation graph, derived from 𝐸𝑖, and D ˜ is its degree matrix, with D ˜ 𝑖𝑖 = Í 𝑖 A ˜ 𝑖𝑗 . W (ℓ) is a trainable weight matrix.\n· Local Background Expert (Background Expert, BE). We leverage SAM2 [28], an advanced model for visual segmentation, to extract local spatial background information from videos. SAM2 can generate a background image for each frame of video. Then we leverage InternVit [5] to encode local spatial background information which is a large vision encoder extending the parameters of vision transformer (VIT) [4] to 6B, formally represented as:\nwhere 𝑣𝑖is the 𝑖-th frame of video 𝑉 . This process results in the local spatial background tokens T𝒃 = {𝒕 𝒃 1 , \u0026hellip;, 𝒕 𝒃 𝒊 , \u0026hellip;, 𝒕 𝒃 𝒎 } for the entire video sequence, with 𝑛 representing the total number of frames.\nGlobal Spatial Expert has a comprehensive understanding of the training data. Collaborate with local spatial experts to bring specialization and generalization capabilities to M-VAE tasks.\n· Global Spatial Expert (Global Expert, GE). The weight assigned to the global spatial expert complements that of the local spatial experts. Consequently, the local spatial experts acquire specialized skills for specific tasks, whereas the global spatial expert develops a comprehensive understanding of the entire training corpus. The collaboration between these two types of experts provides both specialization and generalization for our M-VAE task. In this way, we leverage LanguageBind [73] in Video-LLaVA [38], which inherits the ViT-L/14 structure from CLIP and is equipped with powerful and universal visual encoding capabilities to extract global spatial information for our task. We subsequently leverage a pre-trained FFN layer by [38] to align the dimension with other spatial information, formally represented as:\nwhere 𝑣𝑖is the 𝑖-th frame of video 𝑉 . This process yields the full set of global tokens T 𝒈 = {𝒕 𝒈 1 , \u0026hellip;, 𝒕 𝒈 𝒊 , \u0026hellip;, 𝒕 𝒈 𝒎 } for the entire video sequence, with 𝑛 representing the total number of frames.\nAfter designing four experts, we ensure that the four Spatial Experts can dynamically adjust the weights of the four heterogeneous types of spatial information inspired by Mixture-of-Experts (MoE) [18]. As shown in Figure 2, unlike methods that embed several FFNs within LLMs, our GSM put four experts outside the LLMs to adjust weights for global and local spatial information. Based on this, we introduce a dynamic Expert Gate (EG) [50], which controls the contribution of each expert by calculating gating weights as a soft gate. Finally, the output O of GSM, based on four spatial experts and EG, is formally represented as:\nwhere LayerNorm (·) indicates layer normalization [1]. 𝑔𝑖(the 𝑖-th entry in 𝒈) represents the weight of the 𝑖-th expert. Si represents the outputs of the 𝑖-th Spatial expert. 𝑁 is the total number of spatial expert, and W 𝑔 being the trainable weight matrix.\n3.2 Spatial Imbalance Regulator # After modeling the spatial information, we design a Spatial Imbalance Regulator (SIR) including a Gated Spatial Balancing Loss (GSB) for the global-local spatial balancing challenge, detailed as follows.\nTable 1: The statistics of the number of events and the duration in seconds (s) of events for each scene.\nSpli School Shop Underwate Street Road Boat Wild Fore Residenc Bank Commercia Factor Lawn Othe Total Train 55 (2136s) 107 (4130s) 78 (3022s) 113 (7076s) 114 (5586s) 115 (5203s) 111 (4681s 102 (3918s 117 (4914s) 89 (3380s 105 (5011s) 82 (3173s) 104 (5943s) 56 (1497 48 (59670s Inference ence 13 (534s) 26 (1032s 19 (755s) 28 (1769s) 28 (1396s) 29 (1300s) 27 (1170s) 25 (979s 29 (1228 22 (845s 26 (1252s 20 (793s 26 (1485s 14 (374 332 (14912s Figure 3: The word cloud distribution of quadruple elements in the M-VAE dataset, which reveals the spatial imbalance. (e.g., The proportion of people is the highest)\nGated Spatial Balancing (GSB) Loss. Previous researches employ a basic Mixture of Experts (MoE) [18 , 23] to model global and local spatial information. When faced with an imbalance between these two types of information, the weights assigned to experts tend to be biased toward those that appear more frequently. As shown in Figure 1 (c), there are the most spatial elements (e.g., People) related to local spatial action information in event quadruple. This implies that performance will deteriorate when faced with real-world data that is not processed by an action expert (e.g., object relations). More seriously, as shown in Figure 1 (c), global information holds significant weight in all data, which will lead to excessive training of global experts and weaken the abilities of local experts with lower weights. This imbalance phenomenon will greatly affect the performance of our model. Based on this, we should keep the weights of all spatial experts not too different and achieve the optimal state of relative balance where every expert is fully trained. Inspired by MoELoRA [42], we propose a Gated Spatial Balancing (GSB) Loss to balance spatial weights, as follows:\nwhere 𝑁local is the number of local expert. 𝑔global is the weight of global expert. The first term of Eq.(7) is balancing between local experts, and the second term is balancing between local and global experts. The weights of four experts have already balanced when the loss is optimized to a minimum. This regulation achieves a better balance among all experts, reducing the impact of data imbalance, which effectively addresses the global-local balancing challenge. Finally, the overall loss of Sherlock can be represented as:\nwhere 𝛼 is the hyper-parameter that controls the strength of Lgate , and LD is the next-token prediction loss of Video-LLMs.\n3.3 Training Strategies for Sherlock # In order to enhance the ability of understanding spatial information, we design a two-stage training process. Stage 1 is to enhance the ability of understanding spatial information and Stage 2 is to address the M-VAE task, detailed as follows.\nStage 1. Pre-Tuning for spatial understanding. As shown in Figure 2, we first pre-tune Video-LLaVA using four high-quality\nStage 1: The dataset of pre-tuning for spatial understanding\nRef\nL4\nHumanML3D\n20K Frames\nRSI\nCB\n20K Frames\n20K Frames\nStage 2: Our constructed dataset for M-VAE task\nTrain\n80163s (640K Frames)\nInference\n20053s (160K Frames)\nFigure 4: Data composition for training and inference.\ndatasets. We aim for Video-LLaVA to have a good spatial understanding ability. Specifically, we selected four high-quality datasets: HumanML3D [16], Ref-L4 [3], RSI-CB [32], and COCO-Caption [39], as described in sec 4.1. For each pre-tuning dataset, we enable this dataset to understand corresponding spatial information.\nStage 2. Instruction Tuning for M-VAE task. We aim to enable the model to localize abnormal events and extract quadruples through the chat paradigm. We construct an instruction tuning dataset described in sec 4.1 and instruct the pre-tuned Video-LLaVA to Extract quadruples and localize abnormal events. The quadruple includes subject, event type, object, and scene in abnormal events. The instruction will undergo text embedding to obtain the textual tokens T 𝒕 . Finally, the input of the LLM is \u0026ldquo;O from Eq.(5) + T𝒕\u0026rdquo;.\n4 Experimental Settings # 4.1 Instruction Data Construction # The training pipeline of Sherlock contains two stages. As shown in Figure 4, for each stage, we construct the corresponding instruction dataset for better tuning.\nFor Stage 1. We construct a special understanding dataset based on Ref-L4 [3], HumanML3D [16], RSI-CB [32] and COCO [39]. Specifically, we manually design an instruction for each type of spatial information, for instance: Instruction: \u0026ldquo;Judge the action of the characters in the image. Describe the image region \u0026lt;objs\u0026gt; in the image. Judge the background of the image. Describe the image\u0026rdquo;. As HumanML3D has 25K videos with an average duration of 1 second, and we take 8 frames per second. For the data balance, we randomly select 20K images or frames from each dataset.\nFor Stage 2. We construct an M-VAE instruction dataset based on CUVA [10], which primarily consists of surveillance videos, with an average duration of 80 seconds per video. As this dataset includes five detailed video Q-A tasks (i.e., timestamp, classification, reason, result, and description tasks), it is highly beneficial for constructing our M-VAE dataset. 1) For abnormal event quadruples, constructing quadruples involves two steps. First, we collect answers from the reason, result, and description tasks in CUVA for each video. Subsequently, we construct initial quadruples through ChatGPT [49] based on the answers to these tasks, with the instruction: \u0026ldquo;Please extract the subject, object, and scene of the event based on the responses below\u0026rdquo;. Second, we create multiple candidate sets for subjects, objects, and scenes in quadruple. Specifically, for subjects and objects elements, we manually construct a set of around 40 for subjects and objects and filter elements based on this set. For event types elements, we adopt the 11 categories (i.e., Fighting, Animals, Water, Vandalism, Accidents, Robbery, Theft, Pedestrian, Fire, Violations, and Forbidden) from CUVA as the event types. For\nCOCO\n20K Frames\nTable 2: Comparison of several Video-LLMs and Sherlock on our instruction dataset. The ↓ beside FNRs indicates the lower the metric, the better the performance. AE, ORE, BE, GE, and EG represent four Spatial Experts and Expert Gate respectively. Sub, Type, Obj, and Sce represent Subject, Event type, Object, and Scene respectively. For each task, Blue and Green donate the first and second place respectively.\nEvent Extraction Even Event Extraction Even Event Extraction Even Event Extraction Even Event Extraction Even Event Extraction Even Event Extraction Even Event Extraction Even Event Extraction Even Event Extraction Even Event Extraction Even Event Extraction Even Event Location An Event Location An Event Location An Event Location An Anomaly Cls. Anomaly Cls. Models Single (F1) Single (F1) Single (F1) Single (F1) Pair (F1) Pair (F1) Pair (F1) Pair (F1) Quadruple Quadruple Quadruple Average mAP@tIoU mAP@tIoU Average FNRs F2 FNRs F2 Subject Type Object Scene Sub-Type Obj-Typ Sub-Sce Obj-Sce F1 T5-based GPT-based Average 0.1 0.2 0.3 Average FNRs F2 Video Chat 73.14 71.35 64.28 71.76 70.12 58.69 71.55 61.18 40.95 61.68 53.94 62.6 77.28 74.93 66.26 72.82 38.79 65.88 Video ChatGPT 61.87 59.51 54.82 46.39 54.23 49.68 43.26 41.38 39.63 57.36 50.38 49.86 74.65 70.91 67.03 70.86 41.47 61.35 Valley 64.64 62.27 58.94 52.26 58.36 51.64 49.68 46.42 42.38 63.34 56.67 54.23 69.34 62.26 57.66 63.08 43.49 59.42 Panda GPT 73.09 75.45 68.42 61.93 71.96 59.92 59.79 59.45 41.17 54.36 48.55 60.37 76.64 62.69 57.21 65.51 35.62 69.16 mPLUG-Owl 52.86 37.54 40.24 37.68 31.97 28.89 33.9 27.87 22.12 30.68 32.41 34.1 61.42 53.21 46.46 53.69 56.98 51.66 Chat-UniVi 59.71 57.26 55.28 44.23 52.43 50.62 41.24 40.96 37.68 55.34 48.84 43.59 65.89 58.62 40.02 54.84 52.52 53.78 Video-LLaVA 77.85 73.68 65.67 75.91 69.32 59.21 73.25 62.24 41.32 52.94 56.74 64.37 78.31 74.79 64.92 72.67 41.34 64.96 Sherlock 87.97 82.12 74.99 92.15 77.06 66.28 85.16 73.17 57.57 75.46 67.52 75.22 94.03 82.59 76.12 84.24 17.24 83.59 w/o AE 83.15 77.64 71.28 90.16 72.36 63.47 80.52 70.39 52.48 60.61 62.02 71.18 92.24 81.21 75.38 82.94 21.82 80.45 w/o ORE 83.96 78.25 72.37 90.01 74.24 64.46 81.56 70.97 54.35 72.28 65.08 72.5 91.13 82.08 74.62 82.61 22.97 78.83 w/o BE 81.16 74.65 67.88 88.07 69.29 61.12 77.64 66.64 48.63 53.04 55.94 67.71 88.62 79.09 72.24 79.98 25.36 73.51 w/o GE 79.2 74.09 66.71 84.11 70.38 60.77 75.44 66.28 46.34 63.97 57.06 66.75 86.18 78.37 69.28 77.94 28.97 71.28 w/o EG 78.83 73.96 65.02 83.15 70.15 60.26 74.15 63.37 43.64 59.14 51.82 64.86 81.31 77.68 67.88 75.62 32.58 67.07 w/o SIR 84.47 80.14 71.94 92.34 75.58 64.84 83.21 70.06 55.73 72.87 65.18 73.3 83.41 78.49 68.37 76.75 30.64 70.97 w/o pre-tuning 78.24 74.44 64.22 82.21 68.55 57.74 72.62 62.91 42.51 57.22 50.54 63.74 79.58 75.32 65.07 73.32 34.87 66.64 scenes elements, we assign two annotators to classify scenes for each abnormal event. If they cannot reach an agreement, an expert will make the final decision to ensure annotation quality. The Kappa consistency check value of the annotation is 0.87. 2) For localization task, we use the timestamp in the CUVA as labels for localization. Furthermore, we adhere to the split of CUVA for training and inference videos and take 8 frames per second, resulting in 800K frames from 1k videos and each video contains 1.68 abnormal event on average. The statistics of the number of events and the duration in seconds (s) of events for each scene are shown in Table 1. Finally, we obtain our M-VAE instruction dataset. Our instruction for the M-VAE task is: \u0026ldquo;Generate a quadruple and localize an abnormal event in the video. The quadruple includes subject, event type, object, and scene in abnormal events.\u0026rdquo;. Figure 1 (c) and Figure 3 show the top 20 quadruple elements, revealing the spatial imbalance.\n4.2 Baselines # In this paper, we select several advanced Video-LLMs as baselines which are introduced as follows. VideoChat [47] employs Q-Former [33] to map visual representations to Vicuna [8]. VideoChatGPT [47] integrates LLMs with CLIP [51] for video representations. Valley [46] employs a temporal modeling module to bridge visual and textual modes. PandaGPT [54] utilizes ImageBind [14] to demonstrate cross-modal capabilities. mPLUG-Owl [65] introduces a visual abstractor module to align different modes. ChatUniVi [26] merges visual tokens with semantic meanings. VideoLLaVA [38] conducts joint training on images and videos. To ensure a fair comparison, we re-implement these models using their released codes in our experiments, with all LLMs sized at 7B.\n4.3 Evaluation Metrics # M-VAE focuses on extracting event quadruples and locating abnormal events from videos, requiring evaluation metrics in three aspects (i.e., extract event quadruples, locate abnormal events, and classify abnormal events). For the extraction performance, we measure our model through three perspectives. 1) Single: performance of generating each individual element. 2) Pair: performance of generating the element pair, i.e., Subject-Type pair, Object-Type pair, Subject-Scene pair, Object-Scene pair. 3) Quadruple Generation: performance of generating the complete event quadruple. Following the prior works [30], the performance is evaluated with Macro-F1. Furthermore, we use T5-based and GPT-based metrics based on Video-bench [48] especially for LLM. For localization performance, we use the mAP@tIoU metric [71], calculated by mean Average Precision (mAP) at different IoU thresholds from 0.1 to 0.3 with 0.1 intervals. For classification performance, we refer to the traditional anomaly classification task [17 , 45 , 61] for anomaly classification metric, which mainly determines whether each video frame is abnormal or not in the video. We prefer Recall over Precision and report F2 [71] as another classification metric. Furthermore, our model focuses on accurately distinguishing abnormal events. As shown in Figure 1, it\u0026rsquo;s better to mark all timestamps as abnormal than to miss any. So we prioritize false negative rates (FNRs): FNRs = num of false-negative frame num of positive frame , which is the rate of mislabeling an abnormal event frame as normal. In addition, 𝑡-test is used to evaluate the significance of the performance.\n4.4 Implementation Details # In our experiments, we utilize open-source codes to obtain experimental results of all the baselines in Table 2. The hyper-parameters of these baselines remain the same setting reported by their public papers. For both Stage 1 and 2, we use a batch size of 16 and train for 1 epoch with the AdamW [43] optimizer and a cosine learning rate decay schedule with a warm-up period. The initial learning rate is 2e-5. The hyper-parameter 𝛼 in L is set to 0.4. We tune the Video-LLaVA model using LoRA [22]. The LoRA matrix dimension, dropout rate, and dropout rate are 16, 64, and 0.05 respectively. Experiments are run on a single NVIDIA A100 GPU with 40GB memory. Stage 1 training takes about 16 hours, Stage 2 takes 60 hours, and inference takes about 8 hours.\n5 Results and Discussions # 5.1 Experimental Results # Table 2 and Table!4 shows the performance comparison of different models on our M-VAE task, and we can see that: For extraction performance, our Sherlock model outperforms all baselines, with\nTable 3: Comparison of several advanced Video-LLMs and Sherlock on the 14 scenes of the M-VAE dataset with FNRs.\nModels School Shop Underwater Street Road Boat Wild Forest Residence Bank Commercial Factory Lawn Other Video Chat 39.57 39.47 37.3 36.81 27.41 35.32 33.27 33.36 35.95 40.59 38.97 45.52 35.26 49.04 Video Chatgpt 45.91 41.98 39.36 41.41 30.11 38.19 36.32 37.73 37.54 44.5 42.96 40.78 36.28 52.33 Valley 46.68 43.76 41.37 44.24 35.66 42.15 46.78 39.25 42.15 48.35 48.31 47.21 37.11 53.09 Pandagpt 34.56 35.65 34.47 36.48 24.42 35.85 31.78 32.37 34.18 38.55 37.89 41.46 31.17 44.24 mPLUG-Owl 54.13 54.41 53.21 47.34 36.51 45.02 58.37 46.31 45.63 57.94 56.88 53.14 54.74 59.56 Chatunivi 52.51 48.82 47.52 48.68 35.53 44.41 59.88 45.96 44.34 54.92 55.66 51.12 52.22 55.48 Video-llava 45.27 37.43 34.63 38.84 27.76 32.54 26.41 30.29 31.45 21.19 29.84 20.08 30.72 28.31 Sherlock 16.35 21.91 15.16 24.24 14.63 20.96 17.29 18.48 20.43 11.21 23.43 8.96 21.44 13.6 Figure 5: Convergence analysis of other baselines, Sherlock, and its variant without specific components.\nAE\nORE\nBE\nGE\nFigure 6: The visualization of balanced spatial expert weights calculated in Eq.(5). The length of the bar in different colors represents the weights for the corresponding expert. 𝐶1 to 𝐶11 is different Event types in quadruples.\nan average improvement of 10.85 (𝑝-value \u0026lt; 0.05) over the second performance. Specifically, our Sherlock model surpasses the second performance by an average of 9.9 (𝑝-value \u0026lt; 0.05), 8.59 (𝑝-value \u0026lt; 0.05), and 9.52 (𝑝-value \u0026lt; 0.05) in average Single, Pair, and Quadruple metrics, justifying the effectiveness of Sherlock on extraction task. For localization performance, our Sherlock model exceeds the second performance by 11.42 (𝑝-value \u0026lt; 0.01) in average mAP@tIoU metric, justifying the effectiveness of Sherlock on localization task. Furthermore, for classification performance, in FNRs and F2 metric, Sherlock surpasses the second performance in 18.38 (𝑝value \u0026lt; 0.01) and 14.43 (𝑝-value \u0026lt; 0.01). This implies the importance of our global and local information and justifies the effectiveness of our Sherlock model on our task.\n5.2 Contributions of Each Key Component # In order to further investigate the contributions of different modules of Sherlock, we conduct an ablation study on our Sherlock model. As shown in Table 2, w/o AE, w/o ORE, w/o BE, w/o GE, w/o EG,\nFigure 7: (a) is the visual comparison of our SIR and (b) is the comparison of the average inference time for a one-minute video between Sherlock and other Video-LLMs.\nTable 4: Comparison of localization and anomaly classification task with several well-performing non-LLM models.\nAnomaly Location Anomaly Location Anomaly Location Anomaly Location Anomaly Cls. Anomaly Cls. Models mAP@tIoU mAP@tIoU Average FNRs F2 0.1 0.2 0.3 Averag FNRs F2 BiConvLSTM[19] 52.74 37.31 31.12 40.39 68.05 44.48 SPIL[55] 53.28 38.89 32.91 41.69 67.84 46.87 FlowGatedNet[7] 53.64 39.64 33.18 42.15 67.24 46.55 X3D[53] 54.52 40.05 34.96 43.17 65.08 48.65 HSCD[12] 56.14 42.87 35.28 44.76 60.36 52.28 Sherlock 94.03 82.59 76.12 84.24 17.24 83.59 and w/o pre-tuning represent without four Spatial Experts, Expert Gate, and pre-tuning stage in sec 3.2 respectively.\nEffectiveness Study of Global and Local Spatial Expert . From Table 2, we can see that: The performance of w/o AE , w/o ORE , w/o BE and w/o GE degrades in all metrics, with an average decrease of 7.54 (𝑝-value \u0026lt; 0.01), 7.57 (𝑝-value \u0026lt; 0.01), 4.37 (𝑝-value \u0026lt; 0.01), and 5.68 (𝑝-value \u0026lt; 0.01) in FNRs, F2, average map@tIoU, and average event extraction metrics. This confirms the importance of global and local information in extracting and localizing abnormal events, and Sherlock can better model those information well.\nEffectiveness Study of Spatial Imbalance Regulator. From Table 2, we can see that: 1) Compared with Sherlock , w/o EG shows poorer performance in all metrics, with a decrease of FNRs, F2, average map@tIoU, and average extraction performance by 15.34 (𝑝-value \u0026lt; 0.01), 16.52 (𝑝-value \u0026lt; 0.01), 8.62 (𝑝-value \u0026lt; 0.05) and 10.36 (𝑝-value \u0026lt; 0.01), respectively. This demonstrates the effectiveness of GSM in global-local spatial modeling and encourages us to consider handling heterogeneity issues between spatial information in the manner of MoE. 2) From Table 2, we can see that compared to performance of w/o SIR, the performance of w/o MG is poorer, with FNRs, F2, average map@tIoU, and average event extraction metrics decreasing by 1.94 (𝑝-value \u0026lt; 0.05), 3.9 (𝑝-value\nFigure 8: Two Visualized samples to compare Sherlock with other Video-LLMs.\n\u0026lt; 0.05), 1.13 (𝑝-value \u0026lt; 0.05) and 4.84 (𝑝-value \u0026lt; 0.05), respectively. This further demonstrates the effectiveness of Lgate in global-local spatial balancing and encourages us to consider using SIR to better balance spatial information. 3) In addition, we record the weights of four spatial experts after training in Figure 6 and Figure 7 (a). We can see that the weights of all experts have been relatively balanced, and each expert has demonstrated outstanding professional abilities when facing different types of abnormal videos.\nEffectiveness Study of Pre-tuning. From Table 2, we can see that w/o pre-tuning, the performance is inferior to Sherlock . FNRs, F2, average map@tIoU, and average event extraction metrics have decreased by 17.63 (𝑝-value \u0026lt; 0.01), 16.95 (𝑝-value \u0026lt; 0.01), 10.92 (𝑝-value \u0026lt; 0.01) and 11.48 (𝑝-value \u0026lt; 0.01), respectively. This further justifies the effectiveness of pre-tuning, as well as encourages us to use more high-quality datasets to enhance the spatial understanding ability of Video-LLMs before instruction-tuning.\n5.3 Convergence Analysis and Practical Assessment for Sherlock # In order to analyze the convergence of Sherlock, we record the loss of baseline Video-LLMs, Sherlock, and its variant without specific components over various training steps. The results are shown in Figure 5 and we can see that: 1) Sherlock demonstrates the fastest convergence compared to other Video-LLMs. At the convergence point, the loss of Sherlock is 1.05, while Video-LLaVA is 2.06. This underscores the high efficiency of Sherlock over other advanced Video-LLMs. 2) Sherlock demonstrates the fastest convergence compared to its variant without specific components in Figure 5. This justifies that the spatial information along with GSM and SIR can accelerate the convergence process, which further encourages us to consider the spatial information in the M-VAE task.\nTo assess practicality, we analyze the FNRs of Sherlock for each scene. As shown in Table 3, we can observe that in every scene, Sherlock outperforms other Video-LLMs. This indicates that the possibility of misclassifying abnormal events as normal events is minimized, thereby demonstrating the importance of global and local spatial modeling of Sherlock. We also analyze the average inference time in seconds for a one-minute video. As shown in Figure 7 (b), Sherlock does not perform much differently from the other models in terms of inference time. This is reasonable, as some studies confirm that the MoE architecture can improve efficiency\n[11, 28]. This suggests that introducing more information along with a MoE module for the M-VAE task does not increase the inference time and Sherlock can maintain good inference efficiency.\n5.4 Qualitative Analysis for Sherlock # As shown in Figure 8, we visualize and compare Sherlock with other Video-LLMs. We randomly select two samples from our dataset and ask these models to Analyze the following video and localize the timestamp and extract the quadruple of the abnormal events. From the figure, we can see that: 1) Accurately localizing abnormal events and extracting correct quadruples is a huge challenge. For instance, example 2 captures a segment from 9s to 15s, where identifying the collision of the truck at road is challenging, 2) Compared with other advanced Video-LLMs, Sherlock shows excellent performance in localizing abnormal events. In example 1, Sherlock outperforms other models in terms of accuracy. In example 2, it outperforms PandaGPT in terms of accuracy and can generate a correct quadruple. This further demonstrates the effectiveness of Sherlock in precisely extracting and localizing abnormal events.\n6 Conclusion # In this paper, we firstly propose a new M-VAE task and a constructed instruction dataset, making a significant contribution to future research on abnormal events. Secondly, we propose a Globallocal Spatial-sensitive LLM named Sherlock to assist in localizing and extracting abnormal event quadruples. This model includes a Global-local Spatial-enhanced MoE module and Spatial Imbalance Regular to model and balance spatial information. In the end, our experimental results demonstrate the outstanding performance of Sherlock. In future work, we hope to consider the relationships between events and enrich our tasks with event inference to improve the performance of extraction. In addition, we also hope to improve the interpretability of our model by providing explanations for each abnormal event.\nAcknowledgments # We thank our anonymous reviewers for their helpful comments. This work was supported by three NSFC grants, i.e., No.62006166, No.62376178 and No.62076175. This work was also supported by a Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD).\nReferences # [1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).\n[2] Antoine Bosselut, Jianfu Chen, David Scott Warren, Hannaneh Hajishirzi, and Yejin Choi. 2016. Learning Prototypical Event Structure from Photo Albums. In Proceedings of ACL 2016 .\n[3] Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S.-H. Gary Chan, and Hongyang Zhang. 2024. Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models. CoRR abs/2406.16866 (2024).\n[4] Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. 2022. When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations. In Proceedings of ICLR 2022. 2.\n[5] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. 2023. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. CoRR abs/2312.14238 (2023).\n[6] Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S. Huang, and Lei Zhang. 2020. HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation. In Proceedings of CVPR 2020. 5385–5394.\n[7] Ming Cheng, Kunjing Cai, and Ming Li. 2020. RWF-2000: An Open Large Scale Video Database for Violence Detection. In Proceedings of ICPR 2020. 4183–4190.\n[8] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/\n[9] Yuren Cong, Michael Ying Yang, and Bodo Rosenhahn. 2023. RelTR: Relation Transformer for Scene Graph Generation. IEEE Trans. Pattern Anal. Mach. Intell. 45, 9 (2023), 11169–11183.\n[10] Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jianhang Chen, Qimei Cui, and Xiaofeng Tao. 2024. Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly. CoRR abs/2405.00181 (2024).\n[11] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. 2021. MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection. In Proceedings of CVPR 2021. 14009–14018.\n[12] Guillermo Garcia-Cobo and Juan C. SanMiguel. 2023. Human skeletons and change detection for efficient violence detection in surveillance videos. Comput. Vis. Image Underst. 233 (2023).\n[13] Rohit Girdhar, João Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video Action Transformer Network. In Proceedings of CVPR 2019. 244–253.\n[14] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. ImageBind One Embedding Space to Bind Them All. In Proceedings of CVPR 2023. 15180–15190.\n[15] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. 2019. Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection. In Proceedings of ICCV 2019. 1705–1714.\n[16] Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. 2022. Generating Diverse and Natural 3D Human Motions From Text. In Proceedings of CVPR 2022. 5142–5151.\n[17] Huiwen Guo, Xinyu Wu, Nannan Li, Ruiqing Fu, Guoyuan Liang, and Wei Feng. 2013. Anomaly detection and localization in crowded scenes using short-term trajectories. In Proceedings of ROBIO 2013. 245–249.\n[18] Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. 2023. OneLLM: One Framework to Align All Modalities with Language. CoRR abs/2312.03700 (2023).\n[19] Krishnagopal Sanjukta Davis Larry Hanson Alex, PNVR Koutilya. 2019. Bidirectional Convolutional LSTM for the Detection of Violence in Videos. In Proceedings of ECCV 2018. 280–295.\n[20] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K. Roy-Chowdhury, and Larry S. Davis. 2016. Learning Temporal Regularity in Video Sequences. In Proceedings of CVPR 2016. 733–742.\n[21] Yu Hong, Jianfeng Zhang, Bin Ma, Jian-Min Yao, Guodong Zhou, and Qiaoming Zhu. 2011. Using Cross-Entity Inference to Improve Event Extraction. In Proceedings of ACL 2011. 1127–1136.\n[22] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of ICLR 2022. 2.\n[23] Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. 1991. Adaptive Mixtures of Local Experts. Neural Comput. 3, 1 (1991), 79–87.\n[24] Gagan Jain, Nidhi Hegde, Aditya Kusupati, Arsha Nagrani, Shyamal Buch, Prateek Jain, Anurag Arnab, and Sujoy Paul. 2024. Mixture of Nested Experts: Adaptive Processing of Visual Tokens. CoRR abs/2407.19985 (2024).\n[25] Heng Ji and Ralph Grishman. 2008. Refining Event Extraction through CrossDocument Inference. In Proceedings of ACL 2008. 254–262.\n[26] Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. 2023. Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding. CoRR abs/2311.08046 (2023).\n[27] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. 2024. Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation. In Proceedings of CVPR 2024. 9492–9502.\n[28] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. 2023. Segment Anything. In Proceedings of ICCV 2023. 3992–4003.\n[29] Federico Landi, Cees G. M. Snoek, and Rita Cucchiara. 2019. Anomaly Locality in Video Surveillance. CoRR abs/1901.10364 (2019).\n[30] Bobo Li, Hao Fei, Fei Li, Yuhan Wu, Jinsong Zhang, Shengqiong Wu, Jingye Li, Yijiang Liu, Lizi Liao, Tat-Seng Chua, and Donghong Ji. 2023. DiaASQ: A Benchmark of Conversational Aspect-based Sentiment Quadruple Analysis. In Proceedings of ACL 2023. 13449–13467.\n[31] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023. Otter: A Multi-Modal Model with In-Context Instruction Tuning. CoRR abs/2305.03726 (2023).\n[32] Haifeng Li, Xin Dou, Chao Tao, Zhixiang Wu, Jie Chen, Jian Peng, Min Deng, and Ling Zhao. 2020. RSI-CB: A Large-Scale Remote Sensing Image Classification Benchmark Using Crowdsourced Data. Sensors 20 (2020).\n[33] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of ICML 2023. 19730–19742.\n[34] Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. 2023. VideoChat: Chat-Centric Video Understanding. CoRR abs/2305.06355 (2023).\n[35] Qi Li, Heng Ji, and Liang Huang. 2013. Joint Event Extraction via Structured Prediction with Global Features. In Proceedings of ACL 2013. 73–82.\n[36] Shuo Li, Fang Liu, and Licheng Jiao. 2022. Self-Training Multi-Sequence Learning with Transformer for Weakly Supervised Video Anomaly Detection. In Proceedings of AAAI 2022. 1395–1403.\n[37] Shasha Liao and Ralph Grishman. 2010. Using Document Level Cross-Event Inference to Improve Event Extraction. In Proceedings of ACL 2010. 789–797.\n[38] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. 2023. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. CoRR abs/2311.10122 (2023).\n[39] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of ECCV 2014. 740–755.\n[40] Zihang Lin, Chaolei Tan, Jian-Fang Hu, Zhi Jin, Tiancai Ye, and Wei-Shi Zheng. 2023. Collaborative Static and Dynamic Vision-Language Streams for SpatioTemporal Video Grounding. In Proceedings of CVPR 2023. 23100–23109.\n[41] Kun Liu and Huadong Ma. 2019. Exploring Background-bias for Anomaly Detection in Surveillance Videos. In Proceedings of MM 2019. 1490–1499.\n[42] Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu, Derong Xu, Feng Tian, and Yefeng Zheng. 2023. MOELoRA: An MOE-based Parameter Efficient Fine-Tuning Method for Multi-task Medical Applications. CoRR abs/2310.18339 (2023).\n[43] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In Proceedings of ICLR 2019. 9.\n[44] Cewu Lu, Ranjay Krishna, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual Relationship Detection with Language Priors. In Proceedings of ECCV 2016. 852– 869.\n[45] Cewu Lu, Jianping Shi, and Jiaya Jia. 2013. Abnormal Event Detection at 150 FPS in MATLAB. In Proceedings of ICCV 2013. 2720–2727.\n[46] Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. 2023. Valley: Video Assistant with Large Language model Enhanced abilitY. CoRR abs/2306.07207 (2023).\n[47] Muhammad Maaz, Hanoona Abdul Rasheed, Salman H. Khan, and Fahad Shahbaz Khan. 2023. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. CoRR abs/2306.05424 (2023).\n[48] Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. 2023. Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models. CoRR abs/2311.16103 (2023).\n[49] OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023).\n[50] Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa, and Neil Houlsby. 2024. From Sparse to Soft Mixtures of Experts. In Proceedings of ICLR 2024 .\n[51] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al . 2021. Learning transferable visual models from natural language supervision. In Proceedings of ICML 2021. 8748–8763.\n[52] Zhiyi Song, Ann Bies, Stephanie M. Strassel, Tom Riese, Justin Mott, Joe Ellis, Jonathan Wright, Seth Kulick, Neville Ryant, and Xiaoyi Ma. 2015. From Light to Rich ERE: Annotation of Entities, Relations, and Events. In Proceedings of EVENTS 2015. 89–98.\n[53] Jiayi Su, Paris Her, Erik Clemens, Edwin E. Yaz, Susan C. Schneider, and Henry Medeiros. 2022. Violence Detection using 3D Convolutional Neural Networks. In Proceedings of AVSS 2022. 1–8.\n[54] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. 2023. PandaGPT: One Model To Instruction-Follow Them All. CoRR abs/2305.16355 (2023).\n[55] Yukun Su, Guosheng Lin, Jin-Hui Zhu, and Qingyao Wu. 2020. Human Interaction Learning on 3D Skeleton Point Clouds for Video Violence Recognition. In Proceedings of ECCV 2020. 74–90.\n[56] Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-World Anomaly Detection in Surveillance Videos. In Proceedings of CVPR 2018. 6479–6488.\n[57] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W. Verjans, and Gustavo Carneiro. 2021. Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning. In Proceedings of ICCV 2021 . 4955–4966.\n[58] Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. 2024. Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding. CoRR abs/2401.00901 (2024).\n[59] Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, and Ping Luo. 2023. UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces. arXiv preprint arXiv:2312.15715 (2023).\n[60] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. 2020. Not only Look, But Also Listen: Learning Multimodal Violence Detection Under Weak Supervision. In Proceedings of ECCV 2020. 322–339.\n[61] Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. 2024. VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection. In Proceedings of AAAI 2023 . 6074–6082.\n[62] Dan Xu, Yan Yan, Elisa Ricci, and Nicu Sebe. 2017. Detecting anomalous events in videos by learning deep representations of appearance and motion. Comput. Vis. Image Underst. 156 (2017), 117–127.\n[63] Zhixuan Xu, Chongkai Gao, Zixuan Liu, Gang Yang, Chenrui Tie, Haozhuo Zheng, Haoyu Zhou, Weikun Peng, Debang Wang, Tianyi Chen, Zhouliang Yu, and Lin Shao. 2024. ManiFoundation Model for General-Purpose Robotic Manipulation\nof Contact Synthesis with Arbitrary Objects and Robots. CoRR (2024).\n[64] Zhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. 2023. Video Event Restoration Based on Keyframes for Video Anomaly Detection. In Proceedings of CVPR 2023. 14592–14601. [65] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. 2023. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. CoRR abs/2304.14178 (2023). [66] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics 2 (2014), 67–78. [67] Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual Translation Embedding Network for Visual Relation Detection. In Proceedings of CVPR 2017. 3107–3115. [68] Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, and Shih-Fu Chang. 2017. PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN. In Proceedings of CVPR 2017. 4243–4251. [69] Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In Proceedings of EMNLP 2023. 543–553. [70] Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, and Nong Sang. 2024. Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM. CoRR abs/2406.12235 (2024). [71] Zhicheng Zhang and Jufeng Yang. 2022. Temporal Sentiment Localization: Listen and Look in Untrimmed Videos. In Proceedings of MM 2022. 199–208. [72] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H. Li, and Ge Li. 2019. Graph Convolutional Label Noise Cleaner: Train a Plug-And-Play Action Classifier for Anomaly Detection. In Proceedings of CVPR 2019. 1237–1246. [73] Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Hongfa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Caiwan Zhang, Zhifeng Li, Wei Liu, and Li Yuan. 2024. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. In Proceedings of ICLR 2024 . ","date":"28 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/sherlock-towards-multi-scene-video-abnormal-event-extraction-and-localization-via-a-global-local-spatial-sensitive-llm/","section":"Papers","summary":"Proposes a new task (M-VAE) for structured extraction and localization of abnormal events in videos, introduces Sherlock model with a Global-local Spatial-sensitive MoE module and a Spatial Imbalance Regulator, and demonstrates its effectiveness through extensive experiments.","title":"Sherlock: Towards Multi-scene Video Abnormal Event Extraction and Localization via a Global-local Spatial-sensitive LLM","type":"other"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/azzedine-boukerche/","section":"Authors","summary":"","title":"Azzedine Boukerche","type":"authors"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/bo-hu/","section":"Authors","summary":"","title":"Bo Hu","type":"authors"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/benchmarks/cuhk-avenue/","section":"Benchmarks","summary":"","title":"Cuhk-Avenue","type":"benchmarks"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jielin-li/","section":"Authors","summary":"","title":"Jielin Li","type":"authors"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jieyu-lin/","section":"Authors","summary":"","title":"Jieyu Lin","type":"authors"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jing-liu/","section":"Authors","summary":"","title":"Jing Liu","type":"authors"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/liang-cao/","section":"Authors","summary":"","title":"Liang Cao","type":"authors"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/liang-song/","section":"Authors","summary":"","title":"Liang Song","type":"authors"},{"content":" Networking Systems for Video Anomaly Detection: A Tutorial and Survey # JING LIU , Fudan University, China, The University of British Columbia, Canada, and Duke Kunshan University,\nChina # YANG LIU ∗\n, Soochow University, China\nJIEYU LIN , University of Toronto, Canada\nJIELIN LI , The University of Hong Kong, Hong Kong SAR\nLIANG CAO , Massachusetts Institute of Technology, United States\nPENG SUN ∗ , Duke Kunshan University, China\nBO HU ∗ , Fudan University, China\nLIANG SONG ∗\n, Fudan University, China\nAZZEDINE BOUKERCHE , University of Ottawa, Canada\nVICTOR C.M. LEUNG , Shenzhen MSU-BIT University, China, Shenzhen University, China, and The University of\nBritish Columbia, Canada # The increasing utilization of surveillance cameras in smart cities, coupled with the surge of online video applications, has heightened concerns regarding public security and privacy protection, which propelled automated Video Anomaly Detection (VAD) into a fundamental research task within the Artificial Intelligence (AI) community. With the advancements in deep learning and edge computing, VAD has made significant progress and advances synergized with emerging applications in smart cities and video internet, which has moved beyond the conventional research scope of algorithm engineering to deployable Networking Systems for VAD (NSVAD), a practical hotspot for intersection exploration in the AI, IoVT, and computing fields. In this article, we delineate the\n∗ Corresponding authors.\nAuthors\u0026rsquo; addresses: Jing Liu, jingliu19@fudan.edu.cn, Fudan University, School of Information Science and Technology, 220 Handan Road, Shanghai, 200433, China and The University of British Columbia, Department of Electrical and Computer Engineering, 2329 West Mall, Vancouver, British Columbia, V6T 1Z4, Canada and Duke Kunshan University, Division of Natural and Applied Sciences, 8 Duke Avenue, Kunshan, Jiangsu Province, 215316, China; Yang Liu, yangliu@cs.toronto.edu, Soochow University, School of Future Science and Engineering, 1 Jiuyong West Road, Wujiang District, Suzhou, Jiangsu Province, 215222, China; Jieyu Lin, jieyu.lin@mail.utoronto.ca, University of Toronto, Department of Electrical and Computer Engineering, 27 King\u0026rsquo;s College Circle, Toronto, Ontario, M5S 1A1, Canada; Jielin Li, jielinli@connect.hku.hk, The University of Hong Kong, Department of Computer Science, Pokfulam Road, Hong Kong, Hong Kong SAR; Liang Cao, liangcao@mit.edu, Massachusetts Institute of Technology, Department of Chemical Engineering, 77 Massachusetts Avenue, Cambridge, Massachusetts, 02139, United States; Peng Sun, peng.sun568@duke.edu, Duke Kunshan University, Division of Natural and Applied Sciences, 8 Duke Avenue, Kunshan, Jiangsu Province, 215316, China; Bo Hu, bohu@fudan.edu.cn, Fudan University, School of Information Science and Technology, 220 Handan Road, Shanghai, 200433, China; Liang Song, songl@fudan.edu.cn, Fudan University, Academy for Engineering \u0026amp; Technology, 220 Handan Road, Shanghai, 200433, China; Azzedine Boukerche, aboukerc@uOttawa.ca, University of Ottawa, School of Electrical Engineering and Computer Science, 75 Laurier Avenue East, Ottawa, Ontario, K1N 6N5, Canada; Victor C.M. Leung, vleung@ece.ubc.ca, Shenzhen MSU-BIT University, Artificial Intelligence Research institute, 1 International University Park Road, Dayun New Town, Shenzhen, Guangdong Province, 518172, China and Shenzhen University, College of Computer Science and Software Engineering, 3688 Nanhai Avenue, Shenzhen, Guangdong Province, 518060, China and The University of British Columbia, Department of Electrical and Computer Engineering, 2329 West Mall, Vancouver, British Columbia, V6T 1Z4, Canada.\nPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.\nManuscript submitted to ACM\nfoundational assumptions, learning frameworks, and applicable scenarios of various deep learning-driven VAD routes, offering an exhaustive tutorial for novices in NSVAD. In addition, this article elucidates core concepts by reviewing recent advances and typical solutions and aggregating available research resources accessible at https://github.com/fdjingliu/NSVAD. Lastly, this article projects future development trends and discusses how the integration of AI and computing technologies can address existing research challenges and promote open opportunities, serving as an insightful guide for prospective researchers and engineers.\nCCS Concepts: • General and reference → Surveys and overviews; • Information systems → Multimedia information systems\nAdditional Key Words and Phrases: Video anomaly detection, intelligent surveillance, representation learning, normality learning\nACM Reference Format: # Jing Liu, Yang Liu, Jieyu Lin, Jielin Li, Liang Cao, Peng Sun, Bo Hu, Liang Song, Azzedine Boukerche, and Victor C.M. Leung. 2025. Networking Systems for Video Anomaly Detection: A Tutorial and Survey. 1, 1 (April 2025), 36 pages. https://doi.org/XXXXXXX. XXXXXXX\n1 INTRODUCTION # As one of the core technologies of the ubiquitous Internet of Video Things (IoVT), Video Anomaly Detection (VAD) aims to use video sensors to automatically discover unexpected spatial-temporal patterns and detect unusual events that may cause security problems or economic losses, such as traffic accidents, violent behaviors, and offending contents [118]. With the widespread use of surveillance cameras in smart cities [149] and the boom of online video applications powered by 4/5G communication technologies, traditional human inspection is no longer able to accurately monitor the video data generated around the clock, which is not only time-consuming and labor-intensive but also poses the risk of leaking important information (e.g., biometrics and sensitive speech). In contrast, VAD-empowered IoVT applications [67], such as Intelligent Video Surveillance Systems (IVSS) and automated content analysis platforms, can process massive video streams online and detect events of interest in real-time, sending only noteworthy anomaly parts for human review, significantly reducing data storage and communication costs, and helping to eliminate public concerns about data security and privacy protection. As a result, VAD has gained widespread attention in academia and industry over the last decade and has been used in emerging fields [11 , 13 , 142] such as information forensics [211], industrial manufacturing [106 , 185] in smart cities as well as online content analysis in mobile video applications [210].\nVAD extends the data scope of conventional Anomaly Detection (AD) from time series, images, and graphs to video, which not only needs to cope with the endogenous data complexity, but also needs to take into account the computational and communication costs in resource-limited devices [68]. Specifically, the inherent high-dimensional structure of video data, high information density and redundancy, heterogeneity of temporal and spatial patterns, and feature entanglement between foreground targets and background scenes make VAD more challenging than traditional AD tasks at the levels of representation learning and anomaly discrimination [128]. Existing studies [4 , 77 , 100 , 111] have shown that high-performance VAD models need to target the modeling of appearance and motion information, i.e., the difference between regular events and anomalous examples in both spatial and temporal dimensions. In contrast to time series AD which mainly measures periodic temporal patterns of variables, and image AD which only focuses on spatial contextual deviations, VAD needs to extract both discriminative spatial and temporal features from a large amount of redundant information (e.g., repetitive temporal contexts and label-independent data shifts), as well as to learn the differences between normal and anomalous events in terms of the local appearances and global motions [91 , 99 , 145].\nHowever, video anomalies are ambiguous and subjective [88 , 89]. The same driving behavior can be classified differently depending on road conditions and contextual environments. For example, riding a horse in a grassland is Manuscript submitted to ACM\n.\nFig. 1. Topology diagram of research scope of NSAVD (Left) and its key sequential steps (Right).\nusually normal, whereas a horse appearing on a highway would be considered an anomaly. On the one hand, compared to regular events, anomalies in the real world are difficult to be comprehensively predefined and have a much lower overall frequency of occurrence, making them difficult to collect. Labeling a sufficient number of all possible abnormal samples for model training is almost impossible. As a result, traditional supervised learning-based classification models are usually ineffective in dealing with AD tasks [126]. On the other hand, since video storage and transmission costs are significantly higher than other data modalities, engineers favor processing such data on the end or edge side to reduce communication overhead. As we all know, such devices, including surveillance cameras, smartphones, and local servers, are computationally resource-limited. Therefore, it is highly practical to develop deployable VAD systems for real-world applications, which requires concerted efforts by AI beyond communities.\nIn this article, we extend the conventional scope of VAD from algorithm engineering on spatial-temporal anomaly detection to practical research towards real-world applications, termed Networking Systems for Video Anomaly Detection (NSVAD), to engage a broader readership from the IoT and computing communities. According to research objectives and involved domains, NSVAD is delineated into the hierarchical architecture shown in Fig. 1, encompassing: 1) Hardware Layer, consisting of various video sensors, communication units, and computing servers, etc, responsible for data acquisition, transmission, processing, and result reporting, as well as device networking; 2) System Layer, targeting resource optimization and algorithm deployment platforms for large-scale IoT applications, linking the Hardware Layer and the Algorithm Layer, supporting the configurable deployment of VAD tasks on various terminals; 3) Algorithm Layer, focusing on the development of detection algorithms and scene-specific models driven by artificial intelligence, especially deep learning; 4) Application Layer, encompassing IVSS in modern factories, agriculture, and smart cities, as well as various online video applications powered mobile internet. Most existing works belong to the Algorithm layer and solely concentrate on VAD model design, overlooking resource costs and challenges in real-world scenarios. For large-scale IoT and mobile video internet, NSVAD with stable performance and reasonable overheads that support online detection necessitates sensor networking research and support for resource optimization from computing communities. We review the recent advancements and typical methods in the algorithm layer and provide our latest explorations in the system layer to inspire readers to develop NSVAD toward real-world scenarios.\nThanks to the development of edge AI [177] and artificial neural networks [85], deep learning-driven NSVAD algorithms have made significant progress in recent years and derived Unsupervised (UVAD), Weakly-supervised (WsVAD), and Fully-unsupervised (FuVAD) routes [118]. They have liberated human beings from massive videos Manuscript submitted to ACM\nanalysis works and alleviated public information security concerns. Compared to early manual feature engineering, deep architectures such as convolutional neural networks and attention mechanisms can extract spatial-temporal representations from video sequences end-to-end without applying human a priori, empowering IVSS to process videos in different resolutions and scenarios. Therefore, researchers in this field are currently focusing on deep structure design and optimization learning strategies. They have creatively proposed multimodal VAD [174 , 192 , 193], OpenSet AVD (OSVAD) [1 , 233], Open-Vocabulary VAD (OVVAD) [194], video anomaly segmentation [173], and anomaly retrieval [190] tasks as well as integrated detection systems that can be deployed in practical scenes, such as modern manufacturing [111], smart city [116], and automated driving [180]. In addition to algorithm design, researchers from Networking Systems of AI (i.e., research on the deep convergence of communication and AI) [164] and IoVT (i.e., subfields of IoT focusing on video sensor design, networking, and data processing) [90] have begun to explore the design deployment-oriented VAD systems to collaboratively deal with multiple challenges (e.g., multi-view cross-scene heterogeneous videos and communication overheads) that come from the dynamic scenarios at the application layer and the limited resources at the hardware layer. These explorations and progress greatly expand the research boundaries and application scenarios of VAD, promoting it as an intelligent system science, i.e., NSVAD.\nAlthough there have been some reviews [30 , 139 , 159 , 160] focusing on AD and combing its related work, due to the limited research horizon, the early works usually regard VAD as a fringe research task in the AD community. They focus on time series or images but lack an illuminating analysis of the AD task in video data. Recent survey papers [15 , 131 , 149] continue to focus on unsupervised NSVAD routes in the same vein as the conventional AD task, i.e., using only normal samples to train generative models to learn the prototypical patterns of regular events and to discriminate uncharacterizable test samples as anomalies [122]. Such reviews, while providing a comprehensive taxonomy of VAD research from the outlier detection perspective, have been informative in the last decade when unsupervised methods have dominated the NSVAD algorithm research. However, they ignore the emerging weakly-supervised [42 , 174] and fully unsupervised [140 , 216] routes, which are of limited value in guiding further research.\nConsidering the differences in knowledge bases and orientations of readers, this article provides an in-depth analysis of the basic concepts, related knowledge, and recent advances involved in NSVAD research and summarizes the available research resources. We systematically analyze the assumptions, frameworks, scenarios, advantages, and disadvantages of unsupervised, weakly supervised, and fully unsupervised VAD routes and explain in detail the relevant domain knowledge involved in each route. In addition, we introduce our NSVAD systems designed for dynamic environments in complex scenarios such as industrial IoT [12] and smart cities [115] to guide the NSVAD research in specific applications. Finally, we forecast the research challenges, trends, and possible opportunities to inspire future exploration.\n1.1 Attention Analysis # We searched the number of publications and citations with the topic of \u0026ldquo;video anomaly detection\u0026rdquo; in various mainstream academic databases (e.g., IEEE Xplore, ACM Digital Library, SpringerLink, ScienceDirect, and DBLP) to quantitatively present the research hotness, as shown in Fig. 2(a). Early VAD works were limited by hand-crafted features, which cannot handle complicated videos and rely on human a priori, thus having a slow start. Encouraged by the remarkable success of deep learning in video understanding tasks (e.g., action recognition, scene understanding, expression recognition [182] and multimodal perception [201]), VAD research saw a boom after 2010, with an explosive growth in the number of publications and citations that continues to the present. On the one hand, the spread of surveillance cameras and streaming media platforms has provided sufficient data support for NSVAD research, making it possible to train largescale deep models with high-performance GPUs. On the other hand, the increasing demand for offending video content Manuscript submitted to ACM\nFig. 2. Research hotness analysis. We count (a) the number of NSVAD-related publications and their citations in the past 23 years and organize (b) the AD-related workshops in conferences on Artificial Intelligence (AI), Data Mining (DM), and Computer Vision (CV).\nTable 1. Comparison with related survey papers and conference tutorials.\nYear Ref. Perspective \u0026amp; Main Focus Type Type Research Routes \u0026amp; Open Task Research Routes \u0026amp; Open Task Research Routes \u0026amp; Open Task Research Routes \u0026amp; Open Task Research Routes \u0026amp; Open Task Content Analysis Content Analysis Content Analysis Year Ref. p Survey Tutoria UVAD WsVAD FuVAD OSVAD OVVAD Advances Review Basic Knowledg Practical Cases 2018 [74] Un- and semi-supervised VAD ! # G# # # # # 2019 [30] Time series AD in IoT ! ⊙ ⊙ ⊙ ⊙ ⊙ # 2020 [160] VAD in traffic scene ! G# # # # # # 2022 [149] Unsupervised VAD in single-scene ! G# G# # # # 2021 [131] Deep learning-based unsupervised VAD ! # # # # # # 2021 [152] Unsupervised VAD in crowd sence ! # # # # # # 2021 [139 Deep learing-based unsupervised AD ! ⊙ ⊙ ⊙ ⊙ ⊙ # 2021 [7] Time series AD ! ⊙ ⊙ ⊙ ⊙ ⊙ # # 2022 [15] Unsupervised and supervised VAD ! G# # # # # # 2022 [148] Unsupervised VAD ! # # # # # 2023 - Un- and weakly-supervised VAD ! # # # G# # 2024 [118] Generalized VAD ! G# # # # 2024 Ours NSVAD routes in video IoT ! : Systematic compendium and presentation to the routes/tasks/contents. G#: Briefly mentioned. #: Not presented. ⊙: Not applicable. detection in various scenarios drives many researchers and engineers from AI and IoT fields to devote themselves to NSVAD research.\nFig. 2(b) presents the AD-related workshops in the top computer science conferences in the past four years. The changes in data types and application scenarios in these workshops show that AD tasks on visual data, especially videos, have dominated the community. For example, the \u0026ldquo;DeeperAction\u0026rdquo; workshop explicitly identified anomalous behavior recognition in surveillance videos as the following research hotspot in behavior analysis. The first \u0026ldquo;ASTAD\u0026rdquo; workshop at WACV'24 centered on anomalous detection of spatial-temporal data and its application to computer vision tasks. In addition, the latest workshops, \u0026ldquo;BRAVO\u0026rdquo; and \u0026ldquo;VISION\u0026rdquo;, explored the application of AD technology in areas such as autonomous driving and modern manufacturing, further demonstrating the high hotness and broad application prospects of VAD. To make it easier for beginners to navigate these workshops, we have categorized all the available resources, please see our public GitHub repository at https://github.com/fdjingliu/NSVAD .\n1.2 Related Work # To our knowledge, this is the first tutorial-type paper on NSVAD, providing a systematic overview of the basics, recent advances, and practical applications of various VAD routes. Previous papers [15 , 74 , 118 , 149] primarily focus on Manuscript submitted to ACM\nFig. 3. Content navigation of this article.\nliterature reviews, while conference tutorials lack systematic content, as shown in Table 1. Considering non-specialists\u0026rsquo; limited background knowledge, this article emphasizes clear explanations of basic concepts and models. We introduce task definitions and learning frameworks for unsupervised, weakly supervised, and fully unsupervised VAD, as well as emerging tasks like open-set [233], open vocabulary [194], and glance VAD [218].\nInitially, VAD was considered a fringe topic in the broader AD community and only briefly mentioned in AD overviews [30 , 139], lacking comprehensive surveys. Recent efforts have started organizing the state of VAD research [15 , 133 , 148 , 149], but they focus mainly on unsupervised methods and overlook rising research routes like WsVAD [167] and FuVAD [192]. These routes offer reliable performance in real-world applications, and their importance is increasingly recognized. Fully unsupervised methods, in particular, allow efficient VAD model learning from large-scale video streams. Recent works, such as [118], summarize these developments but do not cover emerging tasks like OSVAD [1 , 233] and OVVAD [194], which are of crucial value for IoT applications.\nResearchers in the AD community have paid attention to the continued progress and application prospects of VAD and introduced it to attendees through conference tutorials. For example, Pang et al. 1 organized a tutorial titled \u0026ldquo;Recent Advances in Anomaly Detection\u0026rdquo; at CVPR2023, focusing on recent work on deep learning-driven unsupervised and weakly supervised VAD. We think this is a good start and highly appreciate the contribution of the organizers to the field. However, a more unified conceptual statement, more comprehensive documentation, and more systematic analysis of challenges and opportunities are necessary to inspire a wider community of readers.\n1.3 Contribution Summary # Given that NSVAD research has become an explicit hotspot in AI, computing, and IoT communities, and shows great potential for applications in emerging scenarios such as smart cities and mobile internet, we aim to provide systematic and inspiring guidance. This article is aimed at researchers and engineers who understand the main concepts and basic knowledge of AI but have no experience in NSVAD. We provide a comprehensive statement of unsupervised, weakly supervised, and fully unsupervised VAD routes, as well as various types of emerging tasks to satisfy readers with different backgrounds and needs. The contributions of this article can be summarized in the following four points:\n1 https://cvpr.thecvf.com/virtual/2023/tutorial/18560\nTo the best of our knowledge, this article will be the first tutorial-type paper focusing on Networking Systems for Video Anomaly detection, which not only provides a well-structured guide for non-specialized readers but also promises to bring together researchers from AI, IoT, and computing societies to promote NSVAD research. Focusing on AD in IoVT from the NSAI perspective, we comprehensively sort out the UVAD, WsVAD, and FuVAD routes and state their basic assumptions, learning paradigms, and applicability scenarios of each scheme. We open source available resources (e.g., benchmark datasets, code bases, literature, workshops, and tutorials) and provide our studies on NSAVD in industry and smart cities. We analyze the development sequence between various research routes by empirically reviewing the recent advances and discussing the future vision of NSVAD in the context of trends and concerns in NSAI and IoVT. 1.4 Section Navigation # Based on the relationship between the content of the individual sections and the NSVAD architecture shown in Fig. 1 , we organize the remainder of this article and provide an intuitive navigation map shown in Fig. 3. Specifically, Section 2 states the general basics of VAD, including the task definition, type of anomalies, and application areas. Sections 3∼5 elaborate on the learning paradigms and typical models of UVAD, WsVAD and FuVAD, respectively. We provide detailed explanations of classic methods for readers to understand the core ideas and implementations better. To assist readers to conduct research instantly, we provide a comprehensive introduction to existing datasets and evaluation metrics involved in the current work in Section 6. Finally, Section 7 discusses the future vision of NSVAD, providing an in-depth analysis of its existing challenges, development trends, and possible opportunities. This tutorial paper is intended for non-specialists, so we prioritize conceptual clarification and research horizon construction in the main papers. Content that may overlap with existing work (e.g., itemized introductions to reviewed papers, detailed explanations of classical methods, and comprehensive presentations of research cases) is relegated to the Supplement .\n2 GENERALIZED FOUNDATIONS OF NSVAD # As a cross-cutting topic, NSVAD has attracted researchers from deep learning, video surveillance, mobile internet, and edge computing communities. Initially, VAD followed the conventional setting of AD problems, where anomalies were treated as outliers with different distributions [149]. Corresponding unsupervised methods [10 , 53 , 208] aimed to learn prototypical representations of regular events, considering test videos outside the distribution as anomalies. To address video\u0026rsquo;s high-dimensional and complex backgrounds, AD researchers introduced efficient video representation learning techniques like Auto-Encoders (AEs) [141], Generative Adversarial Networks (GANs) [65 , 170 , 189], Transformers [43], Mamba [86], and diffusion models [5 , 83 , 95]. With large-scale video datasets and high-performance GPUs, deep learning-driven Video Understanding (VU) techniques have advanced, shifting VAD research to cross-cutting topics of VU and AD. New VAD routes emerged, such as WsVAD with multiple instance learning [167] and FuVAD with iterative learning [140]. These methods challenge the open-world assumption under unsupervised AD, where real-world anomalies are varied and unbounded. WsVAD incorporates anomaly instances in training to differentiate between negative and positive samples. While WsVAD requires collecting and labeling anomalies, it outputs more reliable results for specific types of abnormal events. FuVAD avoids the data constraints of UVAD and WsVAD by learning anomaly detectors directly from raw videos, reducing data preparation costs and preventing mislabeling issues.\nIn recent years, the rise of multimodal learning [40 , 198] and large language models [147] have brought a new windfall for VAD. Researchers have proposed new tasks such as open-set VAD [1 , 233], open-vocabulary VAD [194], and\nManuscript submitted to ACM\nFig. 4. Illustration of AD-related terms’ connection with NSVAD. The categorization is inspired by [204].\nVideo Anomaly Retrieval (VAR) [190], indicating the trend of integration between VAD and generative AI research. We understand the cognitive differences resulting from the research backgrounds of these fields and view the emergence of new routes and tasks as a positive signal to promote VAD to systematic NSVAD research. In addition, NSVAD systems deployed in real-world scenarios must face the challenge of domain bias due to multi-view, cross-scenario videos and consider the limited storage and communication resources of end devices. Recent NSVAD advances [111 , 192] have begun considering both algorithm optimization and model deployment to balance performance and overhead.\n2.1 Related Terms # We introduce key AD-related terms, including Anomaly Detection (AD), Novelty Detection (ND), Open Set Recognition (OSR), and Outlier Detection (OD). These terms often confuse researchers from the computer vision community [98 , 184], and AD practitioners struggle with inconsistent definitions [204]. We follow community consensus and our experience to clarify these terms, as shown in Fig. 4 .\nAD detects samples that deviate from normality as defined by training data [92]. Such deviations include: 1) covariate shift, i.e., label-independent distributional differences due to factors like image style or equipment, and 2) semantic shift, i.e., samples from different categories. The former is addressed by Sensory Anomaly Detection (SenAD), which includes tasks like domain adaptation. UVAD, by contrast, focuses on detecting anomalous samples with different semantic labels, termed Semantic AD (SemAD).\nND is often confused with AD since its goal is also to detect samples from unknown categories [153]. ND is modeled as a binary classification problem, identifying unknown categories without concern for secondary labels [204]. Unlike SemAD, ND views unknown data positively, making UVAD similar to video novelty detection.\nOSR trains a Multi-Class Classifier (MCC) to categorize in-distribution data while detecting unknown data during testing [45]. VAD generally does not categorize normal events but focuses on identifying anomalies. However, complex systems like autonomous cars require both anomaly detection and fine-grained categorization, leading to the integration of OSR and VAD in Open Vocabulary VAD (OVVAD).\nOD detects outliers, samples significantly different from others [8]. Unlike AD, ND, and OSR, which detect out-ofdistribution samples only during testing, OD accepts all data types during training, similar to FuVAD. FuVAD handles unfiltered videos to learn anomaly classifiers, making it superior for large-scale real-time video streams in IoVT systems. Manuscript submitted to ACM\nFig. 5. Illustration of general learning framework of (a) UVAD, (b) WsVAD, and (c) FuVAD research routes.\n2.2 Definition and Type of Anomaly # Defining anomalies and understanding their types is essential for real-world NSVAD applications. UVAD and FuVAD follow the setups of SemAD and OD tasks, where anomalies are relative, meaning anything differing from common data is considered an anomaly [26]. In contrast, WsVAD focuses on specific pre-defined anomalies.\nSpecifically, anomalies are usually categorized as sensory (raw data deviations) or semantic (label differences). VAD targets semantic anomalies, ignoring irrelevant factors like the scene or camera angle changes. Anomalies in UVAD and FuVAD fall into appearance-only, motion-only, or appearance-motion categories, corresponding to deviations in spatial, temporal, or spatial-temporal interactions. For example, in the CUHK Avenue dataset [120], a red bag on a lawn is an appearance-only anomaly. More complex anomalies often involve misalignments in appearance-motion interactions, making them harder to detect with single-dimensional models. Therefore, effective UVAD models must understand regular event patterns in appearance, motion, and spatial-temporal contexts. Multi-proxy task-based models address this by improving the model\u0026rsquo;s ability to distinguish between normal and anomalous events across different dimensions. WsVAD focuses on real-world hazardous events, such as crimes in the UCF-Crime dataset [167] and violent incidents in XD-Violence [192]. Although WsVAD cannot detect arbitrary anomalies, its results are more reliable. WsVAD anomalies are often categorized as short-term, long-term, or crowd anomalies, aligning with real-world concerns.\n2.3 Definition of Various NSVAD Routes # UVAD refers to NSVAD schemes that use only easily collected routine events to train models to learn the spatialtemporal pattern boundaries of normal samples [149 , 220]. UVAD dominated early VAD research because it follows the open-world assumptions in the same vein as the AD community, circumventing predefinitions and collecting anomalous instances [64 , 109]. In Fig. 5, we show the general learning framework of UVAD and compare it with WsVAD and FuVAD. Specifically, UVAD assumes that models trained on regular events will only describe normal spatial-temporal patterns and will exhibit significant deviations when confronted with unseen anomalous examples, such as probability distributions [6 , 27 , 155 , 158], distances [32 , 33 , 157], and proxy task errors [50 , 55 , 111 , 113]. Early approaches first Manuscript submitted to ACM\nused local binary operators [61 , 129 , 221], spatial-temporal points of interest [35], etc. to characterize spatial-temporal features that are normal events, and then employed One-Class (OC) classifiers (e.g., OC support vector machines and OC neural networks) [66 , 168] to learn the pattern boundaries, and considered test samples whose features fell outside the boundaries as anomalous. Such methods rely on manual features and are prone to dimensionality disasters.\nIn recent years, deep learning-driven UVAD integrates feature extraction and normality learning into a unified framework with two phases, training and testing, which corresponds to normality learning via the use of negative samples and anomaly detection via the detection of out-of-distribution samples [118]. In the normality learning phase, the network learnable parameters are optimized by minimizing a loss function overall negative samples. Whereas, in the testing phase, the degree of anomaly is measured by quantifying the distance between the test samples and the learned normality. Among them, reconstruction-based methods have dominated UVAD research in recent years [50 , 100 , 111 , 113 , 141]. On the one hand, most of the challenges faced by such methods, such as global motion modeling, temporal normality learning, and spatial detail inference, have been intensively studied in video self-supervised learning. As a result, many methods have driven the development of UVAD by drawing inspiration from existing methods, such as video prediction [138 , 183 , 224]. On the other hand, since reconstruction/prediction methods aim to learn a generative model that can reason about regular events, their basic settings and optimization goals are clear and unambiguous, and thus easy to implement and follow. Essentially, UVAD is transductive learning, i.e., model training and testing are relatively independent. Numerous studies have shown that the performance of models in the normality learning phase on the agent task does not show a positive correlation with downstream anomaly detection. Due to the diversity of events, the spatial-temporal features of normal and abnormal samples overlap, and UVAD usually fails to actively recognize discriminative features and incorrectly learns some shared patterns when only regular events are available for model training [116]. In addition, Park et al. [141] pointed out that overpowered deep neural networks may be able to effectively reason about unseen anomalous events during the testing phase due to overgeneralization performance, which may lead to underdetection. In response, the researchers proposed a memory network enhancement approach to weaken the model\u0026rsquo;s ability to generalize representations of anomalies by recording prototypical features.\nWsVAD uses weakly semantic video-level labels to supervise the output of strong semantic frame-level labels, i.e., frame-by-frame anomaly scores, by the sequencing model, thus enabling temporal localization of anomalous events [71 , 112]. The first weakly-supervised approach is the multiple instance ranking framework proposed by Sultani et al. [167] in 2018, which lays out the basic MIL architecture of the WsVAD route, and whose concurrently publicly available UCF-Crime dataset has become the most widely used weakly-supervised benchmark. They consider a video as a collection of multiple examples (video clips), where clips containing abnormal frames are positive examples, while clips with all normal frames are labeled as negative examples. Obviously, a normal video with label 0 produces all examples called negative bags, and the example-level labels are all 0. An abnormal video with the label 1 constitutes a positive packet, which contains both positive and negative instances. Inspired by multiple instance learning, WsVAD aims to train a scoring model to output the anomaly scores of each example using video-level labels. The authors introduce a MIL ranking loss inspired by the hinge loss, which encourages the model to output high anomaly scores close to 1 for anomalous clips by maximizing the difference between the anomaly scores of the largest-scoring instances in the positive and negative bags while scoring regular clips as close to 0 as possible. In fact, WsVAD does not belong to any of the classes of out-of-distribution detection tasks introduced in Section 2.1, but is rather a type of multiple instance learning under weak semantic labeling supervision.\nCompared to UVAD, the weakly supervised approach introduces anomalous videos in the training set and provides video-level labels for all training samples. Although anomalous examples are diverse and unenumerable in the real world, Manuscript submitted to ACM\nFig. 6. Illustration of the objectives of (a) WsVAD, (b) Open-Set VAD, and (c) Open vocabulary VAD. WsVAD only detects pre-defined types of anomalies in the training set, whereas OSVA has the open-set detection capability, which can recognize anomalies that have not been seen in the training phase. In contrast to (a) and (b), which treat anomalies as a single class, OVVAD can output specific semantic labels for both pre-defined and unseen anomalies.\nnoteworthy anomalous events in specific scenarios are usually limited and easy to collect, such as thefts, robberies, and traffic accidents, which can be obtained in large quantities from surveillance IoT systems and online video platforms. In addition, the labor cost of video-level tagging, which only requires marking whether a video contains an anomaly without worrying about the specific timing location (frame-level labeling) and detailed anomaly category, is usually affordable. For example, the UCF-Crime dataset is much larger than the one used by UVAD, but only 1,900 discrete [0 , 1] labels need to be provided. Due to the introduction of an additional human prior and the fact that the model has seen anomalous events during the training phase, WsVAD results are typically more reliable than UVAD, achieving excellent and consistent performance in detecting specific anomalous events. As a result, WsVAD has become a mainstream VAD scheme and is considered as the most promising research route for deployment in intelligent surveillance systems. The latest research attempts to mine anomaly-related clues from audio or subtitle text accompanying video frames, and proposes the multimodal WsVAD [206].\nFuVAD attempts to learn anomaly classifiers directly from large-scale raw videos without any editing and labeling [140 , 216]. Specifically, FuVAD\u0026rsquo;s training data contains both positive and negative samples, and due to the low frequency of anomalous examples compared to regular events, the anomalous samples to be detected can be regarded as outliers with different patterns from the main data. In essence, the FuVAD model is transductive learning and does not follow the training-testing process of UVAD. Fully unsupervised methods have become a research hotspot in the internet era, where data preparation is costly, by virtue of the fact that they do not require any constraints on the training data, and can be used to train the model by directly accessing huge amounts of videos from the real world [118].\n2.4 Emerging Research Tasks # In Section 2.1, we mentioned one of the inherent drawbacks of WsVAD, which violates the open-set problem property of the AD task by only detecting specific anomalies predefined in the training set and not being able to cope with diverse and arbitrary anomalous events in the open world. Action Recognition (AR) [76 , 169] and VAD aim to understand Manuscript submitted to ACM\nspecific behaviors. They can learn from each other\u0026rsquo;s research ideas in data modality and feature learning, and the pre-trained AR models on large-scale video datasets are expected to be directly used for feature extraction and anomaly semantic cue mining in VAD. However, AR follows the closed-set task setting, which can only model the spatial-temporal patterns of known categories of behaviors but cannot empower the model to recognize out-of-distribution samples. In response, researchers have proposed various open-set VAD [1 , 233] schemes to break through the above barriers, as shown in Fig. 6(b). Zhang et al. [218] introduced the concept of glance annotation, where a single frame from an anomalous event is randomly labeled and used as an enhanced supervision signal for training weakly supervised VAD models. They provided glance annotations for UCF-Crime and XD-Violence datasets, achieving a 5% improvement in frame-level AUC compared to the state-of-the-art, demonstrating this setting\u0026rsquo;s outstanding potential for balancing annotation costs and model performance.\nOpenVAD in [233] aims to integrate the advantages of UVAD, which can handle arbitrary anomalous events, and WAED, which has a low false alarm rate in detecting specific anomalies. The proposed method integrates evidential deep learning and normalized flow into the MIL to equip WsVAD with the ability to identify unknown anomalies by quantifying uncertainty. Acsintoae et al. [1] propose a dataset for supervised OSVAD, named UBnormal, that maintains the task\u0026rsquo;s open-set properties. Since the anomalous behavior in this dataset is generated through the VAD engine, it comes with fine pixel-level labels, making it possible to train VAD models in supervised learning. In short, this dataset attempts to bridge closed supervised learning and open anomaly detection, and experiments show that it can improve performance without compromising the open-set properties of existing VAD models. OVVAD is closest in setting to the OSR task and aims to learn a multi-classifier capable of detecting and classifying all known and unknown anomalous events. Compared to OSVAD, OVVAD is more in line with the display requirements of scenarios such as autonomous driving. The first OVVAD model proposed by Wu et al. [194] splits the task into two complementary tasks, i.e., AD and anomaly classification, and jointly optimizes them using the knowledge from the large models.\n2.5 VAD with Multimodal Large Language Models # Large Language Models (LLMs) like Generative Pre-Training (GPT) [146] exhibit outstanding zero-shot learning and multimodal information processing abilities, showing great potential in VAD. Research has shown that multimodal LLMs can learn prototype patterns of normal events without training and describe any anomalies in open-set settings, significantly improving the generality and adaptability of VAD models. Zanella et al. [217] proposed Language-based VAD (LAVAD), an unsupervised learning paradigm that leverages pre-trained LLMs and existing VLMs to train video anomaly detectors. They used VLMs to generate textual descriptions of video frames and designed a prompt mechanism to unlock LLMs\u0026rsquo; potential in temporal aggregation and anomaly score estimation, enabling direct VAD execution. Lv et al. [124] introduced Video-LLaMA into VAD, aiming to break threshold limitations and improve model interpretability. The authors proposed a three-stage training method to improve the training efficiency of VLLM. In the AnomalyRuler, during the inductive stage, a small set of normal reference videos was provided to the LLM, enabling it to summarize normal patterns to induce rules for anomaly detection. In the Deduction stage, these induced rules were applied to detect anomalous frames in test videos. Hawk in [171] uses an interactive VLM to accurately interpret video anomalies, answering VAD-related questions. It constructs an auxiliary consistency loss within the motion and video space, guiding the video branch to focus on motion modalities and establishing explicit supervision between actions and language to improve the accuracy of interpretation. Zhang et al. [219] developed a large-scale multimodal VAD instruction tuning benchmark called VAD-Instruct50k, used to build unbiased and interpretable VAD systems, as well as Holmes-VAD for anomaly event localization and interpretation.\nManuscript submitted to ACM\n3 UNSUPERVISED VIDEO ANOMALY DETECTION # UVAD follows the general setup of semantic anomaly detection tasks, where only easily collectible regular instances are used to train models describing the normality of videos, detecting anomalies by measuring deviations between test samples and the learned model [24 , 25 , 178]. From traditional machine learning to deep representation learning [186 , 202 , 205], UVAD has undergone multiple advancements in feature extraction and normality learning, leading VAD to become a key issue in the AD and CV communities. With its close ties to the AD community and its long-standing development history, UVAD has long been regarded by researchers as the mainstream research route of NSVAD algorithms. As a result, existing surveys [131 , 149] typically focus on reviewing UVAD literature, lacking in-depth discussion of emerging weakly supervised [42 , 174] and fully unsupervised [140 , 216] routes, and overlooking novel tasks such as open-set and open-word detection. This article not only systematically reviews the latest developments in these new routes and tasks but also provides foundational knowledge and classic methods of UVAD in this section.\nBased on the means of normality learning and deviation calculation principles, existing deep learning methods are generally divided into three categories: distance-based [32 , 33 , 157], probability-based [6 , 27 , 155 , 158], and reconstruction-based [4 , 111 , 117 , 121]. From our perspective, distance-based UVAD methods are a more general form of probability-based and reconstruction-based methods because probability deviation and reconstruction error calculation are fundamentally just different distance measurement methods. Specifically, distance-based methods include using single-classifier learning of video sequence spatial-temporal representations in the deep feature space, such as OC-SVM and OC-NN [66 , 168]. The drawback of such methods is that the trained models cannot be incrementally compatible with new data, leading to the need to retrain classifiers from scratch when new data is generated, such as in scenarios involving scene transitions. Another type of distance-based method is to use Gaussian mixture models [195] to model video feature normal vectors and measure deviations using Mahalanobis distance. In contrast, probabilitybased methods attempt to map the spatial-temporal representations of regular events into a probability framework and discriminate anomaly instances by measuring differences in probability distributions. Such methods tend to use traditional models such as Markov random fields [6] to build probability space. Deep learning-based attempts have encountered significant increases in computational costs and are noticeably slower in inference speeds.\nIn fact, the most prevailing deep UVAD approach is reconstruction-based [149]. On the one hand, reconstructionbased methods [100 , 111 , 141] aim to train models to represent the general spatial-temporal patterns of regular events through self-supervised proxy tasks, benefiting from advances in video self-supervised learning [69] and deep neural networks. On the other hand, such methods avoid complex mathematical computations, which are easy to implement and exhibit excellent performance, thus being widely praised by existing researchers. The premise of reconstructionbased methods is that generative models trained on massive normal samples can effectively infer the spatial-temporal patterns of regular events. For anomalous events, the performance of proxy tasks will significantly decrease, and the resulting error can be used as a quantitative basis for measuring deviations to calculate anomaly scores. Common generative models include deep autoencoders [28 , 55 , 110 , 113 , 141], variational autoencoders [41 , 200], and generative adversarial networks [20 , 99 , 134]. Proxy tasks include reconstructing input sequences and predicting future frames, which belong to pixel-level image generation. However, most reconstruction-based methods only calculate frame-level errors as anomaly scores without performing spatial localization, as the spatial contribution of anomalies is typically not significant. Spatial localization can only serve as a quantitative visualization result.\nTable 2. Systematic Taxonomy of UVAD.\n3.1 Taxonomy and Advances # This article also systematically reviews deep learning-driven UVAD methods, providing a taxonomy that can aid in understanding the current state of research and inspire further exploration. However, due to space constraints and the emphasis of the tutorial paper on guiding beginners, we only present the underlying logic of the proposed taxonomy and summarize the research trends of the latest advances in the main text message. For a more detailed description of the existing UVAD methods, please refer to Section 1.1 of Supplement .\nThe methods based on distance and probability described earlier focus more on handcrafted features [27 , 161] and traditional classification [176] models, which have been primarily surpassed by distance-based methods in the era of deep learning. Therefore, existing classification systems appear outdated in delineating recent advancements and reflecting research trends, failing to highlight the latest challenges and directions. To address this, inspired by data preprocessing techniques and forms of deep learning modeling, we categorize UVAD into two main classes: Global Normality Learning (GNL) and Local Prototype Modeling (LPM).\nGNL utilizes the entire video sequence as input, often requiring no additional preprocessing such as spatial-temporal cube partitioning or foreground object extraction, and employs end-to-end deep neural networks to directly learn video ecologies [17 , 50 , 63 , 111 , 116 , 141]. Over the years, researchers have believed that videos possess two informational dimensions, namely spatial and temporal, corresponding to appearance and motion, requiring different approaches to capture their normality. Therefore, within our UVAD taxonomy, GNL is further subdivided into Single-Proxy Task [50 , 100 , 141] and Multi-Proxy Task methods [17 , 111 , 113]. In contrast, LPM methods argue that video data contain a plethora of redundant information tantamount to clues related to normality. Thus, they opt to use spatial-temporal cubes [109 , 125 , 154 , 156] or foreground objects [4 , 105] containing dense and effective information as network inputs instead of the entire video sequence, focusing on prototype feature learning of local patch. We classify such methods into spatial-temporal Patch-based and Object-driven methods.\nUVAD methods can be grouped into single-task, multi-task, spatial-temporal patch-based, and foreground objectdriven approaches. The first two categories, which input either full RGB frames or optical flow sequences, are considered global normality learning methods. In contrast, the latter two approaches focus on modeling local image patches or salient foreground objects, which are categorized under local prototype modeling. A taxonomy of these methods is presented in Table 2, showcasing their latest developments and interrelations. Specifically, single-task methods treat the spatial and temporal patterns in video as entangled, typically employing a single network structure to execute a unified task for learning spatiotemporal normality. These methods are easy to design and train but may perform Manuscript submitted to ACM\nsuboptimally in handling diverse anomalies in complex scenes. In comparison, multi-task methods regard appearance and motion as distinct information dimensions, using multi-branch networks (e.g., parallel autoencoders or encoderdecoder architectures) to perform different tasks for learning spatial and motion normality separately. This approach effectively handles anomalies involving appearance, motion, or a combination of both and has demonstrated outstanding performance in industrial, traffic, and medical applications. LPM methods address the redundant information in raw image sequences, which increases data processing costs and introduces noise that can degrade model performance. These methods first identify relevant spatiotemporal regions through preprocessing techniques before modeling them. Specifically, spatial-temporal patch (STP) methods assume that anomalies occupy small spatiotemporal regions in the video, and thus, modeling local cubes of data allows for precise spatiotemporal anomaly localization. Foreground object-driven (FOD) methods focus on analyzing patterns of foreground objects, leveraging pre-trained object detection models to extract the regions of interest for subsequent modeling.\n3.1.1 Global Normality Learning. Convolutional neural network-driven deep representation models can directly learn task-relevant spatial-temporal representations from raw video sequences and can adapt to different scale inputs and feature dimension requirements through simple structural adjustments. Global normality learning aims to learn video normality directly from complete RGB videos or optical flow sequences [108]. Compared to local prototype modeling, GNL requires no additional data preparation and is easy to optimize.\nThe earliest methods did not distinguish between spatial and temporal information, typically using only RGB videos as input and employing a single self-supervised proxy task (e.g., sequence reconstruction or future frame prediction [223]) to broadly learn spatial-temporal normality. Generally, methods based on a single proxy task focus on designing more efficient single-stream end-to-end deep structures. Recent efforts include the introduction of 3D convolutional networks [50] and convolutional long short-term memory networks [121] to enhance the representation capability of spatial-temporal features. Subsequent researchers discovered that spatial and temporal information have different characteristics, with the former focusing on local pixel inference and the latter on modeling global dynamics. Moreover, the addition of proxy tasks such as reconstruction and prediction losses often brings additional performance gains without significantly increasing training costs. Thus, they proposed introducing additional proxy tasks within the GNL framework. In addition to the separation of spatial-temporal normality learning [16 , 113] and simple proxy task stacking [19 , 225], recent work has also explored novel tasks such as appearance-motion consistency [9], spatial-temporal coherence [23 , 136], and correlation [222].\n3.1.2 Local Prototype Modeling. In contrast, local prototype modeling [4 , 105 , 109 , 125 , 154 , 156] treats a video as an information cube with dimensions ℎ × 𝑤 × 𝑙, where ℎ and 𝑤 denote the spatial height and width, and 𝑙 represents the number of frames. It is observed that background information repeats across frames. On one hand, anomalies of interest typically occupy only a small portion of the information volume within the entire cube, and direct learning from the complete sequence often entails high computational costs. On the other hand, separating regions with different information densities and modeling their relationships with each other is beneficial for understanding the interaction between events. To address this, researchers have proposed the method of local prototype modeling, aiming to mitigate the handling of repetitive information to reduce training costs and model the relationship between foreground targets and background scenes to enhance anomaly detection performance. According to the data preprocessing methods, we categorize such methods into those based on Spatial-Temporal Patch-based (STP) and Foreground Object-Driven (FOD). The former employs simple spatial-temporal segmentation to divide the video into several information bodies, while\nthe latter relies on pre-trained object detection models (e.g., RCNN [48], FPN [197], and YOLO [151]) to selectively learn the spatial-temporal normality of specific subjects.\n3.2 Classic UVAD Models # Considering the limited guiding value of a mere progress review for beginners, this article has chosen representative methods of UVAD to provide a detailed introduction to the research motivations and core ideas. The selected methods include: 1) Future Frame Prediction (FFP) framework [100], which introduces the video prediction into the VAD task for the first time. 2) Memory-Guided Normality Learning (MGNL) [141], the first memory network for VAD. We elaborate on the implementation process and comprehensively review the related basic knowledge in Section 2.1 of Supplement .\n4 WEAKLY-SUPERVISED VIDEO ANOMALY DETECTION # Inspired by multiple instance learning (MIL) [57], WsVAD organizes videos into bags containing several instances. All segment instances from the same bag share a video-level label [62]. Under this setting, 𝑌 = 0 indicates negative bags while 𝑌 = 1 indicates positive ones containing at least one anomalous instance. WsVAD strikes a balance between performance and data preparation cost, bridging the gap between UVAD with unreliable results and supervised learning with fine labels. Specifically, unlike unsupervised methods that train models only on regular events and simply consider all unseen samples as anomalies, WsVAD is trained with positive samples, enabling it to effectively understand the inherent differences between normal and anomalous instances. Consequently, existing research indicates that WsVAD yields more reliable results and outperforms UVAD schemes with lower false alarm rates in real-world tasks, such as crime behavior identification and traffic accident detection [93]. Annotating long video sequences frame by frame for supervised learning is often impractical. For instance, the training set of the UCF-Crime dataset includes 1,610 long videos containing 13,741,393 frames in total. In contrast, WsVAD only requires coarse semantic video-level annotations, with each video needing only a discrete binary label. Various types of anomalies are roughly labeled as 1, requiring minimal expert involvement in data annotation and significantly reducing labor costs compared to fine-grained labeling. Thus, although WsVAD no longer adheres to the open-world task setting of the AD community and can only detect specific anomalies, it has garnered widespread attention in recent years due to its stable performance and reliable results. Subsequent works typically adopt the MIL regression task setting proposed by Sultani et al. [167] and utilize their concurrently open-sourced UCF-Crime dataset as a benchmark. To further validate the generalization ability of weakly-supervised models, some researchers relocate positive samples from the test set of UVAD datasets to the training sets and provide video-level labels, proposing reconfigured datasets such as UCSD Ped2 and ShanghaiTech Weakly for WsVAD validation.\n4.1 Taxonomy and Advances # The superior performance of WsVAD on real-world videos has inspired researchers to develop VAD models tailored to the complexities of real-life scenarios. In 2020, Wu et al. [192] extended the research scope of VAD from single-modal video pattern analysis to multimodal learning to leverage heterogeneous data in real-world scenes for enhanced anomaly event detection. Their collected XD-violence dataset is the first multimodal VAD benchmark, comprising RGB images and audio modalities, focusing on violence behavior detection in complex scenes. This work not only provides the first multimodal VAD dataset and solution but also motivates researchers in the community to explore anomaly clues Manuscript submitted to ACM\nfrom multimodal data such as audio and text, leading to a new wave of VAD development: multimodal WsVAD. In this section, we comprehensively review the latest advancements in both single-modal and multimodal WsVAD, and illustrate the foundational knowledge and specific implementations required for WsVAD using the Multiple Instance Regression (MIR) framework and the Local-Global Network (HL-Net) as examples.\nUnimodal methods take only visual data as input and attempt to learn the pattern differences between normal and anomalous events based on the appearance and motion semantics reflected in the RGB sequences. In contrast, multimodal methods explore additional data modalities beyond visuals, such as audio and text, using them as complementary semantics to improve the model\u0026rsquo;s anomaly detection capabilities. Although the inclusion of additional modalities increases the data processing and model training costs, existing studies show that audio and text can significantly enhance WsVAD\u0026rsquo;s ability to detect anomalies, especially in cases where visual information alone cannot distinguish anomalies effectively. Since multimodal methods still need to model spatiotemporal patterns in the visual modality and continue to use the same weak supervision task setup and multi-instance learning strategies as unimodal approaches, certain strategies already explored and validated in unimodal WsVAD—such as dataset bias correction, label noise reduction, and hinge loss optimization—can be incorporated into multimodal WsVAD methods, particularly for the visual data processing branch.\n4.1.1 Unimodal Methods. Unimodal WsVAD (UWsVAD) methods focus on extracting anomaly-related cues from RGB image sequences [114]. These methods generally follow a three-step process: 1) preprocessing videos into several non-overlapping segments, 2) extracting spatial-temporal features using models like Convolutional 3D Networks [175] and Inflated 3D Networks [14], and 3) computing anomaly scores using a multi-instance ranking loss to differentiate between normal and anomalous instances. The Multi-Instance Ranking (MIR) framework [167] first introduced MIL to WsVAD with a focus on predicting higher anomaly scores for anomalous segments while minimizing score fluctuations. For a more detailed description of the existing UWsVAD methods, please refer to Section 1.2.1 of Supplement. Moreover, given that VsVAD datasets are typically collected from real-world scenarios and contain identity-sensitive information such as faces, Fioresi et al. [44] proposed a privacy-preserving VAD framework named TeD-SPAD. They first anonymized video frames using a UNet to eliminate privacy information before using the I3D network to extract spatial-temporal features. Results showed that TeD-SPAD successfully prevented 32% of visual information leakage.\n4.1.2 Multimodal Methods. Multimodal approaches in WsVAD integrate various data types, primarily focusing on the fusion of video with audio [188] and text [190]. These methods face significant challenges due to limited benchmark datasets and standard comparison metrics, which have hindered their widespread adoption.\nOne of the hallmark advancements in multimodal VAD is the effective fusion of video and audio data. Existing research [143 , 187 , 214] primarily uses the XD-Violence [192] dataset as the evaluation benchmark, which introduces the audio into video violence detection. More information about the video-audio-based VAD methods is provided in Section 1.2.2 of Supplement .\nMoreover, with the rise of Visual-Language Learning (VLL), particularly through the emergence of LLMs, researchers have begun to leverage textual information to further enhance VAD models\u0026rsquo; performance. Pre-trained Vision Vision LLMs can describe appearance and motion information in videos without prior samples, embedding such text as prompts into visual representations. This integration represents a significant advancement in the fusion of video and text, which improves the model\u0026rsquo;s ability to express complex anomalies. The incorporation of textual information not only enhances generalizability and interpretability but also gives rise to new research tasks of practical value, such as video anomaly retrieval [190] and open vocabulary VAD [194], as illustrated in Sections 2.4 .\nSpecifically, VLL models [203] can provide accurate textual descriptions of video frames, enhancing the semantic mining capability of existing vision-based VAD models. For instance, Chen et al. [21] used a language model-based captioning network to obtain textual descriptions of video sequences, which, after being embedded in a text embedding network, were fused with visual features as inputs to the anomaly detector. Their proposed Text Empowered Video Anomaly Detection (TEVAD) efficiently captures abstract semantics of anomaly events and enhances the interpretability of VAD models. Pu et al. [144] introduced a Prompt-Enhanced Learning (PEL) module, using knowledge-based prompts to incorporate semantic priors, improving the discriminative power of visual features in weakly supervised VAD, while ensuring separability between anomalous subcategories. Wu et al. [190] proposed VAR, which efficiently detects specific anomalies based on cross-modal learning (e.g., language descriptions and synchronized audio). They designed an Anomaly-Led Alignment Network (ALAN) using BERT [72] to process text information and incorporate a pretext task to enhance semantic alignment between video-text fine-grained representations.\nIn addition, visual-language associations can serve as effective cues for detecting video anomalies, and the pre-learned visual-text consistency in large VLL models can be efficiently transferred to VAD. Kim et al. [73] used large language models to generate textual descriptions of video frames and detected anomalous frames by calculating the cosine similarity between input frames and their textual descriptions using CLIP. Text Prompt with Normality Guidance [206] leverages the language-visual knowledge of the CLIP model to align video frames with textual descriptions of events, generating more accurate pseudo-labels for WsVAD, thus improving model performance.\nIn surveillance videos, it is often difficult to capture synchronized audio or text data, making unimodal methods the mainstream approach. However, with the rise of live streaming platforms and the film industry, multimodal methods will play a crucial role in online video moderation and content detection for TV shows, films, and animations. On the one hand, the context of such content is often highly varied, with anomalies taking many forms, such as visual violence or non-compliant audio or text, making it difficult for unimodal methods relying solely on RGB sequences to handle effectively. On the other hand, these types of videos typically capture synchronized audio and provide subtitles or other multimodal data, allowing multimodal methods to function without additional data preparation costs.\n4.2 Classic WsVAD Models # We select two representative methods from unimodal and multimodal WsVAD, i.e., MIR [167] and HL-Net [192], to elaborate on the concepts of MIL and multimodal information processing in WsVAD research, respectively. Specifically, MIR has laid the foundation for the MIL-based solution, marking a milestone in VAD research. However, its optimization objectives and detailed implementation are challenging for many researchers in the AD and CV communities to understand. Therefore, we provide a comprehensive exposition of the motivation, theoretical logic, and related knowledge. In contrast, HL-Net is the first model for multimodal violence detection. The simultaneously released XD-violence dataset has inspired lots of researchers from the AD and multimodal understanding fields to delve into the emerging hotspot of multimodal VAD. We choose this method as a case study to disseminate knowledge of multimodal understanding to the AD community, aiming to propel multimodal VAD from simple-modal fusion towards systematic multimodal anomaly clues exploration, please see Section 2.2 of Supplement for more information.\n5 FULLY UNSUPERVISED VIDEO ANOMALY DETECTION # FuVAD follows the transductive learning setup of traditional outlier detection tasks, aiming to directly learn an anomaly classifier from all unfiltered observations to detect samples significantly different from the primary data. Manuscript submitted to ACM\nExisting methods, partly inspired by time series outlier detection research, utilize deep clustering to identify pattern centers of the data and consider samples far from the learned centers as anomalies. However, the pattern dimension of video data is much higher than that of time series, and due to reasons such as similar environmental backgrounds, normal and abnormal event spatial-temporal patterns often overlap, making it infeasible to determine pattern boundaries through clustering when dealing with complex datasets. In recent years, researchers have proposed FuVAD schemes based on iterative learning, gradually amplifying the pattern differences between anomalous samples and dominant normal data through the cooperation of feature extraction modules and anomaly models. Compared to other VAD research routes, FuVAD does not require filtering and labeling training data, directly utilizing unclipped unlabeled monitoring videos to train models, which aligns with the data state-agnostic condition in online learning.\n5.1 Recent Advances # Inspired by unmasking, Liu et al. [107] connect heuristic unmasking with multiple classifier two-sample tests, introducing a history sampling method to enhance testing capabilities in video anomaly detection and a motion feature calculation method for better representation and generalization. Li et al. [81] use a distribution clustering to identify anomaly example groups, then train an autoencoder with normal data subsets to learn representations of normalcy, iterating this process to refine the encoder\u0026rsquo;s ability to describe regular events.\nDrawing from the Masked Autoencoder (MAE) [56], the Temporal Masked Auto-Encoder (TMAE) [59] aims to learn high-quality representations for anomaly detection by employing a visual transformer for completion tasks on spatial-temporal cubes, recognizing the significance of the temporal dimension in video anomalies. This approach is designed to efficiently complete regular events, highlighting anomalies due to their significant loss during completion.\nAn end-to-end self-training deep ordinal regression (SDOR) framework [140] iteratively learns pseudo-normal and anomaly scores from raw sequences, starting with identifying potential anomaly frames using existing algorithms and employing ResNet50 and neural networks for score computation, leveraging self-training for simultaneous optimization of feature learning and anomaly scoring. Generative Cooperative Learning (GCL) [216] learns anomaly detectors from unlabeled mixed data by exploiting anomaly events\u0026rsquo; low frequency, featuring a generator G and discriminator D working cooperatively. The G focuses on regular event representations and uses negative learning for anomalies, generating pseudo-labels for D, which estimates anomaly probabilities to further refine G .\n5.2 Classic FuVAD Models # We introduce two typical deep learning-based FuVAD schemes in Section 2.3 of Supplement, including 1) Self-trained Deep Ordinal Regression (SDOR) [140], which utilizes self-training to jointly optimize feature learning and anomaly scorer, and 2) Generative Cooperative Learning [216], which leverages the low-frequency nature of real-world anomalies to construct pseudo-labels for FuVAD. SDOR [140] is the first deep NSVAD method designed for unfiltered and unlabeled videos, which explicitly points out the limited applicability of UVAD and WsVAD in realist scenarios due to the constraints on data and the performance reliance on feature learning. GCL [216] introduces negative learning to increase the contrast between regular sequences and potential anomalies, directly learning the differences between anomalies and the majority of samples (normal) through the interplay of generator and discriminator.\n6 MODEL EVALUATION # Table 3. Statistical results of the NSVAD dataset.\nLabeling Year Dataset #Videos #Videos #Videos #Frames #Frames #Frames #Frames #Frames #Scenes #Classes #Anomalies Labeling Year Dataset Total Training Testing Total Training Testing Norma Abnormal #Scenes #Classes #Anomalies 2008 Subway Entrance - - - 144,250 76,543 67,797 132,138 12,112 1 5 51 2008 Subway Exit - - - 64,901 22,500 42,401 60,410 4,491 1 3 14 2011 UMN† - - - 7,741 - - 6,165 1,576 3 3 11 2013 UCSD Ped1 70 34 36 14,000 6,800 7,200 9,995 4,005 1 5 61 Unsupervised 2013 UCSD Ped2 28 16 12 4,560 2,550 2,010 2,924 1,636 1 5 21 Unsupervised 2013 CUHK Avenue 37 16 21 30,652 15,328 15,324 26,832 3,820 1 5 77 Unsupervised 2018 ShanghaiTech - - - 317,398 274,515 42,883 300,308 17,090 13 11 158 Unsupervised 2020 Street Scene 81 46 35 203,257 56,847 146,410 159,341 43,916 205 17 17 Unsupervised 2023 NWPU Campus 547 305 242 1,466,073 1,082,014 384,059 1,400,807 65,266 43 28 - Weakly Supervised 2018 UCF-Crime 1,900 1,610 290 13,741,393 12,631,211 1,110,182 - - - 13 950 Weakly Supervised 2019 ShanghaiTech Weakly 437 330 107 - - - - - - 11 - Weakly Supervised 2020 XD-Violance 4,754 3,954 800 - - - - - - 6 - Weakly Supervised 2020 TAD 500 400 100 540,272 - - - - - 7 250 Supervised 2022 Ubnormal‡ 543 268 211 236,902 116,087 92,640 147,887 89,015 29 - 660 † The frame rate is set to 15 fps. ‡The Ubnormal contains a validation set with 64 videos totaling 14,237 normal and 13,938 abnormal frames.\nFig. 7. Exapmles of classical UVAD (Subway [2] and UMN [31]) and WsVAD (UCF-Crime [167] and XD-Violence [192]) dataets.\nIn this section, we elaborate on the characteristics of prevailing datasets and common evaluation metrics. Based on the annotations, we categorize existing datasets into unsupervised, weakly supervised, and supervised, as presented in Table 3. The examples of UVAD (e.g., Subway [2] and UMN [31]) and WsVAD (e.g., UCF-Crime [167] and XD-Violence [192]) datasets are illustrated in Fig. 7. Since such datasets have been extensively surveyed, we only present their details in Section 4 of Supplement. We provide a systematic WSVAD evaluation system in this section, categorizing existing metrics into accuracy-oriented and cost-oriented as well as introducing system-level performance indicators.\n6.1 Conventional Evaluation # The existing works typically evaluate the proposed method from two perspectives: detection accuracy and model cost. On the one hand, models are expected to detect noteworthy anomalies as accurately as possible. Considering the varying influences of false positives and false negatives, along with the highly imbalanced data, researchers have proposed multiple quantitative metrics such as Area Under the Receiver Operating Characteristic curve (AUROC), Area Under Precision-Recall curve (AUPR), and detection rate, to assess model accuracy. Additionally, anomaly score curves are commonly used to qualitatively demonstrate the sensitivity to abnormal intervals, while prediction error maps are widely employed in reconstruction-based UVAD methods to visualize the performance of spatial localization. Manuscript submitted to ACM\nFig. 8. Conventional evaluation metrics.\nOn the other hand, metrics like parameter size and inference speed determine whether the model can be deployed on resource-constrained devices. Thus, we categorize conventional metrics based on the orientation and evaluation dimension, as illustrated in Fig. 8 .\n6.1.1 Accuracy-oriented metrics. Accuracy-oriented metrics aim to evaluate a model\u0026rsquo;s ability to distinguish between normal and abnormal events, including quantitative metrics such as AUROC, AUPR, false alarm rate, and detection rate, as well as qualitative metrics like anomaly score curves and prediction error maps. While the task definitions vary across different VAD routes, they all aim to learn an anomaly detector capable of quantitatively measuring the abnormality level of test samples. Specifically, UVAD is a one-class classification task to train a model using regular events to describe normal patterns while considering all uncharacterizable samples as anomalies. The abnormality degree is computed by measuring the deviation to the learned normality model, which is typically normalized to the range of [0 , 1] as anomaly scores. In contrast, WsVAD treats VAD as a regression task, using video-level labels to supervise fully connected networks directly outputting instance-level anomaly scores, similar to the FuVAD model. Therefore, despite differences in task settings and anomaly discrimination processes, metrics from binary classification tasks can be used to evaluate VAD models.\nIn most cases, the anomaly scores computed by NSVAD models are continuous values in the range [0 , 1], while the given data labels are binary discrete values, where 0 denotes negative (normal events) and 1 represents positive (anomalous events). Therefore, it\u0026rsquo;s necessary to select a threshold to convert relative abnormality scores into definitive binary labels for comparison. For example, with a threshold of 0.5, samples with scores lower than 0.5 are considered negative by the model, while those greater than or equal to 0.5 are considered positive. Thus, we can compute True Positive Rate (𝑇 𝑃𝑅), False Positive Rate (𝐹𝑃𝑅), True Negative Rate (𝑇 𝑁 𝑅), and False Negative Rate (𝑇 𝑁 𝑅), as follows:\nwhere 𝑇 𝑃 , 𝐹𝑃 , 𝐹𝑁, and 𝑇 𝑁 represent correctly detected positive samples, negative samples misclassified as positive, correctly detected negative samples, and positive samples misclassified as negative, respectively.\nDue to the highly imbalanced nature of positive and negative samples in NSVAD, some common evaluation metrics for classification tasks, such as accuracy, are not applicable. For example, a model biased towards outputting label 0 thus missing anomalous events, would be incorrectly evaluated as good under such metrics. Using a single threshold to simply assess a model\u0026rsquo;s ability to differentiate between normal and abnormal patterns is not wise. For instance, with a threshold of 0.5, a model that consistently outputs scores slightly below 0.5 for regular events and slightly above 0.5 for anomalies would be considered perfect because it would show optimal performance across various metrics. However, such a model may not have learned the inherent differences between normal and abnormal patterns well, resulting in\nManuscript submitted to ACM\nminimal score gaps between the two, which could lead to failure in detecting subtle anomalies and normal instances with data bias in complex scenarios. Therefore, researchers have introduced Receiver Operating Characteristic (ROC) curves, which measure VAD models more comprehensively by selecting multiple thresholds. Specifically, this curve plots the TNR against the TPR at various thresholds. The area under the curve, known as AUROC, has been the most widely used VAD evaluation metric. An ideal AUROC value of 1 indicates a model that outputs a score of 0 for all negative samples and 1 for all positive samples, aligning with our expectations. Considering that TN usually exceeds TP, researchers argue that Average Precision, i.e., the Area Under the Precision-Recall (AUPR) curve, is more suitable for evaluating anomaly detection tasks. The PR curve depicts precision and recall (i.e., TPR) at specific thresholds. The point on this curve where Precision equals Recall is the balance point. Currently, multimodal anomaly detection models primarily use AP for quantitative evaluation. An anomaly score curve is commonly used to intuitively demonstrate the model\u0026rsquo;s response to anomalous events, presenting the temporal localization capability. In contrast, prediction error maps are often used to assess UVAD model\u0026rsquo;s spatial localization capability.\n6.1.2 Cost-oriented metrics. Current research primarily focuses on developing high-performance detection models while neglecting to evaluate the models\u0026rsquo; deployment potential. A lightweight model is crucial in driving the application of NSVAD. Thus, we compile deployment-oriented metrics, including parameter size, Floating-Point Operations Per Second (FLOPS), and average inference speed. Specifically, parameter size indicates the number of learnable parameters, reflecting the complexity and storage cost of the model. In the real world, while complex models may offer performance gains, the resulting increase in memory and computational resource requirements may be unacceptable. Therefore, NSAVD should balance detection performance with model parameters. FLOPs represent the number of floating-point operations the model needs to perform during inference. This metric is crucial for end devices, as excessive FLOPs may lead to performance bottlenecks and reduced hardware lifespan. Some existing works report average inference speed, i.e., the number of frames the model can process during testing, to quantitatively reflect the model\u0026rsquo;s run time. However, due to inconsistent experimental environments, it cannot serve as an instructive and convincing metric in most cases. Due to space constraints and fairness of the comparison, we have collected the performance reported by existing methods but only organized these results in our GitHub 1 repository for reference.\n6.2 System-level Evaluation # While conventional metrics focus on the detection accuracy and cost of individual models, evaluating the performance of NSVAD in real-world deployments requires a more holistic, system-level perspective. Beyond model-specific metrics, the entire system\u0026rsquo;s effectiveness depends on various factors such as latency, communication cost, bandwidth efficiency, data security, user privacy, and system robustness [94]. These metrics collectively assess the performance and feasibility of deploying NSVAD in large-scale, distributed, and resource-constrained environments. To this end, we categorize system-level metrics into three primary groups: efficiency, privacy, and robustness.\n6.2.1 Efficiency-oriented metrics. Efficiency-oriented system metrics assess the efficiency and scalability of the NSVAD system in a distributed setting. One key metric is latency, which measures the end-to-end delay from the moment video data is captured to the final detection output. For real-time anomaly detection, minimizing latency is critical, particularly in scenarios such as public safety monitoring or autonomous driving, where any detection delay could result in catastrophic outcomes. Latency can be broken down into communication latency, processing latency, and system response time, each reflecting different aspects of delay within the system. Reducing these latencies often requires optimizing the placement of inference tasks across edge and cloud nodes. Communication cost evaluates Manuscript submitted to ACM\nthe data transmission overhead between distributed nodes, especially in edge-cloud architectures. Given the high bandwidth demands of video data, optimizing communication efficiency becomes crucial for deploying NSVAD at scale. Common metrics used here include the total data transmitted (measured in megabytes or gigabytes) and the number of communication rounds required for model updates in federated learning-based systems. To address these challenges, techniques such as video compression, parameter pruning, and model quantization are often employed to reduce communication overhead. Another essential metric is bandwidth utilization, which measures the efficiency of the network resources. High bandwidth usage may congest the network, causing delays and performance degradation, particularly in multi-client or large-scale environments. Methods such as asynchronous communication and bandwidth allocation prioritization can be leveraged to optimize utilization without sacrificing detection performance.\n6.2.2 Privacy-oriented metrics. In NSVAD systems, maintaining data security and user privacy is paramount, particularly in applications involving sensitive environments such as healthcare [78] or public surveillance. Data security is typically evaluated using metrics such as encryption overhead, which measures the computational cost introduced by encryption techniques, and key management efficiency, which assesses how well the system handles the distribution and renewal of cryptographic keys in a large-scale deployment. Robust encryption algorithms, such as AES or homomorphic encryption, are commonly used to ensure that video data remains secure during transmission and processing. User privacy is another critical concern in video anomaly detection systems. Metrics such as the privacy leakage rate evaluate how much sensitive information (e.g., identities, personal activities) can be inferred from the system\u0026rsquo;s outputs. Differential privacy, federated learning, and encrypted video coding are among the techniques employed to minimize privacy risks. Privacy-preserving methods are evaluated based on the degree of anonymization they provide, often measured in terms of the privacy budget, which balances privacy protection against utility loss.\n6.2.3 Robustness-oriented metrics. Robustness metrics aim to assess the system\u0026rsquo;s ability to maintain reliable performance in the presence of adversarial conditions, such as network disruptions, data corruption, or malicious attacks. One critical metric is fault tolerance, which measures the system\u0026rsquo;s capacity to continue operating when certain components fail [51 , 52]. This is especially important in distributed settings where failures in edge devices or communication links can affect the overall detection pipeline. Techniques such as redundancy, dynamic task migration, and edge-cloud coordination can enhance the system\u0026rsquo;s fault tolerance. Adversarial robustness evaluates the system\u0026rsquo;s resilience to attacks designed to manipulate or mislead the anomaly detection model. Adversarial attacks may involve injecting malicious data, such as perturbing video frames or manipulating model parameters. The robustness against such attacks is typically quantified by the system\u0026rsquo;s ability to maintain high detection accuracy even when exposed to adversarial perturbations. Metrics such as adversarial success rate and robust accuracy are often used. Lastly, scalability measures the system\u0026rsquo;s ability to handle increasing workloads, including the number of video streams and distributed clients. This is typically evaluated through stress testing, where system performance is analyzed under different load conditions to ensure it can maintain efficiency and reliability as deployment scales up. A scalable NSVAD system should efficiently distribute workloads across edge devices and cloud servers without degrading performance.\n7 DISCUSSION AND SUMMARY # 7.1 Research Challenges # Sections 3 -5 introduced the key challenges addressed by various NSVAD algorithms. Here, we further elaborate on unresolved problems from the perspectives of data, labels, models, and systems. In contrast to previous works that emphasize only algorithm design, we also explore the NSVAD-specialized bottlenecks encountered in real-world deployments, such as communication and computing overhead, large-scale detection demand, and privacy concerns.\n7.1.1 Data. Real-world videos exhibit label-independent domain shifts due to variations in scenes, camera angles, and device configurations [39]. These subtle differences in spatial-temporal patterns, while easily comprehended by humans, often lead to high false positive rates in NSVAD models [116]. Existing methods typically validate their models on datasets from a single scene to avoid this issue [149]. For instance, the UCSD dataset [82] includes two distinct perspectives, but they are treated as separate datasets. The ShanghaiTech dataset [100], despite spanning 13 scenes, is often treated as a single scene, leading to performance drops when compared to simpler datasets like UCSD Ped2 [82] and CUHK Avenue [120]. In real-world applications, such scene and device variations are inevitable, making it impractical to develop specialized models for every setup. To address this, some researchers [1] have proposed using virtual engines to simulate anomaly events and generate richer positive samples. However, datasets like XD-Violence [192], which include movie and game scenes, differ from real-world anomalies, limiting the effectiveness of models trained on them. Bridging this domain gap between virtual and real anomalies is essential for deployable NSVAD systems. Additionally, multimodal NSVAD, while a growing field, remains confined to the fusion of RGB images and synchronized audio, neglecting novel modalities like language [190] and texts [144], limiting its application in media streaming and broadcasting.\n7.1.2 Label. Despite the acknowledged rarity and diversity of anomalies, most models are trained unsupervised [181 , 199 , 207], using only normal samples. However, as discussed in Section 2.1, unsupervised models still require anomaly-free samples during training. Challenges include the impact of pixel noise on model performance and the cost of labeling large-scale datasets for supervised methods. Moreover, UVAD methods rely on data filtering to avoid contaminated training sets, preventing online learning directly from raw video streams, as seen in FuVAD [140]. Conversely, WsVAD utilizes video-level labels to reduce labor costs [167], but improving the stability of FuVAD/UVAD in complex environments or creating hybrid models that can mitigate label noise remain future directions.\n7.1.3 Model. Unsupervised NSVAD models benefit from self-supervised learning for spatial-temporal feature representation but face challenges with over-generalization, which can lead to missed anomalies [141]. The key challenge is balancing representation and generalization to reduce both false positives and negatives [111]. Approaches like memory networks and causal representation learning [116] have shown promise, but performance on complex datasets remains inconsistent. In contrast, WsVAD can only detect predefined anomaly events, limiting its adaptability and failing to meet open-world requirements. While WsVAD models generally offer more reliable results than UVAD and FuVAD, their reliance on ranking loss is still debated. To advance NSVAD, there is a need for interpretable models and systems that integrate data encryption to address privacy concerns.\n7.1.4 System. Current NSVAD research primarily focuses on improving algorithmic performance on existing datasets, overlooking practical deployment challenges like device heterogeneity and resource constraints. Different applications, such as surveillance analysis, content monitoring, and non-real-time video detection, require distinct task setups and data modalities. However, most work has concentrated on surveillance video detection, neglecting applications Manuscript submitted to ACM\nlike short video analysis and live streaming. Additionally, mobile devices, with their limited storage, computing, and communication resources, necessitate lightweight model designs. Moving forward, NSVAD should focus on the synchronous optimization of detection performance and system cost.\n7.1.5 NSVAD-specialized. For real-world applications, NSVAD needs to handle vast data streams from urban surveillance cameras while addressing long-term anomaly detection. These systems also face public concerns regarding data overhead and privacy security. Unlike algorithmic research, NSVAD for large-scale applications presents three distinct challenges: (i) Efficient data exchange and task offloading [37] between millions of cameras and distributed servers. Current research focuses on enhancing algorithms but neglects the costs of data acquisition and transmission, which are critical in large-scale deployments. NSVAD must adopt new video compression [49] and transmission protocols to handle the expanding network of cameras and increasing video resolution. Distributed machine learning techniques can also facilitate global model training without aggregating all data. (ii) Detecting anomalies across large spatial and temporal scales. Video IoT systems collect information from entire cities, but current NSVAD models [165 , 232], designed for discrete scenes, struggle with large-scale data. For example, traffic anomalies like congestion may be easily detected at intersections, but large-scale crowd movements might be misinterpreted as anomalous gatherings. (iii) Privacy protection and ethical concerns. NSVAD systems collect identifiable information such as faces and clothing, raising concerns about privacy breaches and biases related to race, gender, and skin color [137 , 166 , 166]. While identity-agnostic data (e.g., encrypted video [22]) and privacy-preserving techniques (e.g., federated learning [3]) are promising, ensuring transparency and fairness remains a significant challenge for NSVAD development.\n7.2 Trends and Opportunities # Based on the sequential steps of NSVAD depicted in Fig. 1, we have summarized the trends and open opportunities from hardware, system, algorithm, and application layers by incorporating the development trends of artificial intelligence and communication technologies, as well as the application requirements of NSVAD in smart cities and mobile internet, aiming to guide researchers in various fields to engage in relevant work.\n7.2.1 Hardware Layer. The deployment of billions of cameras in roads, factories, and public places not only provides diverse application scenarios but also offers ample data support for training large-scale NSVAD models. Next-generation communication devices and image processing units can provide communication and computational support for building large-scale NSVAD systems, making it feasible for cloud-based global model learning in end-cloud collaborative architecture [70 , 172]. Additionally, emerging sensing devices such as thermal imaging, motion cameras, and event cameras will expand the application potential of NSVAD in scenarios like military and sports. Therefore, we believe that the advancement and innovation of hardware devices strongly drive the development and application of NSVAD technology, and developing deployable NSVAD systems for new types of sensors and large-scale IoT systems will become a trend.\n7.2.2 System Layer. The system layer aims to bridge the hardware and algorithm layers, providing interfaces for model deployment on terminal devices and supporting the collaborative optimization of computation and communication execution in NSVAD systems. Previous NSVAD research focused on algorithm design while neglecting system layer development for a long time. With the further development of edge artificial intelligence and mobile communication networks, new solutions will be sought for communication strategy optimization, computation offloading, distributed\nManuscript submitted to ACM\nmodel aggregation, and interface flexibility faced by the system layer, promoting NSVAD towards integration and intelligence.\n7.2.3 Algorithm Layer. Combining the latest advancements and domain concerns, we believe the development opportunities for algorithms include: 1) the introduction of large-scale heterogeneous datasets; 2) the adaptability transfer of efficient representation learning and reinforcement learning methods [87 , 96 , 97]; 3) innovative combinations of emerging artificial intelligence tasks with NSVAD; and 4) assistance from implicit knowledge in large models [54]. Specifically, expanding business scenarios from smart transportation and modern factories provide rich data sources for VSVAD, making it possible to train large-scale models. The rise of online video networks and streaming platforms provides additional data modalities beyond images, such as audio, subtitles, and language, aiding in exploring complex anomaly clues. Furthermore, the development of virtual data engines allows easy emulation of rare anomaly events and provides pixel-level annotations. The rarity and diversity of anomalies and the difficulty of collecting cross-scene videos will no longer be bottlenecks for NSVAD model development. Rich data modalities and volume will drive the development of multimodal and cross-scene NSVAD models and are expected to foster supervised interpretable research routes. The development of deep learning will free NSVAD algorithms from manual features and enable them to extract spatial-temporal features end-to-end and model video normality [101 , 102 , 230 , 231], while the continuous advancement of deep neural networks (such as attention, Transformer, masked autoencoders) [103 , 104 , 228 , 229] and emerging representation learning methods (such as video self-supervised learning and causal representation learning) will further improve NSVAD\u0026rsquo;s learning capability. The advancement of emerging artificial intelligence technologies will provide references for addressing specific concerns of NSVAD. For example, unsupervised NSVAD recovers from label-agnostic data biases in complex environments, often encountering significant performance degradation when dealing with diverse common events, a problem long studied in domain generalization tasks. We believe the combination of domain generalization and NSVAD research will help address the negative impact of data bias in unsupervised solutions. Data security and privacy protection have always been concerns for users and researchers when building deployable NSVAD systems for large-scale IoT. Federated learning will provide feasible solutions. The rise of large models will propel generative artificial intelligence research to a climax and give birth to numerous phenomenon-level applications, reshaping industries, and NSVAD is no exception. We believe that large models contain implicit knowledge related to anomalous events, which is crucial for understanding the fundamental differences between normal and abnormal and developing interpretable NSVAD models.\n7.2.4 Application Layer. On the one hand, applications such as smart transportation and live content monitoring provide ample validation scenarios for NSVAD models, driving the development of cross-scene perspective robust models for real-world deployment. On the other hand, business requirements in specific scenarios will spur new research tasks. For example, thermal imaging, as a completely passive sensing method, has been widely used in autonomous driving and the modern military to overcome the limitations of optical cameras at night. NSVAD model design based on thermal sensing devices will address texture loss and boundary entanglement caused by blackbody radiation, facilitating all-weather anomaly event detection. Additionally, the limited resources of terminal devices and the perceptual range of local systems indicate that lightweight models and the development of end-cloud collaborative NSVAD systems for large-scale applications are worth exploring.\n7.3 Summary # NSVAD has emerged as a significant area of research with broad implications for smart cities and the mobile internet, exerting essential influence across various domains such as traffic management, industrial manufacturing, and the operations of online video platforms. This influence is critical for maintaining urban safety and ensuring a clear cyberspace. Originating from the confluence of anomaly detection and video understanding, NSVAD has expanded beyond mere algorithm design, transforming into a multifaceted subject of interest that spans AI, IoT, and computing. As the pioneering tutorial-type paper on NSVAD, this article comprehensively outlines its research landscape, clarifying the foundational concepts and developmental trajectories across various research avenues. In particular, we examine recent advancements in unsupervised, weakly supervised, and fully unsupervised methods, providing detailed explanations of classical solutions. Remarkably, this article centrally presents our latest explorations to NSVAD in modern industry, smart cities, and complex systems. Finally, leveraging our experiences, we analyze the challenges, trends, and opportunities within the future vision of NSVAD, aiming to spark inspiration for following engineers and researchers.\nACKNOWLEDGMENTS # This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 62250410368; the Kunshan Municipal Special Project under Grant 24KKSGR024; the Guangdong Pearl River Talent Recruitment Program under Grant 2019ZT08X603; the Guangdong Pearl River Talent Plan under Grant 2019JC01X235; and the National Key Research and Development Program of China under Project No. 2024YFE0200700 (Subject No. 2024YFE0200703). Additional support was provided in part by the Specific Research Fund of the Innovation Platform for Academicians of Hainan Province under Grant YSPTZX202314; the Shanghai Key Research Laboratory of NSAI and the Joint Laboratory on Networked AI Edge Computing, Fudan University-Changan; the China Mobile Research Fund of MOE under Grant KEH2310029; and the NSFC under Grant 62250410368. The authors sincerely thank Liangyu Teng, Yuntian Shi, and Hao Yang from Fudan University for their help in revising this article. Finally, the authors would like to express their gratitude to the anonymous reviewers for their insightful comments and valuable suggestions.\nREFERENCES # [1] Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. 2022. UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20143–20153.\n[2] Amit Adam, Ehud Rivlin, Ilan Shimshoni, and Daviv Reinitz. 2008. Robust real-time unusual event detection using multiple fixed-location monitors. IEEE transactions on pattern analysis and machine intelligence 30, 3 (2008), 555–560.\n[3] Anas Al-Lahham, Muhammad Zaigham Zaheer, Nurbek Tastan, and Karthik Nandakumar. 2024. Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12416–12425.\n[4] Qianyue Bao, Fang Liu, Yang Liu, Licheng Jiao, Xu Liu, and Lingling Li. 2022. Hierarchical scene normality-binding modeling for anomaly detection in surveillance videos. In Proceedings of the 30th ACM International Conference on Multimedia. 6103–6112.\n[5] Suvramalya Basak and Anjali Gautam. 2024. Diffusion-based normality pre-training for weakly supervised video anomaly detection. Expert Systems with Applications (2024), 124013.\n[6] Yannick Benezeth, P-M Jodoin, Venkatesh Saligrama, and Christophe Rosenberger. 2009. Abnormal events detection based on spatio-temporal co-occurences. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, 2458–2465.\n[7] Ane Blázquez-García, Angel Conde, Usue Mori, and Jose A Lozano. 2021. A review on outlier/anomaly detection in time series data. ACM Computing Surveys (CSUR) 54, 3 (2021), 1–33.\n[8] Azzedine Boukerche, Lining Zheng, and Omar Alfandi. 2020. Outlier detection: Methods, models, and classification. ACM Computing Surveys (CSUR) 53, 3 (2020), 1–37.\n[9] Ruichu Cai, Hao Zhang, Wen Liu, Shenghua Gao, and Zhifeng Hao. 2021. Appearance-motion memory consistency network for video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 938–946.\n[10] Congqi Cao, Yue Lu, and Yanning Zhang. 2024. Context recovery and knowledge retrieval: A novel two-stream framework for video anomaly detection. IEEE Transactions on Image Processing (2024).\n[11] Liang Cao, Xiaolu Ji, Yankai Cao, and Bhushan Gopaluni. 2025. Adaptive Process Monitoring for Multimode Industrial Processes through Machine Learning. IEEE Journal of Emerging and Selected Topics in Industrial Electronics (2025).\n[12] Liang Cao, Jianping Su, Emilio Conde, Lim C Siang, Yankai Cao, and Bhushan Gopaluni. 2025. A novel automated soft sensor design tool for industrial applications based on machine learning. Control Engineering Practice 160 (2025), 106322.\n[13] Liang Cao, Jianping Su, Jack Saddler, Yankai Cao, Yixiu Wang, Gary Lee, Lim C Siang, Robert Pinchuk, Jin Li, and R Bhushan Gopaluni. 2024. Real-time tracking of renewable carbon content with AI-aided approaches during co-processing of biofeedstocks. Applied Energy 360 (2024), 122815.\n[14] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.\n[15] S Chandrakala, K Deepak, and G Revathy. 2022. Anomaly detection in surveillance videos: a thematic taxonomy of deep models, review and performance analysis. Artificial Intelligence Review (2022), 1–50.\n[16] Yunpeng Chang, Zhigang Tu, Wei Xie, Bin Luo, Shifu Zhang, Haigang Sui, and Junsong Yuan. 2021. Video anomaly detection with spatio-temporal dissociation. Pattern Recognition 122 (2021), 108213.\n[17] Yunpeng Chang, Zhigang Tu, Wei Xie, and Junsong Yuan. 2020. Clustering driven deep autoencoder for video anomaly detection. In European Conference on Computer Vision. Springer, 329–345.\n[18] Chengwei Chen, Yuan Xie, Shaohui Lin, Angela Yao, Guannan Jiang, Wei Zhang, Yanyun Qu, Ruizhi Qiao, Bo Ren, and Lizhuang Ma. 2022. Comprehensive Regularization in a Bi-directional Predictive Network for Video Anomaly Detection. In Proceedings of the American association for artificial intelligence. 1–9.\n[19] Dongyue Chen, Pengtao Wang, Lingyi Yue, Yuxin Zhang, and Tong Jia. 2020. Anomaly detection in surveillance video based on bidirectional prediction. Image and Vision Computing 98 (2020), 103915.\n[20] Dongyue Chen, Lingyi Yue, Xingya Chang, Ming Xu, and Tong Jia. 2021. NM-GAN: Noise-modulated generative adversarial network for video anomaly detection. Pattern Recognition 116 (2021), 107969.\n[21] Weiling Chen, Keng Teck Ma, Zi Jian Yew, Minhoe Hur, and David Aik-Aun Khoo. 2023. TEVAD: Improved video anomaly detection with captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5548–5558.\n[22] Hang Cheng, Ximeng Liu, Huaxiong Wang, Yan Fang, Meiqing Wang, and Xiaopeng Zhao. 2020. SecureAD: A secure video anomaly detection framework on convolutional neural network in edge computing environment. IEEE Transactions on Cloud Computing 10, 2 (2020), 1413–1427.\n[23] Kai Cheng, Yang Liu, and Xinhua Zeng. 2023. Learning graph enhanced spatial-temporal coherence for video anomaly detection. IEEE Signal Processing Letters 30 (2023), 314–318.\n[24] Kai Cheng, Yaning Pan, Yang Liu, Xinhua Zeng, and Rui Feng. 2024. Denoising Diffusion-Augmented Hybrid Video Anomaly Detection via Reconstructing Noised Frames. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24. 695–703.\n[25] Kai Cheng, Xinhua Zeng, Yang Liu, Yaning Pan, and Xinzhe Li. 2024. Normality learning reinforcement for anomaly detection in surveillance videos. Knowledge-Based Systems 297 (2024), 111942.\n[26] Kai Cheng, Xinhua Zeng, Yang Liu, Mengyang Zhao, Chengxin Pang, and Xing Hu. 2023. Spatial-temporal graph convolutional network boosted flow-frame prediction for video anomaly detection. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.\n[27] Kai-Wen Cheng, Yie-Tarng Chen, and Wen-Hsien Fang. 2015. Video anomaly detection and localization using hierarchical feature representation and Gaussian process regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2909–2917.\n[28] Yong Shean Chong and Yong Haur Tay. 2017. Abnormal event detection in videos using spatiotemporal autoencoder. In International symposium on neural networks. Springer, 189–196.\n[29] Peter Christiansen, Lars N Nielsen, Kim A Steen, Rasmus N Jørgensen, and Henrik Karstoft. 2016. DeepAnomaly: Combining background subtraction and deep learning for detecting obstacles and anomalies in an agricultural field. Sensors 16, 11 (2016), 1904.\n[30] Andrew A Cook, Göksel Mısırlı, and Zhong Fan. 2019. Anomaly detection for IoT time-series data: A survey. IEEE Internet of Things Journal 7, 7 (2019), 6481–6494.\n[31] Xinyi Cui, Qingshan Liu, Mingchen Gao, and Dimitris N Metaxas. 2011. Abnormal detection using interaction energy potentials. In CVPR 2011 . IEEE, 3161–3167.\n[32] Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05), Vol. 1. Ieee, 886–893.\n[33] Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human detection using oriented histograms of flow and appearance. In European conference on computer vision. Springer, 428–441.\n[34] K Deepak, S Chandrakala, and C Krishna Mohan. 2021. Residual spatiotemporal autoencoder for unsupervised video anomaly detection. Signal, Image and Video Processing 15, 1 (2021), 215–222.\n[35] Piotr Dollár, Vincent Rabaud, Garrison Cottrell, and Serge Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In 2005 IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance. IEEE, 65–72.\n[36] Fei Dong, Yu Zhang, and Xiushan Nie. 2020. Dual discriminator generative adversarial network for video anomaly detection. IEEE Access 8 (2020), 88170–88176.\n[37] Shi Dong, Junxiao Tang, Khushnood Abbas, Ruizhe Hou, Joarder Kamruzzaman, Leszek Rutkowski, and Rajkumar Buyya. 2024. Task offloading strategies for mobile edge computing: A survey. Computer Networks (2024), 110791.\n[38] Keval Doshi and Yasin Yilmaz. 2021. Online anomaly detection in surveillance videos with asymptotic bound on false alarm rate. Pattern Recognition 114 (2021), 107865.\n[39] Huu-Thanh Duong, Viet-Tuan Le, and Vinh Truong Hoang. 2023. Deep learning-based anomaly detection in video surveillance: A survey. Sensors 23, 11 (2023), 5024.\n[40] Yasha Ektefaie, George Dasoulas, Ayush Noori, Maha Farhat, and Marinka Zitnik. 2023. Multimodal learning with graphs. Nature Machine Intelligence 5, 4 (2023), 340–350.\n[41] Yaxiang Fan, Gongjian Wen, Deren Li, Shaohua Qiu, Martin D Levine, and Fei Xiao. 2020. Video anomaly detection and localization via gaussian mixture fully convolutional variational autoencoder. Computer Vision and Image Understanding 195 (2020), 102920.\n[42] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. 2021. Mist: Multiple instance self-training framework for video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14009–14018.\n[43] Xinyang Feng, Dongjin Song, Yuncong Chen, Zhengzhang Chen, Jingchao Ni, and Haifeng Chen. 2021. Convolutional transformer based dual discriminator generative adversarial networks for video anomaly detection. In Proceedings of the 29th ACM International Conference on Multimedia . 5546–5554.\n[44] Joseph Fioresi, Ishan Rajendrakumar Dave, and Mubarak Shah. 2023. Ted-spad: Temporal distinctiveness for self-supervised privacy-preservation for video anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13598–13609.\n[45] Chuanxing Geng, Sheng-jun Huang, and Songcan Chen. 2020. Recent advances in open set recognition: A survey. IEEE transactions on pattern analysis and machine intelligence 43, 10 (2020), 3614–3631.\n[46] Mariana-Iuliana Georgescu, Antonio Barbalau, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. 2021. Anomaly detection in video via self-supervised and multi-task learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 12742–12752.\n[47] Mariana Iuliana Georgescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. 2021. A background-agnostic framework with adversarial training for abnormal event detection in video. IEEE transactions on pattern analysis and machine intelligence 44, 9 (2021), 4505–4523.\n[48] Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440–1448.\n[49] Carlos Gomes, Roberto Azevedo, and Christopher Schroers. 2023. Video compression with entropy-constrained neural representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18497–18506.\n[50] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. 2019. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1705–1714.\n[51] Juncen Guo, Xiaoguang Zhu, Lianlong Sun, Liangyu Teng, Di Li, Yang Liu, and Liang Song. 2025. Feature Calibration enhanced Parameter Synthesis for CLIP-based Class-incremental Learning. arXiv preprint (2025).\n[52] Juncen Guo, Xiaoguang Zhu, Liangyu Teng, Hao Yang, Jing Liu, Yang Liu, and Liang Song. 2025. Adaptive Weighted Parameter Fusion with CLIP for Class-Incremental Learning. arXiv preprint (2025).\n[53] Xingshuo Han, Xiao Wang, Kui Jiang, Wei Liu, Ruimin Hu, Xuefeng Pan, and Xin Xu. 2024. Mutuality Attribute Makes Better Video Anomaly Detection. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2670–2674.\n[54] Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, et al . 2024. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey. arXiv preprint arXiv:2403.14608 (2024).\n[55] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. 2016. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition. 733–742.\n[56] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000–16009.\n[57] Francisco Herrera, Sebastián Ventura, Rafael Bello, Chris Cornelis, Amelia Zafra, Dánel Sánchez-Tarragó, Sarah Vluymans, Francisco Herrera, Sebastián Ventura, Rafael Bello, et al. 2016. Multiple instance learning. Springer.\n[58] Ryota Hinami, Tao Mei, and Shin\u0026rsquo;ichi Satoh. 2017. Joint detection and recounting of abnormal events by learning deep generic knowledge. In Proceedings of the IEEE international conference on computer vision. 3619–3627.\n[59] Jingtao Hu, Guang Yu, Siqi Wang, En Zhu, Zhiping Cai, and Xinzhong Zhu. 2022. Detecting Anomalous Events from Unlabeled Videos via Temporal Masked Auto-Encoding. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.\n[60] Xing Hu, Shiqiang Hu, Yingping Huang, Huanlong Zhang, and Hanbing Wu. 2016. Video anomaly detection using deep incremental slow feature analysis network. IET Computer Vision 10, 4 (2016), 258–267.\n[61] Xing Hu, Yingping Huang, Xiumin Gao, Lingkun Luo, and Qianqian Duan. 2018. Squirrel-cage local binary pattern and its application in video anomaly detection. IEEE Transactions on Information Forensics and Security 14, 4 (2018), 1007–1022.\n[62] Chao Huang, Chengliang Liu, Jie Wen, Lian Wu, Yong Xu, Qiuping Jiang, and Yaowei Wang. 2024. Weakly Supervised Video Anomaly Detection via Self-Guided Temporal Discriminative Transformer. IEEE Transactions on Cybernetics (2024).\n[63] Chao Huang, Yabo Liu, Zheng Zhang, Chengliang Liu, Jie Wen, Yong Xu, and Yaowei Wang. 2022. Hierarchical Graph Embedded Pose Regularity Learning via Spatio-Temporal Transformer for Abnormal Behavior Detection. In 30th ACM International Conference on Multimedia. 307–315.\n[64] Chao Huang, Jie Wen, Chengliang Liu, and Yabo Liu. 2024. Long Short-Term Dynamic Prototype Alignment Learning for Video Anomaly Detection. Proceedings of the Thirty-Two International Joint Conference on Artificial Intelligence (IJCAI-24) (April 2024).\n[65] Chao Huang, Jie Wen, Yong Xu, Qiuping Jiang, Jian Yang, Yaowei Wang, and David Zhang. 2023. Self-Supervised Attentive Generative Adversarial Networks for Video Anomaly Detection. IEEE Transactions on Neural Networks and Learning Systems 34, 11 (Nov. 2023), 9389–9403.\n[66] Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. 2019. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7842–7851.\n[67] Behrouz Jedari, Gopika Premsankar, Gazi Illahi, Mario Di Francesco, Abbas Mehrabi, and Antti Ylä-Jääski. 2020. Video caching, analytics, and delivery at the wireless edge: A survey and future directions. IEEE Communications Surveys \u0026amp; Tutorials 23, 1 (2020), 431–471.\n[68] Xiantao Jiang, F Richard Yu, Tian Song, and Victor CM Leung. 2021. A survey on multi-access edge computing applied to video streaming: Some research issues and challenges. IEEE Communications Surveys \u0026amp; Tutorials 23, 2 (2021), 871–903.\n[69] Longlong Jing and Yingli Tian. 2020. Self-supervised visual feature learning with deep neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence 43, 11 (2020), 4037–4058.\n[70] Bobo Ju, Kun Yang, Yang Liu, Jing Liu, Peng Sun, Wei Ni, and Liang Song. 2024. Open Service for Networking Systems of AI: A Case Study in Adaptation Optimization. In 2024 IEEE 10th World Forum on Internet of Things (WF-IoT). IEEE, 1–6.\n[71] Hamza Karim, Keval Doshi, and Yasin Yilmaz. 2024. Real-time weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 6848–6856.\n[72] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. Minneapolis, Minnesota, 2.\n[73] Jaehyun Kim, Seongwook Yoon, Taehyeon Choi, and Sanghoon Sull. 2023. Unsupervised video anomaly detection based on similarity with predefined text descriptions. Sensors 23, 14 (2023), 6256.\n[74] B Ravi Kiran, Dilip Mathew Thomas, and Ranjith Parakkal. 2018. An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos. Journal of Imaging 4, 2 (2018), 36.\n[75] Kwang-Eun Ko and Kwee-Bo Sim. 2018. Deep convolutional framework for abnormal behavior detection in a smart surveillance system. Engineering Applications of Artificial Intelligence 67 (2018), 226–234.\n[76] Yu Kong and Yun Fu. 2022. Human action recognition and prediction: A survey. International Journal of Computer Vision 130, 5 (2022), 1366–1401.\n[77] Chaobo Li, Hongjun Li, and Guoan Zhang. 2024. Cross-modality integration framework with prediction, perception and discrimination for video anomaly detection. Neural Networks 172 (2024), 106138.\n[78] Chengfang Li, Hanqi Wang, Yang Liu, Xiaoguang Zhu, and Liang Song. 2024. Silent EEG classification using cross-fusion adaptive graph convolution network for multilingual neurolinguistic signal decoding. Biomedical Signal Processing and Control 87 (2024), 105524.\n[79] Nanjun Li, Faliang Chang, and Chunsheng Liu. 2020. Spatial-temporal cascade autoencoder for video anomaly detection in crowded scenes. IEEE Transactions on Multimedia 23 (2020), 203–215.\n[80] Tong Li, Xinyue Chen, Fushun Zhu, Zhengyu Zhang, and Hua Yan. 2021. Two-stream deep spatial-temporal auto-encoder for surveillance video abnormal event detection. Neurocomputing 439 (2021), 256–270.\n[81] Tangqing Li, Zheng Wang, Siying Liu, and Wen-Yan Lin. 2021. Deep unsupervised anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3636–3645.\n[82] Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. 2013. Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence 36, 1 (2013), 18–32.\n[83] Wenhao Li, Xiu Su, Shan You, Fei Wang, Chen Qian, and Chang Xu. 2023. Diffnas: Bootstrapping diffusion models by prompting for better architectures. In 2023 IEEE International Conference on Data Mining (ICDM). IEEE, 1121–1126.\n[84] Yuanyuan Li, Yiheng Cai, Jiaqi Liu, Shinan Lang, and Xinfeng Zhang. 2019. Spatio-Temporal Unity Networking for Video Anomaly Detection. IEEE Access 7 (2019), 172425–172432.\n[85] Zewen Li, Fan Liu, Wenjie Yang, Shouheng Peng, and Jun Zhou. 2021. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems (2021).\n[86] Zhangxun Li, Mengyang Zhao, Xuan Yang, Yang Liu, Jiamu Sheng, Xinhua Zeng, Tian Wang, Kewei Wu, and Yu-Gang Jiang. 2024. STNMamba: Mamba-based Spatial-Temporal Normality Learning for Video Anomaly Detection. arXiv preprint arXiv:2412.20084 (2024).\n[87] Jinmei Liu, Wenbin Li, Xiangyu Yue, Shilin Zhang, Chunlin Chen, and Zhi Wang. 2024. Continual offline reinforcement learning via diffusion-based dual generative replay. arXiv preprint arXiv:2404.10662 (2024).\n[88] Jing Liu, Yang Liu, Di Li, Hanqi Wang, Xiaohong Huang, and Liang Song. 2023. DSDCLA: Driving style detection via hybrid CNN-LSTM with multi-level attention fusion. Applied Intelligence 53, 16 (2023), 19237–19254.\n[89] Jing Liu, Yang Liu, Jieyu Lin, Donglai Wei, Xu Xia, Wei Ni, Xiaohong Huang, and Liang Song. 2021. One-Dimensional Convolutional Neural Network Model for Abnormal Driving Behaviors Detection Using Smartphone Sensors. In 2021 International Conference on Networking Systems of AI (INSAI). 143–150. https://doi.org/10.1109/insai54028.2021.00035\n[90] Jing Liu, Yang Liu, Chengwen Tian, Donglai Wei, Mengyang Zhao, Wei Ni, Xinhua Zeng, and Liang Song. 2021. A Survey of Recent Advances in Driving Behavior Analysis. In 2021 3rd International Symposium on Smart and Healthy Cities (ISHC). 145–157. https://doi.org/10.1109/ishc54333. 2021.00035\n[91] Jing Liu, Yang Liu, Chengwen Tian, Mengyang Zhao, Xinhua Zeng, and Liang Song. 2022. Multi-Level Attention Fusion for Multimodal Driving Maneuver Recognition. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS). 2609–2613. https://doi.org/10.1109/iscas48785.2022. 9937710\n[92] Jing Liu, Yang Liu, Donglai Wei, Wei Ni, Xinhua Zeng, and Liang Song. 2022. Attention-Based Auto-Encoder Framework for Abnormal Driving Detection. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS). 3150–3154. https://doi.org/10.1109/iscas48785.2022.9937548\n[93] Jing Liu, Yang Liu, Wei Zhu, Xiaoguang Zhu, and Liang Song. 2023. Distributional and spatial-temporal robust representation learning for transportation activity recognition. Pattern Recognition 140 (2023), 109568.\n[94] Jing Liu, Yang Liu, and Xiaoguang Zhu. 2024. Privacy-Preserving Video Anomaly Detection: A Survey. arXiv preprint arXiv:2411.14565 (2024).\n[95] Jing Liu, Zhenchao Ma, Zepu Wang, Chenxuanyin Zou, Jiayang Ren, Zehua Wang, Liang Song, Bo Hu, Yang Liu, and Victor Leung. 2025. A Survey on Diffusion Models for Anomaly Detection. arXiv preprint arXiv:2501.11430 (2025).\n[96] Jinmei Liu, Zhi Wang, and Chunlin Chen. 2022. Fast probabilistic policy reuse via reward function fitting. In 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–7.\n[97] Jinmei Liu, Zhi Wang, Chunlin Chen, and Daoyi Dong. 2023. Efficient Bayesian policy reuse with a scalable observation model in deep reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems (2023).\n[98] Jing Liu, Donglai Wei, Yang Liu, Sipeng Zhang, Tong Yang, and Victor C. M. Leung. 2024. SCMM: Calibrating Cross-modal Representations for Text-Based Person Search. arXiv:2304.02278\n[99] Jing Liu, Wei Zhu, Di Li, Xing Hu, and Liang Song. 2025. Domain Generalization with Semi-Supervised Learning for People-Centric Activity Recognition. Science China Information Sciences 68, 1 (Jan. 2025), 112103. https://doi.org/10.1007/s11432-022-3860-y\n[100] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6536–6545.\n[101] Weide Liu, Zhonghua Wu, Yiming Wang, Henghui Ding, Fayao Liu, Jie Lin, and Guosheng Lin. 2024. LCReg: Long-tailed image classification with\nlatent categories based recognition. Pattern Recognition 145 (2024), 109971. [102] Weide Liu, Zhonghua Wu, Yang Zhao, Yuming Fang, Chuan-Sheng Foo, Jun Cheng, and Guosheng Lin. 2024. Harmonizing base and novel classes: A class-contrastive approach for generalized few-shot segmentation. International Journal of Computer Vision 132, 4 (2024), 1277–1291. [103] Weide Liu, Chi Zhang, Henghui Ding, Tzu-Yi Hung, and Guosheng Lin. 2022. Few-shot segmentation with optimal transport matching and message flow. IEEE Transactions on Multimedia 25 (2022), 5130–5141. [104] Weide Liu, Chi Zhang, Guosheng Lin, and Fayao Liu. 2022. Crcnet: Few-shot segmentation with cross-reference and region–global conditional networks. International Journal of Computer Vision 130, 12 (2022), 3140–3157. [105] Yang Liu, Zhengliang Guo, Jing Liu, Chengfang Li, and Liang Song. 2023. Osin: Object-centric scene inference network for unsupervised video anomaly detection. IEEE Signal Processing Letters 30 (2023), 359–363. [106] Yang Liu, Bobo Ju, Dingkang Yang, Liyuan Peng, Di Li, Peng Sun, Chengfang Li, Hao Yang, Jing Liu, and Liang Song. 2024. Memory-enhanced spatial-temporal encoding framework for industrial anomaly detection system. Expert Systems with Applications (2024), 123718. [107] Yusha Liu, Chun-Liang Li, and Barnabás Póczos. 2018. Classifier Two Sample Test for Video Anomaly Detections.. In BMVC. 71. [108] Yang Liu, Di Li, Wei Zhu, Dingkang Yang, Jing Liu, and Liang Song. 2023. MSN-net: Multi-scale normality network for video anomaly detection. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5. [109] Yang Liu, Shuang Li, Jing Liu, Hao Yang, Mengyang Zhao, Xinhua Zeng, Wei Ni, and Liang Song. 2021. Learning Attention Augmented Spatialtemporal Normality for Video Anomaly Detection. In 2021 3rd International Symposium on Smart and Healthy Cities (ISHC). IEEE, 137–144. [110] Yang Liu, Jing Liu, Jieyu Lin, Mengyang Zhao, and Liang Song. 2022. Appearance-Motion United Auto-Encoder Framework for Video Anomaly Detection. IEEE Transactions on Circuits and Systems II: Express Briefs 69, 5 (2022), 2498–2502. [111] Yang Liu, Jing Liu, Kun Yang, Bobo Ju, Siao Liu, Yuzheng Wang, Dingkang Yang, Peng Sun, and Liang Song. 2024. AMP-Net: Appearance-Motion Prototype Network Assisted Automatic Video Anomaly Detection System. IEEE Transactions on Industrial Informatics 20, 2 (2024), 2843–2855. [112] Yang Liu, Jing Liu, Mengyang Zhao, Shuang Li, and Liang Song. 2022. Collaborative Normality Learning Framework for Weakly Supervised Video Anomaly Detection. IEEE Transactions on Circuits and Systems II: Express Briefs 69, 5 (2022), 2508–2512. [113] Yang Liu, Jing Liu, Mengyang Zhao, Dingkang Yang, Xiaoguang Zhu, and Liang Song. 2022. Learning Appearance-Motion Normality for Video\nAnomaly Detection. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.\n[114] Yang Liu, Jing Liu, Xiaoguang Zhu, Donglai Wei, Xiaohong Huang, and Liang Song. 2022. Learning Task-Specific Representation for Video Anomaly Detection with Spatial-Temporal Attention. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2190–2194.\n[115] Yang Liu, Hongjin Wang, Zepu Wang, Xiaoguang Zhu, Jing Liu, Peng Sun, Rui Tang, Jianwei Du, Victor C.M. Leung, and Liang Song. 2025. CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos. arXiv preprint (2025).\n[116] Yang Liu, Zhaoyang Xia, Mengyang Zhao, Donglai Wei, Yuzheng Wang, Liu Siao, Bobo Ju, Gaoyun Fang, Jing Liu, and Liang Song. 2023. Learning Causality-inspired Representation Consistency for Video Anomaly Detection. In 31st ACM International Conference on Multimedia. 203–212.\n[117] Yang Liu, Dingkang Yang, Gaoyun Fang, Yuzheng Wang, Donglai Wei, Mengyang Zhao, Kai Cheng, Jing Liu, and Liang Song. 2023. Stochastic video normality network for abnormal event detection in surveillance videos. Knowledge-Based Systems 280 (2023), 110986.\n[118] Yang Liu, Dingkang Yang, Yan Wang, Jing Liu, Jun Liu, Azzedine Boukerche, Peng Sun, and Liang Song. 2024. Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models. Comput. Surveys 56, 7 (2024), 1–38.\n[119] Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. 2021. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13588–13597.\n[120] Cewu Lu, Jianping Shi, and Jiaya Jia. 2013. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision. 2720–2727.\n[121] Weixin Luo, Wen Liu, and Shenghua Gao. 2017. Remembering history with convolutional lstm for anomaly detection. In 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 439–444.\n[122] Weixin Luo, Wen Liu, Dongze Lian, and Shenghua Gao. 2021. Future frame prediction network for video anomaly detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).\n[123] Weixin Luo, Wen Liu, Dongze Lian, Jinhui Tang, Lixin Duan, Xi Peng, and Shenghua Gao. 2019. Video anomaly detection with sparse coding inspired deep neural networks. IEEE transactions on pattern analysis and machine intelligence 43, 3 (2019), 1070–1084.\n[124] Hui Lv and Qianru Sun. 2024. Video anomaly detection and explanation via large language models. arXiv preprint arXiv:2401.05702 (2024).\n[125] Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. 2010. Anomaly detection in crowded scenes. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 1975–1981.\n[126] Michaela Mašková, Matěj Zorek, Tomáš Pevny, and Václav Šmídl. 2024. Deep anomaly detection on set data: Survey and comparison. Pattern Recognition (2024), 110381.\n[127] Jefferson Ryan Medel and Andreas Savakis. 2016. Anomaly detection in video using predictive convolutional long short-term memory networks. arXiv preprint arXiv:1612.00390 (2016).\n[128] Pratik K Mishra, Alex Mihailidis, and Shehroz S Khan. 2024. Skeletal Video Anomaly Detection Using Deep Learning: Survey, Challenges, and Future Directions. IEEE Transactions on Emerging Topics in Computational Intelligence (2024).\n[129] Ruwan Nawarathna, JungHwan Oh, Jayantha Muthukudage, Wallapak Tavanapong, Johnny Wong, Piet C De Groen, and Shou Jiang Tang. 2014. Abnormal image detection in endoscopy videos using a filter bank and local binary patterns. Neurocomputing 144 (2014), 70–91.\n[130] Rashmika Nawaratne, Damminda Alahakoon, Daswin De Silva, and Xinghuo Yu. 2019. Spatiotemporal anomaly detection using deep learning for real-time video surveillance. IEEE Transactions on Industrial Informatics 16, 1 (2019), 393–402.\n[131] Rashmiranjan Nayak, Umesh Chandra Pati, and Santos Kumar Das. 2021. A comprehensive review on deep learning-based methods for video anomaly detection. Image and Vision Computing 106 (2021), 104078.\n[132] Rashmiranjan Nayak, Umesh Chandra Pati, and Santos Kumar Das. 2023. Video Anomaly Detection Using Self-Attention-Enabled Convolutional Spatiotemporal Autoencoder. In 2023 22nd International Symposium on Communications and Information Technologies (ISCIT). IEEE, 70–75.\n[133] Rashmiranjan Nayak, Umesh Chandra Pati, and Santos Kumar Das. 2024. A comprehensive review of datasets for detection and localization of video anomalies: a step towards data-centric artificial intelligence-based video anomaly detection. Multimedia Tools and Applications 83, 21 (2024), 59617–59674.\n[134] Khac-Tuan Nguyen, Dat-Thanh Dinh, Minh N Do, and Minh-Triet Tran. 2020. Anomaly detection in traffic surveillance videos with gan-based future frame prediction. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 457–463.\n[135] Trong-Nguyen Nguyen and Jean Meunier. 2019. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the IEEE/CVF international conference on computer vision. 1273–1283.\n[136] Zhiyuan Ning, Zile Wang, Yang Liu, Jing Liu, and Liang Song. 2024. Memory-enhanced appearance-motion consistency framework for video anomaly detection. Computer Communications 216 (2024), 159–167.\n[137] Ghazal Alinezhad Noghre, Shanle Yao, Armin Danesh Pazho, Babak Rahimi Ardabili, Vinit Katariya, and Hamed Tabkhi. 2024. PHEVA: A Privacy-preserving Human-centric Video Anomaly Detection Dataset. arXiv preprint arXiv:2408.14329 (2024).\n[138] Sergiu Oprea, Pablo Martinez-Gonzalez, Alberto Garcia-Garcia, John Alejandro Castro-Vargas, Sergio Orts-Escolano, Jose Garcia-Rodriguez, and Antonis Argyros. 2020. A review on deep learning techniques for video prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 6 (2020), 2806–2826.\n[139] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. 2021. Deep learning for anomaly detection: A review. ACM Computing Surveys (CSUR) 54, 2 (2021), 1–38.\n[140] Guansong Pang, Cheng Yan, Chunhua Shen, Anton van den Hengel, and Xiao Bai. 2020. Self-trained deep ordinal regression for end-to-end video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12173–12182.\n[141] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. 2020. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14372–14381.\n[142] Liyuan Peng, Yujie Huang, Mingyu Wang, Wenhong Li, Minge Jing, and Xiaoyang Zeng. 2025. Defective Pixel Corrector for Line Scan and Area Scan Image Sensors. IEEE Transactions on Circuits and Systems I: Regular Papers (2025).\n[143] Yujiang Pu and Xiaoyu Wu. 2022. Audio-Guided Attention Network for Weakly Supervised Violence Detection. In 2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE). IEEE, 219–223.\n[144] Yujiang Pu, Xiaoyu Wu, Lulu Yang, and Shengjin Wang. 2024. Learning prompt-enhanced context features for weakly-supervised video anomaly detection. IEEE Transactions on Image Processing (2024).\n[145] Shaoming Qiu, Jingfeng Ye, Jiancheng Zhao, Lei He, Liangyu Liu, E Bicong, and Xinchen Huang. 2024. Video anomaly detection guided by clustering learning. Pattern Recognition (2024), 110550.\n[146] Alec Radford. 2018. Improving language understanding by generative pre-training. (2018).\n[147] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al . 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.\n[148] Rohit Raja, Prakash Chandra Sharma, Md Rashid Mahmood, and Dinesh Kumar Saini. 2022. Analysis of anomaly detection in surveillance video: recent trends and future vision. Multimedia Tools and Applications (2022), 1–17.\n[149] Bharathkumar Ramachandra, Michael J Jones, and Ranga Raju Vatsavai. 2020. A survey of single-scene video anomaly detection. IEEE transactions on pattern analysis and machine intelligence 44, 5 (2020), 2293–2312.\n[150] Mahdyar Ravanbakhsh, Enver Sangineto, Moin Nabi, and Nicu Sebe. 2019. Training adversarial discriminators for cross-channel abnormal event detection in crowds. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1896–1904.\n[151] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779–788.\n[152] Khosro Rezaee, Sara Mohammad Rezakhani, Mohammad R Khosravi, and Mohammad Kazem Moghimi. 2021. A survey on deep learning-based real-time crowd anomaly detection for secure distributed video surveillance. Personal and Ubiquitous Computing (2021), 1–17.\n[153] Ryne Roady, Tyler L Hayes, Hitesh Vaidya, and Christopher Kanan. 2020. Stream-51: Streaming classification and novelty detection from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 228–229.\n[154] Mehrsan Javan Roshtkhari and Martin D Levine. 2013. An on-line, real-time learning method for detecting anomalies in videos using spatio-temporal compositions. Computer vision and image understanding 117, 10 (2013), 1436–1452.\n[155] Mohammad Sabokrou, Mahmood Fathy, Mojtaba Hoseini, and Reinhard Klette. 2015. Real-time anomaly detection and localization in crowded scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 56–62.\n[156] Mohammad Sabokrou, Mohsen Fayyaz, Mahmood Fathy, and Reinhard Klette. 2017. Deep-cascade: Cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes. IEEE Transactions on Image Processing 26, 4 (2017), 1992–2004.\n[157] Venkatesh Saligrama and Zhu Chen. 2012. Video anomaly detection based on local statistical aggregates. In 2012 IEEE Conference on computer vision and pattern recognition. IEEE, 2112–2119.\n[158] Venkatesh Saligrama, Janusz Konrad, and Pierre-Marc Jodoin. 2010. Video anomaly identification. IEEE Signal Processing Magazine 27, 5 (2010), 18–33.\n[159] Yau Alhaji Samaila, Patrick Sebastian, Narinderjit Singh Sawaran Singh, Aliyu Nuhu Shuaibu, Syed Saad Azhar Ali, Temitope Ibrahim Amosa, Ghulam E Mustafa Abro, and Isiaka Shuaibu. 2024. Video Anomaly Detection: A Systematic Review of Issues and Prospects. Neurocomputing (2024), 127726.\n[160] Kelathodi Kumaran Santhosh, Debi Prosad Dogra, and Partha Pratim Roy. 2020. Anomaly detection in road traffic using visual surveillance: A survey. ACM Computing Surveys (CSUR) 53, 6 (2020), 1–26.\n[161] Ting Shi, Wu Yang, and Junfei Qiao. 2021. Research on Nonlinear Systems Modeling Methods Based on Neural Networks. In Journal of Physics: Conference Series, Vol. 2095. IOP Publishing, 012037.\n[162] Prakhar Singh and Vinod Pankajakshan. 2018. A Deep Learning Based Technique for Anomaly Detection in Surveillance Videos. In 2018 Twenty Fourth National Conference on Communications (NCC). 1–6.\n[163] Sorina Smeureanu, Radu Tudor Ionescu, Marius Popescu, and Bogdan Alexe. 2017. Deep appearance features for abnormal behavior detection in video. In International Conference on Image Analysis and Processing. Springer, 779–789.\n[164] Liang Song, Xing Hu, Guanhua Zhang, Petros Spachos, Konstantinos N Plataniotis, and Hequan Wu. 2022. Networking systems of AI: on the convergence of computing and communications. IEEE Internet of Things Journal 9, 20 (2022), 20352–20381.\n[165] Yong Su, Yuyu Tan, Simin An, Meng Xing, and Zhiyong Feng. 2025. Semantic-driven dual consistency learning for weakly supervised video anomaly detection. Pattern Recognition 157 (2025), 110898.\n[166] Yong Su, Haohao Zhu, Yuyu Tan, Simin An, and Meng Xing. 2023. Prime: privacy-preserving video anomaly detection via motion exemplar guidance. Knowledge-Based Systems 278 (2023), 110872.\n[167] Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6479–6488.\n[168] Jiayu Sun, Jie Shao, and Chengkun He. 2019. Abnormal event detection for video surveillance using deep one-class learning. Multimedia Tools and Applications 78, 3 (2019), 3633–3647.\n[169] Zehua Sun, Qiuhong Ke, Hossein Rahmani, Mohammed Bennamoun, Gang Wang, and Jun Liu. 2022. Human action recognition from various data modalities: A review. IEEE transactions on pattern analysis and machine intelligence 45, 3 (2022), 3200–3225.\n[170] Zhe Sun, Panpan Wang, Wang Zheng, and Meng Zhang. 2024. Dual GroupGAN: An unsupervised Four-Competitor (2V2) approach for video anomaly detection. Pattern Recognition (2024), 110500.\n[171] Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and Ying-Cong Chen. 2024. Hawk: Learning to Understand Open-World Video Anomalies. arXiv preprint arXiv:2405.16886 (2024).\n[172] Liangyu Teng, Yang Liu, Jing Liu, and Liang Song. 2024. End-Cloud Collaboration Framework for Advanced AI Customer Service in E-commerce. arXiv preprint arXiv:2410.07122 (2024).\n[173] Beiwen Tian, Huan-ang Gao, Leiyao Cui, Yupeng Zheng, Lan Luo, Baofeng Wang, Rong Zhi, Guyue Zhou, and Hao Zhao. 2024. Latency-aware Road Anomaly Segmentation in Videos: A Photorealistic Dataset and New Metrics. arXiv preprint arXiv:2401.04942 (2024).\n[174] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. 2021. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4975–4986.\n[175] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.\n[176] Hung Vu, Dinh Phung, Tu Dinh Nguyen, Anthony Trevors, and Svetha Venkatesh. 2017. Energy-based models for video anomaly detection. arXiv preprint arXiv:1708.05211 (2017).\n[177] Guneet Kaur Walia, Mohit Kumar, and Sukhpal Singh Gill. 2023. AI-empowered fog/edge resource management for IoT applications: A comprehensive review, research challenges and future perspectives. IEEE Communications Surveys \u0026amp; Tutorials (2023).\n[178] Hanqi Wang, Xing Hu, Liang Song, Guanhua Zhang, Yang Liu, Jing Liu, and Linhua Jiang. 2021. Stack Multiple Shallow Autoencoders into a Strong One: A New Reconstruction-Based Method to Detect Anomaly. In International Conference on Neural Information Processing. Springer, 103–115.\n[179] Tian Wang, Meina Qiao, Zhiwei Lin, Ce Li, Hichem Snoussi, Zhe Liu, and Chang Choi. 2018. Generative neural networks for anomaly detection in crowded scenes. IEEE Transactions on Information Forensics and Security 14, 5 (2018), 1390–1399.\n[180] Xuanzhao Wang, Zhengping Che, Bo Jiang, Ning Xiao, Ke Yang, Jian Tang, Jieping Ye, Jingyu Wang, and Qi Qi. 2021. Robust unsupervised video anomaly detection by multipath frame prediction. IEEE transactions on neural networks and learning systems (2021).\n[181] Yang Wang, Tianying Liu, Jiaogen Zhou, and Jihong Guan. 2023. Video anomaly detection based on spatio-temporal relationships among objects. Neurocomputing 532 (2023), 141–151.\n[182] Yan Wang, Yixuan Sun, Wei Song, Shuyong Gao, Yiwen Huang, Zhaoyu Chen, Weifeng Ge, and Wenqiang Zhang. 2022. DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos. In Proceedings of the 30th ACM International Conference on Multimedia. 101–110.\n[183] Yunbo Wang, Haixu Wu, Jianjin Zhang, Zhifeng Gao, Jianmin Wang, S Yu Philip, and Mingsheng Long. 2022. Predrnn: A recurrent neural network for spatiotemporal predictive learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 2 (2022), 2208–2225.\n[184] Yan Wang, Shaoqi Yan, Yang Liu, Wei Song, Jing Liu, Yang Chang, Xinji Mai, Xiping Hu, Wenqiang Zhang, and Zhongxue Gan. 2024. A Survey on Facial Expression Recognition of Static and Dynamic Emotions. arXiv preprint arXiv:2408.15777 (2024).\n[185] Yan Wang, Shaoqi Yan, Wei Song, Antonio Liotta, Jing Liu, Dingkang Yang, Shuyong Gao, and Wenqiang Zhang. 2024. MGR3Net: Multigranularity Region Relation Representation Network for Facial Expression Recognition in Affective Robots. IEEE Transactions on Industrial Informatics (2024), 1–11.\n[186] Yuzheng Wang, Dingkang Yang, Zhaoyu Chen, Yang Liu, Siao Liu, Wenqiang Zhang, Lihua Zhang, and Lizhe Qi. 2024. De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 12615–12625.\n[187] Donglai Wei, Yang Liu, Xiaoguang Zhu, Jing Liu, and Xinhua Zeng. 2022. MSAF: Multimodal Supervise-Attention Enhanced Fusion for Video Anomaly Detection. IEEE Signal Processing Letters 29 (2022), 2178–2182.\n[188] Dong-Lai Wei, Chen-Geng Liu, Yang Liu, Jing Liu, Xiao-Guang Zhu, and Xin-Hua Zeng. 2022. Look, Listen and Pay More Attention: Fusing Multi-Modal Information for Video Violence Detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1980–1984.\n[189] Lian Wu, Chao Huang, Lunke Fei, Shuping Zhao, Jianchuan Zhao, Zhongwei Cui, and Yong Xu. 2024. Video-Based Fall Detection Using Human Pose and Constrained Generative Adversarial Network. IEEE Transactions on Circuits and Systems for Video Technology 34, 4 (April 2024), 2179–2194.\n[190] Peng Wu, Jing Liu, Xiangteng He, Yuxin Peng, Peng Wang, and Yanning Zhang. 2024. Toward Video Anomaly Retrieval From Video Anomaly Detection: New Benchmarks and Model. IEEE Transactions on Image Processing 33 (2024), 2213–2225.\n[191] Peng Wu, Jing Liu, and Fang Shen. 2019. A deep one-class neural network for anomalous event detection in complex scenes. IEEE transactions on neural networks and learning systems 31, 7 (2019), 2609–2622.\n[192] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. 2020. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In European conference on computer vision. Springer, 322–339.\n[193] Peng Wu, Xiaotao Liu, and Jing Liu. 2022. Weakly supervised audio-visual violence detection. IEEE Transactions on Multimedia (2022).\n[194] Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. 2024. Open-Vocabulary Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition .\n[195] Shandong Wu, Brian E Moore, and Mubarak Shah. 2010. Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2054–2060.\n[196] Dan Xu, Yan Yan, Elisa Ricci, and Nicu Sebe. 2017. Detecting anomalous events in videos by learning deep representations of appearance and motion. Computer Vision and Image Understanding 156 (2017), 117–127.\n[197] Hang Xu, Lewei Yao, Wei Zhang, Xiaodan Liang, and Zhenguo Li. 2019. Auto-fpn: Automatic network architecture adaptation for object detection beyond classification. In Proceedings of the IEEE/CVF international conference on computer vision. 6649–6658.\n[198] Peng Xu, Xiatian Zhu, and David A Clifton. 2023. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).\n[199] Cheng Yan, Shiyu Zhang, Yang Liu, Guansong Pang, and Wenjun Wang. 2023. Feature prediction diffusion model for video anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5527–5537.\n[200] Shiyang Yan, Jeremy S Smith, Wenjin Lu, and Bailing Zhang. 2018. Abnormal event detection from videos using a two-stream recurrent variational autoencoder. IEEE Transactions on Cognitive and Developmental Systems 12, 1 (2018), 30–42.\n[201] Dingkang Yang, Shuai Huang, Shunli Wang, Yang Liu, Peng Zhai, Liuzhen Su, Mingcheng Li, and Lihua Zhang. 2022. Emotion Recognition for Multiple Context Awareness. In European Conference on Computer Vision. Springer, 144–162.\n[202] Dingkang Yang, Mingcheng Li, Dongling Xiao, Yang Liu, Kun Yang, Zhaoyu Chen, Yuzheng Wang, Peng Zhai, Ke Li, and Lihua Zhang. 2024. Towards multimodal sentiment analysis debiasing via bias purification. arXiv preprint arXiv:2403.05023 (2024).\n[203] Hao Yang, Hongyuan Lu, Xinhua Zeng, Yang Liu, Xiang Zhang, Haoran Yang, Yumeng Zhang, Yiran Wei, and Wai Lam. 2024. Stephanie: Step-by-Step Dialogues for Mimicking Human Interactions in Social Conversations. arXiv preprint arXiv:2407.04093 (2024).\n[204] Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. 2021. Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334 (2021).\n[205] Kun Yang, Dingkang Yang, Jingyu Zhang, Mingcheng Li, Yang Liu, Jing Liu, Hanqi Wang, Peng Sun, and Liang Song. 2023. Spatio-temporal domain awareness for multi-agent collaborative perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 23383-23392. 2.\n[206] Zhiwei Yang, Jing Liu, and Peng Wu. 2024. Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection. arXiv preprint arXiv:2404.08531 (2024).\n[207] Zhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. 2023. Video event restoration based on keyframes for video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14592–14601.\n[208] Hongyu Ye, Ke Xu, Xinghao Jiang, and Tanfeng Sun. 2024. Learning Spatio-Temporal Relations with Multi-Scale Integrated Perception for Video Anomaly Detection. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4020–4024.\n[209] Muchao Ye, Xiaojiang Peng, Weihao Gan, Wei Wu, and Yu Qiao. 2019. Anopcn: Video anomaly detection via deep predictive coding network. In Proceedings of the 27th ACM International Conference on Multimedia. 1805–1813.\n[210] Yafeng Yin, Lei Xie, Zhiwei Jiang, Fu Xiao, Jiannong Cao, and Sanglu Lu. 2024. A Systematic Review of Human Activity Recognition Based On Mobile Devices: Overview, Progress and Trends. IEEE Communications Surveys \u0026amp; Tutorials (2024).\n[211] Guang Yu, Siqi Wang, Zhiping Cai, Xinwang Liu, En Zhu, and Jianping Yin. 2023. Video Anomaly Detection via Visual Cloze Tests. IEEE Transactions on Information Forensics and Security (2023).\n[212] Guang Yu, Siqi Wang, Zhiping Cai, En Zhu, Chuanfu Xu, Jianping Yin, and Marius Kloft. 2020. Cloze test helps: Effective video anomaly detection via learning to complete video events. In Proceedings of the 28th ACM International Conference on Multimedia. 583–591.\n[213] Jongmin Yu, Younkwan Lee, Kin Choong Yow, Moongu Jeon, and Witold Pedrycz. 2021. Abnormal event detection and localization via adversarial event prediction. IEEE Transactions on Neural Networks and Learning Systems (2021).\n[214] Jiashuo Yu, Jinyu Liu, Ying Cheng, Rui Feng, and Yuejie Zhang. 2022. Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection. In Proceedings of the 30th ACM International Conference on Multimedia. 6278–6287.\n[215] Muhammad Zaigham Zaheer, Jin-ha Lee, Marcella Astrid, and Seung-Ik Lee. 2020. Old is gold: Redefining the adversarially learned one-class classifier training paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14183–14193.\n[216] M Zaigham Zaheer, Arif Mahmood, M Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. 2022. Generative Cooperative Learning for\nUnsupervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14744–14754.\n[217] Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. 2024. Harnessing Large Language Models for Training-free Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18527–18536.\n[218] Huaxin Zhang, Xiang Wang, Xiaohao Xu, Xiaonan Huang, Chuchu Han, Yuehuan Wang, Changxin Gao, Shanjun Zhang, and Nong Sang. 2024. GlanceVAD: Exploring Glance Supervision for Label-efficient Video Anomaly Detection. arXiv preprint arXiv:2403.06154 (2024).\n[219] Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, and Nong Sang. 2024. Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM. arXiv preprint arXiv:2406.12235 (2024).\n[220] Menghao Zhang, Jingyu Wang, Qi Qi, Haifeng Sun, Zirui Zhuang, Pengfei Ren, Ruilong Ma, and Jianxin Liao. 2024. Multi-Scale Video Anomaly Detection by Multi-Grained Spatio-Temporal Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17385–17394.\n[221] Zhenzhen Zhang, Jianjun Hou, Qinglong Ma, and Zhaohong Li. 2015. Efficient video frame insertion and deletion detection based on inconsistency of correlations between local binary pattern coded frames. Security and Communication networks 8, 2 (2015), 311–320.\n[222] Mengyang Zhao, Yang Liu, Jing Liu, and Xinhua Zeng. 2022. Exploiting Spatial-temporal Correlations for Video Anomaly Detection. In 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 1727–1733.\n[223] Mengyang Zhao, Xinhua Zeng, Yang Liu, Jing Liu, and Chengxin Pang. 2024. Rethinking prediction-based video anomaly detection from local-global normality perspective. Expert Systems with Applications (2024), 125581.\n[224] Mengyang Zhao, Xinhua Zeng, Yang Liu, Jing Liu, and Chengxin Pang. 2025. Rethinking prediction-based video anomaly detection from local–global normality perspective. Expert Systems with Applications 262 (2025), 125581.\n[225] Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. 2017. Spatio-temporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM international conference on Multimedia. 1933–1941.\n[226] Joey Tianyi Zhou, Jiawei Du, Hongyuan Zhu, Xi Peng, Yong Liu, and Rick Siow Mong Goh. 2019. Anomalynet: An anomaly detection network for video surveillance. IEEE Transactions on Information Forensics and Security 14, 10 (2019), 2537–2550.\n[227] Shifu Zhou, Wei Shen, Dan Zeng, Mei Fang, Yuanwang Wei, and Zhijiang Zhang. 2016. Spatial–temporal convolutional neural networks for anomaly detection and localization in crowded scenes. Signal Processing: Image Communication 47 (2016), 358–368.\n[228] Wei Zhou, Zhibo Chen, and Weiping Li. 2019. Dual-stream interactive networks for no-reference stereoscopic image quality assessment. IEEE Transactions on Image Processing 28, 8 (2019), 3946–3958.\n[229] Wei Zhou and Zhou Wang. 2022. Quality assessment of image super-resolution: Balancing deterministic and statistical fidelity. In Proceedings of the 30th ACM international conference on multimedia. 934–942.\n[230] Wei Zhou and Zhou Wang. 2024. Perceptual depth quality assessment of stereoscopic omnidirectional images. IEEE Transactions on Circuits and Systems for Video Technology (2024).\n[231] Wei Zhou, Qi Yang, Wu Chen, Qiuping Jiang, Guangtao Zhai, and Weisi Lin. 2024. Blind quality assessment of dense 3D point clouds with structure guided resampling. ACM Transactions on Multimedia Computing, Communications and Applications 20, 8 (2024), 1–21.\n[232] Yixuan Zhou, Yi Qu, Xing Xu, Fumin Shen, Jingkuan Song, and Heng Tao Shen. 2024. Batchnorm-based weakly supervised video anomaly detection. IEEE Transactions on Circuits and Systems for Video Technology (2024).\n[233] Yuansheng Zhu, Wentao Bao, and Qi Yu. 2022. Towards open set video anomaly detection. In European Conference on Computer Vision. Springer, 395–412.\n","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/survey-4/","section":"Papers","summary":"A comprehensive survey and tutorial exploring the assumptions, frameworks, recent advances, applications, and future trends of Networking Systems for Video Anomaly Detection (NSVAD), emphasizing the integration of AI, IoVT, and computing for real-world deployable systems.","title":"Networking Systems for Video Anomaly Detection: A Tutorial and Survey","type":"survey"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/peng-sun/","section":"Authors","summary":"","title":"Peng Sun","type":"authors"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/type/survey/","section":"Type","summary":"","title":"Survey","type":"type"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/benchmarks/ubnormal/","section":"Benchmarks","summary":"","title":"Ubnormal","type":"benchmarks"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/benchmarks/ucsd-ped/","section":"Benchmarks","summary":"","title":"Ucsd-Ped","type":"benchmarks"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/victor-c.m.-leung/","section":"Authors","summary":"","title":"Victor C.M. Leung","type":"authors"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yang-liu/","section":"Authors","summary":"","title":"Yang Liu","type":"authors"},{"content":" AADC-Net: A Multimodal Deep Learning Framework for Automatic Anomaly Detection in Real-Time Surveillance # Duc Tri Phan , Member, IEEE, Vu Hoang Minh Doan , Jaeyeop Choi , Byeongil Lee , and Junghwan Oh , Senior Member, IEEE\nAbstract— Automatic anomaly detection (AAD) has emerged as an advanced vision-based measurement method with diverse applications in healthcare and security. However, current AAD methods still face challenges related to data limitations and labeled-data imbalances, which limit the accuracy and reliability of AAD in real-life applications. Additionally, labeling and training large datasets for video anomaly detection (VAD) is computationally demanding and time-consuming. To address these challenges, this work introduces AADC-Net, a multimodal deep neural network for automated abnormal event detection and categorization. The key contributions of this research are as follows: 1) AADC-Net leverages pretrained large language models (LLMs) and vision-language models (VLMs) to mitigate VAD dataset limitations and imbalances; 2) a pretrained object detection model [DEtection TRansformer (DETR)] is integrated for visual feature extraction, eliminating the need for bounding box supervision; 3) the experimental results demonstrate the state-of-the-art (SOTA) performance of the proposed AADC-Net with an area under the curve (AUC) of 83.2% and an average precision (AP) of 83.8% on the public UCF-Crime and XDViolence datasets, respectively; and 4) additionally, AADC-Net can be integrated into existing video surveillance systems, such as those in smart gyms and healthcare facilities, to automatically detect anomalies in real time with minimal supervision, enhancing security, monitoring, and reducing labor costs while minimizing human error. In summary, our results demonstrate that AADC-Net not only achieves high accuracy in anomaly detection but also provides a practical solution for real-world surveillance applications.\nReceived 30 December 2024; revised 17 February 2025; accepted 25 February 2025. Date of publication 31 March 2025; date of current version 16 April 2025. This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIT) under Grant 2022R1A5A8023404. The Associate Editor coordinating the review process was Dr. Sudao He. (Duc Tri Phan and Vu Hoang Minh Doan contributed equally to this work.) (Corresponding authors: Byeongil Lee; Junghwan Oh.)\nDuc Tri Phan was with the Department of Biomedical Engineering, Pukyong National University, Busan 48513, South Korea. He is now with the Early Mental Potential and Wellbeing Research Centre (EMPOWER), Nanyang Technological University, Singapore 639798 (e-mail: ductri.phan@ ntu.edu.sg).\nVu Hoang Minh Doan and Jaeyeop Choi are with the Smart GymBased Translational Research Center for Active Senior\u0026rsquo;s Healthcare, Pukyong National University, Busan 48513, South Korea (e-mail: doanvuhoangminh@ gmail.com; jaeyeopchoi@pknu.ac.kr).\nByeongil Lee is with the Digital Healthcare Research Center, Institute of Information Technology and Convergence, Pukyong National University, Busan 48513, South Korea (e-mail: bilee@pknu.ac.kr).\nJunghwan Oh is with Ohlabs Corporation, Busan 48513, South Korea, and also with Industry 4.0 Convergence Bionics Engineering and the Department of Biomedical Engineering, Pukyong National University, Busan 48513, South Korea (e-mail: jungoh@pknu.ac.kr).\nDigital Object Identifier 10.1109/TIM.2025.3551832\nIndex Terms— Automatic anomaly detection (AAD), real-time surveillance, video anomaly detection (VAD), vision-based methods, vision-language model (VLM).\nI. INTRODUCTION # V IDEO anomaly detection (VAD) is a computational tool used to detect unusual or abnormal events within video sequences, contributing to automated monitoring systems [1] . These abnormal patterns could include unusual movements, behaviors, or any irregular and unexpected activities [2]. More specifically, VAD aims to determine if each given video frame contains any anomalies [3]. In real life, monitoring abnormal activities manually is a challenging task because they rarely occur compared to normal ones. Therefore, developing automated VAD techniques is essential for minimizing unnecessary workload and increasing the efficiency of real-time surveillance and security systems.\nVAD has been extensively studied and employed in real-world scenarios due to its practical benefits [4] , [5] , [6] , [7]. In surveillance and security, VAD can help detect intruders in restricted areas and identify suspicious activities in public places, such as violent or terrorist acts [8] , [9] , [10] . In healthcare, it can monitor patients for abnormal movements or behaviors and detect falls or medical emergencies in patient care services, thereby reducing the workload of nurses and doctors [11] , [12]. VAD is also utilized in industrial and manufacturing settings for detecting equipment malfunctions and identifying safety violations [13] , [14] , [15] , [16]. Current research on VAD in smart gyms primarily focuses on monitoring workout routines, such as weight lifting, to ensure proper form [17]. Despite its limitations in implementation within workouts, VAD shows several promising applications for enhancing personalized training and feedback, safety monitoring, and identifying abnormal activities.\nOver the past few years, deep learning approaches have been widely used for VAD applications. Convolutional neural networks (CNNs) and long short-term memory networks (LSTMs) excel in learning complex patterns and hierarchical features from raw video data, making them ideal for detecting anomalies [18] , [19]. Their ability to describe strongly nonlinear spatial and temporal relationships without manual feature engineering is a significant advantage, allowing them to adapt to various scenarios [20]. However, their limitations include the requirement of large datasets for training and significant\n1557-9662 © 2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission.\nSee https://www.ieee.org/publications/rights/index.html for more information.\nFig. 1. Prediction scores from baseline VAD models and video descriptions using VLM and LLM models are employed for workout categorization and abnormal detection. On the score curve, the red dashed lines denote anomaly thresholds. The bottom shows the answers from our proposed video-LLM model for the VAD video description.\ncomputational resources [21]. They can be challenging to interpret, often acting as black boxes without clear insight into how decisions are made [22]. Spatiotemporal feature-based methods, such as dynamic texture models, treat videos as a linear dynamic system over time [23] , [24]. These methods are well-known for detecting anomalies in dynamic scenes, such as abnormal crowd activities [25]. However, dynamic texture models can only effectively handle videos with smooth movements, are sensitive to noise, and require precise feature extraction [26] .\nDespite various meaningful and innovative studies on VAD, several technical challenges remain in this field.\nThe effectiveness of VAD is highly dependent on the complexity of real-world scenes, such as weather conditions, lighting variations, and background activities [27] , [28] . Data scarcity is another challenge for VAD systems. Training data might not cover all possible abnormal events in social behaviors, leading to potential mispredictions by classical supervised models for detecting anomalies [29] . Moreover, the rarity of anomalies in daily life compared to normal activities causes an imbalance between positive and negative classes in the training dataset, resulting in low precision for minority class predictions [30] . Besides that, the data collection and labeling process in VAD is time-consuming, and deploying trained models for real-time video data prediction can be computationally intensive [31] . To address these challenges, we proposed a deep learning model that integrates large language models (LLMs) and vision-language models (VLMs) for VAD applications.\nLLMs and VLMs are renowned for their ability to process and comprehend complex data across a wide range of applications. LLMs provide transformative capabilities by enhancing how systems interact with video data in VAD [32] . VLMs enable a more comprehensive analysis of videos by integrating visual data from video frames with contextual information from associated textual data [33]. This multimodal approach combines visual signals from video frames with contextual information from textual data, enabling the detection of context-specific anomalies that might be missed by visual data alone. The large pretrained dataset of VLMs effectively targets the data scarcity and imbalance challenges of VAD applications. The proposed unsupervised models eliminate the need for time-consuming data labeling processes. LLMs and VLMs thus support more intuitive and effective anomaly detection by incorporating both visual and textual data to analyze and respond to complex situations in real time. Furthermore, by incorporating advanced object detection models and the Top-K selection mechanism, the impact of environmental changes on detection performance is minimized.\nIn summary, we propose a multimodal network based on LLM and VLM models for automatic abnormal event detection and categorization. As illustrated in Fig. 1, our anomaly detector aims to generate frame-level anomaly confidences, categories, localization, and video descriptions with only video-level annotations provided. The main contributions and advantages of the research are outlined as follows.\nWe present a novel unsupervised framework, namely AADC-Net, for abnormal event detection and categorization. The AADC-Net is based on LLM and VLM models to address the data limitations of VAD, resulting in enhanced accuracy for abnormal detection and classification. The pretrained object detection model [DEtection TRansformer (DETR)] was implemented for visual feature extraction and bounding box detection without supervision. Moreover, we employed a Top-K selection mechanism for abnormal detection in video frames and utilized the multi-instance learning (MIL)-Align mechanism to extend our approach beyond binary classification to handle multiple classes. We leveraged a pretrained LLM model to provide detailed video descriptions. The performance of our AADC-Net model was evaluated using two widely recognized metrics and compared against current state-of-the-art (SOTA) methods. Our model achieved an area under the curve (AUC) of 83.2% on the UCF-Crime dataset and an average precision (AP) of 83.8% on the XD-Violence dataset, demonstrating its effective classification capabilities. Finally, we present a framework utilizing our AADC-Net model as a video processing node. This framework is applied to a smart gym surveillance system to demonstrate the AADC-Net model\u0026rsquo;s applicability in real-world scenarios. The rest of this article is structured as follows. Section II provides a review of related work on VAD, focusing on the video-based VLM (VVLM) and video-based LLM (VLLM). Section III introduces the AADC network. Section IV presents experiments, results, an ablation study on public benchmark datasets, and the deployment of AADC-Net. Finally, Section V concludes and discusses the study.\nII. RELATED WORK # A. Video Anomaly Detection # VAD is a promising research area with various applications in security, surveillance, and beyond [3] , [6]. However, a major challenge remains in the limited availability of anomalous data and labels [30]. Recent reviews and surveys have highlighted advancements in VAD methodologies, emphasizing the evolution of techniques to address these challenges [34] , [35] , [36]. Researchers have recently explored weakly supervised VAD (WSVAD) techniques to overcome the limitations of traditional VAD methods. These WSVAD methods leverage videos with only normal or abnormal labels, relying on weak annotations at the video level. Sultani et al. [1] first established a large benchmark and introduced a lightweight network utilizing MIL mechanisms. A higher order context model combined with a margin-based MIL loss function proposed by Lv et al. [37] has further improved anomaly localization. Zhong et al. [38] employed graph convolutional networks (GCNs) to capture frame-level similarities and temporal relationships, while self-attention mechanisms utilized by Tian et al. [39] have demonstrated effectiveness in modeling global temporal contexts. Zhang et al. [40] explored the completeness and uncertainty of pseudo-labels.\nLi et al. [41] proposed a transformer-based multisequence learning framework, while Huang et al. [42] introduced a transformer-based framework for temporal representation aggregation. Fine-grained anomaly detection techniques have emerged, distinguishing between various types of anomalous frames [43], and multihead network models have been designed to disentangle anomaly representations, with each component specializing in specific anomaly types [44] .\nB. Video-Based Large Language Models # LLMs are making a remarkable impact in the field of video understanding [45]. VLLMs explore how LLMs can tackle multimodal problems, where information comes in various forms such as text, images, and video [45]. One key work is WebVid, a large dataset of short videos with corresponding text descriptions, introduced by Bain et al. [46] . Based on this dataset, Li et al. [47] improved image encoders to enable large models to grasp the visual content within videos. Su et al. [48] proposed multimodal encoders that allow models to handle multiple modalities. Zhang et al. [49] took a different approach, focusing on training fundamental models to comprehend both the visual and auditory aspects of videos. Ning et al. [50] proposed a benchmark system called Video-Bench to evaluate the capabilities of video-LLMs. Additionally, Wang et al. [51] introduced VidiL, a new model for creating VLMs that can handle various video-to-text tasks with minimal training data. For processing long videos, Weng et al. [52] offer a novel approach called LongVLM. It breaks down long videos into smaller segments, enabling the LLM to analyze the details of each part. Maaz et al. [53] introduced Video-ChatGPT, a conversational AI that can understand and discuss video content. Jin et al. [54] tackled the challenge of integrating image and video understanding into conversational LLMs with ChatUniVi. This method utilizes a dynamic visual token system and multiscale representation for efficient comprehension of both broad concepts and fine-grained details within videos.\nC. Video-Based Vision Language Models # VVLMs have emerged as a trend in recent years [55]. This technique focuses on learning the connection between visual information and language by pretraining models on large-scale datasets [55]. One recent model attracting significant attention in the VAD applications is CLIP, a contrastive languageimage pretraining model [56]. CLIP4Clip by Luo et al. [57] demonstrates its effectiveness in video-text retrieval. Several studies, including those by Wang et al. [58], Lin et al. [59] , and Ni et al. [60], explore adopting CLIP for video recognition tasks. Lv et al. [37] built upon CLIP\u0026rsquo;s visual features to develop a new framework unbiased multiple instance learning (UMIL) for unbiased anomaly detection, leading to improved WSVAD performance. Joo et al. [61] propose using CLIP\u0026rsquo;s visual features to extract key representations from videos. They then leverage temporal self-attention (TSA) to analyze both short-term and long-term temporal dependencies and identify relevant video snippets. The application of CLIP extends further to the more complex tasks of video action localization, as demonstrated by\nNag et al. [62] and Ju et al. [63]. Ju et al. [63] even propose a foundational approach for efficiently adapting pretrained image-based CLIP models to general video understanding, highlighting its potential for broader video analysis applications.\nD. Vision-Language Models in Anomaly Detection # Recent advancements in VLMs have enhanced VAD by integrating visual and textual representations. Methods such as CLIP-TSA [61] and UMIL-CLIP [37] utilize CLIP-based embeddings for weakly supervised anomaly detection, while approaches like Video-ChatGPT [53] and VidIL [51] employ LLM-guided reasoning for video understanding. Despite the recent success of VLM-based approaches, these existing methods often lack fine-grained frame-level supervision or require extensive video-text annotations. Transformer-based models such as UniVi [54] and LongVLM [52] further improve temporal anomaly detection but require extensive computational resources. In contrast, AADC-Net introduces a novel multimodal fusion approach that integrates DETR-based object detection, CLIP features, and LLM-generated descriptions, enabling fine-grained anomaly categorization beyond binary classification. Unlike existing methods, AADC-Net utilizes LLMs for context-aware anomaly reasoning, reducing false positives and enhancing interpretability. Its efficient fusion mechanism ensures scalability, making it more suitable for real-time surveillance applications than resource-intensive transformer-based models.\nIII. METHOD # A. Problem Statement # The defined issue can be formally described as follows. Given a collection of training videos V = {Vn Vn, Va Va }, where Vn Vn = {v i n |i = 1 , . . . , Nn Nn } represents the set of normal videos and Nn Nn is the total number of normal videos in the dataset. Similarly, Va Va = {v j a | j = 1 , . . . , Ma Ma } represents the set of abnormal videos and Ma Ma is the number of abnormal videos. For each video v ∈ Vn Vn ∪Va Va , it has a corresponding video-level category label c, where c ∈ L nor ∪ Labn. Here, L nor represents the set of normal categories, and Labn is the set of abnormal categories. Specifically, the model aims to predict an anomaly confidence score for each frame and identify the anomaly category along with workout activities in the input videos.\nB. Overall Framework # AADC-Net leverages the capabilities of a pretrained VLM to excel at learning rich, high-dimensional representations of data [55]. The CLIP models are pretrained on vast multimodal datasets containing diverse visual and textual information, enabling them to develop robust feature representations [56] . These representations are capable of capturing subtle patterns that distinguish normal from anomalous instances, making them particularly effective for VAD tasks [62]. Furthermore, traditional vision-based methods often rely on pixel-level features or handcrafted rules, which struggle to capture the complex semantic context of events [61]. In contrast, AADCNet incorporates contextual learning from VLM, which allows it to associate visual features with textual concepts. This integration enables the system to infer whether an activity is anomalous based on a broader understanding of normal behavior patterns [51]. As a result, AADC-Net can detect subtle anomalies, such as unusual social interactions or contextual inconsistencies, which are often missed by traditional approaches.\nFig. 2 illustrates the AADC-Net framework for abnormal detection and categorization. Given a set of training videos V , AADC-Net first processes each video v ∈ V by splitting it into f frames. These input frames f ∈ R C×H×W , where C represents the number of channels, H stands for the height of the input frame, and W denotes the width of the input frame, are then fed into a DETR encoder to extract DETR features (FDETR) and perform novel bounding box detection. Subsequently, the input frames are cropped into smaller patches (Pbbox) based on the bounding box coordinates. These cropped patches are then processed by the CLIP image encoder 8CLIP-V to obtain visual CLIP features (F v ).\nNext, the AADC-Net framework combines the DETR features with the visual CLIP features to obtain frame-level features fc fc with dimensions n × c, where n represents the number of video frames and c is the feature dimension. This combined feature representation is then passed through a multiscale temporal network (MTN) to effectively capture temporal dependencies at different scales. The resulting multiscale visual features are fed into an anomaly predictor to generate frame-level anomaly confidence scores. This pipeline is primarily used for the abnormal detection task.\nFor workout categorization, the visual CLIP features are fused with the anomaly confidence features to create videolevel features. AADC-Net then leverages the CLIP text encoder to generate textual features. By calculating the alignments between the video-level features and the textual features, the framework estimates the anomaly category. Additionally, AADC-Net incorporates a pretrained LLM model to generate informative video descriptions.\nC. Multiscale Temporal Network # While CLIP is trained on massive image–text pairs, its direct application to videos presents challenges due to the temporal gap between the image and video domains. To address this, we introduce the MTN, designed to effectively capture long-range and short-range temporal dependencies for VAD.\nMTN consists of two key components: 1) a three-layer pyramid dilated convolution (PDC) block for multiresolution feature extraction and 2) a TSA module for long-range temporal dependency modeling. The learned multiscale temporal features are then integrated with DETR and CLIP-extracted features, ensuring robust spatiotemporal anomaly detection.\nThe PDC block employs dilated convolutions to capture temporal variations at different scales, ensuring that both short-term and long-term dependencies are considered.\nGiven the feature fδ ∈ R T , the 1-D dilated convolution operation with kernel W ( (λ) W (κ,δ) ∈ R W is defined as follows: κ ∈ {1 , . . . , D/4} , δ ∈ {1 , . . . , D} , λ ∈ {PDC1 , PDC 2, PDC 3 }, and W denotes the filter size\nFig. 2. AADC-Net framework analyzes abnormal videos by identifying anomalies using DETR and CLIP features. It further categorizes workout types and describes abnormal events through text generated by LLMs.\nHere, ∗ denotes the dilated convolution operator indexed by λ. The term f k (λ) f k ∈ R T represents the output features obtained after applying the dilated convolution along the temporal dimension. The dilation factors that correspond to {PDC1 , PDC 2, PDC 3 } are {1 , 2 , 4} .\nWe modified the TSA block to analyze the video sequence along the time dimension. This allows us to produce an attention map M ∈ R T×T to capture long-range interactions and relationships within the video sequence. The attention map is derived from a self-attention mechanism within the TSA module. The TSA module starts by applying a 1 × 1 convolution to reduce the spatial dimension of X ∈ R T×D to X (c) ∈ RT×D/4 , where X ( c ) = Conv 1×1 (X), with T being the number of frames in the video sequence and D the feature dimension.\nNext, three distinct 1 × 1 convolutional layers are applied to X (c) to generate X ( c 1) , X (c2) , and X ( c 3) ∈ R T×D/4 , where X (ci) = Conv 1×1 (X ( c ) ) for i ∈ {1 , 2 , 3}. An attention map is then constructed as M = (X ( c 1) )(X ( c 2) ) T , which is used to produce X (c4) = Conv 1×1 (M X( c 3) ). A skip connection is added after this final 1 × 1 convolutional layer, as follows:\nThe final output of the MTN is obtained by concatenating the outputs from both the PDC and MTN modules, resulting in X ¯ = [X ( λ) ]λ ∈L ∈ R T×D , where L = {PDC1 , PDC 2, PDC 3, TSA}. The final output of the MTN module, X ¯ , is combined with the original input features (denoted as X) via a skip connection to obtain F. This is done by\nHere, sθ (X) denotes the function that combines the concatenated outputs of the PDC and TSA modules (i.e., X ¯ ) with the original input features X . F is the output of the function sθ , which includes both the MTN module\u0026rsquo;s processed features and the original features.\nIn anomaly detection, we input F into a binary classifier comprising a feed-forward network (FFN) layer, a fully connected (FC) layer, and a sigmoid activation function to compute the anomaly confidence score Sa Sa ∈ R n×1\nD. Learnable Prompt # In standard VLMs like CLIP, text labels typically consist of single words or short phrases (e.g., \u0026ldquo;treadmill workout,\u0026rdquo; \u0026ldquo;bike workout,\u0026rdquo; and \u0026ldquo;bench press\u0026rdquo;). However, such labels often lack the contextual depth needed to describe complex anomalous events in surveillance video data.\nTo bridge this gap and enhance the transferability of text embeddings, AADC-Net incorporates a learnable prompt mechanism that dynamically refines text embeddings by adding adaptive context tokens.\nFirst, AADC-Net transforms the original text labels (e.g., \u0026ldquo;treadmill workout,\u0026rdquo; \u0026ldquo;bike workout,\u0026rdquo; and \u0026ldquo;bench press\u0026rdquo;) into class tokens using the CLIP tokenizer. This can be expressed as tinit = Tokenizer(Label), where Label denotes the discrete text label. Second, a learnable prompt, denoted by {c1 , . . . , c m } , is created. This prompt comprises m context tokens that provide additional context for the class token. In sentence token formation, the class token t init is strategically placed in the middle of the prompt sequence, forming a complete sentence token t p = {c1 , . . . , t init , . . . , cl}. This positioning aims to\nleverage the surrounding context for a more comprehensive representation.\nThe sentence token t p is further enhanced by adding positional embeddings. These embeddings encode the order of words within the sentence, providing crucial information for the text encoder. Finally, the CLIP text encoder takes the enriched sentence token t p as input and generates a more robust text embedding t out ∈ R d .\nTraditional CLIP-based models use fixed text embeddings, which may struggle to distinguish between normal and anomalous activities in complex video scenes. AADC-Net\u0026rsquo;s learnable prompt mechanism enhances text embeddings by providing richer context, improving alignment with anomalies, bridging the text-video domain gap, and reducing reliance on manually labeled descriptions, enabling the model to learn anomalous characteristics autonomously.\nE. Detection Module # DETR is an end-to-end object detector based on a transformer encoder–decoder architecture [64]. Unlike traditional detection frameworks, DETR eliminates the need for anchor boxes and nonmaximum suppression. Instead, it directly predicts the set of object-bounding boxes and class labels by utilizing a transformer-based model that processes the image as a whole. The pretrained object detection network processes the input image and generates a set of object queries corresponding to potential objects in the image. A key advantage of DETR is its use of a bipartite matching algorithm that associates the object queries with ground-truth objects during training. The query-based detection approach easily obtains the features corresponding to each detected instance, enabling efficient extraction of pairwise representations for interaction recognition.\nThe detection process in the DETR framework begins by passing the input image through the visual DETR encoder, which serves as the backbone for extracting feature representations. The encoder processes the input image and outputs two key types of features: bounding box features, b = (x , y, w, h) , where (x , y) represent the center coordinates of the bounding box and (w, h) represent its width and height, as well as the DETR features FD FDETR.\nUsing the bounding box coordinates, we extract smaller image patches, denoted as Pbbox. These image patches are then passed through a pretrained VLM CLIP to generate image embeddings, as shown in Fig. 2 .\nThe list of image patches is denoted as Lbbox, and the computation for extracting the image embeddings is summarized as follows:\nNext, we fuse the extracted visual features Fvis with the DETR features. This is done by performing an elementwise addition between the CLIP image features and the DETR embedding features. An FC layer, denoted by fFC, is applied to reduce the dimension of the visual features to match the dimensionality of the DETR features, and the two are then fused\nHere, X c ∈ R dt dtxt represents the final fused features, where dt dtxt is the target dimensionality of the combined feature vector, typically aligning with the text feature dimensionality in multimodal fusion tasks. The parameter δ refers to the weights of the FC layer fFC .\nThe fused features X c are then used in subsequent downstream tasks, such as interaction recognition, activity detection, and other multimodal recognition tasks that require both visual and semantic understanding of the input scene. This fusion process enables the system to harness the complementary strengths of both DETR\u0026rsquo;s spatial reasoning capabilities and CLIP\u0026rsquo;s rich, semantic image understanding, making it highly effective for tasks requiring both detailed visual information and semantic context.\nF. Large Language Model # This work explores the use of LLaMA, an LLM, for generating video descriptions [65]. These descriptions typically follow a question-and-answer format with a template structure. Here is an example demonstrating how AADC-Net leverages LLaMA to create video descriptions.\nQuestion:\n### Human: # \u0026lt;Video\u0026gt; [Video Tokens]\n\u0026lt;Video\u0026gt; [Can you describe this video?]\n### AADC-Net: # \u0026lt;Normal/Abnormal\u0026gt;\n\u0026lt;Workout /Abnormal Activities\u0026gt;\nAnswer: # ### Assistant: # \u0026lt;Yes, the video shows normal activities at the gym from 00:00 to 02:18 with participants in a dumbbell press session.\u0026gt;\nThe question is first converted into textual embeddings using a pretrained LLM. These textual embeddings are then combined with the features extracted by AADC-Net. This combined representation provides LLaMA with a richer understanding of the video content. LLaMA utilizes this combined input to generate a textual description as the answer. This description conveys information about whether the video is normal or abnormal, the specific activities it showcases, and their duration. By incorporating LLaMA, AADC-Net gains the ability to generate informative video descriptions that complement its core functionalities of abnormal classification and workout activity categorization.\nG. Objective Function # For abnormal detection, we build upon prior VAD works by employing a Top-K selection mechanism. This method identifies the K frames with the highest anomaly scores in both normal and abnormal videos. We then calculate the average anomaly score for both sets, denoted by Anormal and A abnormal , respectively. These averages are then fed into a sigmoid function σ to obtain video-level anomaly\npredictions yˆ ˆ\nFinally, binary cross-entropy loss (Lbce) is computed between these predictions and the ground-truth labels (y) for classification\nwhere N represents the total number of videos.\nIn regard to class-specific categorization, we propose the MIL-Align mechanism to extend our approach beyond binary classification to handle multiple classes. This method utilizes aggregated video-level features (X v ) and textual category embeddings (E c = {e1 , . . . , e m }) to determine video-level classification. For each video, we choose the Top-K similarities between the video features and category embeddings using a distance metric. The average of these Top-K similarities represents the alignment score (si) between the video and the ith class. This process results in a vector S = {s1 , . . . , s m } , where each element represents the similarity between the video and a specific class.\nOur goal is to maximize the similarity score between the video and its corresponding textual label (E y ∗ , where y ∗ denotes the ground-truth class) compared to other classes. To accomplish this, the multiclass prediction (φi) with respect to the ith class is first computed using a softmax function\nwhere γ represents the hyperparameter used for scaling. Finally, the multiclass classification loss L nce is computed using cross-entropy\nDuring training, we combine both loss functions Lbce , L nce to optimize the overall model performance. The total loss function (L) is simply the sum of these individual losses as follows:\nwhere λ is a hyperparameter that controls the weight of the classification loss.\nH. Compared Methods # We compared our AADC-Net with various advanced VAD approaches for a comprehensive analysis. Supervised and weakly supervised methods include Sultani et al. [1], which employ MIL for anomaly detection, MIST [66] with multipleinstance self-training, and Bayesian network + spatial-visual pattern (BN-SVP) [67] using submodular video partitioning for anomaly localization. Graph-based and self-attention methods such as GCN [38] utilize GCNs for spatial–temporal modeling, while RTFM [39] enhances anomaly detection through Robust Temporal Feature Magnitude learning. Contrastive and transformer-based approaches include generative one-shot detection of anomalies (GODS) [68] with The UCF-Crime dataset was initially designed for weakly supervised anomaly detection tasks. Our proposed AADC-Net method addresses this challenge and surpasses all unsupervised and supervised methods on UCF-Crime. Specifically, AADC-Net achieves an improvement of 24.39% in AUC compared to the unsupervised baselines (GODS and FSCN). While AADC-Net demonstrates slight improvements across all evaluation metrics when compared to existing VAD methAuthorized licensed use limited to: University of South Florida. Downloaded on September 01,2025 at 19:58:57 UTC from IEEE Xplore. Restrictions apply.\nTABLE I FRAME-LEVEL AUC RESULTS ON UCF-CRIME DATASET\nOne-Class Discriminative Subspaces, fully self-contained network (FSCN) [69] using sparse coding networks for robust feature learning, and dual memory units for weakly-supervised anomaly detection (DMU) [70] integrating dual memory units for uncertainty regulation. Graph temporal learning for anomaly detection (GTL) [71] employs a generative Transformer model for long-range dependency modeling without labeled data. A diffusion-based feature prediction model future-prediction-based dual memory network (FPDM) [72] refines video representations to learn normal distributions. Finally, vision-language and multimodal methods such as CLIP-TSA [61] incorporate CLIP-assisted TSA for anomaly detection, while consistency-based self-supervised learning for temporal anomaly localization (CSL-TAL) [73] refines anomaly scores via self-supervised contrastive learning.\nIV. RESULTS # A. Comparison With SOTA Methods # This section presents frame-level AUC results on the UCFCrime [74] and XD-Violence datasets [75]. The UCF-Crime dataset is a real-world surveillance video dataset with nearly 1900 video clips collected from various online sources, encompassing 13 different types of real-world anomalies alongside some normal videos. The XD-Violence dataset is a large-scale collection containing 2800 video clips that focus specifically on violent actions and violence-related scenarios. We introduced a new VAD dataset, AN-Workout, constructed specifically to capture abnormal actions within gym scenarios. AN-Workout comprises 830 real-world surveillance videos from smart gyms, featuring 11 types of realistic anomalies, including actions like fighting, using a phone, smoking, talking, and drinking water. It also includes 14 normal workout activities such as treadmill use, dumbbell press, and deadlifting. The results are detailed in Tables I and II, respectively.\nTABLE II FRAME-LEVEL AP RESULTS ON XD-VIOLENCE DATASET\nTABLE III mAP UNDER IOU RESULTS ON XD-VIOLENCE DATASET\nTABLE IV mAP UNDER IOU RESULTS ON UCF-CRIME DATASET\nods on both datasets, it surpasses them in achieving new SOTA performance. AADC-Net reaches 83.8% AP and 83.2% AUC on the XD-Violence and UCF-Crime datasets, respectively. Notably, AADC-Net outperforms the best competing methods, CLIP-TSA and DMU, by an increase of 2.1% and 1.7% in AP on XD-Violence. Similarly, it demonstrates better performance in AUC, achieving increases of 0.8% and 1.7% over CLIP-TSA and DMU on XD-Violence and UCF-Crime, respectively.\nThese results highlight the effectiveness of AADC-Net in frame-level anomaly detection and classification tasks on the UCF-Crime and XD-Violence datasets.\nTables III and IV explore AADC-Net\u0026rsquo;s performance in anomaly classification tasks. Anomaly classification is more challenging compared to anomaly detection. It requires not only the identification of anomalies but also their accurate categorization while maintaining continuity in the detected segments. This requirement for both detection and classification introduces additional complexity.\nDespite this challenge, AADC-Net surpasses even the SOTA\u0026rsquo;s previous works on both the XD-Violence and UCFCrime datasets. AADC-Net achieves an improvement of 23.8% and 113.6% in AP on XD-Violence compared to AVVD and Sultani et al., respectively. This trend is consistent on the UCF-Crime dataset, where AADC-Net outperforms AVVD and Sultani et al. by 11.4% and 112.5% in AP, respectively.\nTABLE V ABLATION STUDIES WITH DIFFERENT DESIGNED MODULES ON UCF-CRIME FOR ABNORMAL DETECTION\nTABLE VI ABLATION STUDIES WITH DIFFERENT DESIGNED MODULES ON XD-VIOLENCE FOR ABNORMAL DETECTION\nTABLE VII ABLATION STUDIES WITH DIFFERENT DESIGNED MODULES ON AN-WORKOUT FOR ABNORMAL DETECTION\nThese results demonstrate AADC-Net\u0026rsquo;s effectiveness in handling the complexities of anomaly classification tasks. AADC-Net achieves superior performance in both anomaly detection and classification tasks, surpassing existing methods on the XD-Violence and UCF-Crime benchmark datasets.\nB. Ablation Studies # Effectiveness of DETR Features: This section investigates the effectiveness of DERT features in improving anomaly detection performance. As detailed in Tables V– VII, DERT features consistently enhance detection across all datasets, regardless of whether the multi-scale temporal network (MSTN) module is included in the model. The inclusion of DERT features leads to performance improvements. Compared to models without DERT features, we observe an increase of 2.2% in AUC for the UCF-Crime dataset, 2.3% in AP for the XD-Violence dataset, and 3% in AUC and 2.3% in AP for the AN-Workout dataset. Furthermore, incorporating DERT features within the visual CLIP features used by AADC-Net leads to a consistent improvement in performance. This integration strengthens the AADC-Net model\u0026rsquo;s overall anomaly detection capabilities.\nEffectiveness of Multiscale Temporal Network: As discussed earlier, the MSTN module is designed to capture temporal relationships within video data, thereby enhancing class-agnostic anomaly detection capabilities. To evaluate its effectiveness, we conducted experiments and presented ablation studies in Tables V – VII. The results demonstrate that incorporating the MSTN module improves performance across various datasets (UCF-Crime, XD-Violence, and AN-Workout) and evaluation metrics (AUC and AP). Without the temporal modeling offered by MSTN, the baseline model achieves only 80.4% AP in the XD-Violence dataset and 86.2% AUC in UCF-Crime dataset. Incorporating the MSTN module results in clear improvements: a 2.3% increase in AUC for UCF-\nAuthorized licensed use limited to: University of South Florida. Downloaded on September 01,2025 at 19:58:57 UTC from IEEE Xplore. Restrictions apply.\nFig. 3. Qualitative results of our AADC-Net model on the AN-Workout dataset. The blue curves represent the anomaly scores, while the red regions indicate detected abnormal temporal events.\nCrime, a 4.2% increase in AP for XD-Violence, and a 3.79% increase in AUC and a 3.94% increase in AP for AN-Workout.\nOverall, the combination of DETR features and the MSTN module boosts performance in terms of AP and AUC. This improvement can be attributed to two key factors.\nDETR features capture object-level details and relationships within a frame, while the MSTN module focuses on capturing temporal information across video sequences. By incorporating both DETR features and the MSTN module, the model learns a more comprehensive feature representation for anomaly detection. This richer representation allows the model to make more distinctions between normal and abnormal events. Analysis of Cross-Dataset Ability: To assess the zero-shot learning capabilities of AADC-Net, we conducted experiments using a cross-dataset setup. We trained the model on one dataset and evaluated its performance on another. Specifically, we employed UCF-Crime and XD-Violence datasets, which share some categories but originate from entirely different sources. The evaluation results in Table VIII reveal two key findings. The model achieves better performance when trained on all available data within the source dataset. The AADC-Net model achieves competitive performance even when evaluated on a dataset it was not trained on. This demonstrates the model\u0026rsquo;s ability to learn patterns from one dataset and effectively apply them to categorize events from a different source. TABLE VIII CROSS-DATA RESULTS ON UCF-CRIME AND XD-VIOLENCE\nAP is 83.8%, and the AN-Workout AUC and AP reach 79.4% and 81.7%, respectively. Increasing K beyond 7 leads to marginal performance drops, likely due to the inclusion of less relevant frames, which introduce noise. Conversely, selecting a lower K value may exclude important anomaly frames, leading to reduced alignment accuracy and potentially missing key discriminative features.\nWe tested γ values of 0.5, 1.0, and 2.0 to observe their effects on video-level predictions. The optimal performance was achieved at γ = 1 . 0, where all datasets reached their peak performance metrics. Lower values (γ = 0 . 5) resulted in smoother probability distributions, reducing prediction confidence and promoting more uniform probabilities across classes. This smoothness could encourage more generalization but reduces the ability to distinguish between classes. Higher values (γ = 2 . 0) sharpened the distribution, increasing confidence for top predictions but potentially causing numerical instability or overfitting by making the model overly confident in its predictions.\nFor this work, optimal results were observed with K = 5 and γ = 1 . 0. This configuration ensures a balanced trade-off among prediction confidence, computational efficiency, and model stability, providing effective performance for practical deployment.\nEffectiveness of Hyperparameters K and γ : We conducted experiments by varying the K value among 3, 5, and 7, evaluating its impact on performance metrics across the UCFCrime, XD-Violence, and AN-Workout datasets. The results in Table IX indicate that the model achieves the best performance at K = 5, where the UCF-Crime AUC is 88.2%, XD-Violence C. Qualitative Results Fig. 3 presents a qualitative visualization of the AADC-Net model\u0026rsquo;s performance on the AN-Workout dataset. The blue Authorized licensed use limited to: University of South Florida. Downloaded on September 01,2025 at 19:58:57 UTC from IEEE Xplore. Restrictions apply. Fig. 4. Example of AADC-Net with video description. The orange boxes are the questions from humans. The blue boxes are the answers from the Video-LaMA and our AADC-Net.\nTABLE IX ANALYSIS OF HYPERPARAMETERS K AND γ ACROSS DATASETS\ncurves represent the anomaly scores, with red regions indicating detected abnormal temporal events. AADC-Net effectively identifies and classifies anomalies with high confidence while generating minimal false positives for normal events. Furthermore, the distinction between normal and abnormal regions is relatively clear.\nFig. 4 showcases an example of AADC-Net integrated with LLaMA for identifying normal and abnormal events alongside textual descriptions. This integration enhances AADC\u0026rsquo;s capabilities, enabling a richer understanding of video content beyond anomaly detection and classification. AADC-Net leverages the pretrained LLaMA model to analyze video content in conjunction with feature extraction from the anomaly detection and classification branches. This combined approach facilitates a better comprehension of the video. As illustrated in Fig. 4, AADC-Net-LLaMA effectively detects anomalies (e.g., \u0026ldquo;drinking water\u0026rdquo;) and precisely locates their occurrences (e.g., \u0026ldquo;from 00:15 to 00:40\u0026rdquo;). Additionally, it provides detailed descriptions with durations. The integration of LLaMA enables AADC-Net to generate informative video descriptions, complementing its core functionalities of anomaly classification and workout activity categorization. This capability supports\nFig. 5. Confusion matrices of normal/abnormal categorization.\nvarious applications, including video summarization, activity understanding, and interactive human–computer dialogs.\nFig. 5 presents confusion matrices for normal/abnormal event classification on the AN-Workout dataset. While AADCNet\u0026rsquo;s overall performance is promising, as demonstrated in Section IV, these confusion matrices highlight some limitations in identifying certain anomaly categories. This observation emphasizes the inherent challenges associated with one-class versus one-class anomaly detection (OVVAD) approaches. OVVAD models typically learn a representation of normal data and classify any significant deviation from this norm as an anomaly. While this strategy can be effective for general anomaly detection, it may struggle with accurately classifying specific types of anomalies, especially when normal and abnormal events share some characteristics. Further investigation into these limitations and potential improvements for fine-grained anomaly classification within AADC-Net is an ongoing focus of our research efforts.\nD. Model Deployment # We have deployed our AADC-Net in an automatic surveillance system for the smart gym application. The deployment framework is illustrated in Fig. 6. In this setup, a camera serves as the video source, streaming data through WebRTC clients to a Python-based WebRTC server that manages peerto-peer connections. The video streams are processed by our AADC-Net model to detect anomalies. Detected anomalies are logged with timestamps and detailed descriptions generated by our model. The entire system is integrated with a user-friendly Streamlit frontend, which provides a web-based interface for real-time monitoring and interaction, ensuring efficient and accurate anomaly detection for gym administrators.\nFig. 6. Model deployment in an automatic surveillance system for smart gym applications.\nTo assess the practical feasibility of AADC-Net, we evaluated its inference speed, memory consumption, and computational resource requirements on both GPU and CPU setups. On an NVIDIA RTX 3090 (24-GB VRAM, CUDA 11.8), AADC-Net achieves 35 frames per second (frames/s) in batch processing, making it suitable for real-time surveillance at 30 frames/s, with a processing time of approximately 28.6 ms per frame and a memory usage of around 6-GB GPU memory, ensuring deployability on consumer-grade GPUs. On a CPU setup (Intel Core i912900K, 16 cores, 24 threads, 32-MB L3 cache), the model achieves 8 frames/s, which remains practical for offline batch analysis, with a processing time of approximately 125 ms per frame and a memory usage of around 14-GB RAM due to the absence of GPU acceleration. These analyses demonstrate that AADC-Net is highly feasible for real-time anomaly detection on GPU-powered systems and remains practical for offline analysis on high-performance CPUs, making it adaptable for surveillance and monitoring applications.\nV. CONCLUSION # One of the fundamental challenges in VAD is the scarcity of labeled data, as anomalies are rare and labor-intensive to annotate in real-world scenarios. VLMs help mitigate this issue by leveraging contrastive learning objectives, such as those used in CLIP. AADC-Net utilizes these objectives to discover meaningful features without explicit supervision, reducing its reliance on labeled training data. Additionally, in real-world detection tasks, the number of normal samples far exceeds the number of abnormal ones, leading to class imbalance issues that can bias supervised learning approaches. Pretrained VLMs act as a regularization mechanism, preventing the model from overfitting to the majority class (normal events). By aligning video frames with pretrained textual embeddings, the model learns robust, class-agnostic representations of anomalies.\nThis research introduces AADC-Net, a novel multimodal deep neural framework that addresses the challenges of data scarcity, imbalance, and computational intensity inherent in video anomaly detection (VAD). By integrating pretrained VLMs and LLMs, AADC-Net leverages their rich, pretrained knowledge to extract meaningful feature representations, reducing reliance on extensively labeled VAD data. This integration of VLMs and LLMs enhances the model\u0026rsquo;s ability to discriminate between normal and anomalous events, improving both anomaly detection and categorization, especially in imbalanced datasets. Furthermore, AADC-Net incorporates the pretrained object detection model DETR to efficiently extract visual features without requiring manual bounding box annotations, making it robust in data-scarce scenarios. To refine anomaly detection, a Top-K selection mechanism identifies the most relevant video frames, while the MIL-Align mechanism enables multiclass anomaly categorization. Experimental results on public benchmarks (UCF-Crime and XD-Violence) and our AN-Workout dataset demonstrate AADC-Net\u0026rsquo;s superior performance compared to SOTA methods. The model generates frame-level anomaly confidences, categories, localization, and video descriptions with minimal supervision.\nOne challenge is that the AADC-Net model is trained on datasets captured in controlled environments with fixed camera alignments, limiting its applicability in real-world scenarios with varying camera placements and angles. To address this, future work will focus on training and evaluating AADC-Net on datasets that include diverse camera angles, lighting conditions, and occlusions. We will also investigate domain adaptation techniques and incorporate spatial attention mechanisms to enhance the model\u0026rsquo;s robustness and real-time performance in real-world settings. Additionally, we will evaluate its performance using metrics like recall at low intersection over union (IoU) thresholds and detection consistency across multiple views. In the long-term vision, the integration of vision-based methods into automatic anomaly detection (AAD) has the potential to develop intelligent monitoring systems by increasing efficiency and reducing the burden on human operators. In summary, AADC-Net provides a robust solution for overcoming the current challenges in VAD, showcasing superior performance with promising applications in smart gym and healthcare monitoring in the future.\nREFERENCES # [1] W. Sultani, C. Chen, and M. Shah, \u0026ldquo;Real-world anomaly detection in surveillance videos,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6479–6488.\n[2] M. Zhang, T. Li, Y. Yu, Y. Li, P. Hui, and Y. Zheng, \u0026ldquo;Urban anomaly analytics: Description, detection, and prediction,\u0026rdquo; IEEE Trans. Big Data , vol. 8, no. 3, pp. 809–826, Jun. 2022.\n[3] R. Nayak, U. C. Pati, and S. K. Das, \u0026ldquo;A comprehensive review on deep learning-based methods for video anomaly detection,\u0026rdquo; Image Vis. Comput., vol. 106, Jan. 2021, Art. no. 104078.\n[4] U. A. Usmani, A. Happonen, and J. Watada, \u0026ldquo;A review of unsupervised machine learning frameworks for anomaly detection in industrial applications,\u0026rdquo; in Proc. Sci. Inf. Conf. Cham, Switzerland: Springer, Jan. 2022, pp. 158–189.\n[5] X. Li, J. Jing, J. Bao, P. Lu, Y. Xie, and Y. An, \u0026ldquo;OTB-AAE: Semisupervised anomaly detection on industrial images based on adversarial autoencoder with output-turn-back structure,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 72, pp. 1–14, 2023.\n[6] Y. Cui, Z. Liu, and S. Lian, \u0026ldquo;A survey on unsupervised anomaly detection algorithms for industrial images,\u0026rdquo; IEEE Access, vol. 11, pp. 55297–55315, 2023.\n[7] A. Yang, X. Xu, Y. Wu, and H. Liu, \u0026ldquo;Reverse distillation for continuous anomaly detection,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 73, pp. 1–13, 2024.\n[8] H. Yao and X. Hu, \u0026ldquo;A survey of video violence detection,\u0026rdquo; Cyber-Phys. Syst., vol. 9, no. 1, pp. 1–24, Jan. 2023.\n[9] T. Saba, \u0026ldquo;Real time anomalies detection in crowd using convolutional long short-term memory network,\u0026rdquo; J. Inf. Sci., vol. 49, no. 5, pp. 1145–1152, Oct. 2023.\n[10] K. Rezaee, S. M. Rezakhani, M. R. Khosravi, and M. K. Moghimi, \u0026ldquo;A survey on deep learning-based real-time crowd anomaly detection for secure distributed video surveillance,\u0026rdquo; Pers. Ubiquitous Comput. , vol. 28, no. 1, pp. 135–151, Feb. 2024.\n[11] M. Kavitha, P. V. V. S. Srinivas, P. S. L. Kalyampudi, and S. Srinivasulu, \u0026ldquo;Machine learning techniques for anomaly detection in smart healthcare,\u0026rdquo; in Proc. 3rd Int. Conf. Inventive Res. Comput. Appl. (ICIRCA) , Sep. 2021, pp. 1350–1356.\n[12] Y. Yang, Y. Xian, Z. Fu, and S. M. Naqvi, \u0026ldquo;Video anomaly detection for surveillance based on effective frame area,\u0026rdquo; in Proc. IEEE 24th Int. Conf. Inf. Fusion (FUSION), Nov. 2021, pp. 1–5.\n[13] H. Yao et al., \u0026ldquo;Scalable industrial visual anomaly detection with partial semantics aggregation vision transformer,\u0026rdquo; IEEE Trans. Instrum. Meas. , vol. 73, pp. 1–17, 2024.\n[14] K. Xiao, J. Cao, Z. Zeng, and W.-K. Ling, \u0026ldquo;Graph-based active learning with uncertainty and representativeness for industrial anomaly detection,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 72, pp. 1–14, 2023.\n[15] J. Zhu, P. Yan, J. Jiang, Y. Cui, and X. Xu, \u0026ldquo;Asymmetric teacher–student feature pyramid matching for industrial anomaly detection,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 73, pp. 1–13, 2024.\n[16] M. Carratù et al., \u0026ldquo;A novel methodology for unsupervised anomaly detection in industrial electrical systems,\u0026rdquo; IEEE Trans. Instrum. Meas. , vol. 72, pp. 1–12, 2023.\n[17] M. S. Sapwan, Z. Ibrahim, Z. Mabni, and N. L. Adam, \u0026ldquo;Detection and classification of weightlifting form anomalies using deep learning,\u0026rdquo; J. Positive School Psychol., vol. 6, no. 3, pp. 8530–8537, 2022.\n[18] Y. He, H. Yang, and Z. Yin, \u0026ldquo;Adaptive context-aware distillation for industrial image anomaly detection,\u0026rdquo; IEEE Trans. Instrum. Meas. , vol. 73, pp. 1–15, 2024.\n[19] R. Sharma and A. Sungheetha, \u0026ldquo;An efficient dimension reduction based fusion of CNN and SVM model for detection of abnormal incident in video surveillance,\u0026rdquo; J. Soft Comput. Paradigm, vol. 3, no. 2, pp. 55–69, May 2021.\n[20] S. Fadl, Q. Han, and Q. Li, \u0026ldquo;CNN spatiotemporal features and fusion for surveillance video forgery detection,\u0026rdquo; Signal Process., Image Commun. , vol. 90, Jan. 2021, Art. no. 116066.\n[21] L. Alzubaidi et al., \u0026ldquo;Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions,\u0026rdquo; J. Big Data, vol. 8, no. 1, pp. 1–74, Mar. 2021.\n[22] Y. Liang, S. Li, C. Yan, M. Li, and C. Jiang, \u0026ldquo;Explaining the blackbox model: A survey of local interpretation methods for deep neural networks,\u0026rdquo; Neurocomputing, vol. 419, pp. 168–182, Jan. 2021.\n[23] P. Wu, W. Wang, F. Chang, C. Liu, and B. Wang, \u0026ldquo;DSS-Net: Dynamic self-supervised network for video anomaly detection,\u0026rdquo; IEEE Trans. Multimedia, vol. 26, pp. 1–13, 2023.\n[24] M. Lovanshi and V. Tiwari, \u0026ldquo;Human skeleton pose and spatio-temporal feature-based activity recognition using ST-GCN,\u0026rdquo; Multimedia Tools Appl., vol. 83, no. 5, pp. 12705–12730, Jun. 2023.\n[25] M. George, B. R. Jose, and J. Mathew, \u0026ldquo;Abnormal activity detection using shear transformed spatio-temporal regions at the surveillance network edge,\u0026rdquo; Multimedia Tools Appl., vol. 79, nos. 37–38, pp. 27511–27532, Oct. 2020.\n[26] J. Wang, Y. Zhao, K. Zhang, Q. Wang, and X. Li, \u0026ldquo;Spatio-temporal online matrix factorization for multi-scale moving objects detection,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 2, pp. 743–757, Feb. 2022.\n[27] M. Mathieu, C. Couprie, and Y. LeCun, \u0026ldquo;Deep multi-scale video prediction beyond mean square error,\u0026rdquo; 2015, arXiv:1511.05440 .\n[28] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, \u0026ldquo;Autoencoding beyond pixels using a learned similarity metric,\u0026rdquo; in Proc. 33rd Int. Conf. Mach. Learn., vol. 48, Feb. 2016, pp. 1558–1566.\n[29] N. C. Tay, T. Connie, T. S. Ong, A. B. J. Teoh, and P. S. Teh, \u0026ldquo;A review of abnormal behavior detection in activities of daily living,\u0026rdquo; IEEE Access, vol. 11, pp. 5069–5088, 2023.\n[30] B. Ramachandra, M. J. Jones, and R. R. Vatsavai, \u0026ldquo;A survey of singlescene video anomaly detection,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell. , vol. 44, no. 5, pp. 2293–2312, May 2022.\n[31] H. Wang, X. Jiang, H. Ren, Y. Hu, and S. Bai, \u0026ldquo;SwiftNet: Real-time video object segmentation,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 1296–1305.\n[32] H. Lv and Q. Sun, \u0026ldquo;Video anomaly detection and explanation via large language models,\u0026rdquo; 2024, arXiv:2401.05702 .\n[33] Z. Yang, J. Liu, and P. Wu, \u0026ldquo;Text prompt with normality guidance for weakly supervised video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 18899–18908.\n[34] H.-T. Duong, V.-T. Le, and V. T. Hoang, \u0026ldquo;Deep learning-based anomaly detection in video surveillance: A survey,\u0026rdquo; Sensors, vol. 23, no. 11, p. 5024, May 2023.\n[35] R. Raja, P. C. Sharma, M. R. Mahmood, and D. K. Saini, \u0026ldquo;Analysis of anomaly detection in surveillance video: Recent trends and future vision,\u0026rdquo; Multimedia Tools Appl., vol. 82, no. 8, pp. 12635–12651, Mar. 2023.\n[36] M. Baradaran and R. Bergevin, \u0026ldquo;A critical study on the recent deep learning based semi-supervised video anomaly detection methods,\u0026rdquo; Multimedia Tools Appl., vol. 83, no. 9, pp. 27761–27807, Aug. 2023.\n[37] H. Lv, Z. Yue, Q. Sun, B. Luo, Z. Cui, and H. Zhang, \u0026ldquo;Unbiased multiple instance learning for weakly supervised video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , Jun. 2023, pp. 8022–8031.\n[38] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, \u0026ldquo;Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1237–1246.\n[39] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro, \u0026ldquo;Weakly-supervised video anomaly detection with robust temporal feature magnitude learning,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 4975–4986.\n[40] C. Zhang et al., \u0026ldquo;Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 16271–16280.\n[41] S. Li, F. Liu, and L. Jiao, \u0026ldquo;Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection,\u0026rdquo; in Proc. AAAI Conf. Artif. Intell., vol. 36, no. 2, 2022, pp. 1395–1403.\n[42] C. Huang et al., \u0026ldquo;Weakly supervised video anomaly detection via self-guided temporal discriminative transformer,\u0026rdquo; IEEE Trans. Cybern. , vol. 54, no. 5, pp. 3197–3210, May 2022.\n[43] P. Wu, X. Liu, and J. Liu, \u0026ldquo;Weakly supervised audio-visual violence detection,\u0026rdquo; IEEE Trans. Multimedia, vol. 25, pp. 1674–1685, 2022.\n[44] C. Ding, G. Pang, and C. Shen, \u0026ldquo;Catching both gray and black swans: Open-set supervised anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 7388–7398.\n[45] Y. Zhao, I. Misra, P. Krähenbühl, and R. Girdhar, \u0026ldquo;Learning video representations from large language models,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 6586–6597.\n[46] M. Bain, A. Nagrani, G. Varol, and A. Zisserman, \u0026ldquo;Frozen in time: A joint video and image encoder for end-to-end retrieval,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 1728–1738.\n[47] K. Li et al., \u0026ldquo;VideoChat: Chat-centric video understanding,\u0026rdquo; 2023, arXiv:2305.06355 .\n[48] Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai, \u0026ldquo;PandaGPT: One model to instruction-follow them all,\u0026rdquo; 2023, arXiv:2305.16355 .\n[49] H. Zhang, X. Li, and L. Bing, \u0026ldquo;Video-LLaMA: An instructiontuned audio-visual language model for video understanding,\u0026rdquo; 2023, arXiv:2306.02858 .\n[50] M. Ning et al., \u0026ldquo;Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models,\u0026rdquo; 2023, arXiv:2311.16103 .\n[51] Z. Wang et al., \u0026ldquo;Language models with image descriptors are strong few-shot video-language learners,\u0026rdquo; in Proc. Adv. Neural Inf. Process. Syst., vol. 35, Jan. 2022, pp. 8483–8497.\n[52] Y. Weng, M. Han, H. He, X. Chang, and B. Zhuang, \u0026ldquo;LongVLM: Efficient long video understanding via large language models,\u0026rdquo; 2024, arXiv:2404.03384 .\n[53] M. Maaz, H. Rasheed, S. Khan, and F. Khan, \u0026ldquo;VideoGPT+: Integrating image and video encoders for enhanced video understanding,\u0026rdquo; 2024, arXiv:2406.09418 .\n[54] P. Jin, R. Takanobu, W. Zhang, X. Cao, and L. Yuan, \u0026ldquo;Chat-UniVi: Unified visual representation empowers large language models with image and video understanding,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 13700–13710.\n[55] S. Uppal et al., \u0026ldquo;Multimodal research in vision and language: A review of current and emerging trends,\u0026rdquo; Inf. Fusion, vol. 77, pp. 149–171, Jan. 2022.\n[56] A. Radford et al., \u0026ldquo;Learning transferable visual models from natural language supervision,\u0026rdquo; in Proc. Int. Conf. Mach. Learn., vol. 139, 2021, pp. 8748–8763.\n[57] H. Luo et al., \u0026ldquo;CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning,\u0026rdquo; Neurocomputing, vol. 508, pp. 293–304, Oct. 2022.\n[58] J. Wang, H. Wang, J. Deng, W. Wu, and D. Zhang, \u0026ldquo;EfficientCLIP: Efficient cross-modal pre-training by ensemble confident learning and language modeling,\u0026rdquo; 2021, arXiv:2109.04699 .\n[59] Z. Lin et al., \u0026ldquo;Frozen clip models are efficient video learners,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2022, pp. 388–404.\n[60] B. Ni et al., \u0026ldquo;Expanding language-image pretrained models for general video recognition,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis. (ECCV). Cham, Switzerland: Springer, 2022, pp. 1–18.\n[61] H. K. Joo, K. Vo, K. Yamazaki, and N. Le, \u0026ldquo;CLIP-TSA: Clipassisted temporal self-attention for weakly-supervised video anomaly detection,\u0026rdquo; in Proc. IEEE Int. Conf. Image Process. (ICIP), Oct. 2023, pp. 3230–3234.\n[62] S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang, \u0026ldquo;Zero-shot temporal action detection via vision-language prompting,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, Jan. 2022, pp. 681–697.\n[63] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, \u0026ldquo;Prompting visual-language models for efficient video understanding,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2022, pp. 105–124.\n[64] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, \u0026ldquo;End-to-end object detection with transformers,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020, pp. 213–229.\n[65] H. Touvron et al., \u0026ldquo;LLaMA: Open and efficient foundation language models,\u0026rdquo; 2023, arXiv:2302.13971 .\n[66] J.-C. Feng, F.-T. Hong, and W.-S. Zheng, \u0026ldquo;MIST: Multiple instance self-training framework for video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 14009–14018.\n[67] H. Sapkota and Q. Yu, \u0026ldquo;Bayesian nonparametric submodular video partition for robust anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2022, pp. 3212–3221.\n[68] J. Wang and A. Cherian, \u0026ldquo;GODS: Generalized one-class discriminative subspaces for anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8200–8210.\n[69] P. Wu, J. Liu, M. Li, Y. Sun, and F. Shen, \u0026ldquo;Fast sparse coding networks for anomaly detection in videos,\u0026rdquo; Pattern Recognit., vol. 107, Jun. 2020, Art. no. 107515.\n[70] H. Zhou, J. Yu, and W. Yang, \u0026ldquo;Dual memory units with uncertainty regulation for weakly supervised video anomaly detection,\u0026rdquo; in Proc. AAAI Conf. Artif. Intell., vol. 37, Jun. 2023, pp. 3769–3777.\n[71] M. A. Hafeez, S. Javed, M. Madden, and I. Ullah, \u0026ldquo;Unsupervised end-toend transformer based approach for video anomaly detection,\u0026rdquo; in Proc. 38th Int. Conf. Image Vis. Comput. New Zealand (IVCNZ), Nov. 2023, pp. 1–7.\n[72] C. Yan, S. Zhang, Y. Liu, G. Pang, and W. Wang, \u0026ldquo;Feature prediction diffusion model for video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2023, pp. 5527–5537.\n[73] A. Panariello, A. Porrello, S. Calderara, and R. Cucchiara, \u0026ldquo;Consistencybased self-supervised learning for temporal anomaly localization,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, Jan. 2022, pp. 338–349.\n[74] T. Yuan et al., \u0026ldquo;Towards surveillance video-and-language understanding: New dataset, baselines, and challenges,\u0026rdquo; 2023, arXiv:2309.13925 .\n[75] P. Wu et al., \u0026ldquo;Not only look, but also listen: Learning multimodal violence detection under weak supervision,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis. (ECCV), Glasgow, U.K. Cham, Switzerland: Springer, Aug. 2020, pp. 322–339.\n[76] P. Wu and J. Liu, \u0026ldquo;Learning causal temporal relation and feature discrimination for anomaly detection,\u0026rdquo; IEEE Trans. Image Process. , vol. 30, pp. 3513–3527, 2021.\n[77] G. Pang, C. Yan, C. Shen, A. Van Den Hengel, and X. Bai, \u0026ldquo;Selftrained deep ordinal regression for end-to-end video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , Jun. 2020, pp. 12173–12182.\nDuc Tri Phan (Member, IEEE) received the B.S. degree from the Mechanics Department, Back Khoa University, Ho Chi Minh City, Vietnam, in 2017, and the M.S. and Ph.D. degrees from the Biomedical Engineering Department, Pukyong National University, Busan, South Korea, in 2020 and 2023, respectively.\nHe is currently a Post-Doctoral Fellow with Nanyang Technological University, Singapore. His research interests include smart healthcare and artificial intelligence applications.\nVu Hoang Minh Doan received the B.S. degree from the Mechatronics Engineering Department, Back Khoa University, Ho Chi Minh City, Vietnam, in 2016, and the M.S. and Ph.D. degrees from the Biomedical Engineering Department, Pukyong National University, Busan, South Korea, in 2019 and 2022, respectively.\nHe is currently a Research Professor with the Smart Gym-Based Translational Research Center for Active Senior\u0026rsquo;s Healthcare, Pukyong National University. His research interests include artificial intelligence applications in biomedical engineering and automation control systems.\nJaeyeop Choi received the B.S. degree in biomedical engineering and the M.S. and Ph.D. degrees from Pukyong National University, Busan, South Korea, in 2017, 2019, and 2022, respectively.\nHe is currently a Research Professor with the Smart Gym-Based Translational Research Center for Active Senior\u0026rsquo;s Healthcare, Pukyong National University. His research interests include scanning acoustic microscopy and fabrication of highfrequency transducers.\nByeongil Lee received the master\u0026rsquo;s and Ph.D. degrees in computer engineering, with a specialization in medical image analysis, from Inje University, Gimhae, South Korea, in 1999 and 2004, respectively.\nIn 2005, he commenced his career at the Department of Nuclear Medicine, Chonnam National University Hospital, Gwangju, South Korea. He expanded his research horizons in 2011 by joining Korea Photonics Technology Institute, Gwangju, focusing on medical photonics.\nHis academic journey led him to Pukyong National University, Busan, South Korea, in 2021, where he currently holds a professorship with the Department of Smart Healthcare. His research interests are currently centered on medical photonics, molecular imaging, and the advancement of digital healthcare systems.\nJunghwan Oh (Senior Member, IEEE) received the B.S. degree in mechanical engineering from Pukyong National University, Busan, South Korea, in 1992, and the M.S. and Ph.D. degrees in biomedical engineering from The University of Texas at Austin, Austin, TX, USA, in 2003 and 2007, respectively.\nHe is currently a Professor with Industry 4.0 Convergence Bionics Engineering, Department of Biomedical Engineering, Pukyong National University. His research interests include the development of ultrasonic-based diagnostic imaging modalities for biomedical engineering applications.\n","date":"31 March 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/aadc-net_a_multimodal_deep_learning_framework_for_automatic_anomaly_detection_in_real-time_surveillance/","section":"Papers","summary":"Introduces AADC-Net, a multimodal deep neural network leveraging pretrained vision-language models, large language models, and object detection (DETR) for real-time anomaly detection and categorization in surveillance videos. The framework addresses data scarcity, imbalance, and computational challenges, demonstrating state-of-the-art performance on multiple datasets, with practical deployment in smart gyms and healthcare settings.","title":"AADC-Net: A Multimodal Deep Learning Framework for Automatic Anomaly Detection in Real-Time Surveillance","type":"other"},{"content":"","date":"31 March 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/byeongil-lee/","section":"Authors","summary":"","title":"Byeongil Lee","type":"authors"},{"content":"","date":"31 March 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/duc-tri-phan/","section":"Authors","summary":"","title":"Duc Tri Phan","type":"authors"},{"content":"","date":"31 March 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jaeyeop-choi/","section":"Authors","summary":"","title":"Jaeyeop Choi","type":"authors"},{"content":"","date":"31 March 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/junghwan-oh/","section":"Authors","summary":"","title":"Junghwan Oh","type":"authors"},{"content":"","date":"31 March 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/vu-hoang-minh-doan/","section":"Authors","summary":"","title":"Vu Hoang Minh Doan","type":"authors"},{"content":"","date":"13 February 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/chen-sun/","section":"Authors","summary":"","title":"Chen Sun","type":"authors"},{"content":"","date":"13 February 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/categories/instruction-tuning/","section":"Categories","summary":"","title":"Instruction Tuning","type":"categories"},{"content":"","date":"13 February 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/liang-gao/","section":"Authors","summary":"","title":"Liang Gao","type":"authors"},{"content":" Personalizing Vision-Language Models With Hybrid Prompts for Zero-Shot Anomaly Detection # Yunkang Cao , Graduate Student Member, IEEE, Xiaohao Xu, Yuqi Cheng , Student Member, IEEE , Chen Sun, Zongwei Du , Liang Gao , Senior Member, IEEE, and Weiming Shen , Fellow, IEEE\nAbstract—Zero-shot anomaly detection (ZSAD) aims to develop a foundational model capable of detecting anomalies across arbitrary categories without relying on reference images. However, since \u0026ldquo;abnormality\u0026rdquo; is inherently defined in relation to \u0026ldquo;normality\u0026rdquo; within specific categories, detecting anomalies without reference images describing the corresponding normal context remains a significant challenge. As an alternative to reference images, this study explores the use of widely available product standards to characterize normal contexts and potential abnormal states. Specifically, this study introduces AnomalyVLM, which leverages generalized pretrained visionlanguage models (VLMs) to interpret these standards and detect anomalies. Given the current limitations of VLMs in comprehending complex textual information, AnomalyVLM generates hybrid prompts—comprising prompts for abnormal regions, symbolic rules, and region numbers—from the standards to facilitate more effective understanding. These hybrid prompts are incorporated into various stages of the anomaly detection process within the selected VLMs, including an anomaly region generator and an anomaly region refiner. By utilizing hybrid prompts, VLMs are personalized as anomaly detectors for specific categories, offering users flexibility and control in detecting anomalies across novel categories without the need for training data. Experimental results on four public industrial anomaly detection datasets, as well as a practical automotive part inspection task, highlight the superior performance and enhanced generalization capability of AnomalyVLM, especially in texture categories. An online demo of AnomalyVLM is available at https: //github.com/caoyunkang/Segment-Any-Anomaly.\nIndex Terms—Anomaly detection, vision-language model (VLM), zero-shot learning.\nReceived 3 December 2024; revised 19 January 2025; accepted 22 January 2025. Date of publication 13 February 2025; date of current version 18 March 2025. This work was supported in part by the Ministry of Industry and Information Technology of the People\u0026rsquo;s Republic of China under Grant 2023ZY01089; in part by the China Scholarship Council (CSC) under Grant 202306160078; and in part by the HPC Platform of Huazhong University of Science and Technology where the computation is completed. This article was recommended by Associate Editor T. Xiang. (Corresponding author: Weiming Shen.)\nYunkang Cao, Yuqi Cheng, Zongwei Du, Liang Gao, and Weiming Shen are with the State Key Laboratory of Intelligent Manufacturing Equipment and Technology, Huazhong University of Science and Technology, Wuhan 430074, China (e-mail: cyk_hust@hust.edu.cn; yuqicheng@hust.edu.cn; duzongwei@hust.edu.cn; gaoliang@hust.edu.cn; wshen@ieee.org).\nXiaohao Xu is with the Michigan Robotics, University of Michigan at Ann Arbor, Ann Arbor, MI 48109 USA (e-mail: xiaohaox@umich.edu).\nChen Sun is with the Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, ON M5S 3G8, Canada (e-mail: chrn.sun@mail.utoronto.ca).\nColor versions of one or more figures in this article are available at https://doi.org/10.1109/TCYB.2025.3536165.\nDigital Object Identifier 10.1109/TCYB.2025.3536165\n2168-2267 -c\nI. INTRODUCTION # A NOMALY detection for images plays a crucial role in industrial applications, including tasks, such as defective product identification [1] , [2] , [3] and industrial process monitoring [4] , [5]. Most existing methods follow a \u0026ldquo;closed-set\u0026rdquo; paradigm, relying on training data from specific product categories [6] , [7] , [8]. However, detecting anomalies in previously unseen categories is equally important, as collecting training samples for every category can be impractical [2] , [9]. For example, during early production stages, samples may not be available, yet accurate anomaly detection is still essential [2]. Furthermore, inspecting millions of product categories [9] makes data collection both costly and often infeasible. This article, therefore, investigates zero-shot anomaly detection (ZSAD), which aims to identify defects in unseen categories without relying on prior training data.\nZSAD faces significant challenges due to the contextdependent nature of anomalies. For example, white prints may be acceptable on capsules but indicate defects on hazelnuts (Fig. 1). This dependence on specific normal contexts complicates the development of generic ZSAD models, underscoring the importance of incorporating prior knowledge of product standards. Such knowledge often includes preproduction guidelines that describe normal conditions and potential defects. For instance, CAD models define normal product conditions, while experts predict likely defects based on production processes (e.g., painting may cause color inconsistencies). Notably, these standards are not derived from data but from expert insights established before production.\nTypically presented in textual formats, these standards offer detailed descriptions of both normal and abnormal conditions. However, conventional anomaly detection methods [6] , [7] primarily rely on visual models, limiting their ability to interpret textual information. To address this limitation, some approaches [9] , [10] incorporate vision-language models (VLMs) [11] , [12] to leverage prior knowledge for ZSAD. These VLMs, extensively pretrained on visual and textual data, exhibit strong generalization capabilities and multimodal understanding [13] , [14]. Nevertheless, even state-of-the-art (SOTA) VLMs, such as ChatGPT [15], often struggle with complex, domain-specific standards, as evidenced by the suboptimal ZSAD performance of WinCLIP [9] .\nTo enhance VLMs\u0026rsquo; understanding of prior knowledge, this study introduces AnomalyVLM, a framework that personalizes VLMs for improved ZSAD performance by integrating hybrid prompts derived from prior knowledge. Specifically,\n2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission.\nFig. 1. Motivation: \u0026ldquo;Abnormality\u0026rdquo; depends on the corresponding \u0026ldquo;normality\u0026rdquo; of given categories and varies across different categories.\nAnomalyVLM employs two pretrained VLMs: 1) Grounding DINO [13] and 2) segment anything module (SAM) [16] as the anomaly region generator (ARG) and anomaly region refiner (ARR), respectively. ARG can identify potential anomaly regions under the guidance of textual prompts, while ARR can refine these regions for precise detection. The synergy between ARG and ARR enables effective ZSAD based on textual anomaly descriptions. To optimize the use of prior knowledge, AnomalyVLM derives three distinct prompts. The first prompt includes textual descriptions of potential anomaly regions, guiding VLMs to identify anomalies comprehensively. However, nonspecialized VLMs may misinterpret prompts, leading to false alarms. To mitigate this, AnomalyVLM incorporates two additional prompts: 1) symbolic rules to filter unlikely candidates based on anomaly characteristics and 2) an estimated maximum anomaly count to select the most confident detections. Additionally, given that VLM confidence scores may not align with anomaly severity, AnomalyVLM refines these scores using visual saliency, leveraging the distinct visual features of abnormal regions [17] .\nThrough these hybrid prompts, AnomalyVLM integrates prior knowledge into the detection process, achieving superior ZSAD performance. Unlike traditional models with fixed posttraining functionalities [6] , [7], AnomalyVLM is customizable and user-centric, enabling users to adapt the framework to diverse categories by adjusting prompts based on specific prior knowledge. Additionally, compared to existing generic ZSAD methods, the proposed approach more effectively exploits prior knowledge through the designed hybrid prompts. Experiments on four industrial anomaly detection datasets and a real-world automotive part inspection scenario validate AnomalyVLM\u0026rsquo;s flexibility, generalization capabilities, and enhanced ZSAD performance. The contributions of this study are summarized as follows.\nThis study addresses the dependence of anomalies on normal contexts in ZSAD by incorporating prior knowledge through VLMs. It highlights the importance of integrating preproduction standards to overcome the challenge of ZSAD without reference images. This study introduces AnomalyVLM, a novel framework that enhances ZSAD performance by deriving three distinct hybrid prompts from prior knowledge. These prompts include textual descriptions of potential anomaly regions, symbolic rules to filter unlikely candidates, and an estimated maximum anomaly count, which help improve detection accuracy and reduce false positives.\nAnomalyVLM demonstrates superior ZSAD performance, particularly for texture anomalies, and can even outperform PatchCore [6] in certain categories that rely on corresponding normal images for training, as detailed in Section IV-B . II. RELATED WORK # A. Anomaly Detection # Anomaly detection methods aim to accurately pinpoint irregular patterns that deviate from normal patterns in given scenarios/categories. Existing anomaly detection methods can be categorized based on the combinations of training data [1] , [2] into semi-supervised [18], unsupervised [7] , [19] , [20], and few-shot methods [21] , [22] .\nSemi-supervised anomaly detection methods require both normal and abnormal samples from target categories for training [23] , [24]. As abnormal samples are typically fewer than normal ones, these methods focus on modeling the normal data distribution, using abnormal samples to refine the decision boundary [25]. Techniques such as residual learning [18] and contrastive learning [26] have been explored to enhance performance. Unsupervised anomaly detection methods, in contrast, rely solely on normal samples for training and have seen significant advancements in recent years. These methods heavily depend on the utilized embeddings, employing self-supervised learning [27] , [28] , [29] , [30] or pretrained neural networks [6] , [7] to derive descriptive embeddings for normal samples. Subsequently, approaches like reconstruction [31] , [32] , [33] , [34] , knowledge distillation [7] , [8] , [20], memory banks [6] , and flow models [19] , [35] , [36] are utilized to model normal embedding distributions. During inference, the distances between the test sample embeddings and the modeled normal distributions serve as anomaly scores. While these methods [35] , [36] achieve promising results, they still require large numbers of normal training samples. Few-shot anomaly detection methods address scenarios with limited normal samples for training. For instance, RegAD [21] improves the compactness of normal embeddings by spatially aligning samples from the same categories, enabling reasonable detection performance with fewer samples. Similarly, PCSNet [37] promotes feature compactness through contrastive learning. More recently, AnomalyGPT [38] achieves superior few-shot anomaly detection performance by prompting the pretrained CLIP [12] . Although the paradigms [18] , [35] , [38] mentioned above have demonstrated promising performance in anomaly detection, they operate in a close-set manner, limiting their applicability to categories present in the training sets.\nConsequently, they fail to detect anomalies in novel categories lacking reference samples. To address this limitation, some ZSAD methods [9] , [39] , [40] were proposed. Early ZSAD methods [39] , [41] construct a knowledge graph and compute similarities between support and query samples for anomaly detection. In contrast, MAEDAY [42] employs a pretrained masked autoencoder to reconstruct the normal appearances of test samples, using reconstruction errors between the test and recovered samples to identify anomalies. More recently, WinCLIP [9] leverages CLIP [12] to compute similarities between image patches and manually defined textual prompts describing normal and abnormal states. Higher similarities to abnormal states are interpreted as increased abnormality. However, since prompts related to \u0026ldquo;normality\u0026rdquo; and \u0026ldquo;abnormality\u0026rdquo; are rarely present in the pretrained data [43] of CLIP, the pretrained CLIP may struggle to effectively distinguish between normal and abnormal patches. To address this issue, several methods [44] , [45] have been proposed to augment the given textual prompts. Nonetheless, these approaches are limited to short descriptions and fail to leverage complex prior knowledge. In contrast, this study aims to better utilize available prior knowledge to personalize off-the-shelf VLMs for anomaly detection in unseen categories.\nB. Vision Language Models # Numerous VLMs have emerged in the past several years, distinguished by their extensive parameters and typically trained on massive datasets like Laion-400M [43], demonstrating promising generalization capacity. An early milestone, CLIP [12], is trained on large-scale image–text pairs using contrastive learning and can compute similarities between images and texts, showcasing admirable zero-shot classification capabilities. Subsequent works further extend CLIP into other downstream tasks like video segmentation [46] and anomaly detection [9]. More recently, VLMs have been equipped with stronger vision-language understanding capacities, thanks to advanced training strategies [47] and more fine-grained annotated data. For example, BLIP [14] proposes utilizing noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones, thereby simultaneously achieving image–text retrieval and image captioning tasks. GroundingDINO [13] achieves referring object detection and can detect arbitrary objects through textual prompts. Furthermore, SAM [16] is trained for open-set segmentation and can accept point, rectangle, and mask prompts. With prompts derived from prior knowledge, SAM can effectively segment objects of interest in given images and has inspired many follow-up works [48]. The availability of these off-the-shelf VLMs and their integration has made a substantial contribution to the advancement of downstream tasks like ZSAD [9] .\nC. Prompt Engineering # Despite the generalization capabilities of current VLMs, their effectiveness remains limited in contexts with substantial domain disparities between target and training data, particularly in industrial anomaly detection. The collaborative integration of prior knowledge and VLMs has emerged as a standard solution to enhance VLM performance in such scenarios. This prior knowledge is typically incorporated into VLMs through prompt engineering [49] .\nVLMs generally accept textual prompts [12] , [14] to adapt their functionality for image understanding. Users can thus leverage these VLMs for specific tasks by expressing prior knowledge in textual form. However, since prior knowledge cannot always be accurately conveyed through text alone, some VLMs are designed to accept more flexible prompts, such as point and mask prompts [16] used by SAM. Additionally, one of the most well-known VLMs, ChatGPT [15], accepts both image and text prompts. ChatGPT also supports multiround interactions, making it highly customizable for specific tasks.\nOverall, prompt engineering is gaining popularity due to its user-centric nature. Users are not required to build models from scratch but instead adjust prompts to suit specific functionalities. In the context of ZSAD, this study derives hybrid prompts from prior knowledge, providing context about normalities and abnormalities in given categories, and addressing the dependence of anomalies on normal contexts. This approach enables anomaly detection across arbitrary categories without requiring references.\nIII. METHOD # A. Problem Definition # In the context of ZSAD, a model should be capable of detecting anomalies in an image I ∈ R h×w×3 from novel categories and generating the corresponding anomaly map A ∈ [0 , 1] h×w×1 . Given the dependence of anomalies on normal contexts, detecting anomalies without any reference to normal conditions is a challenging task. Therefore, this study leverages prior knowledge, such as predefined standards, as sources of normal contexts.\nB. Pipeline Overview # As shown in Fig. 2, this study employs two off-the-shelf VLMs for text-based anomaly region retrieval. Additionally, three prompts derived from prior knowledge are introduced to guide the process. These prompts enable the customization of the VLMs, allowing users to adapt them to specific categories. In this way, the approach enhances ZSAD performance without requiring additional training. The details of AnomalyVLM are provided below.\nC. Anomaly Region Generator # Certain VLMs [10] , [13] have gained the capability to identify objects within images based on textual prompts T . These user-defined prompts can guide VLMs in retrieving regions of interest within a given image I. This study employs a recently introduced VLM named GroundingDINO [13] as ARG to generate anomaly regions using textual prompts, such as generic prompts \u0026ldquo;anomaly.\u0026rdquo; In particular, GroundingDINO is characterized as a text-guided open-set object detection model, pretrained on extensive language-vision datasets [43], and\nFig. 2. Pipeline overview of AnomalyVLM. AnomalyVLM integrates three prompts. First, ARG identifies potential abnormal regions at the bounding-box level within the testing image, guided by prompt 1. Subsequently, ARR enhances the bounding-box-level predictions to pixel-level masks. Prompt 2 facilitates the filtering of abnormal regions that do not adhere to specific symbolic rules. Next, the scores of the remaining masks are refined based on visual saliency. Finally, guided by prompt 3, candidates with the highest K scores are selected and amalgamated into the final predictions.\nequipped with robust open-world detection capabilities. Thus, ARG can retrieve abnormal regions with textual prompts. ARG comprises three submodules, a prompt encoder E P ARG E P , an image encoder E I ARG E I , and a decoder D ARG P , I . First, ARG encodes the given textual prompts and the testing image via E P ARG E P and E I ARG E I , respectively. Then the encoded features are delivered to D ARG P , I , which utilizes cross-attention for region retrieval. The detection process of ARG is formulated as\nwhere R B denotes the resulting set of bounding boxes, and S for the corresponding confidence scores which denote the similarities to the given textual prompts.\nD. Anomaly Region Refiner # Since ARG can only produce bounding-box level predictions, this study further introduces ARR to refine the bounding-box-level anomaly region candidates R B into a set of pixel-level masks, represented as R. Specifically, an openworld visual segmentation model called SAM [16] is employed as ARR. SAM is trained on an extensive image segmentation dataset [16], consisting of one billion fine-grained masks, which equips SAM with the capability to generate high-quality masks in open-set segmentation scenarios. Similar to ARG, ARR also comprises three submodules, a prompt encoder E P ARR E P , an image encoder E I ARR E I , and a decoder D ARR P , I , where E P ARR E P accepts bounding boxes as prompts. Then in the region refinement process, ARR encodes the predicted bounding boxes and the testing image via E P ARR E P and E I ARR E I , respectively. Then the encoded features are delivered to D ARR P , I for mask prediction. The process is formulated as\nwhere R denotes pixel-level masks for candidates of abnormal regions. By combining ARG and ARR, users can input textual prompts to retrieve potential abnormal regions and obtain a set of pixel-level candidates R along with their associated confidence scores S. However, both ARR and ARG may encounter difficulties in interpreting complex textual prompts, which could limit the effective use of prior knowledge. To address this, this study derives three prompts from prior knowledge that can be more effectively integrated into the anomaly detection process.\nE. Prompt 1: Abnormal Regions # Existing ZSAD methods [9] typically employ generic textual prompts, such as \u0026ldquo;anomaly\u0026rdquo; or \u0026ldquo;defect,\u0026rdquo; to instruct VLMs to detect anomalies in arbitrary categories. However, these generic prompts cannot accurately describe the candidates that need to be queried, since the underlying meanings of \u0026ldquo;anomaly\u0026rdquo; may vary from category to category, i.e. , the definition of abnormalities depends on corresponding normal contexts. Instead of generic prompts, the proposed AnomalyVLM allows users to input specific descriptions for potential abnormal regions within the testing category. For instance, users can enter \u0026ldquo;white prints, cracks, holes, cuts.\u0026rdquo; for the hazelnut category, thereby translating the task of retrieving anomalies into retrieving regions with clearer meanings. These prompts are more intuitive than generic prompts. This way, VLMs can effectively retrieve all potential anomalies within an image. However, while these precise descriptions can enhance the detection rate of abnormal regions, they can also result in false alarms, as the utilized VLMs may inadequately comprehend the prompts. Two additional prompts are introduced to mitigate these false alarms.\nF. Prompt 2: Symbolic Rules # Prior knowledge can also provide more specific descriptions regarding abnormalities, such as their areas, positions, and colors, typically in the form of accurate numerical expressions. However, existing VLMs [13] exhibit limitations in their\nability to query regions based on the aforementioned specific anomaly property descriptions, which may be crucial for retrieving more faithful candidates. Hence, this study opts to express these descriptions as symbolic rules rather than textual prompts. In particular, this study develops predefined functions to compute the properties of abnormal region candidates. Then, AnomalyVLM can filter out candidates that do not meet user-given thresholds. Denoting these symbolic rules as {Rule 1, . . . , Rule N}, only those candidates that meet all rules are retained. This study implements a symbolic rule concerning areas for a simple evaluation, primarily focusing on the relative ratio between abnormal candidates and the inspected object, such as \u0026ldquo;Anomalies are smaller than 5% (of the object area).\u0026rdquo; Abnormal region candidates that do not conform to user-defined thresholds will be filtered out.\nG. Prompt 3: Region Numbers # Although symbolic rules have significantly aided in the reduction of false alarms, there might still be an abundance of potential candidates. Drawing from prior knowledge, the quantity of anomaly regions within an examined object is constrained, with regions exhibiting higher anomaly scores being more probable genuine anomalies. Hence, this study introduces a prompt about the estimated maximum number of abnormal regions within given categories. This way, the candidates with the highest top K confidence scores based on the image content are retained as final predictions. However, the confidence scores produced by ARG and ARR can only contribute to the similarities between selected regions and given textual prompts and cannot faithfully reveal the anomaly degrees. This study introduces a visual saliency-based confidence refinement strategy to make the confidence scores more representative of anomaly degrees.\nVisual Saliency-Based Confidence Refinement: Visual saliency refers to the degree to which an object or region captures human observers\u0026rsquo; attention [50] , [51]. Typically, abnormal regions differ from their neighbors and exhibit greater visual saliency than normal regions [17]. Based on this concept, this study proposes computing visual saliency by measuring the distances between a query region and other regions, and then using the saliency map to refine confidence scores. Specifically, this study computes a saliency map (V) for the input image by calculating the average distances between the pixel features (F) and their N most similar features\nHere, (i , j) represents the pixel location, P(Fij) refers to the N most similar features of the corresponding pixel, and \u0004· , ·\u0005 denotes cosine similarity. Pretrained convolutional neural networks (CNNs) are used to extract image features to ensure feature descriptiveness. The saliency map indicates how distinct a region is from other regions. Then, this study utilizes the exponential average saliency values within the corresponding region masks to refine individual confidence scores\nwhere R denotes individual masks for candidates (1 for valid and 0 for invalid), and S r denotes the refined confidence scores that comprehensively consider both the confidence derived from the VLMs and the saliency of the region candidate.\nH. Anomaly Detection # The average values of the final retained candidates are then fused to detect anomalies. Formally, the anomaly map A is computed as follows:\nBy incorporating two VLMs (ARG and ARR) along with three prompts derived from prior knowledge, AnomalyVLM computes anomaly maps for testing images from novel categories, effectively indicating the abnormality level of individual pixels. The user-centric design of AnomalyVLM offers flexibility and generality, enabling reliable ZSAD.\nIV. EXPERIMENTS # In this section, the performance of AnomalyVLM is evaluated on four widely used anomaly detection datasets. The impact of hybrid prompts is also assessed. Subsequently, the practical applicability of the proposed method is demonstrated using a real-world automotive plastic part inspection dataset. Finally, this study explores the advantages and limitations of AnomalyVLM and provides insights into potential avenues for future research.\nA. Experimental Setup # Datasets: This study leverages four anomaly detection datasets to assess the performance of ZSAD. In particular, this study selected MVTec AD [52], VisA [53], KSDD2 [23] , and MTD [54] considering their diverse product categories and comprehensive coverage of various anomaly types. All these datasets offer pixel-level annotations. It is noteworthy that MVTec AD [52] and VisA datasets [53] provide detailed descriptions of abnormal regions, which can be valuable resources for prior knowledge, particularly for describing potential abnormal regions within individual categories. Within these datasets, some categories are about specific objects, while others consist of texture images. This study categorizes MVTec AD into MVTec AD (Object) and MVTec AD (Texture), which contain ten and five categories, respectively. In total, the utilized datasets collectively comprise 4470 normal samples and 3092 abnormal samples for evaluations. Evaluation Metrics: To comprehensively evaluate ZSAD performance, three key metrics have been employed: 1) average precision; 2) maximum F1 score (max-F1) [9]; and 3) maximum Intersection over Union (max-IoU). Specifically, different thresholds are applied to the computed anomaly maps, discretizing anomaly maps into binary values (0 and 1). Subsequently, the F1 score and IoU score under different thresholds are computed, and the maximum F1 and IoU scores under different thresholds are selected as max-F1 and max-IoU, respectively.\nImplementation Details: The proposed AnomalyVLM model incorporates the lighter architectures of VLMs by default, i.e., GroundingDINO with Swin-T 1 and SAM with ViT-T 2 as ARG and ARR, respectively. Input images are consistently resized to a resolution of 256 × 256 for evaluation. Visual saliency calculation employs WideResnet50 [55] for feature extraction and N = 400 for (3) . Prior Knowledge: This study employs various sources of prior knowledge tailored to different categories. Texture categories are generally easier for anomaly detection [7]; thus, simple prompts are utilized for all texture categories in MVTec AD, MTD, and KSDD2. Specifically, for prompts 1, 2, and 3, this study utilizes \u0026ldquo;defect,\u0026rdquo; \u0026ldquo;Anomalies are smaller than 50%.,\u0026rdquo; and \u0026ldquo;At most five abnormal regions,\u0026rdquo; respectively. Conversely, object categories pose more difficulties for anomaly detection. This study exploits the originally provided names of abnormal types by MVTec AD and VisA as prior knowledge. These provided names of anomaly types are paraphrased into nouns as inputs for prompt 1, such as \u0026ldquo;broken_large\u0026rdquo; in the bottle category transformed into \u0026ldquo;large breakage,\u0026rdquo; to facilitate easier region retrieval for VLMs. For prompts 2 and 3, different thresholds are introduced empirically from prior knowledge for enhanced detection performance.\nComparison Methods: This study conducts a comparative analysis of the proposed AnomalyVLM with several ZSAD alternatives, including WinClip [9], UTAD [17], and ClipSeg [10]. Within these ZSAD methods, the proposed AnomalyVLM, WinCLIP, and ClipSeg require textual prompts provided by prior knowledge, while UTAD detects anomalies based on visual saliency. The implementation of WinClip 3 strictly follows the methodology outlined in its original paper, resulting in detection performance comparable to the reported results. It is worth noting that ClipSeg was not originally designed for ZSAD; therefore, this study utilizes its pretrained weights and provides \u0026ldquo;defect\u0026rdquo; as textual prompts to detect anomalies in a zero-shot manner. Additionally, an unsupervised anomaly detection method, PatchCore [6], is also evaluated. B. Main Results # The comparison results between AnomalyVLM and other ZSAD alternatives are presented in Table I. While WinCLIP [9] and UTAD [17] are specifically designed for ZSAD, the simple implementation of ClipSeg [10] for ZSAD achieves comparable performance. Moreover, AnomalyVLM achieves the highest detection performance among all ZSAD methods, with an average of 35.8% max-F1, 27.3% AP, and 28.0% max-IoU across all datasets, surpassing UTAD by\n1 https://github.com/IDEA-Research/GroundingDINO\n2 https://github.com/facebookresearch/segment-anything\n3 https://github.com/zqhang/Accurate-WinCLIP-pytorch\na significant margin of 11.1% max-F1. It is worth noting that ClipSeg, WinCLIP, and AnomalyVLM all utilize textual prompts for guidance. The superior detection performance of AnomalyVLM demonstrates its ability to effectively integrate prior knowledge.\nAdditionally, Table I reveals that almost all methods achieve higher detection performance for textual categories compared to object categories. This discrepancy arises because the normal and abnormal contexts within object categories prove to be more complex, posing obstacles to detecting anomalies.\nIn comparison to the unsupervised anomaly detection method PatchCore, it is evident that ClipSeg, UTAD, and WinCLIP perform weaker across nearly all categories. Conversely, AnomalyVLM even outperforms PatchCore [6] by a large margin for some categories, such as leather, tile, and wood, while requiring no training data. It is also notable that PatchCore fails to operate effectively on the MTD dataset due to the absence of available normal training samples, whereas AnomalyVLM achieves promising detection performance on categories like MT_Blowhole and MT_Fray.\nFig. 3 presents qualitative comparisons between AnomalyVLM and other alternatives. It is evident that UTAD and WinCLIP exhibit limited efficacy in ZSAD, while ClipSeg [10] emerges as a strong competitor, successfully detecting anomalies in most categories but showing significant false alarms in the background. In contrast, the proposed AnomalyVLM achieves superior anomaly detection performance, accurately identifying anomalies in these novel categories. Compared to PatchCore, AnomalyVLM demonstrates comparable detection performance and can even yield more accurate results, particularly in texture categories. Notably, AnomalyVLM requires no training data, whereas PatchCore necessitates large amounts of training data from target categories.\nC. Ablation Study # This study conducts comprehensive ablation studies to assess the impact of prior knowledge. To this end, we replace hybrid prompts derived from prior knowledge with generic prompts. Additionally, Fig. 4 visually showcases several cases to illustrate the detection process of AnomalyVLM and the influence of prior knowledge. In Fig. 4, masks along with their scores after integrating prompts 1–3 are aggregated with (5) for better visualization, respectively.\nAblation on Prompt 1 (Abnormal Regions): This study utilizes generic prompts \u0026ldquo;defect\u0026rdquo; to replace prompt 1 about abnormal regions to investigate the influence of prior knowledge. Since generic prompts are utilized for texture categories by default, Table II only presents the comparison results on object categories. It is clear that the detection performance of AnomalyVLM remains promising compared to other ZSAD alternatives when prior knowledge regarding abnormal regions is unavailable, i.e., employing only generic prompts. With a simple generic prompt \u0026ldquo;defect,\u0026rdquo; AnomalyVLM still achieves an average max-F1 of 21.6% on VisA. However, it has to be admitted that the anomaly detection performance of AnomalyVLM undergoes a slight decline across the object TABLE I\nQUALITATIVE COMPARISONS OF ANOMALYVLM WITH ALTERNATIVE ZSAD METHODS. RESULTS ARE PRESENTED AS (MAX-F1, AP, MAX-IOU). BEST SCORES ARE HIGHLIGHTED IN BOLD, WHILE THE SECOND-BEST SCORES ARE ALSO UNDERLINED. PATCHCORE IS A SOTA UNSUPERVISED ANOMALY DETECTION METHOD AND IS EXCLUDED FROM RANKING\nTABLE II COMPARISON BETWEEN QUANTITATIVE RESULTS WITH PROMPT 1 AND ALTERNATIVE GENERIC INPUTS \u0026ldquo;DEFECT.\u0026rdquo; RESULTS ARE PRESENTED AS (MAX-F1, AP, MAX-IOU).\ncategories when lacking prior knowledge. The most significant decrease is observed in MVTec AD (Object), with a 4.1% lower max-F1 score. This decline is attributed to the complexity of these object categories, as generic names of abnormal regions may not accurately depict the comprised anomalies. As illustrated in Fig. 4(a), the presence of a \u0026ldquo;missing cable\u0026rdquo; within the cable constitutes an anomaly. However, GroundingDINO fails to detect such regions via generic textual prompts \u0026ldquo;defect.\u0026rdquo; In contrast, by introducing more specific names of abnormal regions, i.e., \u0026ldquo;Crack. Bent wire. Missing cable,\u0026rdquo; the proposed AnomalyVLM successfully identifies the abnormal region. This emphasizes the importance of prior knowledge in identifying all potential anomalies.\nAblation on Prompt 2 (Symbolic Rules): This study implements a symbolic rule regarding the area of abnormal regions for prompt 2. To better understand the influence of prompt 2, this study employs different generic area thresholds for all categories to replace the specific prompts, as depicted in Fig. 5. The figure clearly illustrates that area thresholds have a significant impact on anomaly detection performance. Particularly, the detection performance of AnomalyVLM tends to improve and then decrease with larger area thresholds. This is mainly because a small area threshold may wrongly filter out faithful abnormal candidates, while a large threshold could result in more false alarms. For instance, Fig. 4(b) illustrates how a single area threshold effectively filters out the false alarm associated with the entire bottle. Specifically, employing a threshold of 0.3, derived from prior knowledge, substantially mitigates false alarms, whereas using a generic threshold of 0.9 introduces severe false alarms. Generally, Fig. 5 shows that the optimal detection performance without prior knowledge is attained with a generic threshold of 0.5, with which AnomalyVLM still outperforms other ZSAD alternatives. This demonstrates the superiority of AnomalyVLM even in the absence of prior knowledge. Conversely, optimized thresholds derived from prior knowledge for individual categories within VisA and MVTec AD (Object) lead to significantly improved detection performance with prompts, as shown in Fig. 5. This Fig. 3. Qualitative comparison between the proposed AnomalyVLM and alternative ZSAD methods. From top to bottom: (a) testing image, (b) corresponding ground truth, followed by anomaly maps generated by (c) PatchCore [6], (d) ClipSeg [10], (e) UTAD [17], (f) WinCLIP [9], and (g) proposed AnomalyVLM.\nFig. 4. Qualitative analysis of hybrid prompts. The red rectangle denotes the prompt to be replaced with generic prompts. For (a)–(c), generic prompts are \u0026quot; Defect,\u0026quot; \u0026ldquo;Anomalies are smaller than 50%,\u0026rdquo; and \u0026ldquo;At most five abnormal regions.\u0026rdquo; The top and bottom rows display results obtained with specific and generic inputs, respectively. From left to right: visualized anomaly maps after integrating prompts 1–3.\nunderscores the crucial role of prior knowledge in ZSAD. While this study primarily focuses on symbolic rules based on area, looking forward, the integration of additional symbolic rules, such as location and color, holds the potential to yield even more favorable results.\nAblation on Prompt 3 (Region Numbers): Prompt 3 is about the estimated maximum number of abnormal regions for the testing category. This study replaces the prompts with generic estimated numbers for all categories to access the performance without prior knowledge. As depicted in Fig. 6 , the anomaly detection performance exhibits an increasing and then decreasing trend with increasing estimated numbers of abnormal regions. This trend arises because the probability of retaining faithful abnormal candidates improves with a larger estimated number, while simultaneously introducing more false alarms. For instance, Fig. 4(c) illustrates the comparison Fig. 6. Analysis on prompt 3: region numbers. X-axis: different generic region number thresholds. Y-axis: detection performance. For VisA and MVTec AD (Object), thresholds for individual categories are empirically selected by default, and the resulting detection performance is in dashed lines.\nFig. 7. Visualization of visual saliency maps. From top to bottom: (a) testing image, (b) corresponding ground truth, and (c) computed visual saliency map.\nbetween one (from prior knowledge) and five (from generic prompts) retained abnormal regions for the candle category. It demonstrates that retaining more candidates leads to false alarms. Hence, it is crucial for users to determine a suitable number threshold according to practical applications for optimal anomaly detection performance.\nInfluence of the Visual Saliency-Based Confidence Refinement: This study introduces a refinement strategy that utilizes visual saliency to calibrate the confidence scores of individual anomalies. Visualizations of some visual saliency maps are provided in Fig. 7, illustrating that these maps yield notably higher values for abnormal regions, rendering visual saliency suitable for refinement purposes. To further elucidate the impact of the refinement strategy, Table III presents the ZSAD performance with and without the strategy. It is evident that the detection performance declines when the refinement strategy is not applied, such as a decrease of 5.6% in max-F1 on MTD. This underscores the effectiveness of the refinement strategy.\nTABLE III COMPARISON BETWEEN QUANTITATIVE RESULTS WITH AND WITHOUT THE VISUAL SALIENCY-BASED REFINEMENT STRATEGY. RESULTS ARE PRESENTED AS (MAX-F1, AP, MAX-IOU)\nTABLE IV COMPLEXITY COMPARISONS. THE EFFICIENCY OF EXECUTION IS QUANTIFIED IN FPS. FOR THE PROPOSED ANOMALYVLM, WE LIST THE PERFORMANCE FOR ARG+ARR WITH DIFFERENT BACKBONES\nD. Complexity Analysis # This section evaluates the complexity of AnomalyVLM in comparison with alternative methods. For AnomalyVLM, this study analyzes its complexity by varying the backbone models of ARG (GroundingDINO [13]) and ARR (SAM [16]), as detailed in Table IV. Notably, the original SAM model supports only ViT-H/L/B backbones, while ViT-T for SAM is implemented using MobileSAM [56] through knowledge distillation, which significantly reduces computational complexity compared to the original SAM.\nAs shown in Table IV, the choice of backbones has a minimal impact on detection performance, with only a marginal 1.9% improvement in max-F1 when upgrading from Swin-T+ViT-T to Swin-B+ViT-H. This suggests that even lightweight backbones, such as ViT-T for SAM, possess sufficient generic knowledge and can be effectively personalized for ZSAD using the derived hybrid prompts.\nTo quantify computational complexity, both the comparison methods and AnomalyVLM were implemented on a single NVIDIA-3090Ti GPU with a batch size of one. The results in Table IV indicate that AnomalyVLM achieves its lowest complexity with Swin-T for GroundingDINO and ViT-T for SAM, requiring 192.3 MB of parameters and operating at 3.7 FPS, slightly faster than WinCLIP, which is slower due to the use of a sliding window. While AnomalyVLM incurs a higher computational burden than other alternatives because of the two VLMs employed, it significantly outperforms these methods in detection performance.\nWe further examined the average frames per second (FPS) of AnomalyVLM for texture and object images, finding no significant difference in efficiency between these scenarios.\nFig. 8. Real-World application setup. (a) Established image acquisition device for inspection. (b) Collected normal samples. (c) Collected abnormal samples.\nThis consistency suggests that scenario complexity has a negligible impact on the efficiency of AnomalyVLM. However, in industrial settings, higher-resolution images may be required, potentially increasing computational costs and limiting the current version\u0026rsquo;s deployability in practical systems.\nThis study also analyzes AnomalyVLM (Swin-T+ViT-T) and identifies ARG as the primary computational bottleneck, accounting for 82.4% of the computational time, followed by ARR (9.3%), the Visual Saliency Extractor (6.1%), and other operations (2.1%). To facilitate practical deployments, future work could explore techniques such as knowledge distillation, as employed in MobileSAM, to optimize GroundingDINO. For this purpose, collecting a large industrial object detection dataset could enable fine-tuning and distillation of SAM and GroundingDINO, enhancing efficiency and specificity for industrial anomaly detection applications.\nE. Real-World Evaluation # This study applies AnomalyVLM to a real-world automotive plastic parts inspection task to evaluate its practicability. Specifically, as illustrated in Fig. 8, an image acquisition device consisting of four light sources and three cameras was constructed to collect data. Using this device, images from 50 plastic parts were obtained. These images were divided into small patches with a resolution of 256 × 256, resulting in 3980 normal patches and 81 abnormal patches. This collected dataset presents greater challenges than existing anomaly detection datasets, such as MVTec AD [52], due to the higher variability in normal patterns and the minute size of anomalies. These unique characteristics render other ZSAD methods (ClipSeg, UTAD, and WinCLIP) ineffective in detecting anomalies within the collected dataset, as shown in Fig. 9. In contrast, with straightforward hybrid prompts such as \u0026ldquo;Dot. Scratch.,\u0026rdquo; \u0026ldquo;Anomalies are smaller than 5%,\u0026rdquo; and \u0026ldquo;At most one abnormal region,\u0026rdquo; the proposed AnomalyVLM effectively detects these subtle anomalies. Table V highlights the performance of AnomalyVLM, achieving a significant 17.8% improvement in max-IoU on this challenging real-world scenario compared to alternative methods. These results underscore the superiority\nFig. 9. Qualitative comparison for real-world automotive part inspection. From top to bottom: (a) testing image, (b) corresponding ground truth, followed by anomaly maps generated by (c) ClipSeg [10], (d) UTAD [17] , (e) WinCLIP [9], and (f) proposed AnomalyVLM.\nFig. 10. Failure cases. (a) Unclear boundary. (b) Complex components.\nTABLE V QUALITATIVE COMPARISONS OF ANOMALYVLM WITH ALTERNATIVE ZSAD METHODS FOR THE REAL-WORLD AUTOMOTIVE PLASTIC PARTS INSPECTION\nof AnomalyVLM in addressing the complexities of real-world anomaly detection tasks. To promote evaluations on novel categories, this study provides an online demo, available at https://github.com/caoyunkang/Segment-Any-Anomaly.\nF. Discussion # The proposed AnomalyVLM is user-centric and can seamlessly incorporate prior knowledge across specific categories without requiring additional training. Within the zero-shot detection paradigm, AnomalyVLM demonstrates exceptional detection performance across four publicly available datasets and real-world applications, even surpassing the unsupervised method, PatchCore, in certain categories. Moreover, even in the absence of specific prior knowledge, i.e., when only generic prompts are available, AnomalyVLM consistently outperforms other ZSAD alternatives. This combination of superior performance, flexibility, and adaptability establishes AnomalyVLM as a compelling solution for ZSAD.\nHowever, AnomalyVLM is not without limitations, which primarily stem from inherent drawbacks in the utilized VLMs, as illustrated in Fig. 10. Specifically, current VLMs heavily\ndepend on object boundaries for region retrieval. In cases where abnormal regions lack clear boundaries, as shown in Fig. 10(a), AnomalyVLM may struggle to detect such anomalies. Additionally, certain categories contain complex components, such as PCB Pins and USB Sockets depicted in Fig. 10(b), which existing VLMs find challenging to distinguish accurately in a zero-shot manner, potentially resulting in false alarms.\nLooking forward, advancements in more specialized VLMs tailored to industrial scenarios offer promising avenues for addressing these challenges. For instance, adapting CLIP for few-shot anomaly detection via target data, as demonstrated in AnomalyGPT [38], or supervised training of CLIP on auxiliary annotated anomaly detection data, as explored in AdaCLIP [57], provides potential pathways for enhancing performance and mitigating these limitations.\nV. CONCLUSION # In conclusion, this study proposes AnomalyVLM for the challenging ZSAD task. Considering the dependence of abnormalities on normal contexts within individual categories, the study suggests leveraging available prior knowledge to provide insights into normal and abnormal states within testing categories. To this end, AnomalyVLM introduces an ARG and an ARR with off-the-shelf VLMs, personalized by hybrid prompts derived from prior knowledge. These prompts enhance detection performance and afford flexibility and control in identifying anomalies across novel categories without any training. Experimental results and the real-world evaluation attest to the superior detection performance, generalization capacity, and flexibility of the proposed AnomalyVLM.\nFuture efforts will focus on refining VLMs tailored specifically to industrial applications, thereby enhancing both detection performance and efficiency.\nREFERENCES # [1] Y. Cao et al., \u0026ldquo;A survey on visual anomaly detection: Challenge, approach, and prospect,\u0026rdquo; 2024, arXiv:2401.16402 .\n[2] G. Xie et al., \u0026ldquo;IM-IAD: Industrial image anomaly detection benchmark in manufacturing,\u0026rdquo; IEEE Trans. Cybern., vol. 54, no. 5, pp. 2720–2733, May 2024.\n[3] Z. Zhang, Z. Zhao, X. Zhang, C. Sun, and X. Chen, \u0026ldquo;Industrial anomaly detection with domain shift: A real-world dataset and masked multi-scale reconstruction,\u0026rdquo; Comput. Ind., vol. 151, Oct. 2023, Art. no. 103990.\n[4] A. Voulodimos et al., \u0026ldquo;A dataset for workflow recognition in industrial scenes,\u0026rdquo; in Proc. 18th IEEE Int. Conf. Image Process., 2011, pp. 3249–3252.\n[5] M. Wang, D. Zhou, and M. Chen, \u0026ldquo;Hybrid variable monitoring mixture model for anomaly detection in industrial processes,\u0026rdquo; IEEE Trans. Cybern., vol. 54, no. 1, pp. 319–331, Jan. 2024.\n[6] K. Roth, L. Pemula, J. Zepeda, B. Schölkopf, T. Brox, and P. Gehler, \u0026ldquo;Towards total recall in industrial anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 14318–14328.\n[7] Y. Cao, X. Xu, Z. Liu, and W. Shen, \u0026ldquo;Collaborative discrepancy optimization for reliable image anomaly localization,\u0026rdquo; IEEE Trans. Ind. Informat., vol. 19, no. 11, pp. 10674–10683, Nov. 2023.\n[8] Y. Cai, D. Liang, D. Luo, X. He, X. Yang, and X. Bai, \u0026ldquo;A discrepancy aware framework for robust anomaly detection,\u0026rdquo; IEEE Trans. Ind. Informat., vol. 20, no. 3, pp. 3986–3995, Mar. 2024.\n[9] J. Jeong, Y. Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, \u0026ldquo;WinCLIP: Zero-/few-shot anomaly classification and segmentation,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 19606–19616.\n[10] T. Lüddecke and A. Ecker, \u0026ldquo;Image segmentation using text and image prompts,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. , 2022, pp. 7086–7096.\n[11] X. Liu, Y. He, Y.-M. Cheung, X. Xu, and N. Wang, \u0026ldquo;Learning relationship-enhanced semantic graph for fine-grained image–text matching,\u0026rdquo; IEEE Trans. Cybern., vol. 54, no. 2, pp. 948–961, Feb. 2024.\n[12] A. Radford et al, \u0026ldquo;Learning transferable visual models from natural language supervision,\u0026rdquo; in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763.\n[13] S. Liu et al, \u0026ldquo;Grounding DINO: Marrying DINO with grounded pretraining for open-set object detection,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis. , 2025, pp. 38–55.\n[14] J. Li, D. Li, C. Xiong, and S. Hoi, \u0026ldquo;BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,\u0026rdquo; in Proc. Int. Conf. Mach. Learn., 2022, pp. 1–13.\n[15] Z. Yang et al., \u0026ldquo;The Dawn of LMMs: Preliminary explorations with GPT-4V(ision),\u0026rdquo; 2023, arXiv:2309.17421 .\n[16] A. Kirillov et al. \u0026ldquo;Segment anything,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 4015–4026.\n[17] T. Aota, L. T. T. Tong, and T. Okatani, \u0026ldquo;Zero-shot versus many-shot: Unsupervised texture anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2023, pp. 5564–5572.\n[18] Y. Cao, X. Xu, C. Sun, L. Gao, and W. Shen, \u0026ldquo;BiaS: Incorporating biased knowledge to boost unsupervised image anomaly localization,\u0026rdquo; IEEE Trans. Syst., Man, Cybern., Syst., vol. 54, no. 4, pp. 2342–2353, Apr. 2024.\n[19] H. Yao et al., \u0026ldquo;Dual-attention transformer and discriminative flow for industrial visual anomaly detection,\u0026rdquo; IEEE Trans. Autom. Sci. Eng. , vol. 21, no. 4, pp. 6126–6140, Oct. 2024.\n[20] Y. Cao, Q. Wan, W. Shen, and L. Gao, \u0026ldquo;Informative knowledge distillation for image anomaly segmentation,\u0026rdquo; Knowl. Based Syst., vol. 248, Jul. 2022, Art. no. 108846.\n[21] C. Huang, H. Guan, A. Jiang, Y. Zhang, M. Spratling, and Y.-F. Wang, \u0026ldquo;Registration based few-shot anomaly detection,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis., 2022, pp. 303–319.\n[22] S. Kwak et al., \u0026ldquo;Few-shot anomaly detection via personalization,\u0026rdquo; IEEE Access, vol. 12, pp. 11035–11051, 2024.\n[23] J. Božic, D. Tabernik, and D. Sko ˇ ˇ caj, \u0026ldquo;Mixed supervision for surface- ˇ ˇ defect detection: From weakly to fully supervised learning,\u0026rdquo; Comput. Ind., vol. 129, Aug. 2021, Art. no. 103459.\n[24] B. Hu et al., \u0026ldquo;A lightweight spatial and temporal multi-feature fusion network for defect detection,\u0026rdquo; IEEE Trans. Image Process., vol. 30, pp. 472–486, 2020.\n[25] X. Yao, R. Li, J. Zhang, J. Sun, and C. Zhang, \u0026ldquo;Explicit boundary guided semi-push-pull contrastive learning for supervised anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. , 2023, pp. 24490–24499.\n[26] Q. Wan, Y. Cao, L. Gao, X. Li, and Y. Gao, \u0026ldquo;Deep feature contrasting for industrial image anomaly segmentation,\u0026rdquo; IEEE Trans. Instrum. Meas. , vol. 73, pp. 1–11, Jan. 2024.\n[27] C. Huang et al., \u0026ldquo;Self-supervision-augmented deep autoencoder for unsupervised visual anomaly detection,\u0026rdquo; IEEE Trans. Cybern., vol. 52, no. 12, pp. 13834–13847, Dec. 2022.\n[28] Y. Liang, J. Zhang, S. Zhao, R. Wu, Y. Liu, and S. Pan, \u0026ldquo;Omni-frequency channel-selection representations for unsupervised anomaly detection,\u0026rdquo; IEEE Trans. Image Process., vol. 32, pp. 4327–4340, 2023.\n[29] C. Huang, Q. Xu, Y. Wang, Y. Wang, and Y. Zhang, \u0026ldquo;Self-supervised masking for unsupervised anomaly detection and localization,\u0026rdquo; IEEE Trans. Multimedia, vol. 25, pp. 4426–4438, 2023.\n[30] C. Huang et al., \u0026ldquo;Weakly supervised video anomaly detection via selfguided temporal discriminative transformer,\u0026rdquo; IEEE Trans. Cybernet. , vol. 54, no. 5, pp. 3197–3210, May 2024.\n[31] Y.-H. Yoo, U.-H. Kim, and J.-H. Kim, \u0026ldquo;Convolutional recurrent reconstructive network for spatiotemporal anomaly detection in solder paste inspection,\u0026rdquo; IEEE Trans. Cybern., vol. 52, no. 6, pp. 4688–4700, Jun. 2022.\n[32] W. Luo, H. Yao, W. Yu, and Z. Li, \u0026ldquo;AMI-Net: Adaptive mask inpainting network for industrial anomaly detection and localization,\u0026rdquo; IEEE Trans. Autom. Sci. Eng., vol. 22, pp. 1591–1605, 2025, doi: 10.1109/TASE.2024.3368142 .\n[33] H. Yao, W. Yu, and X. Wang, \u0026ldquo;A feature memory rearrangement network for visual inspection of textured surface defects toward edge intelligent manufacturing,\u0026rdquo; IEEE Trans. Autom. Sci. Eng., vol. 20, no. 4, pp. 2616–2635, Oct. 2023.\n[34] C. Hu, J. Wu, C. Sun, X. Chen, A. K. Nandi, and R. Yan, \u0026ldquo;Unified flowing normality learning for rotating machinery anomaly detection in continuous time-varying conditions,\u0026rdquo; IEEE Trans. Cybern., vol. 55, no. 1, pp. 221–233, Jan. 2025.\n[35] Y. Zhou, X. Xu, J. Song, F. Shen, and H. T. Shen, \u0026ldquo;MSFlow: Multiscale flow-based framework for unsupervised anomaly detection,\u0026rdquo; IEEE Trans. Neural Netw. Learn. Syst., early access, Jan. 9, 2024, doi: 10.1109/TNNLS.2023.3344118 .\n[36] W. Cui et al., \u0026ldquo;A rapid screening method for suspected defects in steel pipe welds by combining correspondence mechanism and normalizing flow,\u0026rdquo; IEEE Trans. Ind. Informat., vol. 20, no. 9, pp. 11171–11180, Sep. 2024.\n[37] Y. Jiang, Y. Cao, and W. Shen, \u0026ldquo;Prototypical learning guided context-aware segmentation network for few-shot anomaly detection,\u0026rdquo; IEEE Trans. Neural Netw. Learn. Syst., early access, Oct. 1, 2024, doi: 10.1109/TNNLS.2024.3463495 .\n[38] Z. Gu, B. Zhu, G. Zhu, Y. Chen, M. Tang, and J. Wang, \u0026ldquo;AnomalyGPT: Detecting industrial anomalies using large vision-language models,\u0026rdquo; in Proc. AAAI Conf. Artif. Intell., 2024, pp. 1932–1940.\n[39] Z. Li, L. Gao, Y. Gao, X. Li, and H. Li, \u0026ldquo;Zero-shot surface defect recognition with class knowledge graph,\u0026rdquo; Adv. Eng. Informat., vol. 54, Oct. 2022, Art. no. 101813.\n[40] X. Chen et al., \u0026ldquo;CLIP-AD: A language-guided staged dual-path model for zero-shot anomaly detection,\u0026rdquo; in Proc. Int. Joint Conf. Artif. Intell. , 2024, pp. 17–33.\n[41] Y. Dong, C. Xie, L. Xu, H. Cai, W. Shen, and H. Tang, \u0026ldquo;Generative and contrastive combined support sample synthesis model for few-/zeroshot surface defect recognition,\u0026rdquo; IEEE Trans. Instrum. Meas., vol. 73, pp. 1–15, 2024, doi: 10.1109/TIM.2023.3329163 .\n[42] E. Schwartz et al., \u0026ldquo;MAEDAY: Mae for few-and zero-shot anomalydetection,\u0026rdquo; Comput. Vis. Image Understand., vol. 241, Apr. 2024, Art. no. 103958.\n[43] C. Schuhmann et al., \u0026ldquo;LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs,\u0026rdquo; in Proc. Neural Inf. Process. Syst., 2021, pp. 1–5.\n[44] Y. Li, A. Goodge, F. Liu, and C.-S. Foo, \u0026ldquo;PromptAD: Zero-shot anomaly detection using text prompts,\u0026rdquo; in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2024, pp. 1093–1102.\n[45] Q. Zhou, G. Pang, Y. Tian, S. He, and J. Chen, \u0026ldquo;AnomalyCLIP: Objectagnostic prompt learning for zero-shot anomaly detection,\u0026rdquo; in Proc. Int. Conf. Learn. Represent., 2024, pp. 1–31.\n[46] T. Hui et al., \u0026ldquo;Language-aware spatial-temporal collaboration for referring video segmentation,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell. , vol. 45, no. 7, pp. 8646–8659, Jul. 2023.\n[47] D. Chen et al., \u0026ldquo;Protoclip: Prototypical contrastive language image pretraining,\u0026rdquo; IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 1, pp. 610–624, Jan. 2025.\n[48] A. Maalouf et al., \u0026ldquo;Follow anything: Open-set detection, tracking, and following in real-time,\u0026rdquo; IEEE Robot. Autom. Lett., vol. 9, no. 4, pp. 3283–3290, Apr. 2024.\n[49] J. Wang et al., \u0026ldquo;Review of large vision models and visual prompt engineering,\u0026rdquo; Meta-Radiol., vol. 1, no. 3, 2023, Art. no. 100047.\n[50] W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, and R. Yang, \u0026ldquo;Salient object detection in the deep learning era: An in-depth survey,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 3239–3259, Jun. 2022.\n[51] Q. Lai, T. Zhou, S. Khan, H. Sun, J. Shen, and L. Shao, \u0026ldquo;Weakly supervised visual saliency prediction,\u0026rdquo; IEEE Trans. Image Process. , vol. 31, pp. 3111–3124, 2022.\n[52] P. Bergmann, K. Batzner, M. Fauser, D. Sattlegger, and C. Steger, \u0026ldquo;The MVTec anomaly detection dataset: A comprehensive real-world dataset for unsupervised anomaly detection,\u0026rdquo; Int. J. Comput. Vis., vol. 129, no. 4, pp. 1038–1059, 2021.\n[53] Y. Zou, J. Jeong, L. Pemula, D. Zhang, and O. Dabeer, \u0026ldquo;SPotthe-difference self-supervised pre-training for anomaly detection and segmentation,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis., 2022, pp. 392–408.\n[54] Y. Huang, C. Qiu, Y. Guo, X. Wang, and K. Yuan, \u0026ldquo;Surface defect saliency of magnetic tile,\u0026rdquo; in Proc. Int. Conf. Autom. Sci. Eng., 2018, pp. 612–617.\n[55] S. Zagoruyko and N. Komodakis, \u0026ldquo;Wide residual networks,\u0026rdquo; in Proc. Brit. Mach. Vis. Conf., 2016, pp. 1–12.\n[56] C. Zhang et al., \u0026ldquo;Faster segment anything: Towards lightweight sam for mobile applications,\u0026rdquo; 2023, arXiv:2306.14289 .\n[57] Y. Cao, J. Zhang, L. Frittoli, Y. Cheng, W. Shen, and G. Boracchi, \u0026ldquo;AdaCLIP: Adapting CLIP with hybrid learnable prompts for zero-shot anomaly detection,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis., 2024, pp. 55–72.\nYunkang Cao (Graduate Student Member, IEEE) received the B.S. degree from the Huazhong University of Science and Technology, Wuhan, China, in 2020, where he is currently pursuing the Ph.D. degree in mechanical engineering.\nHis current research interests include machine vision, visual anomaly detection, and industrial foundation models.\nXiaohao Xu received the B.S. degree in mechanical design, manufacturing and automation from the Huazhong University of Science and Technology, Wuhan, China, in 2022. He is currently pursuing the Ph.D. degree with the Robotics Department, University of Michigan at Ann Arbor, Ann Arbor, MI, USA.\nHis current research interests include the fundamental theory and real-world applications of robotics, computer vision, and video understanding.\nYuqi Cheng (Student Member, IEEE) received the B.S. degree in mechanical design, manufacturing and automation and the M.S. degree in mechanical engineering from the Huazhong University of Science and Technology, Wuhan, China, in 2020 and 2023, respectively, where he is currently pursuing the Ph.D. degree.\nHis research interests include point cloud processing, 3-D measurement, and anomaly detection.\nChen Sun received the B.S. degree in mechanical design, manufacturing and automation from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 2020, and the M.S. degree in mechanical engineering from the State Key Laboratory of Digital Manufacturing Equipment and Technology, HUST in 2023. He is currently pursuing the Ph.D. degree in mechanical engineering with the University of Toronto, Toronto, ON, Canada.\nHis research interests include deep learning, computer vision, and medical image analysis.\nZongwei Du received the M.S. degree in mechanical engineering from the Huazhong University of Science and Technology, Wuhan, China, in 2024.\nHis current research interests include defect recognition, image generation, and limited data learning.\nLiang Gao (Senior Member, IEEE) received the Ph.D. degree in mechatronic engineering from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 2002.\nHe is currently a Professor with the Department of Industrial and Manufacturing System Engineering, State Key Laboratory of Intelligent Manufacturing Equipment and Technology, School of Mechanical Science and Engineering, HUST. He has published more than 400 refereed articles. His research interests include operations research and optimization, big data, and machine learning.\nProf. Gao serves as the Co-Editor-in-Chief for IET Collaborative Intelligent Manufacturing and an Associate Editor for Swarm and Evolutionary Computation and Journal of Industrial and Production Engineering .\nWeiming Shen (Fellow, IEEE) received the B.E. and M.S. degrees in mechanical engineering from Northern Jiaotong University, Beijing, China, in 1983 and 1986, respectively, and the Ph.D. degree in system control from the University of Technology of Compiègne, Compiègne, France, in 1996.\nHe is currently a Professor with the Huazhong University of Science and Technology (HUST), Wuhan, China, and an Adjunct Professor with the University of Western Ontario, London, ON, Canada. Before joining HUST in 2019, he was a Principal Research Officer with the National Research Council Canada, Ottawa, ON, Canada. His work has been cited more than 24 000 times with an H-index of 76. He authored or co-authored several books and more than 560 articles in scientific journals and international conferences in related areas. His research interests include agent-based collaboration technologies and applications, collaborative intelligent manufacturing, the Internet of Things, and big data analytics.\nProf. Shen is a Fellow of the Canadian Academy of Engineering and the Engineering Institute of Canada.\n","date":"13 February 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/personalizing_vision-language_models_with_hybrid_prompts_for_zero-shot_anomaly_detection/","section":"Papers","summary":"Introduces AnomalyVLM, a framework leveraging hybrid prompts derived from prior knowledge to enhance zero-shot anomaly detection by personalizing vision-language models, incorporating an anomaly region generator and refiner, and utilizing hybrid prompts for category-specific customization and improved detection performance.","title":"Personalizing Vision-Language Models With Hybrid Prompts for Zero-Shot Anomaly Detection","type":"other"},{"content":"","date":"13 February 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/categories/training-free/","section":"Categories","summary":"","title":"Training Free","type":"categories"},{"content":"","date":"13 February 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/categories/unsupervised/","section":"Categories","summary":"","title":"Unsupervised","type":"categories"},{"content":"","date":"13 February 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/categories/weakly-supervised/","section":"Categories","summary":"","title":"Weakly Supervised","type":"categories"},{"content":"","date":"13 February 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/weiming-shen/","section":"Authors","summary":"","title":"Weiming Shen","type":"authors"},{"content":"","date":"13 February 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiaohao-xu/","section":"Authors","summary":"","title":"Xiaohao Xu","type":"authors"},{"content":"","date":"13 February 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yunkang-cao/","section":"Authors","summary":"","title":"Yunkang Cao","type":"authors"},{"content":"","date":"13 February 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yuqi-cheng/","section":"Authors","summary":"","title":"Yuqi Cheng","type":"authors"},{"content":"","date":"13 February 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/zongwei-du/","section":"Authors","summary":"","title":"Zongwei Du","type":"authors"},{"content":"","date":"10 January 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/chenting-xu/","section":"Authors","summary":"","title":"Chenting Xu","type":"authors"},{"content":"","date":"10 January 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/ke-xu/","section":"Authors","summary":"","title":"Ke Xu","type":"authors"},{"content":" PLOVAD: Prompting Vision-Language Models for Open Vocabulary Video Anomaly Detection # Chenting Xu , Ke Xu , Member, IEEE, Xinghao Jiang , Senior Member, IEEE , and Tanfeng Sun , Senior Member, IEEE\nAbstract— Video anomaly detection (VAD) confronts significant challenges arising from data scarcity in real-world open scenarios, encompassing sparse annotations, labeling costs, and limitations on closed-set class definitions, particularly when scene diversity surpasses available training data. Although current weakly-supervised VAD methods offer partial alleviation, their inherent confinement to closed-set paradigms renders them inadequate in open-world contexts. Therefore, this paper explores open vocabulary video anomaly detection (OVVAD), leveraging abundant vision-related language data to detect and categorize both seen and unseen anomalies. To this end, we propose a robust framework, PLOVAD, designed to prompt tuning large-scale pretrained image-based vision-language models (I-VLMs) for the OVVAD task. PLOVAD consists of two main modules: the Prompting Module, featuring a learnable prompt to capture domain-specific knowledge and an anomaly-specific prompt crafted by a large language model (LLM) to capture semantic nuances and enhance generalization; and the Temporal Module, which integrates temporal information using graph attention network (GAT) stacking atop frame-wise visual features to address the transition from static images to videos. Extensive experiments on four benchmarks demonstrate the superior detection and categorization performance of our approach in the OVVAD task without bringing excessive parameters.\nIndex Terms— Anomaly detection, open vocabulary learning, weakly-supervised, prompt tuning.\nI. INTRODUCTION # V IDEO anomaly detection (VAD) plays a pivotal role in various domains including intelligent video surveillance, industrial monitoring, and healthcare. Its primary aim is to detect irregularities within video streams [1], thereby enhancing accident prevention, recognizing security threats, and preserving system integrity. Over recent years, VAD has experienced significant advancements, with numerous scholarly contributions continually enriching its methodologies and applications [2]. To address the scarcity of labeled data, traditional VAD methods can be broadly classified into two categories: semi-supervised VAD [3] and weakly-supervised\nReceived 10 July 2024; revised 16 October 2024, 10 November 2024, and 11 December 2024; accepted 7 January 2025. Date of publication 10 January 2025; date of current version 6 June 2025. This work was supported by the National Natural Science Foundation of China under Grant 62372295 and Grant 62272299. This article was recommended by Associate Editor G. Jeon. (Corresponding author: Ke Xu.)\nThe authors are with the National Engineering Laboratory on Information Content Analysis Techniques, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: xuchenting@sjtu.edu.cn; l13025816@sjtu.edu.cn; xhjiang@ sjtu.edu.cn; tfsun@sjtu.edu.cn).\nDigital Object Identifier 10.1109/TCSVT.2025.3528108\nVAD [4]. In the case of semi-supervised VAD [3] , [5] , [6] , [7] , [8] , [9] , [10] , [11] , [12] , [13] , [14] , [15], models are developed by learning the patterns of normal samples and identifying outlier samples as abnormal using solely normal training data. However, it is impossible to gather all normal samples within a dataset, thereby limiting these methods to primarily detecting low-level features such as disparities in appearance and speed. As a result, semi-supervised VAD methods often exhibit a high false alarm rate for normal events that have not been previously observed. To address the incorrect recognition of video anomalies, some researchers propose weakly-supervised methods that incorporate anomalous data and use only video-level labels for training, thereby reducing the cost of manual labeling. While weakly-supervised VAD methods [1] , [4] , [16] , [17] , [18] , [19] , [20] , [21] , [22] , [23] , [24] , [25] , [26] , [27] , [28] has shown notable performance by partially alleviating data labeling challenges and maintaining balanced detection capabilities, it inherently operates within closed-set environments. The constrained class definitions restrict their utility in openworld scenarios, where the spectrum of real-world anomaly types or concepts surpasses the coverage of the available training dataset.\nRecent advancements in vision-language pre-training have led to the development of open vocabulary learning, which seeks to recognize categories beyond annotated label spaces by utilizing vision-related language data as auxiliary supervision [31]. The motivations to incorporate language data as auxiliary supervision are: 1) Language data necessitates less labeling effort, rendering it more cost-effective, with vision-related language being readily accessible. 2) Language data provides a broader vocabulary, enhancing scalability and generalizability. Open vocabulary settings are therefore more general, practical, and effective than weakly supervised settings (see Fig. 1 for a comparison of different VAD tasks). Additionally, image-based vision-language models (I-VLMs), exemplified by CLIP [32] and ALIGN [33], pre-train themselves on large-scale image-text pairs, demonstrate remarkable zero-shot performance on various vision tasks [34]. These models align images and language vocabularies within the same feature space, fulfilling the gap between visual and language data. Given the scarcity of VAD-related datasets and the extensive range of potential anomaly categories beyond the training sets currently accessible, language data presents itself as a more widely available and less costly alternative to video data. Wu et al. [30] first explored VAD in an open-vocabulary\n1051-8215 © 2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission.\nSee https://www.ieee.org/publications/rights/index.html for more information.\nFig. 1. Comparison of different VAD tasks. Left side of the dashed line is in the training phase, and the right side is in the inference phase. (a) Semi-Supervised methods [3] identify outlier samples as abnormal using solely normal training data. (b) in the Weakly-Supervised settings [4], the model requires only video-level annotations, utilizes both normal and abnormal data while training, detecting seen anomalies in the closed-set. (c) in the open-set settings [29], the model only needs to detect unseen anomalies and mark them as \u0026ldquo;unknown\u0026rdquo;. (d) in the open vocabulary settings [30], the model trained with seen anomalies can detect and categorize both seen and unseen anomalies.\nsetting, inspiring our work. However, their approach relies solely on a pre-trained CLIP model [32], without discussing I-VLMs for a more general framework, and requires additional fine-tuning with pseudo-anomaly video samples generated by large generative models. Consequently, this paper aims to investigate the open vocabulary video anomaly detection (OVVAD) problem and work on the detection and categorization of unseen anomaly classes by harnessing the capabilities of I-VLMs and scalable textual data.\nTherefore, a question is posed: how can we best exploit the ability in the powerful I-VLMs, and effectively adapt them to solve the OVVAD task? Inspired by existing works on the adaptation of VLMs to downstream tasks [35] , [36] , [37], our focus is shifted to prompt tuning [38]. To this end, we propose a framework, named PLOVAD, to effectively leverage prompt tuning to mine the generalized knowledge from I-VLMs for OVVAD. Following previous works [30] , [39], the OVVAD task is deconstructed into two sub-tasks: detection and categorization. A learnable prompt, designed to encapsulate domain-specific knowledge, is proposed. This prompt is optimized during training, enabling a single instance of the visual backbone to effectively execute the detection sub-task within the specified domain dataset with minimal training parameters. To bridge the gap between image and video data, we further incorporate temporal information by appending a Temporal Module on top of frame-wise visual representation in the detection sub-task. Additionally, to enhance the generalization of the categorization sub-task, an anomalyspecific prompt crafted by a Large Language Model (LLM) with strong knowledge retention capabilities is employed. Our model effectively handles challenges in open vocabulary scenarios while efficiently conserving data resources. In summary, the main contributions of this paper are as follows:\nTo address the challenges presented by data scarcity in real-world open scenarios, this study investigates the VAD problem within an open vocabulary setting. Scalable and cost-effective vision-related language data are utilized to enhance the proposed approach, thereby facilitating efficient detection and recognition of previously unseen anomalies while optimizing data utilization. We propose PLOVAD, a prompt tuning-based method designed to adapt large-scale pretrained I-VLMs for the OVVAD task. This approach enables the detection and categorization of both seen and previously unseen anomalies. A Prompting Module is devised to acquire both domain-specific and anomaly-specific knowledge tailored to the OVVAD task. Additionally, a GAT-based Temporal Module is designed to effectively integrate temporal information and address the transition from static images to video sequences. Extensive experiments on four challenging benchmark datasets: UCF-Crime, ShanghaiTech, XD-Violence and UBnormal, confirm that the proposed method excels in the OVVAD task, achieving superior detection and categorization performance without introducing excessive parameters. II. RELATED WORK # A. Video Anomaly Detection # Weakly-supervised VAD [1] , [4] , [16] , [17] , [18] , [19] , [20] , [21] , [22] , [23] , [24] , [25] , [26] , [27] , [28] requires only videolevel annotations, substantially reducing the need for manual annotation and associated costs. Tian et al. [1] integrated self-attention blocks and pyramid dilated convolution layers to capture multi-scale temporal relations. Zhou et al. [19] incorporated dual memory units to simultaneously learn representations of normal and abnormal data, effectively regulating uncertainty. Fan et al. [24] introduced a snippet-level attention mechanism and proposed a multi-branch supervision module to more effectively explore the abnormal segments within the entire video. While many of the weakly-supervised approaches utilized video inputs encoded by pre-trained models such as C3D [40], I3D [41], a few works [20] , [21] , [26] incorporated CLIP [32]. Despite achieving commendable performance, these approaches primarily exploited CLIP\u0026rsquo;s potent visual representation capability while overlooking its zero-shot capability. Although weakly-supervised VAD has exhibited impressive performance and effectively mitigates data labeling challenges, it inherently operates within closedset frameworks.\nVAD inherently possesses an open-world requirement. Traditional weakly-supervised methods, while effective in detecting known anomalies, may falter when confronted with unseen anomalies. Zhu et al. [29] proposed an approach that integrates evidential deep learning (EDL) and normalizing flows (NFs) within a multiple instance learning (MIL) framework to address open-set VAD challenges. However, this method is not explicitly designed to identify specific anomaly types, which could offer more informative and actionable insights.\nB. Open Vocabulary Learning # Open vocabulary learning aims to recognize categories beyond annotated label spaces by leveraging vision-related language vocabulary data [31]. Many open vocabulary methods [35] , [42] , [43] effectively blur the distinction between closed-set and open-set scenarios through alignment learned in I-VLMs, making them highly suitable for practical applications. While it has been extensively applied to diverse downstream tasks, including video understanding [35] , [42] , [43] , [44], object detection [45], and semantic segmentation [46], its application in VAD remains relatively understudied. Wu et al. [30] explore VAD in the open vocabulary settings, but its categorization performance after additional fine-tuning with generated pseudo-anomaly video samples remains suboptimal. While open vocabulary VAD shares certain similarities with open vocabulary video understanding, it presents unique challenges, including a broader range of anomalies, limitations stemming from restricted datasets, and specific scenarios.\nOpen vocabulary action recognition [35] , [42] , [44] , [47] , [48] , [49] , [50] aims to classify unseen actions and is closely related to open vocabulary VAD. ActionCLIP [47] introduces the new perspective by emphasizing the semantic information of label texts instead of simply mapping into numerical values. Ju et al. [35] employ continuous vectors to prompt tuning pre-trained I-VLMs, bridging the domain gap from images to videos, while Rasheed et al. [42] introduce a video fine-tuned CLIP baseline. While anomalous actions constitute a subset of anomalies in VAD, the broader definition of anomalies is context-dependent.\nC. Prompt Tuning # Prompt tuning [38] is a strategy initially developed in the realm of NLP to tailor pre-trained language models (PLMs) [51] for specific downstream tasks. With meticulously designed prompt templates, PLMs exhibit robust few-shot or zero-shot generalization capabilities. Nonetheless, the creation of such templates necessitates substantial expertise, thereby restricting their adaptability. Recent research has expanded the scope of prompt tuning to encompass computer vision applications [35] , [45] , [46]. For example, CoOp [36] integrates learnable vectors by preappending category words, enabling the adaptation of CLIP to various recognition tasks. In this study, we explore prompt tuning techniques to address the OVVAD task.\nIII. PRELIMINARY # A. Image-Based Vision-Language Model # I-VLMs are typically pretrained using a large-scale corpus of image-text pairs. In this study, we leverage CLIP [32] as a foundational example for our research, with the understanding that the same techniques explored should be applicable to other I-VLMs as well. During the pretraining phase, each batch consists of image-text pairs. The image encoder f (.) and text encoder g(.) compute feature embeddings individually for images and texts. A cosine similarity matrix is then computed for all pairs. The training objective is to jointly optimize the image and text encoders by maximizing the similarity between correct pairs of associations and minimizing the similarity between incorrect pairs. During inference, CLIP classifies images in a zero-shot manner based on the similarity between image features f (x) and text features g(t). The text t is manually produced by filling class names into a predefined template, such as, \u0026ldquo;a photo of a [CLASS]\u0026rdquo;. Finally, given an image x and text t, CLIP output prediction results for the given classes by:\nwhere ⊗ is cosine similarity. The sensitivity of I-VLMs to handcrafted prompt templates presents clear limitations on their efficient adaptation for novel downstream tasks, particularly in scenarios where expert knowledge may be challenging to condense or unavailable. Therefore, we consider automating such prompt design procedures, exploring efficient approaches to adapt the pre-trained I-VLMs for OVVAD tasks with minimal training requirements.\nB. Weakly-Supervised Video Anomaly Detection # Consider a set of weakly-labeled training videos V V = {vi , yi} |V| i=1 , where each video vi comprises a sequence of ni frames, and yi = {0 , 1} is the video-level label of video vi . If none of the frames in vi contain abnormal events, vi is defined as normal and yi = 0; otherwise, if at least one frame contains abnormal events, vi is defined as abnormal and yi = 1. The objective of the weakly-supervised video anomaly detection (WSVAD) task is to develop a detection model capable of predicting frame-level anomaly confidences using only video-level annotations.\nIn this paper, the Open Vocabulary Video Anomaly Detection (OVVAD) problem is explored under weak supervision, with weakly labeled datasets utilized.\nC. Open Vocabulary Video Anomaly Detection # Given a video dataset denoted as V = {vi ∈ R T×H×W×3 } N i=1 , the objective of open vocabulary video anomaly detection (OVVAD) is to develop a unified model 8OV capable of detecting and categorizing both base(seen) and novel(unseen) anomalies. Specifically, the model aims to predict the anomaly confidence for each frame and discern the video-level anomaly category for videos in which anomalies are detected. These categories are represented as textual descriptions T .\nDuring the training phase, the model is trained on pairs of (video, category label) belonging to the normal category C n , and the base (seen) anomaly categories C a base , denoted as {\nV train , Y train \u0001 ∼ {C n , C a base }}. Conversely, during the testing phase, the model is evaluated on videos belonging to the normal category as well as both the base (seen) and novel (unseen) anomaly categories, represented as {\nV test , Y test \u0001 ∼ {C n , C a base, C a novel }}. It is important to emphasize that C a base\nFig. 2. Overview of the proposed framework (PLOVAD). PLOVAD comprises two primary modules: the Temporal Module and the Prompting Module, catering to two sub-tasks: the detection sub-task and the categorization sub-task. The Temporal Module integrates temporal information using GAT stacking atop frame-wise visual features to address the transition from static images to videos. The Prompting Module is employed to formulate a domain-specific prompt (DP) to capture domain-specific knowledge and an anomaly-specific prompt (AP) crafted by a LLM to capture semantic nuances and enhance generalization.\nand C a novel are mutually exclusive and together comprise the entire anomaly category set C a , where C a base ∪ C a novel = C a , C a base ∩ C a novel = ∅ .\nIV. METHODOLOGY # In this paper, we propose a framework, PLOVAD, designed to prompt tuning large-scale pretrained I-VLMs for the OVVAD task. The formulation of the OVVAD problem and the I-VLM is presented in Section III. In the subsequent sections, an overview of the framework will first be provided, followed by a detailed description of each module and process.\nA. Overall Framework # As illustrated in Fig. 2, PLOVAD comprises two primary modules: the Temporal Module and the Prompting Module, catering to two sub-tasks: the detection sub-task and the categorization sub-task. Our methodology leverages both visual and textual data extracted from the training dataset. Initially, we segment the videos (visual data) into frame snippets, which are subsequently processed through the image encoder of I-VLM to derive feature representations. Following this, these visual features are subjected to temporal modeling via the Temporal Module–an imperative step aimed at adapting the I-VLMs. The resultant features are then processed by a detector to generate frame-level anomaly confidence scores, primarily addressing the detection sub-task. Concurrently, the Prompting Module is employed to formulate language prompts designed to encapsulate domain-specific and anomaly-related knowledge. Corresponding text embeddings are then extracted after being fed into the text encoder of I-VLM. Notably, we augment our semantic space by utilizing additional readily available language data, not limited to the training dataset, to conceptualize novel (unseen) anomalies within an open vocabulary setting. The enriched text embeddings and the feature representations obtained from the Temporal Module are aligned within the same feature space through the cross-modal alignment mechanism. Subsequently, category similarities are computed to facilitate the categorization sub-task.\nB. Prompting Module # The objective is to guide the pretrained I-VLM to undertake the OVVAD task with minimal training, effectively detecting and recognizing both base(seen) and novel(unseen) anomalies. In this module, we develop two prompts: a domain-specific learnable prompt (DP) and an anomaly-specific prompt (AP) improve the generalization of the categorization sub-task. Both class labels of given training dataset and potential novel categories of anomalies are fed into the Module to generate prompts.\nDomain-Specific Learnable Prompt: During training, both the image and text encoder of I-VLM are kept frozen, while the gradients flow solely through the text encoder to update the learnable prompt vectors. These learnable vectors ultimately construct domain-specific prompt templates, comprehensible to the text encoder, and generate desired query embeddings. Inspired by prior work [35] , [36] in the vision understanding field, we construct DP via feeding the tokenized category related text into the pretrained text encoder g(.) . where Tc Tc = {ti} |C| i=1 , with ti representing the text associated with the category ci (i.e.,\u0026ldquo;fighting\u0026rdquo;), preferring simplicity by using only the category name. 8token denotes the tokenizer of the I-VLM, and Ed ∈ R |C|× F represents the corresponding\nFig. 3. Illustration of Anomaly-specific Prompt(AP) generation workflow. On the left, the process of defining the attribute set is visualized. The middle section depicts the querying process with LLMs, transforming anomaly-related categories and attributes into APs. On the right, we present sample snippets of the prompts generated.\ngenerated text embedding. ek ∈ R D signifies the k-th prompt vector that learns domain information and is updated during training, with D as its vector dimension. ek is shared for all anomaly categories and is specific to the current task.\nAnomaly-Specific Prompt: Obtaining expert annotations is both cost-prohibitive and labor-intensive, and it is also susceptible to individual biases, resulting in inconsistent outcomes. Therefore, we leverage the capabilities of LLMs, known for their extensive knowledge and versatility, to construct APs containing distinctive features for identifying specific anomaly categories. Inspired by prior LLM-prompting research [52] , [53], we construct anomalyspecific prompts. The illustration of the generation workflow is shown in Fig. 3. Initially, the components of an anomaly are categorized into four fundamental aspects: Anomaly Specific Attributes , Scene , Actor and Body, and four most representive attributes are selected for each aspect, resulting in a set of 16 core attributes. For those anomalies that do not pertain to behavioral anomaly, which involve appearance changes in the environment or the sudden appearance of objects, the 8 core attributes from the two former aspects are considered. Building on the defined attributes, LLMs are employed to generate knowledge-rich descriptive sentences for each anomaly category, serving as the foundation for APs. Specifically, an LLM prompt template is devised: What are the primary characteristics of {category name} in term of its {attribute} . Subsequently, for each anomaly-related category, the template is populated with the category name and attributes, and then the LLM is queried to generate a suite of 16 (or 8) distinct anomaly-sepcific prompts.\nFurthermore, for each anomaly-related category, we derive text embeddings from its distinct set of APs utilizing the frozen text encoder of I-VLM, and the average of these embeddings is then computed to serve as the final text embedding.\nwhere Ta Tai denotes the set of APs belonging to anomaly-related category ci. The extracted text embeddings for all anomaly-related categories are denoted as E a = {E ai } |C| i=1 .\nC. Temporal Module # Temporal Module is designed to capture the temporal dependencies on top of frame-wise visual representation extracted by the frozen image encode. This module serves as a critical intermediary step in adapting the I-VLM to videorelated tasks, bridging the gap between image and video data. We design Temporal Module building upon graph attention network(GAT) [54]. GAT incorporates masked self-attentional layers to overcome the limitations of prior graph modeling approaches that rely solely on graph convolution, which have been demonstrated to be more effective in addressing challenges in Anomalous Node Detection [55]. Mathematically, the formulation can be expressed as follows:\nX is the visual features inputted. Norm(.) denotes the normalization function, which in this paper is a combination of power normalization [56] and L2 normalization. Following the capture of temporal information via the GAT Module 8G AT (.), the resulting output is normalized. Then a linear layer fl(.) is applied, followed by residual connection and layer normalization L N, yielding the context feature X g .\nGAT Module: The specifics of our GAT Module are delineated as follows. Firstly, to capture long-range dependencies based on the positional distance between each two frames, the distance adjacency matrix A is computed as: where the proximity relation between ith and jth frames is only determined by their relative temporal position. A closer temporal distance between two frames corresponds to a higher proximity relation score. α and β are hyperparameters to control the temporal distance. Additionally, we employ a masking strategy on A to constrain attention to frames exhibiting large\nfeature magnitudes:\nr is a hyperparameter to control the mask rate. Given the adjacency matrix A ′ , we input X = { ⃗x1 , x ⃗ 2 , . . . , x ⃗ n } into the GATConv layer to aggregate pertinent information from concerned neighbors (according to A ′ ) for x⃗ ⃗ i . Here, x ⃗ ⃗i ∈ R 1×d represents ith eigenvector and the output is denoted ⃗⃗⃗ as X o = { ⃗ x ′ 1 , ⃗ x ′ 2 , . . . , ⃗ x ′ n } , g ⃗ x ′ i ∈ R F . To ensure generalizability, only a single GATConv layer is utilized. The GATConv layer [54] leverages multi-head attention, detailed formulation is as follows:\nHere, x ⃗ i undergoes a shared linear transformation via a weight vector W and engages the self-attention mechanism at x⃗ ⃗ j based on a shared self-attention function a, which is a single-layer feedforward neural network parameterized by weight vectors. The attention coefficient ai j is computed after normalization using softmax. Subsequently, x⃗ ⃗ ⃗ i is updated to ⃗ x ′ i through a linear combination of neighboring node features followed by a nonlinear function activation function σ ⃗ . We employ a K-heads attention mechanism to calculate ⃗ x ′ i by:\nD. Objective Function # In the detection pipeline, following prior weakly-supervised VAD works [1] , [19], the Top-K mechanism are employed to implement the MIL-based training. Specifically, the video-level prediction piis determined by averaging the top-k anomaly scores of the snippets in vi. By default, we set k = ⌊n//16 + 1⌋ for abnormal videos and k = 1 for normal videos, where n denotes the length of processed snippets used for batch training. The binary cross entropy loss lbce is then computed between video-level predictions and binary labels.\nHere, N is the number of videos, yiis the binary label for the vi .\nIn the categorization pipeline, cross-modal alignment serves as a pivotal component. Leveraging video features X g extracted from the Temporal Module and text embeddings E = {Ed , E a } from the Prompting Module, we conduct crossmodal alignment to ascertain the similarity logits S between snippets of each video vi and enriched categories.\nwhere f (.) is the image encoder of the I-VLM. To facilitate video-level categorization within weakly-supervised settings, we compute the mean value of the top-k similarity logits of snippets from vi as the video-level category similarity logits, with k consistent with the detection pipeline. The video-level similarity logits denoted as S v . Subsequently, a cross-entropy loss function is employed to determine the video-level categorization loss l ce.\nHere,Yi j is the ground truth label for vi and category c j , S v i j is the video-level similary logit for vi and c j , N is the number of videos, and |C| is the number of categories.\nDuring training, the total loss function is defined as:\nHere, the coefficient λ is utilized to modulate the alignment loss. By optimizing this objective function, our model effectively captures nuanced anomaly semantics, enhancing its performance in both detection and categorization.\nE. Inference # For detection tasks, frame-wise anomaly scores are computed based on the input videos. A threshold τ can be set to activate alarms according to predefined sensitivity levels. For categorization task, our model computes video-level similarity logits across a provided category list via alignment. Specifically, we align the extracted visual features of input videos X g with the text embeddings Ed from the trained DP, and E a from predefined AP, to derive two distinct similarity logits. The highest logits corresponding to each category is retained. Subsequently, for each video, the category with the highest similarity logit across all provided categories is recognized as its most probable prediction category.\nV. EXPERIMENT # A. Experiment Setup # Experiments are conducted on the UCF-Crime [4] , ShanghaiTech [57], XD-Violence [58], and UBnormal [59] datasets, with a primary focus on the open vocabulary setting under weak supervision, hereafter referred to simply as the open vocabulary setting for brevity.\nDataset: For the weakly-labeled datasets, video-level annotations are used for training, while frame-level annotations are employed for testing. UCF-Crime [4] dataset comprises 1,900 surveillance videos with 13 real-world anomalies. In the weakly-supervised setting, it is divided into 1,610 training and 290 testing videos, utilizing video-level annotations for training and frame-level annotations for testing. In the open vocabulary setting, the 13 anomaly categories are further categorized into 6 base and 7 novel classes, with only the base classes included in the training set.\nShanghaiTech [57] dataset includes 437 videos from university campus surveillance cameras, depicting 130 abnormal events across 17 anomaly classes in 13 scenes. In the weaklysupervised setting, we follow the dataset configuration by Zhong et al. [16], with 238 training and 199 testing videos.\nFor the open vocabulary setting, we assigned names to 11 previously undefined classes in the original dataset based on text information from [39], splitting them into 5 base and 6 novel classes and restructuring the dataset accordingly.\nXD-Violence [58] dataset is a weakly-supervised dataset consisting of 4,754 untrimmed videos across six anomaly categories, with 3,954 videos for training and 800 for testing. For the open vocabulary setting, the six anomaly categories are split into three base and three novel classes. To align with our model, which supports single-category identification, videos containing multiple categories are excluded.\nUBnormal dataset [59] is a synthesized open-set benchmark comprising 543 videos across multiple virtual scenes. It defines seven types of normal events and 22 types of abnormal events. In its original structure, the anomalous event types in the test set differ from those in the training set. For the open vocabulary setting, we follow the original splits, classifying the 12 abnormal categories in the test set as novel classes and the remaining anomalies as base classes. As the UBnormal dataset lacks category labels, we manually annotate the anomaly videos to assess and report categorization performance.\nFor the OVVAD task, abnormal categories are systematically divided into two distinct groups: base categories and novel categories. During the training phase, only samples belonging to the base categories are utilized. Consistent with prevailing methodologies in open vocabulary learning, base categories predominantly comprise frequent and commonly occurring classes, while novel categories encompass the less frequent or rare classes. For the UCF-Crime dataset, the following categories are designated as base categories: abuse, assault, burglary, road accident, robbery, and stealing . Conversely, the remaining categories: arrest, arson, fighting, explosion, shooting, shoplifting, vandalism are categorized as novel categories, aligning with the approach delineated by Wu et al. [30]. For the ShanghaiTech dataset, categorization is based on the order of sample counts within the dataset. Specifically, the base categories include vehicle, skateboard, running, robbery, and fighting. Conversely, the novel categories comprise chasing, fall, car, throwing object, vaudeville, and monocycle. For the XD-Violence dataset, fighting, shooting, and car accident are considered as base categories, while the remaining three are classified as novel categories. UBnormal is used for open-set VAD, with test anomalies classified as novel categories and training anomalies as base categories. Test anomalies include running, having a seizure, lying down, shuffling, walking drunk, people and car accident, car crash, jumping, fire, smoke, jaywalking, and driving outside the lane . 2) Evaluation Metrics: Both detection and categorization performance are evaluated.\nFor detection, following previous works [1] , [4], the frame-level area under the ROC curve(AUC) is adopted as the evaluation metric for UCF-Crime, ShanghaiTech and UBnormal. A higher AUC indicates superior detection performance. For XD-Violence, we utilize AUC of the frame-level precision-recall curve (AP), following [58] .\nFor categorization, the Multi-class AUC for individual classes is computed, and their macro mean, termed mAUC, is derived as the evaluation metric. Multi-class AUC [60] is computed using the one-vs-rest approach, treating each class in turn as positive and others as negative.\nwhere K is the number of classes, AUCk is AUC for class k, considering class k as positive and others as negative. Additionally, Top-1 accuracy and Top-5 accuracy are utilized for the evaluation of categorization performance, in accordance with the standard evaluation protocol for video action classification [41] .\nImplementation Details: In this study, CLIP [32] is leveraged as the foundational model, with the understanding that the techniques explored are applicable to other I-VLMs as well. The frozen image encoder and text encoder stem from pre-trained CLIP(ViT-B/16) model. The detector comprises two Conv1d layers and GeLU. Following [21], videos are segmented into 32 non-overlapping snippets, each with a length of 16 frames, and the middle frames are sampled. To accommodate the diverse durations of the videos, they are uniformly partitioned into T segments at equal intervals during training, as described in [16] . T is set to 200 for UCF-Crime, 120 for ShanghaiTech and 200 for XD-Violence. For the UBnormal dataset, each video is quite short and of approximately uniform length. We extract all frames and set T = 450. As for inference, all non-overlapping segments are employed. By default, we set α = 0 . 6 and β = 0 . 2 in Eq. 6. We set mask rate r in Eq. 7 to 0.9 for ShanghaiTech, UBnormal and XD-Violence, and to 0.55 for UCF-Crime. The GATConv layer employs a 4-heads attention. The prompt length of the learnable DP is 32. The hyperparameter λ in Eq. 14 defaults to 1. Training is conducted end-to-end using the Adam optimizer [61] with a batch size of 128. For the ShanghaiTech dataset, the learning rate is set to 5 × 10 − 4 , with a total of 60 epochs. For UCF-Crime, the learning rate is set to 1 × 10 − 3 with 50 epochs. Conversely, the learning rate for XD-Violence is set to 7 × 10 − 4 , with 50 total epochs, and 1 × 10 − 3 for UBnormal, with a total of 200 epochs. The experiments are conducted on a single RTX 3090 GPU. B. Comparison With the State-of-the-Art # Table I to Table VI present comparative results on detection and categorization performance against existing methods on four public benchmarks. Given that prior methods are tailored for closed-set VAD, our primary emphasis lies on open vocabulary setting comparisons.\nFor comparisons in the open vocabulary setting, several baselines are employed. CLIP: We employ a softmax function on the cosine similarities of the input frame feature with vectors aligned to the embedding of the textual prompt \u0026ldquo;a video from a CCTV camera of a {class}\u0026rdquo; using pretrained CLIP. For the binary detection task, any prediction associated with an anomaly class is considered abnormal. Random Baseline: This replicates our model\u0026rsquo;s implementation but uses randomly initialized parameters. Specifically, the input tensor is populated using a Xavier uniform distribution [62] . We conducted five trials using different random seeds, with\nTABLE I DETECTION PERFORMANCE COMPARISONS(AUC(%)) UNDER DIFFERENT SETTINGS ON UCF-CRIME AND SHANGHAITECH . † MEANS THE METHOD IS RE-IMPLEMENTED WITH CLIP FEATURE\nthe average performance across these trials reported as the final result. Methods with † are re-implemented with the CLIP features for fair comparison.\nComparison on Detection Performance: As shown in Table I, it is evident that existing methods exhibit superior performance in the weakly-supervised setting compared to the open vocabulary setting trained without the inclusion of novel anomaly samples, which underscore the significant challenges posed by open vocabulary settings in detection. Nevertheless, our model, trained exclusively on base samples in the open vocabulary setting, shows only a modest performance loss and outperforms most methods in the weakly-supervised setting, suggesting some generalizability of our approach. As shown in Table I and Table II, our proposed approach exhibits notable advantages over state-of-the-art methods for both UCF-Crime and ShanghaiTech. It demonstrates robust capabilities in detecting both base and novel samples while being trained solely on base samples. Specifically, our method excels in the open vocabulary setting, rivaling top-performing models trained on the entire dataset. For example, our approach surpasses the leading weakly-supervised model, UR-DMU [19], by a margin of 1.97% in terms of AUC on the UCF-Crime dataset. Besides, our method demonstrates an improvement of 0.38% AUC comparing with Wu et al. [30] on the UCF-Crime dataset, with the harmonic mean of AUC on both base and novel samples being 0.28% higher. Furthermore, as shown in Table II, our proposed method demonstrates superior performance compared to RTFM [1] and UR-DMU [19]. Specifically, it achieves an improvement of 2.68% and 3.18% in AUC on the UCF-Crime dataset, and 0.35% and 0.21% on the ShanghaiTech dataset for novel samples, as evaluated using feature representations derived from CLIP. As shown in Table III, compared to the leading weakly-supervised model UR-DMU [19] on the XD-Violence dataset, our proposed method achieves a 9.53% improvement in APn Pn (AP for novel samples), with only a minor performance reduction on base samples. Overall, our method demonstrates improvements of 4.48% and 4.23% in HM (the harmonic mean of APb and APn Pn ), respectively, compared to Wu et al. [30] and UR-DMU [19] on the XD-Violence dataset. As shown in Table IV, our method achieves improvements of 1.50% and 1.41% in AUC compared to UR-DMU [19] and Wu et al. [30] on the UBnormal dataset. These results demonstrate the robust detection performance of our approach across all four public benchmarks.\nComparison on Categorization Performance: Given the limited prior research on the categorization problem within OVVAD, we contrast our model\u0026rsquo;s categorization performance with two baseline methods across four benchmarks, as detailed in Table V and Table VI. As the UBnormal dataset includes only novel categories in its test set, we report the AUC for the entire test set. Our approach demonstrates marked improvement over the baselines across all three metrics, encompassing both base and novel categories. For instance, our method trained on base categories surpasses the zero-shot CLIP baseline by 6.21% on UCF-Crime, by 25.77% on ShanghaiTech and by 4.61% on XD-Violence as measured by mAUC on the test set, which includes both base and novel samples (denoted as \u0026ldquo;All\u0026rdquo;). Additionally, on UBnormal, our method surpasses the zero-shot CLIP baseline by 11.72% in mAUC on the test set. Furthermore, our method demonstrates significant improvements over the zero-shot CLIP baseline in both Top-1 and Top-5 Accuracy across base and novel categories in these datasets. TABLE II\nDETECTION PERFORMANCE COMPARISONS IN THE OPEN VOCABULARY SETTING UNDER WEAKLY-SUPERVISED ON UCF-CRIME AND SHANGHAITECH . ALL THESE METHODS USE CLIP FEATURE FOR FAIRNESS . AUCb(%) IS EVALUATED ONLY ON THE BASE SAMPLES AND AUC n (%) IS EVALUATED ONLY ON THE NOVEL SAMPLES . H M IS THE HARMONIC MEAN OF AUCb AND AUC n\nTABLE III DETECTION PERFORMANCE COMPARISONS IN THE OPEN VOCABULARY SETTING ON XD-VIOLENCE . APb(%) IS EVALUATED ONLY ON THE BASE SAMPLES AND APn Pn (%) IS EVALUATED ONLY ON THE NOVEL SAMPLES . H M IS THE HARMONIC MEAN OF APb AND APn Pn\nTABLE IV DETECTION PERFORMANCE COMPARISONS IN THE OPEN VOCABULARY SETTING ON UBNORMAL\nC. Ablation Studies # In this section, extensive ablation experiments are conducted to reveal the contributions of each component.\nContribution of the Temporal Module: As shown in Table VII, the integration of the Temporal Module results in a significant enhancement of detection performance across UCF-Crime and ShanghaiTech. Notably, it demonstrates marked improvement for novel anomaly categories without compromising generalization; for instance, an increase of 1.68% in AUC n on UCF-Crime and 6.36% on ShanghaiTech. We further investigate the impact of the number of multi-heads in the GAT module within the Temporal Module. Fig. 4 indicates that performance peaks when utilizing four heads, after which it diminishes on UCF-Crime. Given the complexity of UCF-Crime, we suggest that a four-head attention mechanism effectively captures abundant temporal information. Accordingly, a four-head configuration is adopted for all datasets.\nTo evaluate the effectiveness of the GAT Module within the Temporal Module, we replaced it with a transformer, following the approach in [35]. As shown in Table VIII, our GAT Module\nFig. 4. Evaluation results of multiheads in the Temporal Module on UCF-Crime.\ndemonstrates superior generalization in the open vocabulary setting.\nContribution of the Prompting Module: In our experiment, the non-use of the Prompting Module (in Table VII) indicates that it does not participate in training (λ in Eq. 14 is set to 0). During inference, DP is randomly initialized and AP is utilized. As presented in Table VII, the integration of the Prompting Module markedly improves both the detection and categorization capabilities of our model. Specifically, the Prompting Module yields substantial gains in categorization performance, manifesting as increases of 12.95% in mAUC and 42.76% in top-5 accuracy for UCF-Crime, and 39.93% in mAUC and 83.75% in top-5 accuracy for ShanghaiTech, compared to our model that includes only the Temporal Module. The Prompting Module proves to be a foundational element in the categorization of both base and novel classes, as evaluated by three distinct metrics. Thus, we further investigate the influence of individual components within the Prompting Module on categorization performance. The module comprises two types of prompts: DP and AP. As depicted in Table IX , the combined use of DP and AP yields superior performance compared to their individual implementations. Experiments were also conducted to evaluate the generation methods for DP and AP. Table X presents the categorization performance of CLIP using various manually crafted prompt templates, highlighting considerable variability in their efficacy. This underscores the importance of refining this process to achieve consistent and superior performance. In Table XI ,\nTABLE V CATEGORIZATION PERFORMANCE COMPARISONS ON UCF-CRIME, SHANGHAITECH AND XD-VIOLENCE IN THE OPEN VOCABULARY SETTING\nTABLE VI CATEGORIZATION PERFORMANCE COMPARISONS ON UBNORMAL IN THE OPEN VOCABULARY SETTING\nFig. 5. Evaluation results of prompt length of DP in the Prompting Module on UCF-Crime.\nwe replaced our learnable DP with the manually crafted template \u0026ldquo;p3\u0026rdquo; (identified as the top-performing template in Table X) and incorporated LLM-constructed anomaly-specific prompts with GPT-3.5. 1 Our method, which incorporates DP to focus on learning domain-specific knowledge, excels in both detection and categorization performance.\nAs illustrated in Fig. 5, the influence of prompt length of DP is examined. AUC improves with increasing length, peaking at 32 before gradually declining for UCF-Crime. Our model utilizes a prefix and postfix length of 16 when constructing DP. The optimal performance at a length of 32 suggests that this length provides a balanced and informative context for detection and categorization. However, extending beyond this length may introduce noise and complexity that outweigh the benefits of additional information, subsequently reducing detection performance.\nIn Table XII, we compare the performance of APs constructed using manually crafted prompts (p3 in Table X) and those generated by two distinct LLM models. The results\n1 GPT-3.5 turbo: https://platform.openai.com/docs/models/gpt-3-5-turbo\ndemonstrate that our approach, constructed using LLMs, is both consistent and superior to manually crafted constructs. Moreover, the quality of LLM-generated prompts significantly influences categorization performance. Our model utilizes GLM-4, 2 which has excellent long-text capabilities, such as summarization, information extraction, complex reasoning, coding, and other application scenarios. As LLMs continue to evolve, an increasing number of improved options become available, enhancing the performance of this framework.\nAnalysis of the Execution Requirements: In order to evaluate the execution requirements, We conducted an experiment to compare the trainable parameters of our model with others. The trainable parameters of models remain constant regardless of the execution environment. Fewer trainable parameters lead to lower resource consumption and more efficient computation. In Table XIII, we present the number of trainable parameters and the detection performance under both the weakly-supervised and open vocabulary settings. The results using the CLIP feature are re-implemented for a fair comparison. It is observed that our open vocabulary model (PLOVAD), trained solely on the base set and benefiting from the prompt tuning paradigm, achieves significantly lower trainable parameters (4.044M) while demonstrating superior performance.\nAnalysis of Cross-Dataset Ability: To assess the generalization ability of the proposed method, we conduct experiments in a cross-dataset setup using UCF-Crime, XD-Violence, and UBnormal as examples. These datasets originate from distinct sources: UCF-Crime from surveillance videos, XD-Violence from movies and online videos, and UBnormal from synthesized virtual scenes with a wider range of abnormal events. As shown in Table XIV, our model exhibits only moderate performance loss in both detection and categorization when evaluated on models trained on other datasets (refer to each column), thereby validating the generalization capacity of our approach. Among the crossdataset results, UCF-Crime and XD-Violence demonstrate better mutual inference performance compared to UBnormal, possibly due to the overlap in their categories.\nD. Qualitative Results # To qualitatively validate the detection performance of the proposed method, as shown in Fig. 6, we visualize\n2 GLM-4: https://open.bigmodel.cn/dev/howuse/glm4\nTABLE VII ABLATIONS STUDIES WITH DIFFERENT DESIGNED MODULE ON UCF-CRIME AND SHANGHAITECH FOR DETECTION AND CATEGORIZATION\nTABLE VIII ABLATION STUDY OF THE GAT MODULE: COMPARISON WITH TRANSFORMER REPLACEMENT\nTABLE IX ABLATIONS STUDIES WITH DIFFERENT COMPONENTS IN PROMPTING MODULE ON UCF-CRIME FOR CATEGORIZATION\nTABLE X CATEGORIZATION PERFORMANCE ON CLIP [32] WITH DIFFERENT MANUALLY CRAFTED PROMPT TEMPLATES\nTABLE XI\nABLATION STUDY ON DP IN PROMPTING MODULE: COMPARING DETECTION AND CATEGORIZATION PERFORMANCE ON UCF-CRIME WHEN REPLACING DP WITH A FIXED PROMPT\nTABLE XII ABLATION STUDY ON AP IN PROMPTING MODULE FOR CATEGORIZATION PERFORMANCE ON UCF-CRIME\nthe frame-level anomaly scores predicted by our model (PLOVAD) in the open vocabulary setting on the UCF-Crime and ShanghaiTech dataset. The top row and the first panel of the second row correspond to UCF-Crime, with the remaining panels representing ShanghaiTech. Our model trained with normal and base anomalies, consistently produces high\nTABLE XIII ABLATION STUDY ON THE EXECUTION REQUIREMENTS ON THE UCF-CRIME DATASET\nTABLE XIV CROSS-DATASET RESULTS (%) ON UCF-CRIME , XD-VIOLENCE , AND UBNORMAL\nanomaly scores for anomalous frames on both base and novel anomalies, even when videos contain multiple disjointed abnormal segments (as shown in the sample of \u0026ldquo;Explosion\u0026rdquo;). Moreover, our model assigns low anomaly scores to normal videos, underscoring its ability to maintain a low false alarm rate.\nFurther, we compare our method with UR-DMU [19] , as shown in Fig. 7. The second row presents the results of UR-DMU when trained exclusively on base anomalies, while the third row shows results from training on the full set (including both base and novel anomalies). A comparison of these rows reveals that the well-performing weakly-supervised method struggles to effectively identify novel anomalies to a certain extent, as evidenced by the ineffective detection of videos belonging to the novel category \u0026ldquo;explosion\u0026rdquo; in Fig. 7 when trained without the corresponding visual data. Meanwhile, our model (as seen in the first row) demonstrates superior detection of novel anomalies without requiring training on the corresponding visual data, exhibiting enhanced performance in the open vocabulary setting compared to UR-DMU [19] .\nWe also present the qualitative results of the categorization performance. Fig. 8 and Fig. 9 present the heatmap of multi-class AUC values [60] for each class in the testing set of the UCF-Crime and ShanghaiTech datasets, respectively. The random baseline is our model initialized using the Xavier initialization method [62]. The second and third\nFig. 6. Qualitative results of the detection performance on UCF-Crime and ShanghaiTech. Pink region denotes ground-truth. \u0026ldquo;Base\u0026rdquo; denotes the kind of anomalies are seen when training, while the \u0026ldquo;Novel\u0026rdquo; are unseen.\nFig. 7. Qualitative results of the detection performance on UCF-Crime when compared with UR-DMU [19] method. Pink region denotes ground-truth. The left is the visualization result of a base sample, and the right is a novel sample. The first row represents our model (PLOVAD), which was trained exclusively on base anomalies. The second row represents UR-DMU trained under the same conditions as our model (PLOVAD). The third row depicts UR-DMU trained on the complete dataset, including both base and novel samples, where novel anomalies are included during training.\nrows present the results of the CLIP baseline using the manually crafted prompts p3 and p1 (see Table X), respectively. \u0026ldquo;PLOVAD (full)\u0026rdquo; represents an additional experiment conducted to obtain a comprehensive analysis. This model is trained with both base and novel anomalies (denoted as \u0026ldquo;full\u0026rdquo;). In contrast, \u0026ldquo;PLOVAD (base)\u0026rdquo; refers to our proposed model, which is trained solely with base anomalies available in the open vocabulary setting. Our proposed method exhibits a pronounced advantage over the zero-shot CLIP baseline when applied to both UCF-Crime and ShanghaiTech datasets. Our method demonstrates proficiency in recognizing normal\nFig. 8. Qualitative results of categorization: visualization on the multi-class AUC values of each category on UCF-Crime. Cells shaded in darker blue indicate superior performance. The first column represents the normal samples, the 2nd to 7th columns correspond to the base anomalies, and the remaining columns pertain to the novel anomalies.\nFig. 9. Qualitative results of categorization: visualization on the multi-class AUC values of each category on ShanghaiTech. Cells shaded in darker blue indicate superior performance. The first column represents the normal samples, the 2nd to 6th columns correspond to the base anomalies, and the remaining columns pertain to the novel anomalies.\nsamples, along with the majority of base and novel categories. However, it encounters challenges with specific anomalies, such as \u0026ldquo;shooting\u0026rdquo; in the UCF-Crime dataset, which persists despite performance improvements for \u0026ldquo;fighting\u0026rdquo; when trained on the complete dataset. Additionally, identifying specific base anomalies like \u0026ldquo;abuse\u0026rdquo; and \u0026ldquo;assault\u0026rdquo; shows decreased performance with the inclusion of additional novel visual data, possibly due to increased visual noise. These observations\ndirect our attention towards refining visual representations in future investigations of the open vocabulary VAD task.\nVI. CONCLUSION # Video anomaly detection faces significant challenges in open vocabulary scenarios due to data scarcity and a broad spectrum of potential anomaly categories not represented in existing training sets. This study addresses these challenges by leveraging pretrained image-based vision-language models (I-VLMs) and scalable language data to detect and recognize previously unseen anomaly classes. We propose a framework named PLOVAD. which adapt pretrained I-VLMs for open vocabulary video anomaly detection (OVVAD) using a prompt tuning paradigm, transferring knowledge from known (base) classes to evaluate unseen (novel) classes. Additionally, we introduce a Temporal Module to bridge the image-video gap and a Prompting Module to generate domain-specific and anomaly-specific prompts. Extensive experiments on two public datasets demonstrate that the proposed model excels in the OVVAD tasks, achieving superior detection and categorization performance on both seen and unseen anomalies while training a limited number of free parameters. Our framework exhibits flexibility in the use of I-VLMs and LLMs. With the rapid advancement of vision-language pre-training and LLMs, increasingly powerful VLMs and LLMs are becoming available, thereby enhancing the potential of our framework.\nIn future work, we will explore the following directions:\nDuring the experiment, it is observed that categorization performance on specific anomalies was ineffective and sometimes deteriorated when more visual data was introduced. This observation directs our attention toward refining visual representations in future investigations of the open vocabulary VAD task. Investigating more complex scenarios in the open world, such as the occurrence of multiple types of anomalies within a single video, which presents a greater challenge. REFERENCES # [1] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro, \u0026ldquo;Weakly-supervised video anomaly detection with robust temporal feature magnitude learning,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 4975–4986.\n[2] P. Wu, C. Pan, Y. Yan, G. Pang, P. Wang, and Y. Zhang, \u0026ldquo;Deep learning for video anomaly detection: A review,\u0026rdquo; 2024, arXiv:2409.05383 .\n[3] W. Liu, W. Luo, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction for anomaly detection—A new baseline,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6536–6545,\n[4] W. Sultani, C. Chen, and M. Shah, \u0026ldquo;Real-world anomaly detection in surveillance videos,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6479–6488.\n[5] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, \u0026ldquo;Learning temporal regularity in video sequences,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 733–742.\n[6] J. Wang and A. Cherian, \u0026ldquo;GODS: Generalized one-class discriminative subspaces for anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8200–8210.\n[7] M. Z. Zaheer, A. Mahmood, M. H. Khan, M. Segu, F. Yu, and S.-I. Lee, \u0026ldquo;Generative cooperative learning for unsupervised video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , Jun. 2022, pp. 14744–14754.\n[8] Z. Liu, Y. Nie, C. Long, Q. Zhang, and G. Li, \u0026ldquo;A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 13588–13597.\n[9] J. T. Zhou, L. Zhang, Z. Fang, J. Du, X. Peng, and Y. Xiao, \u0026ldquo;Attentiondriven loss for anomaly detection in video surveillance,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4639–4647, Dec. 2020.\n[10] Y. Zhang, X. Nie, R. He, M. Chen, and Y. Yin, \u0026ldquo;Normality learning in multispace for video anomaly detection,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 9, pp. 3694–3706, Sep. 2021.\n[11] Y. Lu, C. Cao, Y. Zhang, and Y. Zhang, \u0026ldquo;Learnable locality-sensitive hashing for video anomaly detection,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 2, pp. 963–976, Feb. 2023.\n[12] S. Zhang et al., \u0026ldquo;Influence-aware attention networks for anomaly detection in surveillance videos,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol. , vol. 32, no. 8, pp. 5427–5437, Aug. 2022.\n[13] Y. Zhong, X. Chen, Y. Hu, P. Tang, and F. Ren, \u0026ldquo;Bidirectional spatiotemporal feature learning with multiscale evaluation for video anomaly detection,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 12, pp. 8285–8296, Dec. 2022.\n[14] D. Li, X. Nie, R. Gong, X. Lin, and H. Yu, \u0026ldquo;Multi-branch GAN-based abnormal events detection via context learning in surveillance videos,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 5, pp. 3439–3450, May 2024.\n[15] H. Liu, L. He, M. Zhang, and F. Li, \u0026ldquo;VADiffusion: Compressed domain information guided conditional diffusion for video anomaly detection,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 9, pp. 8398–8411, Sep. 2024.\n[16] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, \u0026ldquo;Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1237–1246.\n[17] S. Li, F. Liu, and L. Jiao, \u0026ldquo;Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection,\u0026rdquo; in Proc. AAAI Conf. Artif. Intell., vol. 36, no. 2, 2022, pp. 1395–1403.\n[18] J.-C. Wu, H.-Y. Hsieh, D.-J. Chen, C.-S. Fuh, and T.-L. Liu, \u0026ldquo;Selfsupervised sparse representation for video anomaly detection,\u0026rdquo; in Proc. 17th Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2022, pp. 729–745.\n[19] H. Zhou, J. Yu, and W. Yang, \u0026ldquo;Dual memory units with uncertainty regulation for weakly supervised video anomaly detection,\u0026rdquo; in Proc. AAAI Conf. Artif. Intell., Jun. 2023, vol. 37, no. 3, pp. 3769–3777.\n[20] H. Lv, Z. Yue, Q. Sun, B. Luo, Z. Cui, and H. Zhang, \u0026ldquo;Unbiased multiple instance learning for weakly supervised video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , Jun. 2023, pp. 8022–8031.\n[21] H. K. Joo, K. Vo, K. Yamazaki, and N. Le, \u0026ldquo;CLIP-TSA: Clipassisted temporal self-attention for weakly-supervised video anomaly detection,\u0026rdquo; in Proc. IEEE Int. Conf. Image Process. (ICIP), Oct. 2023, pp. 3230–3234.\n[22] T. Liu, C. Zhang, K.-M. Lam, and J. Kong, \u0026ldquo;Decouple and resolve: Transformer-based models for online anomaly detection from weakly labeled videos,\u0026rdquo; IEEE Trans. Inf. Forensics Security, vol. 18, pp. 15–28, 2023.\n[23] Z. Yang, Y. Guo, J. Wang, D. Huang, X. Bao, and Y. Wang, \u0026ldquo;Towards video anomaly detection in the real world: A binarization embedded weakly-supervised network,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol. , vol. 34, no. 5, pp. 4135–4140, May 2024.\n[24] Y. Fan, Y. Yu, W. Lu, and Y. Han, \u0026ldquo;Weakly-supervised video anomaly detection with snippet anomalous attention,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 7, pp. 5480–5492, Jul. 2024.\n[25] H. Shi, L. Wang, S. Zhou, G. Hua, and W. Tang, \u0026ldquo;Abnormal ratios guided multi-phase self-training for weakly-supervised video anomaly detection,\u0026rdquo; IEEE Trans. Multimedia, vol. 26, pp. 5575–5587, 2023.\n[26] P. Wu et al., \u0026ldquo;VadCLIP: Adapting vision-language models for weakly supervised video anomaly detection,\u0026rdquo; in Proc. AAAI Conf. Artif. Intell. , Mar. 2024, vol. 38, no. 6, pp. 6074–6082.\n[27] Z. Yang, J. Liu, and P. Wu, \u0026ldquo;Text prompt with normality guidance for weakly supervised video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 18899–18908.\n[28] P. Wu et al., \u0026ldquo;Weakly supervised video anomaly detection and localization with spatio-temporal prompts,\u0026rdquo; in Proc. 32nd ACM Int. Conf. Multimedia, Oct. 2024, pp. 9301–9310.\n[29] Y. Zhu, W. Bao, and Q. Yu, \u0026ldquo;Towards open set video anomaly detection,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis. (ECCV), Oct. 2022, pp. 395–412.\n[30] P. Wu et al., \u0026ldquo;Open-vocabulary video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 18297–18307.\n[31] J. Wu et al., \u0026ldquo;Towards open vocabulary learning: A survey,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 7, pp. 5092–5113, Jul. 2024.\n[32] A. Radford et al., \u0026ldquo;Learning transferable visual models from natural language supervision,\u0026rdquo; in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763.\n[33] C. Jia et al., \u0026ldquo;Scaling up visual and vision-language representation learning with noisy text supervision,\u0026rdquo; in Proc. 38th Int. Conf. Mach. Learn. , vol. 139, 2021, pp. 4904–4916.\n[34] Q. Zhou, G. Pang, Y. Tian, S. He, and J. Chen, \u0026ldquo;AnomalyCLIP: Objectagnostic prompt learning for zero-shot anomaly detection,\u0026rdquo; in Proc. 12th Int. Conf. Learn. Represent., Jan. 2023.\n[35] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, \u0026ldquo;Prompting visuallanguage models for efficient video understanding,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis. Springer, 2022, pp. 105–124.\n[36] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, \u0026ldquo;Learning to prompt for visionlanguage models,\u0026rdquo; Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, Sep. 2022.\n[37] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, \u0026ldquo;Conditional prompt learning for vision-language models,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 16816–16825.\n[38] P. Liu, \u0026ldquo;Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,\u0026rdquo; ACM Comput. Surv., vol. 55, no. 9, pp. 1–35, 2023.\n[39] L. Zanella, B. Liberatori, W. Menapace, F. Poiesi, Y. Wang, and E. Ricci, \u0026ldquo;Delving into CLIP latent space for video anomaly recognition,\u0026rdquo; 2023, arXiv:2310.02835 .\n[40] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, \u0026ldquo;Learning spatiotemporal features with 3D convolutional networks,\u0026rdquo; in Proc. IEEE Int. Conf. Comput. Vis., Aug. 2015, pp. 4489–4497.\n[41] J. Carreira and A. Zisserman, \u0026ldquo;Quo vadis, action recognition? A new model and the kinetics dataset,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6299–6308.\n[42] H. Rasheed, M. U. Khattak, M. Maaz, S. Khan, and F. S. Khan, \u0026ldquo;Fine-tuned CLIP models are efficient video learners,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 6545–6554.\n[43] Z. Weng, X. Yang, A. Li, Z. Wu, and Y. Jiang, \u0026ldquo;Open-VCLIP: Transforming CLIP to an open-vocabulary video model via interpolated weight optimization,\u0026rdquo; in Proc. Int. Conf. Mach. Learn., Jan. 2023, pp. 36978–36989.\n[44] W. Wu, Z. Sun, and W. Ouyang, \u0026ldquo;Revisiting classifier: Transferring vision-language models for video recognition,\u0026rdquo; in Proc. AAAI Conf. Artif. Intell., Jun. 2023, vol. 37, no. 3, pp. 2847–2855.\n[45] Y. Du, F. Wei, Z. Zhang, M. Shi, Y. Gao, and G. Li, \u0026ldquo;Learning to prompt for open-vocabulary object detection with vision-language model,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 14084–14093.\n[46] M. Xu et al., \u0026ldquo;A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model,\u0026rdquo; in Proc. 17th Eur. Conf. Comput. Vis., Tel Aviv, Israel, Oct. 2022, pp. 736–753.\n[47] M. Wang, J. Xing, and Y. Liu, \u0026ldquo;ActionCLIP: A new paradigm for video action recognition,\u0026rdquo; 2021, arXiv:2109.08472 .\n[48] H. Cheng et al., \u0026ldquo;DENOISER: Rethinking the robustness for openvocabulary action recognition,\u0026rdquo; 2024, arXiv:2404.14890 .\n[49] T. Wu, S. Ge, J. Qin, G. Wu, and L. Wang, \u0026ldquo;Open-vocabulary spatiotemporal action detection,\u0026rdquo; 2024, arXiv:2405.10832 .\n[50] K.-Y. Lin et al., \u0026ldquo;Rethinking CLIP-based video learners in cross-domain open-vocabulary action recognition,\u0026rdquo; 2024, arXiv:2403.01560 .\n[51] T. Brown et al., \u0026ldquo;Language models are few-shot learners,\u0026rdquo; in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 1877–1901.\n[52] M. Maniparambil, C. Vorster, D. Molloy, N. Murphy, K. McGuinness, and N. E. O\u0026rsquo;Connor, \u0026ldquo;Enhancing CLIP with GPT-4: Harnessing visual descriptions as prompts,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2023, pp. 262–271.\n[53] C. Jia et al., \u0026ldquo;Generating action-conditioned prompts for openvocabulary video action recognition,\u0026rdquo; 2023, arXiv:2312.02226 .\n[54] P. Velickovi ˇ ˇ c, G. Cucurull, A. Casanova, A. Romero, P. Lió, and ´ ´ Y. Bengio, \u0026ldquo;Graph attention networks,\u0026rdquo; in Proc. Int. Conf. Learn. Represent., Jan. 2017.\n[55] X. Ma et al., \u0026ldquo;A comprehensive survey on graph anomaly detection with deep learning,\u0026rdquo; IEEE Trans. Knowl. Data Eng., vol. 35, no. 12, pp. 12012–12038, Dec. 2021.\n[56] Z. Yu, J. Yu, J. Fan, and D. Tao, \u0026ldquo;Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,\u0026rdquo; in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 1821–1830.\n[57] W. Luo, W. Liu, and S. Gao, \u0026ldquo;A revisit of sparse coding based anomaly detection in stacked RNN framework,\u0026rdquo; in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 341–349.\n[58] P. Wu et al., \u0026ldquo;Not only look, but also listen: Learning multimodal violence detection under weak supervision,\u0026rdquo; in Proc. 16th Eur. Conf. Comput. Vis. (ECCV), Glasgow, U.K. Cham\u0026rsquo;, Switzerland: Springer, Aug. 2020, pp. 322–339.\n[59] A. Acsintoae et al., \u0026ldquo;UBnormal: New benchmark for supervised openset video anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 20143–20153.\n[60] Z. Yang, Q. Xu, S. Bao, X. Cao, and Q. Huang, \u0026ldquo;Learning with multiclass AUC: Theory and algorithms,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 11, pp. 7747–7763, Nov. 2022.\n[61] D. P. Kingma and J. Ba, \u0026ldquo;Adam: A method for stochastic optimization,\u0026rdquo; in Proc. Int. Conf. Learn. Represent., Dec. 2014.\n[62] X. Glorot and Y. Bengio, \u0026ldquo;Understanding the difficulty of training deep feedforward neural networks,\u0026rdquo; in Proc. 13th Int. Conf. Artif. Intell. Statist., 2010, pp. 249–256.\nChenting Xu received the B.S. degree in cyber science and engineering from Sichuan University, Chengdu, China, in 2023. She is currently pursuing the master\u0026rsquo;s degree with the School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China. Her research interests include video understanding and abnormal events detection.\nKe Xu (Member, IEEE) received the Ph.D. degree in cyber space security from Shanghai Jiao Tong University, Shanghai, China, in 2019. He is currently an Associate Professor with the Institute of Cyber Space Security, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University. His research interests include action recognition, gait recognition, and abnormal events detection.\nXinghao Jiang (Senior Member, IEEE) received the Ph.D. degree in electronic science and technology from Zhejiang University, Hangzhou, China, in 2003. He was a Visiting Scholar with New Jersey Institute of Technology, Newark, NJ, USA, from 2011 to 2012. He is currently a Professor with the School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China. His research interests include multimedia security, intelligent information processing, cyber information security, information hiding, and watermarking.\nTanfeng Sun (Senior Member, IEEE) received the Ph.D. degree in information and communication system from Jilin University, Changchun, China, in 2003. He had cooperated with Prof. Y. Q. Shi at New Jersey Institute of Technology, USA, as a Visiting Scholar, from July 2012 to December 2013. He is currently a Professor with the School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China. His research interests include digital forensics on video forgery, digital video steganography and steganalysis, and watermarking.\n","date":"10 January 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/plovad_prompting_vision-language_models_for_open_vocabulary_video_anomaly_detection/","section":"Papers","summary":"A novel framework (PLOVAD) leveraging prompt tuning on large-scale pretrained image-based vision-language models for open vocabulary video anomaly detection, incorporating domain-specific and anomaly-specific prompts, and a temporal module to detect and categorize both seen and unseen anomalies with limited parameters.","title":"PLOVAD: Prompting Vision-Language Models for Open Vocabulary Video Anomaly Detection","type":"method"},{"content":"","date":"10 January 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/tanfeng-sun/","section":"Authors","summary":"","title":"Tanfeng Sun","type":"authors"},{"content":"","date":"10 January 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xinghao-jiang/","section":"Authors","summary":"","title":"Xinghao Jiang","type":"authors"},{"content":"","date":"1 January 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/chao-huang/","section":"Authors","summary":"","title":"Chao Huang","type":"authors"},{"content":" Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models # Chao Huang 1 Yushu Shi 1 Jie Wen 2 Wei Wang 1 Yong Xu 2 Xiaochun Cao 1 *\nAbstract # With advancements in visual language models (VLMs) and large language models (LLMs), video anomaly detection (VAD) has progressed beyond binary classification to fine-grained categorization and multidimensional analysis. However, existing methods focus mainly on coarsegrained detection, lacking anomaly explanations. To address these challenges, we propose Ex-VAD , an Explainable Fine-grained Video Anomaly Detection approach that combines fine-grained classification with detailed explanations of anomalies. First, we use a VLM to extract frame-level captions, and an LLM converts them to videolevel explanations, enhancing the model\u0026rsquo;s explainability. Second, integrating textual explanations of anomalies with visual information greatly enhances the model\u0026rsquo;s anomaly detection capability. Finally, we apply label-enhanced alignment to optimize feature fusion, enabling precise finegrained detection. Extensive experimental results on the UCF-Crime and XD-Violence datasets demonstrate that Ex-VAD significantly outperforms existing State-of-The-Art methods.\n1. Introduction # Video Anomaly Detection (VAD) is an important technology with a wide range of applications that cover areas such as security surveillance, healthcare, autonomous driving, and content auditing (Zhao et al. , 2017; Wang et al. , 2019; Samaila et al. , 2024). It aims to improve the safety and efficiency of systems by automatically identifying anomalous events or behaviors through the analysis of video data (Ren\n1 Shenzhen Campus of Sun Yat-Sen University, School of Cyber Science and Technology, Shenzhen, China 2 Harbin Institute of Technology, School of Computer Science and Technology, Shenzhen, China. Correspondence to: Xiaochun Cao \u0026lt;caoxiaochun@mail.sysu.edu.cn\u0026gt; .\nProceedings of the 42 st International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).\nFigure 1. Recent research in VAD can be categorized into three types: a) Traditional binary classification VAD, b) Multiclassification VAD, and c) Training-free VAD. Building on the optimization of these approaches, our Ex-VAD is presented as: d) Explainable VAD based on VLMs and LLMs.\net al. , 2021; Nawaratne et al. , 2020). For example, rapid detection of dangerous behaviors for timely intervention in surveillance, detection of abnormal road conditions to avoid accidents in autonomous driving, and identification of abnormal vital signs to provide timely assistance in healthcare monitoring.\nTraditional VADs (Huang et al. , 2024; Wang et al. , 2025; Huang et al. , 2023; 2022; Ramachandra et al. , 2022; Liu et al. , 2024; Zaigham Zaheer et al. , 2020; Yan et al. , 2023) typically coarse-grained analyze videos, determining only whether a video contains abnormal behavior and categorizing it as normal or anomalous. However, such approaches (Nguyen \u0026amp; Meunier , 2019) face significant limitations in practical applications. First, coarse-grained detection fails to provide detailed descriptions of specific types of abnormal behavior, which is inadequate in scenarios that require tailored responses to distinct anomalies, for example, addressing varying security threats in surveillance systems or diagnosing multiple abnormal conditions in medical monitoring. Second, coarse-grained methods are easily influenced by the complexity of video backgrounds and the diversity of scenes, making it challenging to pinpoint the time and type of anomalies accurately. This deficiency com-\npromises both the detection accuracy of the system and its response efficiency.\nFine-grained video anomaly detection becomes particularly important to distinguish different types of anomalous behavior further and provide more targeted and interpretative detection results. In recent years, visual-language pre-training (VLP) models such as CLIP (Radford et al. , 2021) have significantly improved the semantic representation of images and text through contrast learning, driving advances in visual representation. CLIP-based task-specific models have excelled in various visual tasks, achieving unprecedented performance breakthroughs. In VAD, some researchers (Wu et al. , 2024c;a) have used CLIP\u0026rsquo;s image-text alignment to achieve fine-grained anomaly detection.\nDespite advances, existing methods still struggle to explain anomalous behavior effectively. Even when anomalous events are successfully detected, models often fail to provide clear explanations for the causes of the anomalies, posing significant challenges to decision-makers. For example, in security monitoring, the detection of abnormal behavior in a specific area without a clear explanation can complicate subsequent response efforts, leading to inefficiency and delays. Consequently, enhancing the interpretability of VAD has become a crucial focus in the field\u0026rsquo;s development. Recently, the rapid progress in LLMs has introduced new possibilities for VAD. Some researchers (Zanella et al. , 2024; Ye et al. , 2024) have proposed training-free anomaly detection methods by generating descriptive text explanations of anomalies using VLMs and LLMs. However, these methods primarily rely on the generated text for anomaly detection, often neglecting the full potential of the visual modality. Other researchers (Lv \u0026amp; Sun , 2024; Kim et al. , 2023a; Tang et al. , 2024b) have achieved interpretable anomaly detection by fine-tuning large models. While effective, these approaches often result in complex models that may be challenging to deploy and maintain.\nTo address these challenges, we propose a novel method called Ex-VAD, which is designed to overcome the limitations of traditional VAD methods, particularly in finegrained classification and anomaly explanation. Specifically, we first propose an Anomaly Explanation Generation Module (AEGM), which extracts frame-level captions from videos using VLMs, followed by a cleaning step to refine the captions. The cleaned captions are then integrated by an LLM to generate video-level anomaly explanations through specific prompts, which enable the model to detect abnormal behavior in the video and analyze its cause. Second, we develop a Multimodal Anomaly Detection Module (MADM), which encodes the text from AEGM and extracts both temporal and spatial features between video frames. These features are then fed into a coarse-grained anomaly classifier to determine whether the video contains anoma- lies. Finally, we employ a Label Augment and Alignment Module (LAAM), which uses an LLM to expand anomaly category labels into phrases, selects the top-k phrases semantically most similar to the original labels, and aligns them with the fused multimodal features to obtain fine-grained anomaly categories. In summary, Ex-VAD effectively integrates multimodal features, fine-grained classification, and anomaly explanations, providing a comprehensive solution to video anomaly detection with enhanced interpretability and accuracy.\nOur main contributions are summarized as follows.\nWe develop an Anomaly Explanation Generation Module (AEGM), which utilizes a VLM and an LLM to generate explanations for video anomalies, allowing the model to detect abnormal behavior and analyze its cause, thereby enhancing its semantic interpretation. We propose a Label Augment and Alignment Module (LAAM) that enhances label semantics, enabling the model to better align videos with anomaly categories, thereby improving fine-grained anomaly classification, particularly for complex categories. Extensive experimental results show that our method outperforms existing approaches in both coarsegrained and fine-grained accuracy, improving overall anomaly detection and classification precision. 2. Related Work # 2.1. Video Anomaly Detection # According to the output of existing VAD, it can be divided into binary-classification VAD (Ramachandra et al. , 2022; Liu et al. , 2024), multi-classification VAD (Sultani et al. , 2019; Wu et al. , 2024a;c), and interpretable VAD (Lv \u0026amp; Sun , 2024). Traditional VAD methods classify videos as normal or abnormal. They typically adopt a classification paradigm. Firstly, pre-trained visual models are used to extract frame-level features. Then, these features are fed into a binary classifier based on Multiple Instance Learning (MIL) for training. Finally, abnormal events are detected based on the predicted anomaly confidences.\nWith the development of the CLIP model, some methods have attempted to make improvements. VadCLIP (Wu et al. , 2024c) proposed a fine-grained Weakly Supervised Video Anomaly Detection (WSVAD) method that can distinguish different types of abnormal frames. VadCLIP encodes text labels into class embeddings and calculates the matching similarities between class embeddings and frame-level visual features to obtain an alignment map. Each input text label represents a class of abnormal events, thus achieving fine-grained WSVAD.\nInterpretability is of utmost importance in VAD, especially in sensitive or high-stake applications. Early methods often relied on black-box models, and their prediction results were difficult to trust. Recently, some methods have utilized Large Language Models (LLMs) and Vision-Language Models (VLMs) to generate understandable reasoning through semantic insights and textual explanations. For example, VADor (Lv \u0026amp; Sun , 2024) fine-tunes the projection layer of VideoLLaMA to integrate anomaly detection with semantic reasoning. HAWK (Tang et al. , 2024a)enhances interpretability by integrating motion-based reasoning through interactive VLMs. However, there are still challenges in balancing the granularity of explanations and computational efficiency.\n2.2. Visual Language Model in VAD # Vision language models (VLMs) offer a new perspective for detecting anomalies in video anomaly detection (VAD), especially in fine-grained classification and explanation of anomalous behaviors. Traditional VAD methods (Tian et al. , 2021a; Li et al. , 2022b;a) mainly focus on identifying anomalous behaviors in videos but lack detailed classification of these behaviors. (Wu et al. , 2024c) leverages the pre-trained CLIP model to align video frames with labels in VAD, enabling fine-grained anomaly classification. Meanwhile, the use of LLMs in VAD is still in its infancy(Kim et al. , 2023b) and LAVAD (Zanella et al. , 2024) implemented training-free VAD using pre-trained LLMs and VLMs. This method efficiently transforms LLMs into video anomaly detectors by generating textual descriptions of each frame in the test video, which is combined with prompting to activate LLMs for time series aggregation and anomaly score estimation. Additionally, by referring to VLMs, we establish a strong complementary relationship between visual and textual modalities. This approach not only enables the detection of anomalous behaviors but also provides clear explanations for each behavior, enhancing the explanation of anomaly detection.\n2.3. Prompt Learning # Prompt learning, a technique for adapting prompt words to fit a specific task, was initially applied mainly in the field of Natural Language Processing (NLP) and has gradually been extended to the visual domain. CLIP (Radford et al. , 2021) relys on fixed hand-designed cues (e.g., a photo of a class), which are suitable for open domains but not flexible enough. CLIP-COOP (Zhou et al. , 2022a) introduces learnable context vectors, enhancing performance with limited samples but struggling with generalization. These advances refine prompt adaptation, improving vision-language models across diverse tasks. In VAD, VADCLIP leverages trainable textual templates to generate precise anomaly descriptions. However, manually designing prompts remains time-consuming and highly sensitive to template content. To address this challenge and reduce the dependence on hand-crafted language designs, PEL4VAD (Pu et al. , 2024) used ConceptNet definitions to create prompt templates and expanded class labels through a conceptual dictionary, significantly improving open-vocabulary object detection. Based on this approach, this paper uses GPT4 (OpenAI \u0026amp; etc , 2024) to generate rich semantics for simple labels, and uses CLIP image-text alignment to allow the VAD model to achieve better performance in fine-grained anomaly classification.\n3. Approach # 3.1. Architecture # As shown in Figure 2, the proposed Ex-VAD consists of three components: an Anomaly Explanation Generation Module (AEGM), a Multimodal Anomaly Detection Module (MADM), and a Label Augment and Alignment Module (LAAM). Ex-VAD processes input videos V by first utilizing the AEGM to generate anomaly explanation text E. This text serves two purposes: providing interpretative explanations for video anomalies and acting as the text modality input for the MADM, where it is fused with visual features for coarse-grained anomaly detection. Finally, the LAAM refines the detection by expanding and aligning labels to achieve fine-grained anomaly classification, ensuring both interpretability and accuracy in video anomaly detection. The implementation details are introduced as follows.\n3.2. Anomaly Explanation Generation Module # LAVAD (Zanella et al. , 2024) demonstrated the feasibility of achieving anomaly detection by prompting VLMs and LLMs to generate text descriptions. Inspired by this approach, our AEGM improves the prompting mechanism to guide LLMs in time series aggregation and the generation of anomaly explanations. This not only helps the visual module enhance the performance of VAD but also serves as an explanation for the causes of anomalies, further enhancing the interpretability of detection. As shown in Figure 3 , AEGM consists of two sub-modules: the Caption Extraction and Cleaning Module, and the Explainable Modules Based on LLM.\nCaption Extraction and Cleaning Module. With the rapid development of VLMs, the ability to generate captions from videos has become increasingly powerful. First, uniformly sample n frames from the video V . For each frame Ii ∈ V , we use the SOTA captioning model ΦC i.e. BLIP-2 (Li et al. , 2023) and set appropriate prompts PC to generate frame-level text descriptions:\nFigure 2. Our Ex-VAD includes three components: an Anomaly Explanation Generation Module using VLM and LLM to generate anomaly explanation text, a Multimodal Anomaly Detection Module combining enriched visual and textual features for coarse anomaly classification, and a Label Augment and Alignment Module that refines the detection by expanding and aligning labels to achieve fine-grained anomaly classification.\nDue to the randomness of VLMs, some irrelevant captions may be generated, which may harm training. Since the scenes in the video are captured by a static camera at a high frame rate, the semantic content between frames overlaps to some extent. From this perspective, we alleviate the above problems by designing an image-text alignment mechanism. Specifically, we use a vision-language encoder to encode the captions of each frame. For each frame Ii ∈ V , we calculate its closest caption:\nwhere {· , ·} is the cosine similarity, EI is the image encoder of the VLM, ET is the text encoder and T = {T1, \u0026hellip;, TN } . This module allows us to generate fairly accurate text descriptions for each video frame.\nExplainable Modules Based on LLM. The cleaned captions can describe frame information more accurately than the initial captions, but they are only simple descriptions and cannot describe abnormal phenomena in detail. Therefore, we prompt LLM i.e. LLAMA-3 (Touvron et al. , 2023)to generate the required anomaly explanations. Specifically, we input the collection T ˆ of cleaned frame captions and the prompt PS into the advanced LLM ΦLLM to obtain the explanation E for video V :\nwhere T ˆ = {T 1 ˆ T 1, T 2 ˆ T 2, \u0026hellip;, T N ˆ T N } . Through the above methods, we can obtain an anomaly description E that is more accurate semantically and temporally than the captions T ˆ .\n3.3. Multi-Modal Feature Fusion # This component primarily performs coarse-grained anomaly detection by entering the fused visual and text features into an anomaly classifier. For visual features, we follow prior work (Wu et al. , 2024c) to uniformly sample dense video frames from the input video at 16-frame intervals, obtaining a video frame sequence V . The video frames are then encoded by the frozen visual encoder EI in CLIP to produce frame features FI FI . To bridge the gap between the image and video domains in CLIP, we adopt the approach from (Wu et al. , 2024c), modeling the temporal dependencies of the video frame sequence using the Local and Global Temporal Adapter (LGT-Adapter):\nFigure 3. The Anomaly Explanation model first generates video frame captions via VLM, then cleans up the frame captions using the image-text module, and finally generates a detailed video interpretation using LLM. This interpretation is later used as a textual modality to enhance the performance of anomaly detection in conjunction with the visual modality, in addition to being used as an anomaly interpretation for the video.\nWe use the exception of explanation E generated from AEGM as textual information. These textual information are encoded by the frozen textual encoder ET in CLIP to produce textual features FT = ET (E). Subsequently, the textual features and visual features are fused into FF FF = FV FV + FT , which is then input into a binary classifier that contains a feed-forward network (FFN) layer, an FC layer, and a Sigmoid activation to obtain the anomaly scores s ∈ R n∗1 :\n3.4. Label Augment and Alignment Module # This part mainly includes the following two steps: label augmentation set construction and fine-grained classification.\nLabel Augmentation Set Construction. We utilize a pretrained LLM (OpenAI \u0026amp; etc , 2024) to generate m descriptive sentences related to each category label. To filter the sentences that are most semantically related to the category labels, we calculate the semantic similarity between the category labels and the generated sentences by using cosine similarity. Specifically, first, the category labels L and the related descriptive sentences {S1, \u0026hellip;, S m } generated by the LLM are encoded into vectors. Then, the cosine similarity between the label vector and the sentence vector is calculated as:\nwhere v L and vS i represent the embedding vectors of the category label and the generated sentence, respectively. Ac- cording to the similarity score, the top-k sentences with the highest similarity are selected from the generated sentences. The features of the screened sentences are integrated with the original label embeddings to form the final enhanced label embeddings FL .\nFine-grained Classification. Calculate the matching similarity between these category embeddings FL and Fusion features FF FF to obtain an alignment map M ∈ R n∗m , where m is the number of text labels. In this alignment map, each input text label represents a class of abnormal events. By analyzing the similarity between the video and different category labels, a more detailed classification of abnormal events is achieved, naturally achieving the goal of fine-grained classification.\n3.5. Loss Function # Binary Classification Loss. We follow previous work (Wu et al. , 2020) and use the Top-k mechanism to select K highest anomaly confidence levels among anomalous and normal videos as video-level predictions. The classification loss LBCE is then computed using the binary cross-entropy between video-level prediction and ground truth:\nwhere s denotes the predicted score and y is the true label (usually 0 represents normal, and 1 represents abnormal).\nMultiple Class Loss. For multi-classification tasks, we propose the MIL-Align mechanism to align the frame-level fusion feature FF FF and all label embeddings FL. Specifically, for each video, we select the top-k similarity values and compute the average to measure how well this video is aligned with the current class. Then, we can obtain a vector V = {v1, . . . , v m } that represents the similarity between this video and all classes. We hope the video and its paired textual label emit the highest similarity score among others. To achieve this, the multi-class prediction is first computed as follows:\nwhere piis the prediction with respect to the ith class, and τ refers to the temperature hyper-parameter for scaling. Finally, the alignment loss LMCE can be computed by the cross-entropy:\nwhere yiis the ground truth label and m is the total number of classes.\nContrastive Loss. To pull apart the normal class embeddings from the anomaly class embeddings, we introduce\nEx-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models\nMethod mAP@IOU(%) mAP@IOU(%) mAP@IOU(%) mAP@IOU(%) mAP@IOU(%) mAP@IOU(%) 0.1 0.2 0.3 0.4 0.5 AVG Random Baseline 0.21 0.14 0.04 0.02 0.01 0.08 RealAD (2018) 5.73 4.41 2.69 1.93 1.44 3.24 RTFM (2021) 12.59 7.54 6.44 5.42 1.54 6.71 AVVD (2022) 10.27 7.01 6.25 3.42 3.29 6.05 DMU(2023) 11.32 7.62 5.97 4.33 2.36 6.32 CLIP-TSA(2023) 12.62 8.13 6.66 4.28 1.91 6.72 UMIL(2024) 11.84 7.85 6.52 3.97 2.84 6.60 VadCLIP(2024) 11.72 7.83 6.4 4.53 2.93 6.68 STPrompt(2024) 11.56 7.49 6.13 5.11 2.11 6.48 Ex-VAD (Ours) 16.51 12.35 9.41 7.82 4.65 10.15 Table 1. Fine-grained comparisons on UCF-Crime.\nthe contrast loss. Specifically, we first calculate the cosine similarity between the normal class embedding and other abnormal class embeddings, and then compute the contrastive loss L cts as follows:\nwhere L N is the normal class embedding, and LA j is the abnormal class embedding.\nOverall, the final total objective of Ex-VAD is given by:\n3.6. Inference # ExVAD contains three branches that enable it to handle fine-grained and coarse-grained WSVAD tasks and anomaly interpretation. Regarding fine-grained WSVAD, we follow previous work (Wu et al. , 2023) and use a thresholding strategy on the alignment graph M to predict anomalous events. For coarse-grained WSVAD, we follow previous work (Wu et al. , 2024c) in employing two methods to compute the frame-level anomaly degree. The first method directly uses the anomaly scores from the coarse-grained classification, while the second method uses the alignment map from the fine-grained classification, i.e., the similarity between the video and the normal class minus 1 is the anomaly degree. Finally, we choose the best of these two methods to compute the frame-level anomaly degree.\n4. Experiments # In this section, we perform experiments on the UCFCrime (Sultani et al. , 2019) and XD-Violence (Wu et al. , 2020) datasets. Ex-VAD focuses on fine-grained anomaly detection and explainability. We compare it with other methods designed for fine-grained anomaly detection and explore novel approaches for explainable coarse-grained anomaly detection using LLMs and VLMs. Furthermore, we conduct\nTable 2. Fine-grained comparisons on XD-Violence.\nMethod mAP@IOU(%) mAP@IOU(%) mAP@IOU(%) mAP@IOU(%) mAP@IOU(%) mAP@IOU(%) 0.1 0.2 0.3 0.4 0.5 AVG Random Baseline 1.82 0.92 0.48 0.23 0.09 0.71 RealAD (2018) 22.72 15.57 9.98 6.2 3.78 11.65 RTFM (2021) 31.25 26.85 21.94 13.56 12.54 21.23 AVVD (2022) 30.51 25.75 20.18 14.83 9.79 20.21 DMU(2023) 32.33 28.88 22.57 14.33 13.68 22.36 CLIP-TSA(2023) 34.53 32.88 28.11 13.65 10.01 23.84 UMIL(2024) 34.44 27.13 22.63 19.85 13.24 23.46 VadCLIP(2024) 37.03 30.84 23.38 17.09 14.31 24.70 STPrompt(2024) 38.21 25.63 28.66 13.11 11.63 23.44 Ex-VAD (Ours) 40.14 32.75 28.78 20.15 18.35 28.23 comprehensive ablation studies to validate the effectiveness of each module in the proposed model.\n4.1. Experimental Setups # Datasets. We perform experiments on the UCF-Crime and XD-Violence datasets. UCF-Crime consists of 1,900 untrimmed surveillance videos with a total duration of 128 hours, covering 13 real-world anomalies (e.g., abuse, robbery, explosion) and normal activities. In the WSVAD, 1,610 videos are used for training with video-level annotations, while 290 videos are used for testing with frame-level annotations. XD-Violence contains 4,754 untrimmed videos totaling 217 hours, making it one of the largest multimodal violence detection datasets. It includes six types of violence (e.g., abuse, car accidents, explosions) across diverse sources such as surveillance, films, and games. The dataset is divided into 3,954 training videos and 800 testing videos, with video-level labels.\nEvaluation Metrics. For coarse-grained WSVAD, the evaluation uses frame-level Average Precision (AP) and framelevel AUC for XD-Violence, and only frame-level AUC for UCF-Crime. For fine-grained WSVAD, mean Average Precision (mAP) values are calculated under different Intersection over Union (IoU) thresholds (ranging from 0.1 to 0.5 with a stride of 0.1). The average mAP (AVG) is also reported, and mAP is computed only for abnormal videos in the test set.\nImplementation Details. All experiments are conducted on a single NVIDIA RTX A100 GPU using PyTorch. The network employs frozen image and text encoders from pretrained CLIP (ViT-B/16) with a Transformer-based FFN layer and GELU activation. BLIP-2 is used for caption generation, while Llama-3.1 generates anomaly explanations. Visual and text features are fused via concatenation. Key hyperparameters include: σ = 1 , τ = 0 . 07, context length l = 20, window length in LGT-Adapter (64 for XD-Violence, 8 for UCF-Crime), and λ (1 × 10 − 4 for XDViolence, 1 for UCF-Crime).\nEx-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models\nCategory Method Fine-grained Explainability XD-Violence(AP) UCF-Crime(AUC) Training-Free LAVAD(Zanella et al., 2024) × ✓ 62.01 80.28 Training-Free VERA(Ye et al., 2024) × ✓ 88.2 86.6 Fine-tuning LLMs VADOr(Lv \u0026amp; Sun, 2024) × ✓ - 88.1 kl RealAD(Sultani et al., 2019) ✓ × 75.18 84.14 kl RTFM (Tian et al., 2021b) ✓ × 78.27 85.66 kl AVVD(Zhou et al., 2022b) ✓ × 78.10 84.57 kl TEVAD(Chen et al., 2023) × × 79.80 84.9 kl DMU(Zhou et al., 2023) ✓ × 82.41 86.75 kl CLIP-TSA(Joo et al., 2023) ✓ × 82.17 87.58 kl UMIL(Sanchez-Maci ´ an et al. ´ , 2024) ✓ × - 86.75 kl VADCLIP(Wu et al., 2024c) ✓ × 84.51 88.02 kl STPrompt(Wu et al., 2024b) ✓ × 83.97 88.08 kl Ex-VAD (Ours) ✓ ✓ 86.52 88.29 Table 3. Coarse-grained comparisons of methods on XD-Violence and UCF-Crime datasets.\nTable 4. Effectiveness of each module for Coarse-grained anomaly detection.\nVisual AEGM AEGM LAAM AUC (%) Captionl Explainable text ✓ ✓ ✓ 86.76 ✓ ✓ ✓ 86.33 ✓ ✓ ✓ 87.86 ✓ ✓ ✓ 88.29 Table 5. Effectiveness of the Anomaly Explainable Generation Module for fine-grained anomaly detection.\nAEGM 0.1 0.2 0.3 0.4 0.5 AVG Captions 17.74 13.27 10.25 7.01 6.1 10.88 Explainable Text 16.51 12.35 9.41 7.82 4.65 10.15 4.2. Comparison Results # Fine-grained WSVAD Results. The fine-grained detection task is more challenging as it involves detecting the presence of anomalous events while also accurately identifying their specific categories. To demonstrate the superiority of our proposed Ex-VAD, we conduct comparisons with several VAD methods, including RealAD (Sultani et al. , 2019), RTFM (Tian et al. , 2021b), AVVD (Zhou et al. , 2022b), DMU (Zhou et al. , 2023), CLIP-TSA (Joo et al. , 2023), UMIL (Sanchez-Maci ´ ´ an et al. ´ ´ , 2024), VADCLIP (Wu et al. , 2024c), and STPrompt (Wu et al. , 2024b). For fairness, CLIP (ViT-B/16) is used for all feature extractors.\nTables 1 and 2 present the fine-grained detection results on UCF-Crime and XD-Violence datasets, evaluated using mean average precision (mAP) and average accuracy (AVG) across IOU thresholds (0.1–0.5). Our Ex-VAD consistently achieves the best results, highlighting its superior performance. Specifically, Ex-VAD achieves an AVG of 9.00\nTable 6. Effectiveness of the Label Augment Alignment Module for fine-grained anomaly detection.\nLAAM 0.1 0.2 0.3 0.4 0.5 AVG [CLS] 14.38 10.54 6.92 5.03 2.51 7.87 a video of [CLS] 14.77 10.68 6.69 4.78 3.73 8.13 Learnable-Prompt 15.18 12.03 6.65 4.96 3.2 8.4 Label-Augment Prompt 16.51 12.35 9.41 7.82 4.65 10.15 on UCF-Crime, outperforming VADCLIP, STPrompt, and TCVADS by 1.32, 1.52, and 7.24, respectively. On XDViolence, Ex-VAD achieves 28.23 AVG, exceeding these methods by 3.53, 4.79, and 11.28, respectively. Unlike methods like VADCLIP, STPrompt, and TCVADS which align visual features with text embeddings from CLIP or LLMs, Ex-VAD introduces a novel approach. Using AEGM, it prompts VLMs and LLMs to generate textual information, fuses this with visual features, and aligns the representation with labels. This generated textual information enriches semantics and enhances detection performance. Additionally, LAAM expands label semantics by converting single labels (e.g., \u0026ldquo;Abuse\u0026rdquo;) into descriptive phrases (e.g., \u0026ldquo;Someone is being mistreated\u0026rdquo;), better aligning with visual-text features.\nCoarse-grained WSVAD Results. Additionally, we compare the results of the state-of-the-art methods for coarsegrained anomaly detection, including the training-free methods LAVAD (Zanella et al. , 2024) and VERA (Ye et al. , 2024); fine-tuned models to achieve interpretable VADor (Lv \u0026amp; Sun , 2024), and the above for fine-grained anomaly detection methods.\nTable 3 shows that while LAVAD and VERA are simple and interpretable due to their lack of training, they do not support fine-grained detection. Our method, Ex-VAD, performs best on the UCF dataset and second best on the XD dataset. VADOr achieves explainability through fine-tuning but lacks fine-grained detection support. For methods supporting fine-\ngrained detection, older approaches like RealAD underperform (75.18 AP on XD-Violence), while recent methods, including AVVD, DMU, and STPrompt, show consistent improvement. VADCLIP and TCVADS push the state of the art, with TCVADS achieving 85.58 AP on XD-Violence and 88.58 AUC on UCF-Crime. Ex-VAD uniquely combines fine-grained detection and interpretability, excelling in both. Although its performance on UCF-Crime (88.29 AUC) is marginally below TCVADS (88.58), it leads to XDViolence, highlighting its versatility. This dual capability makes Ex-VAD an optimal choice for practical applications requiring precision and insights into detection results.\n4.3. Model Analysis # Ablation Study. To evaluate the impact of the two key components, AEGM and LAAM, we conducted ablation experiments on the UCF-Crime dataset by removing one or both components from Ex-VAD, with results summarized in Table 4. The findings reveal that generating captions solely for videos degrades performance, whereas cleaning these captions and generating anomaly explanations significantly enhances it. This highlights the negative impact of low-quality captions, which often contain redundant or erroneous information, and the complementary role of highquality anomaly explanations in improving visual performance. While AEGM is primarily designed for fine-grained anomaly detection, it also contributes to coarse-grained detection improvements.\nEffectiveness of the AEGM. We evaluate the effectiveness of fine-grained anomaly detection for AEGM, with results shown in Table 5. The analysis shows that Captions alone outperform Explainable Text in fine-grained anomaly detection, as Captions provide frame-level semantic details, while Explainable Text offers a concise video-level summary. However, Explainable Text enhances fine-grained anomaly detection while also providing transparent, summarized explanations of anomalies at the video level. Therefore, we choose Explainable Text for the final model to balance performance and interpretability.\nEffectiveness of LAAM. We evaluate the effectiveness of LAAM in fine-grained VAD, as summarized in Table 6. The results demonstrate that LLAM-augmented labels significantly enhance detection accuracy compared to manually defined cue words and learnable prompt-based approaches. This improvement highlights the value of leveraging LLAM to generate semantically rich and contextually relevant labels that align more effectively with the visual and textual features used for fine-grained anomaly detection.\nEffectiveness of Top-k. Figure 4 presents the impact of different top-k values in the LAAM module on coarse-grained and fine-grained detection results, respectively. The trend graphs reveal that selecting the top 4 phrases (k = 4) with\nFigure 4. Sensitivity analysis of a different number of templates K to generalization of (a) coarse-grained detection and (b) finegrained detection.\nTable 7. Comparison of Trainable Parameters, Inference Time, and Multiply-Add Operations (MACs). The best and second-best values are highlighted with bold text and underlined text, respectively.\nMethod Trainable Params Inference Time MACs RTFM 24.72M 8.28ms 126.59G DMU 6.49M 16.60ms 21.00G CLIP-TSA 16.41M 18.33ms 102.63G VADCLIP 35.17M 22.30ms 29.17G ExVAD 9.97M 15.37ms 12.04G the highest similarity to the original labels achieves optimal label enhancement for video anomaly detection. In this setting, the AUC for coarse-grained detection peaks at 88.28%, while the average mAP@IOU for fine-grained detection reaches its highest value of 10.15%, demonstrating the best detection performance. However, excessive enhancement (k \u0026gt; 5) may introduce noise, resulting in performance degradation. These results highlight that moderate label enhancement significantly enhances the model\u0026rsquo;s overall detection capability and anomaly localization accuracy.\nAnalysis of Computational Efficiency. We evaluate the number of trainable parameters (Trainable Params), inference time of a frame (Inference Time), and multiply-add operations (MACs). Table 7 demonstrates that our method, ExVAD, achieves a well-balanced trade-off between model complexity and size, optimizing both performance and resource usage effectively.\n5. Conclusion # In this paper, we propose Ex-VAD, an explainable approach for fine-grained video anomaly detection. First, the Anomaly Explanation Generation Module (AEGM) extracts and refines frame-level captions using VLMs, and then generates video-level anomaly explanations with an LLM. Second, the Multimodal Anomaly Detection Module (MADM) encodes the text and extracts temporal and spatial\nfeatures to detect coarse-grained anomalies. Finally, the Label Augment and Alignment Module (LAAM) expands and aligns anomaly category labels with multimodal features to achieve fine-grained anomaly detection. Experiments show that Ex-VAD outperforms existing methods in fineand coarse-grained anomaly detection, providing a more transparent and effective solution.\nAcknowledgements # This work was supported in part by the National Natural Science Foundation of China (No.62301621), in part by Shenzhen Science and Technology Program (No. 20231121172359002, No. KQTD20221101093559018), in part by Shenzhen General Research Project (No. JCYJ20241202125904007), in part by Guangdong Basic and Applied Basic Research Foundation (No. 2025A1515011398, No.2023B0303000010), in part by the CIE-Smartchip research fund (No.2024-08), in part by the Fundamental Research Funds for the Central Universities, Sun Yat-sen University under Grants No. 23xkjc010.\nImpact Statement # This paper presents work whose goal is to advance the field of Computer Vision. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.\nReferences # Chen, W., Ma, K. T., Jian Yew, Z., Hur, M., and Khoo, D. A.-A. Tevad: Improved video anomaly detection with captions. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 5549–5559, 2023.\nHuang, C., Yang, Z., Wen, J., Xu, Y., Jiang, Q., Yang, J., and Wang, Y. Self-supervision-augmented deep autoencoder for unsupervised visual anomaly detection. IEEE Transactions on Cybernetics, 52(12):13834–13847, 2022.\nHuang, C., Wen, J., Xu, Y., Jiang, Q., Yang, J., Wang, Y., and Zhang, D. Self-supervised attentive generative adversarial networks for video anomaly detection. IEEE Transactions on Neural Networks and Learning Systems , 34(11):9389–9403, 2023.\nHuang, C., Wen, J., Liu, C., and Liu, Y. Long short-term dynamic prototype alignment learning for video anomaly detection. In Larson, K. (ed.), Proceedings of the ThirtyThird International Joint Conference on Artificial Intelligence, IJCAI-24, pp. 866–874. International Joint Conferences on Artificial Intelligence Organization, 8 2024. doi: 10.24963/ijcai.2024/96. Main Track.\nJoo, H. K., Vo, K., Yamazaki, K., and Le, N. Clip-tsa: Clipassisted temporal self-attention for weakly-supervised video anomaly detection. In 2023 IEEE International Conference on Image Processing (ICIP), pp. 3230–3234. IEEE, 2023.\nKim, J., Yoon, S., Choi, T., and Sull, S. Unsupervised video anomaly detection based on similarity with predefined text descriptions. Sensors (Basel, Switzerland), 23, 2023a.\nKim, J., Yoon, S., Choi, T., and Sull, S. Unsupervised video anomaly detection based on similarity with predefined text descriptions. Sensors, 23(14):6256, 2023b.\nLi, G., Cai, G., Zeng, X., and Zhao, R. Scale-aware spatiotemporal relation learning for video anomaly detection. In European Conference on Computer Vision, pp. 333–350. Springer, 2022a.\nLi, J., Li, D., Savarese, S., and Hoi, S. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023.\nLi, S., Liu, F., and Jiao, L. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 1395–1403, 2022b.\nLiu, Y., Yang, D., Wang, Y., Liu, J., Liu, J., Boukerche, A., Sun, P., and Song, L. Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models. 56(7), 2024. ISSN 0360-0300.\nLv, H. and Sun, Q. Video anomaly detection and explanation via large language models, 2024.\nNawaratne, R., Alahakoon, D., De Silva, D., and Yu, X. Spatiotemporal anomaly detection using deep learning for real-time video surveillance. IEEE Transactions on Industrial Informatics, 16(1):393–402, 2020.\nNguyen, T. N. and Meunier, J. Anomaly detection in video sequence with appearance-motion correspondence. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1273–1283, 2019.\nOpenAI and etc. Gpt-4 technical report, 2024. URL https: //arxiv.org/abs/2303.08774 .\nPu, Y., Wu, X., Yang, L., and Wang, S. Learning promptenhanced context features for weakly-supervised video anomaly detection. IEEE Transactions on Image Processing, 33:4923–4936, 2024.\nRadford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.\nRamachandra, B., Jones, M. J., and Vatsavai, R. R. A survey of single-scene video anomaly detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5): 2293–2312, 2022. doi: 10.1109/TPAMI.2020.3040591.\nRen, J., Xia, F., Liu, Y., and Lee, I. Deep video anomaly detection: Opportunities and challenges. In 2021 International Conference on Data Mining Workshops (ICDMW) , pp. 959–966, 2021.\nSamaila, Y. A., Sebastian, P., Singh, N. S. S., Shuaibu, A. N., Ali, S. S. A., Amosa, T. I., Mustafa Abro, G. E., and Shuaibu, I. Video anomaly detection: A systematic review of issues and prospects. Neurocomputing, 591: 127726, 2024. ISSN 0925-2312.\nSanchez-Maci ´ ´ an, A., Mart ´ ´ ´ ´ınez, J., Reviriego, P., Liu, S., and Lombardi, F. On the privacy of the count-min sketch: Extracting the top-k elements. IEEE Transactions on Emerging Topics in Computing, 2024.\nSultani, W., Chen, C., and Shah, M. Real-world anomaly detection in surveillance videos, 2019.\nTang, J., Lu, H., Wu, R., Xu, X., Ma, K., Fang, C., Guo, B., Lu, J., Chen, Q., and Chen, Y.-C. Hawk: Learning to understand open-world video anomalies, 2024a.\nTang, Y., Guo, J., Hua, H., Liang, S., Feng, M., Li, X., Mao, R., Huang, C., Bi, J., Zhang, Z., Fazli, P., and Xu, C. Vidcomposition: Can mllms analyze compositions in compiled videos? arXiv preprint arXiv:2411.10979 , 2024b.\nTian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J. W., and Carneiro, G. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4975–4986, 2021a.\nTian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J. W., and Carneiro, G. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4975–4986, 2021b.\nTouvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.\nWang, B., Huang, C., Wen, J., Wang, W., Liu, Y., and Xu, Y. Federated weakly supervised video anomaly detection with multimodal prompt. Proceedings of the AAAI Conference on Artificial Intelligence, 39(20):21017–21025, Apr. 2025.\nWang, L., Huynh, D. Q., and Mansour, M. R. Loss switching fusion with similarity search for video classification. In 2019 IEEE International Conference on Image Processing (ICIP), 2019.\nWu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., and Yang, Z. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pp. 322–339. Springer, 2020.\nWu, P., Liu, X., and Liu, J. Weakly supervised audio-visual violence detection. IEEE Transactions on Multimedia , 25:1674–1685, 2023.\nWu, P., Liu, J., He, X., Peng, Y., Wang, P., and Zhang, Y. Toward video anomaly retrieval from video anomaly detection: New benchmarks and model. IEEE Transactions on Image Processing, 33:2213–2225, 2024a.\nWu, P., Zhou, X., Pang, G., Yang, Z., Yan, Q., Wang, P., and Zhang, Y. Weakly supervised video anomaly detection and localization with spatio-temporal prompts. MM \u0026lsquo;24, New York, NY, USA, 2024b. Association for Computing Machinery. ISBN 9798400706868. doi: 10.1145/3664647.3681442.\nWu, P., Zhou, X., Pang, G., Zhou, L., Yan, Q., Wang, P., and Zhang, Y. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pp. 6074–6082, 2024c.\nYan, C., Zhang, S., Liu, Y., Pang, G., and Wang, W. Feature prediction diffusion model for video anomaly detection. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5504–5514, 2023.\nYe, M., Liu, W., and He, P. Vera: Explainable video anomaly detection via verbalized learning of vision-language models. arXiv preprint arXiv:2412.01095, 2024.\nZaigham Zaheer, M., Lee, J.-H., Astrid, M., and Lee, S.-I. Old is gold: Redefining the adversarially learned oneclass classifier training paradigm. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14171–14181, 2020.\nZanella, L., Menapace, W., Mancini, M., Wang, Y., and Ricci, E. Harnessing large language models for trainingfree video anomaly detection. In Proceedings of the\nIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18527–18536, 2024.\nZhao, Y., Deng, B., Shen, C., Liu, Y., Lu, H., and Hua, X.-S. Spatio-temporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM International Conference on Multimedia, pp. 1933–1941, 2017. Zhou, H., Yu, J., and Yang, W. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 3769–3777, 2023. Zhou, K., Yang, J., Loy, C. C., and Liu, Z. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2022a. ISSN 1573-1405. doi: 10.1007/s11263-022-01653-1. Zhou, X., Girdhar, R., Joulin, A., Krahenb ¨ ¨ uhl, P., and Misra, ¨ ¨ I. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022b. A. appendix. # Figure 5. Confidence visualization on the UCF-Crime dataset.\nQualitative Analyses. Figure 5 illustrates the qualitative visualization of Ex-VAD. The blue curve represents the anomaly prediction score, while the grey area highlights the ground truth anomaly time positions. The figure also showcases fine-grained anomaly categories and anomaly explanations, which are generated by querying the LLM. As shown, Ex-VAD effectively detects unused anomaly categories, describes anomalous phenomena, and accurately differentiates between normal and abnormal clips in anomalous videos.\n","date":"1 January 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/3552_ex_vad_explainable_fine_g/","section":"Papers","summary":"The paper introduces Ex-VAD, a comprehensive framework for fine-grained and explainable video anomaly detection that leverages visual-language models (VLMs) and large language models (LLMs). It features modules for generating anomaly explanations, fusing multimodal features for coarse detection, and expanding/aligning labels for fine-grained classification, with improved interpretability and accuracy demonstrated on UCF-Crime and XD-Violence datasets.","title":"Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models","type":"method"},{"content":"","date":"1 January 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jie-wen/","section":"Authors","summary":"","title":"Jie Wen","type":"authors"},{"content":"","date":"1 January 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/wei-wang/","section":"Authors","summary":"","title":"Wei Wang","type":"authors"},{"content":"","date":"1 January 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiaochun-cao/","section":"Authors","summary":"","title":"Xiaochun Cao","type":"authors"},{"content":"","date":"1 January 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yong-xu/","section":"Authors","summary":"","title":"Yong Xu","type":"authors"},{"content":"","date":"1 January 2025","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yushu-shi/","section":"Authors","summary":"","title":"Yushu Shi","type":"authors"},{"content":"","date":"17 April 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jiantao-zhou/","section":"Authors","summary":"","title":"Jiantao Zhou","type":"authors"},{"content":"","date":"17 April 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/rongqin-liang/","section":"Authors","summary":"","title":"Rongqin Liang","type":"authors"},{"content":" Text-Driven Traffic Anomaly Detection With Temporal High-Frequency Modeling in Driving Videos # Rongqin Liang , Student Member, IEEE, Yuanman Li , Senior Member, IEEE , Jiantao Zhou , Senior Member, IEEE, and Xia Li , Member, IEEE\nAbstract— Traffic anomaly detection (TAD) in driving videos is critical for ensuring the safety of autonomous driving and advanced driver assistance systems. Previous single-stage TAD methods primarily rely on frame prediction, making them vulnerable to interference from dynamic backgrounds induced by the rapid movement of the dashboard camera. While two-stage TAD methods appear to be a natural solution to mitigate such interference by pre-extracting background-independent features (such as bounding boxes and optical flow) using perceptual algorithms, they are susceptible to the performance of first-stage perceptual algorithms and may result in error propagation. In this paper, we introduce TTHF, a novel single-stage method aligning video clips with text prompts, offering a new perspective on traffic anomaly detection. Unlike previous approaches, the supervised signal of our method is derived from languages rather than orthogonal one-hot vectors, providing a more comprehensive representation. Further, concerning visual representation, we propose to model the high frequency of driving videos in the temporal domain. This modeling captures the dynamic changes of driving scenes, enhances the perception of driving behavior, and significantly improves the detection of traffic anomalies. In addition, to better perceive various types of traffic anomalies, we carefully design an attentive anomaly focusing mechanism that visually and linguistically guides the model to adaptively focus on the visual context of interest, thereby facilitating the detection of traffic anomalies. It is shown that our proposed TTHF achieves promising performance, outperforming state-ofthe-art competitors by +5.4% AUC on the DoTA dataset and achieving high generalization on the DADA dataset.\nIndex Terms— Traffic anomaly detection, multi-modality learning, high frequency, attention.\nManuscript received 8 January 2024; revised 2 April 2024; accepted 12 April 2024. Date of publication 17 April 2024; date of current version 30 September 2024. This work was supported in part by the Key Project of Shenzhen Science and Technology Plan under Grant 20220810180617001, in part by the Foundation for Science and Technology Innovation of Shenzhen under Grant RCBS20210609103708014, in part by Guangdong Basic and Applied Basic Research Foundation under Grant 2022A1515010645, and in part by the Open Research Project Programme of the State Key Laboratory of Internet of Things for Smart City (University of Macau) under Grant SKLIoTSC(UM)-20212023/ORP/GA04/2022. This article was recommended by Associate Editor R. He. (Corresponding author: Yuanman Li.)\nRongqin Liang, Yuanman Li, and Xia Li are with Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen 518060, China (e-mail: 1810262064@email.szu.edu.cn; yuanmanli@szu.edu.cn; lixia@szu.edu.cn).\nJiantao Zhou is with the State Key Laboratory of Internet of Things for Smart City and the Department of Computer and Information Science, University of Macau, Macau (e-mail: jtzhou@um.edu.mo).\nColor versions of one or more figures in this article are available at https://doi.org/10.1109/TCSVT.2024.3390173.\nDigital Object Identifier 10.1109/TCSVT.2024.3390173\nI. INTRODUCTION # T RAFFIC anomaly detection (TAD) in driving videos is a crucial component of automated driving systems [1] , [2] and advanced driver assistance systems [3] , [4]. It is designed to detect anomalous traffic behavior from the first-person driving perspective. Accurate detection of traffic anomalies helps improve road safety, shorten traffic recovery times, and reduce the number of regrettable daily traffic accidents.\nGiven the significance of traffic anomaly detection, scholars are actively involved in this field and have proposed constructive research [5] , [6] , [7] , [8] , [9]. We observe that these works on TAD can be mainly divided into the single-stage paradigm [6] , [10] , [11] and the two-stage paradigm [8] , [9] , [12]. As shown in Fig. 1, previous TAD methods mainly embrace a single-stage paradigm, exemplified by frame prediction [6] and reconstruction-based [11] TAD approaches. Nevertheless, these methods are subject to the dynamic backgrounds caused by the rapid movement of the dashboard camera and have limited accuracy in detecting traffic anomalies. To confront the challenges posed by dynamic backgrounds, researchers have advocated for TAD methods [8] , [9] , [12] that utilize a two-stage paradigm. These two-stage approaches first extract features such as optical flow, bounding boxes, or tracking IDs from video frames using existing visual perception algorithms, and then propose a TAD model for detecting traffic anomalies. While these approaches have laid the foundation for TAD in driving videos, they are susceptible to the performance of the first-stage visual perception algorithm, which may cause error propagation, resulting in false detection or missing traffic anomalies. Therefore, in this paper, we strive to explore an effective single-stage paradigmbased approach for traffic anomaly detection in driving videos.\nRecently, large-scale visual language pre-training models [13] , [14] , [15] have achieved remarkable results by utilizing language knowledge to assist with visual tasks. Among them, CLIP [13] stands out for its exceptional transferability through the alignment of image-text semantics and has demonstrated outstanding capabilities across various computer vision tasks such as object detection [16], semantic segmentation [17], and video retrieval [18]. The success of image-text alignment techniques can be attributed to their ability to map the natural languages associated with an\n1051-8215 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.\nFig. 1. Existing TAD approaches of single-stage paradigm (a) and two-stage paradigm (b) vs. the proposed TTHF framework (c). Existing single-stage approaches mainly rely on frame prediction, which is difficult to adapt to detecting traffic anomalies with a dynamic background, while the two-stage TAD approaches are vulnerable to the performance of the first-stage perceptual algorithms. The proposed TTHF framework is text-driven and focuses on capturing dynamic changes in driving scenes through modeling temporal high frequency to facilitate traffic anomaly detection.\nimage into high-dimensional non-orthogonal vectors. This is in contrast to traditional supervised methods that map predefined labels to low-dimensional one-hot vectors. Compared to the low-dimensional one-hot vectors, these high-dimensional vectors offer more comprehensive representations to guide the network training. Motivated by this, we endeavor to investigate a language-guided approach for detecting traffic anomalies in driving videos. Intuitively, the transition of CLIP from image-text alignment to video-text alignment primarily involves the consideration of modeling temporal dimensions. Despite the exploration of various methods [19] , [20] , [21] , [22] for temporal modeling, encompassing various techniques such as Average Pooling , Conv1D , LSTM , Transformer, the existing approaches predominantly concentrate on aggregating visual context along the temporal dimension. In the context of traffic anomaly detection for driving videos, we emphasize that beyond the visual context, characterizing dynamic changes in the driving scene along the temporal dimension proves advantageous in determining abnormal driving behavior. For instance, traffic events such as vehicle collisions or loss of control often result in significant and rapid alterations in the driving scene. Therefore, how to effectively characterize the dynamic changes of driving scenes holds paramount importance for traffic anomaly detection in driving videos .\nAdditionally, considering that different types of traffic anomalies exhibit unique characteristics, a straightforward encoding of the entire driving scene may diminish the discriminability of driving events and impede the detection of diverse traffic anomalies. For instance, traffic anomalies involving the ego-vehicle are often accompanied by global jittering of the dashboard camera, while anomalies involving non-ego vehicles often lead to local anomalies in the driving scene. Consequently, how to better perceive various types of traffic anomalies proves crucial for traffic anomaly detection .\nIn this work, we propose a novel traffic anomaly detection approach: Text-Driven Traffic Anomaly Detection with Temporal High-Frequency Modeling (TTHF), as shown in Fig. 2. To represent driving videos comprehensively, our fundamental idea is to not only capture the spatial visual context but also emphasize the depiction of dynamic changes in the driving scenes, thereby enhancing the visual representation of driving videos. Specifically, we initially leverage the pre-trained visual encoder of CLIP, endowed with rich prior knowledge of visual language semantics, to encode the visual context of driving videos. Then, to capture the dynamic changes in driving scenes, we innovatively introduce temporal high-frequency modeling (THFM) to obtain temporal high frequency representations of driving videos along the temporal dimension. Subsequently, the visual context and temporal high-frequency representations are fused to enhance the overall visual representation of driving videos. To better perceive various types of traffic anomalies, we propose an attentive anomaly focusing mechanism (AAFM) to guide the model to adaptively focus both visually and linguistically on the visual context of interest, thereby facilitating the detection of traffic anomalies.\nIt is shown that our proposed TTHF model exhibits promising performance on the DoTA dataset [9], outperforming state-of-the-art competitors by +5.4% AUC. Furthermore, without any fine tuning, the AUC performance of TTHF on the DADA dataset [23] demonstrates its generalization capability. The main contributions of our work can be summarized as follows:\nWe introduce a simple yet effective single-stage traffic anomaly detection method that aligns the visual semantics of driving videos with matched textual semantics to identify traffic anomalies. In contrast to previous TAD methods, the supervised signals in our approach are derived from text, offering a more comprehensive representation in high-dimensional space. We emphasize the modeling of high frequency in the temporal domain for driving videos. In contrast to previous approaches that solely aggregate visual context along the temporal dimension, we place additional emphasis on modeling high frequency in the temporal domain. This enables us to characterize dynamic changes in the driving scene over time, thereby significantly enhancing the performance of traffic anomaly detection. We further propose an attentive anomaly focusing mechanism to enhance the perception of various traffic anomalies. Our proposed mechanism guides the model both visually and linguistically to adaptively focus on the visual contexts of interest, facilitating the detection of traffic anomalies. Comprehensive experimental results on public benchmark datasets demonstrate the superiority and robustness of the proposed method. Compared to existing stateof-the-art methods, the proposed TTHF improves AUC by +5.4% on the DoTA dataset and also achieves state-of-the-art AUC on the DADA dataset without any fine-tuning. The remainder of this paper is organized as follows. Section II gives a brief review of related works. Section III details our proposed TTHF for traffic anomaly detection in\ndriving videos. Extensive experimental results are presented in Section IV, and we finally draw a conclusion in Section V .\nII. RELATED WORKS # A. Traffic Anomaly Detection (TAD) in Driving Videos # Traffic anomaly detection (TAD) in driving videos aims to identify abnormal traffic events from the perspective of driving, such as collisions with other vehicles or obstacles, being out of control, and so on. Such events can be classified into two categories: ego-involved anomalies (i.e., traffic events involving the ego-vehicle) and non-ego anomalies (i.e. , traffic events involving observed objects but not the egovehicle). A closely related topic to TAD in driving videos is anomaly detection in surveillance videos (VAD), which involves identifying abnormal events such as fights, assaults, thefts, arson, and so forth from a surveillance viewpoint. In recent years, various VAD methods [24] , [25] , [26] , [27] , [28] , [29] have been proposed for surveillance videos, which have greatly contributed to the development of this field. However, in contrast to the static background in surveillance videos, the background in driving videos is dynamically changing due to the fast movement of the ego vehicle, which makes the VAD methods prone to failure in the TAD task [9] , [12]. Recently, Wang et al. [30] proposed a method for detecting crowd flow anomalies by comparing anomalous samples with normal samples that were estimated based on prototypes. However, crowd flow anomaly detection methods are difficult to apply to the TAD task due to the differences in tasks and the data processed. In this paper, we work on the task of traffic anomaly detection in driving videos to provide a new solution for this community.\nEarly TAD methods [5] , [31] mainly extracted features in a handcrafted manner and utilized a Bayesian model for classification. However, these methods are sensitive to well-designed features and generally lack robustness in dealing with a wide variety of traffic scenarios. With the advances of deep neural networks in computer vision, researchers have proposed deep learning-based approaches for TAD, laying the foundation for this task. Based on our observations, the existing TAD methods can be basically classified into single-stage paradigm [6] , [10] , [11] and two-stage paradigm [12] , [32] , [33] , [34] .\nPrevious single-stage paradigm-based TAD approaches mainly comprise frame reconstruction-based and frame prediction-based TAD approaches [6] , [10] , [11]. These methods used reconstruction or prediction errors of video frames to evaluate traffic anomalies. For instance, Liu et al. [6] predicted video frames of normal traffic events through appearance and motion constraints, thereby helping to identify traffic anomalies that do not conform to expectations. Unfortunately, these methods tend to detect ego-involved anomalies (e.g., out of control) and perform poorly on non-ego traffic anomalies. This is primarily attributed to ego-involved anomalies causing significant shaking of the dashboard camera, leading to substantial global errors in frame reconstruction or prediction. Such errors undoubtedly facilitate anomaly detection. However, the methods based on frame reconstruction or prediction have difficulty distinguishing the local errors caused by the traffic anomalies of other road participants because of the interference of the dynamic background from the fastmoving ego-vehicle. This impairs their ability to detect traffic anomalies.\nIn recent years, to address the challenges posed by dynamic backgrounds, researchers have proposed applying a two-stage paradigm to the traffic anomaly detection task. In this paradigm, the perception algorithm is initially applied to extract visual features in the first stage. Then, the TAD model utilizes these features to detect traffic anomalies. For instance, Yao et al. [9] , [32] applied Mask-RCNN [35], FlowNet [36] , DeepSort [37], and ORBSLAM [38] algorithms to extract bounding boxes (bboxes), optical flow, tracking ids, and ego motion, respectively. Then, they used these visual features to predict the future locations of objects over a short horizon and detected traffic anomalies based on the deviation of the predicted location. Along this line, Fang et al. [12] used optical flow and bboxes as visual features. They attempted to collaborate on frame prediction and future object localization tasks [39] to detect traffic anomalies by analyzing inconsistencies in predicted frames, object locations, and the spatial relation structure of the scene. Zhou et al. [8] obtained bboxes of objects in the scene from potentially abnormal frames as visual features. They then encoded the spatial relationships of the detected objects to determine the abnormality of these frames. Despite the success of the two-stage paradigm TAD methods, they rely on the perception algorithms in the first stage, which may cause error propagation and lead to missed or false detection of traffic anomalies. Different from existing TAD methods, we propose a text-driven single-stage traffic anomaly detection approach that provides a promising solution for this task.\nB. Vision-Text Multi-Modality Learning # Recently, there has been a gradual focus on vision-text multi-modal learning. Among them, contrastive languageimage pre-training methods have achieved remarkable results in many computer vision tasks such as image classification [13] , [14], object detection [16] , [40], semantic segmentation [17] , [41] and image retrieval [42] , [43]. At present, CLIP [13] has become a mainstream visual learning method, which connects visual signals and language semantics by comparing large-scale image-language pairs. Essentially, compared to traditional supervised methods that convert labels into orthogonal one-hot vectors, CLIP provides richer and more comprehensive supervision information by collecting large-scale image-text pairs from web data and mapping the text into high-dimensional supervision signals (usually nonorthogonal). Following this idea, many scholars have applied CLIP to various tasks in the video domain, including video action recognition [19] , [44], video retrieval [18] , [20] , [45] , video recognition [46] , [47], and so on. For example, ActionCLIP [19] modeled the video action detection task as a video-text matching problem in a multi-modal learning framework and strengthened the video representation with more semantic language supervision to enable the model to perform zero-shot action recognition. More recently, Wu et al. [48]\nFig. 2. Overview of our proposed TTHF. It is a CLIP-like framework for traffic anomaly detection. In this framework, we first apply a visual encoder to extract visual representations of driving video clips. Then, we propose Temporal High-Frequency Modeling (THFM) to characterize the dynamic changes of driving scenes and thus construct a more comprehensive representation of driving videos. Finally, we introduce an attentive anomaly focusing mechanism (AAFM) to enhance the perception of various types of traffic anomalies. Besides, for brevity, we denote the cross-attention as CA, the visually focused representation as VFR, and the linguistically focused representation as LFR.\nproposed a vision-language model for anomaly detection in surveillance videos. However, as mentioned earlier, traffic anomaly detection faces the problem of dynamic changes in the driving scene, which often makes VAD methods prone to fail in TAD tasks. To the best of our knowledge, there is no effective approach to model traffic anomaly detection task from the perspective of vision-text multi-modal learning. In this paper, we preliminarily explore an effective text-driven method for traffic anomaly detection, which we hope can provide a new perspective on this task.\nIII. THE PROPOSED APPROACH: TTHF # In this section, we mainly introduce the proposed TTHF framework. First, we describe the overall framework of TTHF. Then, we explain two key modules in TTHF, i.e., temporal High-Frequency Modeling (THFM) and attentive anomaly focusing mechanism (AAFM). Moreover, we describe the contrastive learning strategy for cross-modal learning of videotext pairs, and finally show how to perform traffic anomaly detection in our TTHF.\nA. Overview of Our TTHF Framework # The overall framework of TTHF is illustrated in Fig. 2 . It presents a CLIP-like two-stream framework for traffic anomaly detection. For the visual context representation, considerable research [49] , [50] , [51] has demonstrated that CLIP possesses a robust foundation of vision-language prior knowledge. Leveraging this acquired semantic knowledge for anomaly detection in driving videos facilitates the perception and comprehension of driving behavior. Therefore, we advocate applying the pretrained visual encoder of CLIP to extract visual representations from driving video clips of two consecutive frames. After obtaining the frame representations, we employ Average Pooling along the temporal dimension as in previous works [19] , [20] , [21] to aggregate these representations to characterize the visual context of the video clip. For the text representation, we first describe normal and abnormal traffic events as text prompts (i.e. , a1 and a2 in Table I), and then apply the pretrained textual encoder in CLIP to extract text representations.\nIntuitively, after extracting the visual and textual representations of driving video clips, we can directly leverage contrastive learning to align them for traffic anomaly detection. However, in our task, solely modeling the visual representation from visual context is insufficient to capture the dynamic changes in the driving scene. Therefore, we introduce temporal high-frequency modeling (THFM) to characterize the dynamic changes and provide a more comprehensive representation of the driving video clips. Additionally, to better perceive various types of traffic anomalies, we further propose an attentive anomaly focusing mechanism (AAFM) to adaptively focus on the visual context of interest in the driving scene, thereby facilitating the detection of traffic anomalies. In the following sections, we will introduce these two key modules in detail.\nB. Temporal High-Frequency Modeling (THFM) # Video-text alignment diverges from image-text alignment by necessitating consideration of temporal characteristics. Numerous methods [19] , [20] , [21] have effectively employed CLIP in addressing downstream tasks within the video domain. The modeling strategies adopted in these approaches for the temporal domain encompass various techniques such as Average Pooling , Conv1D , LSTM, and Transformer. These strategies primarily emphasize aggregating visual context from distinct video frames along the temporal dimension. Nevertheless, for the anomaly detection task in driving videos, we contend that\nFig. 3. An illustration of the AAFM. The original video frames are displayed in column (a). In column (b), we visualize the attention of the visual representation to the deep features of a video clip under the visually focused strategy (VFS). In column (c), we visualize the attention of the soft text representation to the deep features of a video clip under the linguistically focused strategy (LFS). We present two types of traffic anomaly scenarios. Specifically, case 1 illustrates an instance where the ego-vehicle experiences loss of control while executing a turn. In case 2, the driving vehicle observes a collision between the car turning ahead and the motorcycle traveling straight on the right.\nFig. 4. An illustration of the high frequency. We show 3 cases as examples. The first and second columns correspond to the original consecutive video frames, and the last column is the high-frequency component extracted along the temporal dimension.\nnot only the visual context but also the temporal dynamic changes in the driving scene hold significant importance in modeling driving behavior. For instance, a collision or loss of vehicle control often induces substantial changes in the driving scene within a brief timeframe. Therefore, in our work, we propose to model the visual representation of driving videos in two aspects, i.e., the visual context of video frames in the spatial domain and the dynamic changes of driving scenes in the temporal domain. Considering the fact that the high frequency of the driving video in the temporal domain reflects the dynamic changes of the driving scene. To clarify, we present several cases in Fig. 4 for illustration. Based on the above observations, we introduce the Temporal High Frequency Modeling (THFM) to enhance the visual representation of the driving video within the temporal-spatial domain.\nOur fundamental idea involves utilizing the high frequency presented in the temporal domain of the driving video to characterize dynamic changes. Specifically, we first extract the high frequency of the driving video clip in the temporal dimension, which is formulated as:\nwhere H P(·) is the difference operation to extract high h frequency I n hp I n along the temporal dimension from two consecutive frames t − 1 and t of the n-th driving video clip.\nFurther, we encode I n hp I n to the high-frequency representation by\nwhere Fh f (·) represents the high-frequency encoder, sharing the same architecture as the visual encoder (i.e., ResNet50 unless specified otherwise). The resultant high-frequency representation is denoted as H n t H n . Finally, to obtain the visual representation of the driving video clip in the spatio-temporal domain, we fuse the spatial visual context representation with the temporal high-frequency representation H n t H n , which is expressed as follows:\nwhere Fv Fve is the visual encoder with frozen pre-trained parameter ξ ve, I n t I n and I n t−1 I n represent visual representations of frame t and t − 1, respectively, and V n t V n denotes the spatial visual context representation after Average Pooling.Here, Fn Fn ∈ R 1×C is the fused visual representation, where C denotes the feature dimension. The fused visual representation Fn Fn not only models the visual context of driving video clips, but also characterizes the dynamic changes in the temporal dimension, which is beneficial for perception and understanding driving behaviors.\nC. Attentive Anomaly Focusing Mechanism # Different types of traffic anomalies tend to exhibit distinct characteristics. For instance, anomalies involving the ego vehicle are often accompanied by global jitter from the dashboard camera, whereas anomalies involving non-ego vehicles typically cause anomalies in local regions of the driving scene. Blindly encoding the entire driving scene may reduce the discriminability of driving events and impede the ability to detect various types of traffic anomalies. Therefore, adaptively focusing on the visual context of interest is critical to perceiving different types of traffic anomalies.\nIn our work, we propose an attentive anomaly focusing mechanism (AAFM). The fundamental idea is to decouple the visual context visually and linguistically, to guide the model to adaptively focus on the visual content of interest.\nSpecifically, we carefully design two focusing strategies: the visually focused strategy (VFS) and the linguistically focused strategy (LFS). The former utilizes visual representations with global context to concentrate on the most semantically relevant visual context, while the latter adaptively focuses on visual contexts that are most relevant to text prompts through the guidance of language.\nVisually Focused Strategy (VFS): In fact, the spatial visual representation inherently captures the global context. Utilizing the attention of visual representation towards the deep features of various regions in the driving scene enables a focus on the most semantically relevant visual content. Specifically, as shown in Fig. 2, we focus on and weight the deep features of interest by using cross-attention (CA) on the spatial visual context representation V n t V n and deep features of the video clip, which can be written as: where Q , K and V are linear transformation, P ∈ R h∗w×C is the deep feature map of the video clip, (h, w) represents the size of the feature map, and c is the scaling factor which refers to the rooted square of feature dimension. Note that, for transformer-based visual encoders, V n t V n is represented by the class token, and P is represented by the patch tokens. V F R n ∈ R 1×C denotes the visually focused representation of the n-th video clip. Since the spatial visual representation encodes global context, focusing on its most relevant visual content helps guide the model to perceive the semantics of the driving scene. As shown in Fig. 3 (b), our VFS can adaptively focus on the crucial scene semantics in the driving scene. Such attention helps to detect traffic anomalies involving the ego-vehicle, especially the loss of control of the ego vehicle (case 1 in Fig. 3).\nlinguistically Focused Strategy (LFS): Intuitively, the fine-grained text prompts clearly define the subjects, objects, and traffic types involved in the traffic events. In contrast to general text prompts (as listed in a1 and a2 in Table I), utilizing fine-grained text prompts helps guide the model to focus on relevant visual contexts, thereby improving the comprehension of various traffic anomalies. Therefore, to facilitate the model\u0026rsquo;s adaptive perception of relevant visual context, we further design a linguistically focused strategy. The core idea is to utilize the carefully designed fine-grained text prompts (as listed in b1 to b4 in Table I) to guide the model to adaptively focus on the visual context of interest, thereby enhancing the understanding of traffic anomalies. Specifically, first, we categorize traffic events into four groups based on their types. Second, we further categorize each type of traffic event according to the different subjects (i.e., ego or non-ego vehicle) and objects (i.e., vehicle, pedestrian, or obstacle) involved. Finally, we define a total of 11 types of fine-grained text prompts, as summarized in Table I from b1 to b4. Note that the DoTA dataset used in our experiments is annotated with 9 types of traffic anomalies, as shown in Table II, with each anomaly encompassing both ego-involved and non-ego traffic anomalies. With the defined fine-grained text prompts, we apply the textual encoder in CLIP to extract the fine-grained text representation as follows:\nwhere Ft Fte is the textual encoder with parameter ξte , tm tm (m ∈ [1 , 11]∩Z) denotes the m-th fine-grained text prompt, and T m ′ T m represents the corresponding text representation. As we can see, the fine-grained text prompts describe the subjects and objects involved in a traffic event in a video frame, as well as the event type, which helps to focus on the visual regions in the driving scene where the traffic event occurred. Therefore, we further propose to leverage the similarity of the fine-grained text representation with each deep feature of the video clip to focus on the most relevant visual context of the text prompt. Note that in the driving scenario, we do not have direct access to realistic text prompt that match the driving video. To solve this problem, we leverage the similarity between the visual representation Fn Fn and fine-grained text representations to weight the text representations, and obtain the soft text representation as follows:\nwhere A m n is the cosine similarity between the n-th visual representation Fn Fn and the m-th fine-grained text representation T m ′ T m ∈ R 1×C . After obtaining the soft text representation Tsof t ∈ R 1×C , similar to Section III-C.1, we can further focus on the most semantically relevant visual context of the text description based on the cross-attention (CA) on the soft text representation Tsof t and deep features P, which is denoted as:\nL F R n ∈ R 1×C represents the linguistically focused representation of the n-th video clip, which focuses on the visual context that is most relevant to the soft text representation Tsof t . Moreover, Fig. 3(c) shows that our LFS can indeed adaptively concentrate on road participants potentially linked to anomalies. This capability is crucial for identifying local anomalies in driving scenarios arising from non-ego vehicles (case 2 in Fig. 3).\nFinally, we enhance the visual representation Fn Fn of driving videos by fusing it with visually and linguistically focused representations. Formally, it can be expressed as:\nwhere Ff usion is the fusion layer composed of multi-layer perceptrons with parameter ξ f . F n ′ F n is an enhanced visual representation that not only adaptively focuses on the visual contexts of interest but also more comprehensively characterizes the driving video clip in the spatio-temporal domain. Moreover, such representations facilitate the alignment of visual representations with general text prompts, thus improving the detection of traffic anomalies.\nTABLE I SUMMARY OF WELL-DESIGNED TEXT PROMPTS\nD. Contrastive Learning Strategy and Inference Process # In this section, we introduce the contrastive learning strategy of the proposed TTHF framework for cross-modal learning and present how to perform traffic anomaly detection.\nSuppose that, there are N video clips in the batch, we denote:\nwhere F is the visual representation of N video clips and F ′ represents the enhanced visual representation. For text prompts, we denote:\nwhere T means the matched general text representation of N video clips and T ′ is the matched fine-grained text representation. Note that Tn Tn and T n ′ T n denote the high-dimensional representations of one of the D predefined text prompts. In our case, D = 2 for general text prompts and D = 11 for fine-grained text prompts. To better understand abstract concepts of traffic anomalies, we first perform contrastive learning to align visual representations F with fine-grained text representations T ′ . Formally, the objective loss along the visual axis can be expressed as:\nFor the j-th trained text representation Tj, it may actually match more than one visual representation. Symmetrically, we can calculate the loss along the text axis by:\nwhere τ is a learned temperature parameter [13]. Similarly, we further apply contrastive learning to align the enhanced visual representations with the general text representations. The calculations along the visual and textual axis are as follows:\nThe overall loss then becomes:\nThe inference procedure is similar to the training procedure. For the i-th testing driving video clip, our TTHF first extracts the visual representation Fi and the enhanced visual representation F i ′ F i . For text prompts, the text encoder constructs 11 fine-grained text representations T ′ = {T 1 ′ T 1 , T 2 ′ T 2 , . . . , T 1 ′ T 11 } and 2 general text representations T = {T1 , T2}. We then compute the cosine similarity between Fi and T ′ and between F i ′ F i and T , respectively. Finally, we calculate the anomaly score for the i-th driving video clip as:\nwhere S 11 f represents the cosine similarity after softmax between Fi and T 1 ′ T 11 , and S g 2 S g denotes the cosine similarity after softmax between F i ′ F i and T2. By taking the complement of the average over the prompts corresponding to normal traffic at different levels, we can obtain the final anomaly score Scorei .\nIV. EXPERIMENTS AND DISCUSSIONS # In this section, we evaluate the performance of our proposed method, which is performed on a platform with one NVIDIA 3090 GPU. All experiments were implemented using the PyTorch framework. Our source code and trained models will be publicly available upon acceptance.\nA. Implementation Details # In the experiments, we resize the driving video frames to 224 × 224 and take every two consecutive frames as the input video clip. Except where noted otherwise, in all experimental settings, we adopt ResNet-50 [52] for the visual and high-frequency encoders and Text Transformer [53] for the textual encoder. All of them are initialized with the parameters of CLIP\u0026rsquo;s pre-trained model. Note that during the training phase, we freeze the pre-trained parameters of the visual encoder to prevent the model from overfitting to a specific dataset (e.g., DoTA) while enhancing the generalization of the visual representation. Besides, we optimize loss functions using the Adam algorithm with batch size 128, learning rate 5e-6, weight decay 1e-4, and train the framework for 10 epochs. During inference, we evaluate the traffic anomaly score by taking the complement of the similarity score of normal traffic prompts on both fine-grained and general text prompts.\nB. Dataset and Metrics # Dataset: For the sake of fairness, we evaluate our method on two challenging datasets, namely, DoTA [9] and DADA-2000 [23], following prior works [8] , [9] , [12]. DoTA is the first traffic anomaly video dataset that provides detailed spatio-temporal annotations of anomalous objects for traffic anomaly detection in driving scenarios. The dataset contains 4677 dashcam video clips with a resolution of 1280×720 pixels, captured under various weather and lighting conditions. Each video is annotated with the start and end time of the TABLE II TRAFFIC ANOMALY CATEGORY IN THE DOTA DATASET\nanomaly and assigned to one of nine categories, which we summarize in Table II. The DADA-2000 dataset consists of 2000 dashcam videos with a resolution of 1584 × 660 pixels, each annotated with driver attention and one of 54 anomaly categories. In our experiments, we use the standard train-test split as used in [9] and [23] and other previous works.\nMetrics: Following prior works [8] , [9] , [54], we use Area under ROC curve (AUC) metric to evaluate the performance of different TAD approaches. The AUC metric is calculated by computing the area under a standard frame-level receiver operating characteristic (ROC) curve, which plots the true positive rate (TPR) against the false positive rate (FPR). The larger AUC prefers better performance. C. Competitors # To verify the superiority of the proposed framework, we compare with the following state-of-the-art TAD approaches: ConvAE [10], ConvLSTMAE [11], AnoPred [6] , FOL-STD [32], FOL-Ensemble [9], DMMNet [55], SSCTAD [12] and STFE [8]. Among them, the ConvAE [10] and ConvLSTMAE [11] methods contain two variants. The variant utilizing the grayscale image as input belongs to the singlestage paradigm, while the variant using optical flow as input belongs to the two-stage paradigm. The AnoPred method [6] also contains two variants. The variant employing the full video frame as input falls within the single-stage paradigm, whereas the variant utilizing pixels of foreground objects belongs to the two-stage paradigm. Besides, the DMMNet method [55] follows the single-stage paradigm, while the methods FOL-STD [32], FOL-Ensemble [9], SSC-TAD [12] , and STFE [8] fall under the two-stage paradigm. Note that the experimental results for all these methods and their variants are obtained from the published papers [8] , [9], and [12] . In addition, we consider a CLIP-like TAD framework, denoted as TTHF-Base, as our baseline approach. This baseline lacks temporal High-Frequency Modeling and the attention anomaly focusing mechanism and utilizes only general text prompts for alignment.\nD. Quantitative Results # Overall Results: We conduct a comparative analysis of TTHF with a wide range of competitors and their variants in terms of AUC metric. Table III presents the AUC performance of various competitors, along with labels indicating TABLE III THE AUC ↑ (%) OF DIFFERENT APPROACHES ON THE DOTA DATASET\ntheir respective variants (i.e., different inputs) and paradigms employed. Overall, our framework demonstrates the superior performance on the DoTA dataset in terms of AUC. Specifically, our method outperforms the previously two-stage paradigm-based leading TAD method, STFE [8], by +5.4% AUC. Although in previous methods, the two-stage paradigm method employs a perception algorithm in the first stage to mitigate the impact of dynamic background resulting from the ego-vehicle movement, and generally outperforms single-stage TAD methods [6] , [10] , [11], such approaches are susceptible to the performance of the perception algorithm in the first stage, potentially leading to error propagation. In contrast, our proposed single-stage TAD method explicitly characterizes dynamic changes by modeling high frequency in the temporal domain, achieving a significant performance improvement over all previous methods and establishing a new state-ofthe-art in traffic anomaly detection. Note that our baseline method outperforms all previous single-stage paradigm-based methods by at least +8.3% AUC. This is mainly attributed to our introduction of text prompts and the alignment of driving videos with text representations in a high-dimensional space, which facilitates the detection of traffic anomalies.\nPer-Class Results: To investigate the ability of our proposed method to detect traffic anomalies in different categories, we compared the detection performance of different methods for ego-involved and non-ego traffic anomalies. Based on the nine traffic anomalies divided by the DoTA dataset, detailed in Table II, we summarize the AUC performance of the different methods as well as the average AUC in Table IV. Our method achieves significant improvements in all categories of traffic anomalies except ST*, and in particular, achieves an average AUC of at least +9.9% on egos involving traffic anomalies. This further validates our idea that characterizing dynamic changes in driving scenarios is important for traffic anomaly detection. Simultaneously, it also demonstrates the effectiveness of our proposed approach to model the temporal high frequency of driving videos to characterize the dynamic changes of driving scenes.\nGeneralization Performance: To explore the generalization performance of our method for unseen types of\nTABLE IV THE AUC ↑ (%) OF DIFFERENT METHODS FOR EACH INDIVIDUAL ANOMALY CLASS ON THE DOTA DATASET IS PRESENTED. THE ∗ INDICATES NON-EGO ANOMALIES, WHILE EGO-INVOLVED ANOMALIES ARE SHOWN WITHOUT ∗ . N/A INDICATES THAT THE AUC PERFORMANCE FOR THE CORRESPONDING CATEGORY IS NOT AVAILABLE. WE BOLD THE BEST PERFORMANCE\nTABLE V THE AUC ↑ (%) OF DIFFERENT METHODS ON THE DADA-2000 DATASET\ntraffic anomalies, we perform a generalization experiment on the DADA-2000 dataset. Specifically, we compare the AUC performance of our TTHF and TTHF-Base without any fine tuning on the DADA-2000 dataset with previous trained models, summarized in Table V. As we can see, our proposed TTHF-base and TTHF methods outperform previously trained TAD methods, bringing at least +0.8% and +4.2% improvement in AUC respectively, indicating the strong generalization performance of the proposed approach. This is mainly attributed to our introduction of a text-driven video-text alignment strategy for traffic anomaly detection from a new perspective, as well as the proposed attentive anomaly focusing mechanism and temporal high-frequency modeling for traffic anomaly detection.\nE. Qualitative Results # In this subsection, we visualize some examples to further illustrate the detection capability of our TTHF across various types of traffic anomalies and the feasibility of soft text representation in our framework.\nVisualization of Various Types of Traffic Anomalies: As presented in Fig. 5, we show five representative traffic anomalies from top to bottom as examples: a) The other vehicle collides with another vehicle that turns into or crosses a road. b) The ego-vehicle collides with another oncoming vehicle. c) The ego-vehicle collides with another vehicle moving laterally in the same direction. d) The ego-vehicle collides with another vehicle waiting. e) The ego-vehicle is out-of-control and leaving the roadway to the left. From the above visualization results of different types of traffic anomalies, we can summarize as follows. Overall, our TTHF exhibits superior detection performance on various types of traffic anomalies. Secondly, while the most intuitive classify-based approach (It has the same network architecture as the visual encoder of TTHF, but directly classifies the visual representation, denoted as Classifier in Fig. 5) also follows a single-stage paradigm, our proposed text-driven TAD approach offers a more comprehensive representation in high-dimensional space than orthogonal one-hot vectors. Consequently, both our proposed TTHF and its variants outperform the Classifier. Third, incorporating AAFM allows our method to better perceive different types of traffic anomalies, as evident in Fig. 5 when comparing the Base and AAFM variants across various traffic anomalies. Finally, capturing dynamic changes in driving scenarios significantly enhances traffic anomaly detection. This highlights the effectiveness of our approach in characterizing dynamic changes in driving scenarios by modeling high frequency in the temporal domain.\nVisualization of the Weights Used for Soft Text Representation: We further investigate the feasibility of soft text representations. Specifically, as shown in Fig. 6, we use three cases from the test set as examples. For video frames captured at different moments in driving videos, we visualize the weights employed to compute the soft text representation and compare it with the real fine-grained text representation. From the visualization results, we observe that the text representation associated with the maximum weight (indicated by the darkest red) consistently aligns with the real fine-grained text representation. The above results indicate that the way we calculate the soft text representation is effective and can well reflect the real anomaly category.\nF. Ablation Investigation # In this subsection, we conduct ablation studies by analyzing how different components of TTHF contribute to traffic anomaly detection on DoTA dataset.\nFig. 5. The visualization of anomaly score curves for traffic anomaly detection of different variants on the DoTA dataset. The first row of each case shows the extracted video frames of the driving video, where the red boxes mark the object involved in or causing the anomaly. The second rows show the anomaly score curves of different methods on the corresponding whole videos. For brevity, we label the TTHF-Base variant as Base and TTHF-Base with AAFM as AAFM, while Classifier denotes the classify-based TAD method. Better viewed in color.\nVariants of Our Architecture: We first evaluate the effectiveness of different components in our TTHF framework including the visual encoder, the textual encoder, the attentive anomaly focusing mechanism (AAFM), and the temporal high-frequency modeling (THFM). The ablation results are summarized in Table VI. Note that when only the visual encoder is applied, we add a linear classification head after the visual representation. This adaptation formulates the traffic anomaly detection task as a straightforward binary classification task. The results presented in Table VI demonstrate that introducing linguistic modalities and aligning visual-text in high-dimensional space greatly facilitates anomaly detection in driving videos compared to the classifier, achieving an AUC improvement of +14.8%. Based on this, the designed Fig. 6. Visualization of the weights used for computing soft text representations. We present three illustrative cases, each involving video frames captured at different times. These frames are accompanied by the corresponding weight values used in the computation of soft text representations. Notably, we employ a blue-to-red color scale, where increasing redness signifies higher weights. Additionally, we label the ground-truth fine-grained text representations (denoted as T_i) associated with specific frames. Among them, T_1 corresponds to the text \u0026ldquo;The ego vehicle collision with another vehicle\u0026rdquo; (as described in Table I), T_4 corresponds to the text \u0026ldquo;The non-ego vehicle collision with another vehicle\u0026rdquo;, T_7 corresponds to the text \u0026ldquo;The ego vehicle out-of-control and leaving the roadway\u0026rdquo;, and T_11 corresponds to the text \u0026ldquo;The vehicle is running normally on the road\u0026rdquo; .\nTABLE VI ABLATION RESULTS OF DIFFERENT COMPONENTS ON DOTA DATASET . NOTE THAT FOR FAIR COMPARISON , IN THE EXPERIMENTS WITHOUT THFM, WE FINE-TUNE THE PARAMETERS OF THE VISUAL ENCODER. LARGER AUC PREFERS BETTER PERFORMANCE\nTABLE VII ABLATION RESULTS ON HOW AAFM CONTRIBUTES TO TRAFFIC ANOMALY DETECTION ON THE DOTA DATASET. LARGER AUC PREFERS BETTER PERFORMANCE\nAAFM helps guide the model to adaptively focus on the visual context of interest and thus enhance the perception ability of various types of traffic anomalies. Lastly, the incorporation of the modeling of temporal high frequency to capture dynamic background during driving significantly improves traffic anomaly detection, resulting in an AUC improvement of +7.9%.\nAnalysis of the AAFM: To investigate how the proposed attentive anomaly focusing mechanism (AAFM) contributes to traffic anomaly detection, we perform ablation on each TABLE VIII ABLATION RESULTS OF DIFFERENT BACKBONES ON DOTA DATASET . LARGER AUC PREFERS BETTER PERFORMANCE\ncomponent in the AAFM. The ablation results are presented in Table VII. We can conclude that both the Visually Focused Strategy (VFS) and the Linguistically Focused Strategy (LFS) explicitly guide the model to pay attention to the visual context most relevant to the representations of visual and linguistic modalities, respectively. This enhances the ability to perceive traffic anomalies with different characteristics, thereby improving traffic anomaly detection in driving videos. Our AAFM achieves the best detection performance when both VFS and LFS are applied.\nNetwork Architecture: Different network architectures of visual encoder may exhibit different representation capabilities. We now evaluate the performance of traffic anomaly detection when ResNet50 [52], ResNet50×64 [13], ViT-B-32 [56] and ViT-L-14 [56] are used. Specifically, the results of these visual encoders can be found in Table VIII, respectively. As can be noticed, for the task of traffic anomaly detection in driving videos, we observe that the ResNet-based network achieves comparable performance to the Transformer-based Fig. 7. Visualization of some bad cases of the proposed TTHF. The first row of each case shows the extracted video frames of the driving video, where the red boxes mark the objects involved in the anomaly. The second rows show the anomaly score curves of different methods on the corresponding whole videos. Better viewed in color.\nnetwork. The larger model sizes perform slightly better, with ViT-L-14 achieving an AUC performance of 85.0%. Therefore, considering both computing resources and performance gains, we ultimately chose ResNet50 as an example as our visual encoder in all other experiments.\nG. Disscusion # In this subsection, we discuss the limitations of the proposed framework. We experimentally found that the detection accuracy of our proposed method needs improvement for two specific cases: 1) long-distance observation of traffic anomalies; and 2) subtle traffic anomalies involving other vehicles when the ego-vehicle is stationary. Fig. 7 shows several cases where the accuracy of our method needs to be further improved. In the first scenario, the other vehicle at a distance collide with a turning or crossing vehicle. The second scenario depicts a distant vehicle losing control and veering to the left side of the road. The third scenario involves a slowly retreating vehicle experiencing friction with other stationary vehicles. By analyzing the anomaly score curve in Fig. 7, we can conclude that our method faces challenges primarily due to the traffic anomalies occurring in these scenarios involve non-ego vehicles and cause minor anomaly areas. These anomalies include small local anomalies that are caused when non-ego vehicles are abnormal at a distance, and slow and slight traffic anomalies that are observed for other vehicles when the ego-vehicle is at rest. These slight traffic anomalies may not be well focused on the corresponding abnormal regions by modeling the dynamic changes of the driving scene as well as using text guidance. This also explains that the ability of our method in detecting non-ego involved traffic anomalies is not as good as in detecting ego-involved traffic anomalies, especially ST* in Table IV. Despite the significant improvement of our approach over previous TAD methods, addressing these more challenging traffic anomalies undoubtedly requires a greater effort from the community.\nV. CONCLUSION # This paper have proposed an accurate single-stage TAD framework. For the first time, this framework introduces visual-text alignment to address the traffic anomaly detection task for driving videos. Notably, we verified that modeling the high frequency of driving videos in the temporal domain helps to characterize the dynamic changes of the driving scene and enhance the visual representation, thereby greatly facilitating the detection of traffic anomalies. In addition, the experimental results demonstrated that the proposed attentive anomaly focusing mechanism is indeed effective in guiding the model to adaptively focus on the visual content of interest, thereby enhancing the ability to perceive different types of traffic anomalies. Although extensive experiments have demonstrated that the proposed TTHF substantially outperforms state-of-the-\nart competitors, more effort is required to accurately detect the more challenging slight traffic anomalies.\nREFERENCES # [1] Z. Yuan, X. Song, L. Bai, Z. Wang, and W. Ouyang, \u0026ldquo;Temporal-channel transformer for 3D LiDAR-based video object detection for autonomous driving,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 4, pp. 2068–2078, Apr. 2022.\n[2] L. Claussmann, M. Revilloud, D. Gruyer, and S. Glaser, \u0026ldquo;A review of motion planning for highway autonomous driving,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., vol. 21, no. 5, pp. 1826–1848, May 2020.\n[3] M. Jeong, B. C. Ko, and J.-Y. Nam, \u0026ldquo;Early detection of sudden pedestrian crossing for safe driving during summer nights,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 27, no. 6, pp. 1368–1380, Jun. 2017.\n[4] L. Yue, M. A. Abdel-Aty, Y. Wu, and A. Farid, \u0026ldquo;The practical effectiveness of advanced driver assistance systems at different roadway facilities: System limitation, adoption, and usage,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., vol. 21, no. 9, pp. 3859–3870, Sep. 2020.\n[5] Y. Yuan, D. Wang, and Q. Wang, \u0026ldquo;Anomaly detection in traffic scenes via spatial-aware motion reconstruction,\u0026rdquo; IEEE Trans. Intell. Transp. Syst. , vol. 18, no. 5, pp. 1198–1209, May 2017.\n[6] W. Liu, W. Luo, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction for anomaly detection—A new baseline,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6536–6545.\n[7] Z. Liu, Y. Nie, C. Long, Q. Zhang, and G. Li, \u0026ldquo;A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 13588–13597.\n[8] Z. Zhou, X. Dong, Z. Li, K. Yu, C. Ding, and Y. Yang, \u0026ldquo;Spatio-temporal feature encoding for traffic accident detection in VANET environment,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., vol. 23, no. 10, pp. 19772–19781, Oct. 2022.\n[9] Y. Yao et al., \u0026ldquo;DoTA: Unsupervised detection of traffic anomaly in driving videos,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 444–459, Jan. 2023.\n[10] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, \u0026ldquo;Learning temporal regularity in video sequences,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 733–742.\n[11] Y. S. Chong and Y. H. Tay, \u0026ldquo;Abnormal event detection in videos using spatiotemporal autoencoder,\u0026rdquo; in Proc. Adv. Neural Netw., 2017, pp. 189–196.\n[12] J. Fang, J. Qiao, J. Bai, H. Yu, and J. Xue, \u0026ldquo;Traffic accident detection via self-supervised consistency learning in driving scenarios,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., vol. 23, no. 7, pp. 9601–9614, Jul. 2022.\n[13] A. Radford et al., \u0026ldquo;Learning transferable visual models from natural language supervision,\u0026rdquo; in Proc. Int. Conf. Mach. Learn., vol. 139, 2021, pp. 8748–8763.\n[14] C. Jia et al., \u0026ldquo;Scaling up visual and vision-language representation learning with noisy text supervision,\u0026rdquo; in Proc. Int. conf. mach. learn. , vol. 139, 2021, pp. 4904–4916.\n[15] Y. Yang et al., \u0026ldquo;Attentive mask CLIP,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 2771–2781.\n[16] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, \u0026ldquo;Open-vocabulary object detection via vision and language knowledge distillation,\u0026rdquo; in Proc. Int. Conf. Learn. Represent., 2022, pp. 1–21.\n[17] J. Xu et al., \u0026ldquo;GroupViT: Semantic segmentation emerges from text supervision,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 18113–18123.\n[18] S. Chen, Q. Xu, Y. Ma, Y. Qiao, and Y. Wang, \u0026ldquo;Attentive snippet prompting for video retrieval,\u0026rdquo; IEEE Trans. Multimedia, vol. 26, pp. 4348–4359, 2024.\n[19] M. Wang, J. Xing, and Y. Liu, \u0026ldquo;ActionCLIP: A new paradigm for video action recognition,\u0026rdquo; 2021, arXiv:2109.08472 .\n[20] H. Luo et al., \u0026ldquo;CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning,\u0026rdquo; Neurocomputing, vol. 508, pp. 293–304, Oct. 2022.\n[21] H. Rasheed, M. U. Khattak, M. Maaz, S. Khan, and F. S. Khan, \u0026ldquo;Fine-tuned CLIP models are efficient video learners,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 6545–6554.\n[22] Y. Li et al., \u0026ldquo;Learning hierarchical fingerprints via multi-level fusion for video integrity and source analysis,\u0026rdquo; IEEE Trans. Consum. Electron. , early access, pp. 1–11, 2024, doi: 10.1109/TCE.2024.3357977.\n[23] J. Fang, D. Yan, J. Qiao, J. Xue, and H. Yu, \u0026ldquo;DADA: Driver attention prediction in driving accident scenarios,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., vol. 23, no. 6, pp. 4959–4971, Jun. 2022.\n[24] Y. Zhong, X. Chen, Y. Hu, P. Tang, and F. Ren, \u0026ldquo;Bidirectional spatiotemporal feature learning with multiscale evaluation for video anomaly detection,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 12, pp. 8285–8296, Dec. 2022.\n[25] M. I. Georgescu, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah, \u0026ldquo;A background-agnostic framework with adversarial training for abnormal event detection in video,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell. , vol. 44, no. 9, pp. 4505–4523, Sep. 2022.\n[26] S. Zhang et al., \u0026ldquo;Influence-aware attention networks for anomaly detection in surveillance videos,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol. , vol. 32, no. 8, pp. 5427–5437, Aug. 2022.\n[27] X. Zeng, Y. Jiang, W. Ding, H. Li, Y. Hao, and Z. Qiu, \u0026ldquo;A hierarchical spatio-temporal graph convolutional neural network for anomaly detection in videos,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 1, pp. 200–212, Jan. 2023.\n[28] C. Huang et al., \u0026ldquo;Self-supervised attentive generative adversarial networks for video anomaly detection,\u0026rdquo; IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 11, pp. 9389–9403, Nov. 2023.\n[29] Y. Gong, C. Wang, X. Dai, S. Yu, L. Xiang, and J. Wu, \u0026ldquo;Multiscale continuity-aware refinement network for weakly supervised video anomaly detection,\u0026rdquo; in Proc. IEEE Int. Conf. Multimedia Expo. (ICME) , Jul. 2022, pp. 1–6.\n[30] Y. Wang, X. Luo, and Z. Zhou, \u0026ldquo;Contrasting estimation of pattern prototypes for anomaly detection in urban crowd flow,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., early access, Jan. 31, 2024, doi: 10.1109/TITS.2024.3355143.\n[31] Y. Yuan, J. Fang, and Q. Wang, \u0026ldquo;Incrementally perceiving hazards in driving,\u0026rdquo; Neurocomputing, vol. 282, pp. 202–217, Mar. 2018.\n[32] Y. Yao, M. Xu, Y. Wang, D. J. Crandall, and E. M. Atkins, \u0026ldquo;Unsupervised traffic accident detection in first-person videos,\u0026rdquo; in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Nov. 2019, pp. 273–280.\n[33] G. Sun, Z. Liu, L. Wen, J. Shi, and C. Xu, \u0026ldquo;Anomaly crossing: New horizons for video anomaly detection as cross-domain few-shot learning,\u0026rdquo; 2021, arXiv:2112.06320 .\n[34] R. Liang, Y. Li, Y. Yi, J. Zhou, and X. Li, \u0026ldquo;A memory-augmented multitask collaborative framework for unsupervised traffic accident detection in driving videos,\u0026rdquo; 2023, arXiv:2307.14575 .\n[35] K. He, G. Gkioxari, P. Dollár, and R. Girshick, \u0026ldquo;Mask R-CNN,\u0026rdquo; in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988.\n[36] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, \u0026ldquo;FlowNet 2.0: Evolution of optical flow estimation with deep networks,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1647–1655.\n[37] N. Wojke, A. Bewley, and D. Paulus, \u0026ldquo;Simple online and realtime tracking with a deep association metric,\u0026rdquo; in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2017, pp. 3645–3649.\n[38] R. Mur-Artal and J. D. Tardós, \u0026ldquo;ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras,\u0026rdquo; IEEE Trans. Robot. , vol. 33, no. 5, pp. 1255–1262, Oct. 2017.\n[39] R. Liang, Y. Li, J. Zhou, and X. Li, \u0026ldquo;STGlow: A flow-based generative framework with dual-graphormer for pedestrian trajectory prediction,\u0026rdquo; IEEE Trans. Neural Netw. Learn. Syst., early access, pp. 1–14, 2024, doi: 10.1109/TNNLS.2023.3294998.\n[40] L. Yao et al., \u0026ldquo;DetCLIPv2: Scalable open-vocabulary object detection pre-training via word-region alignment,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2023, pp. 23497–23506.\n[41] Z. Zhou, Y. Lei, B. Zhang, L. Liu, and Y. Liu, \u0026ldquo;ZegCLIP: Towards adapting CLIP for zero-shot semantic segmentation,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 11175–11185.\n[42] A. Baldrati, L. Agnolucci, M. Bertini, and A. Del Bimbo, \u0026ldquo;Zero-shot composed image retrieval with textual inversion,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 15338–15347.\n[43] M. Tschannen, B. Mustafa, and N. Houlsby, \u0026ldquo;CLIPPO: Image-andlanguage understanding from pixels only,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 11006–11017.\n[44] S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang, \u0026ldquo;Zero-shot temporal action detection via vision-language prompting,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis., 2022, pp. 681–697.\n[45] Y. Ma, G. Xu, X. Sun, M. Yan, J. Zhang, and R. Ji, \u0026ldquo;X-CLIP: End-toend multi-grained contrastive learning for video-text retrieval,\u0026rdquo; in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 638–647.\n[46] W. Wu, Z. Sun, and W. Ouyang, \u0026ldquo;Revisiting classifier: Transferring vision-language models for video recognition,\u0026rdquo; in Proc. AAAI Conf. Art. Intel., vol. 37, 2023, pp. 2847–2855.\n[47] B. Ni et al., \u0026ldquo;Expanding language-image pretrained models for general video recognition,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 1–18.\n[48] P. Wu et al., \u0026ldquo;VadCLIP: Adapting vision-language models for weakly supervised video anomaly detection,\u0026rdquo; 2023, arXiv:2308.11681 .\n[49] R. Zhang, Z. Zeng, Z. Guo, and Y. Li, \u0026ldquo;Can language understand depth?\u0026rdquo; in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 6868–6874.\n[50] Z. Liang, C. Li, S. Zhou, R. Feng, and C. C. Loy, \u0026ldquo;Iterative prompt learning for unsupervised backlit image enhancement,\u0026rdquo; in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 8094–8103.\n[51] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, \u0026ldquo;Conditional prompt learning for vision-language models,\u0026rdquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jul. 2022, pp. 16816–16825.\n[52] K. He, X. Zhang, S. Ren, and J. Sun, \u0026ldquo;Deep residual learning for image recognition,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2016, pp. 770–778.\n[53] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, \u0026ldquo;Language models are unsupervised multitask learners,\u0026rdquo; OpenAI Blog , vol. 1, no. 8, pp. 1–9, 2019.\n[54] D. Gong et al., \u0026ldquo;Memorizing normality to detect anomaly: Memoryaugmented deep autoencoder for unsupervised anomaly detection,\u0026rdquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 1705–1714.\n[55] S. Li, J. Fang, H. Xu, and J. Xue, \u0026ldquo;Video frame prediction by deep multi-branch mask network,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol. , vol. 31, no. 4, pp. 1283–1295, Apr. 2021.\n[56] A. Dosovitskiy et al., \u0026ldquo;An image is worth 16×16 words: Transformers for image recognition at scale,\u0026rdquo; in Proc. Int. Conf. Learn. Represent. , 2021, pp. 1–22.\nRongqin Liang (Student Member, IEEE) received the B.Eng. degree in communication engineering from Wuyi University, Guangdong, China, in 2018, and the M.S. degree in information and communication engineering from Shenzhen University, Shenzhen, China, in 2021, where he is currently pursuing the Ph.D. degree with the College of Electronics and Information Engineering. His current research interests include trajectory prediction, anomaly detection, computer vision, and deep learning.\nYuanman Li (Senior Member, IEEE) received the B.Eng. degree in software engineering from Chongqing University, Chongqing, China, in 2012, and the Ph.D. degree in computer science from the University of Macau, Macau, in 2018. From 2018 to 2019, he was a Post-Doctoral Fellow with the State Key Laboratory of Internet of Things for Smart City, University of Macau. He is currently an Assistant Professor with the College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China. His current research interests include multimedia security and forensics, data representation, computer vision, and machine learning.\nJiantao Zhou (Senior Member, IEEE) received the B.Eng. degree from the Department of Electronic Engineering, Dalian University of Technology, Dalian, China, in 2002, the M.Phil. degree from the Department of Radio Engineering, Southeast University, Nanjing, China, in 2005, and the Ph.D. degree from the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, in 2009. He held various research positions at the University of Illinois at Urbana–Champaign, Champaign, IL, USA; The\nHong Kong University of Science and Technology; and McMaster University, Hamilton, ON, Canada. He is currently an Associate Professor with the Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Macau, and also the Interim Head of the newly established Centre for Artificial Intelligence and Robotics. He holds four granted U.S. patents and two granted Chinese patents. His research interests include multimedia security and forensics, multimedia signal processing, artificial intelligence, and big data. He has coauthored two papers that received the Best Paper Award from the IEEE Pacific-Rim Conference on Multimedia in 2007 and the Best Student Paper Award from the IEEE International Conference on Multimedia and Expo in 2016. He is serving as an Associate Editor for IEEE TRANSACTIONS ON IMAGE PROCESSING and IEEE TRANSACTIONS ON MULTIMEDIA .\nXia Li (Member, IEEE) received the B.S. and M.S. degrees in electronic engineering and signal and information processing (SIP) from Xidian University, Xi\u0026rsquo;an, China, in 1989 and 1992, respectively, and the Ph.D. degree from the Department of Information Engineering, The Chinese University of Hong Kong, in 1997. She is currently a member of Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University. Her research interests include intelligent computing and its applications, image processing, and pattern recognition.\n","date":"17 April 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/text-driven_traffic_anomaly_detection_with_temporal_high-frequency_modeling_in_driving_videos/","section":"Papers","summary":"The paper introduces TTHF, a novel single-stage method aligning video clips with text prompts for traffic anomaly detection. It emphasizes modeling high frequency in the temporal domain to capture dynamic changes in driving scenes, and proposes an attentive anomaly focusing mechanism to enhance detection of various traffic anomalies. The approach leverages visual-text semantic alignment, modeling temporal high frequency, and guided attention mechanisms, achieving superior performance on benchmark datasets.","title":"Text-Driven Traffic Anomaly Detection With Temporal High-Frequency Modeling in Driving Videos","type":"other"},{"content":"","date":"17 April 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xia-li/","section":"Authors","summary":"","title":"Xia Li","type":"authors"},{"content":"","date":"17 April 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yuanman-li/","section":"Authors","summary":"","title":"Yuanman Li","type":"authors"},{"content":"","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/apostolos-ntelopoulos/","section":"Authors","summary":"","title":"Apostolos Ntelopoulos","type":"authors"},{"content":" CALLM: Cascading Autoencoder and Large Language Model for Video Anomaly Detection # 1 st Apostolos Ntelopoulos Dept. of Electronic Systems Aalborg University Aalborg, Denmark ntelopoulosa@gmail.com\nAbstract—This paper introduces a new approach using a 3D Deep Autoencoder and a Large Visual Language Model (LVLM) to bridge the gap between video data and multi-modal models for Video Anomaly Detection. The study explores the limitations of previous architectures, particularly their lack of expertise when encountering out-of-distribution instances. By integrating an autoencoder and an LVLM in the same pipeline, this method predicts an abnormality\u0026rsquo;s presence and provides a detailed explanation. Moreover, this can be achieved by employing binary classification and automatically prompting a new query. Testing reveals that the inference capability of the system offers a promising solution to the shortcomings of industrial models. However, the lack of high-quality instruction-follow video data for anomaly detection necessitates a weakly supervised method. Current limitations from the LLM domain, such as object hallucination and low physics perception, are acknowledged, highlighting the need for further research to improve model design and data quality for the video anomaly detection domain.\nKeywords—cascade, video, anomaly, detection, autoencoder, multimodal, LLM\nI. INTRODUCTION # In the past few years, large vision-language models have attempted to connect multiple modalities and progressively leverage the power of existing LLMs as their semantic power in textual form. Moreover, the recent approaches in video language models such as Video-LLaMA [1], VideoBERT [2], Video-ChatGPT [3], Video-LLaVA [4] and MiniGPT4Video [5] aim to solve complex tasks including as text-tovideo retrieval, video question and answering by utilizing a pretrained visual backbone and further align it with the frozen weights of an LLM, typically through the integration of videocaption data. On the other hand, video anomaly detection is a well-researched topic which involves identifying events that deviate from what is commonly observed. Classic video anomaly detection algorithms include motion-detection [6] [7] [8], temporal-spatial feature outliers [9] [10] [11] [10] or through skeleton trajectories [12] [13] [14]. However, despite the existing performance of the previously mentioned techniques to achieve adequate accuracy and quantify uncertainty, they often falter in detecting events that lie beyond the scope\n979-8-3315-4184-2/24/$31.00 ©2024 IEEE\n2 nd Kamal Nasrollahi Visual Analysis \u0026amp; Perception Lab Aalborg University \u0026amp; Milestone Systems Copenhagen, Denmark kn@create.aau.dk of their training data, resulting in out-of-distribution errors. Thus, this paper explores a new way to address this limitation by testing the possible application of video language in connection with a baseline system.\nThe main contributions of this work are as follows:\nA novel architecture that integrates a cascade system combining a simple one-class classification (OCC) network with a Large Video-Language Model. A weak-supervised method for extracting video caption data from a pre-trained model, along with a technique for injecting pseudo-instructions to improve performance in downstream tasks. A comprehensive investigation into the capabilities of a state-of-the-art (SoTA) Video-Language Model for video anomaly detection, providing new insights and benchmarks for future research. II. RELATED WORK # VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection\nRecently, a work named VadCLIP [15] investigated the application of a pre-trained language-image model CLIP [16] to make use of the fine-grained dependencies between text and images with weak-supervised learning, producing a novel mechanism for extracting video-level labels. Particularly, it outperformed other prior weak-supervised models by introducing two core modules for video anomaly detection; a local module which captures the local features using a transformer encoder layer on top of the textual features coming from the frozen image encoder CLIP on an equal-size window, overlapping over the temporal dimension; and a global module which computes the relation between video frames. The prior module extracts the local features without information aggregation over the windows, whereas the subsequent module apprehends the global information between the video frames. It consists of a lightweight Graph Convolutional Network (GCN) which calculates the frame-wise cosine similarity to build its adjacency matrix of the GCN and capture the global temporal dependencies between the frame-level video features.\nThe frame-wise cosine similarity to calculate the adjacency matrix Hs Hsim is presented as follows:\nwhere H sim is the adjacency matrix of the GCN and Xlis the frame-level video feature obtained from a local module.\nVideo Anomaly Detection and Explanation via Large Language Models\nAnother research paper on Video Anomaly Detection proposed the addition of a Long-Term Context Module (LTC) along with a pre-trained LLM to enhance the performance in detecting anomalies in long videos compared to short or normal videos, while also enabling the generation of textual explanations for events [17]. The process begins by dividing the input video into m-segments, followed by clip-level feature extraction using a Video Encoder (VE) based on a foundational model i.e. BLIP-2 [18], which is trained on a diverse corpus of large images and videos. Next, a simple anomaly prediction model is co-trained with the LTC module, which includes the clip-level features with anomaly scores to build the anomaly prompt module, which generates text that is conditioned on the video content and presents accordingly the anomaly status. The LTC module collects these clip-level features from the VE and updates them online while stacking them into a concatenated list. The normal list is denoted as N = n j , whereas these with high anomaly scores as A = aj with j = 1 , 2, . . . , K representing the input clip each time. These are simultaneously updated based on each clip anomaly score and aggregated via cross-attention with the pre-trained VE feature to prevent catastrophic forgetting after the fine-tuning stage. Therefore, when a new video is introduced in the system, the VE will handle the feature extraction for the LTC to update the content of the lists immediately with the new anomaly scores. The final system will be able to describe a video and pinpoint the timestamp of the abnormal incident.\nIII. METHOD # The cascade system aims to deploy a 3D autoencoder and empower it with video-language understanding as an auxiliary mechanism. As depicted in Figure 1, the cascade consists of two separate stages, the 3D autoencoder and the fine-tuned Video-LLaMA [19]. In this section, the overall architecture is described, as well as the training goals through the different stages.\nFig. 1. The video is divided into individual frames. These frames are processed by the AE to decide whether to proceed to generate the final prediction with Video-LLaMA.\nA. 1st Stage. 3D Autoencoder # After processing an input video into individual frames, a 3D Autoencoder is implemented to classify each individual frame based on a manual selected threshold. It has been shown in various studies [20] [21] [22] [23] [24] [25] that deep autoencoders are able to comprehend the pattern from the low-dimensional space of the training data and reconstruct an approximation of a new image using the pre-trained weights. This capability is leveraged in the anomaly detection domain since evaluating the reconstruction error, makes it feasible to classify an event if it deviates from the normal. Furthermore, in the video domain, it is necessary to preserve the temporal information that is included. Thus, a 3D Autoencoder is adopted [26] [21] employing the time as the third dimension. AE is trained separately, learning the distribution of the training dataset and minimizing the Mean Squared Error (MSE) loss as follows:\nwhere N is the total number of pixels of the video frame, xi the true value for the i-th pixel and xˆ ˆ i the predicted value for the i-th pixel. Essentially, the autoencoder has a bottleneck architecture which consists of the same number of layers in the encoder and decoder respectively [27]. For the video anomaly detection task, the implemented autoencoder applies 3D convolution over the input video frame and 3D batch normalization to prevent vanishing/exploding gradient issues at the beginning of the training. To capture higher-level patterns during reconstruction, the Tanh activation function is used instead of ReLU in the last layer of the decoder module, since it is able to extract more complex patterns in the latent space.\nB. 2nd Stage. Video-Language Branch # A video language model, adapted from Video-LLaMA [1], was chosen to process instances that exceeded the anomaly score threshold from the previous stage. Since the cost of pretraining and fine-tuning an LLM is indeed in many cases prohibitive, this method utilizes the alignment of the two frozen modules namely a Visual Encoder [18] and a pre-trained LLM [28] [29]. Specifically, the Visual Encoder consists of a ViT [30] and a Query-Transformer to align the text and image domain [18]. The share of communication between the text and visual transformer that the Q-former consists of, will be able to give a new set of queries to the model and align the frozen modules, as shown in Figure 2.\nFig. 2. Flow Diagram. A pre-trained Image Encoder processes an input image, whereas the Q-former bootstraps the vision-language learning by connecting the two frozen models.\nFig. 3. Video-Language Branch. The video frames are processed from a frozen Visual Encoder which extracts the video frame features. Then, positional embeddings are injected to add temporal information into video frames.\nFig. 4. Following the training of linear layers for frame-level embeddings match the text embeddings\u0026rsquo; dimension. Finally, a binary answer of yes/no is given from the frozen LLM module conditioned on the video input.\nIV. EXPERIMENT # The performance comparison follows for each stage accordingly. A baseline is formulated as a standalone 3D autoencoder trained on each downstream dataset. The ensemble method is investigated afterwards presenting both frame-level and videolevel performance.\nA. Baseline Method # This approach consists of the 3D Autoencoder trained only on normal samples of datasets targeted for the video surveillance anomaly detection including CUHK Avenue Dataset [31], Ped2 [32], and ShanghaiTech [33]. The testing split of these datasets includes abnormal moving patterns and behaviour whereas CUHK Avenue is known to be harder to be correctly classified from the network due to the vague video content. The input sequence X for the autoencoder network has a shape of 16 x 1 x 256 x 256, where the number of frames, number of channels, height, and width of the video frames of the input sequence are presented. Thereafter, the decoder produces a reconstructed sequence of the same shape. The final anomaly score is derived from the Peak-Singal-toNoise-Ratio (PSNR) which measures the reconstructed frame quality compared to the true image. For training purposes, the L2 loss is calculated across all 16 frames, but only the 9th frame of the sequence is stored for the PSNR and anomaly score calculation.\nFig. 5. Initial system architecture. As a baseline was selected a VAD framework that outputs the anomaly score and depending on the threshold determines an anomaly or not.\nTABLE I 3D DEEP AUTOENCODER\nDataset Baseline MULDE [34] SSTML [35] MSMA [36] CUHK Avenue 80.46 % 94.3 % 91.5 % 90.2 % Ped2 91.19 % 99.7 % 97.5 % 99.5 % ShanghaiTech 70.84 % 86.7 % 82.4 % 84.1 % a Comparative Results of the baseline on the frame-level AUCF . The best results were chosen.\nB. Cascading Autoencoder and Video-LLaMA # A weakly supervised method is adopted by creating captions for the video-text alignment, which are fed to the fine-tuning process. BLIP-2 was utilized to extract in a zero-shot manner the frame representation (Fig. 6), where pseudo-instructions were injected afterwards to fine-tune Video-LLaMA on the downstream task of video anomaly detection.\nFig. 6. Selected examples of zero-shot caption generation with BLIP-2. The images are derived from datasets including ShangaiTech, UCSD Ped2, CUHK Avenue.\nParticularly, phrases for the instruction-follow alignment from the CUHK Avenue Dataset include:\nCaption: List of injected instruction-follow data:\n\u0026ldquo;This is abnormal. Someone is moving strangely in the wrong direction.\u0026rdquo; \u0026ldquo;This is abnormal. Someone is running.\u0026rdquo; \u0026ldquo;This is abnormal. A cyclist should not be there.\u0026rdquo; These pseudo-instructions combined only with the testing videos that are labelled as abnormal, will be finally added for the 2nd training stage along with the general-purpose datasets LLaVA [37] [38] [39], MiniGPT-4 [40] [41] and VideoChat\n[42] [43]. Finally, the accuracy comparison as shown in Table II, shows that adequate performance of the cascade system to classify a video without the requirement of being previously trained on the selected dataset i.e. Ped2 and ShanghaiTech. The results describe the video-level accuracy score, as VideoLLaMA samples uniformly 8 frames in a video and extract the video-level representation.\nTABLE II CASCADE ACCURACY COMPARISON\nDataset 0-shot 1-shot CUHK Avenue 9.52 33.33 Ped2 8.33 66.66 ShanghaiTech 22.43 64.86 a Accuracy comparison within zero-shot and fine-tuning inference.\nThis table describes the video-level performance.\nV. DISCUSSION # Video-language models are a relevantly recent topic and can serve as a general-purpose tool in multiple different application areas. High-quality instruction-follow data plays a major role in reaching the capabilities of other models. It is remarkable that even in a zero-shot scenario, numerous actions observed in the testing data, such as throwing paper or navigating a university campus, can be accurately distinguished by the cascade system, but still not predict the true label correctly. Nevertheless, future research should address object hallucination in the frozen LLM and work on enhancing the model\u0026rsquo;s physics perception by incorporating more robust datasets and refining model architectures.\nVI. FUTURE WORK # To address the acknowledged problems in this paper, extensive research should be done on multimodal LLMs, training in a new manner the pretrained language-visual model while also fine-tuning the application task\u0026rsquo;s domain knowledge effectively. Utilizing keywords extracted from actions and objects could potentially enhance the efficiency of the cascade method, while objective metrics such as CHAIR [44] can objectively measure the LLM hallucination.\nVII. CONCLUSION # In this study, experimentation was conducted on the implementation of the cascade system alongside the recent breakthrough of Large Visual Language Models. Although the performance is currently limited within the Video Anomaly Detection domain, there is significant potential for improvement. Acquiring high-quality video-to-caption data could equip the model with the necessary knowledge to fine-tune according to the downstream task by learning the correlations between video-level representations and textual embeddings. This could not only have the potential to boost the model performance, as evidenced in related works but also to offer valuable insights to users analyzing surveillance footage.\nREFERENCES # [1] H. Zhang, X. Li, and L. Bing, \u0026ldquo;Video-llama: An instruction-tuned audio-visual language model for video understanding,\u0026rdquo; arXiv preprint arXiv:2306.02858, 2023.\n[2] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, \u0026ldquo;Videobert: A joint model for video and language representation learning,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision , 2019, pp. 7464–7473.\n[3] M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, \u0026ldquo;Video-chatgpt: Towards detailed video understanding via large vision and language models,\u0026rdquo; arXiv preprint arXiv:2306.05424, 2023.\n[4] B. Lin, B. Zhu, Y. Ye, M. Ning, P. Jin, and L. Yuan, \u0026ldquo;Video-llava: Learning united visual representation by alignment before projection,\u0026rdquo; arXiv preprint arXiv:2311.10122, 2023.\n[5] K. Ataallah, X. Shen, E. Abdelrahman, et al. , \u0026ldquo;Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens,\u0026rdquo; arXiv preprint arXiv:2404.03413, 2024.\n[6] D. Samariya and A. Thakkar, \u0026ldquo;A comprehensive survey of anomaly detection algorithms,\u0026rdquo; Annals of Data Science, vol. 10, no. 3, pp. 829–850, 2023.\n[7] T.-N. Nguyen and J. Meunier, \u0026ldquo;Anomaly detection in video sequence with appearance-motion correspondence,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1273–1283.\n[8] S. Wang, E. Zhu, J. Yin, and F. Porikli, \u0026ldquo;Video anomaly detection and localization by local motion based joint video representation and ocelm,\u0026rdquo; Neurocomputing, vol. 277, pp. 161–175, 2018.\n[9] Y. Chang, Z. Tu, W. Xie, et al., \u0026ldquo;Video anomaly detection with spatio-temporal dissociation,\u0026rdquo; Pattern Recognition, vol. 122, p. 108 213, 2022.\n[10] Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu, and X.-S. Hua, \u0026ldquo;Spatio-temporal autoencoder for video anomaly detection,\u0026rdquo; in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1933–1941.\n[11] R. Tudor Ionescu, S. Smeureanu, B. Alexe, and M. Popescu, \u0026ldquo;Unmasking the abnormal events in video,\u0026rdquo; in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2895–2903.\n[12] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh, \u0026ldquo;Learning regularity in skeleton trajectories for anomaly detection in videos,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 996–12 004.\n[13] A. Flaborea, L. Collorone, G. M. D. Di Melendugno, S. D\u0026rsquo;Arrigo, B. Prenkaj, and F. Galasso, \u0026ldquo;Multimodal motion conditioned diffusion model for skeletonbased video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 318–10 329.\n[14] A. Flaborea, G. D\u0026rsquo;Amely, S. D\u0026rsquo;Arrigo, M. A. Sterpa, A. Sampieri, and F. Galasso, \u0026ldquo;Contracting skeletal kinematics for human-related video anomaly detection,\u0026rdquo; arXiv preprint arXiv:2301.09489, 2023.\n[15] P. Wu, X. Zhou, G. Pang, et al., \u0026ldquo;Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 6074– 6082.\n[16] A. Radford, J. W. Kim, C. Hallacy, et al., \u0026ldquo;Learning transferable visual models from natural language supervision,\u0026rdquo; in International conference on machine learning, PMLR, 2021, pp. 8748–8763.\n[17] H. Lv and Q. Sun, \u0026ldquo;Video anomaly detection and explanation via large language models,\u0026rdquo; arXiv preprint arXiv:2401.05702, 2024.\n[18] J. Li, D. Li, S. Savarese, and S. Hoi, \u0026ldquo;Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,\u0026rdquo; in International conference on machine learning, PMLR, 2023, pp. 19 730–19 742.\n[19] H. Zhang, X. Li, and L. Bing, \u0026ldquo;Video-llama: An instruction-tuned audio-visual language model for video understanding,\u0026rdquo; arXiv preprint arXiv:2306.02858, 2023. [Online]. Available: https://arxiv.org/abs/2306.02858.\n[20] E. T. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gor¨ ¨ ur, and B. Lakshminarayanan, \u0026ldquo;Do deep gener- ¨ ¨ ative models know what they don\u0026rsquo;t know?\u0026rdquo; ArXiv , vol. abs/1810.09136, 2018.\n[21] D. Gong, L. Liu, V. Le, et al., \u0026ldquo;Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,\u0026rdquo; 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , pp. 1705–1714, 2019.\n[22] Z. Chen, C. K. Yeo, B. S. Lee, and C. T. Lau, \u0026ldquo;Autoencoder-based network anomaly detection,\u0026rdquo; in 2018 Wireless telecommunications symposium (WTS) , IEEE, 2018, pp. 1–5.\n[23] J. An and S. Cho, \u0026ldquo;Variational autoencoder based anomaly detection using reconstruction probability,\u0026rdquo; Special lecture on IE, vol. 2, no. 1, pp. 1–18, 2015.\n[24] C. Zhou and R. C. Paffenroth, \u0026ldquo;Anomaly detection with robust deep autoencoders,\u0026rdquo; in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 665–674.\n[25] Z. Cheng, S. Wang, P. Zhang, S. Wang, X. Liu, and E. Zhu, \u0026ldquo;Improved autoencoder for unsupervised anomaly detection,\u0026rdquo; International Journal of Intelligent Systems , vol. 36, no. 12, pp. 7103–7125, 2021.\n[26] M. Astrid, M. Z. Zaheer, J.-Y. Lee, and S.-I. Lee, \u0026ldquo;Learning not to reconstruct anomalies,\u0026rdquo; in BMVC , 2021.\n[27] G. E. Hinton and R. R. Salakhutdinov, \u0026ldquo;Reducing the dimensionality of data with neural networks,\u0026rdquo; science , vol. 313, no. 5786, pp. 504–507, 2006.\n[28] W.-L. Chiang, Z. Li, Z. Lin, et al. , Vicuna: An opensource chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.\n[29] F. Bordes, R. Y. Pang, A. Ajay, et al., \u0026ldquo;An introduction to vision-language modeling,\u0026rdquo; arXiv preprint arXiv:2405.17247, 2024.\n[30] Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao, \u0026ldquo;Evaclip: Improved training techniques for clip at scale,\u0026rdquo; arXiv preprint arXiv:2303.15389, 2023.\n[31] C. Lu, J. Shi, and J. Jia, \u0026ldquo;Abnormal event detection at 150 fps in matlab,\u0026rdquo; in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2720–2727.\n[32] W. Li, V. Mahadevan, and N. Vasconcelos, \u0026ldquo;Anomaly detection and localization in crowded scenes,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 1, pp. 18–32, 2013.\n[33] W. Liu, W. Luo, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction for anomaly detection–a new baseline,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6536–6545.\n[34] J. Micorek, H. Possegger, D. Narnhofer, H. Bischof, and M. Kozinski, \u0026ldquo;Mulde: Multiscale log-density estimation via denoising score matching for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 868–18 877.\n[35] M.-I. Georgescu, A. Barbalau, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah, \u0026ldquo;Anomaly detection in video via self-supervised and multi-task learning,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 742–12 752.\n[36] A. Mahmood, J. Oliva, and M. Styner, \u0026ldquo;Localizing anomalies via multiscale score matching analysis,\u0026rdquo; arXiv preprint arXiv:2407.00148, 2024.\n[37] H. Liu, C. Li, Q. Wu, and Y. J. Lee, Visual instruction tuning, 2023.\n[38] H. Liu, C. Li, Y. Li, and Y. J. Lee, Improved baselines with visual instruction tuning, 2023.\n[39] H. Liu, C. Li, Y. Li, et al. , Llava-next: Improved reasoning, ocr, and world knowledge, 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-01-30llava-next/.\n[40] J. Chen, D. Zhu, X. Shen, et al., \u0026ldquo;Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning,\u0026rdquo; arXiv preprint arXiv:2310.09478 , 2023.\n[41] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, \u0026ldquo;Minigpt-4: Enhancing vision-language understanding with advanced large language models,\u0026rdquo; arXiv preprint arXiv:2304.10592, 2023.\n[42] K. Li, Y. He, Y. Wang, et al., \u0026ldquo;Videochat: Chat-centric video understanding,\u0026rdquo; arXiv preprint arXiv:2305.06355 , 2023.\n[43] Y. Wang, K. Li, Y. Li, et al., \u0026ldquo;Internvideo: General video foundation models via generative and discriminative learning,\u0026rdquo; arXiv preprint arXiv:2212.03191, 2022.\n[44] P. Kaul, Z. Li, H. Yang, et al., \u0026ldquo;Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 228–27 238.\n","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/callm_cascading_autoencoder_and_large_language_model_for_video_anomaly_detection/","section":"Papers","summary":"This paper introduces a novel cascade system combining a 3D Autoencoder with a Large Visual Language Model (LVLM) for video anomaly detection, leveraging weak supervision and multimodal capabilities to improve detection and explanation of abnormalities.","title":"CALLM: Cascading Autoencoder and Large Language Model for Video Anomaly Detection","type":"method"},{"content":"","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/changkang-li/","section":"Authors","summary":"","title":"Changkang Li","type":"authors"},{"content":"","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/guansong-pang/","section":"Authors","summary":"","title":"Guansong Pang","type":"authors"},{"content":"","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/kamal-nasrollahi/","section":"Authors","summary":"","title":"Kamal Nasrollahi","type":"authors"},{"content":"","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/lingru-zhou/","section":"Authors","summary":"","title":"Lingru Zhou","type":"authors"},{"content":"","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/peng-wang/","section":"Authors","summary":"","title":"Peng Wang","type":"authors"},{"content":"","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/peng-wu/","section":"Authors","summary":"","title":"Peng Wu","type":"authors"},{"content":"","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/qingsen-yan/","section":"Authors","summary":"","title":"Qingsen Yan","type":"authors"},{"content":" VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection # Peng Wu 1 , Xuerong Zhou 1 , Guansong Pang 2* , Lingru Zhou 1 , Qingsen Yan 1 , Peng Wang 1∗ , Yanning Zhang 1\n1\nASGO, School of Computer Science, Northwestern Polytechnical University, China 2 School of Computing and Information Systems, Singapore Management University, Singapore {xdwupeng, zxr2333}@gmail.com, gspang@smu.edu.sg, {lingruzhou, yqs}@mail.nwpu.edu.cn, {peng.wang, ynzhang}@nwpu.edu.cn\nAbstract # The recent contrastive language-image pre-training (CLIP) model has shown great success in a wide range of image-level tasks, revealing remarkable ability for learning powerful visual representations with rich semantics. An open and worthwhile problem is efficiently adapting such a strong model to the video domain and designing a robust video anomaly detector. In this work, we propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD) by leveraging the frozen CLIP model directly without any pre-training and fine-tuning process. Unlike current works that directly feed extracted features into the weakly supervised classifier for frame-level binary classification, VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP and involves dual branch. One branch simply utilizes visual features for coarsegrained binary classification, while the other fully leverages the fine-grained language-image alignment. With the benefit of dual branch, VadCLIP achieves both coarse-grained and fine-grained video anomaly detection by transferring pretrained knowledge from CLIP to WSVAD task. We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD, surpassing the state-of-the-art methods by a large margin. Specifically, VadCLIP achieves 84.51% AP and 88.02% AUC on XDViolence and UCF-Crime, respectively. Code and features are released at https://github.com/nwpu-zxr/VadCLIP .\nIntroduction # In recent years, weakly supervised video anomaly detection (WSVAD, VAD) has received growing concerns due to its broad application prospects. For instance, with the aid of WSVAD, it is convenient to develop more powerful intelligent video surveillance systems and video content review systems. In WSVAD, the anomaly detector is expected to generate frame-level anomaly confidences with only videolevel annotations provided. The majority of current research in this field follows a systematic process, wherein the initial step is to extract frame-level features using pre-trained visual models, e.g., C3D (Tran et al. 2015; Sultani, Chen, and Shah 2018), I3D (Carreira and Zisserman 2017; Wu\nCorresponding Authors\nCopyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.\nFigure 1: Comparisons of different paradigms for WSVAD.\net al. 2020), and ViT (Dosovitskiy et al. 2020; Li, Liu, and Jiao 2022), followed by feeding these features into multiple instance learning (MIL) based binary classifiers for the purpose of model training, and the final step is to detect abnormal events based on predicted anomaly confidences. Despite their simple schemes and promising results, such a classification-based paradigm fails to take full advantage of cross-modal relationships, e.g, vision-language associations.\nDuring the past two years, we have witnessed great progress in the development of vision-language pre-training (VLP) models (Kim, Son, and Kim 2021; Jia et al. 2021; Wang et al. 2021; Chen et al. 2023a), e.g., CLIP (Radford et al. 2021), for learning more generalized visual representations with semantic concepts. The main idea of CLIP is to align images and texts by contrastive learning, that is, pull together images and matched textual descriptions while pushing away unmatched pairs in the joint embedding space. Thanks to hundreds of million noisy image-text pairs crawled from the web, such models pre-trained at a large scale really demonstrate their strong representation learning as well as associations between vision and language. In view of the breakthrough performance of CLIP, recently, building task-specific models on top of CLIP is becoming emerging research topics and applied to a broad range of vision tasks, and these models achieve unprecedented performance.\nAlthough CLIP and its affiliated models demonstrate the great potential on various vision tasks, these methods mainly focus on the image domain. Therefore, how to efficiently adapt such a model learned from image-text pairs to more complex video anomaly detection task under weak supervision deserves a thorough exploration. Recently, a few works (Joo et al. 2023; Lv et al. 2023) attempt to make use of the learned knowledge of CLIP, however, these methods limit their scope to directly using visual features extracted from the image encoder of CLIP, and neglect to exploit semantic relationships between vision and language.\nIn order to make effective use of generalized knowledge and enable CLIP to reach its full potential on WSVAD task, based on the characteristics of WSVAD, there are several critical challenges that need to be addressed. First, it is vital to explore ways to capture contextual dependencies across time. Second, it is essential to determine how to harness learned knowledge and the visual-language connections. Third, it is crucial to maintain optimal CLIP performance under weak supervision.\nIn this work, we propose a novel paradigm based on CLIP for WSVAD, which is dubbed as VadCLIP. VadCLIP consists of several components to overcome the above challenges. Specifically, for the first challenge, we present a local-global temporal adapter (LGT-Adapter), which is a lightweight module for video temporal relation modeling. LGT-Adapter involves two components, i.e., local temporal adapter and global temporal adapter, wherein the former mainly captures local temporal dependencies with high efficiency, since in most cases the current events are highly related to the adjacent events, and the latter smooths feature information in a more holistic view with less parameters. For the second challenge, unlike current methods (Joo et al. 2023; Lv et al. 2023) that solely use visual features, we encourage VadCLIP to also leverage textual features to preserve learned knowledge as much as possible. As shown in Figure 1, VadCLIP is devised as a dual-branch fashion, where one simply and directly utilizes visual features for binary classification (C-branch), while the other employs both visual and textual features for language-image alignment (A-branch). Moreover, such dual branch seamlessly achieves coarse-grained and fine-grained WSVAD (Wu, Liu, and Liu 2022). For A-branch, we build bridge between videos and video-level textual labels. Moreover, we propose two prompt mechanisms (Wu et al. 2023), i.e., learnable prompt and visual prompt, to specify that the succinct text is about the video. Learnable prompt does not require extensive expert knowledge compared to the handcrafted prompt, effectively transfers pre-trained knowledge into the downstream WSVAD task. Visual prompt is inspired by that visual contexts can make the text more accurate and discriminate. Imagine that if there is a car in the video, two types of abnormal events of \u0026ldquo;car accident\u0026rdquo; and \u0026ldquo;fighting\u0026rdquo; would be more easily distinguished. Hence, In the visual prompt, we focus on anomaly information in videos and integrate these anomaly-focus visual contents from C-branch with textual labels from A-branch for automatic prompt engineering. Such a practice seamlessly creates connections between dual branch. For the third challenge, multiple instance learning (MIL) (Sultani, Chen, and Shah 2018; Wu et al. 2020) is the most commonly used method. For the language-visual alignments in A-branch, we introduce a MIL-Align mechanism, the core idea is to select the most matched video frames for each label to represent the whole video.\nNote that during training, the weights of CLIP image and text encoders are kept fixed, and the gradients are backpropagated to optimise these learnable parameters of the devised adapter and prompt modules.\nOverall, the contributions of our work are threefold:\n(1) We present a novel diagram, i.e., VadCLIP, which involves dual branch to detect video anomaly in visual classification and language-visual alignment manners, respectively. With the benefit of dual branch, VadCLIP achieves both coarse-grained and fine-grained WSVAD. To our knowledge, VadCLIP is the first work to efficiently transfer pretrained language-visual knowledge to WSVAD.\n(2) We propose three non-vital components to address new challenges led by the new diagram. LGT-Adapter is used to capture temporal dependencies from different perspectives; Two prompt mechanisms are devised to effectively adapt the frozen pre-trained model to WSVAD task; MIL-Align realizes the optimization of alignment paradigm under weak supervision, so as to preserve the pre-trained knowledge as much as possible.\n(3) We show that strength and effectiveness of VadCLIP on two large-scale popular benchmarks, and VadCLIP achieves state-of-the-art performance, e.g., it obtains unprecedented results of 84.51% AP and 88.02% AUC on XD-Violence and UCF-Crime respectively, surpassing current classification based methods by a large margin.\nRelated Work # Weakly Supervised Video Anomaly Detection # Recently, some researchers (Zaheer et al. 2020; Feng, Hong, and Zheng 2021; Wu et al. 2021; Chen et al. 2023b) have proposed weakly supervised methods for VAD. Sultani et al. (Sultani, Chen, and Shah 2018) firstly proposed a deep multiple instance learning model, which considers a video as a bag and its multiple segments as instances. Then several follow-up works made effort to model temporal relations based on self-attention models and transformers. For example, Zhong et al. (Zhong et al. 2019) proposed a graph convolutional network (GCN) based method to model the feature similarity and temporal consistency between video segments. Tian et al. (Tian et al. 2021) used a self-attention network to capture the global temporal context relationship of videos. Li et al. (Li, Liu, and Jiao 2022) proposed a transformer based multi-sequence learning framework, and Huang et al. (Huang et al. 2022) proposed a transformer based temporal representation aggregation framework. Zhou et al. (Zhou, Yu, and Yang 2023) presented a global and local multi-head self attention module for the transformer layer to obtain more expressive embeddings for capturing temporal dependencies in videos. The above methods only detect whether video frames are anomalous, on the contrary, Wu et al. (Wu, Liu, and Liu 2022) proposed a fine-grained WSVAD method, which distinguishes between different types of anomalous frames. More recently, the CLIP model has also attracted great attentions in the VAD community. Based on visual features of CLIP, Lv et al. (Lv et al. 2023) proposed a new MIL framework called Unbiased MIL (UMIL)\nto learn unbiased anomaly features that improve WSVAD performance. Joo et al. (Joo et al. 2023) proposed to employ visual features from CLIP to efficiently extract discriminative representations, and then model long- and short-range temporal dependencies and nominate the snippets of interest by leveraging temporal self-attention. All the above methods are based on the classification paradigm, which detect anomalous events by predicting the probability of anomalous frames. However, this classification paradigm does not fully utilize the semantic information of textual labels.\nVision-Language Pre-training # Vision-language pre-training has achieved impressive progress over the past few years, which aims to learn the semantic correspondence between vision and language through pre-training on large-scale data. As one of the most representative works, CLIP has shown impressive performance on a range of vision-language downstream tasks, including image classification (Zhou et al. 2022a), image captioning (Mokady, Hertz, and Bermano 2021), object detection (Zhou et al. 2022b), scene text detection (Yu et al. 2023), dense prediction (Zhou et al. 2022c; Rao et al. 2022), and so on. Recently, some follow-up works attempted to leverage the pre-trained models for video domains. For example, CLIP4Clip (Luo et al. 2022) transferred the knowledge of CLIP model to the video-text retrieval, some works (Wang, Xing, and Liu 2021; Lin et al. 2022; Ni et al. 2022) attempted to take advantages of CLIP for video recognition, furthermore, CLIP is used to tackle the more complex video action localization task (Nag et al. 2022; Ju et al. 2022). More generally, Ju et al. (Ju et al. 2022) presented a simple yet strong baseline to efficiently adapt the pre-trained image-based visual-language model, and exploited its powerful ability for general video understanding. In this work, we deeply explore how to adapt pre-trained vision-language knowledge of CLIP from image-level into video-level downstream WSVAD efficiently.\nMethod # Problem Definition # The WSVAD task supposes that only video-level labels are available during the training stage. Given a video v, if all frames of this video do not contain abnormal events, this video is defined as normal with the label y = 0; Otherwise, if there is at least one frame contains abnormal events, this video is labeled as abnormal with the label y = 1. The goal of WSVAD task is to train a detection model that is able to predict frame-level anomaly confidences while only videolevel annotations are provided.\nPrevious works generally make use of pre-trained 3D convolutional models, e.g., C3D (Tran et al. 2015) and I3D (Carreira and Zisserman 2017), to extract video features, and then feed these features into MIL-based binary classifiers, such paradigms are referred as the classificationbased paradigm in this paper. Recently, CLIP, as a largescale language-vision pre-trained model, has revolutionized many fields in computer vision, and has shown great generalization capabilities across a wide range of downstream tasks. Inspired by CLIP, our work not only uses the image encoder of CLIP as the backbone to extract video features, but also attempts to utilize the text encoder of CLIP to take full advantage of the powerful associations between visual contents and textual concepts. Our work is demonstrated in Figure 2.\nLocal and Global Temporal Adapter # As we know, CLIP is pre-trained on large-scale image-text pairs crawled from the web. In this section, we investigate how to model temporal dependencies and bridge the gap between the image domain and video domain for CLIP. Meanwhile, it is also significant to learn long-range and shortrange temporal dependencies for WSVAD task (Zhou, Yu, and Yang 2023; Wu and Liu 2021). From the perspective of the efficiency and receptive field, we design a new temporal modeling method compatible with local and global receptive field.\nLocal Module. To capture local temporal dependencies, we introduce a transformer encoder layer on top of framelevel features Xclip ∈ R n×d from the frozen image encoder of CLIP, where n is the length of video, d is the dimension size, which is set as 512 in this work. Note that this layer differs from the ordinary transformer encoder layer since it limits self-attention computation to local windows (Liu et al. 2021) instead of the global scope. Specifically, we split frame-level features into equal-length and overlapping windows over temporal dimension, self-attention calculation is limited within each window, and no information exchange among windows. Such an operation possesses local receptive field like the convolution, and leads to the lower computation complexity.\nGlobal Module. To further capture global temporal dependencies, we introduce a lightweight GCN module following local module, we adopt GCN to capture global temporal dependencies due to its widespread adoption and proven performance in VAD (Zhong et al. 2019; Wu et al. 2020; Wu and Liu 2021). Following the setup in (Zhong et al. 2019; Wu et al. 2020), we use GCN to model global temporal dependencies from the perspective of feature similarity and relative distance, it can be summarized as follows,\nwhere Hs Hsim and Hd Hdis are the adjacency matrices, the Softmax normalization is used to ensure the sum of each row of Hs Hsim and Hd Hdis equals to one. Xlis the frame-level video feature obtained from local module, W is the only one learnable weight that is used to transform the feature space, this setup demonstrates the lightweight of global module.\nFeature similarity branch is designed to generated a similarity relationship adjacency matrix for GCN. We use the frame-wise cosine similarity to calculate the adjacency matrix Hs Hsim , which is presented as follows,\nwe also use the thresholding operation to filter weak relations (Wu et al. 2020).\nFigure 2: The framework of our proposed VadCLIP.\nPosition distance branch is used to capture long-range dependencies based on positional distance between each two frames. The proximity adjacency matrix is shown as follows:\nthe proximity relation between i th and j th frames only determined by their relative temporal position. σ is a hyperparameter to control the range of influence of distance relation. Both local transformer and GCN layer employ residual connection to prevent feature over-smoothing.\nDual Branch and Prompt # Dual Branch. Unlike other previous WSVAD works, our VadCLIP contains dual branch, more precisely, in addition to the traditional binary classification branch (C-Branch), we also introduce a novel video-text alignment branch, dubbed as A-Branch. Specifically, after temporal modeling, the video feature X g is fed into a fully connected (FC) layer to obtain the final video feature X ∈ R n×d . In C-Branch, we feed X into a binary classifier that contains a feed-forward network (FFN) layer, an FC layer and a Sigmoid activation to obtain the anomaly confidence A ∈ R n×1 .\nIn A-Branch, textual labels, e.g., abuse, riot, fighting, etc, are no longer encoded as one-hot vectors, on the contrary, they are encoded into class embeddings using the text encoder of CLIP, we leverage the frozen pre-trained text encoder of CLIP throughout, as the text encoder can provide language knowledge prior for video anomaly detection. Then we calculate the match similarities between class embeddings and frame-level visual features to obtain the alignment map M ∈ R n×m , where m is the number of text labels, such a setup is similar to that of CLIP. In A-Branch, each input text label represents a class of abnormal events, thus naturally achieving fine-grained WSVAD.\nLearnable Prompt. In WSVAD, text labels are words or phrases, which are too succinct to summarize abnormal events very well. To learn robust transferability of text embedding, we take inspirations from CoOp (Zhou et al. 2022a), and add the learnable prompt to original class embeddings. Concretely, the original text labels are first transformed into class tokens through CLIP tokenizer, i.e., tinit = T okenizer(Label), where Label is the discrete text label, e.g., fighting, shooting, road accident, etc. Then we concatenate tinit with the learnable prompt {c1, \u0026hellip;, cl} that contains l context tokens to form a complete sentence token, thus the input of text encoder is presented as follows:\nhere we place the class token at the middle of a sequence. Then this sequence token is added to the positional embedding to obtain positional information, and finally, the text encoder of CLIP takes as input t p and generates class embedding tout ∈ R d .\nAnomaly-Focus Visual Prompt. In order to further improve the representation ability of text labels for abnormal events, we investigate how to use visual contexts to refine the class embedding, since visual contexts can make the succinct text labels more accurate. To this end, we propose an anomaly-focus visual prompt, which focuses on the visual embeddings in abnormal segments, and aggregate these embeddings as the video-level prompt for class embeddings. We first use the anomaly confidence A obtained from CBranch as the anomaly attention, then compute the videolevel prompt by the dot product of anomaly attention and video feature X, which is presented as follows,\nwhere Norm is the normalization, and V ∈ R d is the anomaly-focus visual prompt. We then add V to the class embedding tout and obtain the final instance-specific class embedding T by a simple FFN layer and a skip connection.\nwhere ADD is the element-wise addition. Such a implementation allows class embeddings to extract the related visual context from videos.\nWith X and T in hands, we calculate the match similarities between all class embeddings and frame-level visual features to obtain the alignment map M .\nObjective Function # For C-Branch, we follow previous works (Wu et al. 2020) and use Top-K mechanism to select K high anomaly confidences in both abnormal and normal videos as the videolevel predictions. Then we use the binary cross entropy between video-level predictions and ground-truth to compute classification loss L bce.\nFor A-Branch, we are confronted with new challenges: 1) there is no anomaly confidence; 2) facing multi-classes instead of binary classes. To address this dilemma, we propose the MIL-Align mechanism which is similar to vanilla MIL. Specifically, we consider the align map M since it expresses the similarity between frame-level video features and all class embeddings. For each row, we select top K similarities and compute the average to measure the alignment degree between this video and the current class. Then we can obtain a vector S = {s1, \u0026hellip;, s m } that represents the similarity between this video and all classes. We hope the video and its paired textual label emit the highest similarity score among others. To achieve this, the multi-class prediction is firstly computed as follows,\nwhere piis the prediction with respect to the i th class, and τ refers to the temperature hyper-parameter for scaling. Finally, the alignment loss L nce can be computed by the cross entropy.\nIn addition to classification loss L bce and alignment loss L nce , we also introduce a contrastive loss to slightly push the normal class embedding and other abnormal class embeddings away, here we first calculate cosine similarity between normal class embedding and other abnormal class embeddings, and then compute the contrastive loss Lcts as follows,\nwhere t n is the normal class embedding, and t a is abnormal class embeddings.\nOverall, the final total objective of VadCLIP is given by:\nInference # VadCLIP contains dual branch that enables itself to address both fine-grained and coarse-grained WSVAD tasks.\nIn regard to fine-grained WSVAD, we follow previous works (Wu, Liu, and Liu 2022) and utilize a thresholding strategy on alignment map M to predict anomalous events. In regard to coarse-grained WSVAD, there are two ways to compute the frame-level anomaly degree. The first one is to directly use the anomaly confidences in C-Branch, the second one is to use the alignment map in A-Branch, specifically, subtracting the similarities between videos and the normal class by one is the anomalous degree. Finally, we select the best of these two ways for computing the framelevel anomaly degree.\nExperiments # Datasets and Evaluation Metrics # Datasets. We conduct experiments on two popular WSVAD datasets, i.e., UCF-Crime and XD-Violence. Notably, training videos only have video-level labels on both datasets.\nEvaluation Metrics. For coarse-grained WSVAD, we follow previous works, and utilize the frame-level Average Precision (AP) for XD-Violence, and frame-level AUC and the AUC of anomaly videos (termed as AnoAUC) for UCFCrime. For fine-grained WSVAD, we follow the standard evaluation protocol in video action detection and use the mean Average Precision (mAP) values under different intersection over union (IoU) thresholds. In this work, we use IoU thresholds ranging from 0.1 to 0.5 with a stride of 0.1 to compute mAP values. Meanwhile, we also report an average of mAP (AVG). Note that we only compute mAP on the abnormal videos in the test set.\nImplementation Details # For network structure, frozen image and text encoders are adopted from pre-trained CLIP (ViT-B/16). FFN is a standard layer from Transformer, and ReLU is replaced with GELU. For hyper-parameters, we set σ in Eq.3 as 1, τ in Eq.8 as 0.07, and the context length l as 20. For window length in LGT-Adapter, we set it as 64 and 8 on XD-Violence and UCF-Crime, respectively. For λ in Eq.10, we set it as 1 × 10 − 4 and 1 × 10 − 1 on XD-Violence and UCF-Crime, respectively. For model training, VadCLIP is trained on a single NVIDIA RTX 3090 GPU using PyTorch. We use AdamW as the optimizer with batch size of 64. On XDViolence, the learning rate and total epoch are set as 2×10 − 5 and 20, respectively, and on UCF-Crime, the learning rate and total epoch are set as 1 × 10 − 5 and 10, respectively.\nComparison with State-of-the-Art Methods # VadCLIP can simultaneously realize coarse-grained and fine-grained WSVAD, therefore we present the performance of VadCLIP and compare it with several state-of-the-art methods on coarse-grained and fine-grained WSVAD tasks. For the sake of fairness, all comparison methods use the same visual features extracted from CLIP as VadCLIP.\nCoarse-grained WSVAD Results. We show comparison results in Tables 1 and 2. Here Ju et al. (Ju et al. 2022) is a CLIP-based work for action recognition, which is significantly inferior to our method. Such results demonstrate challenges on WSVAD task, and also show the strength of our method with respect to Ju et al. (Ju et al. 2022) for the specific WSVAD task. Besides, we found that VadCLIP significantly outperforms both semi-supervised methods and classification-based weakly supervised methods on two commonly-used benchmarks and across all evaluation metrics. More precisely, VadCLIP attains 84.51% AP and 82.08% AUC on XD-Violence and UCF-Crime, respectively, a new state-of-the-art on both datasets. By comparison, VadCLIP achieves an absolute gain of 2.3% and 2.1% in terms of AP over the best competitors CLIP-TSA (Joo et al. 2023) and DMU (Zhou, Yu, and Yang 2023) on XDViolence, and on UCF-Crime, VadCLIP also outperforms them by 0.4% and 1.3% in terms of AUC. More importantly, among all comparison methods, AVVD (Wu, Liu, and Liu 2022) uses fine-grained class labels exactly, and it only achieves 78.10% AP and 82.45% AUC on XD-Violence and UCF-Crime, respectively, which lags behind VadCLIP by a large margin. Such a result shows simply using fine-grained labels cannot lead to performance gains, since excessive inputs of label increases the difficulty of binary classification. The performance advantage of VadCLIP is partially attributable to the vision-language associations, since all comparison baselines use the same visual features as VadCLIP.\nFine-grained WSVAD Results. For fine-grained WSVAD task, we compare VadCLIP with previous works AVVD and Sultani et al. (Sultani, Chen, and Shah 2018; Wu, Liu, and Liu 2022) in Tables 3 and 4. Here AVVD is the first work to propose the fine-grained WSVAD, and we reimplement it with visual features of CLIP, then we also finetune Sultani et al. based on the setup in AVVD for adapting fine-grained WSVAD. As we can see, the fine-grained WSVAD is a more challenging task with respect to coarsefined WSVAD since the former needs to consider both multicategory classification accuracy and detection segment continuity. On this task, VadCLIP is also clearly superior to these excellent comparison methods on both XD-Violence and UCF-Crime datasets. For instance, On XD-Violence, VadCLIP achieves a performance improvement of 13.1% and 4.5% in terms of AVG compared to Sultani et al. and AVVD.\nAblation Studies # Extensive ablations are carried out on XD-Violence dataset. Here we choose the similarity map to compute the framelevel anomaly degree for coarse-grained WSVAD.\nEffectiveness of LGT-Adapter. As shown in Table 5, firstly, without the assistance of LGT-Adapter for temporal modeling, the baseline model only achieves 72.22% AP and 15.64% AVG, this results in a considerably drop of 12.3% AP and 9.1% AVG. Secondly, only using global transformer encoder layer, local transformer encoder layer or GCN layer gets clear performance boosts, especially in terms of AP, which convincingly indicates transformer encoder and GCN both can efficiently capture temporal dependencies by means of the self-attention mechanism across video frames. Thirdly, the combination of global transformer encoder and GCN yields the slightly improved performance\nCategory Method AP(%) Semi SVM baseline 50.8 Semi OCSVM (Scholkopf et al. 1999) ¨ 28.63 Semi Hasan et al. (Hasan et al. 2016) 31.25 Weak Ju et al. (Ju et al. 2022) 76.57 Weak Sultani et al. (Sultani, Chen, and Shah 2018) 75.18 Weak Wu et al. (Wu et al. 2020) 80 Weak RTFM (Tian et al. 2021) 78.27 Weak AVVD (Wu, Liu, and Liu 2022) 78.1 Weak DMU (Zhou, Yu, and Yang 2023) 82.41 Weak CLIP-TSA (Joo et al. 2023) 82.17 Weak VadCLIP (Ours) 84.51 Table 1: Coarse-grained comparisons on XD-Violence.\nCategory Method AUC(%) Ano-AUC(%) Semi SVM baseline 50.1 50.00 Semi OCSVM (1999) 63.2 51.06 Semi Hasan et al. (2016) 51.2 39.43 Weak Ju et al. (2022) 84.72 62.60 Weak Sultani et al. (2018) 84.14 63.29 Weak Wu et al. (2020) 84.57 62.21 Weak AVVD (2022) 82.45 60.27 Weak RTFM (2021) 85.66 63.86 Weak DMU (2023) 86.75 68.62 Weak UMIL (2023) 86.75 68.68 Weak CLIP-TSA (2023) 87.58 N/A Weak VadCLIP (Ours) 88.02 70.23 Table 2: Coarse-grained comparisons on UCF-Crime.\nTable 3: Fine-grained comparisons on XD-Violence.\nMethod mAP@IOU(%) mAP@IOU(%) mAP@IOU(%) mAP@IOU(%) mAP@IOU(%) mAP@IOU(%) Method 0.1 0.2 0.3 0.4 0.5 AVG Random Baseline 1.82 0.92 0.48 0.23 0.09 0.71 Sultani et al. (2018) 22.72 15.57 9.98 6.2 3.78 11.65 AVVD (2022) 30.51 25.75 20.18 14.83 9.79 20.21 VadCLIP (Ours) 37.03 30.84 23.38 17.9 14.31 24.70 Table 4: Fine-grained comparisons on UCF-Crime.\n| Method | mAP@IOU(%) 03 04 05 AVG | mAP@IOU(%) 03 04 05 AVG | mAP@IOU(%) 03 04 05 AVG | mAP@IOU(%) 03 04 05 AVG | mAP@IOU(%) 03 04 05 AVG | mAP@IOU(%)\n03 04 05 AVG Method 0.1 0.2 0.3 0.4 0.5 AVG Random Baseline 0.21 0.14 0.04 0.02 0.01 0.08 Sultani et al. (2018) 5.73 4.41 2.69 1.93 1.44 3.24 AVVD (2022) 10.27 7.01 6.25 3.42 3.29 6.05 VadCLIP (Ours) 11.72 7.83 6.4 4.53 2.93 6.68 in terms of AP (+0.4%) over the combination of local transformer encoder and GCN, while the latter achieves significantly better performance in terms of AVG (+3.9%). We also attempt a combination of local Transformer encoder and global Transformer encoder, which results in significant performance degradation in terms of AP listed in the 5 th row. The possible reason is that, compared to Transformer, GCN can be regarded as a lightweight variant, and fewer parameters prevent learned knowledge of CLIP from being affected during the transfer process. Therefore, local transformer encoder and GCN are the optimum combination, which can capture different range temporal dependencies.\nMethod AP(%) AVG(%) Baseline (w/o temporal modeling) 72.22 15.64 Global TF-Encoder 82.54 16.76 Local TF-Encoder 81.18 18.41 Only GCN 81.56 23.31 Local TF-Encoder+ Global TF-Encoder 79.91 19.78 Global TF-Encoder+GCN 84.87 20.84 Local TF-Encoder+GCN (LGT-Adapter) 84.51 24.7 Table 5: Effectiveness of LGT-Adapter.\nTable 6: Effectiveness of dual branch.\n| C-Branch\n√ A-Branch L-Prompt V-Prompt AP(%) √ √ 80.53 √ √ √ 68.15 √ √ √ √ √ √ 75.03 √ √ √ √ √ √ 78.27 √ √ √ √ √ √ √ 82.35 √ √ √ √ √ √ √ 84.51 Table 7: Effectiveness of prompt.\nAP(%) AVG(%) Hand-crafted Prompt 81.06 (-3.46) 22.46 (-2.24) Learnable-Prompt 84.51 24.70 Average-Frame Visual Prompt 81.34 (-3.17) 21.57 (-3.13) Anomaly-Focus Visual Prompt 84.51 24.70 Effectiveness of Dual Branch. As shown in Table 6, our method with only C-Branch belongs to the classificationbased paradigm, and can compete current state-of-the-art methods on XD-Violence. On the other hand, our method with only A-Branch achieves unsatisfactory performance in terms of AP since it is mainly focus on fine-grained WSVAD. With the assistance of coarse-grained classification on feature optimization in C-Branch, A-Branch obtains a leap of about 7% AP improvement. By further adding the learnable prompt and visual prompt that are ad-hoc designs in A-Branch, we notice that a consistent performance improvement can be achieved, leading to a new state-of-the-art. These results clearly show dual branch that contains coarsegrained classification paradigm and fine-grained alignment paradigm can boost the performance by leveraging the complementary of different granularity.\nEffectiveness of Prompt. As shown in Table 7, using hand-crafted prompt results in a drop of 3.5% AP and 2.2% AVG, demonstrating that the learnable prompt has better potential for adapting pre-trained knowledge from the large language-vision model to WSVAD task. Furthermore, simply using the average of frame-level features in visual prompt (Ni et al. 2022) produces a drop of 3.2% AP and 3.1% AVG, such results show focusing on abnormal snippets in the video can support VadCLIP to obtain more accurate instance-specific text representations, which boosts the ability of video-language alignment that is useful for WSVAD task. We refer readers to supplement materials 1 for\n1 https://arxiv.org/abs/2308.11681\nFigure 3: t-SNE visualizations for XD-Violence. Left: Raw CLIP features; Right: VadCLIP features.\nFigure 4: Qualitative results of coarse-grained WSVAD.\nmore ablation studies and qualitative visualizations.\nQualitative Analyses # Feature Discrimination Visualization. We visualize the feature distribution by using t-SNE for XD-Violence, and present results in Figure 3, where star icons denote textual label features. As we can see, although CLIP has learned generalized capacities based on image-text pairs, such capacities still cannot allow it to effectively distinguish different categories for WSVAD due to intrinsic problems on WSVAD task. After specialized optimization by VadCLIP, these visual features have more distinguishable boundaries and also surround the corresponding text class features.\nCoarse-grained Qualitative Visualization. We illustrate the qualitative visualizations of coarse-grained WSVAD in Figure 4, where the blue curves represent the anomaly prediction, and the pink regions correspond to the ground-truth abnormal temporal location. As we can see, VadCLIP precisely detects abnormal region of different categories on two benchmarks, meanwhile, it also produces considerably low anomaly predictions on normal videos.\nConclusion # In this work, we propose a new paradigm named VadCLIP for weakly supervised video anomaly detection. To efficiently adapt the pre-trained knowledge and vision-language associations from frozen CLIP to WSVAD task, we first devise a LGT-Adapter to enhance the ability of temporal modeling, and then we design a series of prompt mechanisms to improve the adaptation of general knowledge to the specific task. Finally we introduce the MIL-Align operation for facilitating the optimization of vision-language alignment under weak supervision. We empirically verify the effectiveness of VadCLIP through state-of-the-art performance and sufficient ablations on two WSVAD benchmarks. In future, we will continue to explore vision-language pre-trained knowledge and further devote to open-set VAD task.\nAcknowledgments # This work is supported by the National Natural Science Foundation of China (No. 62306240, U23B2013, U19B2037, 62301432, 62101453), China Postdoctoral Science Foundation (No. 2023TQ0272), National Key R\u0026amp;D Program of China (No.2020AAA0106900), Shaanxi Provincial Key R\u0026amp;D Program (No.2021KWZ-03), Natural Science Basic Research Program of Shaanxi (No. 2021JCW-03, 2023-JC-QN-0685), and the Fundamental Research Funds for the Central Universities (No. D5000220431).\nReferences # Carreira, J.; and Zisserman, A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308.\nChen, F.-L.; Zhang, D.-Z.; Han, M.-L.; Chen, X.-Y.; Shi, J.; Xu, S.; and Xu, B. 2023a. Vlp: A survey on vision-language pre-training. Machine Intelligence Research, 20(1): 38–56.\nChen, Y.; Liu, Z.; Zhang, B.; Fok, W.; Qi, X.; and Wu, Y.-C. 2023b. MGFN: Magnitude-Contrastive Glance-and-Focus Network for Weakly-Supervised Video Anomaly Detection. volume 37, 387–395.\nDosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 .\nFeng, J.-C.; Hong, F.-T.; and Zheng, W.-S. 2021. Mist: Multiple instance self-training framework for video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14009–14018.\nHasan, M.; Choi, J.; Neumann, J.; Roy-Chowdhury, A. K.; and Davis, L. S. 2016. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, 733–742.\nHuang, C.; Liu, C.; Wen, J.; Wu, L.; Xu, Y.; Jiang, Q.; and Wang, Y. 2022. Weakly Supervised Video Anomaly Detection via Self-Guided Temporal Discriminative Transformer. IEEE Transactions on Cybernetics .\nJia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; and Duerig, T. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 4904–4916. PMLR.\nJoo, H. K.; Vo, K.; Yamazaki, K.; and Le, N. 2023. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In 2023 IEEE International Conference on Image Processing (ICIP), 3230–3234. IEEE.\nJu, C.; Han, T.; Zheng, K.; Zhang, Y.; and Xie, W. 2022. Prompting visual-language models for efficient video understanding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, 105–124. Springer.\nKim, W.; Son, B.; and Kim, I. 2021. Vilt: Vision-andlanguage transformer without convolution or region supervision. In International Conference on Machine Learning , 5583–5594. PMLR.\nLi, S.; Liu, F.; and Jiao, L. 2022. Self-training multisequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 1395–1403.\nLin, Z.; Geng, S.; Zhang, R.; Gao, P.; de Melo, G.; Wang, X.; Dai, J.; Qiao, Y.; and Li, H. 2022. Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, 388–404. Springer.\nLiu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; and Guo, B. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision , 10012–10022.\nLuo, H.; Ji, L.; Zhong, M.; Chen, Y.; Lei, W.; Duan, N.; and Li, T. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing , 508: 293–304.\nLv, H.; Yue, Z.; Sun, Q.; Luo, B.; Cui, Z.; and Zhang, H. 2023. Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection. arXiv preprint arXiv:2303.12369 .\nMokady, R.; Hertz, A.; and Bermano, A. H. 2021. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 .\nNag, S.; Zhu, X.; Song, Y.-Z.; and Xiang, T. 2022. Zero-shot temporal action detection via vision-language prompting. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III , 681–697. Springer.\nNi, B.; Peng, H.; Chen, M.; Zhang, S.; Meng, G.; Fu, J.; Xiang, S.; and Ling, H. 2022. Expanding language-image pretrained models for general video recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, 1– 18. Springer.\nRadford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.\nRao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang, G.; Zhou, J.; and Lu, J. 2022. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18082–18091.\nScholkopf, B.; Williamson, R. C.; Smola, A.; Shawe-Taylor, ¨ ¨ J.; and Platt, J. 1999. Support vector method for novelty detection. Advances in neural information processing systems , 12.\nSultani, W.; Chen, C.; and Shah, M. 2018. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6479–6488.\nTian, Y.; Pang, G.; Chen, Y.; Singh, R.; Verjans, J. W.; and Carneiro, G. 2021. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision, 4975–4986.\nTran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, 4489–4497.\nWang, M.; Xing, J.; and Liu, Y. 2021. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 .\nWang, Z.; Yu, J.; Yu, A. W.; Dai, Z.; Tsvetkov, Y.; and Cao, Y. 2021. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904 .\nWu, J.; Zhang, W.; Li, G.; Wu, W.; Tan, X.; Li, Y.; Ding, E.; and Lin, L. 2021. Weakly-supervised spatio-temporal anomaly detection in surveillance video. arXiv preprint arXiv:2108.03825 .\nWu, P.; and Liu, J. 2021. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing, 30: 3513–3527.\nWu, P.; Liu, J.; He, X.; Peng, Y.; Wang, P.; and Zhang, Y. 2023. Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model. arXiv preprint arXiv:2307.12545 .\nWu, P.; Liu, J.; Shi, Y.; Sun, Y.; Shao, F.; Wu, Z.; and Yang, Z. 2020. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, 322– 339. Springer.\nWu, P.; Liu, X.; and Liu, J. 2022. Weakly supervised audiovisual violence detection. IEEE Transactions on Multimedia, 1674–1685.\nYu, W.; Liu, Y.; Hua, W.; Jiang, D.; Ren, B.; and Bai, X. 2023. Turning a CLIP Model into a Scene Text Detector. arXiv preprint arXiv:2302.14338 .\nZaheer, M. Z.; Mahmood, A.; Astrid, M.; and Lee, S.-I. 2020. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, 358–376. Springer.\nZhong, J.-X.; Li, N.; Kong, W.; Liu, S.; Li, T. H.; and Li, G. 2019. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1237–1246.\nZhou, H.; Yu, J.; and Yang, W. 2023. Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection. In Proceedings of the AAAI Conference on Artificial Intelligence .\nZhou, K.; Yang, J.; Loy, C. C.; and Liu, Z. 2022a. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.\nZhou, X.; Girdhar, R.; Joulin, A.; Krahenb ¨ ¨ uhl, P.; and ¨ ¨ Misra, I. 2022b. Detecting twenty-thousand classes using image-level supervision. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, 350–368. Springer.\nZhou, Z.; Zhang, B.; Lei, Y.; Liu, L.; and Liu, Y. 2022c. ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation. arXiv preprint arXiv:2212.03588 .\n","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/vadclip-adapting-vision-language-models-for-weakly-supervised-video-anomaly-detection/","section":"Papers","summary":"A novel paradigm for weakly supervised video anomaly detection leveraging frozen CLIP model with dual-branch architecture, temporal modeling modules, and prompt mechanisms to utilize vision-language knowledge for both coarse- and fine-grained detection tasks, achieving state-of-the-art performance on benchmarks.","title":"VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection","type":"method"},{"content":" VLAVAD: Vision-Language Models Assisted Unsupervised Video Anomaly Detection # Changkang Li 1 lichangkang@buaa.edu.cn\nThe School of Electrical and Information Engineering, Beihang University, Beijing 100191, China\nYalong Jiang2 † allenyljiang@buaa.edu.cn\nInstitute of Unmanned System, Beihang University, Beijing 100191, China\nAbstract # Video anomaly detection is a subject of great interest across industrial and academic domains because of its crucial role in computer vision applications. However, the inherent unpredictability of anomalies and the scarcity of anomaly samples present significant challenges for unsupervised learning methods. To overcome the limitations of unsupervised learning, which stem from a lack of comprehensive prior knowledge about anomalies, we propose VLAVAD (Video-Language Models Assisted Anomaly Detection). Our method employs a cross-modal pre-trained model that leverages the inferential capabilities of large language models (LLMs) in conjunction with a Selective-Prompt Adapter (SPA) for selecting semantic space. Additionally, we introduce a Sequence State Space Module (S3M) that detects temporal inconsistencies in semantic features. By mapping high-dimensional visual features to low-dimensional semantic ones, our method significantly enhance the interpretability of unsupervised anomaly detection. Our proposed approach effectively tackles the challenge of detecting elusive anomalies that are hard to discern over periods, achieving SOTA on the challenging ShanghaiTech dataset.\n1 Introduction # Video anomaly detection (VAD) is a task of considerable practical value in various situations, such as detecting abnormal behaviors such as theft, fighting, or falls, as well as anomalous objects like vehicles entering pedestrian zones. The necessity of achieving this task increases significantly in the context of security and intelligent cities[27 , 44 , 53 , 60 , 66]. However, due to the sudden and often unclear nature of such events, identifying their time and location is highly challenging.\nAbnormal occurrences in the real world are infrequent and can be classified into an extensive array of categories. Consequently, conventional supervised VAD[25 , 41 , 46 , 62] may not be suitable for this task, as it is often impractical to gather a substantial dataset with labeled abnormal samples. To address the limitations of data annotation, some researchers have turned to weakly supervised VAD that does not necessitate frame-by-frame annotations but instead relies on video-level labels. In weakly supervised VAD, a video is deemed\n† Indicates Corresponding author.\n© 2024. The copyright of this document resides with its authors.\nIt may be distributed unchanged freely in print or electronic forms.\nFigure 1: Comparison between previous methods (left) and our method (right). Our purposed VLAVAD shifts from visual to semantic analysis, identifying shared attributes between normal and anomalous data while ignoring unique visual traits. Unlike traditional methods focused on specific visual cues like pose or motion, our approach is more adaptable across different scenes, facilitated by task-related semantic feature selection. Additionally, we introduce the Sequence State Space Module (S3M) to learn the temporal correlation of normal samples, thereby detecting anomalies that deviate from the normal temporal pattern.\nanomalous if any part of it is labeled as such. On the other hand, a video is labeled as normal only if all of its frames are normal. However, this approach is inefficient in pinpointing the abnormal section of the video, especially when the video is long. The application of unsupervised learning methodologies[1 , 23 , 42 , 48], which involve training representations solely on regular samples, allows for the separation of anomalous samples without the need for prior knowledge about anomalies, thereby eliminating constraints imposed by the process of collecting data.\nThe spatial and temporal complexities of anomalous features make it difficult to identify and categorize all anomalies. Anomalous samples may not always exhibit clear differences from normal samples; instead, they may sometimes closely resemble them in certain feature dimensions. Methods that rely on visual features often make judgments based on a single observation that defines anomalies[16 , 23 , 45 , 65], resulting in the mapping of all normal samples into the same feature space and neglecting the variety of normal samples. Therefore, referencing human understanding for anomaly discrimination necessitates a multidimensional assessment, combining various factors such as human posture, optical flow, background changes, etc., for judgment. The multi-task learning paradigm that incorporates diverse types of features has shown potential to enhance accuracy[4 , 7 , 18 , 50]. However, such multi-task-based algorithms incur high transfer costs across scenes and categories, implying that achieving the desired detection performance requires fine-tuning each sub-task to strike a balance.\nIn recent times, the Vision-Language Models (VLM) has enhanced accuracy in visual downstream tasks, and also offer a reasonable level of interpretability[5 , 31 , 32 , 33 , 56]. To make use of the advancements in Vision-Language Pre-training models, we present the Vision-Language Model Assisted Anomaly Detection (VLAVAD). This technique makes use of Vision-Language Models (VLM) to transform images into high-level semantic representations. We replace visual features with semantic features and utilize the Selective Prompt Adapter to focus on learning effective semantics from normal samples, thereby enabling smooth adaptation to cross-scene, cross-category anomaly detection without the need for additional model training. Given the significance of accounting for temporal information in\nvideos for effective VAD, it is essential to consider the correlation of feature information across time. Methods that only take into account the current frame when identifying anomalies are insufficient, as they fail to capture the temporal dimension\u0026rsquo;s correlation. To harness the temporal variations in semantic features, we propose the Sequence State Space Module (S3M) to learn the temporal correlation of normal samples. In contrast to convolution-based and transformer-based networks, S3M outperforms them by capturing long-range temporal context dependencies with reduced computational costs.\nOur proposed method, VLAVAD, eliminates the need for collecting and labeling anomalous data, making it suitable for real-world applications. By utilizing Selective Prompt Adapter (SPA) and employing a lightweight S3M trained on normal data, our approach effectively harnesses the deep semantic information in images, allowing for precise and interpretable spatiotemporal localization of anomaly events. The method has been successfully validated across multiple datasets, showcasing its cost-effective transferability and superior performance.\nIn summary, our contributions can be summarized as follows:\nWe present an unsupervised video anomaly detection framework called VLAVAD, which utilizes semantic features rather than visual features for anomaly detection. This framework capitalizes on the comprehension and reasoning skills of pretrained VisualLanguage model to enhance performance in VAD. Consequently, our method expands the anomaly detection from a particular dimension to open-world. We introduce the pioneering use of the Sequence State Space Module (S3M) to tackle temporal variation in anomaly detection, further mitigating the limitation of singleframe anomaly assessment that overlooks time-related anomalies. Our method allows for cost-effective universal anomaly event discrimination across scenes, achieving a 2.7% improvement in performance on the challenging cross-scene, cross-category Shanghaitech dataset. We also validate the superiority of our approach across multiple datasets. 2 Related Work # 2.1 Video Anomaly Detection # In unsupervised Video Anomaly Detection tasks, two primary categories emerge: feature reconstruction and video frame interpolation. Feature reconstruction methods typically employ Auto Encoder (AE)[22 , 51 , 52] or Generative Adversarial Network (GAN)[10 , 26] to project normal data into a low-dimensional space for reconstruction in either temporal or spatial dimensions. Reconstruction methods assume a neural network model that has been exclusively trained on normal samples, which can reconstruct normal samples from lowdimensional features, while anomalous samples cannot be reconstructed[58]. Conversely, video frame interpolation methods entail training a prediction network to forecast the state of an object with missing input frames. By comparing the prediction results with actual outcomes, deviations are assessed to identify anomalies. This method assumes that a network trained on a dataset of normal samples cannot predict frames of anomalous events, thereby effectively differentiating between normal and anomalous events[22 , 68].\n2.2 Vision-Language Pre-training # In recent years, the domain of vision-language pre-training has witnessed significant progress, primarily aimed at discerning the semantic interplay between visual and linguistic modalities through extensive pre-training on diverse datasets. A quintessential illustration of this paradigm is the CLIP[49], which excels in achieving its goals by employing an image-text contrastive learning strategy. This method involves aligning paired images and texts in the embedding space, bringing similar pairs closer together and pushing dissimilar pairs further apart. By utilizing this approach, pre-trained Vision-Language Models (VLMs) are able to acquire extensive knowledge of vision-language correspondence. This enables VLMs to make zero-shot predictions by matching the embeddings of any given images and texts.\nVLMs have shown outstanding performance in diverse vision-language downstream tasks, such as image classification[49], object detection[13 , 14 , 21], scene text detection[63], image captioning[31 , 70], semantic segmentation[12 , 19]. In recent times, a number of studies have endeavored to employ pre-trained models in the domain of video. For example, CLIP4Clip[37] utilized the CLIP\u0026rsquo;s expertise in video-text retrieval, while other works[34 , 47 , 59] applied CLIP to video recognition. VisualGPT[11] highlights the advantages of utilizing pretrained language models to initialize models for more efficient training with less data. Furthermore, Tsimpoukelli et al. [55] enhances performance by fine-tuning a vision encoder and aligning it with a frozen Large Language Model (LLM). Models such as BEiT3[57] and BLIP[31] employ unified transformer architectures for pretraining, and Flamingo et al. [2] introduces a cross-attention design to align visual and language modalities. Additionally, BLIP-2[32] introduces a lightweight Q-Former that converts visual features into tokens directly interpretable by a frozen LLM, achieving impressive results in both image captioning and VQA tasks. Our research leverages the VQA capabilities of BLIP-2 through our automatic questioning mechanism to extract additional image information and enhance image captions beyond the original BLIP-2 captions.\n3 Method # 3.1 Overview # Our main objective is to develop an unsupervised learning methodology to effectively handle scenarios with unpredictable and unobtainable anomalous data samples. Our approach involves transitioning from vision to semantic features, identifying common attributes between normal and anomalous data in the semantic space while excluding non-shared visual features. In contrast to conventional methods that heavily rely on specific aspects of visual features such as pose or optical flow data, our approach offers a significant advantage in its seamless adaptability across diverse cross-scene datasets, facilitated by the incorporation of a Prompt Adapter. Additionally, we introduce the Sequence State Space Module (S3M) to detect temporal variations in semantics, complementing single-frame detection results and addressing the limitation of underutilizing temporal information in anomaly detection.\n3.2 Obtain Multi-object Trajectories # Our Anomaly Detection Architecture receives a series of object-level temporal image sequences for input. To achieve object detection, we employ a pre-trained YOLOx network.\nFigure 2: Overview of our purposed VLAVAD. In the preprocessing stage, object-level sequences {Ti} N i=1 are obtained through detection and tracking. During training, the Selective Prompt Adapter (SPA) selects the most suitable prompt from the prompt pool to describe the dataset scene samples. Subsequently, the Sequence State Space Module (S3M) takes clip-level semantic features E(t) as input and is trained using Mean Squared Error(MSE) loss between the predicted feature output and the expected feature to learn the deviations in temporal patterns. During testing, we utilize the prompt selected by SPA from the training set to generate the answer sequence. We then calculate A s and A t , which represent the static caption anomaly score and time inconsistency anomaly score, respectively.\nAdditionally, we utilize the ByteTrack algorithm for object tracking to train the S3M. Consequently, we acquire object-level trace trajectories T = {Oi} fend i=fbegin , where O denote the image of the detected object, fbegin and fend denote the frame index of the object\u0026rsquo;s appearance and disappearance, respectively. Finally, we obtain a object-level trajectories set {Ti} N i=1 , where N is the total number of objects detected in the video, which facilitates the segmentation of each object into clips during both training and testing phases.\n3.3 Algorithm Description # Illustrated in the right half of Figure 2, our network comprises three components. The first component, the Selective Prompt Adapter, employs the frequency distribution of the output of LLM to compute anomaly scores for individual objects detected within a single frame. It selects the most salient score among multiple objects within the same frame and designates it as the anomaly score for that frame, denoted as Ak = max n i=1 (AO i ), where Ak represents the anomaly score for the k-th frame and AO i represents the anomaly score for the i-th object within that frame. The second component, the Sequence State Space Module (S3M), takes as input the object-level text embedding sequence generated by VLM. It undergoes unsupervised training solely on the normal samples within the training set and computes anomaly scores based on the temporal inconsistency of features during the test phase. Finally, we integrate the static anomaly scores with the dynamic ones and apply Gaussian smoothing to obtain the final score.\n3.3.1 Selective Prompt Adater # To promote the utilization of Vision-Language Models (VLMs) in anomaly detection, we introduce the Selective Prompt Adapter (SPA) module. This component aids VLM in selecting appropriate prompts by evaluating the statistical properties of common text features in typ-\nical data. Anomaly detection typically entails mapping the input data to a low-dimensional space, and its efficacy hinges on the ability to compress input images into a low-dimensional feature space. Leveraging the dual capabilities of image and text inputs in VLMs, we are able to identify the common features of normal samples by utilizing multiple text inputs. This process effectively distinguishes them from anomalous samples and enhances the precision of anomaly detection. Specifically, the SPA module selects the most appropriate prompt for dimensionality reduction of normal samples by examining the frequency of text features. By concentrating normal samples in a more compact low-dimensional space, the final input prompt text Pselected can be represented as:\nIn the context of object-level image inputs obtained from the training set and represented by the symbol I , GV LM denotes the vision-language model. The top (k) frequency statistics are represented by Ftopk, and Pselected denotes the prompt pool selected to maximize the concentration of output features from the training set. We choose the prompt with the highest Ft Ftopk statistics from normal samples as the optimal input for compressing common features. During the testing phase, the same set Pselected is used, and the anomaly score for each object is calculated based on the reciprocal of the frequency of occurrence of the object\u0026rsquo;s text in the training dataset, as anomalies are less frequent in the selected semantic space.\n3.3.2 Sequence State Space Module # We present a Sequence State Space Module (S3M) designed to identify changes in semantic features over extended periods. The S3M extracts persistent patterns of state transitions within lengthy sequences in normal events and encodes them for predicting future states based on past observations. The model also identifies anomalies by leveraging disparities between predicted and observed states. Moreover, the S3M\u0026rsquo;s ability to capture long-range dependencies enhances its capacity to uncover comprehensive anomaly clues.\nThe input to the S3M includes embeddings obtained from the answer text of VLM, combined with object-level trajectories. The embedding sequences of objects appearing in all frames of the video are segmented into a set of clips. The input is the text sequence output by the text encoder E, denoted as {Ei(t) , Ei(t +1) ,\u0026hellip;, Ei(t +L c )} N i=1 , where E(t) ∈ R 512 , N is the total number of objects, and L c represents the length of each clip. The S3M function is defined as follows:\nHere,W(t , L) represents the window function, which retains the input from the previous Lp Lp moments. E ˆ i denotes the output obtained from the S3M. The objective function of the S3M network is to reduce the divergence between ground truth sequences and predictions.\nWhere || . ||2 denotes mean square error. The S3M is trained solely on normal samples, with the aim of learning the normal motion patterns. Therefore, when abnormal samples\nfrom the test set are utilized as input, the module\u0026rsquo;s prediction which is derived from normal patterns diverge from observations. The anomaly score at the testing stage is calculated as:\n3.3.3 Computation of Anomaly Scores # After obtaining the object-level anomaly scores A s (t) and At(t), we compute the final anomaly score A(t) as follows: the maximum score among all objects {Oi} n i=1 within the current frame is selected for both A s (t) and At(t). To reduce the impact of noise, we apply a 1-D Gaussian filter to smooth the scores. The expression can be written as:\nIn this formula, Guess represents the 1-D Gaussian filter. A s (t) denotes the static anomaly score obtained by SPA, which includes only the information at the moment t . A t (t) denotes the dynamic anomaly score obtained by S3M, which incorporates information from a period of L c . And λ is a hyperparameter that adjusts the weight between the two.\n4 Experiments # 4.1 Dataset and Evaluation Metrics # Dataset: The study presented in this article employs several benchmark datasets that depict complex anomalous events occurring in diverse settings captured from various vantage points. The UCSD dataset [40] is a collection of videos captured in different crowd scenarios. The \u0026ldquo;Pedestrian 2\u0026rdquo; (Ped2) subset we used includes 16 training video samples and 12 testing video samples. The Avenue dataset [36] consists of 21 testing videos of anomalous events and 16 training videos of normal events. The dataset contains a total of 47 anomalous events, including behaviors like walking in the wrong direction, running, dancing, and object throwing. The ShanghaiTech dataset [35] comprises 330 training videos and 107 testing videos. With 13 scenes characterized by complex lighting and camera angles, this dataset includes 42,883 testing frames and 274,515 training frames. The ShanghaiTech dataset is the most extensive and intricate, presenting the greatest challenges for VAD due to its semantic complexity and cross-scene detection requirements.\nMetrics: Performance metrics in anomaly detection research are typically assessed using ground truth annotations at either the frame or video level within datasets. When an anomalous event is identified within a frame, the entire frame is classified as anomalous, evaluating frame-level metrics. Due to the inherent imbalance between normal and anomalous samples in anomaly detection datasets, we evaluate the performance of VAD using the area under the curve (AUC) of the frame-level receiver operating characteristics (ROC), which remains indifferent to thresholding in the detection task.\n4.2 Implementation Details # For the network structure, we utilized the ByteTrack model [67] pretrained on the MOT17 dataset [43], with its backbone derived from the pretrained YOLOx [17] on MS-COCO. The\nTable 1: Comparison of the AUC on the UCSD Ped2, Avenue, and ShanghaiTech.\nPub. Year Methods UCSD Ped2 Avenue ShanghaiTech 2018 before MPPC+SFA[40] Conv-AE[22] ConvLSTM-AE[38] StackRNN[39] FPd[35] 61.3% 90.0% - 70.2% - | | 2020 | StackRNN[39] Frame-Pred[35] Mem-AE[20] AnoPCN[64] Deep-OC[61] ClusterAE[8] | 90.0% 88.1% 92.21% 95.4% 941% | 77.0% 81.71% 85.1% 83.3% 86.2% 86.6% 86.0% | - # 68.0% 72.8% 71.2% 73.6% – | | 2020 | ClusterAE[8] IPR[54] MNAD-Recon[48] CT-D2GAN[15] | 96.5% 96.20% 90.2% | 83.70% 82.8% 859% | 69.8% 77.7% 76.0% 736% | | 2021 | CT-D2GAN[15 LNRA[3] ARAE[29] | 97.2% 96.5% | 82.8% 85.9% 84.7% | 77.7% 76.0% 73.6% | | 2022 | ARAE[29] CR-BPN [9] | 97.4% 98.3% | 86.7% 90.3% | 73.6% 78.1% 735% | | 2022 | ARAE[29] CR-BPN [9] MGME [69] | 96.8% | 86.2% | 73.6% 78.1% 73.5% 845% | | 2023 | Deep-OC[61] ClusterAE[8] IPR[54] | 96.9% | 86.2% 86.6% | 84.5% 75.8% | | 2020 | ClusterAE[8] IPR[54] MNAD-Recon[48] | 96.5% 96.20% | 86.0% 83.70% | 73.3% 71.50% | | 2020 | ClusterAE[8] IPR[54] MNAD-Recon[48 | 96.5% 96.20% | 83.70% | 73.3% 71.50% | | | IPR[54] MNAD-Recon[4 | 96.20% 902% | 83.70% 828% | 71.50% | | 2023 | LNRA[3] ARAE[29] CR-BPN [9] | 97.2% 96.5% | 85.9% 84.7% | 77.7% 76.0% | | 2022 | CR-KR[6] Ours | 97.1% 99.0% | 86.7% 90.3% 86% | 73.6% 78.1% 735% | | 2022 | ARAE[29] CR-BPN [9] G69 | 97.4% 98.3% | 86.7% 90.3% 876% | 73.6% 78.1% | | 2022 | CR-BPN [9] | 98.3% | 90.3% | 78.1% | | 2023 | MGME [69] | 97.8% | 87.6% | 73.5% | | 2023 | SPTD[28] | - | - | 84.5% | | 2023 | OFR-E[24] | 97.7% | 89.7% | 75.8% | | 2024 | STM[30] | 97.0% | 87.7% | 76.1% | | 2024 | CR-KR[6] | 97.1% | 90.8% | 83.7% | | 2024 | Ours | 99.0% | 87.6% | 87.2% |\ntracking confidence threshold parameter was set to 0.5 for both training and testing sets, with an NMS threshold of 0.3, filtering out tiny boxes with an area less than 300. Regarding the pretrained Blip-2[32] on a combined dataset of 129 million images from COCO, Visual Genome, CC3M, etc., its Image Encoder part was pretrained ViT-g, while the Large Language Model part utilized a lighter pretrained OPT-2.7B. Three input prompt texts were selected for pose and behavior, Question 1: \u0026ldquo;What is the pose of the person in the picture?\u0026rdquo; Question 2: \u0026ldquo;What is the behavior or action of the person in the picture?\u0026rdquo; Question 3: \u0026ldquo;What does the person in the picture look like?\u0026rdquo; The S3M\u0026rsquo;s layers were configured with 3 layers, with 10 input frames and 2 predicted frames for Avenue and Ped2, and 20 input frames and 4 predicted frames for the ShanghaiTech. The learning rate was set to 5e-1 for Avenue and Ped2, and 5e-2 for ShanghaiTech. Finally, for the Anomaly Scoring, the λ were set to 0.1, and GMM was used for Gaussian smoothing, with sigma values of 6, 6, and 12 for Ped2, Avenue, and ShanghaiTech.\n4.3 Comparison with state-of-the-art methods # Our VLAVAD has been compared with other unsupervised anomaly detection methods in Table 1. On the UCSD Ped2 and ShanghaiTech datasets, the combined results demonstrated a significant lead over the state of the art, achieving AUC scores of 99.0% and 87.2% respectively. Notably, the latter achieved a lead of 2.7%, making it a substantial benchmark across scenarios involving 130 complex anomalous events, both human-related and unrelated. Our AUC scores on this dataset exceed those of other methods, confirming that our model is\nTable 2: Ablation study results on Ped2 and ShanghaiTech datasets.\n| As | As | At | At | At | Dataset Shhith | Dataset\nShhith kNN SPA Trans. RNN S3M Ped2 Shanghaitech ✓ - - - - 96.3% 72.3% - ✓ - - - 98.2% 86.5% - - ✓ - - 93.2% 81.2% - - - ✓ - 92.3% 80.7% - - - - ✓ 96.6% 82.6% - ✓ - - ✓ 99.0% 87.2% better suited for universal anomaly detection scenarios. Nevertheless, our experimental outcomes on the CHUK Avenue dataset did not achieve parity with the SOTA benchmarks. This divergence can be principally attributed to the dataset\u0026rsquo;s unconventional anomaly definition criteria, which uniquely consider variables such as the directionality of human locomotion as anomalous indicators, while our model did not account for the incorporation of pedestrian walking direction as an atypical anomaly within its caption. Consequently, this dataset performs better when focusing on velocity, such as using optical flow for discrimination.\n4.4 Ablation Study # To assess the usefulness of mining text features generated by VLM for anomaly detection, we compared directly using the 512-D visual features output by the image encoder of CLIP for kNN classification and the scores obtained from SPA. Furthermore, in order to verify the effectiveness of both input pathways, we conducted an Absolute Study by adjusting the λ. The AUC achieved by kNN classification using only the visual features produced by the Visual Encoder is lower than that obtained when utilizing SPA for feature mining on both the Ped2 and ShanghaiTech datasets, highlighting the effectiveness of visual features over semantic features for anomaly detection. Additionally, we replaced S3M with transformer and RNN structures for experimentation, and S3M outperformed these two models due to its characteristics that make it less prone to overfitting, which are more suitable for this prediction task. Finally, incorporating S3M on both datasets shows a certain degree of improvement. This improvement is attributed to the presence of short-duration anomaly events that may be intermittent in time, with S3M aiding in the detection of anomalies with longer durations. The experimental results are shown in Table 2 .\n5 Conclusions # Previous efforts in video anomaly detection have typically relied on visual representations, which has limited the ability to generalize across diverse situations. For instance, behaviors that are considered typical in one context may be deemed anomalous in another. Our method addresses this challenge by employing the Selective Prompt Adapter (SPA) to enable a pretrained VLMs to perform cross-scenario, interpretable anomaly detection more effectively. The advancement of cross-modal large models, as well as the progress in cross-modal matching models and Language Language Models (LLMs), has made it possible to extend this technique to enhance the interpretability and generalization of VAD.\n6 Acknowledgements # This work was supported inpart by the National Natural Science Foundation of China under Grant 62301020 and in part by Beijing Natural Science Foundation under Grant 4234085.\nReferences # [1] Samet Akcay, Amir Atapour-Abarghouei, and Toby P Breckon. Ganomaly: Semisupervised anomaly detection via adversarial training. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pages 622–637. Springer, 2019.\n[2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.\n[3] Marcella Astrid, Muhammad Zaigham Zaheer, Jae-Yeong Lee, and Seung-Ik Lee. Learning not to reconstruct anomalies. arXiv preprint arXiv:2110.09742, 2021.\n[4] Mohammad Baradaran and Robert Bergevin. Multi-task learning based video anomaly detection with attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2885–2895, 2023.\n[5] Federico Bianchi, Giuseppe Attanasio, Raphael Pisoni, Silvia Terragni, Gabriele Sarti, and Sri Lakshmi. Contrastive language-image pre-training for the italian language. arXiv preprint arXiv:2108.08688, 2021.\n[6] Congqi Cao, Yue Lu, and Yanning Zhang. Context recovery and knowledge retrieval: A novel two-stream framework for video anomaly detection. IEEE Transactions on Image Processing, 2024.\n[7] Xingya Chang, Yuxin Zhang, Dingyu Xue, and Dongyue Chen. Multi-task learning for video anomaly detection. Journal of Visual Communication and Image Representation , 87:103547, 2022.\n[8] Yunpeng Chang, Zhigang Tu, Wei Xie, and Junsong Yuan. Clustering driven deep autoencoder for video anomaly detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pages 329–345. Springer, 2020.\n[9] Chengwei Chen, Yuan Xie, Shaohui Lin, Angela Yao, Guannan Jiang, Wei Zhang, Yanyun Qu, Ruizhi Qiao, Bo Ren, and Lizhuang Ma. Comprehensive regularization in a bi-directional predictive network for video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 230–238, 2022.\n[10] Dongyue Chen, Lingyi Yue, Xingya Chang, Ming Xu, and Tong Jia. Nm-gan: Noisemodulated generative adversarial network for video anomaly detection. Pattern Recognition, 116:107969, 2021.\n[11] Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. Visualgpt: Dataefficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18030–18040, 2022.\n[12] Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 2022.\n[13] Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022.\n[14] Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. Promptdet: Towards open-vocabulary detection using uncurated images. In European Conference on Computer Vision, pages 701–717. Springer, 2022.\n[15] Xinyang Feng, Dongjin Song, Yuncong Chen, Zhengzhang Chen, Jingchao Ni, and Haifeng Chen. Convolutional transformer based dual discriminator generative adversarial networks for video anomaly detection. In Proceedings of the 29th ACM International Conference on Multimedia, pages 5546–5554, 2021.\n[16] Alessandro Flaborea, Luca Collorone, Guido Maria D\u0026rsquo;Amely Di Melendugno, Stefano D\u0026rsquo;Arrigo, Bardh Prenkaj, and Fabio Galasso. Multimodal motion conditioned diffusion model for skeleton-based video anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10318–10329, 2023.\n[17] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.\n[18] Mariana-Iuliana Georgescu, Antonio Barbalau, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. Anomaly detection in video via selfsupervised and multi-task learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12742–12752, 2021.\n[19] Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision , pages 540–557. Springer, 2022.\n[20] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1705– 1714, 2019.\n[21] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 , 2021.\n[22] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 733–742, 2016.\n[23] Or Hirschorn and Shai Avidan. Normalizing flows for human pose anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13545–13554, 2023.\n[24] Heqing Huang, Bing Zhao, Fei Gao, Penghui Chen, Jun Wang, and Amir Hussain. A novel unsupervised video anomaly detection framework based on optical flow reconstruction and erased frame prediction. Sensors, 23(10):4828, 2023.\n[25] Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7842–7851, 2019.\n[26] Samuel D Jackson and Fabio Cuzzolin. Svd-gan for real-time unsupervised video anomaly detection. In Proceedings of the British Machine Vision Conference (BMVC), Virtual, pages 22–25, 2021.\n[27] Shunsuke Kamijo, Yasuyuki Matsushita, Katsushi Ikeuchi, and Masao Sakauchi. Traffic monitoring and accident detection at intersections. IEEE transactions on Intelligent transportation systems, 1(2):108–118, 2000.\n[28] Jaehyun Kim, Seongwook Yoon, Taehyeon Choi, and Sanghoon Sull. Unsupervised video anomaly detection based on similarity with predefined text descriptions. Sensors , 23(14):6256, 2023.\n[29] Viet-Tuan Le and Yong-Guk Kim. Attention-based residual autoencoder for video anomaly detection. Applied Intelligence, 53(3):3240–3254, 2023.\n[30] Hongjun Li and Mingyi Chen. A novel spatio-temporal memory network for video anomaly detection. Multimedia Tools and Applications, pages 1–22, 2024.\n[31] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping languageimage pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.\n[32] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.\n[33] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.\n[34] Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, and Thomas H Li. Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6555–6564, 2023.\n[35] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018.\n[36] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision, pages 2720– 2727, 2013.\n[37] Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.\n[38] Weixin Luo, Wen Liu, and Shenghua Gao. Remembering history with convolutional lstm for anomaly detection. In 2017 IEEE International conference on multimedia and expo (ICME), pages 439–444. IEEE, 2017.\n[39] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE international conference on computer vision, pages 341–349, 2017.\n[40] Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. Anomaly detection in crowded scenes. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1975–1981, 2010. doi: 10.1109/CVPR.2010. 5539872.\n[41] Romany F Mansour, José Escorcia-Gutierrez, Margarita Gamarra, Jair A Villanueva, and Nallig Leal. Intelligent video anomaly detection and classification using faster rcnn with deep reinforcement learning model. Image and Vision Computing, 112:104229, 2021.\n[42] Jefferson Ryan Medel and Andreas Savakis. Anomaly detection in video using predictive convolutional long short-term memory networks. arXiv preprint arXiv:1612.00390, 2016.\n[43] Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.\n[44] Sadegh Mohammadi, Alessandro Perina, Hamed Kiani, and Vittorio Murino. Angry crowds: Detecting violent events in videos. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 3–18. Springer, 2016.\n[45] Romero Morais, Vuong Le, Truyen Tran, Budhaditya Saha, Moussa Mansour, and Svetha Venkatesh. Learning regularity in skeleton trajectories for anomaly detection in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11996–12004, 2019.\n[46] Medhini G Narasimhan and Sowmya Kamath S. Dynamic video anomaly detection and localization using sparse denoising autoencoders. Multimedia Tools and Applications , 77:13173–13195, 2018.\n[47] Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision, pages 1–18. Springer, 2022.\n[48] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14372–14381, 2020.\n[49] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.\n[50] Tal Reiss and Yedid Hoshen. Attribute-based representations for accurate and interpretable video anomaly detection. arXiv preprint arXiv:2212.00789, 2022.\n[51] Manassés Ribeiro, André Eugênio Lazzaretti, and Heitor Silvério Lopes. A study of deep convolutional auto-encoders for anomaly detection in videos. Pattern Recognition Letters, 105:13–22, 2018.\n[52] Mohammad Sabokrou, Mahmood Fathy, and Mojtaba Hoseini. Video anomaly detection and localisation based on the sparsity and reconstruction error of auto-encoder. Electronics Letters, 52(13):1122–1124, 2016.\n[53] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018.\n[54] Yao Tang, Lin Zhao, Shanshan Zhang, Chen Gong, Guangyu Li, and Jian Yang. Integrating prediction and reconstruction for anomaly detection. Pattern Recognition Letters, 129:123–130, 2020.\n[55] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.\n[56] Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021.\n[57] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.\n[58] Yuzheng Wang, Zhaoyu Chen, Dingkang Yang, Yang Liu, Siao Liu, Wenqiang Zhang, and Lizhe Qi. Adversarial contrastive distillation with adaptive denoising. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.\n[59] Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23034–23044, 2023.\n[60] Donglai Wei, Yang Liu, Xiaoguang Zhu, Jing Liu, and Xinhua Zeng. Msaf: Multimodal supervise-attention enhanced fusion for video anomaly detection. IEEE Signal Processing Letters, 29:2178–2182, 2022.\n[61] Peng Wu, Jing Liu, and Fang Shen. A deep one-class neural network for anomalous event detection in complex scenes. IEEE transactions on neural networks and learning systems, 31(7):2609–2622, 2019.\n[62] Ke Xu, Tanfeng Sun, and Xinghao Jiang. Video anomaly detection and localization based on an adaptive intra-frame classification network. IEEE Transactions on Multimedia, 22(2):394–406, 2019.\n[63] Chuhui Xue, Wenqing Zhang, Yu Hao, Shijian Lu, Philip HS Torr, and Song Bai. Language matters: A weakly supervised vision-language pre-training approach for scene text detection and spotting. In European Conference on Computer Vision, pages 284– 302. Springer, 2022.\n[64] Muchao Ye, Xiaojiang Peng, Weihao Gan, Wei Wu, and Yu Qiao. Anopcn: Video anomaly detection via deep predictive coding network. In Proceedings of the 27th ACM international conference on multimedia, pages 1805–1813, 2019.\n[65] Yuan Yuan, Yachuang Feng, and Xiaoqiang Lu. Structured dictionary learning for abnormal event detection in crowded scenes. Pattern Recognition, 73:99–110, 2018.\n[66] Muhammad Zaigham Zaheer, Arif Mahmood, Marcella Astrid, and Seung-Ik Lee. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 358–376. Springer, 2020.\n[67] Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. Bytetrack: Multi-object tracking by associating every detection box. In European conference on computer vision, pages 1–21. Springer, 2022.\n[68] Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. Spatiotemporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM international conference on Multimedia, pages 1933–1941, 2017.\n[69] Liuping Zhou and Jing Yang. Video anomaly detection with memory-guided multilevel embedding. International Journal of Multimedia Information Retrieval, 12(1):6, 2023.\n[70] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13041–13049, 2020.\n","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/vlavad-vision-language-models-assisted-unsupervised-vad/","section":"Papers","summary":"Proposes VLAVAD, an unsupervised video anomaly detection method leveraging vision-language pre-trained models, utilizing semantic features, Selective Prompt Adapter, and Sequence State Space Module to improve interpretability and transferability, achieving state-of-the-art performance on the ShanghaiTech dataset.","title":"VLAVAD: Vision-Language Models Assisted Unsupervised Video Anomaly Detection","type":"method"},{"content":"","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xuerong-zhou/","section":"Authors","summary":"","title":"Xuerong Zhou","type":"authors"},{"content":"","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yalong-jiang/","section":"Authors","summary":"","title":"Yalong Jiang","type":"authors"},{"content":"","date":"1 January 2024","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yanning-zhang/","section":"Authors","summary":"","title":"Yanning Zhang","type":"authors"},{"content":" AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM # Sunghyun Ahn * *, Youngwan Jo * *, Kijung Lee, Sein Kwon, Inpyo Hong, Sanghyun Park † Yonsei University, Seoul, Korea\n{skd, jyy1551, rlwjd4177, seinkwon97, hip9863, sanghyun}@yonsei.ac.kr\nAbstract # Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. However, existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. Consequently, users should retrain models or develop separate AI models for new environments, which requires expertise in machine learning, high-performance hardware, and extensive data collection, limiting the practical usability of VAD. To address these challenges, this study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model. C-VAD considers userdefined text as an abnormal event and detects frames containing a specified event in a video. We effectively implemented AnyAnomaly using a context-aware visual question answering without fine-tuning the large vision language model. To validate the effectiveness of the proposed model, we constructed C-VAD datasets and demonstrated the superiority of AnyAnomaly. Furthermore, our approach showed competitive performance on VAD benchmark datasets, achieving state-of-the-art results on the UBnormal dataset and outperforming other methods in generalization across all datasets. Our code is available online at github.com/SkiddieAhn/Paper-AnyAnomaly .\n1. Introduction # Video anomaly detection (VAD) aims to detect abnormal events in video streams. Abnormal events include the actions of objects that are inappropriate for the environment (e.g., climbing over a fence) or objects with unusual appearances (e.g., a bicycle on a walkway). However, abnormal events are rare and diverse, making it difficult to construct large-scale datasets for VAD. Therefore, the VAD is recognized as a highly challenging problem.\nTo overcome these limitations, previous studies primarily used one-class classification (OCC) methods that learn only from normal data. In the OCC approach, the model\nEqual contribution † Corresponding author\nFigure 1. Comparison of traditional video Anomaly Detection (VAD) and customizable video anomaly detection (C-VAD). Traditional VAD models struggle with generalization, making them hard to apply in diverse environments, while C-VAD can handle various video environments.\nlearns normal patterns and classifies the cases that deviate from them as abnormal. Representative OCC methods include classification-[14 , 30 , 39], distance-[2 , 27 , 29], and prediction-based methods [10 , 15 , 17 , 37], all of which have demonstrated excellent performance in VAD tasks.\nHowever, because normal and abnormal classes can be defined differently depending on the environment, OCC methods cannot always guarantee a generalized performance. For example, as shown on the left side of Figure 1, a model trained in a campus environment learns the characteristics of a \u0026lsquo;person\u0026rsquo; as a normal pattern and classifies a \u0026lsquo;car\u0026rsquo; as abnormal. However, when this model is applied to a road environment, a \u0026lsquo;car\u0026rsquo; is still detected as abnormal, which can increase the number of false positives. Therefore, OCC methods require the retraining of normal patterns for each new environment, which entails additional costs, such as data collection, expert intervention, and high-performance equipment. Because of these limitations, the application of VAD models to real-world scenarios is challenging.\nTo address this issue, we propose a novel technique called customizable video anomaly detection (C-VAD). CVAD considers user-defined text as abnormal events and detects the frames containing these events in the video. For instance, in campus videos, \u0026lsquo;car\u0026rsquo; can be set as an abnormal event, while in road videos, \u0026lsquo;person\u0026rsquo; can be set as an abnor-\nmal event. In contrast to existing VAD models, which judge abnormalities based on learned normal patterns, C-VAD dynamically detects abnormal patterns based on the text provided. This implies that as the generalizability of visual text analysis improves, anomaly detection becomes more effective in various environments. Consequently, we introduce a zero-shot capable C-VAD approach, as shown on the right side of Figure 1, and propose the AnyAnomaly model, which allows for VAD in various environments without the need for additional training.\nAn effective method to implement zero-shot C-VAD is to leverage large vision language models (LVLMs). Recently, LVLMs have demonstrated outstanding generalization performance in visual text analysis. By leveraging this capability, C-VAD can be performed effectively across various environments. The most intuitive method involves performing visual question answering (VQA) [4] on each frame to estimate the anomaly score. For instance, one could provide the model with the prompt: \u0026ldquo;Return a value between 0 (no) and 1 (yes) indicating how well the input image represents the text provided by the user\u0026rdquo;. This was used as the baseline model. However, through experiments, we observed the following limitations of the baseline model: 1 Due to the large computational cost of LVLMs, the latency is high. 2 Difficulty in analyzing specific objects due to the characteristics of surveillance videos, such as foreground-background imbalance and object congestion. 3 Difficulty in detecting action-related anomalies because of the inability to utilize temporal information.\nTo overcome these limitations, we designed an AnyAnomaly model with the structure shown in Figure 2 . First, to reduce the latency, we adopted a segment-level approach that groups consecutive frames into a single segment for processing. For this purpose, we introduced a key frames selection module (KSM) that selects key frames representing the segment and performed VQA per segment. Second, instead of performing simple image-text matching, we introduced a context-aware VQA approach to enable a deeper understanding of the scene. To this end, we additionally utilized two types of information: position context, P C and temporal context, T C . P C is a context that emphasizes important locations within a frame, enhancing the object analysis capability of the LVLM. T C is a context that structures scene changes over time into a grid format, improving the action analysis capability of the LVLM. Notably, the proposed KSM and context generation modules operate in a training-free manner, allowing for easy application of C-VAD without additional data training.\nTo evaluate the performance of C-VAD, we classified existing VAD benchmark datasets based on anomaly types to create the C-VAD datasets. Through this process, we demonstrated the superiority of AnyAnomaly. Furthermore, despite being a zero-shot approach, AnyAnomaly achieved\nFigure 2. The architecture of AnyAnomaly\ncompetitive performance on VAD datasets compared to traditional OCC-based VAD models. It achieved state-of-theart (SOTA) results on the UBnormal dataset [1] and showed superior generalization across all datasets. The proposed approach is expected to be an effective solution for deploying VAD technology in real-world applications. The contributions of this study are as follows:\nWe propose the C-VAD technique for anomaly detection in diverse environments. To the best of our knowledge, it is the first to perform VAD based on user-defined anomalies. We develop the AnyAnomaly model, which applies context-aware VQA to perform C-VAD effectively. To evaluate the performance of C-VAD, we construct new C-VAD datasets and experimentally verify the superiority of AnyAnomaly. AnyAnomaly achieves SOTA performance on UBnormal dataset and outperforms other methods in generalization across all datasets. 2. Related Work # Video Anomaly Detection. Most VAD models adopt the OCC approach to detect anomalies by learning normal patterns. Among them, prediction-based methods train models to predict future or past frames based on normal frames, assuming that abnormal frames exhibit larger prediction errors. Liu [17] proposed a method that utilizes FlowNet [11] and GANs [23] to predict the t + 1-th frame\ngiven t input frames. Yang [37] introduced an approach that selects key frames from t input frames to predict the entire sequence. However, because the definitions of normal and abnormal patterns may vary depending on the environment, the OCC approach, which relies on learned normal patterns, has a limited generalization performance. To mitigate this limitation, recent studies have explored cross-domain VAD (xVAD). Notably, zxVAD [3] enhances the adaptability to new environments by synthesizing abnormal patterns using the cut mix technique on images from auxiliary datasets. However, these approaches depend on fixed data transformations, making it difficult to fully capture the diverse abnormal patterns that may occur in real-world scenarios. Therefore, we propose a novel VAD method that uses textual information to dynamically detect abnormal patterns that vary depending on the environment.\nLarge Vision Language Models. Large language models have primarily been used in natural language processing; however, they have recently been applied to multimodal tasks such as image captioning and VQA. For example, MiniGPT-4 [41] processes multimodal inputs by connecting a pre-trained vision encoder to the Vicuna [7] model through a linear layer. Recent LVLMs have employed novel visual encoding techniques to better understand images. Chat-UniVi [13] generates dynamic tokens for images, thereby reducing unnecessary information and effectively extracting key visual features. This model enables flexible analysis by applying dynamic tokens across various resolutions. MiniCPM-V [38] applies the best partition technique according to the image resolution and generates tokens optimized for each segment, thereby improving the memory efficiency. However, despite the advancements in LVLMs, they are trained for general purposes, making their direct application in VAD challenging. To handle VAD effectively, it is essential to consider the characteristics of surveillance videos and leverage temporal information. Therefore, we propose a training-free approach to minimize the domain gap between the LVLMs and VAD tasks.\n3. Method # 3.1. Overview # Figure 2 illustrates the structure of the AnyAnomaly model, which performs context-aware VQA. The input is a video segment S = {s0, . . . , sN − 1 } comprising N frames, where N is a multiple of 4. The KSM selects key frames K = {k0, . . . , k3} from S. Among the selected key frames, the representative frame ˆ k is used to generate PC, whereas K is used to create T C. Subsequently, ˆ k , P C, and T C are utilized as image inputs for the LVLM, whereas the userprovided text X is combined with a prompt and used as the text input. Finally, the LVLM\u0026rsquo;s response results are integrated to compute an anomaly score.\nThe user-defined text X refers to a natural language description of the anomaly the user wishes to detect. It can be a single word (e.g., \u0026ldquo;bicycle\u0026rdquo;), diverse events (e.g., \u0026ldquo;jumping-falling\u0026rdquo;), or a complex behavior (e.g., \u0026ldquo;driving outside the lane\u0026rdquo;). In the case of diverse events, each event keyword is processed individually as a single word.\n3.2. Key frames Selection Module # Figure 3a shows KSM, a key component of the segmentlevel approach. For this purpose, we selected four frames representing the segment as K and utilized the CLIP [25] model, which was trained to match the images and text.\nSpecifically, S and X are inputs to the image encoder E I and text encoder E T , respectively, and the similarity is calculated using the dot product of N image and text embeddings. The frame with the highest similarity is selected as the representative frame ˆ k .\nThe index of the representative frame ˆ k, denoted as ˆ i, is used to select the other key frames. We divide the segment into four groups of equal size and select the ˆ i mod N 4 -th frame from each group. For example, when N = 8 and ˆ i = 4, the 0-th frame from each group is selected and the final set is K = {s0, s2, s4, s6}. This process is defined as follows:\nUsing the KSM, K is generated by considering both text alignment and temporal uniformity, thereby enabling effective context generation. A comparative analysis of the key frames selection method is presented in Section 4.5 .\n3.3. Context Generation # P C and T C are key elements of context-aware VQA, serving as additional information that complements the input image. P C enhances the object analysis capability of the LVLM and is generated through WinCLIP-based attention (WA). T C strengthens the action analysis ability of the LVLM and is created through grid image generation (GIG).\nWinCLIP-based Attention. Figure 3b illustrates the WA method. We emphasize the regions related to X at ˆ k based on WinCLIP, as proposed by Jeong [12]. First, ˆ k is divided into multiple windows, and individual embeddings are generated from each window using EI . For example, when the image size is 240 × 240, it is divided into 25 windows of size 48 × 48, and the embeddings of each window are\nFigure 3. Architecture of the proposed modules. KSM is essential for the segment-level approach, and WA and GIG are crucial for context generation.\ncollected to create a small-scale window embedding map W s ∈ R 25×D . By adjusting the window size, a middlescale window embedding map W m and large-scale window embedding map W l are also generated, and the similarity between these embedding maps and the text embedding z ∈ R D is calculated. The final similarity map M is generated by averaging the similarities calculated on three scales:\nWe combined the template proposed by Jeong [12] with X and passed it through ET to generate z. Finally, we multiplied M and ˆ k to create P C:\nHere, fn fnorm represents min-max normalization, and ⊙ denotes element-wise multiplication. M was used after interpolation and reshaping to match the resolution of ˆ k . Because P C is created by integrating similarities from multiple scales, it is robust to object size and location, and operates effectively even in situations with multiple objects.\nGrid Image Generation. Figure 3c illustrates the GIG method, which comprises two stages. In the multiscale grid generation stage, K is used to create grid images at different scales. Similar to the process described in WA, each frame of K is divided into multiple windows, and the windows at the same position are connected in a 2×2 grid format to create a single grid image. This process is defined as follows:\nHere, u i j refers to the i-th window created from k j , and g i refers to the i-th grid image. We defined the sets of grid images generated using small-, middle-, and large-scale windows as G s , G m , and G l , respectively.\nIn the grid image selection stage, the previously created sets are aggregated to generate G all . Then, using the same method as in KSM to select ˆ k, the grid image with the highest similarity to the text is chosen to generate T C:\nThe T C generated through this process represents the object movement over time within the same background, making it advantageous for action analysis and robust to various object sizes. An analysis of the window sizes used in the WA and GIG is presented in Section 4.5 .\n3.4. Anomaly Detection # Instead of tuning the LVLM, we propose a new prompt and context for performing context-aware VQA. The VQA results were used as anomaly scores to enable training-free zero-shot anomaly detection.\nPrompt Design. Figure 4 illustrates the proposed prompt P. The prompt comprises three main components: \u0026rsquo;task\u0026rsquo;, \u0026lsquo;consideration\u0026rsquo;, and \u0026lsquo;output\u0026rsquo;. First, \u0026rsquo;task\u0026rsquo; defines the operation the LVLM should perform, specifically evaluating whether X is present in the image. Next, \u0026lsquo;consideration\u0026rsquo; specifies factors to be taken into account during evaluation, while \u0026lsquo;output\u0026rsquo; defines the format for presenting the evaluation results. To leverage the chain-of-thought [33] effect,\nFigure 4. Proposed prompt for VQA\nwe instructed the model to provide brief reasoning along with the anomaly score, rounded to one decimal place. When conducting VQA using T C, an additional element, \u0026lsquo;context\u0026rsquo;, is inserted between \u0026rsquo;task\u0026rsquo; and \u0026lsquo;consideration\u0026rsquo; in the prompt. This context element conveys the meaning of the rows and columns of T C to the LVLM. We define the modified prompt as P ∗ . A comparative analysis of prompts is presented in supplementary materials.\nAnomaly Scoring. The context serves as supplementary information to the image. However, because the LVLM accepts only a single image as input, it is challenging to utilize both the original and additional information simultaneously. To address this issue, we adopt a late fusion approach. Specifically, ˆ k , P C, and T C were used as the image inputs for the LVLM. The LVLM returns an anomaly score for each input, and these three scores are combined to compute the final ascore:\nHere, γ is a hyperparameter that adjusts the proportion of context reflected in ascore. A performance comparison experiment based on the hyperparameter tuning is presented in supplementary materials.\nConsequently, even if the abnormal frame ˆ k receives a low score, the final anomaly score will be high if the additional information, P C or T C, is assigned a high score. This enabled accurate anomaly detection. Finally, to create frame-level anomaly scores, we duplicated the ascore for the length of each segment and then applied a temporal 1-D Gaussian filter for smoothing, following prior works [27 , 32].\nFigure 5. Comparison between the VAD and C-VAD datasets\n4. Experiments # 4.1. Datasets # Figure 5 illustrates the composition of the VAD and proposed C-VAD datasets. In conventional VAD datasets, videos are not categorized by an abnormal class type. In contrast, the proposed C-VAD datasets are organized by abnormal event type, with videos classified as positive or negative based on the presence of each abnormality. This categorization enables a precise evaluation of detection performance for specific types of abnormalities (e.g., bicycle). In this study, we validated the effectiveness of the proposed method on three VAD datasets: CUHK Avenue (Ave) [18], ShanghaiTech Campus (ShT) [20], and UBnormal (UB) [1] as well as two C-VAD datasets: Customizable-ShT (CShT) and Customizable-Ave (C-Ave). Further details of the datasets are provided in supplementary materials.\n4.2. Evaluation Criteria # To ensure consistency with previous VAD studies, the performance of the proposed model was evaluated using the micro-averaged area under the receiver operating characteristic curve (micro AUROC) metric. Specifically, the anomaly scores of all the frames in the dataset were aggregated, and the threshold of the anomaly score was progressively adjusted to compute the final evaluation.\n4.3. Results # Tables 1 and 2 present the evaluation results on the C-VAD datasets. The baseline, as described in Section 1, performs VQA at the frame level to compute anomaly scores. The proposed model achieved performance improvements of 9.88% and 13.65% compared to the baseline on the C-ShT and C-Ave datasets, respectively. Specifically, it showed improvements of 14.34% and 8.2% in the action class, and 3.25% and 21.98% in the appearance class, respectively.\nWhen only KSM was applied to the baseline, the execution time decreased in proportion to the segment length, whereas the average performance remained similar to that\nTable 1. Performance comparison on C-ShT dataset\nCategory Class Baseline +KSM +KSM/PC +KSM/TC Proposed Improvement (%) Skateboarding 61.3 57.06 57.79 73.66 73.66 20.16 Throwing 91.41 72.82 88.74 82.53 90.67 -0.81 Running 53.13 51.93 53.68 59.77 60.11 13.14 Action Loitering 61.98 51.96 81.27 76.94 81.27 31.12 Jumping 82.84 92.89 92.91 95.31 95.31 15.05 Falling 78.31 78.95 79.24 88.01 88.01 12.39 Fighting 84.48 91.18 91.18 98.06 98.06 16.07 Average 73.35 72 77.83 82.04 83.87 14.34 Appearance Car 88.72 90.96 91.46 90.96 91.46 3.09 Appearance Hand truck 95.5 98.2 98.91 99.2 99.2 3.87 Appearance Bicycle 72.36 72.46 78.47 72.46 78.47 8.44 Appearance Motorcycle 88.04 86.72 86.72 86.72 86.72 -1.5 Appearance Average 86.16 87.09 88.89 87.34 88.95 3.25 Overall Average Overall Average 78.01 77.48 81.85 83.97 85.72 9.88 Table 2. Performance comparison on C-Ave dataset\nCategory Class Baseline +KSM +KSM/PC +KSM/TC Proposed Improvement (%) Action Throwing 78.44 80.13 89.77 82.4 89.77 14.44 Action Running 75.82 77.67 77.67 77.9 77.9 2.74 Action Dancing 85.65 72.28 76.64 91.92 91.92 7.32 Action Average 79.97 76.69 81.36 84.07 86.53 8.2 Appearance Too close 57.23 61.48 61.48 91.78 91.78 60.37 Appearance Bicycle 99.99 99.84 99.99 99.93 100 0.01 Appearance Average 78.61 80.66 80.74 95.86 95.89 21.98 Overall Average Overall Average 79.43 78.28 81.11 88.79 90.27 13.65 of the baseline. This is because the CLIP effectively selects representative frames for each segment, thereby compensating for the loss of temporal information. However, because it does not fully capture fine-grained spatio-temporal details, its performance significantly decreases for certain classes. Therefore, we address these issues using the proposed contextual information. First, using P C resulted in performance improvements of 5.64% and 3.62% compared to the KSM, as the LVLM focused on analyzing objects related to X. Additionally, applying T C led to performance improvements of 8.38% and 14.43% over the KSM, respectively, with particularly notable enhancements observed in the action class. This indicates that utilizing the temporal information provided in the grid image is essential for action analysis. Additional validations, such as FPS comparisons by segment length and performance evaluations across various LVLMs, are presented in supplementary materials.\n4.4. Qualitative Analysis # To analyze the effect of context-aware VQA, we present the visualization results for the anomaly scores and input frames in Figure 6. When P C was not applied, the bicycle object appeared smaller then the other objects, leading to a lower detection performance. Once P C is applied, the bicycle region is emphasized, thereby enhancing the object recognition capability of the LVLM. Similarly, without T C , the model misinterpreted fighting as standing, resulting in lower detection performance. Incorporating temporal information through T C improves the action recognition capability of the LVLM. These results demonstrate that contextaware VQA is more effective than the conventional VQA.\n4.5. Ablation Study # Key frames Selection. We conducted an ablation study on key frames selection from two perspectives: temporal\nFigure 6. Anomaly score comparison and context visualization\nTable 3. Comparison on key frames selection method. RD, CP and Gr. indicate random, CLIP and grouping, respectively. * indicates testing without context. Act. and App. indicate action and appearance class, respectively.\n| Key frames | C-ShT | C-ShT | C-ShT | C-Ave ATl | C-Ave ATl | C-Ave\nATl Key frames Act. App. Total Act. App. Total RD* 69.9 84.0 75.0 66.4 78.8 71.3 CP* 72.0 87.1 77.5 76.7 80.7 78.3 RD 80.0 89.1 83.3 79.1 92.3 84.4 CP 81.2 88.9 84.0 84.3 81.2 83.1 Gr. → CP 82.2 88.8 84.7 83.9 92.2 87.2 CP → Gr. 83.9 89.0 85.7 86.5 95.9 90.3 uniformity and text alignment. The random method considers neither of these aspects, whereas the CLIP-based approach considers only text alignment. Selecting key frames using CLIP after grouping ensures text alignment but does not guarantee temporal uniformity. Applying grouping after CLIP resulted in evenly distributed key frames, thereby considering both temporal uniformity and text alignment. As shown in Table 3, incorporating both factors yielded the best performance for C-VAD, highlighting the critical role of temporal uniformity in action recognition. Furthermore, RD* and CP*, which do not utilize the contextual information, perform worse than the random method, which disregards both temporal uniformity and text alignment. This demonstrates the importance of leveraging the contextual information.\nWindow Size. Table 4 presents the experimental results based on the window sizes used in P C and T C. For the action classes, the best performance was achieved with the large window size in C-ShT and the middle window size in C-Ave. This indicates that middle or large window sizes are more effective in capturing temporal movements and interactions between multiple objects. For appearance classes, the optimal performance was observed with the small window size in C-ShT and the middle window size in C-Ave, suggesting that the appropriate window size varies depending on the dataset owing to differences in object sizes. To enhance the generalization performance of the model, we adopted an approach that utilized all three window sizes and found that incorporating them yielded the best overall performance.\n4.6. Comparison with SOTA # To assess the effectiveness of AnyAnomaly in handling multiple text inputs, we conducted experiments on the VAD benchmark datasets. For performance evaluation, each anomaly class in the dataset was treated as X, and the maximum anomaly score among all computed scores was assigned to the corresponding segment. Table 5 presents a performance comparison with frame-centric VAD methods.\nTable 4. Comparison on window size.\n| Window Size | C-ShT | C-ShT | C-ShT | C-Ave ATtl | C-Ave ATtl | C-Ave\nATtl Window Size Act. App. Total Act. App. Total small 78.8 90.6 83.1 84.7 87.1 85.7 middle 81.2 89.0 84.1 87.5 92.0 89.3 large 82.1 89.7 84.9 86.8 86.4 86.6 all 83.9 89.0 85.7 86.5 95.9 90.3 Table 5. Comparison with state-of-the-art VAD methods. * indicates testing without context.\nMethod Venue Zero-shot Ave ShT UB AMMC-Net[6] AAAI 21 ✗ 86.6 73.7 - STEAL-Net[5] ICCV 21 ✗ 87.1 73.7 - MPN[21] CVPR 21 ✗ 89.5 73.8 - DLAN-AC[36] ECCV 22 ✗ 89.9 74.7 - UBnormal[1] CVPR 22 ✗ - - 68.5 FPDM[34] ICCV 23 ✗ 90.1 78.6 62.7 SLM[31] ICCV 23 ✗ 90.9 78.8 - USTN-DSC[37] CVPR 23 ✗ 89.9 73.8 - AnomalyRuler[35] ECCV 24 ✗ 89.7 85.2 71.9 MULDE[24] CVPR 24 ✗ - 81.3 72.8 AED-MAE[28] CVPR 24 ✗ 91.3 79.1 58.5 MA-PDM[40] AAAI 25 ✗ 91.3 79.2 63.4 AccI-VAD[27] TMLR 25 ✗ - - 66.8 AnyAnomaly* - ✓ 81.4 77.2 73.1 AnyAnomaly - ✓ 87.3 79.7 74.5 Table 6. Generalization performance comparison. Tr.: crossdomain training where models trained on one VAD dataset are evaluated on another. Few.: methods that adapt to the target domain using only a few training samples, Aux.: methods that utilize auxiliary datasets, *: since competitors did not perform crossdomain evaluations on ShT, we present their same-domain results instead.\nMethod Tr. Few. Aux. Ave ShT STEAL-Net[5] ✓ ✗ ✗ 54.3 51.7 Jigsaw[32] ✓ ✗ ✗ 62.9 59.3 rGAN[19] ✓ ✓ ✗ 76.6 77.9* MPN[21] ✓ ✓ ✗ 78.9 73.8* zxVAD[3] ✓ ✗ ✓ 82.2 71.6* Shibao et al.[8] ✓ ✗ ✓ 86.2 78.7 ZS CLIP[25] ✗ ✗ ✗ 62.3 60.9 ZS ImageBind[9] ✗ ✗ ✗ 64.5 61.3 LLaVA-1.5[16] ✗ ✗ ✗ 67.4 59.6 Video-ChatGPT[22] ✗ ✗ ✗ 76.9 69.1 AnyAnomaly ✗ ✗ ✗ 87.3 79.7 Despite not being trained on VAD datasets, AnyAnomaly demonstrated a performance comparable to that of SOTA methods. Notably, it achieved a new SOTA performance of 74.5% on the UB dataset, which contains 29 diverse backgrounds and 22 abnormal event types, demonstrating the effectiveness of the proposed model in various environments. Furthermore, while LLM-based methods (e.g., AnomalyRuler) require rule generation and aggregation using a few normal samples, the proposed method achieves competitive performance solely through zero-shot inference, highlighting its practical applicability.\n4.7. Generalization Performance Comparison # Table 6 presents a comparison of the generalization performance of AnyAnomaly. Although STEAL-Net [5] and Jigsaw [32] achieved high accuracy in same-domain testing, their performance was significantly degraded in crossdomain settings. Specifically, on the Ave dataset, the performances of STEAL-Net and Jigsaw decreased as 87.1% → 54.3% and 92.2% → 62.9%, respectively. Similarly, on the ShT dataset, their performance decreased as 73.7% → 51.7% and 84.3% → 59.3%, respectively. This suggests that the existing OCC-based VAD models tend to overfit the training data, making them less effective when applied to new environments. For instance, \u0026lsquo;Too close\u0026rsquo; where an object is in close proximity to the camera is considered anomalous in the Ave dataset but normal in the ShT dataset. Consequently, OCC-based models trained on ShT struggle to detect such anomalies.\nThe zero- and few-shot VAD models designed for xVAD exhibited better generalization performance than the OCCbased models. However, few-shot models depend heavily on the number of K-shot samples, whereas zero-shot models require auxiliary datasets. Training-free methods using VLMs, such as ZS-CLIP and Video-ChatGPT, leverage strong image and video understanding capabilities, outperforming some VAD models. Nevertheless, their performance is still limited by domain gaps. In contrast, AnyAnomaly effectively overcomes these gaps by incorporating contextual information, achieving superior performance.\n5. Conclusion # We propose AnyAnomaly, a novel approach that leverages the LVLM for universal VAD. AnyAnomaly effectively performs the C-VAD by incorporating a segmentlevel approach and context-aware VQA. This design reduces latency when processing large videos and minimizes the domain gap between the LVLM and VAD task. Despite being a zero-shot method, AnyAnomaly demonstrates competitive performance on benchmark datasets and holds promise for real-world VAD. Furthermore, because it operates without any training and enables anomaly detection in any video, it significantly improves accessibility in the VAD domain. We anticipate that AnyAnomaly will contribute substantially to VAD research and practical deployment.\nReferences # [1] Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ubnormal: New benchmark for supervised open-set video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 20143–20153, 2022. 2 , 5 , 8 , 11 [2] Sunghyun Ahn, Youngwan Jo, Kijung Lee, and Sanghyun Park. Videopatchcore: An effective method to memorize normality for video anomaly detection. In Proceedings of the Asian Conference on Computer Vision, pages 2179–2195, 2024. 1 [3] Abhishek Aich, Kuan-Chuan Peng, and Amit K RoyChowdhury. Cross-domain video anomaly detection without target domain adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages 2579–2591, 2023. 3 , 8 [4] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425– 2433, 2015. 2 [5] Marcella Astrid, Muhammad Zaigham Zaheer, and Seung-Ik Lee. Synthetic temporal anomaly guided end-to-end video anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 207–214, 2021. 8 [6] Ruichu Cai, Hao Zhang, Wen Liu, Shenghua Gao, and Zhifeng Hao. Appearance-motion memory consistency network for video anomaly detection. In Proceedings of the AAAI conference on artificial intelligence, pages 938–946, 2021. 8 [7] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 3 [8] Shibo Gao, Peipei Yang, and Linlin Huang. Scene-adaptive svad based on multi-modal action-based feature extraction. In Proceedings of the Asian Conference on Computer Vision , pages 2471–2488, 2024. 8 [9] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023. 8 [10] Seungkyun Hong, Sunghyun Ahn, Youngwan Jo, and Sanghyun Park. Making anomalies more anomalous: Video anomaly detection using a novel generator and destroyer. IEEE Access, 2024. 1 [11] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 2462–2470, 2017. 2 [12] Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023. 3 , 4\n[13] Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700– 13710, 2024. 3 , 11 , 12 , 13\n[14] Dongha Lee, Sehun Yu, and Hwanjo Yu. Multi-class data description for out-of-distribution detection. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery \u0026amp; Data Mining, pages 1362–1370, 2020. 1\n[15] Kijung Lee, Youngwan Jo, Sunghyun Ahn, and Sanghyun Park. Mdvad: Multimodal diffusion for video anomaly detection. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 121–133. Springer, 2025. 1\n[16] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 8\n[17] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018. 1 , 2\n[18] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision, pages 2720–2727, 2013. 5 , 11\n[19] Yiwei Lu, Frank Yu, Mahesh Kumar Krishna Reddy, and Yang Wang. Few-shot scene-adaptive anomaly detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16 , pages 125–141. Springer, 2020. 8\n[20] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE international conference on computer vision, pages 341–349, 2017. 5 , 11\n[21] Hui Lv, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Learning normal dynamics in videos with meta prototype network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15425–15434, 2021. 8\n[22] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 8\n[23] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2794–2802, 2017. 2\n[24] Jakub Micorek, Horst Possegger, Dominik Narnhofer, Horst Bischof, and Mateusz Kozinski. Mulde: Multiscale logdensity estimation via denoising score matching for video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18868–18877, 2024. 8\n[25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3 , 8\n[26] Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad S. Khan. Llava++: Extending visual capabilities with llama-3 and phi-3, 2024. 12 , 13\n[27] Tal Reiss and Yedid Hoshen. An attribute-based method for video anomaly detection. Transactions on Machine Learning Research . 1 , 5 , 8\n[28] Nicolae-C Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, Mubarak Shah, et al. Self-distilled masked auto-encoders are efficient video anomaly detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15984–15995, 2024. 8\n[29] Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Scholkopf, Thomas Brox, and Peter Gehler. Towards to- ¨ ¨ tal recall in industrial anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2022. 1\n[30] Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Muller, and Marius Kloft. Deep one-class classifica- ¨ ¨ tion. In International conference on machine learning, pages 4393–4402. PMLR, 2018. 1\n[31] Chenrui Shi, Che Sun, Yuwei Wu, and Yunde Jia. Video anomaly detection via sequentially learning multiple pretext tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10330–10340, 2023. 8\n[32] Guodong Wang, Yunhong Wang, Jie Qin, Dongming Zhang, Xiuguo Bao, and Di Huang. Video anomaly detection by solving decoupled spatio-temporal jigsaw puzzles. In European Conference on Computer Vision, pages 494–511. Springer, 2022. 5 , 8\n[33] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 4 , 11\n[34] Cheng Yan, Shiyu Zhang, Yang Liu, Guansong Pang, and Wenjun Wang. Feature prediction diffusion model for video anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5527–5537, 2023. 8\n[35] Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. Follow the rules: Reasoning for video anomaly detection with large language models. ArXiv , abs/2407.10299, 2024. 8\n[36] Zhiwei Yang, Peng Wu, Jing Liu, and Xiaotao Liu. Dynamic local aggregation network with adaptive clusterer for anomaly detection. In European Conference on Computer Vision, pages 404–421. Springer, 2022. 8\n[37] Zhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. Video event restoration based on keyframes for video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14592–14601, 2023. 1 , 3 , 8\n[38] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 3 , 11 , 12 , 13\n[39] Jihun Yi and Sungroh Yoon. Patch svdd: Patch-level svdd for anomaly detection and segmentation. In Proceedings of the Asian conference on computer vision, 2020. 1\n[40] Hang Zhou, Jiale Cai, Yuteng Ye, Yonghui Feng, Chenxing Gao, Junqing Yu, Zikai Song, and Wei Yang. Video anomaly detection with motion and appearance guided patch diffusion model. arXiv preprint arXiv:2412.09026, 2024. 8\n[41] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 3 , 12 , 13\nAnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM # Supplementary Material # Table S1. Comparison on prompt tuning\nPrompt Tuning C-ShT C-Ave Baseline (simple) 70.38 67.58 Baseline (+reasoning) 71.58 72.79 Baseline (+reasoning, consideration) 78.01 79.43 Proposed (simple) 79.29 74.01 Proposed (+reasoning) 79.79 82.09 Proposed (+reasoning, consideration) 85.72 90.27 Table S2. Comparison on segment length\nSegment length C-ShT C-Ave FPS Baseline 78.01 79.43 0.96 8 83.83 83.96 2.67 16 83.45 87.45 4.49 24 85.72 90.27 6.67 32 82.5 85.94 8.45 chain-of-thought [33] effect by requiring a simple reason along with the anomaly score. This helps to break down the problem step-by-step, guiding the model to resolve complex issues more systematically. For example, the question \u0026ldquo;Does the image include jumping? can be divided into two steps: 1. \u0026ldquo;Is there an object related to jumping (e.g., a person)?\u0026rdquo; and 2. \u0026ldquo;Is the object performing a jumping action?\u0026rdquo; This allows object-level image analysis, leading to more refined predictions. The consideration prompt encourages the assignment of a high score even when X is not central within the image. This prompt was introduced to address the issue where low scores are assigned simply because X exists but is not the central element. The effectiveness of this prompt tuning is compared and analyzed in Table S1 .\nThe simple prompt instructs the LVLM to output only the anomaly score, while adding reasoning prompt the model to perform reasoning during the score calculation process, and applying consideration prompt encourages the model to focus on the given text. Experimental results showed that using both reasoning and consideration prompt achieved the best performance, suggesting that when the LVLM includes reasoning in the process, it produces more accurate results and can respond more precisely to user instructions through consideration prompt.\nA. Experiment Details # A.1. Dataset Details # VAD Dataset. We used the CUHK Avenue (Ave) [18], ShanghaiTech Campus (ShT) [20], and UBnormal (UB) [1] datasets. Ave comprises of videos captured by a single camera on a university campus, containing five types of abnormal events; throwing paper, running, dancing, approaching the camera (Too close) and bicycle. ShT is a campus CCTV dataset that includes 13 different background scenes and 11 types of abnormal events; such as bicycles, cars, fighting, and jumping. UB is a synthetic dataset generated using the Cinema4D software, encompassing 29 diverse background scenes, including indoor environments, sidewalks, and etc. It provides a total of 22 abnormal events, including not only challenging-to-detect events such as smoking and stealing but also complex scenarios such as driving outside the lane and people-car accidents.\nC-VAD Dataset. We constructed the Customizable-ShT (C-ShT) and Customizable-Ave (C-Ave) datasets. C-ShT reorganizes the test data of ShT into 11 abnormal event types and assigns new labels to each type. For example, in the bicycle category, videos containing bicycles were assigned to positive, whereas all other videos were assigned to negative. The frame-level labels were set to 1 only for frames in which a bicycle appeared in the positive videos. C-Ave was constructed by reorganizing the test data of Ave into 5 abnormal event types, following the same labeling methodology as C-ShT.\nA.2. Implementation Details # In a key experiment using the C-VAD datasets, we employed an efficient Chat-UniVi [13] 7B model, considering the balance between performance and speed. For the VAD dataset experiment, we utilized the effective MiniCPM-V [38] 8B model to achieve optimal performance and compared it with state-of-the-art (SOTA) models. The CLIP model used for key frames selection and context generation was ViT-B/32. For context generation, we adopted large, middle, and small window sizes of (120,120), (80,80), and (48,48), respectively. For C-Ave and Ave, the large window size was set to (240,240). All the experiments were conducted on a single NVIDIA GeForce RTX 3090 GPU.\nA.3. Prompt Details # Figure S1 shows the detailed prompts used in the experiments. First, a reasoning prompt is designed to obtain the\nFigure S1. Prompt details. The content written in the simple version is not utilized when applying reasoning.\nTable S3. Comparison of different methods on various datasets\nDataset Method Value AUC w/o context w/o tuning w/o context w/o tuning 81.4 84.4 873 81.4 w/o tuning 1.0, 1.0, 1.0 81.4 84.4 873 81.4 w/ tuning 0.6, 0.3, 0.1 81.4 84.4 873 81.4 w/o context w/o tuning w/ tuning - 77.2 77.2 w/o tuning 1.0, 1.0, 1.0 77.2 77.2 w/ tuning / 0.5, 0.3, 0.2 79.7 77.2 w/o context - 73.1 73.1 0 73.8 w/o tuning 1.0, 1.0, 1.0 73.1 73.1 0 73.8 w/ tuning 74.5 73.1 73.1 0 73.8 B. Additional Quantitative Evaluation # B.1. Segment length and FPS # Table S2 presents the performance comparison and FPS based on different segment lengths. The baseline segment length was set to 1. It was observed that deriving anomaly scores at the segment level yields superior performance compared to the baseline, which relies on a single frame. The highest AUC performance was achieved when the segment length is set to 24, reaching 85.72% and 90.27% for C-ShT and C-Ave, respectively. However, excessively long segment length introduces irrelevant information into the temporal context, leading to a decrease in accuracy. Fur- thermore, performing VAD at the segment level resulted in a 594% improvement in the FPS compared with the baseline.\nB.2. Hyperparameter Tuning # We tuned the three hyperparameters γ1, γ2, andγ3 used for the final anomaly score calculation for each VAD dataset. Each hyperparameter controls the influence of the anomaly score derived from the frame, position, and temporal contexts. As shown in Table S3, the optimal hyperparameter values vary across datasets owing to differences in object sizes and abnormal events. Additionally, comparing w/o context, which does not utilize context information, and w/o tuning, where all hyperparameters were set to the same value, we observed performance improvements of 3.0%, 2.2%, and 0.7%, even without hyperparameter tuning. In contrast, the performance differences owing to hyperparameter tuning were 2.9%, 0.3%, and 0.7%, respectively. This demonstrates the effectiveness of our proposed approach in utilizing context information in VAD and proves that it achieves a strong generalization performance even without hyperparameter tuning.\nB.3. Diverse LVLM Comparison # Table S4 presents the results for C-ShT and C-Ave when using various LVLMs. We evaluated the performances of four SOTA LVLMs: Chat-UniVi [13], MiniGPT-4 [41], MiniCPM-V [38], and LLAVA++ [26]. All experiments\nTable S4. Comparison of diverse LVLMs. The model highlighted in blue represents the most efficient model for the C-VAD task, while the one highlighted in purple indicates the most effective model.\nLVLM Pre-trained C-ShT C-ShT C-Ave C-Ave FPS LVLM Pre-trained w/o context Proposed w/o context Proposed Chat-UniVi[13] Chat-UniVi-7B 77.5 85.7 78.3 90.3 6.67 MiniGPT-4[41] LLaMA-2 Chat 7B 54.0 67.4 53.9 55.3 1.26 MiniCPM-V[38] MiniCPM-Llama3-V-2 5 (8B) 87.7 90.1 86.3 91.0 1.36 LLAVA++[26] LLaVA-Meta-Llama-3-8B-Instruct-FT 73.3 82.8 59.0 69.4 7.25 Figure S2. Example of complementarity between position and temporal context. The first example highlights the importance of position context and the second example emphasizes the importance of temporal context.\nwere conducted using the default settings, and \u0026lsquo;Pre-trained\u0026rsquo; refers to the names of the pre-trained model weights. The experimental results demonstrate that incorporating the proposed context-aware VQA improves the performance of all LVLMs. Specifically, the use context-aware VQA leads to improvements ranging from 2.6% to 24.8%. Notably, even MiniCPM, which achieved the best performance without context-aware VQA, and showed additional improvements of 2.7% and 5.4% for C-ShT and C-Ave, respectively, when context-aware VQA was applied. This confirms that leveraging the proposed context-aware VQA is effective for C-VAD. Additionally, we observed that ChatUniVi, with an FPS of 6.67, was the most efficient model, whereas MiniCPM-V achieved the highest performance on both datasets, scoring 90.1% and 91.0%, respectively. Therefore, as mentioned in Appendix A.1, Chat-UniVi was used for the C-VAD experiments and MiniCPM-V was used for the VAD dataset experiments.\nC. Additional Qualitative Evaluation # C.1. Context Complementarity # In this section, we explain the complementarity between P C and T C in context-aware VQA. Figure S2 visualizes the key frame of a specific segment along with the images generated using WA and GIG for of P C and T C. We also present the results of a context-aware VQA that utilizes these contexts.\nIn the first row, when the text input was \u0026lsquo;bicycle\u0026rsquo;, P C successfully identified the bicycle via WA, yielding a score of 1.0. However, the temporal context suffers from a cropping effect due to motion over time, resulting in a lower score of 0.5. In the second row, when the text input is \u0026lsquo;jumping,\u0026rsquo; the attention result from WA fails to accurately locate the \u0026lsquo;jumping\u0026rsquo; person. Additionally, because of the lack of temporal information, P C was unable to recognize the jumping action, resulting in a score of 0.0. In contrast, T C captured the entire jumping action over time, achieving a score of 0.9.\nThese results demonstrate that the proposed P C, which focuses on the object appearance, and T C, which leverages temporal information, are complementary. By integrating\nFigure S3. Anomaly detection in diverse scenarios. Various abnormal events can emerge over time.\nboth approaches, we enable an effective generalization of the VAD.\nC.2. Anomaly Detection in Diverse scenarios # Figure S3 visualizes the results of VAD performed on videos containing multiple abnormal classes. The captions in each figure indicate the abnormal classes used in the corresponding video. We input the user-defined abnormal keywords as text individually to obtain the scores, and assigned the highest score as the anomaly score for the corresponding segment. As shown in the visualization results, the proposed AnyAnomaly enables VAD across various types of abnormal events. This demonstrates that AnyAnomaly can be effectively utilized even when the user aims to simultaneously detect multiple abnormal types.\nC.3. Anomaly Detection in Complex scenarios # Figure S4 presents the visualization results of AnyAnomaly on complex scenarios. \u0026lsquo;Key Frame\u0026rsquo;, \u0026lsquo;Position Context\u0026rsquo;, and \u0026lsquo;Temporal Context\u0026rsquo; visualize ˆ k , P C, and T C, respectively. The text below each figure represents the LVLM output. These visualization results demonstrate that the proposed context-aware VQA, which utilizes P C and T C, is effective and contributes to improving VAD performance.\nAdditionally, in Figure S4d, we observe that the model can detect certain frames of \u0026ldquo;walking drunk\u0026rdquo; even without utilizing context information. This suggests that the strong visual reasoning capabilities of the LVLM enable VAD in complex scenarios. However, as shown in Figure S4a–S4c , relying solely on individual frames is insufficient for fully leveraging these reasoning abilities. Therefore, the proposed context-aware VQA approach is essential for effective VAD.\nD. Discussion # D.1. Comparison with traditional VAD # Traditional VAD methods and our zero-shot C-VAD each have distinct strengths and limitations. Traditional VAD detects anomalies as deviations from learned normal patterns, requiring no prior knowledge of specific anomaly types and delivering strong performance within the trained environment. However, it often exhibits poor generalization to unseen environments and typically necessitates retraining. In contrast, C-VAD requires prior knowledge of anomaly types but removes the need for retraining or additional data collection even when the definition of \u0026ldquo;normal\u0026rdquo; varies across users or environments. This makes it a practical and cost-effective solution for real-world applications. We anticipate that, with continued advances in LVLM technology, the proposed C-VAD will become even more effective in the future.\nD.2. Limitation # Efficiency is crucial in VAD; therefore, we utilized the most lightweight model among the SOTA LVLMs and adopted a segment-level approach to significantly reduce the latency. However, our method still requires three inputs per segment (key frame, position context, and temporal context) and involves a reasoning process, which makes real-time analysis more challenging. Furthermore, when multiple abnormal events occur simultaneously, each event must be processed independently, which leads to a substantial increase in latency. Hence, our future studies will aim to enhance to the efficiency of the C-VAD in handling multiple abnormal events simultaneously.\n(a) Anomaly event: jaywalking # (b) Anomaly event: driving outside lane # (c) Anomaly event: people and car accident # (d) Anomaly event: walking drunk # Figure S4. Anomaly detection in complex scenarios. Results with and without the inclusion of context are presented.\n","date":"1 December 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/anyanomaly-zero-shot-customizable-video-anomaly-detection-with-lvlm/","section":"Papers","summary":"Proposes the AnyAnomaly model utilizing large vision language models (LVLMs) for zero-shot, customizable video anomaly detection that detects user-defined anomalies without additional training, incorporating segment-level processing and context-aware visual question answering (VQA). The approach enhances generalization across diverse environments and achieves state-of-the-art results on benchmark datasets, demonstrating practical potential for real-world applications.","title":"AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM","type":"method"},{"content":"","date":"1 December 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/inpyo-hong/","section":"Authors","summary":"","title":"Inpyo Hong","type":"authors"},{"content":"","date":"1 December 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/kijung-lee/","section":"Authors","summary":"","title":"Kijung Lee","type":"authors"},{"content":"","date":"1 December 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/sanghyun-park/","section":"Authors","summary":"","title":"Sanghyun Park","type":"authors"},{"content":"","date":"1 December 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/sein-kwon/","section":"Authors","summary":"","title":"Sein Kwon","type":"authors"},{"content":"","date":"1 December 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/sunghyun-ahn/","section":"Authors","summary":"","title":"Sunghyun Ahn","type":"authors"},{"content":"","date":"1 December 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/youngwan-jo/","section":"Authors","summary":"","title":"Youngwan Jo","type":"authors"},{"content":" Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead # Yunkang Cao 1∗ , Xiaohao Xu 2∗ , Chen Sun 3∗ , Xiaonan Huang 2 and Weiming Shen 1\n1 Huazhong University of Science and Technology 2 University of Michigan, Ann Arbor 3 University of Toronto\nAbstract # Anomaly detection is a crucial task across different domains and data types. However, existing anomaly detection models are often designed for specific domains and modalities. This study explores the use of GPT-4V(ision), a powerful visual-linguistic model, to address anomaly detection tasks in a generic manner. We investigate the application of GPT-4V in multi-modality, multi-domain anomaly detection tasks, including image, video, point cloud, and time series data, across multiple application areas, such as industrial, medical, logical, video, 3D anomaly detection, and localization tasks. To enhance GPT-4V\u0026rsquo;s performance, we incorporate different kinds of additional cues such as class information, human expertise, and reference images as prompts. Based on our experiments, GPT-4V proves to be highly effective in detecting and explaining global and fine-grained semantic patterns in zero/one-shot anomaly detection. This enables accurate differentiation between normal and abnormal instances. Although we conducted extensive evaluations in this study, there is still room for future evaluation to further exploit GPT-4V\u0026rsquo;s generic anomaly detection capacity from different aspects. These include exploring quantitative metrics, expanding evaluation benchmarks, incorporating multi-round interactions, and incorporating human feedback loops. Nevertheless, GPT-4V exhibits promising performance in generic anomaly detection and understanding, thus opening up a new avenue for anomaly detection.\nAll evaluation samples, including image and text prompts, will be available at https://github.com/caoyunkang/ GPT4V-for-Generic-Anomaly-Detection .\nContents # 1 Introduction Introduction 5 1.1 1 Motivation and Overview 5 1.2 Ou r Approach: Prompting GPT-4V for Anomaly Detection . 5 1.2.1 Prompt Designs 5 1.2.2 Evaluation Scope: Modalities and Domain 6 1.3 Lim itations in Anomaly Detection Evaluation Based on GPT-4V 6 2 Observations of GPT-4V on Multi-modal Multi-domain Anomaly Detec Observations of GPT-4V on Multi-modal Multi-domain Anomaly Detec 7 2.1 2.1 GPT-4V can address multi-modality and multi-field anomaly detection tasks in zero/one-shot regime: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 2.2 GPT-4V can understand both global and fine-grained semantics for anomaly detection: 7 2.3 2.3 GPT-4V can automatically reason for anomaly detection 8 2.4 4 G GPT-4V can be enhanced with increasing prompts: 8 ∗ Authors contribute equally. Email: cyk_hust@hust.edu.cn, xiaohaox@umich.edu, chrn.sun@mail.utoronto.ca, xiaonanh@umich.edu, wshen@ieee.org\n2 2.5 .5 GPT-4V can be constrained in real-world application but still promi 8 3 Industrial Image Anomaly Detection 3 Industrial Image Anomaly Detection 3 Industrial Image Anomaly Detection 8 3 3.1 Task Introduction . 8 3 3.2 Testing philosophy 8 3 3.3 Case Demonstration 10 4 dustrial Image Anomaly Localization dustrial Image Anomaly Localization 10 4 Task Introd Task Introduction 10 4. 4.2 Testing philosophy 10 4 4.3 Case Demonstration 10 5 oint Cloud Anomaly Detection oint Cloud Anomaly Detection 10 5 5.1 Task Introduction 10 5. 5.2 Testing philosophy 11 5. 5.3 Case Demonstration 11 6 L ogical Anomaly Detection ogical Anomaly Detection 11 6. 6.1 Task Introduction . 11 6. 6.2 Testing philosophy 11 6 6.3 Case Demonstration 12 6 Medical Image Anomaly Detection Medical Image Anomaly Detection 12 7.1 Task Introduction . 12 7 7.2 Testing philosophy 12 7 7.3 Case Demonstration 13 8 Medical Image Anomaly Localization Medical Image Anomaly Localization 13 8 1 Task Task Introduction 13 8 8.2 Testing philosophy 13 8 8.3 Case Demonstration 13 8.3 3 Case Demonstration 3 Case Demonstration 14 9. .1 Tas Task Introduction 14 9.2 9.2 Testing philosophy 14 9.3 93 Case Demonstration 14 0 Traffic Anomaly Detection 0 Traffic Anomaly Detection 0 Traffic Anomaly Detection 14 10.1 Task Introduction . 10.1 Task Introduction . 10.1 Task Introduction . 14 10.2 Testing philosophy 10.2 Testing philosophy 10.2 Testing philosophy 14 10.3 Case Demonstration . 10.3 Case Demonstration . 10.3 Case Demonstration . 15 10.3 Case Demonstration . . Time Series Anomaly Dete 10.3 Case Demonstration . . Time Series Anomaly Dete 10.3 Case Demonstration . . Time Series Anomaly Dete 15 11.1 T . . . 15 11.2 Testing philosop 15 11.3 Case Demonstration 15 12 Prospect 15 13 Conclusion 16 List of Figures # 1 e Diagram of Evaluation GPT-4V on Multi-modality/fields Anomaly Detection. . 6 2 Industrial Image Anomaly Detection: Case 1 . 9 3 Industrial Image Anomaly Detection: Case 1 17 4 Industrial Image Anomaly Detection: Case 2 18 5 Industrial Image Anomaly Detection: Case 2 19 6 Industrial Image Anomaly Detection: Case 3 20 7 Industrial Image Anomaly Detection: Case 3 21 8 Industrial Image Anomaly Localization: Case 1 22 9 Industrial Image Anomaly Localization: Case 2 23 10 Industrial Image Anomaly Localization: Case 3 24 11 Point Cloud Anomaly Detection: Case 1 . 25 12 Point Cloud Anomaly Detection: Case 1 26 13 Point Cloud Anomaly Detection: Case 2 27 14 Point Cloud Anomaly Detection: Case 2 28 15 Point Cloud Anomaly Detection: Case 29 16 Point Cloud Anomaly Detection: Ca 30 17 Logical Anomaly Detection: Case 1 . 31 18 Logical Anomaly Detection: Case 2 . 32 19 Logical Anomaly Detection: Case 3 . 33 20 Logical Anomaly Detection: Case 4 34 21 Medical Anomaly Detection: Case 1 35 22 Medical Anomaly Detection: Case 1 36 23 Medical Anomaly Detection: Case 2 37 24 Medical Anomaly Detection: Case 2 38 25 Medical Anomaly Detection: Case 3 39 26 Medical Anomaly Detection: Case 3 40 27 Medical Anomaly Detection: Case 4 41 28 Medical Anomaly Detection: Case 4 42 29 Medical Anomaly Localization: Case 1 43 30 Medical Anomaly Localization: Case 2 44 31 Medical Anomaly Localization: Case 3 45 32 Medical Anomaly Localization: Case 4 46 33 Pedestrian Anomaly Detection 47 34 Traffic Anomaly Detection: Case 1 48 35 Traffic Anomaly Detection: Case 2 49 36 Time Series Anomaly Detection: Case 50 37 Time Series Anomaly Detection: Case 2 51 1 Introduction # 1.1 Motivation and Overview # Anomaly detection [20 , 19 , 72 , 10 , 78] involves identifying data patterns or data points that significantly deviate from normality. These anomalies or outliers are rare, unusual, or inconsistent data points that deviate from the majority of the data. The primary objective of anomaly detection is to automatically detect and pinpoint these irregularities, which may signify errors, fraud, unusual events, or other noteworthy phenomena, facilitating further investigation or necessary action. Anomaly detection techniques have been widely employed in diverse domains, such as industrial inspection [29 , 98], medical diagonisis [107], video surveillance [84], fraud detection [30] and many other areas where identifying unusual instances is crucial.\nDespite the existence of numerous techniques [14 , 3 , 69 , 41 , 38 , 79 , 110 , 16 , 103] for anomaly detection, many existing approaches predominantly rely on methods that describe the normal data distribution. They often overlook high-level perception and primarily treat it as a low-level task. However, practical applications of anomaly detection frequently necessitate a more comprehensive, high-level understanding of the data. Achieving this understanding entails at least three crucial steps:\nUnderstanding the Data Types and Categories: The first step involves a thorough comprehension of the data types and categories present in the dataset. Data can take various forms, including images, videos, point clouds, time-series data, etc. Each data type may require specific methods and considerations for anomaly detection. Furthermore, different categories may have distinct definitions of normal states. Determining Standards for Normal States: After obtaining the data types and categories, it would be feasible to further reason the standards for normal states, which requires a high-level understanding of the data. Evaluating Data Conformance: The final step is to assess whether the provided data conforms to the established standards for normality. Any deviation from these standards can be categorized as an anomaly. Recent advancements in large multimodal models (LMMs) [25 , 4 , 36 , 113 , 58 , 27 , 52] have shown robust reasoning capacity [55 , 57] and created new opportunities for improving anomaly detection. LMMs are typically trained on extensive multimodal datasets [80], enabling them to effectively analyze various data types, including natural language and visual information. They hold the potential to address the challenges associated with high-level anomaly detection [37 , 17 , 22 , 112].\nMoreover, OpenAI recently introduced GPT-4V(ision) [101], a state-of-the-art LMM that has exhibited remarkable performance across various practical applications. However, it remains uncertain whether GPT-4V can also exhibit robust capabilities for anomaly detection. The objective of this study is to bridge this knowledge gap by assessing the anomaly detection capabilities of GPT-4V.\n1.2 Our Approach: Prompting GPT-4V for Anomaly Detection # 1.2.1 Prompt Designs # The design of prompts plays a crucial role in effectively directing GPT-4V\u0026rsquo;s attention toward the specific aspects of the anomaly detection task. In this study, we primarily consider four types of prompts:\nTask Information Prompt: To prompt GPT-4V effectively for anomaly detection, it is essential to provide clear task information. This study formulates the prompt as follows: \u0026ldquo;Please determine whether the image contains anomalies or outlier points.\u0026rdquo; Class Information Prompt: The understanding of data types and categories is critical. In cases where GPT-4V may struggle to recognize the data class, explicit class information may be provided. For instance, \u0026ldquo;Please determine whether the image, which is related to the {CLS}, contains anomalies or defects.\u0026rdquo; Figure 1 | Comprehensive Evaluation of GPT-4V for Multi-modality Multi-task Anomaly Detection In this study, we conduct a thorough evaluation of GPT-4V in the context of multi-modality anomaly detection. We consider four modalities: image, video, point cloud, and time series, and explore nine specific tasks, including industrial image anomaly detection/localization, point cloud anomaly detection, medical image anomaly detection/localization, logical anomaly detection, pedestrian anomaly detection, traffic anomaly detection, and time series anomaly detection. Our evaluation encompasses a diverse range of 15 datasets.\nNormal Standard Prompt: GPT-4V may encounter difficulties in answering questions related to determining normal standards, and sometimes the standards even can not be examined without human expertise. Hence, this study also explicitly provides the normal standards. For example, normal standards for the breakfast box in MVTec-LOCO [7] could be expressed as follows: \u0026ldquo;1. It should contain two oranges, one peach, and some cereal, nuts, and banana slices; 2. The fruit should be on the left side of the lunchbox, the cereal on the upper right, and the nuts and banana slices on the lower right of the lunchbox.\u0026rdquo; Reference Image Prompt: To ensure better alignment between normal standards and images, a normal reference image is provided alongside language prompts. For example, \u0026ldquo;The first image is normal. Please determine whether the second image contains anomalies or defects.\u0026rdquo; The study aims to explore how the use of these prompts, either individually or in different combinations depending on certain cases, impacts GPT-4V\u0026rsquo;s capacity for anomaly detection.\n1.2.2 Evaluation Scope: Modalities and Domains # Extensive evaluations are conducted in this study to assess the capabilities of GPT-4V in anomaly detection, as Fig. 1 shows. From the perspective of modalities, we evaluate image (Section 3 , 4 , 6 , 7 , 8), point cloud (Section 5), video (Section 9 , 10), and time series (Section 11). From the perspective of fields, industrial inspection (Section 3 , 4 , 6 , 5), medical diagnosis (Section 7 , 8), and video surveillance (Section 9 , 10) are evaluated. To the best of our knowledge, this is the first study to investigate such a wide range of modalities and fields for anomaly detection.\n1.3 Limitations in Anomaly Detection Evaluation Based on GPT-4V # The analysis of this study is subject to certain limitations:\nPredominance of Qualitative Results: The analysis primarily relies on qualitative assessment, lacking quantitative metrics that could offer a more objective evaluation of the model\u0026rsquo;s performance in anomaly detection. Incorporating quantitative measures would provide a more robust basis for assessment.\nScope of Evaluated Cases: The evaluation is confined to a limited scope of cases or scenarios. This narrow focus may not fully capture the diverse challenges encountered in real-world anomaly detection tasks. Expanding the range of evaluated cases would yield a more comprehensive understanding of the model\u0026rsquo;s capabilities. Single Interaction Evaluation: The study mainly concentrates on a single-round conversation. In contrast, multi-round conversations, as observed in the in-context learning capacity of GPT-4V [101], can stimulate deeper interaction. The single-round conversation approach restricts the depth of interaction and may constrain the model\u0026rsquo;s comprehension and its effectiveness in responding to anomaly detection tasks. Exploring multi-round interactions could reveal a more nuanced perspective of the model\u0026rsquo;s performance. 2 Observations of GPT-4V on Multi-modal Multi-domain Anomaly Detection # Following a thorough evaluation of GPT-4V\u0026rsquo;s performance across various multi-modality and multi-field anomaly detection tasks, it becomes apparent that GPT-4V possesses robust anomaly detection capabilities. More precisely, GPT-4V consistently excels in addressing the three previously mentioned challenges: comprehending image context, discerning normal standards, and effectively comparing the provided image against these standards. In addition to these fundamental findings, our assessments have yielded valuable insights.\n2.1 GPT-4V can address multi-modality and multi-field anomaly detection tasks in zero/one-shot regime: # Anomaly detection for multi-modality: GPT-4V\u0026rsquo;s ability to handle diverse data modalities is demonstrated by its consistent performance across various domains. For instance, it exhibits proficiency in identifying anomalies in images, point clouds, X-rays, etc., underscoring its adaptability to multi-modal tasks. This versatility allows it to transcend the limitations of single-modal anomaly detectors.\nAnomaly detection for multi-field: GPT-4V\u0026rsquo;s performance across multiple fields, including industrial, medical, pedestrian, traffic, and time series anomaly detection, showcases its ability to seamlessly adapt to the distinct characteristics of each domain. Its consistent results affirm its broad applicability and versatility, making it a valuable tool for anomaly detection in a variety of real-world contexts.\nAnomaly detection in zero/one-shot regime: GPT-4V\u0026rsquo;s evaluation in both zero-shot and one-shot settings highlights its adaptability to different inference scenarios. In the absence of reference images, the model effectively relies on language prompts to detect anomalies. However, when provided with normal reference images, its anomaly detection accuracy is further enhanced. This flexibility enables GPT-4V to cater to a wide range of anomaly detection applications, whether with or without prior knowledge.\n2.2 GPT-4V can understand both global and fine-grained semantics for anomaly detection: # GPT-4V\u0026rsquo;s understanding of global semantics: GPT-4V\u0026rsquo;s capacity to comprehend global semantics is demonstrated in its ability to recognize overarching abnormal patterns or behaviors. For example, in traffic anomaly detection, it can discern the distinction between typical traffic flow and irregular events, providing a holistic interpretation of the data. This global understanding makes it well-suited for identifying anomalies that deviate from expected norms in a broader context.\nGPT-4V\u0026rsquo;s understanding of fine-grained semantics: GPT-4V\u0026rsquo;s fine-grained anomaly detection capabilities shine in cases where it not only detects anomalies but also precisely localizes them within complex data. For instance, in industrial image anomaly detection, it can pinpoint intricate details like slightly tilted wicks on candles or minor scratches or residues around the top rim of the bottle. This fine-grained understanding enhances its ability to detect subtle anomalies within complex data, contributing to its overall effectiveness.\n2.3 GPT-4V can automatically reason for anomaly detection: # The model\u0026rsquo;s strength in automatically reasoning the given complex normal standards and generating explanations for detected anomalies is a valuable feature. In logical anomaly detection, for example, GPT-4V excels at dissecting complex rules and providing detailed analyses of why an image deviates from the expected standards. This inherent reasoning ability adds a layer of interpretability to its anomaly detection results, making it a valuable tool for understanding and addressing irregularities in various domains.\n2.4 GPT-4V can be enhanced with increasing prompts: # The results of the evaluation highlight the positive impact of additional prompts on GPT-4V\u0026rsquo;s anomaly detection performance. The model\u0026rsquo;s response to class information, human expertise, and reference images suggests that providing it with more context and information significantly improves its ability to detect anomalies accurately. This feature allows users to fine-tune and enhance the model\u0026rsquo;s performance by providing relevant and supplementary information.\n2.5 GPT-4V can be constrained in real-world application but still promising: # From the cases we test, we find there are still several gaps for GPT4V models to be applied in real world anomaly detection. For example, GPT-4V may face challenges in handling highly complex scenarios for industrial application. Ethical constraints in the medical field also make it conservative and hesitate to give confident answer. But we believe it remains promising in a wide range of anomaly detection tasks. To address these challenges effectively, further enhancements, specialized fine-tuning, or complementary techniques may be required. GPT-4V\u0026rsquo;s potential for anomaly detection is evident, and ongoing research may continue to unlock its capabilities in even more complex scenarios.\n3 Industrial Image Anomaly Detection # 3.1 Task Introduction # Industrial image anomaly detection is a critical component of manufacturing processes aimed at upholding product quality [6 , 98 , 14]. Following the establishment of the MVTec AD dataset [6], various methods [15 , 45 , 22 , 17 , 15 , 46 , 92] have thrived in this field. These methods focus on determining whether testing images contain anomalies, typically represented as local structural variants. Early methods [91 , 95 , 102 , 13 , 54 , 94] concentrated on developing specific models for given categories, while recent approaches [45 , 22 , 17 , 112] target a more general but challenging solution, i.e., developing a unified model for arbitrary product categories, which usually performs in few-shot [99 , 40] or even zero-shot [45 , 17 , 22] regime. As highlighted in [101], GPT-4V, equipped with extensive world knowledge, presents a promising solution for arbitrary category inspection.\n3.2 Testing philosophy # Different prompts [101 , 56] could lead to different responses from GPT-4V. We aim to investigate the influence of different information on prompting GPT-4V for industrial anomaly detection. Following the previously discussed problems, this study further develops three prompts, a) class information: the names of the desired inspecting products, such as \u0026ldquo;bottle\u0026rdquo; and \u0026ldquo;candle\u0026rdquo;, b) human expertise: the normal appearance and potential abnormal states and express them in languages, e.g., \u0026ldquo;Normally, the image given should show a clean and well-structured printed circuit board (PCB) with clear traces, soldered components, and distinct labels. It may have defects such as bent pins, cold solder joints, missing components, or smudged labels\u0026rdquo;, c) reference image: normal reference image to provide GPT-4V a better understanding of normality. We propose to evaluate GPT-4V in either a zero-shot setting, with only language prompts, or a one-shot setting, with one reference image provided along with the language prompts. For each setting, we test three different variants: a) a naive prompt like \u0026ldquo;Please determine whether the image contains anomalies or defects,\u0026rdquo; b) with class information, and c) with human expertise.\nFigure 2 | Industrial Image Anomaly Detection: Case 1, zero-shot, the Bottle category of MVTec AD [6] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\n3.3 Case Demonstration # Fig. 2 , 3 , 4 , 5 , 6 , 7 qualitatively demonstrate the effectiveness of GPT-4V for industrial image anomaly detection. Even with a simple language prompt, GPT-4V effectively identifies anomalies in examined bottle and candle images, showcasing its capacity and versatility. Moreover, GPT-4V excels not only in detecting desired anomalies but also in identifying fine-grained structural anomalies. As evident in Fig. 4, GPT-4V noticed a slightly tilted wick on the bottom left candle, demonstrating its nuanced understanding. In complex cases like Fig. 6, GPT-4V recognizes the PCB in images and provides in-depth reasoning about anomalies, such as examining the proper seating of the ultrasonic sensor. However, GPT-4V overlooks the bent pin, resulting in an incorrect result. Nevertheless, GPT-4V showcases a strong grasp of image context and category-specific anomaly understanding.\n4 Industrial Image Anomaly Localization # 4.1 Task Introduction # Industrial image anomaly localization entails a more intricate process than mere image anomaly detection [76 , 93 , 12 , 13 , 65]. It goes beyond recognizing the abnormality within an image and extends to precisely identifying the location of these anomalies. While GPT-4V has exhibited localization capabilities in various domains [101 , 97 , 100], its potential for image anomaly localization warrants further exploration.\nRegrettably, GPT-4V does not currently have the capability to directly produce prediction masks. Some methods have attempted to leverage GPT-4V by prompting it to provide bounding boxes [101 , 97]. However, this approach appears to be imprecise and poses challenges for GPT-4V. In contrast, the approach presented by SoM [100] involves utilizing SAM [50] to generate visual prompts [81 , 50], which are presented in numbered markers. This visual prompting technique shifts the localization task from a pixel-level mask prediction task to a mask-level classification task, effectively reducing the associated complexities and increasing localization precision.\n4.2 Testing philosophy # To harness the fine-grained localization capability of GPT-4V, we adopt the approach outlined in SoM [100]. This involves generating a set of image-mask pairs for prompting GPT-4V. In addition to the image-mask pairs, we employ a straightforward language prompt that instructs the model, as follows: \u0026ldquo;The first image needs to be inspected. The second image contains its corresponding marks. Please determine whether the image contains anomalies or defects. If yes, give a specific reason\u0026rdquo;.\n4.3 Case Demonstration # Fig. 8 , 9, and 10 provide a visual representation of GPT-4V\u0026rsquo;s performance in industrial anomaly localization. These illustrations clearly illustrate GPT-4V\u0026rsquo;s ability to accurately identify the second mask in Fig. 8 as a twisted wire and the second mask in Fig. 9 as holes. These results serve as compelling evidence of GPT-4V\u0026rsquo;s proficiency in localizing anomalies when guided by visual prompts.\nIt is important to acknowledge that GPT-4V does exhibit certain limitations when confronted with more complex scenarios, as evidenced in Fig. 10. However, the combination of visual prompting techniques and GPT-4V remains a promising approach for industrial anomaly localization.\n5 Point Cloud Anomaly Detection # 5.1 Task Introduction # Geometrical information, as discussed in references such as PAD [111], Real3D [59], and MVTec-3D [8], holds a crucial role in fields like industrial anomaly detection, especially when dealing with categories lacking textual information. Recently, MVTec 3D [8] and Real3D [59] have recognized the growing need for such information and have introduced a point cloud anomaly detection task. This task focuses on the identification of anomalies\nwithin the provided point clouds [32].\nIt is important to note that the success achieved in industrial image anomaly detection is not fully mirrored in point cloud anomaly detection. This disparity is primarily attributed to the reliance of industrial image anomaly detection on robust pre-trained networks [12 , 75 , 39]. Conversely, due to the scarcity of extensive point cloud data, the capabilities of pre-trained networks for point clouds currently fall short, leading to suboptimal performance for some methods [96 , 21 , 9 , 77].\nIn contrast, CPMF [16] proposes a novel approach by transforming point clouds into depth images, thereby opening up the possibility of leveraging image-based foundation models for point cloud anomaly detection. This innovative method has shown the potential to deliver significantly improved results in point cloud anomaly detection.\n5.2 Testing philosophy # To employ GPT-4V in the context of point cloud anomaly detection, we adopt the methodology presented in CPMF [16] to transform point clouds into multi-view depth images. In our evaluation, we adhere to the principles commonly used in industrial image anomaly detection, specifically the zero/one-shot approach, with the inclusion of three distinct variations of language prompts.\n5.3 Case Demonstration # Fig. 11 , 12 , 13 , 14 , 15 , 16 provide a visual representation of the performance of GPT-4V in point cloud anomaly detection. These illustrations serve to qualitatively illustrate the proficiency of GPT-4V in comprehending multi-modality data.\nSpecifically, GPT-4V demonstrates its capability to accurately identify the presence of a small protrusion or bump on the top left part of the torus in the bagel (Fig. 11). Moreover, the introduction of additional information, such as class information and human expertise, enhances the performance of GPT-4V in point cloud anomaly detection, allowing it to effectively detect anomalies in the rope (Fig. 15 and 16).\nHowever, it is noteworthy that GPT-4V may occasionally misidentify artificially introduced elements during the rendering process as anomalies, as observed in Fig. 14. It is possible that improvements in rendering quality could further enhance the capacity of GPT-4V in this context.\n6 Logical Anomaly Detection # 6.1 Task Introduction # In addition to structural anomalies, there exists another type of anomaly, named logical anomalies [7]. Logical anomalies generally refer to incorrect combinations of components, commonly encountered in the context of anomaly detection in assemblies. For instance, a screw bag should contain matched screws, nuts, and washers. This necessitates that the model is capable of understanding fine-grained information in images and determining attributes of the components within the image, such as component type, length, color, quantity, and so forth. This places higher demands on the model. Existing logical anomaly detection methods [63 , 103 , 5 , 106] typically relied on solely visual context and have achieved promising detection performance. However, these approaches do not genuinely comprehend the content of images; instead, they rely on global-local correspondences [98] for logical anomaly detection. This does not effectively address logical anomaly detection. In contrast, GPT-4V possesses robust image understanding capabilities, allowing for a better comprehension of image content. By providing predefined normal rules manually, GPT-4V might be capable of determining whether an image adheres to normal rules, thereby enabling a more rational approach to logical anomaly detection.\n6.2 Testing philosophy # To ensure an effective assessment of testing images, it is crucial to provide clear guidelines defining the expected normal state for GPT-4V. This enables GPT-4V to evaluate the conformity of testing images with\nthe established standards, relying on an analysis of image content in relation to these norms. Consequently, our approach involves presenting GPT-4V with both testing images and descriptive language articulating the expected normal standards. However, it is worth noting that GPT-4V might encounter difficulties in comprehending the nuances of normal standards when presented with language alone. To enhance its understanding and alignment of normal standards with the context of normal images, we propose the inclusion of a reference image illustrating the desired normal state. Therefore, our experimental design encompasses both zero-shot and one-shot settings to assess the effectiveness of this approach.\n6.3 Case Demonstration # The evaluation results, as depicted in Fig. 17 , 18 , 19 , 20, unequivocally highlight the robust image comprehension and logical reasoning capabilities of GPT-4V. For instance, in Fig. 17, GPT-4V demonstrates its proficiency in interpreting intricate standards, encompassing criteria such as the presence of \u0026ldquo;1. It should contain two oranges, one peach, and some cereal, nuts, and banana slices; 2. The fruit should be on the left side of the lunch box, the cereal on the upper right, and the nuts and banana slices on the lower right of the lunch box\u0026rdquo;. GPT-4V adeptly breaks down this complex task into subcomponents, identifying and localizing the various items before calculating their quantities and positions. Ultimately, GPT-4V accurately concludes that the provided breakfast box does not adhere to the stipulated standards.\nMoreover, visual references play a pivotal role in enhancing GPT-4V\u0026rsquo;s performance. In Fig. 18, without the aid of a visual reference, GPT-4V erroneously classifies the juice bottle as a normal one. However, when presented with a referenced image, GPT-4V effectively comprehends the rule \u0026ldquo;2. To prevent bottle explosions, ensure the juice is filled to about 3cm below the bottle\u0026rsquo;s opening\u0026rdquo; and delivers a correct analysis.\nNonetheless, GPT-4V may encounter challenges in scenarios where its ability to contextualize images is constrained. Notably, GPT-4V fails to detect a broken cable in Fig. 19 and inaccurately quantifies washers in Fig. 20. The limitations of GPT-4V, particularly in matters of fine-grained details like counting, have been addressed in prior research [101]. Furthermore, it is worth noting that multi-round conversations and specific language prompts can significantly impact GPT-4V\u0026rsquo;s performance in such cases.\n7 Medical Image Anomaly Detection # 7.1 Task Introduction # Anomaly detection, also known as outlier detection, is a pivotal task in the domain of medical imaging, aimed at identifying abnormal patterns that do not conform to expected behavior[31]. These abnormalities or anomalies could be indicative of a wide range of medical conditions or diseases[citation]. The primary goal of anomaly detection is to accurately discern these irregularities from a plethora of medical imaging data, thereby aiding in early diagnosis and effective treatment planning. Current medical anomaly detection methods can be categorized into reconstruction-based methods [23] [35] [90], GAN-based [61], self-supervised methods [82] [88] [87] and pre-train methods [75] [28] [60] [62] Although these methods have achieved great improvements, a unified anomaly detection model across different diseases and modalities still remains an unsolved challenge. As highlighted in [71] and [97], GPT-4V, equipped with numerous multi-modal knowledge, shows promising future in enhancing the performance of anomaly detection tasks in various medical imaging modalities.\n7.2 Testing philosophy # We aim to investigate the generalization abilities of GPT-4V on medical anomaly detection. Thus medical images on across different diseases and modalities are used, including Head MRI, Head CT, Retinal OCT, Chest X-ray and so on. For the text prompt, we also take the previous multi-step prompt to test its zero-shot and one-shot abilities. There are generally three types of prompts, a)general medical information, the disease and modalities of the medical images, such as \u0026ldquo;Chest X-ray Image\u0026rdquo; or \u0026ldquo;Head CT Image\u0026rdquo; b)human expertise, based on the general medical information, we further give the possible disease name in the medical image, e.g.\u0026ldquo;The image should be classified as normal or hemorrhage\u0026rdquo;, c) reference image: normal reference image to provide GPT-4V a better understanding of nomrality.\nWe propose to evaluate GPT-4V in either a zero-shot setting, with only language prompts, or a one-shot setting, with one reference image provided along with the language prompts. For each setting, we test three different variants: a) a naive prompt like \u0026ldquo;Please determine whether the image contains anomalies\u0026rdquo; b)general medical information, and c) with human expertise.\n7.3 Case Demonstration # Fig.21 , 23 , 25 and 27 show the GPT-4V\u0026rsquo;s zero-shot inference ability. GPT-4V is capable of automatically recognizing medical image modalities and anatomical structures, even without general medical information prompts. The superior image caption ability enables GPT-4V to describe the spatial and textural anomalies in the image. However, due to ethical restrictions, the GPT-4V model tends to give conservative answers when lack of sufficient information. The introduction of both general medical information and human expertise successfully leads GPT-4V to generate more concrete and accurate answers, as shown in Fig 21 , 23 and 25. However, GPT-4V fails to recognize anomalies in Fig 27, even with enough information provided. The abnormal area is not obvious in the image, so it turns out that it has high requirements for the medical image quality. When a visual reference is added, the GPT-4V\u0026rsquo;s image caption ability successfully describe the difference between normal and abnormal images, which is shown in Fig 22 , 24 26 and 28 .\n8 Medical Image Anomaly Localization # 8.1 Task Introduction # Following the detection of medical anomaly, the subsequent critical task is anomaly localization, which entails pinpointing the exact spatial location of the identified anomaly within the medical image [88] [104]. Accurate localization is imperative for clinicians to understand the extent and nature of the pathology, which in turn informs the course of clinical intervention. However, the real-world clinical scenario, such as tumor anomaly localization, is more complex, where either normal or abmoral cases have multiple types of tumors. Establishing a direct relationship between image pixels and excessive semantics (types of tumors) is diffcult for real world medical image anomaly localization. Several methods, including self-supervised based method [88] and cluster-based method [104] have been proposed to deal with the medical image anomaly localization task. Inspired by [100], we would like to examine the localization ability of GPT-4V model, under the visual prompts.\n8.2 Testing philosophy # To test the GPT-4V\u0026rsquo;s ability on medical image localization, we utilize several diseases categories and modalities, including abdominal CT image, endoscopy image, head MRI image and skin lesion image. Both diseased area and manually synthetic abnormal are taken into consideration to test its robustness. The visual prompts proposed by [100] are also used to harness the fine-grained localization abilities of GPT-4V, including a set of image-mask pairs and corresponding index numbers to each mask. Thus, the input images are the raw images with the augmented one with masks and numbers. We also adopt a straightforward text prompt to introduce the relationship between the two input images, as follows: \u0026ldquo;The first image needs to be inspected. The second image contains its corresponding marks. Please determine whether the image contains anomalies or defects. If yes, give a specific reason\u0026rdquo;\n8.3 Case Demonstration # The qualitative results are shown in Fig 29 30 31 and 32. Under the instruction of visual prompts in the images, the GPT-4V tends to learn and caption the areas around the marks. For easily recognized and located cases, such as Fig 30 31 and 32, GPT-4V can clearly tell the difference between the anomaly areas and backgrounds. But GPT-4V fails in Fig 29, a synthetic case where the region-of-interest shares a similar texture and shape with the background. This indicates that this model still needs to improve its detection and localization abilities under adversarial attack and complex backgrounds.\n9 Pedestrian Anomaly Detection # 9.1 Task Introduction # Pedestrian anomaly detection, a subset of video anomaly detection, is dedicated to recognizing irregular activities within pedestrian interactions captured in video streams. Traditional methodologies, as referenced by various studies [1 , 69 , 33 , 109 , 64 , 86 , 105 , 44], primarily rely on rule-based approaches and manually engineered features. In recent times, there has been a noticeable shift towards the adoption of deep learning techniques [38 , 24 , 66 , 74 , 73 , 53 , 42 , 43 , 41] for pedestrian anomaly detection. The complexity of pedestrian anomaly detection arises from the need to accurately identify abnormal behaviors within the context of diverse and dynamic pedestrian interactions. This is further compounded by the varying environmental conditions in which these interactions take place. To ensure precise analysis, a substantial contextual understanding is essential. While existing methods have demonstrated promising performance in pedestrian anomaly detection, it is worth considering that GPT-4V, with its advanced contextual comprehension capabilities, has the potential to significantly enhance the performance of this task.\n9.2 Testing philosophy # We utilize the GPT-4V model, which currently only accepts image format visual input, for pedestrian anomaly detection. To prompt the model, we select two images from the video dataset. In addition to the image prompt, we include a simple text prompt asking the model to determine if the video frames contain anomalies or outlier points and provide a specific reason if so.\n9.3 Case Demonstration # In Fig. 33, we illustrate a scenario (from UCF-Crime datadet [85]) where a pedestrian aggresses another on the road. The GPT-4V model recognizes the aggressive behavior as an anomaly when compared to typical interactions. Additionally, it suggests caution due to the \u0026ldquo;LiveLeak\u0026rdquo; watermark, implying a need for further analysis with sufficient contextual information before drawing conclusions. The model\u0026rsquo;s adeptness at discerning aggressive behavior, even in the absence of technical anomalies, demonstrates its potential to identify social anomalies within visual data.\n10 Traffic Anomaly Detection # 10.1 Task Introduction # Traffic anomaly detection primarily aims at identifying the commencement and conclusion of abnormal events, with lesser emphasis on spatial localization. Various methodologies [38 , 70 , 24 , 67 , 68 , 35 , 35] have been devised to model normalcy and discern regular patterns in video frames. The prevailing challenge for anomaly detection in traffic scenarios is the development of robust algorithms that can effectively differentiate between normal and abnormal vehicles and driving behaviors, thereby ensuring the safety and reliability of the autonomous vehicle system. Integrating GPT4v into traffic anomaly detection promises to refine the precision and speed of current systems. GPT4v, which has the ability to conduct high-level understanding, is adept at parsing the intricacies of traffic data, thereby sharpening the discrepancy between normal variations and true anomalies. This precision is critical for developing real-time monitoring systems that deliver accurate alerts while minimizing false positives.\n10.2 Testing philosophy # We employ GPT-4V for traffic anomaly detection, which, as of now, only accepts visual input in image format. To engage the model, we select a representative image from the traffic scene, accompanied by a succinct text prompt. This prompt requests the model to ascertain whether the image frames harbor anomalies or outlier points, and if found, to elucidate the specific reasons for such irregularities.\n10.3 Case Demonstration # As depicted in Fig. 34 and 35, by scrutinizing the spatial-temporal dynamics within the traffic scenes from a traffic anomaly detection dataset [48], GPT-4V proficiently differentiates between standard traffic flow and anomalous events. Beyond merely identifying outliers in traffic patterns, the model extends its utility by offering insightful elucidations concerning the abnormal nature of the scenarios. For instance, in Fig.34 , the model effectively explicates an abnormal vehicular maneuver that collides with the roadside barrier and deviates from typical driving behavior. Harnessing its deep comprehension of the underlying patterns and relationships within the traffic data, the model employs interpretable techniques to unravel the factors contributing to the anomaly, thereby providing a nuanced understanding that could be pivotal for enhancing the safety and reliability of autonomous driving systems.\n11 Time Series Anomaly Detection # 11.1 Task Introduction # Time series anomaly detection refers to the task of identifying unusual or abnormal patterns, events, or behaviors in sequential data over time, that deviate significantly from the expected or normal behavior. Time series anomaly detection models can be categorized as supervised or unsupervised algorithms. Supervised methods perform well when anomaly labels are available, such as AutoEncoder [79] and RobustTAD[34]. Unsupervised algorithms are suitable when obtaining anomaly labels is challenging. This has led to the development of new unsupervised methods, including DAGMM [115] and OmniAnomaly [83]. Unsupervised deep learning methods excel in time series anomaly detection, leveraging representation learning and a reconstruction approach to accurately identify anomalies without the need for labeled data [110 , 47 , 108].\n11.2 Testing philosophy # To exploit GPT-4V for time series anomaly detection, we plot time series into images and then deliver the testing data to GPT-4V. Specifically, we select two instances [2 , 89] along with a simple text prompt asking the model to determine if the image contains anomalies or outlier points and provide a specific reason if so.\n11.3 Case Demonstration # As illustrated in Fig. 36 and 37, by examining the temporal dependencies and trends within the time series, GPT-4V adeptly differentiates between normal fluctuations and anomalous behavior. Beyond merely detecting outliers in the time series curves, the model extends its utility by offering insightful explanations regarding the abnormal nature of the data. For instance, in Fig. 37, the model effectively elucidates the abnormal peak in the time series. Drawing upon its profound understanding of the underlying patterns and relationships within the data, the model employs interpretability techniques to illuminate the factors contributing to the anomaly.\n12 Prospect # The future evaluation and utilization of GPT-4V for anomaly detection hold significant promise in addressing complex challenges across various domains. As a versatile language model, GPT-4V demonstrates its potential in anomaly detection, and the following prospects aim to refine its capabilities, foster integration, and elevate its performance.\nQuantitative Analysis: Incorporating quantitative metrics, such as Precision, Recall, and F1-score, alongside AUC-ROC and MAP, in future evaluations will provide a more comprehensive understanding of GPT-4V\u0026rsquo;s anomaly detection performance. This quantification will empower a more objective assessment of the model\u0026rsquo;s capabilities and its adaptation to diverse anomaly detection tasks.\nExpanding Evaluation Scope: Expanding the scope to include real-world challenges, such as varying lighting conditions and occlusions in image-based anomaly detection, and different types of anomalies in time-series data, offers a more realistic view of GPT-4V\u0026rsquo;s adaptability and limitations. The inclusion of synthetic and real-world anomalies adds depth to the evaluation process.\nMulti-round Interaction Evaluation: The potential of multi-round conversations for GPT-4V\u0026rsquo;s iterative learning and adaptation to feedback provides a dynamic framework for enhancing its performance in anomaly detection. It is a promising avenue for scenarios where ongoing refinement is crucial, such as cybersecurity.\nIncorporation of Human Feedback: Utilizing human feedback loops presents the opportunity for domain experts to refine GPT-4V\u0026rsquo;s understanding of complex or nuanced anomalies. The collaboration between the model and experts promises to address real-world challenges effectively.\nIntegration of Auxiliary Data: Exploring the impact of integrating auxiliary data, such as additional sensor readings or metadata, is instrumental in enhancing GPT-4V\u0026rsquo;s understanding and accuracy in identifying anomalies across various domains. This comprehensive approach aligns with real-world data scenarios.\nComparison with Specialized Models: Comparative evaluations against specialized anomaly detection models are essential to identify the specific strengths and weaknesses of GPT-4V. These assessments will clarify the domains and use cases where GPT-4V\u0026rsquo;s versatility excels or where specialized models remain superior.\nReal-Time Performance Assessment: Evaluating GPT-4V\u0026rsquo;s real-time performance is crucial for applications requiring rapid anomaly detection. This prospect ensures the model\u0026rsquo;s suitability for time-critical or online anomaly detection tasks.\nTransfer Learning Evaluation: Assessing the effectiveness of transfer learning in fine-tuning GPT-4V for specific anomaly detection tasks can pave the way for broader generalization. It enhances the model\u0026rsquo;s adaptability in diverse anomaly detection scenarios.\nHybrid Model Development: The development of hybrid models combining GPT-4V with other machine learning or deep learning approaches offers an innovative approach to address anomaly detection challenges. These hybrids aim to leverage GPT-4V\u0026rsquo;s linguistic capabilities while enhancing its performance in specialized scenarios.\nIn summation, these prospects set the stage for a comprehensive and multifaceted exploration of GPT-4V\u0026rsquo;s anomaly detection potential. By combining quantitative metrics, real-world challenges, human feedback, auxiliary data integration, comparative assessments, and real-time capabilities, we can unlock the full scope of GPT-4V\u0026rsquo;s utility in addressing anomalies across diverse fields. The journey towards improved anomaly detection with GPT-4V is one of collaboration, adaptation, and innovation, promising exciting developments in the years to come.\n13 Conclusion # In conclusion, the assessment of GPT-4V\u0026rsquo;s capabilities in anomaly detection signifies a notable advancement in the realm of versatile and adaptable AI models. GPT-4V demonstrates exceptional proficiency in identifying anomalies across diverse modalities and fields, offering both comprehensive and nuanced semantic comprehension. Its ability to deduce anomalies and its responsiveness to an expanding array of prompts underscore its versatility and potential. Nevertheless, like any technology, there remains room for further enhancement, particularly in intricate and subtle scenarios.\nThe opportunities delineated in this evaluation propose promising avenues for future research and development. The inclusion of quantitative metrics, broadening the spectrum of evaluations, embracing human input, and integrating supplementary data all contribute to augmenting the performance of GPT-4V. Comparative assessments against specialized models and the exploration of hybrid models further enrich the landscape of anomaly detection. Real-time assessment and the incorporation of transfer learning hold the promise of addressing time-sensitive situations and generalizing anomaly detection across diverse domains.\nAs we embark on this journey to unlock the full potential of GPT-4V, collaboration, adaptability, and innovation will serve as the foundational pillars of our success. The evaluation and utilization of GPT-4V for anomaly detection do not merely signify an exploration of technology but also serve as a testament to the ongoing evolution of AI and its transformative impact on real-world applications. Keeping these prospects in mind, the future of anomaly detection holds significant promise, and GPT-4V stands at the forefront of this captivating evolution.\nFigure 3 | Industrial Image Anomaly Detection: Case 1, one-shot, the Bottle category of MVTec AD [6] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 4 | Industrial Image Anomaly Detection: Case 2, zero-shot, the Candle category of VisA [116] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 5 | Industrial Image Anomaly Detection: Case 2, one-shot, the Candle category of VisA [116] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 6 | Industrial Image Anomaly Detection: Case 3, zero-shot, the PCB2 category of VisA [116] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 7 | Industrial Image Anomaly Detection: Case 3, one-shot, the PCB2 category of VisA [116] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 8 | Industrial Image Anomaly Localization: Case 1, zero-shot, the Bottle category of MVTec AD [6] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 9 | Industrial Image Anomaly Localization: Case 2, the Hazelnut category of MVTec AD [6] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 10 | Industrial Image Anomaly Localization: Case 3, the Capsule category of VisA [116] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 11 | Point Cloud Anomaly Detection: Case 1, zero-shot, the Bagel category of MVTec 3D [8] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 12 | Point Cloud Anomaly Detection: Case 1, one-shot, the Bagel category of MVTec 3D [8] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 13 | Point Cloud Anomaly Detection: Case 2, zero-shot, the Peach category of MVTec 3D [8] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 14 | Point Cloud Anomaly Detection: Case 2, one-shot, the Peach category of MVTec 3D [8] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 15 | Point Cloud Anomaly Detection: Case 3, zero-shot, the Rope category of MVTec 3D [8] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 16 | Point Cloud Anomaly Detection: Case 3, one-shot, the Rope category of MVTec 3D [8] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 17 | Logical Anomaly Detection: Case 1, the Breakfast Box category of MVTec LOCO [7] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 18 | Logical Anomaly Detection: Case 2, the Juice Bottle category of MVTec LOCO [7] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 19 | Logical Anomaly Detection: Case 3, the Splicing Connector category of MVTec LOCO [7] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 20 | Logical Anomaly Detection: Case 4, the Screw Bag category of MVTec LOCO [7] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 21 | Medical Anomaly Detection: Case 1, the Chest X-ray [49] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 22 | Medical Anomaly Detection: Case 1, the Chest X-ray [49] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 23 | Medical Anomaly Detection: Case 2, the Retinal OCT [49] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 24 | Medical Anomaly Detection: Case 2, the Retinal OCT [49] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 25 | Medical Anomaly Detection: Case 3, the Head CT [51] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 26 | Medical Anomaly Detection: Case 3, the Head CT [51] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 27 | Medical Anomaly Detection: Case 4, Head MRI Image [18] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 28 | Medical Anomaly Detection: Case 4, Head MRI Image [18] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 29 | Medical Anomaly Localization: Case 1, Abdonimal CT Localization [114] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 30 | Medical Anomaly Localization: Case 2,Head MRI Localization [114] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 31 | Medical Anomaly Localization: Case 3, Skin Lesion Localization [26] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 32 | Medical Anomaly Localization: Case 4, Endoscopy Localization [11] . Yellow highlights the given class information and normal and abnormal state descriptions. Green , red, and blue highlight the expected, incorrect, and additional information outputted by GPT-4V.\nFigure 33 | Pedestrian Anomaly Detection: Case 1, from UCF-Crime Dataset [85] . Green highlights the expected information outputted by GPT-4V.\nFigure 34 | Traffic Anomaly Detection: Case 1, from Kaggle Accident Detection [48] . Green highlights the expected information outputted by GPT-4V.\nFigure 35 | Traffic Anomaly Detection: Case 2, from Kaggle Accident Detection [48] . Green highlights the expected information outputted by GPT-4V.\nFigure 36 | Time Series Anomaly Detection: Case 1, from Outlier Detection Dataset [89].Green highlights the expected information outputted by GPT-4V.\nFigure 37 | Time Series Anomaly Detection: Case 2, from Catfish Sales Dataset [2] . Green highlights the expected information outputted by GPT-4V.\nReferences # [1] Amit Adam, Ehud Rivlin, Ilan Shimshoni, and Daviv Reinitz. Robust real-time unusual event detection using multiple fixed-location monitors. IEEE Transactions on Pattern Analysis and Machine Intelligence , 30(3):555–560, 2008.\n[2] Neptune AI. Anomaly detection in time series. https://neptune.ai/blog/anomaly-detection-in-time-series, 2023. Accessed: 2023-11-04.\n[3] Samet Akçay, Amir Atapour-Abarghouei, and T. Breckon. Ganomaly: Semi-supervised anomaly detection via adversarial training. In Asian Conference on Computer Vision, 2018.\n[4] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.\n[5] Kilian Batzner, Lars Heckler, and Rebecca König. Efficientad: Accurate visual anomaly detection at millisecond-level latencies. arXiv preprint arXiv:2303.14535, 2023.\n[6] Paul Bergmann, Kilian Batzner, Michael Fauser, David Sattlegger, and Carsten Steger. The MVTec anomaly detection dataset: A comprehensive real-world dataset for unsupervised anomaly detection. International Journal of Computer Vision, 129(4):1038–1059, 2021.\n[7] Paul Bergmann, Kilian Batzner, Michael Fauser, David Sattlegger, and Carsten Steger. Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization. International Journal of Computer Vision, 130(4):947–969, 2022.\n[8] Paul Bergmann, Xin Jin, David Sattlegger, and Carsten Steger. The MVTec 3d-AD dataset for unsupervised 3d anomaly detection and localization. In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, pages 202–213, 2022.\n[9] Paul Bergmann and David Sattlegger. Anomaly detection in 3d point clouds using deep geometric descriptors. arXiv preprint arXiv:2202.11660, 2022.\n[10] Ane Bl\u0026rsquo;azquez-Garc\u0026rsquo;ia, Angel Conde, Usue Mori, and José Antonio Lozano. A review on outlier/anomaly detection in time series data. ACM Computing Surveys, 54:1 – 33, 2020.\n[11] Hanna Borgli, Vajira Thambawita, Pia H Smedsrud, Steven Hicks, Debesh Jha, Sigrun L Eskeland, Kristin Ranheim Randel, Konstantin Pogorelov, Mathias Lux, Duc Tien Dang Nguyen, et al. Hyperkvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Scientific data , 7(1):283, 2020.\n[12] Yuxuan Cai, Dingkang Liang, Dongliang Luo, Xinwei He, Xin Yang, and Xiang Bai. A discrepancy aware framework for robust anomaly detection. IEEE Transactions on Industrial Informatics, pages 1–10, 2023.\n[13] Yunkang Cao, Yanan Song, Xiaohao Xu, Shuya Li, Yuhao Yu, Yifeng Zhang, and Weiming Shen. Semi-supervised knowledge distillation for tiny defect detection. In 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pages 1010–1015. IEEE, 2022.\n[14] Yunkang Cao, Qian Wan, Weiming Shen, and Liang Gao. Informative knowledge distillation for image anomaly segmentation. Knowledge-Based Systems, 248:108846, 2022.\n[15] Yunkang Cao, Xiaohao Xu, Zhaoge Liu, and Weiming Shen. Collaborative discrepancy optimization for reliable image anomaly localization. IEEE Transactions on Industrial Informatics, pages 1–10, 2023.\n[16] Yunkang Cao, Xiaohao Xu, and Weiming Shen. Complementary pseudo multimodal feature for point cloud anomaly detection. arXiv preprint arXiv:2303.13194, 2023.\n[17] Yunkang Cao, Xiaohao Xu, Chen Sun, Yuqi Cheng, Zongwei Du, Liang Gao, and Weiming Shen. Segment any anomaly without training via hybrid prompt regularization. arXiv preprint arXiv:2305.10724, 2023.\n[18] Navoneel Chakrabarty. Brain mri images for brain tumor detection, 2019.\n[19] Raghavendra Chalapathy and Sanjay Chawla. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407, 2019.\n[20] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41(3), jul 2009.\n[21] Rui Chen, Guoyang Xie, Jiaqi Liu, Jinbao Wang, Ziqi Luo, Jinfan Wang, and Feng Zheng. Easynet: An easy network for 3d industrial anomaly detection. Proceedings of the 31st ACM International Conference on Multimedia, 2023.\n[22] Xuhai Chen, Yue Han, and Jiangning Zhang. A zero-/few-shot anomaly classification and segmentation method for CVPR 2023 VAND workshop challenge tracks 1\u0026amp;2: 1st place on zero-shot AD and 4th place on few-shot AD. arXiv preprint arXiv:2305.17382, 2023.\n[23] Xiaoran Chen, Suhang You, Kerem Can Tezcan, and Ender Konukoglu. Unsupervised lesion detection via image restoration with a normative prior. Medical image analysis, 64:101713, 2020.\n[24] Yong Shean Chong and Yong Haur Tay. Abnormal event detection in videos using spatiotemporal autoencoder. In International symposium on neural networks, pages 189–196. Springer, 2017.\n[25] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.\n[26] Noel CF Codella, David Gutman, M Emre Celebi, Brian Helba, Michael A Marchetti, Stephen W Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler, et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pages 168–172. IEEE, 2018.\n[27] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.\n[28] Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier. Padim: a patch distribution modeling framework for anomaly detection and localization. In International Conference on Pattern Recognition, pages 475–489. Springer, 2021.\n[29] Jan Diers and Christian Pigorsch. A survey of methods for automated quality control based on images. International Journal of Computer Vision, 2023.\n[30] Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017.\n[31] Tharindu Fernando, Harshala Gammulle, Simon Denman, Sridha Sridharan, and Clinton Fookes. Deep learning for medical anomaly detection–a survey. ACM Computing Surveys (CSUR), 54(7):1–37, 2021.\n[32] Alberto Floris, Luca Frittoli, Diego Carrera, and Giacomo Boracchi. Composite layers for deep anomaly detection on 3d point clouds. arXiv preprint arXiv:2209.11796, 2022.\n[33] Harrou Fouzi and Ying Sun. Enhanced anomaly detection via pls regression models and information entropy theory. In IEEE Symposium Series on Computational Intelligence (SSCI), pages 383–388, 2015.\n[34] Jingkun Gao, Xiaomin Song, Qingsong Wen, Pichao Wang, Liang Sun, and Huan Xu. Robusttad: Robust time series anomaly detection via decomposition and convolutional neural networks. arXiv preprint arXiv:2002.09545, 2020.\n[35] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1705–1714, 2019.\n[36] Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023.\n[37] Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models. arXiv preprint arXiv:2308.15366 , 2023.\n[38] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 733–742, 2016.\n[39] Lars Heckler, Rebecca König, and Paul Bergmann. Exploring the importance of pretrained feature extractors for unsupervised anomaly detection and localization. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2917–2926, 2023.\n[40] Chaoqin Huang, Haoyan Guan, Aofan Jiang, Ya Zhang, Michael Spratling, and Yan-Feng Wang. Registration based few-shot anomaly detection. In European Conference on Computer Vision, pages 303–319. Springer, 2022.\n[41] Chao Huang, Chengliang Liu, Jie Wen, Lian Wu, Yong Xu, Qiuping Jiang, and Yaowei Wang. Weakly supervised video anomaly detection via self-guided temporal discriminative transformer. IEEE Transactions on Cybernetics, pages 1–14, 2022.\n[42] Chao Huang, Jie Wen, Yong Xu, Qiuping Jiang, Jian Yang, Yaowei Wang, and David Zhang. Selfsupervised attentive generative adversarial networks for video anomaly detection. IEEE Transactions on Neural Networks and Learning Systems, pages 1–15, 2022.\n[43] Chao Huang, Zehua Yang, Jie Wen, Yong Xu, Qiuping Jiang, Jian Yang, and Yaowei Wang. Selfsupervision-augmented deep autoencoder for unsupervised visual anomaly detection. IEEE Transactions on Cybernetics, 52(12):13834–13847, 2022-12.\n[44] Tsuyoshi Idé, Ankush Khandelwal, and Jayant Kalagnanam. Sparse gaussian markov random field mixtures for anomaly detection. In IEEE 16th International Conference on Data Mining (ICDM), pages 955–960, 2016.\n[45] Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero-/few-shot anomaly classification and segmentation. arXiv preprint arXiv:2303.14814 , 2023.\n[46] Yuxin Jiang, Yunkang Cao, and Weiming Shen. A masked reverse knowledge distillation method incorporating global and local information for image anomaly detection. Knowledge-Based Systems , 280:110982, 2023.\n[47] Yang Jiao, Kai Yang, Dongjing Song, and Dacheng Tao. Timeautoad: Autonomous anomaly detection with self-supervised contrastive loss for multivariate time series. IEEE Transactions on Network Science and Engineering, 9(3):1604–1619, 2022.\n[48] C. Kay. Accident detection from cctv footage. https://www.kaggle.com/datasets/ckay16/accident-detection-fromcctv-footage, 2022. Kaggle dataset.\n[49] Daniel S Kermany, Michael Goldbaum, Wenjia Cai, Carolina CS Valentim, Huiying Liang, Sally L Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. cell, 172(5):1122–1131, 2018.\n[50] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages 4015–4026, October 2023.\n[51] Felipe Campos Kitamura. Head ct - hemorrhage, 2018.\n[52] Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023.\n[53] Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, and Tomas Pfister. Cutpaste: Self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9664–9674, 2021.\n[54] Yufei Liang, Jiangning Zhang, Shiwei Zhao, Runze Wu, Yong Liu, and Shuwen Pan. Omni-frequency channel-selection representations for unsupervised anomaly detection. arXiv preprint arXiv:2203.00259 , 2022.\n[55] Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023.\n[56] Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023.\n[57] Fuxiao Liu, Yaser Yacoob, and Abhinav Shrivastava. Covid-vts: Fact extraction and verification on short video platforms. arXiv preprint arXiv:2302.07919, 2023.\n[58] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.\n[59] Jiaqi Liu, Guoyang Xie, Rui Chen, Xinpeng Li, Jinbao Wang, Yong Liu, Chengjie Wang, and Feng Zheng. Real3d-ad: A dataset of point cloud anomaly detection. arXiv preprint arXiv:2309.13226, 2023.\n[60] Mingxuan Liu, Yunrui Jiao, and Hong Chen. Skip-st: Anomaly detection for medical images using student-teacher network with skip connections. In 2023 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5, 2023.\n[61] Mingxuan Liu, Yunrui Jiao, Hongyu Gu, Jingqiao Lu, and Hong Chen. Data augmentation using image-to-image translation for tongue coating thickness classification with imbalanced data. In 2022 IEEE Biomedical Circuits and Systems Conference (BioCAS), pages 90–94, 2022.\n[62] Mingxuan Liu, Yunrui Jiao, Jingqiao Lu, and Hong Chen. Anomaly detection for medical images using teacher-student model with skip connections and multi-scale anomaly consistency. TechRxiv, 2023.\n[63] Tongkun Liu, Bing Li, Xiao Du, Bingke Jiang, Xiao Jin, Liuyi Jin, and Zhu Zhao. Component-aware anomaly detection framework for adjustable and logical industrial visual inspection. arXiv preprint arXiv:2305.08509, 2023.\n[64] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In IEEE International Conference on Computer Vision, pages 2720–2727, 2013.\n[65] Ruiying Lu, YuJie Wu, Long Tian, Dongsheng Wang, Bo Chen, Xiyang Liu, and Ruimin Hu. Hierarchical vector quantized transformer for multi-class unsupervised anomaly detection. arXiv preprint arXiv:2310.14228, 2023.\n[66] Weixin Luo, Wen Liu, and Shenghua Gao. Remembering history with convolutional lstm for anomaly detection. In IEEE International Conference on Multimedia and Expo (ICME), pages 439–444, 2017.\n[67] Weixin Luo, Wen Liu, and Shenghua Gao. Remembering history with convolutional lstm for anomaly detection. In 2017 IEEE International conference on multimedia and expo (ICME), pages 439–444. IEEE, 2017.\n[68] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE International Conference on Computer Vision , pages 341–349, 2017.\n[69] Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. Anomaly detection in crowded scenes. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1975–1981, 2010.\n[70] Jefferson Ryan Medel and Andreas Savakis. Anomaly detection in video using predictive convolutional long short-term memory networks. arXiv preprint arXiv:1612.00390, 2016.\n[71] OpenAI. Gpt-4v(ision) system card. 2023.\n[72] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton van den Hengel. Deep learning for anomaly detection. ACM Computing Surveys, 54:1 – 38, 2020.\n[73] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14372–14381, 2020.\n[74] Mahdyar Ravanbakhsh, Moin Nabi, Enver Sangineto, Lucio Marcenaro, Carlo Regazzoni, and Nicu Sebe. Abnormal event detection in videos using generative adversarial nets. In IEEE International Conference on Image Processing (ICIP), pages 1577–1581, 2017.\n[75] Tal Reiss, Niv Cohen, Liron Bergman, and Yedid Hoshen. Panda: Adapting pretrained features for anomaly detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2806–2814, 2021.\n[76] Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Schölkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14318–14328, 2022.\n[77] Marco Rudolph, Tom Wehrbein, Bodo Rosenhahn, and Bastian Wandt. Asymmetric student-teacher networks for industrial anomaly detection. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2591–2601, 2022.\n[78] Lukas Ruff, Jacob R. Kauffmann, Robert A. Vandermeulen, Gregoire Montavon, Wojciech Samek, Marius Kloft, Thomas G. Dietterich, and Klaus-Robert Muller. A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE, 109(5):756–795, 2021.\n[79] Mayu Sakurada and Takehisa Yairi. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, MLSDA'14, page 4–11, New York, NY, USA, 2014. Association for Computing Machinery.\n[80] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.\n[81] Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11987–11997, October 2023.\n[82] Kihyuk Sohn, Chun-Liang Li, Jinsung Yoon, Minho Jin, and Tomas Pfister. Learning and evaluating representations for deep one-class classification. In International Conference on Learning Representations , 2020.\n[83] Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery \u0026amp; data mining, pages 2828–2837, 2019.\n[84] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.\n[85] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018.\n[86] Hanlin Tan, Yongping Zhai, Yu Liu, and Maojun Zhang. Fast anomaly detection in traffic surveillance video based on robust sparse optical flow. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 1976–1980, 2016.\n[87] Yu Tian, Fengbei Liu, Guansong Pang, Yuanhong Chen, Yuyuan Liu, Johan W Verjans, Rajvinder Singh, and Gustavo Carneiro. Self-supervised pseudo multi-class pre-training for unsupervised anomaly detection and segmentation in medical images. Medical Image Analysis, 90:102930, 2023.\n[88] Yu Tian, Guansong Pang, Fengbei Liu, Yuanhong Chen, Seon Ho Shin, Johan W Verjans, Rajvinder Singh, and Gustavo Carneiro. Constrained contrastive distribution learning for unsupervised anomaly detection and localisation in medical images. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24, pages 128–140. Springer, 2021.\n[89] Stack Exchange User. Simple outlier detection for time series. https://stats.stackexchange.com/questions/ 427327/simple-outlier-detection-for-time-series, 2021. Accessed: 2023-11-04.\n[90] Shashanka Venkataramanan, Kuan-Chuan Peng, Rajat Vikram Singh, and Abhijit Mahalanobis. Attention guided anomaly localization in images. In European Conference on Computer Vision, pages 485–503. Springer, 2020.\n[91] Qian Wan, Yunkang Cao, Liang Gao, Weiming Shen, and Xinyu Li. Position encoding enhanced feature mapping for image anomaly detection. In 2022 IEEE 18th International Conference on Automation Science and Engineering (CASE), pages 876–881. IEEE, 2022-08-20.\n[92] Qian Wan, Liang Gao, and Xinyu Li. Logit inducing with abnormality capturing for semi-supervised image anomaly detection. IEEE Transactions on Instrumentation and Measurement, 71:1–12, 2022.\n[93] Qian Wan, Liang Gao, Xinyu Li, and Long Wen. Industrial image anomaly localization based on gaussian clustering of pretrained feature. IEEE Transactions on Industrial Electronics, 69(6):6182–6192.\n[94] Qian Wan, Liang Gao, Xinyu Li, and Long Wen. Unsupervised image anomaly detection and segmentation based on pretrained feature mapping. 19(3):2330–2339, 2023-03.\n[95] Guodong Wang, Shumin Han, Errui Ding, and Di Huang. Student-teacher feature pyramid matching for anomaly detection. In British Machine Vision Conference, 2021.\n[96] Yue Wang, Jinlong Peng, Jiangning Zhang, Ran Yi, Yabiao Wang, and Chengjie Wang. Multimodal industrial anomaly detection via hybrid fusion. arXiv preprint arXiv:2303.00601, 2023.\n[97] Chaoyi Wu, Jiayu Lei, Qiaoyu Zheng, Weike Zhao, Weixiong Lin, Xiaoman Zhang, Xiao Zhou, Ziheng Zhao, Ya Zhang, Yanfeng Wang, and Weidi Xie. Can GPT-4v(ision) serve medical applications? case studies on GPT-4v for multimodal medical diagnosis. arXiv preprint arXiv:2310.09909, 2023.\n[98] Guoyang Xie, Jinbao Wang, Jiaqi Liu, Jiayi Lyu, Yong Liu, Chengjie Wang, Feng Zheng, and Yaochu Jin. IM-IAD: Industrial image anomaly detection benchmark in manufacturing. arXiv preprint arXiv:2301.13359, 2023.\n[99] Guoyang Xie, Jingbao Wang, Jiaqi Liu, Feng Zheng, and Yaochu Jin. Pushing the limits of fewshot anomaly detection in industry vision: Graphcore. In International Conference on Learning Representations, 2023.\n[100] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4v. arXiv preprint, arXiv:2310.11441, 2023.\n[101] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 , 2023.\n[102] Haiming Yao, Wei Luo, Wenyong Yu, Xiaotian Zhang, Zhenfeng Qiang, Donghao Luo, and Hui Shi. Dual-attention transformer and discriminative flow for industrial visual anomaly detection. IEEE Transactions on Automation Science and Engineering, pages 1–15, 2023.\n[103] Haiming Yao, Wenyong Yu, Wei Luo, Zhenfeng Qiang, Donghao Luo, and Xiaotian Zhang. Learning global-local correspondence with semantic bottleneck for logical anomaly detection. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2023.\n[104] Mingze Yuan, Yingda Xia, Hexin Dong, Zifan Chen, Jiawen Yao, Mingyan Qiu, Ke Yan, Xiaoli Yin, Yu Shi, Xin Chen, et al. Devil is in the queries: Advancing mask transformers for real-world medical image segmentation and out-of-distribution localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23879–23889, 2023.\n[105] Andrei Zaharescu and Richard Wildes. Anomalous behaviour detection using spatiotemporal oriented energies, subset inclusion histogram comparison and event-driven processing. In European Conference on Computer Vision, pages 563–576. Springer, 2010.\n[106] J. Zhang, Masanori Suganuma, and Takayuki Okatani. Contextual affinity distillation for image anomaly detection. arXiv preprint arXiv:2307.03101,, 2023.\n[107] Jianpeng Zhang, Yutong Xie, Yi Li, Chunhua Shen, and Yong Xia. Covid-19 screening on chest x-ray images using deep learning based anomaly detection. arXiv preprint arXiv:2003.12338, 2020.\n[108] Yuxin Zhang, Jindong Wang, Yiqiang Chen, Han Yu, and Tao Qin. Adaptive memory networks with self-supervised learning for unsupervised anomaly detection. IEEE Transactions on Knowledge and Data Engineering, 2022.\n[109] Bin Zhao, Li Fei-Fei, and Eric P. Xing. Online detection of unusual events in videos via dynamic sparse coding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3313–3320, 2011.\n[110] Hang Zhao, Yujing Wang, Juanyong Duan, Congrui Huang, Defu Cao, Yunhai Tong, Bixiong Xu, Jing Bai, Jie Tong, and Qi Zhang. Multivariate time-series anomaly detection via graph attention network. In 2020 IEEE International Conference on Data Mining (ICDM), pages 841–850. IEEE, 2020.\n[111] Qiang Zhou, Weize Li, Lihan Jiang, Guoliang Wang, Guyue Zhou, Shanghang Zhang, and Hao Zhao. Pad: A dataset and benchmark for pose-agnostic anomaly detection. arXiv preprint arXiv:2310.07716 , 2023.\n[112] Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jiming Chen. Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection. arXiv preprint arXiv:2310.18961, 2023.\n[113] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 , 2023.\n[114] David Zimmerer, Peter M. Full, Fabian Isensee, Paul Jäger, Tim Adler, Jens Petersen, Gregor Köhler, Tobias Ross, Annika Reinke, Antanas Kascenas, Bjørn Sand Jensen, Alison Q. O\u0026rsquo;Neil, Jeremy Tan, Benjamin Hou, James Batten, Huaqi Qiu, Bernhard Kainz, Nina Shvetsova, Irina Fedulova, Dmitry V. Dylov, Baolun Yu, Jianyang Zhai, Jingtao Hu, Runxuan Si, Sihang Zhou, Siqi Wang, Xinyang Li, Xuerun Chen, Yang Zhao, Sergio Naval Marimont, Giacomo Tarroni, Victor Saase, Lena Maier-Hein, and Klaus Maier-Hein. Mood 2020: A public benchmark for out-of-distribution detection and localization on medical images. IEEE Transactions on Medical Imaging, 41(10):2728–2738, 2022.\n[115] Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In International conference on learning representations, 2018.\n[116] Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the-difference selfsupervised pre-training for anomaly detection and segmentation. In European Conference on Computer Vision, pages 392–408. Springer, 2022.\n","date":"31 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/towards-generic-anomaly-detection-and-understanding/","section":"Papers","summary":"This study explores the use of GPT-4V, a large visual-linguistic model, for generic anomaly detection across multiple modalities and domains, demonstrating its ability to understand global and fine-grained semantics, reason automatically, and improve with prompts. It evaluates GPT-4V on diverse tasks including industrial, medical, logical, video, 3D, and time series anomaly detection, discussing its promising performance and future directions for enhancement, such as quantitative metrics, expanded benchmarks, multi-round interactions, human feedback, and real-time application.","title":"Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead","type":"survey"},{"content":"","date":"31 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiaonan-huang/","section":"Authors","summary":"","title":"Xiaonan Huang","type":"authors"},{"content":"","date":"5 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/type/benchmark/","section":"Type","summary":"","title":"Benchmark","type":"type"},{"content":"","date":"5 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jane-smith/","section":"Authors","summary":"","title":"Jane Smith","type":"authors"},{"content":"","date":"5 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/john-doe/","section":"Authors","summary":"","title":"John Doe","type":"authors"},{"content":" SmartHome-Bench: A Comprehensive Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal Large Language Models # Xinyi Zhao 1∗ , Congjing Zhang 1∗ , Pei Guo 2 , Wei Li 2 , Lin Chen 2† , Chaoyue Zhao 1 , Shuai Huang 1 1 University of Washington 2 Wyze Labs, Inc.\n{xyzhao24, congjing}@uw.edu, {pguo, wei.li, lchen}@wyze.com, {cyzhao, shuaih}@uw.edu\nAbstract # Video anomaly detection (VAD) is essential for enhancing safety and security by identifying unusual events across different environments. Existing VAD benchmarks, however, are primarily designed for general-purpose scenarios, neglecting the specific characteristics of smart home applications. To bridge this gap, we introduce SmartHomeBench, the first comprehensive benchmark specially designed for evaluating VAD in smart home scenarios, focusing on the capabilities of multi-modal large language models (MLLMs). Our newly proposed benchmark consists of 1,203 videos recorded by smart home cameras, organized according to a novel anomaly taxonomy that includes seven categories, such as Wildlife, Senior Care, and Baby Monitoring. Each video is meticulously annotated with anomaly tags, detailed descriptions, and reasoning. We further investigate adaptation methods for MLLMs in VAD, assessing state-of-the-art closed-source and open-source models with various prompting techniques. Results reveal significant limitations in current models\u0026rsquo; ability to detect video anomalies accurately. To address these limitations, we introduce the Taxonomy-Driven Reflective LLM Chain (TRLC), a new LLM chaining framework that achieves a notable 11.62% improvement in detection accuracy. The benchmark dataset and code are publicly available at https://github. com/Xinyi-0724/SmartHome-Bench-LLM .\n1. Introduction # Video anomaly detection (VAD) identifies unexpected events to monitor and mitigate risks, thus improving security across diverse public spaces, including campuses, pedestrian zones, and crowded scenes [11 , 12 , 15 , 37 , 39]. A range of supervised, weakly-supervised, one-class classification, and unsupervised methods has been proposed to generate anomaly scores for videos [20 , 29 , 43 , 46 , 54].\nEqual contribution. Work done during the authors\u0026rsquo; internship at Wyze. † Corresponding Author.\n(b)\nFigure 1. (a) Statistics for event categories and anomaly tags in the SmartHome-Bench dataset. (b) Overall anomaly detection accuracy of various adaptation methods across seven event categories, using Gemini-1.5-pro.\nHowever, most of these methods cannot provide descriptive rationales to support their predictions. Offering clear rationales can help users understand which behaviors or events are flagged as anomalies and why, fostering trust in the system\u0026rsquo;s assessments. Multi-modal large language models (MLLMs), with their substantial model size and capability to learn from extensive training data [2 , 9 , 17 , 18 , 41], demonstrate exceptional performance in multimodal tasks. Additionally, their generative nature enables them to make\nanomaly predictions and generate rationales, improving the transparency and trustworthiness of VAD [32 , 35].\nResearchers have assessed MLLMs for VAD in various domains [7 , 28 , 50 , 56]. For example, LAVAD [55] focused on detecting crimes and violent behaviors using the UCFCrime [39] and XD-Violence [47] datasets, while AnomalyRuler [53] focused on pedestrian anomalies related to biking or jumping using the ShanghaiTech [25], UCSD Ped2 [21], and CUHK Avenue [27] datasets. However, these studies focus on public spaces, overlooking anomalies within private environments like smart home scenarios. Unlike the goals of VAD in public environments, VAD in smart homes centers on more personal concerns, such as minimizing property damage, protecting vulnerable residents (e.g., young children and elderly family members), and monitoring pets and wildlife [3 , 38 , 58]. While anomalies in smart homes may overlap with incidents in public spaces, such as crimes, they also involve many unique events rarely seen in public, like a baby climbing out of a crib or a bear entering a backyard. It remains unclear whether existing methods can effectively handle VAD in smart home scenarios. This study aims to fill the gap by evaluating the feasibility of MLLMs for VAD in smart home scenarios.\nIn particular, we identify two major research gaps: (1) the absence of a dedicated benchmark for VAD in smart home scenarios, and (2) the under-exploration of adaptation strategies for MLLMs in VAD. To address the first gap, we propose SmartHome-Bench, a benchmark dataset of 1,203 videos featuring distinct anomaly events, such as wildlife encounters, senior care incidents, and baby monitoring issues, all collected from smart home cameras. Each video is manually annotated with anomaly tags, detailed descriptions, and reasoning, positioning SmartHome-Bench as an ideal instructional dataset for advancing MLLM research and development in VAD. Dataset statistics are provided in Figure 1a .\nTo address the second gap, we conduct experiments focused on two key aspects: adaptation methods and base MLLMs. We implemented a diverse set of adaptation techniques for MLLMs, including standard prompting (zeroshot, chain-of-thought, and few-shot), contextual strategies (in-context learning), and our proposed Taxonomy-Driven Reflective LLM Chain (TRLC). These adaptations are applied across both state-of-the-art open-source and proprietary MLLMs. By evaluating these off-the-shelf models, we aim to harness their instruction-following capabilities, assessing both their anomaly detection performance and the quality of model-generated descriptions and rationales.\nOur findings indicate that current MLLMs often struggle to deliver satisfactory performance using basic prompting alone. In contrast, the TRLC framework, which integrates taxonomy-driven rules and self-reflection modules into MLLM chains, significantly enhances MLLM capa- bilities for VAD in smart home scenarios. This method achieves a remarkable 11.62% improvement in anomaly detection accuracy over zero-shot prompting and outperforms all standalone prompting approaches across five out of seven event categories, as shown in Figure 1b .\nIn summary, our contributions are threefold:\nWe introduce SmartHome-Bench, the first benchmark for VAD in smart home scenarios, featuring a dataset of 1,203 videos annotated across seven event categories. We evaluate both closed-source and open-source MLLMs using various adaptation methods, offering insights for optimizing model performance and prompt design. We propose the TRLC, a novel LLM chaining framework that improves overall VAD accuracy by 11.62% compared to the zero-shot prompt approach. 2. Related Work # Video Anomaly Detection. MLLMs have been extensively applied in VAD recently. For instance, Holmes-VAD [56] processes untrimmed video with user prompts to produce frame-level anomaly scores and explanations for detected anomalies. CALLM [34] integrates a 3D autoencoder and a visual language model into a cascade system to predict anomalies. However, MLLMs have rarely been tested in VAD for smart home scenarios, where most methods primarily rely on motion detection algorithms, statistical models, or basic machine learning techniques to detect unusual behaviors or patterns [31 , 38 , 51]. For example, Withanage et al. [45] used depth cuboid similarity features with RGB-D imaging to detect falls, aiming to support insitu assistance for fall incidents in the context of independent living for the elderly. Liu et al. [24] transformed fall detection into a sparse recognition problem of the signal, incorporating visual shielding for enhanced privacy protection and recognition accuracy. Despite the potential of MLLMs, there remains a lack of benchmark datasets for smart home scenarios, preventing comprehensive evaluation and adaptation of these models. Our work addresses this gap by introducing SmartHome-Bench, a benchmark specifically designed for VAD in smart home scenarios.\nBenchmark for MLLMs. Recent advancements in MLLMs [1 , 2 , 9 , 17 , 23 , 42] have opened new avenues for processing diverse data types, including video, audio, and text. As a result, benchmarks designed to assess MLLM performance on video-related tasks have become increasingly important. Existing benchmarks like Flamingo [2] and VideoVista [22] demonstrate the effectiveness of MLLMs in video understanding and reasoning for finegrained video tasks across broad domains. To explore specific task capabilities, benchmarks such as MVBench [19] and NExT-QA [49] evaluate temporal understanding in visual language models for temporally-sensitive videos, while\nFigure 2. Example of video annotation from the SmartHomeBench dataset.\nVideo-ChatGPT [30] quantifies video dialogue capabilities for benchmarking video conversation models. VANEBench [7] uses question-answer pairs to evaluate VAD on both real-world and AI-generated videos. Other benchmarks, such as Video-MME [10] and TempCompass [26], focus on categorizing video datasets for specific evaluation needs, like trending topics on YouTube (Video-MME [10]) or temporal aspects (TempCompass [26]). However, these benchmarks primarily address general video domains and overlook the unique characteristics of smart home scenarios. In contrast, SmartHome-Bench is the first benchmark specifically tailored for smart home scenarios, offering a dataset with detailed video descriptions and reasoning for detected anomalies.\n3. SmartHome-Bench Dataset # This section presents the raw video collection and annotation process for SmartHome-Bench, with an emphasis on the proposed taxonomy used to categorize video anomalies in smart home scenarios.\n3.1. Video Collection # We crawl videos from public sources, such as YouTube, to create SmartHome-Bench. To identify keywords associated with common anomalies, we review the literature on home security [8], family care [57], and pet monitoring [16], creating an initial keyword set that was refined by smart home experts. Additionally, we develop a separate keyword set to capture typical, non-anomalous events in smart homes. Using these keywords, we identify 8,611 videos on YouTube. After manual filtering, we finalize a set of 1,203 videos captured by both indoor and outdoor smart home cameras. Details on the collection and filtering process are provided in Appendix A .\n3.2. Video Annotation # In SmartHome-Bench, each video is manually annotated with (1) the event category; (2) the anomaly tag indicat-\nFigure 3. Overview of the video anomaly taxonomy.\ning whether the video event is normal , abnormal, or vague abnormal; (3) textual descriptions of the events; and (4) rationales explaining the reasoning behind the assigned anomaly tag. An example of an annotated video is shown in Figure 2 .\nDefining anomalies is a key challenge in VAD [33], especially in smart home scenarios where interpretations of what constitutes an anomaly can vary widely among users. To streamline the annotation process, we develop an anomaly taxonomy to guide the labeling of event categories and anomaly tags, as illustrated in Figure 3. This taxonomy defines seven primary categories: security , baby monitoring , kid monitoring , senior care , pet monitoring , wildlife, and other category. Each category is further divided into specific second-level event types, covering both normal and abnormal events. For example, the senior care category includes one normal event, routine activity, and three abnormal events: distress signal , senior fall, and elder abuse .\nThe complete video anomaly taxonomy is provided in Appendix B, served as a structured guideline for annotators to ensure consistency and accuracy in labeling event categories and anomaly tags. Under the guidance of taxonomy, annotators label the video with normal or abnormal tags for well-defined scenarios. If annotators could not reach a consensus on a video\u0026rsquo;s anomaly classification due to limited context, it is labeled as vague abnormal. The distribution of categories and anomaly tags across the dataset is shown in Figure 1a, with further details on the video annotation process available in Appendix C .\nIn addition to categorizing events and tagging anomalies,\nFigure 4. Overview of adaptation methods and TRLC pipeline: The upper section shows vanilla adaptations, ICL methods, and the TRLC; The lower section presents the TRLC output from Gemini-1.5-pro on a SmartHome-Bench video.\nhuman annotators provide detailed descriptions of video events and articulate the reasoning behind each anomaly judgment. Video descriptions are limited to 200 words, while reasoning explanations are all in 100 words, promoting concise and precise justification. To ensure the annotation quality, there is a human review process to avoid annotator bias. These high-quality textual annotations serve as a benchmark for validating MLLMs\u0026rsquo; video understanding and reasoning processes, as demonstrated in our case analysis in Section 5.5 .\n4. Methods # For smart home scenarios, users are often interested in receiving a clear alert about whether a video contains an anomalous event [5 , 52]. By leveraging MLLMs, we aim to go beyond anomaly detection by also generating detailed descriptions and reasoning, thereby enriching the interpretability of detection outputs. We evaluate MLLMs'\nperformance for VAD in smart home scenarios across multiple adaptation methods. As illustrated in Figure 4, we begin with vanilla adaptations, such as zero-shot prompting, chain-of-thought (CoT) prompting, and few-shot CoT prompting, to gauge MLLM\u0026rsquo;s baseline capabilities in recognizing video anomalies. Then, we further utilize an incontext learning (ICL) approach that incorporates the complete anomaly taxonomy, embedding expert knowledge to enhance MLLM anomaly understanding. Building on insights that MLLMs often struggle to follow complex instructions or capture nuanced details in a single pass, we develop the TRLC, a novel LLM chaining framework, to systematically address these challenges.\n4.1. Vanilla Adaptations # All prompts used for the following three vanilla adaptation methods are provided in Appendix D.1 .\nZero-Shot Prompting: In this setup, MLLM is prompted directly to return a binary anomaly label, where 0 indicates no anomaly detected and 1 indicates an anomaly detected.\nCoT Prompting: CoT prompting enhances complex reasoning by incorporating intermediate reasoning steps [44]. In this setup, we prompt MLLMs with the task instructions, smart home anomaly definitions, and video input, guiding them to complete the task in three steps: generating video descriptions, providing reasoning, and predicting the anomaly label.\nFew-Shot CoT Prompting: To enhance MLLMs\u0026rsquo; understanding of smart home video anomalies, we add a few representative anomaly examples at the end of the CoT prompt. Each example includes a video description, anomaly reasoning, and the corresponding ground-truth anomaly label.\n4.2. In-Context Learning # We further integrate the smart home anomaly taxonomy from Section 3.2 into the ICL prompts, building on a similar approach that effectively guides LLMs in conversation safety assessments using a safety risk taxonomy [14]. Building upon the CoT prompt, we include the complete anomaly taxonomy as a reference, allowing MLLMs to justify anomalies based on the taxonomy, and utilize their own knowledge if the video does not fit any predefined taxonomy category (see prompt in Appendix D.2). This integration provides MLLMs with structured guidelines and examples of both abnormal and normal events in smart home scenarios.\n4.3. Taxonomy-Driven Reflective LLM Chain # LLM chaining refers to a pipeline that decomposes the task into multiple steps, each solved by a unique LLM call [48]. In our proposed TRLC framework, the VAD task is divided into three smaller subtasks: (a) Taxonomy-Driven Rule Generation, (b) Initial Prediction, and (c) Self-Reflection (see prompts for each subtask in Appendix D.3). An example of the process in our TRLC is illustrated in Figure 4 .\nStep (a): Taxonomy-Driven Rule Generation MLLMs often struggle to follow long instructions accurately and capture all detailed information in prompts. Therefore, at the first step, we make an MLLM call to condense the full taxonomy from Section 3.2 into a list of concise, actionable rules. This rule set is then incorporated as expert knowledge in the subsequent prompting steps. The complete set of summarized rules is provided in Appendix D.3 .\nStep (b): Initial Prediction Using the summarized rules, input videos, and a CoT prompt, we call an MLLM to generate the initial VAD prediction, which includes a video description, reasoning, and an anomaly label. This output then serves as the input for Step (c).\nStep (c): Self-Reflection It has been observed that with a single MLLM call often leads to misclassification of certain events due to the model\u0026rsquo;s limited contextual understanding. A notable example is the misclassification of an unattended cat left alone outside as a normal event, as shown in Step (b) of Figure 4. The model\u0026rsquo;s reasoning focuses solely on typical pet behavior, overlooking potential risks a pet may face when left alone outside, such as getting lost, encountering diseases, sustaining injuries, or facing dangerous wildlife. Adding an additional self-reflection step could help correct these types of initial misclassifications.\nIn Step (c), we reintroduce the generated rules from Step (a) and the results from Step (b) to the MLLM, prompting it to refine the initial predictions. For instance, an unattended outdoor pet is highlighted as a common smart home anomaly in Rule #2. With this additional context, the model successfully applies this rule to refine the initial VAD results, correcting the original classification.\nIn summary, our TRLC framework enhances MLLM\u0026rsquo;s contextual understanding through taxonomy-driven rules and significantly improves reasoning abilities via selfreflection. Additionally, the TRLC framework\u0026rsquo;s support for configurable video anomaly taxonomies enables broader applications, such as adapting VAD for diverse public and private environments. Furthermore, TRLC enables personalized VAD by allowing users to define tailored taxonomies that align with individual standards for anomalies.\n5. Experiments # In this section, we present the experimental results of the adaptation methods outlined in Section 4 across opensource and closed-source MLLMs. We convert the video\u0026rsquo;s anomaly tags to binary labels: normal(0) , abnormal(1), and vague abnormal(1). The MLLM predictions, also in binary format, are then compared against these ground-truth labels.\n5.1. Experiment Setup # There are two ways to perform VAD: (1) asking MLLMs if the video is abnormal, referred to as abnormal detection and (2) asking MLLMs if the video is normal, referred to as normal detection. We opt for abnormal detection because it is observed that anomaly detection prompts yield better results. This is likely because MLLMs are pre-dominantly trained on normal videos and may struggle to detect anomalies without additional instructions (see results in Appendix E).\nWe involve six MLLMs in our experiments, including five closed-source models: Gemini-1.5-flash001 [42], Gemini-1.5-pro-001 [42], GPT-4o-2024-0806 [13], GPT-4o-mini-2024-07-18 [36], and Claude-3.5sonnet@20240229 [4], as well as one open-source model, VILA-13b [23]. For zero-shot, CoT, few-shot CoT, and ICL\nmethods, we test all six models, while the TRLC is evaluated only with the five closed-source models, as VILA13b struggles to follow long, complex instructions. Overall, these models offer a comprehensive comparison and serve as the most representative benchmarks for state-ofthe-art MLLM performance in anomaly detection within smart home scenarios.\n5.2. Benchmarking on Vanilla Adaptations # Zero-Shot Prompting Table 1 presents VAD performance results under zero-shot prompting, showcasing each model\u0026rsquo;s inherent understanding of smart home anomalies without additional guidance. Claude-3.5-sonnet achieves the highest accuracy, recall, and F1-score, while Gemini1.5-pro leads in precision. The accuracy of all closed-source MLLMs is only marginally above random chance (50%), indicating limited baseline performance. Notably, VILA13b classifies all videos as normal, underscoring its difficulty with zero-shot VAD tasks in detecting anomalies. These low VAD performance results suggest that, without guidance, these MLLMs have limited inherent understanding of smart home anomalies or may not fully utilize their capability to detect anomalies effectively.\nTable 1. Anomaly detection performance of different MLLMs with the zero-shot prompting (Bold values indicate the highest score for each metric; applies to all tables in this paper).\nModel Accuracy Precision Recall F1-score Gemini-1.5-flash 58.44 79.22 31.12 44.69 Gemini-1.5-pro 57.36 84.34 25.73 39.43 GPT-4o 68.41 80.09 55.16 65.33 GPT-4o-mini 69.91 76.52 63.79 69.58 Claude-3.5-sonnet 70.82 69.66 81.36 75.05 VILA-13b 46.05 0 0 0 Chain-of-Thought Prompting Across all test MLLMs, CoT prompting consistently improves VAD accuracy compared to zero-shot prompting (see Table 2 vs. Table 1), underscoring the effectiveness of more granular anomaly definitions and step-by-step guidance. Among the models, Gemini-1.5-pro achieves the highest accuracy and precision. Notably, GPT-4o-mini outperforms GPT-4o in recall, albeit with reduced precision. For all closed-source MLLMs except Claude-3.5-sonnet, the gap between precision and recall narrows, resulting in a significantly improved F1-score compared to Table 1. VILA-13b also demonstrates substantial improvement across all metrics, highlighting the positive impact of CoT prompting on its performance.\nFew-Shot CoT Prompting In the few-shot CoT setup, we extend the CoT prompt by adding three representative examples of anomaly videos. Due to MLLM\u0026rsquo;s processing limitations on the number of images or videos per request,\nTable 2. Anomaly detection performance of different MLLMs with the CoT prompting.\nModel Accuracy Precision Recall F1-score Gemini-1.5-flash 69.58 74.44 66.41 70.2 Gemini-1.5-pro 74.06 83.77 64.41 72.82 GPT-4o 72.57 83.02 61.79 70.85 GPT-4o-mini 68.83 68.07 79.51 73.35 Claude-3.5-sonnet 71.9 83.44 59.78 69.66 VILA-13b 68.41 68.45 76.89 72.42 these examples are provided as text tuples. As shown in Table 3, Gemini-1.5-pro achieves the highest accuracy, surpassing the previous CoT best of 74.06% and leading in precision and F1-score, while GPT-4o-mini performs best in recall. However, for models like Gemini-1.5-flash, GPT4o, GPT-4o-mini, and VILA-13b, accuracy is slightly lower than in Table 2, suggesting that few-shot CoT does not fundamentally enhance CoT performance. This may be because the three examples provided in the prompt do not fully capture the range of anomalies and may distort the MLLMs\u0026rsquo; inherent knowledge, leading to misclassification.\nTable 3. Anomaly detection performance of different MLLMs with the few-shot CoT prompting.\nModel Accuracy Precision Recall F1-score Gemini-1.5-flash 68.41 79.43 55.93 65.64 Gemini-1.5-pro 76.39 86.87 66.26 75.17 GPT-4o 71.65 83.19 59.48 69.36 GPT-4o-mini 68 66.3 82.74 73.61 Claude-3.5-sonnet 72.98 77.65 70.11 73.68 VILA-13b 67.17 69.18 70.57 69.87 5.3. Benchmarking on ICL # In CoT and few-shot CoT experiments, we find that adding more informative and precise anomaly definitions to the prompt improves VAD performance. With this insight, we utilize an ICL approach that incorporates the complete anomaly taxonomy in the prompt, providing MLLMs with structured categories and anomaly definitions specific to diverse smart home scenarios.\nTable 4. Anomaly detection performance of different MLLMs with the ICL method.\nModel Accuracy Precision Recall F1-score Gemini-1.5-flash 67.08 80.78 51.16 62.64 Gemini-1.5-pro 74.4 86.2 62.56 72.5 GPT-4o 72.65 89.41 55.93 68.82 GPT-4o-mini 71.74 83.96 58.86 69.2 Claude-3.5-sonnet 73.82 84.22 63.33 72.3 VILA-13b 65.59 75.82 53.16 62.5 Table 4 shows each MLLM’s ability to directly apply the\nanomaly taxonomy in VAD with the ICL method. While half of the models (i.e., GPT-4o, GPT-4o-mini, and Claude3.5-sonnet) demonstrate improved accuracy, the other half do not, suggesting this approach does not consistently enhance few-shot CoT performance. Except for a slight decrease in precision for Gemini-1.5-pro, all other MLLMs show increased precision, indicating that the taxonomy helps MLLMs identify anomalies more accurately.\n5.4. Benchmarking on TRLC # ICL experiment results indicate that directly integrating the full anomaly taxonomy does not significantly improve MLLMs\u0026rsquo; VAD performance. Additionally, lengthy prompts in a single call tend to dilute the primary task, making it challenging for MLLMs to stay focus on VAD. To address this, our TRLC approach uses anomaly-specific rules generated from the taxonomy rather than the full taxonomy, providing targeted guidance and avoiding the excess detail that can lead to confusion in ICL.\nAs shown in Table 5, applying this approach to MLLMs achieves better accuracy than all other adaptation methods in Table 1 -4, with Claude-3.5-sonnet reaching 79.05%. Figure 5 further illustrates the accuracy results for all adaptation methods. Notably, our TRLC approach significantly boosts performance across all tested MLLMs, outperforming all other methods in four of the five models. The exception is GPT-4o-mini, where the TRLC ranks second, just slightly below its ICL result. On average, the TRLC method increases accuracy by 11.62% over the zero-shot prompting across all five closed-source models. These results demonstrate that our TRLC approach provides MLLMs with an improved contextual understanding of smart home anomalies and enhances their reasoning abilities compared to nochaining methods.\nTable 5. Anomaly detection performance of different MLLMs with the TRLC method.\nModel Accuracy Precision Recall F1-score Gemini-1.5-flash 77.14 77.74 80.74 79.21 Gemini-1.5-pro 78.47 82.18 76.73 79.36 GPT-4o 77.47 79.35 78.74 79.04 GPT-4o-mini 70.82 67.74 87.67 76.43 Claude-3.5-sonnet 79.05 79.67 82.13 80.88 Majority Voting To assess the peak performance achievable with the TRLC, we combine TRLC results from the top three MLLMs: Gemini-1.5-pro, GPT-4o, and Claude-3.5sonnet, using majority voting to determine the final anomaly prediction for each video. There are two possible voting outcomes: unanimous agreement and absolute majority. When all three MLLMs produce the same anomaly prediction, such as Gemini-1.5-pro: 0, GPT-4o: 0, and Claude3.5-sonnet: 0, the result is classified as unanimous, and that prediction (normal(0)) becomes the final label. In all other cases, the majority prediction is used as the final classification.\nFigure 5. Overall VAD accuracy of all tested adaptation methods across different MLLMs.\nAs shown in Figure 6, this approach increases accuracy to 81.63%, surpassing the individual performance of each model in Table 5. Specifically, the number of videos with unanimous agreement and absolute majority outcomes are 781 and 422, with corresponding accuracies of 91.2% and 64.0%, respectively. The high VAD accuracy in cases of unanimous agreement suggests potential applications, such as leveraging unanimous MLLM votes to create reliable ground-truth anomaly labels for large smart home video datasets.\nFigure 6. Majority voting outcomes on VAD using TRLC results across the top three MLLMs (Gemini-1.5-pro, GPT-4o, and Claude-3.5-sonnet) with video distribution by ground-truth anomaly categories.\n5.5. In-Depth Analysis # Hard Case Analysis As introduced in Section 3.2, our dataset includes a category of videos with ambiguous anomalies, labeled as vague abnormal. These videos\npresent challenges even for human annotators, making them a useful subset for assessing the limits of MLLMs in VAD prediction. To explore this, we analyze the accuracy of MLLMs on all 91 vague abnormal videos, as shown in Table 6. Generally, the accuracy for vague cases is significantly lower than the other cases across all MLLMs. Notably, with the exception of Claude-3.5-sonnet, MLLMs achieve their highest vague accuracy using the TRLC, underscoring its effectiveness in improving VAD performance, even in challenging smart home scenarios.\nTable 6. VAD accuracy on 91 vague abnormal videos across different MLLMs with all adaptation methods (ZS: zero-shot, CoT: chain-of-thought, FS: few-shot chain-of-thought, ICL: in-context learning, TRLC: taxonomy-driven reflective LLM chain).\n| Model | Accuracy FS ICL TRLC | Accuracy FS ICL TRLC | Accuracy FS ICL TRLC | Accuracy FS ICL TRLC | Accuracy\nFS ICL TRLC ZS CoT FS ICL TRLC Gemini-1.5-flash 35.16 48.35 38.46 28.57 60.44 Gemini-1.5-pro 16.48 37.36 37.36 35.16 56.04 GPT-4o 47.25 37.36 30.77 23.08 50.55 GPT-4o-mini 52.75 69.23 71.43 34.07 81.32 Claude-3.5-sonnet 67.03 30.77 41.76 34.07 59.34 Error Diagnosis To understand which aspects MLLMs struggle with in anomaly detection within our dataset, we evaluate their video descriptions and reasoning against annotated ground truth. To capture all possible outcomes, we manually analyze MLLM outputs for 100 videos and identify five types of failure outcomes: (1) Misinterpretation: misdescribing or misunderstanding video events; (2) Event Omission: missing key abnormal events; (3) Hallucination: adding content that is not present; (4) Context Lack: failing to grasp details like the identity of the people and the emotions of the participants; and (5) Technical Error: failing to generate a response. Using these identified failure types, we then employ GPT-4 to evaluate the description and reasoning for all videos (see the prompts in Appendix D.4). The results are presented in Figures 7 and 8. Overall, MLLMs make more mistakes in video descriptions than in reasoning, likely due to the longer length of descriptions (see examples in Figure 2).\nSince a single video may exhibit multiple failure types, the total count of categorized types exceeds the dataset size of 1,203. Among the failure types, Context Lack is more prominent in reasoning than in descriptions. This occurs when MLLMs fail to grasp smart home context beyond basic descriptions, such as the identities of individuals in the video, leading to misinterpretation of normal events as anomalies or overlooking true anomalies. For instance, a description of a dog engaging with a person could have two possible interpretations: (1) playing with its owner, which is normal, or (2) attempting to fend off an intruder, which would be an anomaly, depending on whether the person is a resident. Incorporating additional context, such as a customized anomaly taxonomy and recognition of familiar faces, may help address this limitation.\nFigure 7. Distribution of video outcomes for the top three MLLMs\u0026rsquo; descriptions compared to human-annotated description.\nFigure 8. Distribution of video outcomes for the top three MLLMs\u0026rsquo; reasoning compared to human-annotated reasoning.\n6. Conclusion # In this paper, we introduce SmartHome-Bench, the first benchmark specifically designed for detecting anomalies in smart home scenarios. The dataset comprises 1,203 video clips, each annotated with an event category, anomaly tag, and high-quality video description and reasoning. We assess the performance of state-of-the-art closed-source and open-source MLLMs using various prompting techniques. Notably, we propose the TRLC, a novel LLM chaining framework tailored for VAD tasks, which outperforms other methods and achieves the highest accuracy of 79.05% with Claude-3.5-sonnet.\n7. Acknowledgment # This work is supported by Wyze Labs, Inc. and the University of Washington. We thank Kevin Beussman for donating the videos. We also thank the annotators Lina Liu, Vincent Nguyen, Pengfei Gao, Yunyun Xi, Liting Jia, and Xiaoya Hu for their hard work on data annotation.\nReferences # [1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. 2 [2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022. 1 , 2 [3] Manal Mostafa Ali. Real-time video anomaly detection for smart surveillance. IET Image Processing, 17(5):1375–1388, 2023. 2 [4] Anthropic. Claude 3.5 sonnet, 2024. Accessed: 2025-04-05. 5 [5] UABUA Bakar, Hemant Ghayvat, SF Hasanm, and Subhas Chandra Mukhopadhyay. Activity and anomaly detection in smart home: A survey. Next generation sensors and systems, pages 191–220, 2015. 4 [6] Yassir Bendou, Giulia Lioi, Bastien Pasdeloup, Lukas Mauch, Ghouthi Boukli Hacene, Fabien Cardinaux, and Vincent Gripon. Llm meets vision-language models for zero-shot one-class classification. arXiv preprint arXiv:2404.00675, 2024. 24 [7] Rohit Bharadwaj, Hanan Gani, Muzammal Naseer, Fahad Shahbaz Khan, and Salman Khan. Vane-bench: Video anomaly evaluation benchmark for conversational lmms. arXiv preprint arXiv:2406.10326, 2024. 2 , 3 [8] Kellie Corona, Katie Osterdahl, Roderic Collins, and Anthony Hoogs. Meva: A large-scale multiview, multimodal video dataset for activity detection. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1060–1068, 2021. 3 , 12 [9] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palme: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 1 , 2 [10] Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 3 [11] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1705–1714, 2019. 1 [12] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 733–742, 2016. 1\n[13] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 5\n[14] Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llmbased input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023. 5\n[15] Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7842–7851, 2019. 1\n[16] Jinah Kim and Nammee Moon. Dog behavior recognition based on multimodal data from a camera and wearable device. Applied sciences, 12(6):3199, 2022. 3 , 12\n[17] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730– 19742. PMLR, 2023. 1 , 2\n[18] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 1\n[19] Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2\n[20] Shuo Li, Fang Liu, and Licheng Jiao. Self-training multisequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1395–1403, 2022. 1\n[21] Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence , 36(1):18–32, 2013. 2\n[22] Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, and Min Zhang. Videovista: A versatile benchmark for video understanding and reasoning. arXiv preprint arXiv:2406.11303, 2024. 2\n[23] Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024. 2 , 5\n[24] Jixin Liu, Yinyun Xia, and Zheng Tang. Privacy-preserving video fall detection using visual shielding information. The Visual Computer, 37(2):359–370, 2021. 2\n[25] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018. 2\n[26] Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? arXiv preprint arXiv:2403.00476, 2024. 3\n[27] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision, pages 2720–2727, 2013. 2\n[28] Hui Lv and Qianru Sun. Video anomaly detection and explanation via large language models. arXiv preprint arXiv:2401.05702, 2024. 2\n[29] Hui Lv, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Learning normal dynamics in videos with meta prototype network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15425–15434, 2021. 1\n[30] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 3\n[31] Amir Markovitz, Gilad Sharir, Itamar Friedman, Lihi ZelnikManor, and Shai Avidan. Graph embedded pose clustering for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 10539–10547, 2020. 2\n[32] Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024. 2\n[33] Rashmiranjan Nayak, Umesh Chandra Pati, and Santos Kumar Das. A comprehensive review on deep learning-based methods for video anomaly detection. Image and Vision Computing, 106:104078, 2021. 3\n[34] Apostolos Ntelopoulos and Kamal Nasrollahi. Callm: Cascading autoencoder and large language model for video anomaly detection. In International Conference on Image Processing Theory, Tools and Applications. IEEE, 2024. 2\n[35] Richard Oelschlager. Evaluating the impact of hallucinations on user trust and satisfaction in llm-based systems, 2024. 2\n[36] OpenAI. Gpt-4o-mini: Advancing cost-efficient intelligence, 2024. Accessed: 2025-04-05. 5\n[37] Sharnil Pandya, Hemant Ghayvat, Ketan Kotecha, Mohammed Awais, Saeed Akbarzadeh, Prosanta Gope, Subhas Chandra Mukhopadhyay, and Wei Chen. Smart home anti-theft system: a novel approach for near real-time monitoring and smart home security for wellness protocol. Applied System Innovation, 1(4):42, 2018. 1\n[38] Jing Ren, Feng Xia, Yemeng Liu, and Ivan Lee. Deep video anomaly detection: Opportunities and challenges. In\n2021 international conference on data mining workshops (ICDMW), pages 959–966. IEEE, 2021. 2 [39] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018. 1 , 2\n[40] Jiayu Sun, Jie Shao, and Chengkun He. Abnormal event detection for video surveillance using deep one-class learning. Multimedia Tools and Applications, 78(3):3633–3647, 2019. 24\n[41] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 1\n[42] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 2 , 5\n[43] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4975–4986, 2021. 1\n[44] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 5\n[45] Kalana Ishara Withanage, Ivan Lee, Russell Brinkworth, Shylie Mackintosh, and Dominic Thewlis. Fall recovery subactivity recognition with rgb-d cameras. IEEE transactions on industrial informatics, 12(6):2312–2320, 2016. 2\n[46] Peng Wu and Jing Liu. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing, 30:3513–3527, 2021. 1\n[47] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 322–339. Springer, 2020. 2\n[48] Tongshuang Wu, Michael Terry, and Carrie Jun Cai. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems, pages 1–22, 2022. 5\n[49] Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021. 2\n[50] Xiaohao Xu, Yunkang Cao, Yongqi Chen, Weiming Shen, and Xiaonan Huang. Customizing visual-language foundation models for multi-modal anomaly detection and reasoning. arXiv preprint arXiv:2403.11083, 2024. 2\n[51] Salisu Wada Yahaya, Ahmad Lotfi, and Mufti Mahmud. Towards a data-driven adaptive anomaly detection system for human activity. Pattern Recognition Letters, 145:200–207, 2021. 2\n[52] Masaaki Yamauchi, Yuichi Ohsita, Masayuki Murata, Kensuke Ueda, and Yoshiaki Kato. Anomaly detection in smart home operation from user behaviors and home conditions. IEEE Transactions on Consumer Electronics, 66(2):183– 192, 2020. 4\n[53] Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. Follow the rules: Reasoning for video anomaly detection with large language models. arXiv preprint arXiv:2407.10299, 2024. 2\n[54] M Zaigham Zaheer, Arif Mahmood, M Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. Generative cooperative learning for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14744–14754, 2022. 1\n[55] Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. Harnessing large language models for training-free video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18527–18536, 2024. 2\n[56] Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, and Nong Sang. Holmes-vad: Towards unbiased and explainable video anomaly detection via multi-modal llm. arXiv preprint arXiv:2406.12235, 2024. 2\n[57] Junge Zhang, Yanhu Shan, and Kaiqi Huang. Isee smart home (ish): Smart video analysis for home security. Neurocomputing, 149:752–766, 2015. 3 , 12\n[58] Sijie Zhu, Chen Chen, and Waqas Sultani. Video anomaly detection for smart surveillance. In Computer Vision: A Reference Guide, pages 1315–1322. Springer, 2021. 2\nA. Video Collection # To curate our SmartHome-Bench dataset, we collect videos from public sources, such as YouTube. We craft a keyword set to crawl and identify videos with anomalies in smart homes. To achieve this, we survey the literature on different aspects, such as home security [8], family care [57], and pet monitoring [16]. Additionally, we develop a separate keyword set to capture typical, normal events in smart homes. These keywords are then refined with input from smart home experts. Table 7 shows examples of keywords used in the search process. For each keyword, we collect approximately 20 videos from YouTube, resulting in an initial pool of 8,611 videos. We then filter out irrelevant footage, such as edited content and videos not captured by smart home cameras. For relevant videos that contain advertisements, we trim these segments to ensure the videos are clean. This curation process results in the final SmartHome-Bench dataset, comprising 1,203 videos recorded by both indoor and outdoor smart home cameras.\nTable 7. Example keywords for searching normal and abnormal videos.\nType Example Keywords Normal Videos Normal Videos sleeping crib, kid surveillance, elderly resting safe, senior camera m visitor arrival video, vehicle arriving home, scheduled delivery h ibkdl hbkd iid Abnormal Videos vomiting home cam, child wandering outside, kid sharp objects, child sudden fall, se\u0002 r unexpected fall, senior physical distress, elderly rough caregiver, unauthorized entry empt, package theft, car theft driveway, broken window home, suspicious person home, ere weather property, fire damage home, earthquake home safety, severe wind backyard, nderstorm backyard, flood property risk B. Smart Home Anomaly Taxonomy # We present a comprehensive taxonomy for video anomalies in the smart home domain. This taxonomy is developed based on user study, focusing on seven areas like security, senior care, and pet monitoring, and is further refined by smart home experts. Each category is further divided into normal and abnormal videos, with detailed descriptions provided for both.\n1. Wildlife # Normal Videos: # – Harmless Wildlife: Harmless wildlife sightings, such as squirrels, birds, or rabbits, moving through the yard. – Common Pests: Common pest activity that doesn\u0026rsquo;t pose immediate danger (e.g., bugs in the garden). Abnormal Videos: # – Dangerous Wildlife: Presence of dangerous wildlife like snakes, spiders, or raccoons that may pose a health risk. – Wildlife Damage: Any wildlife activity that causes or potentially causes damage to property or threatens human or pet safety. – Indoor Wildlife: Any wildlife (dangerous or not) that enters a home without clear containment. 2. Pet Monitoring # Normal Videos: # – Routine Pet Activity: Pets engaging in regular play, resting or moving around within designated safe areas. – Safe Interaction: Pets interacting with known family members or other pets. – Supervised Pets: Pets accompanied by their guardian without interacting with property or people in harmful ways. Abnormal Videos: – Unattended Pets: Pets left outside alone for extended periods. – Escape Attempts: Pets attempting to escape, leaving the designated area, or exhibiting behaviors indicating escape attempts. – Destructive Behavior: Pets causing property damage by actions like chewing, scratching, or digging. – Distress Signals: Behaviors that indicate illness or distress, like vomiting, excessive scratching, or erratic movements. – Conflict or Injury Risk: Any interaction with others that could lead to conflict or injury. 3. Baby Monitoring # Normal Videos: # – Safe Play: Baby engaging in play or sleep within safe zones or under supervision. – Caregiver Interaction: Harmless interactions between the baby and caregivers. Abnormal Videos: # – Near Danger: Baby nearing dangerous zones (e.g., staircases, swimming pools) without adult supervision. – Unattended Baby: Baby wandering outside a crib, stroller, or designated play area without adult presence. – Injury Risk: Sudden, unexpected falls that may lead to injury. – Baby Abuse: Any abusive behavior toward the baby, such as hitting, or forcing them to act against their will. 4. Kid Monitoring # Normal Videos: # – Safe Play: Kids playing or moving around indoors or outdoors within designated areas. – Routine Activities: Regular daily activities under adult supervision. Abnormal Videos: # – Wandering: Kids found wandering outdoors or in dangerous locations without adult supervision. – Dangerous Actions: Dangerous actions indoors (e.g., playing with sharp objects, accessing restricted areas) or significant health/safety concerns (e.g., choking hazards). – Injury Risk: Sudden, unexpected falls that may lead to injury. 5. Senior Care # Normal Videos: # – Routine Activity: Seniors engaging in routine activities like walking, resting, or interacting with caregivers or family. Abnormal Videos: – Senior Falls: Sudden, unexpected falls that may lead to injury. – Distress Signals: Signs of distress or calls for help through hand gestures or unusual body language. – Elder Abuse: Any abusive or rough behavior by caregivers toward seniors, including verbal and physical abuse. 6. Security # Normal Videos: # – Routine Activity: Routine activity of homeowners, known visitors, or vehicles arriving and leaving. – Scheduled Delivery: Scheduled package deliveries or pickups without interference. Abnormal Videos: # – Unauthorized Entry: Motion or presence indicating potential break-ins, or trespassing. – Suspicious Loitering: Loitering individuals or those wearing unusual attire that deviates from the norm. – Forced Entry: Forced entry attempts, such as fiddling with locks, tampering with doors or windows, or trying to enter a home or vehicle through unconventional means. – Theft or Vandalism: Unauthorized removal of packages, vehicles, or other items. – Property Damage: Acts of property damage like graffiti, broken windows, car crashes, or other forms of vandalism. – Violence or Threats: Actions that might cause harm, such as kidnapping, aggressive confrontations, or any threatening behavior. – Disturbing Behavior: Unusual or eccentric behavior by individuals that could alarm or frighten viewers. 7. Other Category # Normal Videos: # – Everyday Activity: Videos that do not fit any of the above categories but show harmless, everyday activities, such as trees waving, normal weather events, or background motion. Abnormal Videos: # – Severe Weather: Severe weather conditions or natural disasters like fires, earthquakes, floods, or storms causing property damage or safety hazards. – Unexplained Phenomena: Unexplained phenomena of inanimate objects. – Falling Objects: Sudden, unexpected falls of inanimate objects that may cause damage or injury. – Risky activities: Irregular activities that do not fit into other categories but may pose risks or concerns. C. Video Annotation # During the video annotation process, we assign unique IDs to the downloaded videos to prevent annotators from being influenced by the original titles or metadata. The annotators classify each video into one or more of the seven categories in the taxonomy outlined in Appendix B, as real-world events in a single video may span multiple categories. Each video is then assigned an anomaly tag of normal , abnormal, or vague abnormal, based on the definitions outlined in the\ntaxonomy. The vague abnormal category is created for videos where annotators cannot reach a consensus on whether the content is normal or abnormal. This category is specifically introduced to challenge the video anomaly detection (VAD) capabilities of multi-modal large language models (MLLMs) with videos that are difficult for even humans to classify. A vague normal category is not included, as any ambiguity regarding the presence of an anomaly is classified under vague abnormal .\nWe instruct annotators to write high-quality video descriptions and provide detailed reasoning for the assignment of each video\u0026rsquo;s anomaly tag. These annotations establish a strong foundation for future research by enabling the generation of diverse question-answer pairs to assess the video understanding and reasoning capabilities of MLLMs. Additionally, the inclusion of ground-truth reasoning ensure a transparent inference process for classifying normal and abnormal videos, which can be leveraged to fine-tune MLLMs and improve anomaly detection accuracy in smart home scenarios. To maintain consistency and quality across video descriptions and reasoning annotations, we use the Gemini-1.5-pro model to generate initial drafts. Annotators then review each video and refine or rewrite these drafts according to three main criteria: (1) clarity and precision of language, (2) alignment of descriptions and reasoning with the video content, and (3) accuracy in identifying key elements such as objects triggering anomalies, abnormal movements, participants, and environmental conditions.\nFigure 9. The UI enables annotators to label videos by selecting event categories, assigning anomaly tags, and providing detailed video descriptions along with the reasoning behind the observed anomalies or normality.\nTo streamline the annotation process and maximize efficiency, annotators use a customized user interface (UI), shown in Figure 9, to label each video\u0026rsquo;s event category and anomaly tag, as well as to manually write the description and reasoning. To ensure the quality and consistency of the annotations, we conduct a human review of a randomly select 200 videos after the initial round of annotation .\nFollowing the annotation process for all 1,203 videos, the statistics of the SmartHome-Bench dataset are presented in Figure 1a of the main paper. The dataset shows a balanced distribution between abnormal and normal videos, with the\nsecurity category containing the largest number of videos among the seven event categories. Additionally, Figure 10 illustrates the distribution of video durations and word counts for descriptions and reasoning annotations. The average video length is approximately 20 seconds, with most clips being shorter than 80 seconds. This duration aligns well with the frameprocessing limitations of some existing MLLMs, enabling relatively comprehensive predictions in VAD tasks. The word count distribution reveals that reasoning annotations are typically more concise than descriptions, as they focus solely on the key event leading to the assigned anomaly tag. In contrast, descriptions provide a detailed account of all events within the video.\nFigure 10. Distribution of video durations and word counts for human-annotated video descriptions and reasoning.\nD. Prompts for Adaptation Methods and In-Depth Analysis # We provide all prompts used for adaptation methods and error diagnosis in in-depth analysis as follows.\nD.1. System Prompt for Vanilla Adaptations # Figure 11 shows the prompts used in zero-shot prompting for the VAD task.\nZero ot p - r Shot Prompting for Video Normality Detection rompting for VAD. MLLMs are prom Zero-Shot Prompting for Video Normality Detection Figure 11. System prompts adopted in zero-shot prompting for VAD. MLLMs are prompted directly to return a binary anomaly label.\nPlease watch the video carefully and determine whether the situation captured in the video is normal. pts used is normal. Figure 12 shows the prompts used in chain-of-thought (CoT) prompting for the VAD task.\nResponse Format:\nReply using the following format:\nmovements, and environmental conditions (max 200 words)\u0026quot;,\n\u0026ldquo;reasoning\u0026rdquo;: \u0026ldquo;Detailed reasoning for why the situation is considered abnormal or\nconcerning, if applicable (max 100 words)\u0026rdquo;, d in CoT prompting include task i \u0026ldquo;anomaly\u0026rdquo;: 0 or 1 // 0 for no anomaly detected, 1 for anomaly detected Chain-of-Thought Prompting for Video Normality D ppg , ttiidditiidi } th y\u0026quot;: 0 or Chain p 1 // of p 0 0 for no anomaly detected, 1 for anomaly detected Thought Prompting for Video Normality Detection g , ating video descriptionsproviding reasoninga concerning, if applicable (max 100 words)\u0026quot;, \u0026ldquo;aoal\u0026quot;0 o1 // 0 foo aoaldetected1 foaoaldetected ChifThht PtifVidNlitDtti Figure 12. System prompts adopted in CoT prompting include task instructions, smart home anomaly definitions, and video input, guiding anomaly: 0 or 1 // 0 for no anomaly detected, 1 for anomaly detected } ChainofThought Prompting for Video Normality Detection MLLMs to complete the task in three steps: generating video descriptions, providing reasoning, and predicting the anomaly label.\nYou are an excellent smart video vigilance expert agent in the smart home security domain.\nTask Instruction:\nYou are given a smart home video clip, and your job is to carefully identify normal, expected, or ordinary situations captured by the surveillance cameras. These cameras are set up by\nusers to enhance their safety and security. Keep in mind that the people in the video may or may not be the camera owners.\nNormality Definition:\nIn this context, normality refers to behaviors or events that are typical and do not raise concerns related to security, personal safety, child safety, wildlife activity, pet behavior,\nsenior monitoring, or any other situations that seem usual.\nResponse Format:\nPlease think step by step and respond using the format below:\n{\n}\n\u0026ldquo;video_description\u0026rdquo;: \u0026ldquo;A concise description of the video content, including objects, movements, and environmental conditions (max 200 words)\u0026rdquo;,\n\u0026ldquo;reasoning\u0026rdquo;: \u0026ldquo;Detailed reasoning for why the situation is considered normal and not concerning, if applicable (max 100 words)”\n,\n\u0026ldquo;normality\u0026rdquo;: 0 or 1 // 0 for the video is normal, 1 for anomaly detected\nFigure 13 shows the prompts used in the few-shot CoT prompting for the VAD task.\nFigure 13. System prompts adopted in few-shot CoT prompting for VAD. Each example provided includes a video description, anomaly reasoning, and the corresponding ground-truth anomaly label.\nD.2. System Prompt for In-Context Learning # The prompts used in in-context learning (ICL) for the VAD task are shown in Figure 14 .\nFigure 14. System prompts adopted in ICL for VAD. Building upon the CoT prompt, we include the complete anomaly taxonomy as a reference.\nD.3. System Prompt for Taxonomy-Driven Reflective LLM Chain # The prompts used in the taxonomy-driven reflective LLM chain (TRLC) framework for the VAD task are detailed as follows. First, the prompts in Figure 15 are used in step (a) of the TRLC to generate rules from the complete video anomaly taxonomy, with the resulting rules from step (a) shown in Figure 16. Next, the prompts in Figure 17 are employed to predict the initial detection for the VAD task. Finally, the self-reflection step is carried out using the prompts provided in Figure 18 .\nFigure 15. System prompts adopted in step (a) of the TRLC for VAD: taxonomy-driven rule generation.\n\u0026ldquo;Rule 10: Identify individuals tampering with locks, attempting forced entry, or wearing\nFigure 16. 10 rules generated from the full video anomaly taxonomy in step (a) of TRLC by GPT-4o. TRLC for Video Anomaly Detection: Initial Prediction (Step b) unusual attire. ] 0 rules generat TRLC unusual attire.\u0026rdquo; ]\nP }\nPrompt: }\nYou are an excellent smart video vigilance expert agent in the smart home security domain.\nTask Instruction: # Figure 17. System prompts adopted in step (b) of the TRLC for VAD: initial prediction. (These prompts are identical to the CoT prompts shown in Figure 12).\nFigure 18. System prompts adopted in step (c) of the TRLC for VAD: self-reflection.\nD.4. System Prompt for Error Diagnosis in In-Depth Analysis # We use the prompts in Figure 19 and Figure 20 to evaluate MLLM-generated video descriptions and reasoning against human-annotated counterparts, respectively.\nFigure 19. System prompts adopted in evaluating the MLLM-generated video description for VAD.\nFigure 20. System prompts adopted in evaluating the MLLM-generated video reasoning for VAD.\nTable 8. Performance of MLLMs with two prompt frames: accuracy, precision, recall (%), and processing time (s) compared across different MLLMs using zero-shot prompting (AD: anomaly detection, ND: normality detection).\nModel Accuracy Accuracy Precision Precision Recall Recall Video Processing Time Video Processing Time Model AD ND AD ND AD ND AD ND Gemini-1.5-flash 58.44 72.90 79.22 81.36 31.12 64.56 3.43 3.26 Gemini-1.5-pro 57.36 74.15 84.34 86.58 25.73 61.63 4.14 4.02 GPT-4o 68.41 70.74 80.09 82.07 55.16 58.55 10.15 9.79 GPT-4o-mini 69.91 73.07 76.52 78.66 63.79 68.72 10.09 10.39 Claude-3.5-sonnet 70.82 74.06 69.66 82.97 81.36 65.33 20.87 21.51 VILA-13b 46.05 55.28 0.00 78.46 0.00 23.57 1.38 1.28 E. Additional Experiments # E.1. Comparison between Anomaly Detection and Normality Detection # Anomaly detection is a classical binary classification task [6]. In the context of VAD, we employ two distinct prompt frames to evaluate the accuracy of MLLMs in this classification task. First, we prompt the MLLMs to identify abnormal events within a sequence of normal activities, targeting the anomaly detection task. Conversely, given that \u0026ldquo;normal videos\u0026rdquo; constitute the majority of training data [40], we also frame the task as a normality detection issue, prompting MLLMs to justify whether a video is normal. This bidirectional approach allows for a comprehensive evaluation of the MLLMs\u0026rsquo; capabilities in understanding and reasoning about smart home video clips, highlighting performance differences across different task frames in MLLM-based VAD.\nZero-Shot Prompting The zero-shot prompt for anomaly detection is illustrated in Figure 11, while the prompt for normality detection is provided in Figure 21. Table 8 presents the VAD results for both anomaly detection and normality detection Zero-Shot Prompting for Video Anomaly Detection tasks using zero-shot prompting. All MLLMs, except Claude-3.5-sonnet, achieve higher accuracy, precision, and recall in the normality detection task. VILA-13b classifies all videos as normal when tasked with anomaly detection, emphasizing its Please watch the video carefully and determine whether it contains any anomalies. limitations in zero-shot VAD tasks, despite being the fastest model in processing videos. Given that VAD is a binary classifi- Response Format: cation task, the random guess accuracy is 50%. Even the best-performing MLLMs achieve accuracy close to this threshold, Reply using the following format: { highlighting their limited understanding of anomalies in smart home contexts. These results likely reflect the models\u0026rsquo; train- { \u0026ldquo;anomaly\u0026rdquo;: 0 or 1 // 0 for no anomaly detected, 1 for anomaly detected ing on datasets primarily composed of normal videos, leading to stronger prior knowledge of normal events in smart home } scenarios. taskVILA-13b classifies all videos as normal when tasked with an Please watch the video carefully and determine whether it contains any anomalies. VAD tasksdesp Response Format: guess accuracy is 50%Ev Reply using the following format: g d { understanding of anomalies in smart home contexts. The \u0026ldquo;anomaly\u0026rdquo;: 0 or 1 // 0 for no anomaly detected, 1 for anomaly detected y } able Zero 8 8 presents the VAD results for both an Shot Prompting for Video Anomaly Detection\nFigure 21. System prompts adopted in zero-shot prompting for video normality detection.\nChain\nof\nThought Prompting for Video Anomaly Detection\nYou are an excellent smart video vigilance expert agent in the smart home security domain. t question arises: does this trend continue with CoT prompting? CoT Prompting Given that all MLLMs perform better on normality detection than anomaly detection with zero-shot You are an excellent smart video vigilance expert agent in the smart home secu prompting, an important question arises: does this trend continue with CoT prompting?\nTask Instruction: valuate CoT pe You are given a smart home video clip, and your job is to carefully identify potentially risky, py y 2 and Figure 22respectivelyAs shown in Table 9for the AD results suspicious, or anomalous situations captured by the surveillance cameras. These cameras 2 and Figure 22, respectively. As shown in Table 9, for the AD results, are set up by users to enhance their safety and security. Keep in mind that the people in the pared to the zero-shot prompting in Table 8, meeting expectations for CoT video may or may not be the camera owners. ty detection declines with CoT prom Task Instruction: Yit hidlid jb itfllidtifttillik To investigate, we evaluate CoT performance for both anomaly detection and normality detection. The prompts used You are given a smart home video clip, and your job is to carefully identify potentially risky, suspiciousor anomalous situations captured by the surveillance camerasThese cameras are detailed in Figure 12 and Figure 22, respectively. As shown in Table 9, for the AD results, CoT prompting improves are set up by users to enhance their safety and security. Keep in mind that the people in the accuracy and recall compared to the zero-shot prompting in Table 8, meeting expectations for CoT\u0026rsquo;s effectiveness. However, video may or may not be the camera owners. performance in normality detection declines with CoT prompting. While four MLLMs achieve over 90% precision in the\nAnomaly Definition:\nIn this context, anomalies refer to behaviors or events that raise concerns related to\nsecurity, personal safety, child safety, wildlife alerts, unusual pet behavior, senior 24 ale 24\nmonitoring, or any other situations that seem out of the ordinary.\nResponse Format:\nPlease think step by step and respond using the format below:\n{\n}\n\u0026ldquo;video_description\u0026rdquo;: \u0026ldquo;A concise description of the video content, including objects, movements, and environmental conditions (max 200 words)\u0026rdquo;,\n\u0026ldquo;reasoning\u0026rdquo;: \u0026ldquo;Detailed reasoning for why the situation is considered abnormal or concerning, if applicable (max 100 words)\u0026rdquo;,\n\u0026ldquo;anomaly\u0026rdquo;: 0 or 1 // 0 for no anomaly detected, 1 for anomaly detected\nChain\nof\nThought Prompting for Video Anomaly Detection\nYou are an excellent smart video vigilance expert agent in the smart home security domain.\nTask Instruction:\nYou are given a smart home video clip, and your job is to carefully identify potentially risky, suspicious, or anomalous situations captured by the surveillance cameras. These cameras\nare set up by users to enhance their safety and security. Keep in mind that the people in the video may or may not be the camera owners.\nAnomaly Definition:\nIn this context, anomalies refer to behaviors or events that raise concerns related to security, personal safety, child safety, wildlife alerts, unusual pet behavior, senior\nmonitoring, or any other situations that seem out of the ordinary.\nResponse Format:\nPlease think step by step and respond using the format below:\n{\n}\nFigure 22. System prompts adopted in CoT prompting for video normality detection.\nTable 9. Performance of MLLMs with two prompt frames: accuracy, precision, recall (%), and processing time (s) compared across different MLLMs using CoT prompting (AD: anomaly detection, ND: normality detection).\nMdl Accuracy Accuracy Precision Precision Recall Recall Video Processing Time Video Processing Time Model AD ND AD ND AD ND AD ND Gemini-1.5-flash 69.58 45.47 74.44 40.00 66.41 2.16 4.61 4.57 Gemini-1.5-pro 74.06 61.60 83.77 93.90 64.41 30.82 7.05 6.83 GPT-4o 72.57 57.94 83.02 100.00 61.79 22.03 12.55 14.27 GPT-4o-mini 68.83 49.46 68.07 100.00 79.51 6.32 12.28 13.39 Claude-3.5-sonnet 71.90 54.20 83.44 95.37 59.78 15.87 24.49 24.09 VILA-13b 68.41 43.39 68.45 13.64 76.89 0.92 6.74 11.56 normality detection task, the overall accuracy drops significantly compared to the ND results in Table 8. This suggests that while MLLMs have a solid grasp of normality, CoT prompting reinforces their existing strengths without addressing their weaknesses in anomaly detection, resulting in a decrease in overall VAD accuracy. In terms of efficiency, Gemini-1.5-flash emerges as the fastest model with CoT prompting, whereas VILA-13b, previously the fastest, likely loses this advantage due to difficulties in processing longer prompts.\nFrom the comparison between two prompt frames under zero-shot and CoT prompting, we observe that a feasible way to stably enhance MLLM VAD performance is to focus on anomaly detection while enriching the prompt with contextual information about anomalies in smart home scenarios. This strategy helps compensate for the models\u0026rsquo; inherent limited understanding of anomalies.\nE.2. Evaluation on Video Understanding of MLLMs # From Figure 7 and Figure 8 in the main paper, we analyze the five failure types where MLLMs failed to generate correct video description and reasoning. Additionally, we examine the distribution of MLLM outcomes for video description and reasoning across three ground-truth anomaly tags, i.e., Normal , Abnormal, and Vague Abnormal, as shown in Figures 23 and 24, respectively. The possible outcomes are defined as follows: (1) Correct: the MLLM\u0026rsquo;s response matches the annotated description or reasoning; (2) Error: the MLLM generates \u0026ldquo;nan\u0026rdquo; or nonsensical information; (3) Incorrect: there is at least one mismatch between the MLLM output and human annotation.\nFor video description, over 1000 MLLM outputs are incorrect from the top three MLLMs, whereas over half of the reasoning outputs are correct. This discrepancy is likely because the description tends to include more detailed information\n\u0026ldquo;video_description\u0026rdquo;: \u0026ldquo;A concise description of the video content, including objects, movements, and environmental conditions (max 200 words)\u0026rdquo;,\n\u0026ldquo;reasoning\u0026rdquo;: \u0026ldquo;Detailed reasoning for why the situation is considered abnormal or concerning, if applicable (max 100 words)\u0026rdquo;,\n\u0026ldquo;anomaly\u0026rdquo;: 0 or 1 // 0 for no anomaly detected, 1 for anomaly detected\ncompared to the reasoning, as illustrated in Figure 10, making it more challenging for MLLMs to match every detail in the descriptions. The error rates for the three models follow the same ranking for both description and reasoning: Gemini-1.5pro exhibits the highest error rate, followed by Claude-3.5-sonnet, with GPT-4o showing the least, indicating the relative stability of GPT-4o in response generation. The proportion of videos with correct descriptions across MLLMs remains consistent between normal and abnormal videos. However, the proportion of correct reasoning decreases progressively from normal to abnormal and further to vague abnormal. This trend highlights the limited understanding MLLMs have of smart home anomalies in our dataset, particularly for more ambiguous cases.\nFigure 23. Distribution of video outcomes for the top three MLLMs\u0026rsquo; description compared to human-annotated description across different video anomaly tags.\nFigure 24. Distribution of video outcomes for the top three MLLMs\u0026rsquo; reasoning compared to human-annotated reasoning across different video anomaly tags.\n","date":"5 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/smarthome-bench-a-comprehensive-benchmark-for-video-anomaly-detection-in-smart-homes-using-multi-modal-large-language-models/","section":"Papers","summary":"The paper introduces SmartHome-Bench, the first large-scale dataset and benchmark designed specifically for video anomaly detection (VAD) within smart home environments, incorporating 1,203 annotated videos across seven categories such as Wildlife, Senior Care, and Baby Monitoring. The dataset includes detailed annotations with anomaly tags, descriptions, and rationales, facilitating research on multi-modal large language models (MLLMs) for explainable VAD. It evaluates various adaptation methods, including prompting strategies and a novel taxonomy-driven reflective LLM chain (TRLC), demonstrating significant performance improvements and highlighting current model limitations. The study aims to advance smart home security by providing a dedicated benchmark and novel framework for enhancing MLLM-based anomaly detection and reasoning.","title":"SmartHome-Bench: A Comprehensive Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal Large Language Models","type":"benchmark"},{"content":"","date":"5 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xinyi-zhang/","section":"Authors","summary":"","title":"Xinyi Zhang","type":"authors"},{"content":" A Survey on Video Anomaly Detection via Deep Learning: Human, Vehicle, and Environment # Ghazal Alinezhad Noghre 1 , Armin Danesh Pazho 1 , Hamed Tabkhi 1\nAbstract—Video Anomaly Detection (VAD) has emerged as a pivotal task in computer vision, with broad relevance across multiple fields. Recent advances in deep learning have driven significant progress in this area, yet the field remains fragmented across domains and learning paradigms. This survey offers a comprehensive perspective on VAD, systematically organizing the literature across various supervision levels, as well as adaptive learning methods such as online, active, and continual learning. We examine the state of VAD across three major application categories: human-centric, vehicle-centric, and environment-centric scenarios, each with distinct challenges and design considerations. In doing so, we identify fundamental contributions and limitations of current methodologies. By consolidating insights from subfields, we aim to provide the community with a structured foundation for advancing both theoretical understanding and real-world applicability of VAD systems. This survey aims to support researchers by providing a useful reference, while also drawing attention to the broader set of open challenges in anomaly detection, including both fundamental research questions and practical obstacles to real-world deployment.\nIndex Terms—video anomaly detection, deep learning, computer vision\nI. INTRODUCTION # V IDEO Anomaly Detection (VAD), also known as outlier detection, abnormal event detection, and abnormal activity detection, has emerged as a crucial technology across a range of applications [1]–[5], from public safety [6]–[10] to healthcare monitoring [11]–[14], autonomous driving [15]– [18], road surveillance [19]–[21], and environmental disaster detection and response [22]–[27]. In this age, where thousands of cameras continuously capture data, automated systems for detecting unusual events offer transformative potential [28]– [31]. For example, surveillance VAD (see Supplementary Materials for list of abbreviations) can automatically flag crimes or accidents, relieving human operators of the impossible task of watching hours of mostly uneventful footage. Another example is in healthcare, where VAD can monitor patients or older adults for sudden falls or distress. The growing importance of VAD in such domains stems from its ability to consistently watch for anomalies that could signify security threats, medical emergencies, or catastrophic events.\nVAD confronts unique challenges inherent to video data. A video is a high-dimensional spatiotemporal signal: each anomaly may involve not just an unusual appearance in a single frame, but an irregular motion pattern unfolding over time [32]–[34]. An anomaly in video can be formally defined as \u0026ldquo;The manifestation of atypical visual or motion characteristics,\n1 Electrical and Computer Engineering Department, UNC Charlotte (galinezh, adaneshp, htabkhiv@charlotte.edu)\nor the presence of typical visual or motion patterns occurring in a spatiotemporal contexts that deviate from established norms\u0026rdquo;. An example of an abnormal pattern can be a car accident, which represents a deviation from expected vehicular operation. On the other hand, a normal pattern occurring in an inappropriate context is exemplified by riding a bicycle on a pedestrian-only sidewalk. Moreover, regardless of the specific domain or application area, anomalous events are inherently rare, often occurring with low frequency and unpredictability [8], [35]–[39]. VAD may encounter novel, unforeseen abnormal events that were never observed. Even new patterns of normal activity may continually emerge, especially in openworld environments [40]–[43].\nTraditionally, VAD relied on statistical models and handcrafted features to identify unusual patterns [44], [45]. These methods often struggled with the complexity and variability inherent in video data and previously mentioned general challenges, leading to limited accuracy and adaptability. Deep learning has been a driving force behind recent progress in VAD, enabling models to automatically learn rich representations of normal and abnormal patterns. A wide spectrum of learning paradigms, from fully supervised [21], [46]–[51] to unsupervised [32], [52]–[55], has been explored in the literature. Beyond the training data regime, researchers have also looked at adaptive learning paradigms for VAD [43], [56], [57]. The abundance of these paradigms reflects the community\u0026rsquo;s efforts to tackle VAD\u0026rsquo;s challenges from different angles. Each paradigm comes with its own assumptions, strengths, and failure modes, and part of the goal of our survey is to clarify how these pieces fit into the larger picture.\nConsidering the breadth of applications and methods, there is a clear need for a unifying, structured perspective on VAD. Past surveys have typically focused on a subset of this space [2], [4], [6]–[9], [11]–[20], [23], [58]–[62]. For instance, on algorithms for a single domain (e.g. autonomous vehicles) or on specific training paradigms (e.g., unsupervised anomaly detection). However, VAD research has now grown to encompass diverse domains and a wide array of deep learning techniques. Researchers in one domain may not be fully aware of relevant techniques developed in another domain, even though the underlying problems share similarities. We aim to bridge this gap by providing a comprehensive survey that treats VAD holistically. In particular, we bring together humancentric, vehicle-centric, and environment-centric VAD under one umbrella (see Figure 1). By comparing and contrasting the problem formulations, data characteristics, and successful techniques across these domains, our survey highlights common principles as well as domain-specific nuances. This\n1\nFig. 1. Overview of the paper structure. The advancements in vehicle, human, and environmental VAD are explored.\nunified viewpoint is intended to help transfer knowledge across application domains. Moreover, we organize the growing literature on deep learning for VAD into a coherent taxonomy, which makes it easier to understand how different approaches relate to each other. Rather than seeing the field as a collection of disjoint research focuses, readers will gain a structured map of VAD research: the key problem settings, the algorithmic families, and the connections between them. We aim to help the research community identify open problems and practical barriers that must be addressed to advance VAD towards widespread deployment. This survey aims to bring clarity to what has been accomplished and what remains to be done. We summarize not only the State-of-The-Art (SOTA) techniques but also their limitations, and we pinpoint areas ready for new exploration. To this end, the main contributions of this paper are as follow:\nWe identify and critically analyze key challenges and open problems in VAD. By highlighting these gaps, the survey outlines practical considerations necessary for building reliable, adaptive, and deployable VAD systems. We present a structured taxonomy of VAD approaches categorized by supervision levels and learning paradigms, including supervised, unsupervised, weakly supervised, self-supervised, and adaptive learning. This taxonomy clarifies the underlying assumptions, strengths, and limitations of each paradigm and guides readers in selecting appropriate methods for different problem settings. We provide a comprehensive and unified survey of VAD via deep learning, encompassing human-centric, vehiclecentric, and environment-centric domains. This work bridges the gap between fragmented subfields by systematically comparing problem formulations, data characteristics, and methods across application areas, enabling knowledge transfer and cross-domain insights. II. VAD CHALLENGES # VAD presents unique challenges, summarized in Table I, which are explained in detail in this section.\nA. Data Scarcity and Annotation Challenges # C1: Rarity of Anomalies and Class Imbalance: By definition, anomalies are rare events compared to normal. For instance, traffic accidents in autonomous driving are infrequent compared to normal driving scenarios. Deep learning models typically thrive on abundant data, but the scarcity of anomalous examples means they struggle to learn generalizable patterns.\nC2: Limited and Difficult Labeling: Not only are anomalies rare, but they are also inherently difficult to label. Annotating frame-by-frame or pixel-level ground truth is labor-intensive. In many instances, expert knowledge is also essential to perform correct labeling. For example, in healthcare applications like monitoring Parkinson\u0026rsquo;s disease, identifying the exact onset and offset of anomalous behavior necessitates the involvement of domain specialists. C3: Ambiguity in Defining Anomalies and Context Specificity: Unlike standard vision tasks, anomaly detection is highly context-dependent. The same behavior may be normal in one setting but anomalous in another. For instance, in public safety, punching signals violence unless occurring in a boxing gym. Such contextual ambiguity complicates defining anomalies. Modeling such context is difficult and requires auxiliary inputs or learning multiple modes of normality. While deep models must be robust to these ambiguities, current methods often struggle with subtle or context-sensitive anomalies. B. Spatiotemporal Modeling Challenges # C4: Complex Temporal Patterns and Long-Term Dependencies: Anomalies unfold as irregular motion patterns or unusual events over time. Capturing temporal dynamics is a core difficulty. For instance, in Autonomous Driving, an accident might be inferred from a vehicle\u0026rsquo;s erratic trajectory over several seconds. Some anomalies have a slow temporal build-up (e.g., a person slowly loitering in a restricted area). Detecting these requires integrating information over long durations. On the other hand, anomalies can be instantaneous (a sudden explosion). Balancing responsiveness to quick events with the ability to analyze extended sequences is non-trivial. C5: Multi-Agent Interactions and Crowded Scenes: Many anomalies involve multiple entities interacting with each other. Detecting these anomalies requires modeling collective behavior patterns. However, modeling them is difficult due to occlusions and complex dynamics. In some events the anomaly is evident in the group\u0026rsquo;s joint configuration (e.g. a group of people suddenly running away) even if each individual\u0026rsquo;s motion by itself might appear normal. C6: Feature Abstraction Level: Deep video anomaly detectors traditionally operate on raw pixel data, but this raises feature redundancy issues. Raw pixel-based models TABLE I COMPREHENSIVE SUMMARY OF KEY CHALLENGES IN VIDEO ANOMALY DETECTION. THE TABLE CATEGORIZES THE CHALLENGES INTO SIX BROAD THEMES. THIS CATEGORIZATION AIMS TO GUIDE FUTURE RESEARCH AND DEVELOPMENT DIRECTIONS IN VIDEO ANOMALY DETECTION SCENARIOS .\nCategory Challeng Short Description References 1: Rarity of Anomalies Class Imbalance C1: Rarity of Anomalies and Class Imbalance [8], [35], [36] Data Scarcity and Annotation Chll 1: Rarity of Anomalies Class Imbalance C2: Limited and Difficult Labeling e annotation is labor-intensive and often require domain experts [40], [63], [64] 1: Rarity of Anomalies Class Imbalance C3: Ambiguity in Defining Anomalies and Context Specificity omalies are context-dependent; similar actions ca be normal or abnormal depending on scenario. liifi [64]–[66] Spatiotemporal Modeling Challenges C4: Complex Temporal Patterns and Long-Term Dependencies Many anomalies manifest over time. Instantaneous and rolonged anomalies must be detected with appropriate temporal context. ilfilid [32]–[34 Spatiotemporal Modeling Challenges C5: Multi-Agent Interactions and Crowded Scenes eractions are complex, often involving crowd behavior or occlusion. [6], [67], [68 Spatiotemporal Modeling Challenges C6: Feature Abstraction Leve -based models are affected by visual noise. Higher abstraction may lose contextual cues. [9], [41], [43], [69], [70 Robustness and Generalization C7: Environmental Variations and Noise C8DiShifd orld conditions (e.g., weather, lighting) degrade model performance. dlffil hdld iil [65], [71], [72] Robustness and Generalization C8: Domain Shift and Cross-Scene Generalization ten fail when deployed in new v environments. f l/lb [73]–[75] Robustness and Generalization C9: Open-Set Nature of Anomalies and Novelty ot all types of normal/anomaly can be seen during training or validation. [40]–[42 Robustness and Generalization C10: Handling Concept Drift and Evolving Normality Normal behavior may evolve over time; failure to adapt causes false alarms, while over-adaptation risks misclassifying anomalies [56], [57], [64] Evaluation and Benchmarking Challenges C11: Scarcity of Comprehensiv Benchmark Datasets Benchmark datasets are limited in diversity and detail [37]–[39] Evaluation and Benchmarking Challenges C12: Limitations of Current Evaluation Metrics C13: Gap Between Offline ommon metrics often fail to reflect deployment performance. [8], [39], [41], [42] Evaluation and Benchmarking Challenges C13: Gap Between Offline Evaluation and Deployment Performance Real-time scenarios require new protocols for accurate assessment. [5], [41], [42] Real-Time and Deployment Challenges 4: Real-Time Processing and Low Latency Timely detection is essential in safety-critical domains. [76]–[78] Real-Time and Deployment Challenges Resource Constraints and Scalability Clibid hhldi odels require significant computationa resources. hlhhld iiil b [8], [56], [79] Real-Time and Deployment Challenges C16: Calibration and Thresholding (False Alarms vs. Misses) C17ChiFi g the right anomaly threshold is critical to balance false positives and false negatives. [80]–[82] Adaptive Learning Challenges C17: Catastrophic Forgetting or Stability-Plasticity Dilemma C18: Efficient Label Utilization During Adaptation odels may lose previously learned information whe updated with new data. [57], [83], [84 Challenges C18: Efficient Label Utilization During Adaptation Labels are scarce in streaming settings [43], [56], [57] must contend with background clutter, illumination changes, and camera motion that can obscure the relevant pattern. An emerging approach is to use other modalities such as pose, optical flows, object landmarks, etc. However, these approaches rely on accurate preprocessing steps. Additionally, detecting certain anomalies requires detailed visual queues that may be lost in higher levels of abstraction (e.g., detecting someone carrying a weapon would be more challenging).\nC. Robustness and Generalization Challenges # C7: Environmental Variations and Noise: VAD methods must operate in diverse real-world conditions that can affect their inputs. Models may face day/night cycles, various weather conditions, and lighting changes. These factors can introduce visual noise that are unrelated to anomalies but can confuse deep models. Another aspect is highly dynamic backgrounds that could lead to high false alarm rates when the model interprets normal background changes as abnormal. Robustness to these perturbations is crucial. C8: Domain Shift and Cross-Scene Generalization: Related to C7 is the domain shift problem: an anomaly detection model trained in one setting often fails when deployed in a new setting. This is because deep models internalize the statistics of their training data\u0026rsquo;s environment. Domain adaptation and generalization techniques are actively researched. This is a critical issue for scalability as well: a city-wide deployment across hundreds of street cameras would require per-camera calibration if the model cannot generalize.\nC9: Open-Set Nature of Anomalies and Novelty: VAD is an open-set problem: a model can never see examples of all possible anomalies in training, since by definition anomalies encompass anything that deviates from normal, including novel events that have never occurred before. On the same note, capturing all normal behaviors is also not feasible. VAD must be prepared for the unforeseen. This translates to the \u0026ldquo;unknown unknowns\u0026rdquo; problem: an AI may handle known rare events but fail to recognize a truly odd hazard as an anomaly. The open-set challenge also complicates evaluation. A model could correctly detect all anomalies in a test set and still be unreliable in practice if a new kind of anomaly occurs.\n4) C10: Handling Concept Drift and Evolving Normality: # Over time, what is considered normal or even anomalous may evolve. This phenomenon is known as concept drift. In a traffic monitoring scenario seasonal differences may cause normal behavior patterns to shift. In healthcare, a patient\u0026rsquo;s baseline behavior might gradually change due to therapy or disease progression. If a model is not updated, it may raise false alarms on these evolving behaviors.\nD. Evaluation and Benchmarking Challenges # C11: Scarcity of Comprehensive Benchmark Datasets: Most current VAD models are trained and tested on a limited set of benchmarks. While useful for initial development, these datasets often lack diversity in scenes, environmental conditions, and anomaly categories. C12: Limitations of Current Evaluation Metrics: The dominant metrics used in VAD fail to fully capture the realworld effectiveness of a model. These metrics often abstract away threshold selection and ignore the impact of false alarms. Additionally, metrics rarely account for operational concerns such as alert fatigue, latency, or the cost of misclassification. C13: Gap Between Offline Evaluation and Deployment Performance: Many VAD methods are evaluated in offline settings using pre-recorded video clips. Offline evaluation may overstate model accuracy. Bridging the gap between offline benchmarks and online performance requires new evaluation protocols that account for temporal causality, resource constraints, and continuous learning needs. E. Real-Time and Deployment Challenges # C14: Real-Time Processing and Low Latency: For many applications, detecting anomalies promptly is crucial. For example, autonomous vehicles must detect and react to road anomalies within a very short time to avoid accidents. Such scenarios demand that deep learning models operate in real-time on video streams. Even if accuracy is high, a method that triggers an alert too late is often unacceptable in practice. Achieving real-time anomaly detection without sacrificing detection quality is an active challenge. C15: Resource Constraints and Scalability: Deep learning models for video require significant memory and computation. In a real-world deployment like city-wide surveillance, running a deep anomaly detector on all feeds simultaneously is a massive scalability challenge. Likewise, an autonomous vehicle has a power and hardware budget. Thus, anomaly detection methods must be efficient in terms of computation, memory, and energy. Another aspect of scalability is handling long durations and continuous monitoring: a model might need to run 24/7. Storing and analyzing such long video sequences can be difficult. There is also a data management challenge: if anomalies are flagged often, how to store or review these events efficiently. Ensuring that a solution scales from a small benchmark to a deployment is a non-trivial jump. C16: Calibration and Thresholding (False Alarms vs. Misses): Deploying an anomaly detector in the real world requires choosing how sensitive it should be. In other words, setting thresholds or decision criteria for what is flagged as anomalous. This leads to a classic precision-recall trade-off: a very sensitive system will catch nearly all true anomalies (high recall) but at the cost of many false alarms (low precision), whereas a strict system will raise fewer false alerts but might miss anomalies. Finding the right balance is extremely challenging and often application-specific. F. Adaptive Learning Challenges # C17: Catastrophic Forgetting or Stability-Plasticity Dilemma: Catastrophic forgetting is the tendency of models to overwrite previously learned knowledge when updated with new data. If a model is updated incrementally to learn from new scenes or behaviors, it may degrade in performance on previously seen data. This is critical in safety or surveillance settings, where remembering rare but significant events is essential. This challenge is closely related to the StabilityPlasticity Dilemma, which describes the trade-off between retaining existing knowledge (stability) and acquiring new knowledge (plasticity) without interference. 2) C18: Efficient Label Utilization During Adaptation: # Obtaining labels in a streaming setting is expensive and timeconsuming. Therefore, continual learning must proceed with minimal supervision. Designing models that can effectively leverage sparse and noisy labels, or self-supervise their adaptation process, is a key challenge.\nIII. DEFINITIONS AND SOLUTIONS # VAD focuses on identifying patterns or events in video sequences that deviate significantly from expected or normal behavior [5]. As discussed in Section I, the complexity of VAD has led researchers to adopt varying levels of supervision (see Figure 6). This section identifies and discusses the definition and solutions within each supervision level. Table II summarizes the solutions and their weaknesses and strengths.\nA. Supervised VAD # Supervised VAD involves training models on labeled datasets where both normal and anomalous events are explicitly annotated. This approach is particularly effective in domains where anomalies are well-defined and annotated data is available, such as in healthcare. By learning from labeled examples, supervised methods can achieve high accuracy in detecting known types of anomalies. However, as outlined in Section II due to challenges such as Rarity of Anomalies and Class Imbalance, Limited and Difficult Labeling, Ambiguity in Defining Anomalies and Context Specificity, and Open-set nature of Anomalies and Novelty (challenges C1, C2, C3, and C9) supervised approaches exhibit limited applicability [5], [8], [35], [85], [86]. The main solution in this supervision level is treating VAD as a classification problem.\nSupervised Classification (S1): The most common formulation of supervised VAD is as a classification task, where models are trained to distinguish between predefined normal and anomalous events. This setup leverages well-established classification algorithms to learn discriminative features.\nB. Weakly-Supervised VAD # To address the challenge of obtaining accurately labeled data for supervised solutions (challenges C2 and C3), weaklysupervised approaches offer greater flexibility. In these methods, labels may be incomplete, noisy, or ambiguous. Weaklysupervised solutions mostly take advantage of Multiple Instance Learning (MIL) and try to improve it for better efficacy.\nMultiple Instance Learning (S2): MIL treats each video as a bag of instances, labeling it anomalous if at least one instance is abnormal (see Figure 2). During training, the model\nFig. 2. VAD formulated as a weakly supervised problem, commonly addressed using MIL (S2).\nFig. 3. Self/semi-supervised VAD achieved through reconstruction(S3) or prediction (S4). The top figure illustrates the training phase using only normal data, while the bottom shows the inference phase, where elevated loss indicates abnormal behavior.\nlearns to identify which instances within positive bags are anomalous, without needing fine-grained labels. One-stage MIL often focuses on the most prominent anomaly, risking missed detections of subtle instances, while two-stage selftraining methods use MIL to generate pseudo-labels iteratively, refining both the model and labels, enabling more robust and comprehensive detection of both obvious and subtle anomalies.\nC. Self/Semi-supervised VAD # Semi-supervised solutions bridge supervised and unsupervised learning paradigms by using only normal videos during training to learn the characteristics of normal behavior. Previous literature often classified these methods as unsupervised. However, recent works [54], [55] have reclassified them as semi-supervised due to the inherent supervision involved: normal and abnormal sequences are distinguished, and only normal sequences are utilized during training. This shift in terminology acknowledges the partial labeling and guidance provided, which differentiates these methods from truly unsupervised approaches. In general, most of these methods also fall under the self-supervised paradigm, where supervisory signals are derived from inherent characteristics of the normal data. Depending on the learning objective, these solutions can be categorized into four main groups: Reconstruction-based, Prediction-based, Jigsaw Puzzle, and Distribution Estimation.\nReconstruction-based (S3): This strategy employs autoencoders to reconstruct normal data; anomalies are indicated by high reconstruction loss when the model fails to reconstruct anomalous snippets accurately, as seen in Figure 3.\nPrediction-based (S4): In these approaches, models are trained to predict the future normal behavior, with anomalies identified through higher prediction loss on abnormal sequences, as seen in Figure 3.\nJigsaw Puzzle (S5): A supervisory signal is generated by formulating a jigsaw puzzle task, which may be spatial, temporal, or a combination of both (see Figure 4). The model is trained exclusively on normal data, learning to reassemble shuffled video segments. During inference, its ability to correctly reconstruct these sequences is used as a measure for computing anomaly scores.\nDistribution Estimation (S6): This category employs either non-deep learning or deep learning methods to model the distribution of normal samples during training. At inference, instances with low likelihood are identified as anomalies.\nD. Unsupervised # In unsupervised training, no labels are available to distinguish between normal and anomalous instances. However, the literature frequently misclassifies certain self-supervised or semi-supervised approaches as unsupervised. A critical observation is that many of these methods are trained exclusively on normal data. This implicitly introduces label information, violating the core principle of unsupervised learning [54]. Consequently, such models should not be considered unsupervised. The degree of supervision must be evaluated not only based on the methodology but also in relation to the informational content embedded in the training data. Despite its fundamental nature, fully unsupervised anomaly detection remains relatively underexplored compared to its self-supervised and semisupervised counterparts, indicating a significant opportunity for advancement and application in real-world scenarios.\nTruly unsupervised methods operate without any access to ground-truth labels. These approaches aim to exploit the normality advantage; the observation that anomalies represent rare and irregular events, whereas the majority of the data corresponds to normal behavior [87], [88]. The core strategy behind these methods is to leverage this statistical imbalance: given that normal samples dominate the dataset, the global structure and distributional trends of the data are expected to reflect normal characteristics. Unsupervised models are therefore trained to capture these prevailing patterns, under the assumption that deviations from the learned representation will correspond to anomalous instances.\nClustering (S7): Clustering methods assume that normal data form dense clusters in feature space, while outliers in low-density regions are potential anomalies, as illustrated in Figure 5. Approaches range from classical algorithms like kmeans to deep clustering methods that jointly learn features and clusters. Despite their effectiveness, clustering methods face challenges such as sensitivity to hyperparameters, such as the number of clusters, and reliance on clear structural differences between normal and anomalous data.\nPseudo-label Induction (S8): This strategy leverages the normality assumption: normal data dominate the input distribution. While conceptually related to self or semi-supervised approaches, a key distinction is that the training data includes unknown anomalies. These methods use reconstruction errors or prediction inconsistencies to assign pseudo-labels, guiding anomaly filtering or classifier training. As they rely on selfgenerated signals without ground truth, they are considered unsupervised self-supervised approaches. However, unreliable pseudo-labels and feedback loops can undermine robustness and generalizability, especially in noisy or complex data.\nFig. 4. Self/semi-supervised VAD achieved through jigsaw puzzle task (S5). The puzzle can be spatial, temporal, or a combination. The left figure illustrates the training phase using only normal data, while the right shows the inference phase, where wrong permutation prediction indicates abnormal behavior.\nFig. 5. Unsupervised anomaly detection through clustering.\nIV. ADAPTIVE LEARNING IN VAD # Fig. 6. Percentage distribution of supervision levels within each domain.\nAs discussed in Section I and Section II, VAD is a dynamic and complex problem, ever evolving and heavily affected by spatio-temporal changes. This includes but is not limited to Environmental Variations and Noise, Domain Shift, OpenSet Nature, Concept Drift, and Calibration and Thresholding\n(challenges C7, C8, C9, C10, and C16). To address these challenges, adaptive learning methods such as meta-learning, online learning, continual learning, and active learning have become essential [89]–[92]. In this survey, the term \u0026ldquo;adaptive\nTABLE II OVERALL CLASSIFICATION OF VAD SOLUTIONS .\n| Supervision\nLevel Solution Definition Main Strength Main Limitation Supervised S1: Supervised Classification Frames VAD as a classification task using labeled datasets. High accuracy and reliability in detecting predefined, labeled anomalies. Supervised Weakly\u0002Supervised S2: Multiple Instance Learning Labels video bags; identifies anomalous instances. Can handle weak labels where only bag-level annotation is provided, reducing labeling efforts. Weakly\u0002Supervised Supervise S3: Reconstruction Uses autoencoders to reconstruct ormal data; anomalies have high reconstruction loss. Effective for capturing the structure of normal behaviors Struggles with generalization when normal patterns exhibit high variability. Suffers in scenarios with non pervised Predicts future behavio prediction loss indic anomalies. Predicts future behavior; high prediction loss indicates anomalies. Effective for capturing the structure of normal behaviors Struggles with generalization when normal patterns exhibit high variability. Suffers in scenarios with non S5: Jigsaw Puzzle Challenges models to reassembl shuffled video segments. ubtle Complexity of s permutations in puzzles, affecting r VAD. Effective for capturing the structure of normal behaviors Complexity of solving permutations in jigsaw puzzles, affecting real-time VAD. S6: Distribution Estimation Uses generative models or statistical methods to learn normal behavior distributions. tional properties behaviors. Sen Effective for capturing the structure of normal behaviors Sensitive to noise. Unsupervised S7: Clustering No reliance on labeled dat simple implementation No reliance on labeled data and simple implementation. Limited generalizability, sensitive to hyperparameters S8: Pseudo-Label Induction verage error magnitude to do pseudo-labeling for filtering anomalies. No reliance on labeled data and simple implementation. do No reliance on labeled data. Pseudo labels are uncertain and can potentially reinforce false patterns. learning\u0026rdquo; encompasses a range of general adaptation methods. These techniques enable models to update and adjust to new data, trying to manage the aforementioned challenges.\nMeta-learning, also known as \u0026ldquo;learning to learn,\u0026rdquo; [93] focuses on designing models capable of rapidly adapting to new tasks by leveraging knowledge acquired from previous tasks. This approach involves training across a variety of tasks to develop a general learning strategy, enabling the model to perform effectively on novel tasks with minimal data, which is particularly useful for solving the domain shift problems in VAD models. One significant weakness of meta-learning, particularly in real-time VAD, is the high computational expense associated with training across multiple tasks. However, integrating meta-learning with few-shot training methods can help mitigate this issue by enabling the model to learn from a limited number of examples, thereby reducing the computational burden while maintaining adaptability and performance. [94] introduces a meta-learning framework using the ModelAgnostic Meta-Learning (MAML) [95] algorithm to enhance semi/self-supervised anomaly detection in surveillance videos. This approach involves training the model on various scenes, creating tasks that simulate few-shot learning scenarios.\nOnline Learning is a paradigm where the model is updated incrementally as it receives new data points [96], [97]. This approach allows the model to adapt continuously to new information. Online learning is particularly advantageous when dealing with large datasets or streaming data, as it can handle data efficiently without requiring access to the entire dataset simultaneously. Online learning has been explored for anomaly detection on other types of data, such as time series [90], text [98], and medical images [99]. In VAD, Yao et al. [100] introduced a framework optimized for real-world deployment, integrating inference and training in a pipeline to enhance public safety applications. While effective under conditions of minimal distributional shift, online learning faces notable limitations. These include susceptibility to noisy or unrepresentative data (challenges C1 and C7), as well as challenges such as the stability-plasticity dilemma and catastrophic forgetting (challenge C17), where frequent updates may overwrite prior knowledge.\nContinual Learning is a strategy in machine learning where a model is designed to continually acquire, fine-tune, and retain knowledge from a stream of data over an extended period. This approach addresses the challenge of catastrophic forgetting (challenge C17), where learning new information can lead to a loss of previously acquired knowledge [101], [102]. This enables models to adapt to new tasks and changes in data distribution without sacrificing performance on previously learned tasks, making it particularly valuable in dynamic environments where the data evolves over time. Continual learning encounters challenges related to managing the high volume of streaming data and maintaining the efficiency of continuous model training. That is why most of the works in this area move toward few-shot learning to be able to handle the complexity of the training process while making real-time decisions. [56] proposed a two-step method for anomaly detection using deep learning-based feature extractors combined with kNN and a memory module, enhanced by two continual learning approaches. The first approach involves exact k-Nearest Neighbor kNN distance computation, effective for incrementally learning nominal behaviors when the training data size is manageable, updating the memory module with kNN distances from each training split. To address the computational expense as the training set grows, the second approach employs a fully connected deep neural network (kDNN) to estimate kNN distances, ensuring scalability and efficiency.\nActive Learning [103] is a technique where the algorithm selectively queries the user (or domain experts) to label new data points to improve the learning efficiency and model performance [104], [105]. In scenarios where labeling data is costly or time-consuming, active learning is particularly valuable because it allows the model to focus on acquiring labels for the most informative data points. This is achieved through various strategies that prioritize data points based on criteria such as uncertainty, representativeness, or expected model change [105]. By enabling the model to query the most useful data points for annotation, active learning reduces the need for large pre-labeled data and enhances the model\u0026rsquo;s ability to generalize from fewer labeled instances (challenges C1, C2, and C18). Incorporating human feedback on selected samples within an active learning framework establishes a fewshot learning paradigm that improves the efficacy of anomaly detection systems and the efficiency of training the model. A significant challenge associated with this technique is the requisite involvement of a human or domain expert (challenges C2 and C18). This requirement can introduce complexities related to scalability and efficiency, as the continuous need for expert input can limit the speed and autonomy of the learning process. [52] proposes an active learning framework using YOLO v3 [106] and Flownet 2 [107] for feature extraction and kNN for anomaly detection. The model constructs a statistical baseline of normal behaviors using kNN distances and continually updates it with new nominal data. Anomalies trigger human feedback for labeling, which categorizes this work as an active learning framework rather than continual learning, as described in the original paper. Several other works propose a more advanced method for selecting the data queries. [108] proposes an adaptive weighting scheme for dynamically selecting between various criteria such as the likelihood criterion, which selects samples with low likelihood according to the current model to discover new classes, and the uncertainty criterion, which selects samples that cause the most disagreement among committee members to refine the decision boundary. [109] utilizes a Bayesian nonparametric model, specifically the Pitman-Yor Process (PYP) for managing imbalanced class distributions (challenge C1) and models probabilities for both known and unknown classes.\nV. HUMAN-CENTRIC VAD # A. Healthcare # In healthcare VAD, the goal is to detect deviations in physiological or behavioral patterns that may signal disease, injury, or other medical conditions, enabling early diagnosis and intervention to improve outcomes and reduce costs. These systems might process various data types, but in this work,\nTABLE III REVIEWED WORKS IN HEALTHCARE: ALL STUDIES EMPLOY SUPERVISED LEARNING; * DENOTES STUDIES THAT EVALUATE MULTIPLE ARCHITECTURES .\nTask Approach Architecture Distinct Characteristics / Novel Contri Modalit Task Fall Detection [110] CNN, LSTM Performs person detection and contour-based feature extraction, follo by attentionguided LSTM RGB Task Fall Detection [111] CNN Mitigates feature loss through multi-task learning and leverages la features for decision-making RGB Task Fall Detection [112] MLP Proposes enhanced optical dynamic flow for improved temporal motion estimation in fall scenarios Ubd ffll idbd bd Optical Flow Task Fall Detection [113] Heuristic Rule-based Model Uses pose-based features to compute fall index based on body posture changes Ctttitl h f hd lih Pose Task Fall Detection [114] GCN Constructs a spatiotemporal graph of human poses and applies grap convolution Pose Task Fall Detection [46] Random Forest, MLP CNNLogistic Divides falls into dynamic/static states; uses fusion of vision-based da dlbddiih id dld li RGB, Pose Task Fall Detection [115] CNN, Logistic Regression ls body dynamics with an inverted pendulum and analyzes motio stability to extract features RGB, Pose Task Fall Detection [116] CNN A multi-stream CNN where each stream processes different features RGB, Depth, Optical Flow Parkinson’s Detection [117] Deep Residual Network multimodal system using facial features and expression-specific act for effective detection Alkd fil iididiff RGB (Facial Video) Parkinson’s Detection [118] CNN, SVM yzes evoked facial expressions using domain adaptation from fac recognition gait energy images to classify Parkinson’s gait leveraging onecla RGB (Facial Video Parkinson’s Detection [118] CNN, SVM gait energy images to classify Parkinson’s gait leveraging one-clas SVM tit dftibtti2D tti3D i GB (Facial Vid Parkinson’s Detection [119] * icts gait dysfunction by extracting 2D poses, reconstructing 3D ga multiviews, and analyzing features using classical and deep learnin models Gait Parkinson’s Detection [120] Random Fores yzes stride variability and cadence using pose-based features for effective detection Pose Parkinson’s Detection [48] * es Parkinson’s symptoms via jitter and amplitude of small muscle groups in face videos Facial Landmarks Parkinson’s Detection [121] * ms remote assessment using webcam video by extracting hand landmarks Hand Landmarks Parkinson’s Detection [122] CNN, Random Forest eye-tracking and gait data using covariance descriptors for Parkinson progression quantification RGB (Eye Video), Gait Autism Detection [123] 3D CNN, LSTM oses spatial attentional bilinear pooling to capture fine-grained atial features and dynamic attention on discriminative regions tegrates phototaking and imageviewing modalities through RGB Autism Detection [124] CNN, LSTM Integrates phototaking and imageviewing modalities through ti-modal knowledge distillation, enabling accurate detection using temporal and attentional behavioral features RGB Autism Detection [125] CNN, SVM Analyzes attention pattern differences using discriminative image selection and fixation maps, followed by linear SVM classification Extracts visual and temporal features from gaze scanpaths using RGB (Eye Video) Autism Detection [126] CNN, LSTM Extracts visual and temporal features from gaze scanpaths using saliency-guided patch extraction for sequence-based prediction Scanpath Autism Detection [127] 3D-CNN izes 3D-CNN for spatiotemporal analysis, focusing on a recognition to detect symptoms RGB, Optical Flow Autism Detection [128] CNN, MLP Processes facial expressions for autism screening RGB, Facial Autism Detection [129] LSTM es on posture and movement data in social interactions for detection Pose Seizure Detection [130] CNNLSTM Analyzes spatial vs. spatiotemporal features for detection, showing the latter performs better RGB Seizure Detection [131] Transformer pplies BART-inspired self-supervised training on hospital videos to learn contextfollowed by classification for seizure detection RGB Seizure Detection [132] ansforme pplies BART-inspired self-supervised training on hospital videos t learn contextfollowed by classification for seizure detection RGB Seizure Detection [132] * learn context, followed by classification for seizure detection Emotion detection used as a feature extractor Reconstructs 3D facial geometry to capture mouth and cheek motions, RGB (Facial Video) Seizure Detection [134] CNN p, temporal dynamics for seizure classification Transforms EEG into second-order Poincare plots and uses pre-trained ´ CNNtlifit RGB (EEG Vid) Seizure Detection [135] SVM Applies dimensionality reduction techniques (PCA and ICA), and defines handcrafted features for a SVM classifier Optical Flow Seizure Detection SETR [49] Transformer Uses pretrained networks for spatial features, a transformer for temporal modeling, and Progressive Knowledge Distillation for early detection iiihlii Optical Flow Seizure Detection [136] CNN Generates a compact image representation capturing the location variance and periodicity of semiology RGB, Optical Flow Seizure Detection [137], [138] GCN, TCN A multistream framework leveraging GCN, spatio-temporal feature extraction, and late fusion RGB, Pose, Facial Ldk Seizure Detection [] [138] GCN, TCN gg , pp extraction, and late fusion RGB, Pose, Facial Landmarks we focused on methods that use video as their primary data. As shown in Figure 1, most approaches adopt a supervised learning paradigm, reflecting the domain\u0026rsquo;s need for precise, reliable detection. Research primarily targets events with strong visual cues, such as falls, Parkinson\u0026rsquo;s episodes, autistic behaviors, and seizures, which exhibit distinctive motion or posture patterns amenable to visual analysis (see Table III).\nFall Detection: Fall detection is a key task in healthcarerelated video analysis due to its distinct visual patterns and practical significance. Early methods typically use RGB video and leverage pre-trained models, applying object detection and temporal modeling to track human motion. For instance, [110] uses an LSTM to distinguish fall-like behaviors over time, while [111] proposes a two-stage approach with a convolutional autoencoder for feature extraction, followed by a lightweight classifier for final prediction. To address the limitations of RGB-only approaches, particularly under challenging conditions such as poor lighting, occlusions, or background clutter, recent works have incorporated additional modalities to improve fall detection\nperformance. Optical flow captures pixel-level motion between frames, offering a richer representation of dynamic events; for instance, [112] uses optical flow with a fine-tuned VGG16 network [139] to enhance motion-specific feature learning. Human pose estimation further improves robustness by abstracting subjects into skeletal representations, which are less sensitive to visual noise. Pose-based features such as centroid velocity and rotational energy have been applied using both deep learning and traditional classifiers, including logistic regression [115], and hybrid models like the Multi Layer Perceptron (MLP) combined with random forest in [46]. Advancing beyond handcrafted descriptors, [114] introduces a spatiotemporal graph convolutional network (ST-GCN) for end-to-end learning of pose dynamics. To further enhance robustness and capture complementary information, some studies combine multiple modalities such as RGB, depth maps, optical flow, and motion history images, processed through specialized network branches [116].\nParkinson Detection: Parkinson\u0026rsquo;s disease (PD) exhibits both motor and non-motor symptoms, with motor manifestations such as tremor, rigidity, bradykinesia, postural instability, and shuffling gait being the most visually detectable and thus well-suited for computer vision analysis. Leveraging this visual accessibility, recent research has focused on facial and body movement analysis to identify Parkinsonian signs. A key facial symptom, hypomimia (reduced expressiveness) has been widely studied. For instance, [48] applies facial landmark detection to extract handcrafted features classified with traditional algorithms, while [117] enhances facial analysis through segmentation and hybrid learning strategies. More recent endto-end approaches, such as [118], repurpose pretrained face recognition models via transfer learning to detect PD and assess motor impairment severity using multiple Support Vector Machine (SVM) classifiers. Another line of research focuses on gait and pose-based analysis, targeting motor irregularities such as bradykinesia (slowness and reduced movement amplitude) common in PD. For example, [121] uses hand keypoint trajectories during motor tasks, classifying temporal patterns with conventional models like logistic regression and random forests. Other studies analyze full-body motion through silhouettes or skeletal poses; [47] creates Gait Energy Images (GEIs), while [120] applies pose estimation followed by SVM classification. A more advanced approach by [119] combines multiview RGB video with 3D skeletal reconstruction and deep models, including multi-scale residual networks, achieving strong generalization. Some studies have sought to combine multiple modalities, such as facial and body movement cues, to enhance detection robustness. For instance, [122] proposes a multimodal framework that integrates facial expressions and skeletal motion features, aiming to capture complementary signals associated with Parkinsonian motor deficits.\nAutism Detection: Autism Spectrum Disorder (ASD) is a common neurodevelopmental condition in children, marked by social communication deficits and atypical attention patterns. Clinical assessment relies on repeated, time-intensive behavioral evaluations by trained professionals, which are prone to subjective variability. As a result, developing automated, objective tools for ASD detection is critical to enable early, consistent, and scalable diagnosis. One research direction leverages eye gaze patterns as behavioral biomarkers for ASD, given their link to impaired social engagement and disruptions in the social brain network. Jiang et al. [125] used VGG-16 to analyze fixation difference maps and classified visual attention features with a linear SVM. Chen and Zhao [124] combined ResNet-50 [140] with LSTM layers to model spatial-temporal gaze dynamics. Tao et al. [126] proposed SP-ASDNet, using saliency maps from neurotypical individuals to guide patch selection, followed by a CNN-LSTM network to detect deviations indicative of ASD.\nBeyond gaze, many studies focus on general behavioral patterns, especially stereotypical behaviors like clapping, arm flapping, and repetitive movements, common indicators in ASD diagnosis. Ali et al. [127] use 3D CNNs to detect such actions, supporting clinical assessments without providing a final diagnosis. Wu et al. [128] offer a more integrated pipeline, combining deep models on RGB and facial landmarks with statistical features (e.g., behavior frequency and duration), fed into a neural network for classification, linking low-level behavior detection with high-level diagnostic inference.\nA more recent, data-driven approach eliminates manual feature engineering by end-to-end deep learning models that learn discriminative patterns directly from video. Sun et al. [123] combine CNNs with LSTMs to extract spatial-temporal features from pixel data, while Kojovic et al. [129] use a similar architecture with human pose inputs, offering a more abstract and potentially robust representation.\nTogether, these studies form a continuum from explicit behavior modeling to implicit feature learning, highlighting the progression toward more generalizable and efficient systems. Each category of methods, whether based on gaze analysis, stereotyped motor behavior, or end-to-end learning, addresses different aspects of the complex behavioral phenotype associated with ASD, and collectively, they underscore the potential of machine learning in revolutionizing autism diagnosis.\nSeizure Detection: Seizure detection has traditionally relied on Electroencephalography (EEG), often paired with video (VEEG) to link motor behaviors with brain activity. Some works, such as [134], convert EEG data into visual forms like Poincare plots for classification via pre-trained ´ ´ CNNs. While effective, EEG remains intrusive and impractical for long-term or ambulatory use. As a result, recent efforts have focused on video-only systems that analyze visible cues such as facial expressions and body movements, offering noninvasive, scalable, and more comfortable alternatives. Building on this shift, recent work has explored methods focusing on facial features and expressions, particularly facial semiology (e.g., involuntary movements). Pothula et al. [132] use standard facial recognition pipelines to extract features for classification, while Ahmedt-Aristizabal et al. [133] enhance this by modeling 3D facial dynamics, especially mouth motion, using LSTM networks.\nAnother research direction focuses on full-body movement, which is more pronounced in generalized seizures. Yang et al. [130] use CNNs and LSTMs to capture spatial and temporal motion features. More recent work, such as Hou et al. [131],\nintroduces transformer-based models with BART-style selfsupervised pretraining, enabling effective seizure classification with reduced dependence on large labeled datasets.\nTo address privacy concerns in video-based seizure monitoring, recent studies use de-identified features such as optical flow, which captures motion without revealing identity. Garc¸ao et al. [135] apply dimensionality reduction and SVMs ˜ ˜ to optical flow, while Mehta et al. [49] propose a CNNtransformer hybrid with Progressive Knowledge Distillation for early prediction. Complementing this, multimodal fusion strategies have been explored to improve detection robustness. Ahmedt-Aristizabal et al. [136] integrate facial and hand movements to create compact semiology descriptors, while Hou et al. [137], [138] fuse RGB, optical flow, body pose, and facial landmarks via multi-branch networks to produce richer representations for seizure classification.\nB. Public Safety # In public safety, video anomaly detection focuses on identifying risky behaviors like violence or rule violations by analyzing external cues. Pixel-based methods capture rich context but are sensitive to environment changes and privacy issues, while pose-based approaches improve robustness and privacy at the cost of visual detail. Some studies combine both in multimodal frameworks. Real-time applications demand efficient trade-offs between accuracy, privacy, and speed.\nPixel-Based Methods: In the context of public safety, pixel-based methods for VAD continue to play a central role due to their ability to capture fine-grained visual details, including both environmental context and object appearance. Table IV summarizes these methods. Some works pursue taskspecific anomaly detection, focusing on particular threats such as shoplifting [50], [141], [142], weapon detection [143], [144], [156], or vandalism [145]. While these methods offer high precision for well-defined scenarios, their generalizability remains limited, which motivates the mainstream research direction in VAD: detecting a broad range of anomalies without pre-defining their nature. Recent progress in weakly supervised VAD has shown that it is possible to achieve fine-grained temporal localization using only video-level labels. A prominent direction in this field involves pseudo-label refinement: Tian et al. [154] propose a two-stage strategy using a multi-head classifier with diversity loss and Monte Carlo Dropout-based uncertainty filtering to generate high-quality pseudo labels. Similarly, Wang et al. [152] introduce ARMS, a multi-phase training framework that incrementally increases the assumed ratio of abnormal segments to progressively discover harder anomalies, supported by temporal convolution and attention. Complementary to these efforts, RTFM [32] avoids over-reliance on classifier outputs by focusing on feature magnitudes, selecting topk high-magnitude snippets to separate normal and abnormal segments using a multi-scale temporal architecture. In parallel, vision-language models have emerged as powerful tools for semantic alignment: Li et al. [155] utilize CLIP-based featuretext alignment combined with temporal context learning, while An et al. [153] adopt ViLBERT features in an MIL framework for snippet-level classification from coarse labels.\nIn self-supervised learning, models define proxy tasks to learn representations of normal behavior without requiring labeled anomalies, as mentioned in Section III. Among these, reconstruction-based methods have long been popular. To enhance reconstruction accuracy and enforce better anomaly separation, adversarial training has been widely adopted. For instance, Yang et al. [146] use a discriminator to distinguish between real and reconstructed patches, pushing the generator (autoencoder) to reconstruct more accurately. Chen et al. [147] instead use the discriminator to differentiate between real reconstruction error maps and synthetic noise, penalizing abnormality through structural deviations. In another approach, Georgescu et al. [150] use irrelevant pseudo-anomalies (e.g., flowers, anime images) to train a discriminator to separate pseudo-abnormal and normal samples, encouraging the generator to focus specifically on human behavioral features.\nPrediction-based models have also evolved to integrate optical flow for more accurate future frame prediction. Luo et al. [151] replace basic MSE loss with a combination of flow, intensity, and gradient-based losses, alongside adversarial training for sharper predictions. Huang et al. [53] employ separate encoders for appearance and flow, feeding both into a unified decoder with skip connections and memory modules to compare current behavior with learned normal prototypes for better suppression of anomalies.\nOther self-supervised tasks, such as jigsaw puzzle-based learning, aim to improve generalization by encouraging spatiotemporal reasoning. Wang et al. [149] decouple spatial and temporal dimensions to form dual puzzles, solved via a 3D CNN trained to predict permutations learning both visual structure and motion patterns. Further extending generalization, some methods employ multiple proxy tasks. Georgescu et al. [64] use a suite of four self-supervised tasks: arrowof-time prediction, motion shuffling, irregularity localization, and knowledge distillation, while its successor, SSMTL++ [148], adds jigsaw puzzles and adversarial pseudo-anomalies for broader robustness. Beyond reconstruction and prediction, Doshi et al. [52] use deep learning for feature extraction and statistical modeling to estimate the distribution of normal data, enabling adaptive decision-making through continual learning. Trained solely on normal data, the approach falls under semisupervised learning and focuses on dynamic thresholds in evolving environments.\nWhile these approaches reduce dependence on labeled data, even self-supervised methods often assume that training videos are purely normal. Recent research aims to relax this assumption. ESSL [55] builds on puzzle-based learning but incorporates a self-selective module to identify and exclude suspected anomalies during training, enabling learning from mixed datasets. Similarly, Zaheer et al. [54] propose a Generative Cooperative Learning framework, where a generator reconstructs input features and a discriminator classifies them as normal or anomalous using pseudo-labels derived from reconstruction errors. A negative learning strategy intentionally trains the generator to reconstruct anomalous samples poorly, reinforcing clear distinctions between normal and abnormal patterns, achieving truly unsupervised anomaly detection.\nTABLE IV\nOVERVIEW OF PIXEL -BASED APPROACHES IN VAD FOR PUBLIC SAFETY. * DENOTES THAT MULTIPLE ALTERNATIVE ARCHITECTURES HAVE BEEN USED .\nApproach Supervision Strategy rchitecture Distinct Characteristics / Novel Cont Modalit [50] Supervised S1 CNN, RNN Using CNN as spatial feature extractor and RNN fo temporal pattern detection and final classification Pixel [141 Supervised S1 3D CNN pp Use 3D CNN for simultaneous spatiotempo Pixe [142 Supervised S1 CNN, LSTM Uses Inception V3 blocks and LSTM for feature extract Pixel ADOS [143] Supervised S1 CNN Minimizes multi-object detection errors by segmenting frames and applying a saliency-aware classification Pixel [51] Supervised S1 * o-stage gun detection using fine-tuned spatial classi and temporal sequence models Pixel [144] Supervised S1 CNN Uses off-the-shelf object detectors and reduces fals positives by incorporating confusion classes CNNLSTM deep learning model that combines spa Pixel [145] Supervised S1 CNN, LSTM CNNLSTM deep learning model that combines spatia feature extraction from convolutional layers with tempor sequence modeling from LSTM Pixel [146] Self/Semi-supervised S4 CNN y reconstruction and object-focused scoring based on likelihood, position, and confidence Uses noisemodulated adversarial learningwhere a Pixel NM-GAN [147] Self/Semi-supervised S4 CNN Uses noisemodulated adversarial learning, where a discriminator trained on noise-injected reconstruction erro distinguishes normal from anomalous patterns Pixel [64] Self/Semi-supervised S5, S6 CNN, 3D CNN Defining multiple tasksarrow of time prediction, motio shuffling, irregularity prediction (viewed as various jigsaw puzzles), and knowledge distillation Pixel [148] Self/Semi-supervised S5, S6 CNN, 3D CNN Adds adversarial pseudo anomalies, segmentation, jigsaw pose estimation, and inpainting to multi-task training Dld til d tl jild Pixel [149] Self/Semi-supervised S6 3D CNN Decoupled spatial and temporal jigsaw puzzles and employed a multi-label paradigm for more accurate VAD Udbl ltidthtd Pixel [150] Self/Semi-supervised S4 CNN Uses pseudo-abnormal examples to guide the autoencode and binary classifiers for each branch A dldildddl tht li Pixel, Flow [53] Self/Semi-supervised S5 CNN A dualencoder singledecoder model that aligns appearan and flow features and uses memory of normal prototypes enhance detection accuracy Pixel, Flow [151] Self/Semi-supervised S5 CNN Future prediction is guided by flow, intensity, and gradie losses, with a discriminator improving frame realism Combines flow and object detections to form feature vec ARMS [152] Weakly-supervised S2 CNN Trained through bootstrapped pseudo labeling, hard anoma mining, and adaptive self-training with dynamic abnorma ratios to capture both easy and subtle anomalies Abl ihhihfid Pixel RTFM [32] Weakly-supervised S2 CNN Assumes abnormal snippets have higher feature magnitude selects top-k segments per video to maximize abnormal-normal separation Pixel [153] Weakly-supervised S2 Transformer pg followed by a fully connected network trained with a soft-margin ranking loss on mean anomaly scores of positive and negative bags Pixel, Flow [154] Weakly-supervised S2 Transformer mproves pseudo labels through completeness modeling and diversity-enhanced multi-head classification, followed by uncertainty-aware self-training that selects reliable clips using Monte Carlo Dropout Pixel, Flow TPWNG [155] Weakly-supervised S2 Transformer Uses CLIP for pseudolabeling, then trains a classifier wit a Temporal Context Self-Adaptive Learning module that adjusts attention spans based on event duration Pixel, Flow ESSL [55] Unsupervised S8 3D CNN Extends the jigsaw puzzle concept with a self-selective module to filter potential anomalies, enabling truly unsupervised training Pixe [54] S8 3D CNN xtends the jigsaw puzzle concept with a selfselective module to filter potential anomalies, enabling truly unsupervised training Pixe Pose-Based Methods: Pose-based VAD has emerged as a powerful alternative to appearance-based methods, particularly in applications where privacy, robustness to environmental variation, and focus on human motion are essential (see Table V for summary). The dominant paradigm in this area is semi-supervised or self-supervised learning, where models aim to learn the regular patterns of human skeletal motion using only normal data. A central challenge lies in capturing the complexity of human movement while ensuring effective generalization to unseen abnormal patterns. To this end, researchers have explored increasingly sophisticated architectures that improve reconstruction or prediction quality by modeling the temporal and spatial dynamics of the human skeleton. Given the inherent graph structure of the human pose, where joints are nodes and limbs form edges, many works [42], [69], [158], [159], [161], [163]–[166], [169] naturally adopt graph-based architectures, particularly graph convolutional networks (GCNs), to model both temporal sequences and body structure. Traditional reconstruction frameworks are extended by integrating more powerful sequence modeling mechanisms, such as transformers, which excel at capturing long-range dependencies. For instance, Yu et al. [70] propose a tokenization scheme based on the first-order difference between pose frames and introduce a motion prior derived from training statistics to explicitly model the distribution of joint displace-\nments, enhancing anomaly detection sensitivity.\nA notable trend in recent years is the combination of reconstruction-based learning with distribution modeling. Several works [163], [164], [169] adopt a two-stage framework: first, training an autoencoder on normal data and then performing latent space clustering at test time to detect anomalies as outliers. Jain et al. [162] utilize a variational autoencoder (VAE) to impose a probabilistic structure on the latent space, enabling more principled distribution estimation. Extending this idea further, Hirschorn et al. [69] propose a purely probabilistic model using normalizing flows, where input pose sequences are transformed into a standard Gaussian distribution, and anomaly scores are computed via log-likelihood.\nIn parallel, prediction-based methods have evolved to leverage both sequential modeling and skeletal structure. Prior to the widespread adoption of GCNs, Fan et al. [160] used a combination of feedforward and recurrent (GRU) networks for future pose prediction. More recent works incorporate GCNs to simultaneously capture spatial (joint connectivity) and temporal (movement trajectory) patterns [161]. To further enrich the input representation, some researchers propose decomposing the pose into local (individual motion) and global (interpersonal interaction) components, as seen in [158], [159], leading to a better understanding of both individual and group behavior. Alternatively, Rodrigues et al. [157] introduce a multi-timescale approach, predicting both past and future frames at varying temporal resolutions to effectively capture both short-term and long-term anomalies.\nTo boost overall performance, several studies adopt multibranch architectures that combine reconstruction and prediction tasks. These systems benefit from complementary perspectives: reconstruction captures spatial structure while prediction leverages temporal dynamics. For instance, GRUbased [167], LSTM-based [166], and transformer-based [168], [170] multi-branch models all report improved performance by sharing an encoder while diverging into task-specific decoders. Li et al. [165] enhance this design by incorporating adversarial training, aligning with trends in pixel-based VAD to improve the quality of generated sequences. Additionally, Noghre et al. [42] propose a hybrid model that combines variational autoencoding for distribution-based scoring with a trajectory prediction branch, demonstrating the advantage of unifying multiple learning objectives under a coherent architecture.\nOverall, pose-based VAD methods are evolving toward architectures that jointly model structure, motion, and probability, offering a privacy-aware and semantically rich alternative to pixel-level approaches. The integration of reconstruction, prediction, and distribution modeling, along with architectural\nTABLE V OVERVIEW OF POSE -BASED APPROACHES IN VAD FOR PUBLIC SAFETY. * DENOTES THAT MULTIPLE ALTERNATIVE ARCHITECTURES HAVE BEEN USED .\nApproach Supervision Strategy Architecture Distinct Characteristics / Novel Contributions MoPRL [70] Self/Semi-supervised S3 Transforme Uses a motion embedder followed by a spatio-temporal transforme for reconstruction, leveraging motion priors extracted through first-order difference statistics Uses 1D convolutions to predict past and future poses at multiple [157] Self/Semi-supervised S4 CNN Uses 1D convolutions to predict past and future poses at multiple timescales, capturing short- and long-term anomalies without relyin on fixed observation windows Combines hierarchical spatiotemporal graphs with a twobranch STGformer [158] Self/Semi-supervised S4 GCN, Transforme Combines hierarchical spatio-temporal graphs with a two-branch architecture (local and global prediction) using spatial and tempor Transformers alongside GCNs HSTGCNN [159] Self/Semi-supervised S4 GCN, CNN Uses hierarchical spatio-temporal graphs with local and global prediction branches, applying 2D temporal followed by 2D spatia graph convolutions CNN il fhilGRU l [160] Self/Semi-supervised S4 CNN, GRU NN extracts spatial features, while GRU captures temporal dependencies il h lifdiid di Normal Graph [161] Self/Semi-supervised S4 GCN es spatiotemporal graph convolution for prediction and deriv anomaly scores from the prediction loss ltliiltt Giitd iti PoseCVAE [162] Self/Semi-supervised S4 CNN imulates anomalies via latent Gaussian mixtures, and is trained through a three-stage process combining reconstruction, KL-divergence, and binary cross-entropy losses [163] Self/Semi-supervised S6 GCN An autoencoder is used for feature extraction, with latent space clustering for final detection liiflill diibi STG-NF [69] Self/Semi-supervised S6 GCN Uses normalizing flow to map inputs to a latent normal distributio computing normality scores via likelihood and minimizing negativ log-likelihood during training GEPC [164] Self/Semi-supervised S6 GCN encoder is used for feature extraction, with latent sp clustering applied for VAD titl h ltiiVAE tt TSGAD [42] Self/Semi-supervised S6 GCN Leverages spatio-temporal graph convolution in a VAE structure using the distance from the latent mean and variance to score anomalies based on deviation from the learned normal distributio MemWGAN-GP [165] Self/Semi-supervised S3, S4 CNN A single-encoder dual-decoder generator with a critic, reconstructin past and predicts future sequences via memory-augmented branche STGCAE-LSTM [166] Self/Semi-supervised S3, S4 GCN, LSTM ppqyg Single-encoder, dual-decoder architecture with LSTM in the latent space for enhanced temporal analysis MPED-RNN [167] Self/Semi-supervised S3, S4 GRU es global-local decomposition with a single encoder and dual GRU-based decoders ilddlddfdid SPARTA [168] Self/Semi-supervised S3, S4 Transformer Features a single-encoder, dual-decoder transformer design and introduces a novel pose tokenization method by incorporating relative movement to emphasize motion dynamics patio-temporal GCN and attention are used for reconstruction, wi MSTA-GCN [169] Self/Semi-supervised S3, S6 GCN Spatio-temporal GCN and attention are used for reconstruction, wit both reconstruction and latent space clustering TABLE VI OVERVIEW OF VEHICLE -CENTRIC VAD APPROACHES .\nTask Approach Supervision Strategy Architectur Distinct Characteristics / Novel Contributions Task Surveillance [171] Supervised S1 - The system detects five types of traffic anomalies (speeding, one-way violations, overtaking, illegal parking, and improper drop-offs) by combining deep learning for object detection and tracking with handcrafted algorithms YOLOv5 is used for object detectionwhile anomalies are detecte Task Surveillance [172] Supervised S1 Decision Tree YOLOv5 is used for object detection, while anomalies are detected using decision trees Task Surveillance [21] Supervised S1 CNN s frames as accident or non-accident using a rolling a prediction algorithm DiffTAD [173] Self/Semi-supervised S3 Transformer Models anomalies as a noisytonormal reconstruction process usin Denoising Diffusion Probabilistic Models (DDPM), integrating Transformer-based temporal and spatial encoders to capture inter-vehicle dynamics VegaEdge [174] Self/Semi-supervised S4 GIN A pipeline from object detection to trajectory prediction detect anomalies by comparing expected and actual trajectories rajectories are encoded into smoothed feature vectors with first a [175] Self/Semi-supervised S6 SOM Trajectories are encoded into smoothed feature vectors with first a second-order motion information, allowing the SOM to detect unusual behavior by learning the distribution of normal trajectorie [176] Unsupervised S7 - clustering method based on K-means, modeling each motion pattern as a chain of Gaussian distributions, and enabling both anomaly detection and behavior prediction A hihil Bifk hLDA d HDP [177] Unsupervised S8 Bayesian Model A hierarchical Bayesian framework that uses LDA and HDP to jointly model atomic activities and multi-agent interactions withou requiring labeled data Converts vehicle trajectories into gradient imagesleverages a CNN [178] Self/Semi-supervised S3, S6 CNN Converts vehicle trajectories into gradient images, leverages a CNN to classify normal trajectories via unsupervised clustering, and uses VAE to detect unseen anomalies through reconstruction loss Autonomous Driving [179] Supervised S1 CNN, MLP SVM Temporal features are clustered after MLP processing to identif potential accidents, which are then combined with CNN-based spatial features and classified using an SVM Autonomous Driving [180] Supervised S1 CNN Combines YOLOv5, lane alignment, and motion tracking to det td hil Autonomous Driving TempoLearn [181] Supervised S1 CNN, LSTM, Transformer pp ses CNN and LSTM for spatiotemporal feature extraction and a Transformer classifier for accident detection Autonomous Driving [182] Self/Semi-supervised S4 CNN, LSTM ollaborative multi-task framework for jointly predicting fut frames, object locations, and scene context fd jid hl j Autonomous Driving FOL [183] Self/Semi-supervised S4 CNN e expected trajectory is compared to the actu for detecting abnormal behaviors Autonomous Driving [184] Self/Semi-supervised S4 Transformer A dual GAN framework with a Swin-Unet-based generator to predi intermediate frames using both optical flow and cropped inputs Combines memoryaugmented autoencoders for reconstruction an Autonomous Driving HF2-VAD [185] Self/Semi-supervised S3, S4 CNN Combines memoryaugmented autoencoders for reconstruction and onditional VAEs for future frame prediction, enabling fine-grained dense anomaly localization Autonomous Driving [186] Self/Semi-supervised S3, S6 3D CNN, GCN Proposes two models: one based on manifold learning to identify out-of-distribution anomalies, and another using reconstruction to detect deviations from normal data innovations such as GCNs and transformer positions posebased methods as a robust and scalable direction in VAD.\nVI. VEHICLE-CENTRIC VAD # A. Road Surveillance # In the context of vehicle VAD for road surveillance, systems are primarily utilized by traffic monitoring authorities and urban infrastructure. These systems demand high-resolution spatial coverage, real-time or near-real-time processing, and robustness under diverse environmental conditions. The corresponding responses are typically passive and retrospective, including alert generation, traffic violation reporting, or data archiving for forensic purposes.\nEarly methods in vehicle-focused VAD followed similar trajectories to general VAD research, relying on handcrafted features and rule-based logic (see Table VI). These approaches often combine object detection and tracking with non-deep learning models for classification. Pramanik et al. [171] employed five distinct algorithms to identify specific behaviors such as speed violations and illegal parking. Zhou et al. [179]\nutilized Support Vector Machines (SVMs) for accident detection based on extracted spatial-temporal features. Although effective for narrowly defined tasks, these approaches lack flexibility and generalization to unforeseen behaviors. While early deep learning-based models such as [21] extended detection capabilities, they still largely operated within constrained anomaly categories.\nTo mitigate these limitations, Aboah et al. [172] proposed a decision-tree-based method that evaluates foreground and background object detections using spatial thresholds and Intersection-over-Union (IoU) metrics, offering a more adaptable and interpretable rule-based framework.\nA more flexible perspective involves casting anomaly detection as an unsupervised clustering problem, where anomalies are treated as statistical outliers. These methods utilize features such as motion trajectories, foreground activity, or background dynamics to learn normal patterns without requiring explicit labels. Hu et al. [176] applied k-means clustering to vehicle trajectories, while Niebles et al. [177] adopted hierarchical Bayesian models—originally developed for language modeling—to cluster interactions and motions, thereby learning the\nTABLE VII OVERVIEW OF REVIEWED WORK IN FIRE AND FLOOD DETECTION. ARCHITECTURES ARE INFERRED FROM REPORTED METHODOLOGIES WHEN POSSIBLE .\nTask Approach Architecture Distinct Characteristics / Novel Contribution [187] GMM Uses GMM motion detection and a region growth tracking, enabling accurate fire segmentation, growth rate estimation, and significant false alarm reduction Integrates 2D Haar wavelet transforms with convolutional neural networks to combi spatial and spectral features, achieving higher detection accuracy and significantly diflld ttil lit [22] CNN segmentation, growth rate estimation, and significant false alarm reduction Integrates 2D Haar wavelet transforms with convolutional neural networks to combin spatial and spectral features, achieving higher detection accuracy and significantly reducing false alarms and computational complexity A fk biiidiih ilid id Fire-Det [24] CNN A two-stage framework combining motion detection with specialized Fire-Det an lightweight Fire-Det Nano models, enabling fast, accurate early fire detection ased on EfficientNetB0 enhanced with stacked autoencoders and dense connectio [189] CNN achieving high accuracy, reduced false alarms, and efficient real-time inferencing Combining EfficientNet and YOLOv5, leveraging compound scaling and real-tim bjdi object detection modified YOLOv5 model within an edge computing framework using Jetson Nan featuring a dropout-enhanced architecture for improved accuracy and speedand [191] CNN featuring a dropoutenhanced architecture for improved accuracy and speed, and integration with cloud services for real-time alerting n improved YOLOv5s-based fire detection model that enhances detection accuracy [191] CNN and efficiency by integrating CBAM, BiFPN, and transposed convolution LiWihFi(LWF) fk blfl [192] CNN y y gg , , p oposes Learning Without Forgetting (LWF) framework to enable transfer lea [193] CNN pg gg nalyzes the efficacy of famous networks such as AlexNet, GoogLeNet, and VGG [195] CNN p g models YOLOv8 and VQ-VAE, achieving high precision and robustness Combines transfer learning with YOLOv8 and the TranSDet model, incorporating [196] CNN Integrates a dehazing algorithm with a fine-tuned YOLO-v10 for ship fire detectio [197] 3D CNN A modified MobileNetV3 integrated with a 3D CNN and a novel soft attention mechanism, enhancing spatial awareness and reducing model complexity A ViiTfbd dl iilfid if FWSRNet [198] Transformer mechanism, enhancing spatial awareness and reducing model complexity A Vision Transformer-based model incorporating self-attention and contrastive feat learning for fine-grained wildfire and smoke recognition [199] Transformer learning for fine-grained wildfire and smoke recognition A modified vision transformer architecture for fire detection that enables learnin from scratch on small to medium-sized datasets by integrating shifted patch tokenisation and locality selfattention Flood Detection [26] - Combines background subtraction, morphological operations, color probability modeling across, and spatial features like edge density and boundary roughnes robabilistic model for flood detection by combining spatial features with temp Flood Detection [200] Bayesian Classifier modeling across, and spatial features like edge density and boundary roughness A probabilistic model for flood detection by combining spatial features with temp variation and a non-central chi-square-based positional prior, using Bayes Flood Detection V-FloodNet [25] CNN variation and a noncentral chisquarebased positional prior, usin classification and patch-level scoring A video segmentation system that uses template-matching-based w Flood Detection FRAD [201] CNN Applies a CNN network to high-resolution multispectral remote sensing ima (SPOT-5) for supervised classification of urban flood risk YOLO4bd dlihd fbfld dh iii Flood Detection [202] CNN Applies a CNN network to high-resolution multispectral remote sensing image (SPOT-5) for supervised classification of urban flood risk A YOLOv4-based deep learning method for urban flood depth estimation using tr Flood Detection CNN p g pg images, leveraging submerged reference objects Flood severity classification from videos, combining spatial features extracted by Flood Detection [203] CNN, GRU Flood severity classification from videos, combining spatial features extracted by modified VGG16-based CNN with temporal dependencies captured by GRUs Fld dttiiil dibbiiil ftifitd Flood Detection [204] CNN, LSTM Flood detection in social media by combining visual features using a fine-tune InceptionV3 CNN with semantic features from metadata using a bidirectional LS structure of normal behavior in a probabilistic manner. Such clustering-based methods offer greater adaptability in complex or evolving environments.\nMore recent work has shifted toward deep learning models that emphasize generalization. Many of these methods fall into the prediction/reconstruction-based paradigms discussed in Section III. Santhosh et al. [178] employed a variational autoencoder to reconstruct trajectory data, while Li et al. [173] leveraged diffusion models to learn the data distribution. In the prediction domain, Katariya et al. [174] used a graph isomorphism network with attention mechanisms to model interactions and forecast future trajectories, and Fang et al. [182] proposed a multi-task learning framework that predicts future frames, object locations, and scene context simultaneously.\nB. Autonomous Driving # In autonomous driving, Vehicle VAD serves as a realtime, safety-critical component integrated into the vehicle\u0026rsquo;s decision-making pipeline. These systems demand low-latency processing, precise detection of complex motion patterns, and seamless integration with multi-sensor fusion modules including LiDAR, radar, and cameras.\nEarly research in this domain primarily focused on detecting well-defined types of anomalies, often formulated as supervised classification (see Table VI). Park et al. [180] addressed the detection of stopped vehicles using dense optical flow to estimate host vehicle motion and bounding-box analysis to track surrounding vehicles. Similarly, some works have narrowed their scope to the detection and categorization of different types of collisions. Htun et al. [181] proposed a deep learning architecture that uses CNNs and LSTMs to extract spatial and temporal features, respectively, followed by a region proposal module and a classification head to detect and categorize collision types.\nBuilding upon these constrained approaches, more flexible systems have emerged. Zhou et al. [179] introduced a two-stage coarse-to-fine framework: the first stage performs clustering of encoded temporal features to identify outlier frames as potential anomaly candidates, while the second stage applies object-level spatial feature extraction and a trained\nY Axis\nFig. 7. Severity of Each Challenge Across Different VAD Domains.\nSVM classifier to confirm accident frames.\nReconstruction-based methods have also gained traction due to their generalization capacity to unseen anomalies. Haresh et al. [186] enhanced traditional autoencoder architectures by incorporating region proposal networks for object detection and graph convolutional networks (GCNs) to model object interactions, improving the semantic richness of reconstructions.\nOne of the most adopted modalities is optical flow, which provides dense motion information. Optical flow enables the detection of sudden or abnormal motion patterns, making it useful across both prediction and reconstruction-based paradigms. Bogdoll et al. [185] proposed a convolutional variational autoencoder that fuses features from both RGB and optical flow, improving anomaly reconstruction. In predictionbased frameworks, Yao et al. [183] leveraged optical flow for future object localization and ego-motion prediction, detecting anomalies based on deviations from expected motion trajectories. Ru et al. [184] extended this idea through a dualGAN framework that jointly predicts both optical flow and appearance features in regions of interest using a Swin-Unet backbone, achieving high accuracy at the cost of computational efficiency. Similarly, Fang et al. [182] proposed a multi-task framework incorporating future frame prediction, motion trajectory consistency, and visual context modeling, with optical flow as a core feature to enhance anomaly detection performance.\nDespite the demonstrated success of prediction-based models, their performance can degrade in highly dynamic environments involving complex multi-agent interactions or unexpected environmental changes. These limitations are particularly pronounced in ego-centric settings, where the camera is mounted on a moving vehicle, increasing the risk of false positives due to background motion or occlusions.\nVII. ENVIRONMENTAL-CENTERIC VAD # Environmental VAD is critical for enabling rapid response and minimizing harm, particularly during the real-time detection stage of disaster management. Unlike prediction and postevent assessment, which rely on early indicators or support recovery and fall outside the scope of this work, real-time detection has been primarily approached through supervised methods that treat specific disasters as classification tasks. Video-based fire and flood detection have received the most attention in computer vision due to their structured visual signatures, whereas disasters like hurricanes and tsunamis are better suited to satellite imagery, and events like earthquakes\nX Axis and droughts often lack distinct visual cues. This survey focuses on video-based methods for detecting floods and fires (see Table VII), where vision remains a central and effective modality.\nA. Fire Detection # CNN-based methods have been explored for fire detection due to their ability to capture spatial features. Early works utilized pretrained CNN models such as MobileNetV2 [205], AlexNet [206], and GoogLeNet [207], fine-tuning them for fire detection tasks. Some studies combined these models with additional techniques to enhance performance. For instance, wavelet transforms were used to extract critical spectral features [22], while transfer learning with \u0026ldquo;learning without forgetting\u0026rdquo; ensured models retained prior knowledge when adapting to new environments [193]. Advanced approaches integrated 3D CNNs with modified attention mechanisms to improve accuracy and employed Grad-CAM for visual interpretability of model decisions [197].\nAnother group of studies [189], [190], [195], [196] defined fire detection as a subset of object detection, leveraging and adapting well-known algorithms such as Faster R-CNN [208] and YOLO [106] for this purpose. For instance, [191] extended YOLOv5s by incorporating Convolutional Block Attention Modules (CBAM) [209] for improving feature fusion and replacing nearest neighbor interpolation with transposed convolution, introducing a fast, compact model and a more complex, accurate version tailored for fire detection. Similarly, [194] utilized YOLOv8 as a feature extractor to identify regions likely to contain fire. While this approach alone can serve as a fire detection method, they augmented it with a Vector Quantized Autoencoder (VQ-VAE) [210] to model the distribution of fire patterns, thereby providing an additional layer of analysis to reduce false positives and enhance detection reliability.\nWith the growing popularity of transformers, several studies have begun employing the Vision Transformers (ViT) [211] for fire detection. [199] introduces a specialized tokenization method designed for effective input tokenization for transformers. Additionally, [198] utilizes a contrastive feature learning mechanism to enhance the model\u0026rsquo;s discriminative capabilities.\nEarlier works treated video as independent frames, ignoring temporal dynamics. Recent studies, however, emphasize the importance of motion. For example, [187] uses segmentation and GMMs to identify flame-like motion and estimate fire growth, while [24] applies GMMs for motion filtering before\nfire classification. These approaches demonstrate how incorporating temporal cues enhances accuracy and context awareness.\nB. Flood Detection # Earlier works [26], [200] rely on fundamental probabilistic models and heuristic approaches. These methods focus on extracting visual features like color, texture, and motion to identify flood regions. [200] integrates color, texture, and dynamic features within a probabilistic framework, leveraging spatial distributions to enhance detection accuracy. [26] employs background subtraction, morphological processing, and boundary roughness analysis for improved efficacy.\nDeep learning brought strong advancements [201], [203], [204]. [203] utilizes a hybrid CNN-GRU model to classify flood severity in videos, combining spatial feature extraction with temporal modeling of sequential frames. [201] adopts a CNN-based Flood-Risk Assessment and Detection (FRAD) method for processing multispectral satellite images to identify flood-risk zones, emphasizing urban planning applications. [204] combines visual CNN-based analysis with a BiLSTM network for textual metadata processing, creating a multimodal approach to flood detection. These approaches exemplify the power of deep learning in capturing both spatial and temporal intricacies, demonstrating significant improvements over traditional models in accuracy and versatility.\nMore advanced architectures [25], [202], [212], incorporate SOTA techniques to enhance flood detection capabilities. [25] proposes V-FloodNet, a system integrating video segmentation (AFB-URR) and image segmentation (EfficientNet-B4 and LinkNet) with novel template-matching for depth estimation. [212] introduces DX-FloodLine, combining VGG16-LSTM for flood classification and Faster R-CNN with Mask R-CNN for object detection. [202] applies YOLOv4 for urban flood detection, utilizing traffic images with submerged reference objects and achieving real-time performance.\nVIII. CONCLUSIONS AND FUTURE DIRECTIONS # This survey provides a comprehensive and structured overview of deep learning-based Video Anomaly Detection (VAD), examining major challenges, learning paradigms, and a range of application domains. By incorporating human, vehicle-, and environment-centric perspectives, it reveals both shared foundations and domain-specific characteristics, facilitating meaningful cross-domain insights. The proposed taxonomy of supervision levels and adaptive strategies clarifies the strengths and limitations of existing methods, offering actionable guidance for designing effective VAD systems. In identifying critical research gaps, this work outlines promising directions for future exploration and serves as both a primer for newcomers and a valuable reference for researchers seeking to build robust, scalable solutions for real-world applications. Building on the challenges and trends observed across different VAD domains, we further evaluate the severity of open problems, as visualized in Figure 7, to support strategic research planning. Environment-centric VAD tends to be more manageable due to its structured, constrained settings. In contrast, autonomous driving remains highly challenging due to issues like domain shift, real-time performance, and sensor calibration (C8, C14, C16). Large-scale deployment in road surveillance and public safety introduces major scalability concerns (C15), driving the development of resource-efficient models, along with alternative data modalities. In healthcare, annotation remains a significant bottleneck (C2) due to the dependence on expert knowledge, underscoring the importance of label-efficient approaches such as few-shot and weakly supervised learning. Moreover, data scarcity (C1) persists across nearly all domains, prompting increased interest in synthetic data generation, especially with generative AI to simulate anomalies and boost model robustness.\nACKNOWLEDGMENTS # This research is funded by the United States National Science Foundation (NSF) under award number 2329816.\nREFERENCES # [1] K. Shaukat et al., \u0026ldquo;A review of time-series anomaly detection techniques: A step to future perspectives,\u0026rdquo; in Advances in information and communication: proceedings of the 2021 future of information and communication conference (FICC), volume 1. Springer, 2021, pp. 865–877.\n[2] J. Liu et al., \u0026ldquo;Deep industrial image anomaly detection: A survey,\u0026rdquo; Machine Intelligence Research, vol. 21, pp. 104–135, 2024.\n[3] P. Mishra et al., \u0026ldquo;Vt-adl: A vision transformer network for image anomaly detection and localization,\u0026rdquo; in 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE). IEEE, 2021, pp. 01–06.\n[4] J. Yang et al., \u0026ldquo;Visual anomaly detection for images: A systematic survey,\u0026rdquo; Procedia computer science, vol. 199, pp. 471–478, 2022.\n[5] A. D. Pazho et al., \u0026ldquo;A survey of graph-based deep learning for anomaly detection in distributed systems,\u0026rdquo; IEEE Trans. Knowl. Data Eng. , vol. 36, pp. 1–20, 2023.\n[6] K. Rezaee et al., \u0026ldquo;A survey on deep learning-based real-time crowd anomaly detection for secure distributed video surveillance,\u0026rdquo; Personal and Ubiquitous Computing, vol. 28, pp. 135–151, 2024.\n[7] D. Fahrmann ¨ ¨ et al., \u0026ldquo;Anomaly detection in smart environments: a comprehensive survey,\u0026rdquo; IEEE access, 2024.\n[8] Y. A. Samaila et al., \u0026ldquo;Video anomaly detection: A systematic review of issues and prospects,\u0026rdquo; Neurocomputing, p. 127726, 2024.\n[9] P. K. Mishra et al., \u0026ldquo;Skeletal video anomaly detection using deep learning: Survey, challenges, and future directions,\u0026rdquo; IEEE Trans. Emerg. Topics Comput., 2024.\n[10] A. D. Pazho et al., \u0026ldquo;Ancilia: Scalable intelligent video surveillance for the artificial intelligence of things,\u0026rdquo; IEEE Internet Things J., vol. 10, pp. 14 940–14 951, 2023.\n[11] X. Yang et al., \u0026ldquo;Deep learning technologies for time series anomaly detection in healthcare: A review,\u0026rdquo; Ieee Access, vol. 11, pp. 117 788– 117 799, 2023.\n[12] A. A. Ali et al., \u0026ldquo;Anomaly detection in healthcare monitoring survey,\u0026rdquo; in Advanced Research Trends in Sustainable Solutions, Data Analytics, and Security. IGI Global Scientific Publishing, 2025, pp. 29–56.\n[13] T. Fernando et al., \u0026ldquo;Deep learning for medical anomaly detection–a survey,\u0026rdquo; ACM Computing Surveys (CSUR), vol. 54, pp. 1–37, 2021.\n[14] Y. M. Galvao˜ ˜ et al., \u0026ldquo;Anomaly detection in smart houses for healthcare: Recent advances, and future perspectives,\u0026rdquo; SN Computer Science , vol. 5, p. 136, 2024.\n[15] D. Bogdoll et al., \u0026ldquo;Anomaly detection in autonomous driving: A survey,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4488–4499.\n[16] S. Baccari et al., \u0026ldquo;Anomaly detection in connected and autonomous vehicles: A survey, analysis, and research challenges,\u0026rdquo; IEEE Access , vol. 12, pp. 19 250–19 276, 2024.\n[17] D. Bogdoll et al., \u0026ldquo;Perception datasets for anomaly detection in autonomous driving: A survey,\u0026rdquo; in 2023 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2023, pp. 1–8.\n[18] J. R. V. Solaas et al., \u0026ldquo;Systematic literature review: Anomaly detection in connected and autonomous vehicles,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., 2024.\n[19] K. K. Santhosh et al., \u0026ldquo;Anomaly detection in road traffic using visual surveillance: A survey,\u0026rdquo; Acm Computing Surveys (CSUR), vol. 53, pp. 1–26, 2020.\n[20] M. Rathee et al., \u0026ldquo;Automated road defect and anomaly detection for traffic safety: a systematic review,\u0026rdquo; Sensors, vol. 23, p. 5656, 2023.\n[21] S. W. Khan et al., \u0026ldquo;Anomaly detection in traffic surveillance videos using deep learning,\u0026rdquo; Sensors, vol. 22, p. 6563, 2022.\n[22] L. Huang et al., \u0026ldquo;Fire detection in video surveillances using convolutional neural networks and wavelet transform,\u0026rdquo; Engineering Applications of Artificial Intelligence, vol. 110, p. 104737, 2022.\n[23] A. Saleh et al., \u0026ldquo;Forest fire surveillance systems: A review of deep learning methods,\u0026rdquo; Heliyon, vol. 10, 2024.\n[24] S. Gao et al., \u0026ldquo;Two-stage deep learning-based video image recognition of early fires in heritage buildings,\u0026rdquo; Engineering Applications of Artificial Intelligence, vol. 129, p. 107598, 2024.\n[25] Y. Liang et al., \u0026ldquo;V-floodnet: A video segmentation system for urban flood detection and quantification,\u0026rdquo; Environmental Modelling \u0026amp; Software , vol. 160, p. 105586, 2023.\n[26] A. Filonenko et al., \u0026ldquo;Real-time flood detection for video surveillance,\u0026rdquo; in IECON 2015-41st annual conference of the IEEE industrial electronics society. IEEE, 2015, pp. 004 082–004 085.\n[27] L. Lopez-Fuentes et al., \u0026ldquo;Review on computer vision techniques in emergency situations,\u0026rdquo; Multimedia Tools and Applications, vol. 77, pp. 17 069–17 107, 2018.\n[28] B. R. Ardabili et al., \u0026ldquo;Understanding policy and technical aspects of ai-enabled smart video surveillance to address public safety,\u0026rdquo; Computational Urban Science, vol. 3, p. 21, 2023.\n[29] B. Rahimi Ardabili et al., \u0026ldquo;Understanding ethics, privacy, and regulations in smart video surveillance for public safety,\u0026rdquo; arXiv preprint arXiv:2212.12936, 2022.\n[30] A. D. Pazho et al., \u0026ldquo;Vt-former: An exploratory study on vehicle trajectory prediction for highway surveillance through graph isomorphism and transformer,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5651–5662.\n[31] B. R. Ardabili et al., \u0026ldquo;Exploring public\u0026rsquo;s perception of safety and video surveillance technology: A survey approach,\u0026rdquo; Technology in Society , vol. 78, p. 102641, 2024.\n[32] Y. Tian et al., \u0026ldquo;Weakly-supervised video anomaly detection with robust temporal feature magnitude learning,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 4975–4986.\n[33] P. Wu et al., \u0026ldquo;Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 6074– 6082.\n[34] X. Wang et al., \u0026ldquo;Robust unsupervised video anomaly detection by multipath frame prediction,\u0026rdquo; IEEE Trans. Neural Netw. Learn. Syst. , vol. 33, pp. 2301–2312, 2021.\n[35] B. Ramachandra et al., \u0026ldquo;A survey of single-scene video anomaly detection,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, pp. 2293– 2312, 2020.\n[36] Y. Yang et al., \u0026ldquo;Follow the rules: reasoning for video anomaly detection with large language models,\u0026rdquo; in European Conference on Computer Vision. Springer, 2024, pp. 304–322.\n[37] W. Liu et al., \u0026ldquo;Future frame prediction for anomaly detection – a new baseline,\u0026rdquo; in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n[38] R. Rodrigues et al., \u0026ldquo;Multi-timescale trajectory prediction for abnormal human activity detection,\u0026rdquo; in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2020.\n[39] A. Danesh Pazho et al., \u0026ldquo;Chad: Charlotte anomaly dataset,\u0026rdquo; in Scandinavian Conference on Image Analysis. Springer, 2023, pp. 50–66.\n[40] Y. Zhu et al., \u0026ldquo;Towards open set video anomaly detection,\u0026rdquo; in European Conference on Computer Vision. Springer, 2022, pp. 395–412.\n[41] G. Alinezhad Noghre et al., \u0026ldquo;Understanding the challenges and opportunities of pose-based anomaly detection,\u0026rdquo; in Proceedings of the 8th International Workshop on Sensor-Based Activity Recognition and Artificial Intelligence, 2023, pp. 1–9.\n[42] G. A. Noghre et al., \u0026ldquo;An exploratory study on human-centric video anomaly detection through variational autoencoders and trajectory prediction,\u0026rdquo; in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 995–1004.\n[43] S. Yao et al., \u0026ldquo;Evaluating the effectiveness of video anomaly detection in the wild: Online learning and inference for real-world deployment,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4832–4841.\n[44] Y. Zhu et al., \u0026ldquo;Context-aware activity recognition and anomaly detection in video,\u0026rdquo; IEEE J. Sel. Topics Signal Process., vol. 7, pp. 91–101, 2012.\n[45] Y. Zhou et al., \u0026ldquo;Detecting anomaly in videos from trajectory similarity analysis,\u0026rdquo; in 2007 IEEE international conference on multimedia and expo. IEEE, 2007, pp. 1087–1090.\n[46] B.-H. Wang et al., \u0026ldquo;Fall detection based on dual-channel feature integration,\u0026rdquo; IEEE Access, vol. 8, pp. 103 443–103 453, 2020.\n[47] L. Gong et al., \u0026ldquo;A novel computer vision based gait analysis technique for normal and parkinson\u0026rsquo;s gaits classification,\u0026rdquo; in 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech). IEEE, 2020, pp. 209–215.\n[48] B. Jin et al., \u0026ldquo;Diagnosing parkinson disease through facial expression recognition: video analysis,\u0026rdquo; Journal of medical Internet research , vol. 22, p. e18697, 2020.\n[49] D. Mehta et al., \u0026ldquo;Privacy-preserving early detection of epileptic seizures in videos,\u0026rdquo; in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 210–219.\n[50] L. Kirichenko et al., \u0026ldquo;Detection of shoplifting on video using a hybrid network,\u0026rdquo; Computation, vol. 10, p. 199, 2022.\n[51] B. C. Das et al., \u0026ldquo;Efficient gun detection in real-world videos: Challenges and solutions,\u0026rdquo; 2025.\n[52] K. Doshi et al., \u0026ldquo;Continual learning for anomaly detection in surveillance videos,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 254–255.\n[53] X. Huang et al., \u0026ldquo;Multi-level memory-augmented appearance-motion correspondence framework for video anomaly detection,\u0026rdquo; in 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2023, pp. 2717–2722.\n[54] M. Z. Zaheer et al., \u0026ldquo;Generative cooperative learning for unsupervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 14 744–14 754.\n[55] Q. Li et al., \u0026ldquo;Essl: Enhanced spatio-temporal self-selective learning framework for unsupervised video anomaly detection.\u0026rdquo; in ECAI, 2023, pp. 1398–1405.\n[56] K. Doshi et al., \u0026ldquo;Rethinking video anomaly detection-a continual learning approach,\u0026rdquo; in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 3961–3970.\n[57] K. Faber et al., \u0026ldquo;Lifelong continual learning for anomaly detection: New challenges, perspectives, and insights,\u0026rdquo; IEEE Access, vol. 12, pp. 41 364–41 380, 2024.\n[58] R. Jiao et al., \u0026ldquo;Survey on video anomaly detection in dynamic scenes with moving cameras,\u0026rdquo; Artificial Intelligence Review, vol. 56, pp. 3515– 3570, 2023.\n[59] Z. Zamanzadeh Darban et al., \u0026ldquo;Deep learning for time series anomaly detection: A survey,\u0026rdquo; ACM Computing Surveys, vol. 57, pp. 1–42, 2024.\n[60] Y. Lin et al., \u0026ldquo;A survey on rgb, 3d, and multimodal approaches for unsupervised industrial image anomaly detection,\u0026rdquo; Information Fusion , p. 103139, 2025.\n[61] S. Olugbade et al., \u0026ldquo;A review of artificial intelligence and machine learning for incident detectors in road transport systems,\u0026rdquo; Mathematical and Computational Applications, vol. 27, p. 77, 2022.\n[62] S. A. Ahmed et al., \u0026ldquo;Trajectory-based surveillance analysis: A survey,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 29, pp. 1985–1997, 2018.\n[63] A. Al-Lahham et al., \u0026ldquo;A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2024, pp. 6793–6802.\n[64] M.-I. Georgescu et al., \u0026ldquo;Anomaly detection in video via self-supervised and multi-task learning,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 742–12 752.\n[65] Z. Yang et al., \u0026ldquo;Context-aware video anomaly detection in long-term datasets,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 4002–4011.\n[66] P. Narwade et al., \u0026ldquo;Synthetic video generation for weakly supervised cross-domain video anomaly detection,\u0026rdquo; in International Conference on Pattern Recognition. Springer, 2025, pp. 375–391.\n[67] A. Ponraj et al., \u0026ldquo;A video surveillance: Crowd anomaly detection and management alert system,\u0026rdquo; Quantum Computing Models for Cybersecurity and Wireless Communications, pp. 139–152, 2025.\n[68] L. Luo et al., \u0026ldquo;Detecting and quantifying crowd-level abnormal behaviors in crowd events,\u0026rdquo; IEEE Trans. Inf. Forensics Security, 2024.\n[69] O. Hirschorn et al., \u0026ldquo;Normalizing flows for human pose anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 545–13 554.\n[70] S. Yu et al., \u0026ldquo;Regularity learning via explicit distribution modeling for skeletal video anomaly detection,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., 2023.\n[71] K. Biradar et al., \u0026ldquo;Robust anomaly detection through transformerencoded feature diversity learning,\u0026rdquo; in Proceedings of the Asian Conference on Computer Vision, 2024, pp. 115–128.\n[72] D. Bhardwaj et al., \u0026ldquo;Leveraging dual encoders with feature disentanglement for anomaly detection in thermal videos,\u0026rdquo; in International Conference on Pattern Recognition. Springer, 2025, pp. 237–253.\n[73] D. Guo et al., \u0026ldquo;Ada-vad: Domain adaptable video anomaly detection,\u0026rdquo; in Proceedings of the 2024 SIAM International Conference on Data Mining (SDM). SIAM, 2024, pp. 634–642.\n[74] Z. Wang et al., \u0026ldquo;Domain generalization for video anomaly detection considering diverse anomaly types,\u0026rdquo; Signal, Image and Video Processing, vol. 18, pp. 3691–3704, 2024.\n[75] M. Cho et al., \u0026ldquo;Towards multi-domain learning for generalizable video anomaly detection,\u0026rdquo; Advances in Neural Information Processing Systems, vol. 37, pp. 50 256–50 284, 2024.\n[76] R. Nawaratne et al., \u0026ldquo;Spatiotemporal anomaly detection using deep learning for real-time video surveillance,\u0026rdquo; IEEE Transactions on Industrial Informatics, vol. 16, pp. 393–402, 2019.\n[77] M. M. Ali, \u0026ldquo;Real-time video anomaly detection for smart surveillance,\u0026rdquo; IET Image Processing, vol. 17, pp. 1375–1388, 2023.\n[78] H. Karim et al., \u0026ldquo;Real-time weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2024, pp. 6848–6856.\n[79] S. Zhu et al., \u0026ldquo;Video anomaly detection for smart surveillance,\u0026rdquo; in Computer Vision: A Reference Guide. Springer, 2021, pp. 1315–1322.\n[80] K. Doshi et al., \u0026ldquo;Online anomaly detection in surveillance videos with asymptotic bound on false alarm rate,\u0026rdquo; Pattern Recognition, vol. 114, p. 107865, 2021.\n[81] J. Micorek et al., \u0026ldquo;Mulde: Multiscale log-density estimation via denoising score matching for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 18 868–18 877.\n[82] Y. Nie et al., \u0026ldquo;Interleaving one-class and weakly-supervised models with adaptive thresholding for unsupervised video anomaly detection,\u0026rdquo; in European Conference on Computer Vision. Springer, 2024, pp. 449–467.\n[83] A. Ntelopoulos et al., \u0026ldquo;Callm: Cascading autoencoder and large language model for video anomaly detection,\u0026rdquo; in 2024 IEEE Thirteenth International Conference on Image Processing Theory, Tools and Applications (IPTA). IEEE, 2024, pp. 1–6.\n[84] B. Asal et al., \u0026ldquo;Ensemble-based knowledge distillation for video anomaly detection,\u0026rdquo; Applied Sciences, vol. 14, p. 1032, 2024.\n[85] Y. Cai et al., \u0026ldquo;Medianomaly: A comparative study of anomaly detection in medical images,\u0026rdquo; Medical Image Analysis, p. 103500, 2025.\n[86] Z. Z. Darban et al., \u0026ldquo;Dacad: Domain adaptation contrastive learning for anomaly detection in multivariate time series,\u0026rdquo; arXiv preprint arXiv:2404.11269, 2024.\n[87] S. Wang et al., \u0026ldquo;Effective end-to-end unsupervised outlier detection via inlier priority of discriminative network,\u0026rdquo; Advances in neural information processing systems, vol. 32, 2019.\n[88] G. Yu et al., \u0026ldquo;Deep anomaly discovery from unlabeled videos via normality advantage and self-paced refinement,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , 2022, pp. 13 987–13 998.\n[89] J. Lu et al., \u0026ldquo;Learning under concept drift: A review,\u0026rdquo; IEEE Trans. Knowl. Data Eng., vol. 31, pp. 2346–2363, 2018.\n[90] S. Saurav et al., \u0026ldquo;Online anomaly detection with concept drift adaptation using recurrent neural networks,\u0026rdquo; in Proceedings of the acm india joint international conference on data science and management of data , 2018, pp. 78–87.\n[91] S. Wu et al., \u0026ldquo;Adversarial sparse transformer for time series forecasting,\u0026rdquo; Advances in neural information processing systems, vol. 33, pp. 17 105–17 115, 2020.\n[92] X. Tang et al., \u0026ldquo;Deep anomaly detection with ensemble-based active learning,\u0026rdquo; in 2020 IEEE International Conference on Big Data (Big Data). IEEE, 2020, pp. 1663–1670.\n[93] S. Thrun et al., \u0026ldquo;Learning to learn: Introduction and overview,\u0026rdquo; in Learning to learn. Springer, 1998, pp. 3–17.\n[94] Y. Lu et al., \u0026ldquo;Few-shot scene-adaptive anomaly detection,\u0026rdquo; in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer, 2020, pp. 125–141.\n[95] C. Finn et al., \u0026ldquo;Model-agnostic meta-learning for fast adaptation of deep networks,\u0026rdquo; in International conference on machine learning . PMLR, 2017, pp. 1126–1135.\n[96] E. Hazan et al., \u0026ldquo;Introduction to online convex optimization,\u0026rdquo; Foundations and Trends® in Optimization, vol. 2, pp. 157–325, 2016.\n[97] S. C. Hoi et al., \u0026ldquo;Online learning: A comprehensive survey,\u0026rdquo; Neurocomputing, vol. 459, pp. 249–289, 2021.\n[98] S. Han et al., \u0026ldquo;Log-based anomaly detection with robust feature extraction and online learning,\u0026rdquo; IEEE Trans. Inf. Forensics Security , vol. 16, pp. 2300–2311, 2021.\n[99] Z. Chen et al., \u0026ldquo;An effective cost-sensitive sparse online learning framework for imbalanced streaming data classification and its application to online anomaly detection,\u0026rdquo; Knowledge and Information Systems , vol. 65, pp. 59–87, 2023.\n[100] S. Yao et al., \u0026ldquo;Evaluating the effectiveness of video anomaly detection in the wild: Online learning and inference for real-world deployment,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2024, pp. 4832–4841.\n[101] R. M. French, \u0026ldquo;Catastrophic forgetting in connectionist networks,\u0026rdquo; Trends in cognitive sciences, vol. 3, pp. 128–135, 1999.\n[102] M. McCloskey et al., \u0026ldquo;Catastrophic interference in connectionist networks: The sequential learning problem,\u0026rdquo; in Psychology of learning and motivation. Elsevier, 1989, vol. 24, pp. 109–165.\n[103] D. Cohn et al., \u0026ldquo;Improving generalization with active learning,\u0026rdquo; Machine learning, vol. 15, pp. 201–221, 1994.\n[104] R. M. Monarch, Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI. Simon and Schuster, 2021.\n[105] B. Settles, “Active learning literature survey,” 2009.\n[106] J. Redmon et al., \u0026ldquo;You only look once: Unified, real-time object detection,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.\n[107] E. Ilg et al., \u0026ldquo;Flownet 2.0: Evolution of optical flow estimation with deep networks,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2462–2470.\n[108] C. C. Loy et al., \u0026ldquo;Stream-based active unusual event detection,\u0026rdquo; in Asian Conference on Computer Vision. Springer, 2010, pp. 161–175.\n[109] C. Change Loy et al., \u0026ldquo;Stream-based joint exploration-exploitation active learning,\u0026rdquo; in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 1560–1567.\n[110] Y. Chen et al., \u0026ldquo;Vision-based fall event detection in complex background using attention guided bi-directional lstm,\u0026rdquo; IEEE Access, vol. 8, pp. 161 337–161 348, 2020.\n[111] X. Cai et al., \u0026ldquo;Vision-based fall detection with multi-task hourglass convolutional auto-encoder,\u0026rdquo; IEEE Access, vol. 8, pp. 44 493–44 502, 2020.\n[112] S. Chhetri et al., \u0026ldquo;Deep learning for vision-based fall detection system: Enhanced optical dynamic flow,\u0026rdquo; Computational Intelligence, vol. 37, pp. 578–595, 2021.\n[113] W. Chen et al., \u0026ldquo;Fall detection based on key points of human-skeleton using openpose,\u0026rdquo; Symmetry, vol. 12, p. 744, 2020.\n[114] O. Keskes et al., \u0026ldquo;Vision-based fall detection using st-gcn,\u0026rdquo; IEEE Access, vol. 9, pp. 28 224–28 236, 2021.\n[115] J. Zhang et al., \u0026ldquo;Human fall detection based on body posture spatiotemporal evolution,\u0026rdquo; Sensors, vol. 20, p. 946, 2020.\n[116] C. Khraief et al., \u0026ldquo;Elderly fall detection based on multi-stream deep convolutional networks,\u0026rdquo; Multimedia Tools and Applications, vol. 79, pp. 19 537–19 560, 2020.\n[117] L. F. Gomez et al., \u0026ldquo;Improving parkinson detection using dynamic features from evoked expressions in video,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 1562–1570.\n[118] L. Gomez-Gomez et al., \u0026ldquo;Exploring facial expressions and affective domains for parkinson detection. arxiv 2020,\u0026rdquo; arXiv preprint arXiv:2012.06563, 2012.\n[119] R. Kaur et al., \u0026ldquo;A vision-based framework for predicting multiple sclerosis and parkinson\u0026rsquo;s disease gait dysfunctions—a deep learning approach,\u0026rdquo; IEEE J. Biomed. Health Inform., vol. 27, pp. 190–201, 2022.\n[120] T. Connie et al., \u0026ldquo;Pose-based gait analysis for diagnosis of parkinson\u0026rsquo;s disease,\u0026rdquo; Algorithms, vol. 15, p. 474, 2022.\n[121] M. H. Monje et al., \u0026ldquo;Remote evaluation of parkinson\u0026rsquo;s disease using a conventional webcam and artificial intelligence,\u0026rdquo; Frontiers in neurology, vol. 12, p. 742654, 2021.\n[122] J. Archila et al., \u0026ldquo;A multimodal parkinson quantification by fusing eye and gait motion patterns, using covariance descriptors, from non-invasive computer vision,\u0026rdquo; Computer methods and programs in biomedicine, vol. 215, p. 106607, 2022.\n[123] K. Sun et al., \u0026ldquo;Spatial attentional bilinear 3d convolutional network for video-based autism spectrum disorder detection,\u0026rdquo; in ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3387–3391.\n[124] S. Chen et al., \u0026ldquo;Attention-based autism spectrum disorder screening with privileged modality,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1181–1190.\n[125] M. Jiang et al., \u0026ldquo;Learning visual attention to identify people with autism spectrum disorder,\u0026rdquo; in Proceedings of the ieee international conference on computer vision, 2017, pp. 3267–3276.\n[126] Y. Tao et al., \u0026ldquo;Sp-asdnet: Cnn-lstm based asd classification model using observer scanpaths,\u0026rdquo; in 2019 IEEE International conference on multimedia \u0026amp; expo workshops (ICMEW). IEEE, 2019, pp. 641–646.\n[127] A. Ali et al., \u0026ldquo;Video-based behavior understanding of children for objective diagnosis of autism,\u0026rdquo; in VISAPP 2022-17th International Conference on Computer Vision Theory and Applications, 2022.\n[128] C. Wu et al., \u0026ldquo;Machine learning based autism spectrum disorder detection from videos,\u0026rdquo; in 2020 IEEE International Conference on Ehealth Networking, Application \u0026amp; Services (HEALTHCOM). IEEE, 2021, pp. 1–6.\n[129] N. Kojovic et al., \u0026ldquo;Using 2d video-based pose estimation for automated prediction of autism spectrum disorders in young children,\u0026rdquo; Scientific Reports, vol. 11, p. 15069, 2021.\n[130] Y. Yang et al., \u0026ldquo;Video-based detection of generalized tonic-clonic seizures using deep learning,\u0026rdquo; IEEE J. Biomed. Health Inform., vol. 25, pp. 2997–3008, 2021.\n[131] J.-C. Hou et al., \u0026ldquo;A self-supervised pre-training framework for visionbased seizure classification,\u0026rdquo; in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 1151–1155.\n[132] P. K. Pothula et al., \u0026ldquo;A real-time seizure classification system using computer vision techniques,\u0026rdquo; in 2022 IEEE International Systems Conference (SysCon). IEEE, 2022, pp. 1–6.\n[133] D. Ahmedt-Aristizabal et al., \u0026ldquo;Vision-based mouth motion analysis in epilepsy: A 3d perspective,\u0026rdquo; in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2019, pp. 1625–1629.\n[134] C.-H. Chou et al., \u0026ldquo;Convolutional neural network-based fast seizure detection from video electroencephalograms,\u0026rdquo; Biomedical Signal Processing and Control, vol. 80, p. 104380, 2023.\n[135] V. M. Garc¸ao˜ ˜ et al., \u0026ldquo;A novel approach to automatic seizure detection using computer vision and independent component analysis,\u0026rdquo; Epilepsia , vol. 64, pp. 2472–2483, 2023.\n[136] D. Ahmedt-Aristizabal et al., \u0026ldquo;Motion signatures for the analysis of seizure evolution in epilepsy,\u0026rdquo; in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2019, pp. 2099–2105.\n[137] J.-C. Hou et al., \u0026ldquo;A multi-stream approach for seizure classification with knowledge distillation,\u0026rdquo; in 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2021, pp. 1–8.\n[138] J. C. Hou et al., \u0026ldquo;Automated video analysis of emotion and dystonia in epileptic seizures,\u0026rdquo; Epilepsy Research, vol. 184, p. 106953, 2022.\n[139] K. Simonyan et al., \u0026ldquo;Very deep convolutional networks for large-scale image recognition,\u0026rdquo; arXiv preprint arXiv:1409.1556, 2014.\n[140] K. He et al., \u0026ldquo;Deep residual learning for image recognition,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.\n[141] G. A. Mart ´ ´ınez-Mascorro et al., \u0026ldquo;Criminal intention detection at early stages of shoplifting cases by using 3d convolutional neural networks,\u0026rdquo; Computation, vol. 9, p. 24, 2021.\n[142] I. Muneer et al., \u0026ldquo;Shoplifting detection using hybrid neural network cnn-bilsmt and development of benchmark dataset,\u0026rdquo; Applied Sciences , vol. 13, p. 8341, 2023.\n[143] V. Manikandan et al., \u0026ldquo;A neural network aided attuned scheme for gun detection in video surveillance images,\u0026rdquo; Image and Vision Computing , vol. 120, p. 104406, 2022.\n[144] M. T. Bhatti et al., \u0026ldquo;Weapon detection in real-time cctv videos using deep learning,\u0026rdquo; Ieee Access, vol. 9, pp. 34 366–34 382, 2021.\n[145] T. Nyajowi et al., \u0026ldquo;Cnn real-time detection of vandalism using a hybridlstm deep learning neural networks,\u0026rdquo; in 2021 IEEE AFRICON. IEEE, 2021, pp. 1–6.\n[146] Y. Yang et al., \u0026ldquo;Enhanced adversarial learning based video anomaly detection with object confidence and position,\u0026rdquo; in 2019 13th International Conference on Signal Processing and Communication Systems (ICSPCS). IEEE, 2019, pp. 1–5.\n[147] D. Chen et al., \u0026ldquo;Nm-gan: Noise-modulated generative adversarial network for video anomaly detection,\u0026rdquo; Pattern Recognition, vol. 116, p. 107969, 2021.\n[148] A. Barbalau et al., \u0026ldquo;Ssmtl++: Revisiting self-supervised multi-task learning for video anomaly detection,\u0026rdquo; Computer Vision and Image Understanding, vol. 229, p. 103656, 2023.\n[149] G. Wang et al., \u0026ldquo;Video anomaly detection by solving decoupled spatiotemporal jigsaw puzzles,\u0026rdquo; in European Conference on Computer Vision . Springer, 2022, pp. 494–511.\n[150] M. I. Georgescu et al., \u0026ldquo;A background-agnostic framework with adversarial training for abnormal event detection in video,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, pp. 4505–4523, 2021.\n[151] W. Luo et al., \u0026ldquo;Future frame prediction network for video anomaly detection,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, pp. 7505– 7520, 2021.\n[152] H. Shi et al., \u0026ldquo;Abnormal ratios guided multi-phase self-training for weakly-supervised video anomaly detection,\u0026rdquo; IEEE Trans. Multimedia , vol. 26, pp. 5575–5587, 2023.\n[153] Q. Li et al., \u0026ldquo;Attention-based anomaly detection in multi-view surveillance videos,\u0026rdquo; Knowledge-Based Systems, vol. 252, p. 109348, 2022.\n[154] C. Zhang et al., \u0026ldquo;Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16 271–16 280.\n[155] Z. Yang et al., \u0026ldquo;Text prompt with normality guidance for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 899–18 908.\n[156] J. Salido et al., \u0026ldquo;Automatic handgun detection with deep learning in video surveillance images,\u0026rdquo; Applied Sciences, vol. 11, p. 6085, 2021.\n[157] R. Rodrigues et al., \u0026ldquo;Multi-timescale trajectory prediction for abnormal human activity detection,\u0026rdquo; in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 2626–2634.\n[158] C. Huang et al., \u0026ldquo;Hierarchical graph embedded pose regularity learning via spatio-temporal transformer for abnormal behavior detection,\u0026rdquo; in Proceedings of the 30th ACM international conference on multimedia , 2022, pp. 307–315.\n[159] X. Zeng et al., \u0026ldquo;A hierarchical spatio-temporal graph convolutional neural network for anomaly detection in videos,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 33, pp. 200–212, 2021.\n[160] B. Fan et al., \u0026ldquo;Anomaly detection based on pose estimation and gruffn,\u0026rdquo; in 2021 IEEE Sustainable Power and Energy Conference (iSPEC) . IEEE, 2021, pp. 3821–3825.\n[161] W. Luo et al., \u0026ldquo;Normal graph: Spatial temporal graph convolutional networks based prediction network for skeleton based video anomaly detection,\u0026rdquo; Neurocomputing, vol. 444, pp. 332–337, 2021.\n[162] Y. Jain et al., \u0026ldquo;Posecvae: Anomalous human activity detection,\u0026rdquo; in 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021, pp. 2927–2934.\n[163] C. Liu et al., \u0026ldquo;A self-attention augmented graph convolutional clustering networks for skeleton-based video anomaly behavior detection,\u0026rdquo; Applied Sciences, vol. 12, p. 4, 2021.\n[164] A. Markovitz et al., \u0026ldquo;Graph embedded pose clustering for anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 539–10 547.\n[165] n. Li et al., \u0026ldquo;Human-related anomalous event detection via memoryaugmented wasserstein generative adversarial network with gradient penalty,\u0026rdquo; Pattern Recognition, vol. 138, p. 109398, 2023.\n[166] N. Li et al., \u0026ldquo;Human-related anomalous event detection via spatialtemporal graph convolutional autoencoder with embedded long shortterm memory network,\u0026rdquo; Neurocomputing, vol. 490, pp. 482–494, 2022.\n[167] R. Morais et al., \u0026ldquo;Learning regularity in skeleton trajectories for anomaly detection in videos,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 996– 12 004.\n[168] G. A. Noghre et al., \u0026ldquo;Human-centric video anomaly detection through spatio-temporal pose tokenization and transformer,\u0026rdquo; 2025.\n[169] X. Chen et al., \u0026ldquo;Multiscale spatial temporal attention graph convolution network for skeleton-based anomaly behavior detection,\u0026rdquo; Journal of visual communication and image representation, vol. 90, p. 103707, 2023.\n[170] G. A. Noghre et al., \u0026ldquo;Posewatch: A transformer-based architecture for human-centric video anomaly detection using spatio-temporal pose tokenization,\u0026rdquo; arXiv preprint arXiv:2408.15185, 2024.\n[171] A. Pramanik et al., \u0026ldquo;A real-time video surveillance system for traffic pre-events detection,\u0026rdquo; Accident Analysis \u0026amp; Prevention, vol. 154, p. 106019, 2021.\n[172] A. Aboah, \u0026ldquo;A vision-based system for traffic anomaly detection using deep learning and decision trees,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4207–4212.\n[173] C. Li et al., \u0026ldquo;Difftad: Denoising diffusion probabilistic models for vehicle trajectory anomaly detection,\u0026rdquo; Knowledge-Based Systems, vol. 286, p. 111387, 2024.\n[174] V. Katariya et al., \u0026ldquo;Vegaedge: Edge ai confluence for real-time iotapplications in highway safety,\u0026rdquo; Internet of Things, vol. 27, p. 101268, 2024.\n[175] J. Owens et al., \u0026ldquo;Application of the self-organising map to trajectory classification,\u0026rdquo; in Proceedings Third IEEE International Workshop on Visual Surveillance. IEEE, 2000, pp. 77–83.\n[176] W. Hu et al., \u0026ldquo;A system for learning statistical motion patterns,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, pp. 1450–1464, 2006.\n[177] X. Wang et al., \u0026ldquo;Unsupervised activity perception by hierarchical bayesian models,\u0026rdquo; in 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8.\n[178] K. K. Santhosh et al., \u0026ldquo;Vehicular trajectory classification and traffic anomaly detection in videos using a hybrid cnn-vae architecture,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., vol. 23, pp. 11 891–11 902, 2021.\n[179] Z. Zhou et al., \u0026ldquo;Spatio-temporal feature encoding for traffic accident detection in vanet environment,\u0026rdquo; IEEE Transactions on Intelligent Transportation Systems, vol. 23, pp. 19 772–19 781, 2022.\n[180] J. Park et al., \u0026ldquo;Deep learning-based stopped vehicle detection method utilizing in-vehicle dashcams,\u0026rdquo; Electronics, vol. 13, p. 4097, 2024.\n[181] S. S. Htun et al., \u0026ldquo;Tempolearn network: Leveraging spatio-temporal learning for traffic accident detection,\u0026rdquo; IEEE Access, vol. 11, pp. 142 292–142 303, 2023.\n[182] J. Fang et al., \u0026ldquo;Traffic accident detection via self-supervised consistency learning in driving scenarios,\u0026rdquo; IEEE Trans. Intell. Transp. Syst. , vol. 23, pp. 9601–9614, 2022.\n[183] Y. Yao et al., \u0026ldquo;Unsupervised traffic accident detection in first-person videos,\u0026rdquo; in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 273–280.\n[184] H. Ru et al., \u0026ldquo;Enhanced anomaly detection in dashcam videos: Dual gan approach with swin-unet for optical flow and region of interest analysis,\u0026rdquo; in 2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 2024, pp. 1–8.\n[185] D. Bogdoll et al., \u0026ldquo;Hybrid video anomaly detection for anomalous scenarios in autonomous driving,\u0026rdquo; arXiv preprint arXiv:2406.06423 , 2024.\n[186] S. Haresh et al., \u0026ldquo;Towards anomaly detection in dashcam videos,\u0026rdquo; in 2020 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2020, pp. 1407–1414.\n[187] A. Khalil et al., \u0026ldquo;Fire detection using multi color space and background modeling,\u0026rdquo; Fire technology, vol. 57, pp. 1221–1239, 2021.\n[188] H. Farman et al., \u0026ldquo;Efficient fire detection with e-efnet: A lightweight deep learning-based approach for edge devices,\u0026rdquo; Applied Sciences , vol. 13, p. 12941, 2023.\n[189] S. Chitram et al., \u0026ldquo;Enhancing fire and smoke detection using deep learning techniques,\u0026rdquo; Engineering Proceedings, vol. 62, p. 7, 2024.\n[190] A. S. Mahdi et al., \u0026ldquo;An edge computing environment for early wildfire detection,\u0026rdquo; Annals of Emerging Technologies in Computing (AETiC) , vol. 6, pp. 56–68, 2022.\n[191] Z. Dou et al., \u0026ldquo;An improved yolov5s fire detection model,\u0026rdquo; Fire Technology, vol. 60, pp. 135–166, 2024.\n[192] V. E. Sathishkumar et al., \u0026ldquo;Forest fire and smoke detection using deep learning-based learning without forgetting,\u0026rdquo; Fire ecology, vol. 19, p. 9, 2023.\n[193] G. Son et al., \u0026ldquo;Video based smoke and flame detection using convolutional neural network,\u0026rdquo; in 2018 14th International Conference on Signal-Image Technology \u0026amp; Internet-Based Systems (SITIS). IEEE, 2018, pp. 365–368.\n[194] H. Zhao et al., \u0026ldquo;Fsdf: A high-performance fire detection framework,\u0026rdquo; Expert Systems with Applications, vol. 238, p. 121665, 2024.\n[195] N. Yunusov et al., \u0026ldquo;Robust forest fire detection method for surveillance systems based on you only look once version 8 and transfer learning approaches,\u0026rdquo; Processes, vol. 12, p. 1039, 2024.\n[196] F. Akhmedov et al., \u0026ldquo;Dehazing algorithm integration with yolo-v10 for ship fire detection,\u0026rdquo; Fire, vol. 7, p. 332, 2024.\n[197] H. Yar et al., \u0026ldquo;An efficient deep learning architecture for effective fire detection in smart surveillance,\u0026rdquo; Image and Vision Computing, vol. 145, p. 104989, 2024.\n[198] Y. Wang et al., \u0026ldquo;Computer vision-driven forest wildfire and smoke recognition via iot drone cameras,\u0026rdquo; Wireless Networks, vol. 30, pp. 7603–7616, 2024.\n[199] H. Yar et al., \u0026ldquo;A modified vision transformer architecture with scratch learning capabilities for effective fire detection,\u0026rdquo; Expert Systems with Applications, vol. 252, p. 123935, 2024.\n[200] P. V. K. Borges et al., \u0026ldquo;A probabilistic model for flood detection in video sequences,\u0026rdquo; in 2008 15th IEEE International Conference on Image Processing. IEEE, 2008, pp. 13–16.\n[201] I. E. Villalon-Turrubiates, \u0026ldquo;Convolutional neural network for flood-risk assessment and detection within a metropolitan area,\u0026rdquo; in 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS . IEEE, 2021, pp. 1339–1342.\n[202] P. Zhong et al., \u0026ldquo;Detection of urban flood inundation from traffic images using deep learning methods,\u0026rdquo; Water Resources Management , vol. 38, pp. 287–301, 2024.\n[203] K. Lohumi et al., \u0026ldquo;Automatic detection of flood severity level from flood videos using deep learning models,\u0026rdquo; in 2018 5th International Conference on Information and Communication Technologies for Disaster Management (ICT-DM). IEEE, 2018, pp. 1–7.\n[204] L. Lopez-Fuentes et al., \u0026ldquo;Multi-modal deep learning approach for flood detection.\u0026rdquo; MediaEval, vol. 17, pp. 13–15, 2017.\n[205] M. Sandler et al., \u0026ldquo;Mobilenetv2: Inverted residuals and linear bottlenecks,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.\n[206] A. Krizhevsky et al., \u0026ldquo;Imagenet classification with deep convolutional neural networks,\u0026rdquo; Advances in neural information processing systems , vol. 25, 2012.\n[207] C. Szegedy et al., \u0026ldquo;Going deeper with convolutions,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 1–9.\n[208] S. Ren et al., \u0026ldquo;Faster r-cnn: Towards real-time object detection with region proposal networks,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell. , vol. 39, pp. 1137–1149, 2016.\n[209] S. Woo et al., \u0026ldquo;Cbam: Convolutional block attention module,\u0026rdquo; in Proceedings of the European conference on computer vision (ECCV) , 2018, pp. 3–19.\n[210] A. Van Den Oord et al., \u0026ldquo;Neural discrete representation learning,\u0026rdquo; Advances in neural information processing systems, vol. 30, 2017.\n[211] A. Dosovitskiy et al., \u0026ldquo;An image is worth 16x16 words: Transformers for image recognition at scale,\u0026rdquo; arXiv preprint arXiv:2010.11929, 2020.\n[212] N. Humaira et al., \u0026ldquo;Dx-floodline: End-to-end deep explainable pipeline for real time flood scene object detection from multimedia images,\u0026rdquo; IEEE Access, vol. 11, pp. 110 644–110 655, 2023.\nGhazal Alinezhad Noghre is currently a Ph.D. candidate in Electrical and Computer Engineering at the University of North Carolina at Charlotte, Charlotte, North Carolina, United States. Her research focuses on artificial intelligence, machine learning, and computer vision, with a particular emphasis on the application of AI in real-world environments and the associated challenges.\nArmin Danesh Pazho is a Ph.D. candidate in Electrical and Computer Engineering at the University of North Carolina at Charlotte. His research focuses on artificial intelligence, machine learning, and computer vision, with emphasis on developing scalable AI solutions for practical applications. He has researched, designed, and developed novel AI/ML algorithms, systems, and datasets with deployment in real-world testbeds.\nHamed Tabkhi is an associate professor of Electrical and Computer Engineering at the University of North Carolina at Charlotte. His research focuses on advancing artificial intelligence and computer vision to solve real-world challenges through close collaboration with experts and community stakeholders. The National Science Foundation recognized Dr. Tabkhi\u0026rsquo;s Smart and Connected Communities award as a program success story. His work has been featured by local news for its significant contributions to community-driven responsible AI solutions.\nTABLE VIII LIST OF ABBREVIATIONS USED THROUGHOUT THE PAPER. THIS TABLE PROVIDES FULL FORMS FOR TECHNICAL TERMS COMMONLY REFERENCED IN THE CONTEXT OF VIDEO ANOMALY DETECTION (VAD).\nAbbreviation Full Form VAD Video Anomaly Detection SOTA State-of-The-Art AI Artificial Intelligence CNN Convolutional Neural Network VQ-VAE Vector Quantized Variational Autoencoder ViT Vision Transformer GMM Gaussian Mixture Model GRU Gated Recurrent Unit LSTM Long Short-Term Memory RGB Red Green Blue (color video input) MLP Multi-Layer Perceptron PD Parkinson’s Disease SVM Support Vector Machine GEI Gait Energy Image ASD Autism Spectrum Disorde EEG Electroencephalography VEEG Video Electroencephalography MIL Multiple Instance Learning MSE Mean Squared Error GCN Graph Convolutional Networ VAE Variational Autoencoder GIN Graph Isomorphism Networ kNN k-Nearest Neighbors kDNN k-Nearest Distance Neural Network (DNN-based approximation of kNN IoU Intersection-over-Union GAN Generative Adversarial Netwo IX. ABBREVIATIONS # This section provides a list of abbreviations and their corresponding full forms used throughout the survey. These terms are commonly referenced in the literature on Video Anomaly Detection (VAD). The purpose of this list is to assist readers with quick reference and improve the clarity and accessibility of the material presented.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/survey-3/","section":"Papers","summary":"This survey provides a comprehensive overview of deep learning-based Video Anomaly Detection (VAD), covering challenges, methodologies, domain-specific applications, and future research directions across human-centric, vehicle-centric, and environment-centric contexts. It introduces a taxonomy of supervision levels, adaptive learning strategies, and explores diverse application areas including healthcare, public safety, road surveillance, and disaster detection, emphasizing the latest advancements and open challenges.","title":"A Survey on Video Anomaly Detection via Deep Learning: Human, Vehicle, and Environment","type":"survey"},{"content":" A VLM-based Method for Visual Anomaly Detection in Robotic Scientific Laboratories # Shiwei Lin, Chenxu Wang, Xiaozhen Ding, Yi Wang, Boyuan Du, Lei Song,\nChenggang Wang, and Huaping Liu\nAbstract— In robot scientific laboratories, visual anomaly detection is important for the timely identification and resolution of potential faults or deviations. It has become a key factor in ensuring the stability and safety of experimental processes. To address this challenge, this paper proposes a VLM-based visual reasoning approach that supports different levels of supervision through four progressively informative prompt configurations. To systematically evaluate its effectiveness, we construct a visual benchmark tailored for process anomaly detection in scientific workflows. Experiments on two representative visionlanguage models show that detection accuracy improves as more contextual information is provided, confirming the effectiveness and adaptability of the proposed reasoning approach for process anomaly detection in scientific workflows. Furthermore, realworld validations at selected experimental steps confirm that first-person visual observation can effectively identify processlevel anomalies. This work provides both a data-driven foundation and an evaluation framework for vision anomaly detection in scientific experiment workflows.\nI. INTRODUCTION # In scientific experiments, when unexpected events occur during the process, accurately and efficiently detecting anomalies through visual information is critical for dynamic task scheduling [1], embodied perceptual decision-making [2], minimizing downtime, improving system efficiency, and reducing human intervention. It also contributes to the development of robust laboratory safety protocols [3]. Most existing experimental designs in scientific laboratories focus on workflow planning and schedule optimization, while paying limited attention to potential anomalies that may arise during the experimental process [4]–[6]. Current visionbased anomaly detection methods are mainly developed for specific domains such as manufacturing processes and abnormal behavior analysis [7]–[15], are not well suited to handle process anomaly detection in diverse scientific experiments, and exhibit limited transferability between diverse experimental setups.\nIn scientific experiments, process anomaly detection is context-dependent, as the same target state may be normal or\n*This work was supported by the National Natural Science Fund for Key International Collaboration under grant 62120106005.\nShiwei Lin, Chenxu Wang, Yi Wang, and Huaping Liu are with the Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China. linsw23@mails.tsinghua.edu.cn, hpliu@tsinghua.edu.cn.\nXiaozhen Ding is with the School of Physics and Electronic Information, Yantai University, Yantai 264005, China.\nBoyuan Du is with the School of Computer and Big Data, Fuzhou University, Fuzhou 350108, China.\nLei Song and Chenggang Wang are with Department of Automation, Shanghai Jiao Tong University, Shanghai, 200240, P. R. China.\nHuaping Liu is the corresponding author.\nFig. 1. Illustration of context-dependent anomaly determination. Each experimental stage is represented by a combination of a visual state (image) and a corresponding textual description (context). The same image may be considered normal or anomalous depending on the subtask context. (a) Without contextual information, the status of the image is ambiguous. (b) In a stage where the presence of the test tube is expected, the image is considered normal. (c) During the pouring stage, the absence of the test tube makes the image is considered abnormal.\nabnormal at different workflow stages. This context-sensitive nature makes it difficult for conventional classification-based methods to accurately identify such anomalies, as illustrated in Fig.1.\nRecent advances have shown that VLMs possess strong multimodal reasoning capabilities, supporting zero-shot anomaly detection via context-aware VQA [13], [16], precise anomaly localization by integrating traditional industrial knowledge with visual encoders [14], [15], and enhanced visual tracking through linguistic guidance [17], [18]. These developments build on the growing role of LLM-powered agents in scientific domains [4], [19], [20]. Therefore, we adopt VLMs for anomaly detection, as they are capable of reasoning about the semantic meaning of target states in images, making them more suitable for context-aware decision-making in experimental workflows. Specifically, we propose a method that utilizes visual information collected during experimental execution, combined with structured prompt information to perform anomaly detection using VLMs. As shown in Fig.2, this approach integrates experiment perception, visual-language reasoning, and contextual understanding to determine whether the current stage is anomalous.\nTo systematically evaluate the proposed method, we con-\nExperiment Schedule # Fig. 2. Illustration of the process anomaly detection problem in scientific experiments. A robotic system performs experimental steps while collecting visual data from a first-person perspective. At each step, the visual observations are used to determine whether the current process state is normal or anomalous, based on contextual information from the experimental workflow. The visual detection results can influence the subsequent experimental schedule.\nstruct a multimodal visual benchmark for process anomaly detection in scientific experiments. Using the silicone preparation workflow, we collect and annotate first-person visual data in a real chemical laboratory, covering 15 key stages and 20 monitoring points. This benchmark supports both stagelevel and full-process anomaly detection tasks. Based on this dataset, we introduce four progressively informative prompt configurations to study the impact of prompt granularity on VLM reasoning performance, enabling evaluation and analysis of different models. We conduct systematic experiments on two representative VLMs, GPT-4o and Qwen2.5-VL-72BInstruct, and demonstrate that prompt design plays a crucial role in improving model accuracy and decision robustness.\nOur main contributions are summarized as follows:\nWe propose a VLM-based visual anomaly detection method that utilizes first-person images and stagedependent semantics across diverse scientific scenarios, enabling systematic investigation of prompt granularity effects and providing an evaluation standard for process anomaly detection in scientific experiments. We construct a vision-based benchmark for process anomaly detection in scientific experiments, grounded in a real silicone preparation workflow in a chemical lab. The benchmark includes multi-step, multi-view firstperson images with detailed annotations, supporting multimodal models with contextual input, enabling research on anomaly understanding in scientific workflows. We conduct real-world experiments to validate the effectiveness of our method in the selected stage, demon- strating the practical applicability of first-person visual anomaly detection.\nII. RELATED WORK # A. Anomaly Detection in Automated Systems # Anomaly detection has long been applied in manufacturing, behavior monitoring, and defect inspection. Traditional approaches based on statistical analysis and handcrafted features remain prevalent. For example, liquid motion patterns in video frames have been used to improve experimental safety [7]. However, these methods struggle with adaptability in complex or dynamic environments. Spatiotemporal regression has also been applied for the detection of localized overheating in additive manufacturing [8], but its computational cost limits the practical deployment.\nDeep learning enables more effective feature extraction in industrial anomaly detection. Unsupervised methods improve the detection of human and vehicle anomalies in surveillance settings [9], though high false positives remain under changing conditions. AMP-Net, which fuses spatial and temporal features, performs well in complex scenarios [10], but like many deep models, it relies on domain-specific training data and lacks generalization to diverse environments.\nGANs with attention mechanisms have achieved high accuracy in surface defect detection [11], but are sensitive to data distribution and resource-intensive. Transformer-based models that incorporate prior knowledge handle various types of anomalies [12], yet require extensive annotations and high inference costs. Overall, while deep learning outperforms traditional methods, it still faces limitations in robustness, adaptability, and resource demands.\nDespite progress in industrial contexts, existing methods often fail to generalize to the complex and dynamic workflows of scientific experiments.\nB. VLMs for Anomaly Detection # With the rapid development of multimodal learning, Vision Question Answering (VQA) has become a flexible and extensible solution for anomaly detection. AnyAnomaly [13] introduces a context-aware VQA framework that leverages userdefined textual anomaly descriptions to enable fast, zeroshot detection, evaluated on the proposed C-VAD dataset. MiniGPT-4 and Myriad enhance multimodal models by integrating prior knowledge from traditional industrial detection models with visual encoders, improving the accuracy of anomaly localization [14], [15], with MiniGPT-4 evaluated on the COCO caption benchmark and Myriad validated across MVTec-AD, VisA, and PCB Bank datasets. However, these methods are primarily designed for static production environments or fixed defect types, making them less suitable for the diverse processes and irregular events encountered in scientific experiments.\nIn video monitoring and tracking, adding textual descriptions can enhance visual features and improve behavior recognition [17]. Applied to experimental settings, keyframes from different time points can be described in natural language and semantically compared using large models [21],\nSystem Prompt: You are an AI assistant for automated anomaly detection in a robotic chemical laboratory . Your role is to visually inspect laboratory procedures, identifying any abnormalities that may disrupt experiments or pose safety risks to laboratory staff. f.\nFig. 3. Hierarchical prompt design and reasoning process. The left part shows the construction of prompts at different information levels, including experimental context, stage description, detection content, and anomaly label description. The right part shows how the VLM takes the image and prompt as input and performs step-by-step reasoning to identify potential anomalies.\nenabling the detection of both environmental and procedural anomalies. Additionally, few-shot learning improves model adaptability to unseen scenarios and rare anomaly types, offering a promising direction for low-resource conditions [22]. Existing vision-based anomaly detection methods are primarily developed for fixed industrial scenarios and often fail to generalize to the complex and variable workflows of scientific experiments. This limitation mainly stems from their dependence on task-specific data and limited adaptability to diverse procedural structures.\nIII. PROBLEM FORMULATION # A complete automated experimental workflow, as illustrated in Fig.2, can be formally represented by an ordered set of steps S = {s1 , s2 ,\u0026hellip;, s n } along with fundamental information I about the context of the experiment. Each step si , defined as a meta-step, may include one or multiple robotic actions. Each meta-step si requires certain prerequisites (e.g., materials, reagents, or equipment) and completion conditions, which together serve as the anomaly detection target C, and may include a fine-grained anomaly description Cd . To support different levels of prompt granularity in anomaly detection, we introduce an information control function φ(S , C , Cd), which determines the semantic content provided to the model during inference. Given a visual observation x captured from the experiment, the anomaly detection task is formulated as a binary classification function:\nwhere y ∈ {0 , 1} indicates whether an anomaly is present (1 for anomaly, 0 for normal). The function φ selectively includes the step information S, the detection target C, and the fine-grained anomaly description Cd, depending on the evaluation configuration. Formally, φ serves as a semantic controller that determines the composition and granularity of the prompt content. One specific instantiation of φ is presented in Section IV to support hierarchical prompt analysis.\nThis formulation provides a unified framework for modeling open-ended reasoning and goal-conditioned anomaly detection, and serves as the foundation for our benchmark\u0026rsquo;s layered evaluation strategy.\nIV. METHOD # Our method takes a first-person image and structured textual information as input, and uses VLMs to perform multimodal reasoning. By progressively incorporating contextual and semantic information through prompt design, the model determines whether the current experimental state is normal or anomalous.\nAt each monitoring point within the experimental procedure, a first-person image is captured by the robotic system. This image is then paired with a natural language prompt that encodes progressively detailed contextual information. The prompt may include the experimental background, the specific subtask being performed, the detection target, and a semantic description of abnormal conditions.\nThe resulting multimodal input is fed into a VLM, which jointly processes the visual and textual information to infer the state of the experiment. Rather than producing a direct binary classification, we adopt a reasoning-based approach using Chain-of-Thought (CoT) prompting, which guides the model to analyze the situation step by step before reaching a conclusion. The final output is then used to assess whether an anomaly is present and to support subsequent experimental decision-making.\nA. Multimodal Input Formatting # To enable effective reasoning by VLMs, our method formats both visual and textual information into a unified multimodal input. This section describes how the two modalities—first-person images and structured prompts—are prepared and combined, as illustrated in Fig.3.\nVisual Input. The visual input consists of RGB images captured from a first-person perspective by robotic arms during the execution of each experimental stage. These images provide contextual views of critical equipment, materials, and objects in the workspace. All images are resized to 640×480 resolution before being passed into the model to match input constraints and maintain visual consistency.\nPrompt Structure. The textual input follows a structured natural language format and is dynamically generated based on the current experimental stage and prompt configuration level. Each prompt is composed of up to four textual elements: Experiment Context, Stage Description, Detection Content, and Anomaly Label Description.\nExperiment Context: A brief description of the overall experimental background. Stage Description: A description of the current subtask, includes the information about the operator, target object, start and destination position, and actions. Detection Content: A structured sentence that specifies the content to be checked in the scene, typically in the form of \u0026ldquo;Check whether [object] is [in a specific state or location]\u0026rdquo;. Anomaly Label Description: A semantic-level textual description that defines the condition under which a state is considered abnormal or normal. Multimodal Integration. The image and prompt are submitted as a combined query to VLMs via its multimodal input interface. No additional segmentation tokens are required; the model handles joint encoding of vision and text natively. This consistent formatting ensures that the model receives clearly organized information, allowing it to focus on relevant aspects of the visual scene during reasoning.\nB. Hierarchical Prompt Design # To investigate how the granularity of textual input affects model reasoning, we design a four-level hierarchical prompting scheme. Each level incrementally adds semantic or contextual information, enabling progressively more detailed guidance for the model.\nLevel 1: Contains only the Experiment Context, which provides a general background of the experiment and helps the model understand the overall task setting. Level 2: Includes both the Experiment Context and the Stage Description, describing the specific subtask being performed in the current stage. Level 3: Adds the Detection Content on top of Level 2, specifying what should be visually checked in the current image. Level 4: Further incorporates the Anomaly Label Description, offering a semantic-level explanation of how normal and abnormal states differ. This layered design enables a systematic evaluation of how varying prompt specificity influences model behavior. A visual illustration of this hierarchical prompt structure is provided in Fig.3. It also directly corresponds to the instantiation of the prompt function φi defined in our problem formulation, where each level represents an increasing degree of semantic richness and contextual precision in the input.\nThis progressive design allows us to systematically evaluate the effect of prompt granularity on model performance. As more context and guidance are provided, the model is expected to exhibit more accurate and consistent reasoning capabilities.\nC. Reasoning Process # Given a first-person image and its corresponding hierarchical prompt, the VLM is required to jointly interpret both the visual input and structured textual information to assess whether the current experimental state is normal or anomalous. Rather than performing direct classification, the model is guided to reason step by step through Chain-ofThought (CoT) prompting. This approach encourages more\n(a)\nAbnormal pictures from mobile robotic arm camera with different viewpoints.\nAbnormal pictures from fixed robotic arm camera with different perspectives\n.\nFig. 4. Examples from two detection points in the proposed benchmark. Each example includes first-person images captured from different devices and viewpoints, along with their associated textual annotations. (a) Images captured near the workbench, show both the presence and absence of the silicone container on the table. (b) Images captured near the shaking device, show both the presence or missing on the shaking device.\ninterpretable and robust decisions, especially under complex or ambiguous visual conditions.\nThe model output is a natural language response that typically includes reasoning steps followed by a conclusion. To derive the final anomaly label, we apply a post-processing procedure that converts the model\u0026rsquo;s conclusion into a binary decision. Specifically, a rule-based natural language parser is employed to extract the final decision from the response.\nThis reasoning and decision pipeline ensures that model predictions are semantically grounded and suitable for comparison against annotated ground truth labels in the benchmark.\nV. BENCHMARK IMPLEMENTATION # The benchmark is constructed based on a real chemical laboratory setting. It provides a unified dataset, task definition, and evaluation protocol to systematically assess the performance of multimodal models in detecting anomalies during experiments.\nOur dataset captures a complete automated silicone preparation process and contains 1001 first-person images, including 501 normal and 500 abnormal samples. The workflow is divided into 15 discrete stages, with 20 monitoring points. Visual data are collected using both fixed and mobile robotic arms from multiple viewpoints to ensure diversity in spatial and visual configurations. Examples are shown in Fig.4.\nEach image is accompanied by structured textual annotations that provide contextual information for multimodal reasoning. These annotations include four distinct elements as introduced in the Method section. Annotations were initially provided independently by three annotators, followed by a consensus process to ensure consistency, semantic clarity, and alignment with real-world scientific procedures. An overview of the dataset structure and annotation scheme is shown in Fig.4. The evaluation metrics are introduced in the following section.\nFig. 5. Qualitative examples illustrating the effects of hierarchical prompting. Left: The model corrects its focus and successfully detects the anomaly after receiving all four levels of prompts. Right: The model succeeds at Level 3 when provided with clear detection content, but fails at Level 4 due to misinterpretation of the anomaly label description, resulting in incorrect judgment.\nVI. EXPERIMENTS # A. Experimental Setup # We evaluate two representative VLMs: GPT-4o and Qwen2.5-VL-72B-Instruct. The latter is abbreviated as QwenVL-72B. Both models are capable of performing image-text reasoning and support natural language prompting. Inference is conducted via publicly available APIs, without any fine-tuning.\nIn addition, we perform a real-world validation in a robotic laboratory, where anomaly detection is performed through a GPT-4o-driven reasoning pipeline to verify model behavior in actual experimental procedures.\nB. Evaluation Metrics # To comprehensively evaluate model performance, we adopt the following four metrics:\nAccuracy (ACC): The percentage of samples correctly classified as normal or abnormal, indicating the overall effectiveness.\nFalse Positive Rate (FPR): The percentage of normal samples incorrectly predicted as abnormal. This reflects the model\u0026rsquo;s tendency to generate false alarms, which may cause unnecessary workflow interruptions.\nMissed Detection Rate (MDR): The percentage of abnormal samples mistakenly classified as normal, representing the model\u0026rsquo;s failure to detect true anomalies - a critical factor in high-risk experimental contexts.\nUncertainty Rate (UR): The percentage of cases where the model cannot provide a confident judgment. This metric indicates the model\u0026rsquo;s ambiguity when handling visually complex or previously unseen samples, and provides insight into its robustness and reliability.\nC. Results and Analysis # Table I reports the average performance of both models across the four prompt levels.\nOverall Performance Trends # GPT-4o shows a significant performance improvement, with accuracy increasing from 41.6% at Level 1 to 79.2% at Level 4, and MDR dropping to 3.0%. QwenVL-72B demonstrates a more limited gain, with accuracy improving from 41.6% to 61.4%. A noticeable reduction in MDR (to 3.0%) only occurs at Level 4, indicating its stronger reliance on explicit anomaly descriptions. Differences in False Positive Rate # QwenVL-72B has a high FPR of 38.6% at Level 1, suggesting a tendency to over-predict anomalies when contextual information is insufficient. Even with the most detailed prompts at Level 4, its FPR remains high at 35.6%, reflecting sensitivity to prompt design. In contrast, GPT-4o maintains more a stable FPR across levels and achieves the lowest FPR of 16.8% at Level 4, showing better robustness to prompt variations. Sensitivity to Prompt Granularity # GPT-4o exhibits a sharp performance boost at Level 3, where MDR drops to 5.0%, indicating that specifying the detection objective alone can significantly enhance its anomaly recognition. QwenVL-72B only shows notable improvement at Level 4, where explicit anomaly descriptions are included, suggesting that its reasoning capability is more dependent on semantic cues provided in the prompt. In addition to quantitative metrics, we further conduct qualitative analysis to better understand how prompt granularity influences model decision-making. As shown in Fig. 5, we present two representative cases. In the left example, the model initially fails due to inaccurate focus or object localization, but successfully identifies the anomaly after receiving all four levels of prompt information. In contrast, the right example remains misclassified at Levels 1 and 2, but correctly identifies the anomaly at Level 3 after receiving the detection content. However, when the anomaly label description is added at Level 4, the model incorrectly classifies the image as abnormal. This suggests that overly specific descriptions may inadvertently shift the model\u0026rsquo;s attention toward secondary features, leading to misjudgment. These cases illustrate how hierarchical prompting can both enhance and, in some cases, misguide model reasoning, depending on how well the prompt aligns with the visual context.\nTABLE I PERFORMANCE (%) ACROSS DIFFERENT PROMPT LEVELS .\nModel Level ACC↑ FPR↓ MDR↓ UR↓ GPT-4o Level 1 41.6 25.7 32.7 0 GPT-4o Level 2 50.5 27.7 20.8 1 GPT-4o Level 3 67.3 27.7 5 0 GPT-4o Level 4 79.2 16.8 3 1 QwenVL-72B Level 1 41.6 38.6 19.8 0 QwenVL-72B Level 2 49.5 23.8 26.7 0 QwenVL-72B Level 3 57.4 19.8 22.8 0 QwenVL-72B Level 4 61.4 35.6 3 0 Pre condition Check\nPost\ncondition Check\nFig. 6. Real-world demonstration of anomaly detection: verifying the presence and correct placement of a silicone bottle during a robotic transfer operation.\nD. Real-World Demonstration # To further validate the benchmark\u0026rsquo;s applicability, we conducted a real-world test in a robotic chemical laboratory. The test focused on the operation where a robotic arm transfers a silicone bottle from the material table to the operation table.\nAs shown in Fig.6, the experiment is divided into two key checkpoints: before and after the execution of the transfer step. At the beginning of the step, the model receives the experiment background and step description as input, and is prompted to check whether the silicone bottle is present on the material table. The model correctly responds with no anomaly, confirming that the setup is valid.\nAfter the robotic arm completes the transfer, another image is captured at the workbench. The model is again prompted to verify whether the silicone bottle has been successfully placed. The result is also no anomaly, demonstrating the model\u0026rsquo;s ability to monitor the correctness of step execution under real-world conditions.\nVII. CONCLUSIONS # This paper proposes a vision-language reasoning approach for process anomaly detection in scientific experiments, leveraging VLMs guided by hierarchical prompts and CoT inference for step-wise anomaly judgment. To support systematic evaluation, we construct a visual reasoning benchmark based on a real-world chemical workflow. Experimental results show that prompt granularity significantly affects model performance. Real-world validation confirms the method\u0026rsquo;s effectiveness in detecting execution anomalies from first-person visual input. Future work will expand the benchmark and explore automatic prompt generation for broader applicability, and generate descriptive anomaly reports from VLM reasoning traces to support more explainable and actionable outputs.\nREFERENCES # [1] D. Ouelhadj and S. Petrovic, \u0026ldquo;A survey of dynamic scheduling in manufacturing systems,\u0026rdquo; Journal of scheduling, vol. 12, pp. 417–431, 2009. [2] H. Liu, D. Guo, and A. Cangelosi, \u0026ldquo;Embodied intelligence: A synergy of morphology, action, perception and learning,\u0026rdquo; ACM Computing Surveys, 2025. [3] G. Tom, S. P. Schmid, S. G. Baird, Y. Cao, K. Darvish, H. Hao, S. Lo, S. Pablo-Garc ´ ´ıa, E. M. Rajaonson, M. Skreta et al., \u0026ldquo;Selfdriving laboratories for chemistry and materials science,\u0026rdquo; Chemical Reviews, vol. 124, no. 16, pp. 9633–9732, 2024. [4] K. Darvish, M. Skreta, Y. Zhao, N. Yoshikawa, S. Som, M. Bogdanovic, Y. Cao, H. Hao, H. Xu, A. Aspuru-Guzik et al., \u0026ldquo;Organa: a robotic assistant for automated chemistry experimentation and characterization,\u0026rdquo; Matter, vol. 8, no. 2, 2025. [5] Z. Yang, Y. Du, D. Liu, K. Zhao, and M. Cong, \u0026ldquo;A human-robot interaction system for automated chemical experiments based on vision and natural language processing semantics,\u0026rdquo; Engineering Applications of Artificial Intelligence, vol. 146, p. 110226, 2025. [6] D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes, \u0026ldquo;Autonomous chemical research with large language models,\u0026rdquo; Nature, vol. 624, no. 7992, pp. 570–578, 2023. [7] N. H. Sarker, Z. A. Hakim, A. Dabouei, M. R. Uddin, Z. Freyberg, A. MacWilliams, J. Kangas, and M. Xu, \u0026ldquo;Detecting anomalies from liquid transfer videos in automated laboratory setting,\u0026rdquo; Frontiers in Molecular Biosciences, vol. 10, p. 1147514, 2023. [8] H. Yan, M. Grasso, K. Paynabar, and B. M. Colosimo, \u0026ldquo;Real-time detection of clustered events in video-imaging data with applications to additive manufacturing,\u0026rdquo; IISE Transactions, vol. 54, no. 5, pp. 464– 480, 2022. [9] B. Li, S. Leroux, and P. Simoens, \u0026ldquo;Decoupled appearance and motion learning for efficient anomaly detection in surveillance video,\u0026rdquo; Computer Vision and Image Understanding, vol. 210, p. 103249, 2021. [10] Y. Liu, J. Liu, K. Yang, B. Ju, S. Liu, Y. Wang, D. Yang, P. Sun, and L. Song, \u0026ldquo;Amp-net: Appearance-motion prototype network assisted automatic video anomaly detection system,\u0026rdquo; IEEE Transactions on Industrial Informatics, vol. 20, no. 2, pp. 2843–2855, 2023. [11] L. Zhang, Y. Dai, F. Fan, and C. He, \u0026ldquo;Anomaly detection of gan industrial image based on attention feature fusion,\u0026rdquo; Sensors, vol. 23, no. 1, p. 355, 2022. [12] H. Yao, Y. Cao, W. Luo, W. Zhang, W. Yu, and W. Shen, \u0026ldquo;Prior normality prompt transformer for multiclass industrial image anomaly detection,\u0026rdquo; IEEE Transactions on Industrial Informatics, 2024. [13] S. M. Lukin and R. Sharma, \u0026ldquo;Anomaly detection with visual question answering,\u0026rdquo; DEVCOM Army Research Laboratory, Tech. Rep. ARLTR9817, 2023. [14] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, \u0026ldquo;Minigpt-4: Enhancing vision-language understanding with advanced large language models,\u0026rdquo; arXiv preprint arXiv:2304.10592, 2023. [15] Y. Li, H. Wang, S. Yuan, M. Liu, D. Zhao, Y. Guo, C. Xu, G. Shi, and W. Zuo, \u0026ldquo;Myriad: Large multimodal model by applying vision experts for industrial anomaly detection,\u0026rdquo; arXiv preprint arXiv:2310.19070 , 2023. [16] S. Tan, M. Ge, D. Guo, H. Liu, and F. Sun, \u0026ldquo;Knowledge-based embodied question answering,\u0026rdquo; IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 11 948–11 960, 2023. [17] D. Chen, H. Zhang, J. Song, Y. Feng, and Y. Yang, \u0026ldquo;Mamtrack: Vision-language tracking with mamba fusion,\u0026rdquo; in Proceedings of the 2024 8th International Conference on Computer Science and Artificial Intelligence, 2024, pp. 119–126. [18] X. Liu, X. Li, D. Guo, S. Tan, H. Liu, and F. Sun, \u0026ldquo;Embodied multiagent task planning from ambiguous instruction.\u0026rdquo; in Robotics: Science and Systems, 2022. [19] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin et al., \u0026ldquo;A survey on large language model based autonomous agents,\u0026rdquo; Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024. [20] N. C. Frey, R. Soklaski, S. Axelrod, S. Samsi, R. Gomez-Bombarelli, C. W. Coley, and V. Gadepally, \u0026ldquo;Neural scaling of deep chemical models,\u0026rdquo; Nature Machine Intelligence, vol. 5, no. 11, pp. 1297–1305, 2023. [21] U. De Silva, L. Fernando, B. L. P. Lik, Z. Koh, S. C. Joyce, B. Yuen, and C. Yuen, \u0026ldquo;Large language models for video surveillance applications,\u0026rdquo; in TENCON 2024-2024 IEEE Region 10 Conference (TENCON). IEEE, 2024, pp. 563–566. [22] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, \u0026ldquo;Generalizing from a few examples: A survey on few-shot learning,\u0026rdquo; ACM computing surveys (csur), vol. 53, no. 3, pp. 1–34, 2020. ","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/a-vlm-based-method-for-visual-anomaly-detection-in-robotic-scientific-laboratories/","section":"Papers","summary":"Proposes a vision-language reasoning approach utilizing hierarchical prompts and Chain-of-Thought inference for process anomaly detection in scientific experiments. Constructs a benchmark based on real chemical laboratory workflows and demonstrates improved accuracy with prompt granularity, validated through real-world robotic lab testing.","title":"A VLM-based Method for Visual Anomaly Detection in Robotic Scientific Laboratories","type":"method"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/abolfazl-razi/","section":"Authors","summary":"","title":"Abolfazl Razi","type":"authors"},{"content":" Action Hints: Semantic Typicality and Context Uniqueness for Generalizable Skeleton-based Video Anomaly Detection # Canhui Tang, Sanping Zhou, Member, IEEE, Haoyue Shi, Le Wang, Senior Member, IEEE\nAbstract—Zero-Shot Video Anomaly Detection (ZS-VAD) requires temporally localizing anomalies without target domain training data, which is a crucial task due to various practical concerns, e.g., data privacy or new surveillance deployments. Skeleton-based approach has inherent generalizable advantages in achieving ZS-VAD as it eliminates domain disparities both in background and human appearance. However, existing methods only learn low-level skeleton representation and rely on the domain-limited normality boundary, which cannot generalize well to new scenes with different normal and abnormal behavior patterns. In this paper, we propose a novel zero-shot video anomaly detection framework, unlocking the potential of skeleton data via action typicality and uniqueness learning. Firstly, we introduce a language-guided semantic typicality modeling module that projects skeleton snippets into action semantic space and distills LLM\u0026rsquo;s knowledge of typical normal and abnormal behaviors during training. Secondly, we propose a test-time context uniqueness analysis module to finely analyze the spatio-temporal differences between skeleton snippets and then derive sceneadaptive boundaries. Without using any training samples from the target domain, our method achieves state-of-the-art results against skeleton-based methods on four large-scale VAD datasets: ShanghaiTech, UBnormal, NWPU, and UCF-Crime, featuring over 100 unseen surveillance scenes.\nIndex Terms—Video Anomaly Detection, Skeleton-based, Zeroshot, Action Semantic Typicality, Context Uniqueness.\nI. INTRODUCTION # Video Anomaly Detection (VAD) aims to temporally locate abnormal events, which has wide applications in the context of video surveillance and public safety [1], [2]. Current mainstream paradigms include one-class [3]–[7] and weakly supervised methods [2], [8], [9], which require abundant samples from the target video domain for training. However, in surveillance scenarios involving privacy or newly installed monitoring devices, training samples from the target domain are usually not available. Therefore, designing a Zero-Shot Video Anomaly Detection (ZS-VAD) method that can generalize to diverse target domains becomes necessary. Despite the recent extensive attention given to zero-shot image anomaly detection [10]–[14], the zero-shot setting in the complex surveillance video domain remains under-explored [15].\nThe challenges of ZS-VAD come from significant variations in visual appearance and human activities across different\nThis work was supported in part by National Science and Technology Major Project under Grant 2023ZD0121300, National Natural Science Foundation of China under Grants 62088102, U24A20325 and 12326608, and Fundamental Research Funds for the Central Universities under Grant XTR042021005. (Corresponding author: Sanping Zhou, E-mail: spzhou@mail.xjtu.edu.cn.)\nCanhui Tang, Sanping Zhou, Haoyue Shi, and Le Wang are all with the National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi\u0026rsquo;an Jiaotong University, Shaanxi 710049, China.\nFig. 1. An illustration of skeleton-based VAD paradigm comparison. Previous approaches suffer from two main issues: (1) low-level representations and (2) domain-limited normal boundary. Our method enhances generalizability via action semantic typicality learning and context uniqueness analysis.\nvideo domains. While frame/object-based methods [3], [4], [16] have been prominent in video anomaly detection, their performance will degrade when adapting to new scenes due to visual feature distribution shifts. In another view, skeletonbased methods [6], [17]–[19] utilize mature pose detection systems [20], [21] to obtain skeleton data, learn to encode features via self-supervision tasks [18], [19], and then calculate the anomaly score. They are effective for identifying human behavior anomalies, which are popular in the VAD task due to their superior efficiency and performance. skeletonbased methods also have inherent generalizable advantages in achieving ZS-VAD as it eliminates domain disparities both in background and human appearance.\nHowever, as shown in Fig. 1, existing skeleton-based VAD methods still suffer from several limitations: (1) Low-level skeleton representations. They learn normal distribution of skeleton patterns using self-supervised tasks, such as skeleton prediction [19], reconstruction [18], or coordinate-based normalizing flows [6]. Without semantic supervision signals, such methods fail to capture higher-level action patterns, making them unable to distinguish novel anomaly patterns similar to normal patterns and sensitive to noise. (2) Domain-limited normality boundary. They blindly rely on training-datadefined normality boundaries, leading to the misclassification of unseen normal events as anomalies. Both limitations hinder their generalization to unseen scenes with varying normal and\nabnormal patterns. This leads to a question: \u0026ldquo;Can we further unlock the potential of skeleton in ZS-VAD with generalizable representation learning and prior injection? \u0026quot;\nTo address this question, we reflect on how human observers judge normal and abnormal behavior in a new scenario. As shown in Fig. 1, we first identify the types of individual actions in the video and consider whether they are normal or abnormal based on our experiential knowledge of normality and abnormality, which is referred to as typicality. For instance, a pedestrian walking would be considered normal, while a fight or scuffle would be deemed abnormal. Secondly, for atypical normal or abnormal scenarios, we integrate the behaviors of all individuals in the video to observe if any individual\u0026rsquo;s behavior significantly differs from others, as anomalies are usually rare and unique, referred to as context uniqueness.\nBased on these complementary priors, we propose a novel skeleton-based zero-shot video anomaly detection framework, which captures both typical anomalies guided by language prior and unique anomalies in spatio-temporal contexts. First, we introduce a language-guided typicality modeling module to achieve high-level semantic understanding beyond previous low-level representations. Specifically, it projects skeleton snippets into language-aligned action semantic space and distills LLM\u0026rsquo;s knowledge of typical normal and abnormal behaviors during training. Secondly, to derive scene-adaptive boundaries, we propose a context uniqueness analysis module at test time. It finely analyzes the spatio-temporal differences between skeleton snippets to get an adaptive understanding of target scene activities. Without using any training samples from the target domain, we achieve state-of-the-art results on four large-scale VAD datasets: ShanghaiTech [1], UBnormal [22], NWPU [23], UCF-Crime [2], featuring over 100 unseen surveillance scenes. Our contributions are as follows:\nWe propose a skeleton-based video anomaly detection framework that learns action typicality and uniqueness, enabling generalization across diverse target scenes. We propose a language-guided typicality modeling module that projects skeleton snippets into a generalizable semantic space and distills LLM\u0026rsquo;s knowledge of typical normal and abnormal behaviors during training. We propose a test-time uniqueness analysis module to finely analyze the spatio-temporal differences between skeleton snippets and derive scene-adaptive boundaries between normal and abnormal behavior. The rest of this paper is organized as follows. We review the related work in Section II. Section III describes the technical details of our proposed method. Section IV presents the experiment details and results. Finally, we summarize the paper in Section V .\nII. RELATED WORK # Video anomaly detection. Most previous video anomaly detection studies can be grouped into frame-based [1], [2], [4], object-centric [3], [16], [24], and skeleton-based methods [6], [19], [25]. In this work, we focus on the skeleton-based methods, which detect anomalies in human activity based on preprocessed skeleton/pose data. Morais et al. [17] propose an anomaly detection method that uses an RNN network to learn the representation of pose snippets, with prediction errors serving as anomaly scores. GEPC [25] utilizes autoencoders to learn pose graph embeddings, generates soft assignments through clustering, and uses a Dirichlet process mixture to determine anomaly scores. To model normal diversity, MoCoDAD [19] leverages diffusion probabilistic models to generate multimodal future human poses. FG-Diff [26] guides the diffusion model with observed high-frequency information and prioritizes the reconstruction of low-frequency components, enabling more accurate and robust anomaly detection. STGNF [6] proposes a simple yet effective method by establishing normalized flow [27] from normal pose snippets to obtain normal boundaries. DA-Flow [28] proposes a lightweight dual attention module for capturing cross-dimension interaction relationships in spatio-temporal skeletal data. However, these methods rely on training with normal data from the target domain, while overlooking the semantic understanding of human behavior, which makes it difficult to ensure performance in scenarios where the target data is unavailable.\nZero-shot anomaly detection. Thanks to the development of vision-language models, zero-shot anomaly detection has received a lot of attention [10]–[12], [14], [29]–[31], especially in the field of image anomaly detection [32]. The pioneering work is WinCLIP [11], which utilizes CLIP [33]\u0026rsquo;s image-text matching capability to distinguish between unseen normal and abnormal anomalies. Building on that, AnomalyCLIP [14] proposes to learn object-agnostic text prompts that capture generic normal and abnormal patterns in an image. AdaCLIP [13] introduces two types of learnable prompts to enhance CLIP\u0026rsquo;s generalization ability for anomaly detection. Despite the success in the image domain, only a few works [15], [34] have ventured into zero-shot video anomaly detection with underwhelming performance. Although recently [35] proposes to leverage large visual language models for zero-shot video anomaly detection, it requires multi-stage reasoning and the collaboration of multiple large models, making it less userfriendly. We aim to develop a lightweight, user-friendly, and easily deployable zero-shot anomaly detector starting from skeleton data. Our work shares some similarities with a recent study [36]. However, we emphasize that our approach differs significantly from [36] in the following ways: 1) Different tasks: It addresses abnormal action recognition, involving no more than two individuals in a short video, while ours requires temporally localizing abnormal events in real surveillance videos. 2) Novel perspective: We combine the action typicality and uniqueness priors to address zero-shot anomaly detection challenges in video surveillance scenes.\nIII. METHOD # A. Overview # The objective of ZS-VAD is to train one model that can generalize to diverse target domains. Formally, let V train be a training set from source video domain and {W test 1 , W 2 test W 2 , \u0026hellip;, W test N } be multiple test sets from target video domain. The test videos are annotated at the frame level with labels l i ∈ {0 , 1}, and the VAD model is required to\nFig. 2. Overview of our approach for skeleton-based zero-shot video anomaly detection. I. Language-guided typicality modeling in the training phase. It projects skeleton snippets into the action semantic space, collects typicality knowledge from LLM, and then effectively learns the typical distribution of normal and abnormal behavior. (Only the black dashed boxes are used during inference.) II. Test-time uniqueness analysis in the inference phase. It finely analyzes the spatio-temporal differences between skeleton snippets and derives scene-adaptive boundaries between normal and abnormal behavior.\npredict each frame\u0026rsquo;s anomaly score. In this work, we focus on the skeleton-based paradigm, as it is computation-friendly and can benefits ZS-VAD by reducing the domain gap in both background and appearance.\nFig. 2 overviews our proposed approach. Our model tackles the ZS-VAD problem from the perspective of action typicality and uniqueness learning. Firstly, to obtain a high-level semantic understanding, we propose a Language-Guided Typicality Modeling module that projects skeleton snippets into action semantic space and distills LLM\u0026rsquo;s knowledge of typical normal and abnormal behaviors during training. Secondly, to get scene-adaptive decision boundaries, we propose a Test-Time Uniqueness Analysis module that finely analyzes the spatiotemporal differences between skeleton snippets. During inference on unseen VAD datasets, our model integrates typicality scores and uniqueness scores of human behavior to provide a holistic understanding of anomalies.\nB. Language-guided Typical Modeling # Unlike previous works that learn low-level skeleton representations via self-supervised tasks [6], [18], [19], this module aims to obtain a high-level semantic understanding of human behavior. It learns language-aligned action features and scenegeneric distributions of typical distribution with distillation of LLM\u0026rsquo;s knowledge during training. Specifically, this module consists of skeleton-text alignment, typicality knowledge selection, and typicality distribution learning. During inference, it can predict typicality anomaly scores with only a lightweight skeleton encoder and a normalizing flow module.\nSkeleton-text alignment. For achieving a generalizable semantic understanding of human behavior, we first propose to align the skeleton snippets with the corresponding semantic labels. For such skeleton-text pairs, we utilize external action recognition datasets (e.g., Kinect [37] as the training set instead of specific VAD datasets (e.g., ShanghaiTech). The raw skeleton data of an action video is typically formally represented as Xi ∈ R C×J×L×M , where C is the coordinate number, J is the joint number, L is the sequence length, and M is the pre-defined maximum number of persons, respectively. In addition, each video is annotated with a text label gi representing the action class, which can also be transformed into a one-hot class vector yi .\nCompared to action recognition tasks [38]–[40] that only predict video-level categories, the VAD task is more finegrained, focusing on frame-level anomaly scores. Therefore, we decompose the original sequences into multiple short skeleton snippets Ai ∈ R C×J×T using a sliding window, and discard snippets that are composed of zeros, where T is the length of a snippet. For the snippets from the same action video, they share the same labels and undergo a normalization operation like STG-NF [6] to make different snippets independent. Inspired by the recent multimodal alignment works [41], [42], we then perform a skeleton-text alignment pretraining procedure to learn the discriminative representation. The procedure is built with a skeleton encoder E s and a text encoder E t , for generating skeleton features F s and text features F t , respectively. Additionally, the skeleton encoder also predicts a probability vector yˆ ˆ i using a fully-connected layer. The training loss consists of a KL divergence loss and a crossentropy classification loss following GAP [42]. This skeletontext alignment procedure effectively guides the projection of skeleton snippets into language-aligned action semantic space beyond previous VAD works [6], [18], [19] that learns lowlevel skeleton representation.\nTypicality knowledge selection. In most video surveillance scenarios, some behaviors are generally considered normal or abnormal, which constitute a scene-generic set. Therefore, training a typicality-aware capability is one of the promising\nTABLE I THE GENERATED TYPICALITY LABELS .\nType Typicality action list Normal Normal Abnormal Abnormal ways to achieve ZS-VAD. Thanks to the cutting-edge advancements of Large Language Models (LLMs), we propose to distill their prior knowledge about generic normality and abnormality during training. Based on the pre-trained skeletontext representation, we aim to use a LLM as our knowledge engine to collect typical normal and abnormal data from the massive skeleton snippets. In detail, we give the large model a prompt P: \u0026ldquo;In most video surveillance scenarios, what are generally considered as normal actions and abnormal actions among these actions (Please identify the 20 most typical normal actions and 20 most typical abnormal actions, ranked in order of decreasing typicality). The action list is \u0026lt;T \u0026gt;\u0026rdquo;, where T refers to the set of all action class labels in the prepared action recognition dataset [37].\nThe large language model will respond with a list of typical normal action classes T n T n and a list of typical abnormal action classes T a , which can be formalized as:\nwhere T n T n and T a are the subsets of T , and OLLM denotes the offline LLM used for initial typicality label generation. Note that the LLM is only needed to be used once during training for auxiliary data selection, while inference is not.\nAfter knowing the action categories of typicality, we first collect the data of these selected categories and then proceed to select the high-quality snippets from them. This is because 1) Some snippets contain noise, such as errors in pose detection and tracking. 2) In an abnormal action sequence, not all the snippets are abnormal. Therefore, we use the skeleton-text similarity score to select the high-quality skeleton snippets, which is formulated as:\nwhere M x refers to the selected snippets index, gi denotes the text label of snippet i, and β denotes the selection ratio. The superscript x represents n or a, indicating normal and abnormal, respectively. Using the index M x , we obtain the corresponding skeleton data A ˜ n and A ˜ a , as well as skeleton features F ˜ s n and F ˜ s a .\nTypicality distribution learning. As shown in Fig. 2, after obtaining the data, we proceed to model the feature distribution of typical behavior. Normalizing Flow (NF) [27] provides a robust framework for modeling feature distributions, transforming this distribution through a series of invertible and differentiable operations. Consider a random variable X ∈ R D with target distribution pX(x), and a random variable Z follows a spherical multivariate Gaussian distribution. A bijective map f : X ↔ Z is then introduced, which is composed of a sequence of transformations: f1 ◦ f2 ◦ \u0026hellip; ◦ fK. According to the variable substitution formula, the log-likelihood of X can be expressed as:\nUsing such a transformation, the feature distribution of typicality behavior is effectively modeled. Specifically, the bijective maps for the normal features and abnormal features are f : X n ↔ Zn Zn and f : X a ↔ Z a , respectively. Here, the log-likelihood of Zn Zn and Z a are as follows:\nwhere Con is a constant, and u n and u z are the centers of the Gaussian distributions (|u n − u z | ≫ 0), respectively. During training, the normalizing flow is optimized to increase the loglikelihood of the skeleton features F s with the following loss:\nDuring inference, the testing skeleton snippet F s i will be sent to the trained normalizing flow, outputting the typicality anomaly score as follows:\nwhere the normal skeletons will exhibit low S t i , while the anomalies will exhibit higher S t i . Our approach differs significantly from STG-NF [6]. It takes low-level skeleton coordinates as inputs and only learns implicit spatio-temporal features, which struggle to generalize to new datasets without the normality reference of training data from the target dataset. Differently, we use action semantics as a generalizable representation for normalizing flow input and leverage experiential typicality labels to learn domain-general boundaries between normal and abnormal behavior.\nC. Test-time Uniqueness Analysis # The goal of this component is to serve as a complementary perspective of typicality, deriving scene-adaptive boundaries by considering the context of the target scene. To this end, we propose a context uniqueness analysis module during the inference of the unseen VAD dataset.\nUnlike action recognition datasets, surveillance videos contain richer contextual information, featuring longer temporal spans, larger numbers of people, and more diverse behavioral patterns. For such a video, H skeleton sequences {X1 , \u0026hellip;, X H } are extracted, where each sequence comprises L i -frame poses, represented as Xi = {P1 , \u0026hellip;, P L i }. Here, P t ∈ R J×2 comprises J keypoints, each defined by a pair of coordinate values. Targeted at frame-level anomaly scoring, the sequences are segmented into shorter skeleton snippets, denoted as A i ∈ R C×J×T , each of which is then individually scored based on its contextual information.\nSpatio-temporal context. As shown in Fig. 2, to gain a fine-grained context understanding of the scene, we construct\ntwo types of spatio-temporal context graphs: a cross-person graph G c and a self-inspection graph G s . The first graph is constructed by retrieving the feature nearest neighbors among the surrounding skeleton snippets, while the second one is constructed by retrieving the feature nearest neighbors from different time segments of the current person. In this way, we can filter out some unrelated activities and focus solely on behaviors related to the current individual. Given a skeleton snippet Ai with feature F s i , the cross-person graph is defined as G c i = {V i c V i , E i c E i }, where V i c V i = {Ai , Nc Nc (Ai)} denotes the node set and E i c = {(i, j)| j ∈ Nc Nc } denotes the edge set. Besides, during the preprocessing of skeleton snippets, Aiis associated with a human trajectory index pi and timestamp ti . The neighborhood Nc Nc is formulated as:\nwhere d(·) represents the Euclidean distance, and D k c refers to the k-th smallest value for the cross-person distances. The second graph, which depicts self-inspection, is defined as G s i = {V i s V i , E i s E i }, where V i s V i = {Ai , Ns Ns (Ai)} denotes the node set and E i s = {(i, j)| j ∈ Ns Ns ) denotes the edge set. Then, the neighborhood Ns Ns is formulated as:\nwhere D k s refers to the k-th smallest value for the selfinspection distances. α is a threshold that masks out a period of time before and after the current time window, as the individual\u0026rsquo;s state tends to remain stable during adjacent periods.\nUniqueness scores. Since abnormal activities are rare, anomalies in real-world surveillance videos often differ from other activities in both spatial and temporal context, which is referred to as uniqueness. Based on the pre-trained discriminative skeleton features, uniqueness can be represented as the feature distances between a query node and other nodes in the built graph. Specifically, the uniqueness score S u for individual i is obtained by taking the larger one of the crossperson and self-inspection distances, formulated as follows:\nHolistic anomaly scoring. By integrating the complementary typicality S t i scores and the uniqueness scores S u i , our model can capture both typical anomalies in language prior and unique anomalies in spatio-temporal contexts. This helps gain a comprehensive understanding of anomalies in new scenes, where the holistic anomaly score of individual i is defined as:\nFinally, the frame-level anomaly scores are obtained by taking the highest score among all individuals within each frame. If any individual is considered anomalous, the entire frame is classified as anomalous. For frames where no individuals are detected, it is classified as a normal frame. In this condition, the anomaly score is assigned the minimum value among all scores in that video, following the approach in [6].\nTABLE II THE DETAILS OF OUR ZERO -SHOT VIDEO ANOMALY DETECTION BENCHMARKS. EACH SNIPPET CONTAINS 16 FRAMES OF SKELETON DATA WITH A 1-FRAME INTERVAL , WHILE THE SNIPPETS OF UCF-CRIME* ARE SAMPLED WITH A 16-FRAME INTERVAL AS ITS VIDEOS ARE TOO LONG .\n| | Resolution | Test Video Num. | Scenes Num. | Snippet\nNum. ShanghaiTech [1] 480×856 107 13 156,571 UBnormal [22] 720×1280 211 29 315,416 NWPU [23] multiple 242 43 723,490 UCF-Crime [2] 240×320 290 \u0026gt;50 152,231* IV. EXPERIMENTS # A. Dataset and Implementation Details # Dataset. The training of our model is conducted on the Kinect-400-skeleton dataset [37], [38], while the ZS-VAD capability of our model is evaluated on four large-scale VAD datasets: ShanghaiTech [1], UBnormal [22], NWPU [23] and UCF-Crime [2]. Note that we only use the test set of these four VAD datasets.\nFor the training of our model, we use the external Kinect400-skeleton [38] dataset. It is not intended for VAD tasks but for action recognition, which is gathered from YouTube videos covering 400 action classes. We utilize the preprocessed skeleton data obtained from ST-GCN [38] for training. For evaluation, we take four VAD-relevant datasets. Compared to some early VAD benchmarks [43], [44] that involve single scenes staged and captured at one location, the four datasets we evaluated are more extensive, encompassing a wider variety of scenes. Consequently, these four datasets are better suited for testing the model\u0026rsquo;s zero-shot capabilities and assessing its cross-scenario performance. The details are summarized in Table II and the following descriptions. (1) ShanghaiTech. It is a widely-used benchmark for one-class video anomaly detection, which consists of 330 training videos and 107 test videos from 13 different scenes. (2) UBnormal. It is a synthetic dataset with virtual objects and real-world environments. It consists of 186 training videos and 211 test videos from 29 different scenes. (3) NWPU. It is a newly published dataset that contains some scene-dependent anomaly types. It comprises 305 training videos and 242 testing videos from 43 scenes. (4) UCF-Crime. It is a large-scale dataset with 1900 long untrimmed surveillance videos. The 290 testing videos are used for our evaluation.\nImplementation Details. For a fair comparison, we directly use the skeleton data of ShanghaiTech and UBnormal from STG-NF [6]. For NWPU and UCF-Crime, as they do not have open-source skeleton data, we resort to utilizing AlphaPose [21] for data extraction. We use a segment window T = 16 and a stride of 1 to divide each sequence into snippets. Specifically, we use a stride of 16 for UCF-Crime because its videos are too long. For the backbone, we use multi-scale CTR-GCN [46] (2.1M) as the skeleton encoder and use a 4-layer feature normalizing flow [47] (2.9M) to model the normality probability. During training, we use the \u0026ldquo;ViT-B/32\u0026rdquo; CLIP [33] as the text encoder, and GPT-3.5-Turbo as our knowledge engine. During inference, these two models are\nTABLE III ZERO -SHOT VIDEO ANOMALY DETECTION PERFORMANCE ON THE FOUR LARGE -SCALE DATASETS, SHANGHAITECH, UBNORMAL, NWPU, AND UCF-CRIME , WHERE THE SUBSCRIPT DENOTES THE NUMBER OF SCENES .\n| Method | Training VAD | SHT13 | Testing Set UB29 NWPU43 | Testing Set\nUB29 NWPU43 UCFC\u0026gt;50 LVLM Imagebind [45] LAVAD [35] ✘ - - - 55.8 LAVAD [35] ✘ - - - 80.3 Frame/Object HF2-VAD [4] SHT 76.2 59.5 58.3 52.9 Jigsaw-VAD [3] SHT 84.3 58.6 61.1 53.3 Skeleton MocoDAD [19] SHT 77.6 67.0 56.4 51.8 MocoDAD [19] UB 76.0 68.4 56.6 52.0 STG-NF [6] SHT 85.9 68.8 57.6 51.6 STG-NF [6] UB 83.0 71.8 57.9 51.9 Ours ✘ 84.1 74.5 57.9 62.1 51.9 627 removed. For the hyperparameters, the batch size is set to 1024, and the Adam optimizer is used with a learning rate of 0.0005. Additionally, β n , β a , k, and α are set to 90%, 10%, 16, and 4, respectively. For the evaluation metrics, we follow common practice [1], [2], [6] by using the micro-average frame-level AUC as the evaluation metric, which involves concatenating all frames and calculating the score.\nB. Main Results # We conduct a comprehensive comparison of the performance of ZS-VAD, comparing the frame-based/objectbased [3], [4], skeleton-based [6], [19], and LVLM-based methods [35], [45].\nComparison with frame/object-based methods. We use their open-source checkpoints trained on the ShanghaiTech to evaluate the zero-shot performance on the remaining three VAD datasets. As shown in Table III, their generalization capabilities on new scene datasets are relatively poor due to the influence of human appearance and background variations.\nComparison with skeleton-based methods. We use their open-source checkpoints trained on the ShanghaiTech or UBnormal to evaluate on the remaining three VAD datasets. The performance of prevalent skeleton-based methods is still underwhelming due to a lack of understanding of complex normal and abnormal behaviors without target training data. Compared with our baseline STG-NF, our proposed method improves the frame-level AUC-ROC by 1.1% on ShanghaiTech, 5.7% on UBnormal, 4.2% on NWPU, and 10.8% on UCF-Crime. We also compare their performance in the fullshot setting, where target domain data is used for training. Table IV shows that our zero-shot approach can achieve comparable or even superior results to SoTA full-shot performance. To evaluate our method under the popular full-shot setting, we train our normalizing flow only on VAD normal data to model the normal distribution and test it on the same domain. The results outperform state-of-the-art (SOTA) full-shot methods.\nComparison with LVLM-based methods. With the advancements in Large Vision-Language Models (LVLMs) [49]–\nTABLE IV OUR ZERO -SHOT PERFORMANCE VS. SOTA FULL -SHOT PERFORMANCE . WE ALSO PROVIDE A VERSION NAMED OURS -FULL TO EVALUATE OUR METHOD UNDER THE POPULAR FULL -SHOT SETTING .\n| Setting | Method | Training VAD | Testing HT UB | Testing\nHT UB Setting Method VAD SHT UB zero-shot Ours ✘ 84.1 74.5 full-shot HF2-VAD [4] SHT / UB 76.2 - full-shot Jigsaw-VAD [3] SHT / UB 84.3 - full-shot g SSMTL++ [48] SHT / UB 83.8 62.1 full-shot GEPC [25] SHT / UB 76.1 53.4 full-shot MocoDAD [19] SHT / UB 77.6 68.4 full-shot STG-NF [6] SHT / UB 85.9 71.8 full-shot FG-Diff [26] - / UB - 68.9 full-shot Ours-full SHT / UB 86.0 78.2 TABLE V ABLATION EXPERIMENTS OF THE TYPICALITY MODELING MODULE .\nExperiments SHT UB NWPU UCFC (a) ours w/o aligning 83.2 69.4 60.4 59.5 (b) ours w/o selection - 64.1 61.9 56.7 (c) ours w/o NF 83.4 72.2 62.6 60.5 (d) prompt score 81.3 64.4 61.5 61 (e) ours 84.1 74.5 62.1 62.7 [51], LAVAD [35] proposes a zero-shot video anomaly detection (ZS-VAD) framework. However, it relies on multistage reasoning and the coordination of multiple large models with over 13 billion (B) parameters, posing challenges for widespread deployment. In contrast, we develop a lightweight zero-shot anomaly detector with a mere 5.0 million (M) parameters, just one in two thousand of LAVAD\u0026rsquo;s parameters.\nC. Ablation Study # Ablation of typicality module. We conduct ablation experiments on the typicality modeling module with the following settings: (a) removing the aligning stage, training an STGNF [6] network with our typicality data; (b) removing the collection phase, training with VAD source data (SHT); (c) removing the normalizing flow and calculating typicality scores using k-nearest neighbors distance techniques. As shown in Table V, the model shows poor performance without the aligning stage, as it fails to learn generalizable and discriminative semantic representations. Moreover, performance deteriorates without the selection of typicality action knowledge, as the model can only learn a limited normality boundary from the VAD source data. Furthermore, without the normalizing flow (NF), the model also loses flexibility in modeling the distribution of typical behaviors.\nAblation of uniqueness module. We ablate the uniqueness scores and the holistic scores in this part. As demonstrated in Table VI, when only using the cross-person distance, the model can identify contextual anomalies with acceptable performance. When combined with the self-inspection score, the model can spot changes in motion states, aiding in detecting a wider range of anomalies. The reason for the suboptimal\nFig. 3. Example results of our method that succeed in capturing typical anomalies. STG-NF [6] fails to detect the \u0026ldquo;jumping in the street\u0026rdquo; event, while ours performs well through action typicality learning. Each individual (blue skeleton) has a predicted anomaly score (red font), where the frame-level score (orange line) is defined as the maximum among all individuals in that frame.\nFig. 4. Example results of our method that succeed in capturing unique anomalies. STG-NF misclassifies unseen normal events during periods where \u0026ldquo;riding\u0026rdquo; occurs as anomalies, whereas our method correctly identifies them as normal by recognizing their contextual similarity. Moreover, STG-NF fails to detect the \u0026ldquo;photographing in restricted areas\u0026rdquo; anomaly, while our approach successfully identifies it by recognizing a sudden change in the person\u0026rsquo;s movement trajectory.\nTABLE VI ABLATION STUDY OF THE UNIQUENESS ANALYSIS MODULE AND HOLISTIC ANOMALY SCORING .\nTyp. Cross Self SHT UB NWPU UCFC ✓ ✓ 81.9 73.2 62.1 59.6 ✓ ✓ 81.9 62.9 60.7 59.9 ✓ ✓ 67.8 60.1 61 62.6 ✓ ✓ 82 64.5 61.7 61 ✓ ✓ ✓ 84.1 74.5 62.1 62.7 TABLE VII COMPARISON WITH PROMPT -BASED METHODS .\nExperiments SHT UB NWPU UCFC (a) normal prompts 81.3 64.4 61.5 61 (b) abnormal prompts 80.7 63.6 61 60.2 (c) ensemble prompts 80.7 63.9 61.1 61.3 (d) ours 84.1 74.5 62.1 62.7 performance of uniqueness score on UBnormal is that UBnormal is a synthetic dataset where some videos contain only one person with relatively short movement durations, which does not align well with real surveillance video scenarios. By integrating both the typicality and uniqueness modules, our approach can achieve optimal performance.\nComparison with prompt-based methods. Since promptbased techniques have been popular in other zero-shot tasks [11], [36], we conduct experiments to compare our typicality module with theirs. To this end, we design typical normal prompts, typical abnormal prompts, and the ensemble prompts, then use the skeleton-prompt similarity as the anomaly score. In detail, we use a normal prompt list: [\u0026ldquo;usual\u0026rdquo;, \u0026ldquo;normal\u0026rdquo;, \u0026ldquo;daily\u0026rdquo;, \u0026ldquo;stable\u0026rdquo;, \u0026ldquo;safe\u0026rdquo;], and an abnormal prompt list [\u0026ldquo;danger\u0026rdquo;, \u0026ldquo;violence\u0026rdquo;, \u0026ldquo;suddenness\u0026rdquo;, \u0026ldquo;unusual\u0026rdquo;, \u0026ldquo;instability\u0026rdquo;]. The prompts are encoded into text features and compute similarity with the skeleton features, together with our uniqueness analysis. As shown in Table V (d) and Table VII, the results are suboptimal. Unlike various forms of text seen in CLIP image-text alignment, the current skeletontext alignment scheme has only encountered text of action class names, thus the alignment capability for prompt text is weak. Our method, on the other hand, distills LLM\u0026rsquo;s\nTABLE VIII ABLATION STUDIES OF THE BACKBONE NETWORKS. THE PERFORMANCE IS EVALUATED ON UBNORMAL / NWPU / UCF-CRIME DATASETS .\nMethod Backbone Params. Performance Jigsaw [3] 3D-Conv 1.5M 58.6 / 61.1 / 53.3 HF2-VAD [4] MemAE+C-VAE 36.4M 9.5 / 58.3 / 52.9 STG-NF [4] STG-NF 0.7K 68.8 / 57.6 / 51. [] STG- STG-CN CTR-GCN STG-NF STG-CN + NF TR-GCN + NF 0.7K 4.1M 50M 0 / 60.1 / 58.5 STG-CN + NF TG-CN + NF 73.1 / 61.3 / 61 0 / 60.1 / 58.5 CTR-GCN + NF 5.0M 74.5 / 62.1 / 62.7 0 / 60.1 / 58.5 knowledge and learns typicality distribution, avoiding directly using the skeleton-prompt similarity as anomaly scores.\nAblation of backbone. In this part, we ablate our backbone. For a fair comparison with our baseline STG-NF [6], we attempt to use STG-NF as the backbone. However, STGNF takes XY -coordinates as input, with only 2 dimensions, making it extremely lightweight yet difficult to learn highlevel features. We then use STG-NF to learn the typicality skeleton coordinate inputs to obtain the typicality score. (Note that the results in Table 3 (a) of the main paper also consider the uniqueness score, resulting in the higher performance). As shown in Table VIII, using STG-NF as our backbone still demonstrates better performance compared to vanilla STGNF, highlighting the effectiveness of our typicality training. In addition, when we switch our backbone from CTR-GCN [46] to STG-CN [38], the model becomes more lightweight and the performance remains good.\nHyper-parameter ablations. We ablate the nearest neighborhood (NN) number k and the masking threshold α in the uniqueness analysis module. As shown in Table IX, our method is robust for these two hyper-parameters. Choosing an appropriate k can filter out some unrelated activities and focus solely on behaviors related to the current skeleton snippets. In addition, taking the average of the k neighbors helps suppress noise, which also makes our model insensitive to α .\nWe also ablate the hyperparameters in the typicality knowledge selection step. As shown in Table X, the optimal values of β n and β a are 0.9 (90%) and 0.1 (10%), respectively. A smaller β a can enhance performance by filtering out noisy data and normal snippets within the anomalous sequences.\nD. Visualization Results # Fig. 3 presents the qualitative localization results of typical anomalies. STG-NF fails to detect the \u0026ldquo;jumping\u0026rdquo; anomaly due to its low-level skeleton representation, which is unable to distinguish between indiscernible anomaly patterns that are similar to normal patterns. Disturbed by the skeletal noise of frame 350, it erroneously identifies the anomaly\u0026rsquo;s position. In contrast, our model maps the skeleton snippets to a highlevel space with generalizable and discriminative semantic information, allowing it to identify the anomaly based on our trained decision boundary. In surveillance scenarios lacking training samples, our model can still be effectively utilized to detect certain typical abnormal behaviors.\nFig. 4 shows the qualitative localization results of unique anomalies. Existing skeleton-based methods rely on the source\nTABLE IX ABLATION RESULTS OF TWO MAIN HYPER -PARAMETER .\n(k, α) SHT UB (k, α) SHT UB (1, 1) 83.9 74.2 (16, 1) 84.2 74.2 (4, 1) 84.1 74.3 (16, 2) 84.2 74.4 (16, 1) 84.1 74.5 (16, 3) 84 74.5 (64, 1) 83.7 74.7 (16, 4) 84.1 74.5 TABLE X HYPER -PARAMETER SENSITIVITY OF THE SELECTION RATIO .\n| (β n , βa) | SHT | UB | (β n\n, βa) SHT UB (0.1, 0.1) 82.8 68.3 (0.9, 0.1) 84.1 74.5 (0.3, 0.1) 83 70.9 (0.9, 0.3) 84.1 73.3 (0.5, 0.1) 83.4 74.1 (0.9, 0.5) 84.1 72.3 (0.7, 0.1) 83.9 74.3 (0.9, 0.7) 83.5 72 (0.9, 0.1) 84.1 74.5 (0.9, 0.9) 82.5 71.8 normal data for training. When the source domain does not include some novel behavior that appears in the target domain, these behavior will be classified as anomalies. Consequently, STG-NF erroneously localizes the anomaly during time periods when \u0026ldquo;riding\u0026rdquo; is present. In contrast, our model can analyze the spatio-temporal differences and establish sceneadaptive decision boundaries. Since \u0026ldquo;riding\u0026rdquo; occurs multiple times in the video, its uniqueness score is low. On the other hand, \u0026ldquo;photographing at restricted areas\u0026rdquo; exhibits significant differences from the surrounding people\u0026rsquo;s behavior and appears as a sudden change in the person\u0026rsquo;s movement trajectory, resulting in a corresponding increase in its anomaly score.\nV. CONCLUSION # In this paper, we identify the advantages of the skeletonbased approach in ZS-VAD, and introduce a novel framework that can generalize to various target scenes with typicality and uniqueness learning. First, we propose a language-guided typicality modeling module that effectively learns the typical distribution of normal and abnormal behavior. Secondly, we propose a test-time uniqueness analysis module to derive scene-adaptive boundaries. Experiments demonstrate the effectiveness of our model. Limitations and future work: Our work aims to exploit the full potential of skeleton data in the ZS-VAD task, while the complex relationships between behaviors and scenes will be left for future work.\nREFERENCES # [1] W. Liu, W. Luo, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction for anomaly detection–a new baseline,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6536– 6545.\n[2] W. Sultani, C. Chen, and M. Shah, \u0026ldquo;Real-world anomaly detection in surveillance videos,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6479–6488.\n[3] G. Wang, Y. Wang, J. Qin, D. Zhang, X. Bao, and D. Huang, \u0026ldquo;Video anomaly detection by solving decoupled spatio-temporal jigsaw puzzles,\u0026rdquo; in European Conference on Computer Vision. Springer, 2022, pp. 494–511.\n[4] Z. Liu, Y. Nie, C. Long, Q. Zhang, and G. Li, \u0026ldquo;A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 588–13 597.\n[5] C. Tang, S. Zhou, Y. Li, Y. Dong, and L. Wang, \u0026ldquo;Advancing pre-trained teacher: towards robust feature discrepancy for anomaly detection,\u0026rdquo; arXiv preprint arXiv:2405.02068, 2024.\n[6] O. Hirschorn and S. Avidan, \u0026ldquo;Normalizing flows for human pose anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 545–13 554.\n[7] L. Wang, J. Tian, S. Zhou, H. Shi, and G. Hua, \u0026ldquo;Memory-augmented appearance-motion network for video anomaly detection,\u0026rdquo; Pattern Recognition, vol. 138, p. 109335, 2023.\n[8] M. Cho, M. Kim, S. Hwang, C. Park, K. Lee, and S. Lee, \u0026ldquo;Look around for anomalies: weakly-supervised anomaly detection via context-motion relational learning,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 12 137–12 146.\n[9] H. Shi, L. Wang, S. Zhou, G. Hua, and W. Tang, \u0026ldquo;Abnormal ratios guided multi-phase self-training for weakly-supervised video anomaly detection,\u0026rdquo; IEEE Transactions on Multimedia, 2023.\n[10] Y. Liu, S. Li, Y. Zheng, Q. Chen, C. Zhang, and S. Pan, \u0026ldquo;Arc: A generalist graph anomaly detector with in-context learning,\u0026rdquo; arXiv preprint arXiv:2405.16771, 2024.\n[11] J. Jeong, Y. Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, \u0026ldquo;Winclip: Zero-/few-shot anomaly classification and segmentation,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 606–19 616.\n[12] Z. Gu, B. Zhu, G. Zhu, Y. Chen, H. Li, M. Tang, and J. Wang, \u0026ldquo;Filo: Zero-shot anomaly detection by fine-grained description and highquality localization,\u0026rdquo; arXiv preprint arXiv:2404.13671, 2024.\n[13] Y. Cao, J. Zhang, L. Frittoli, Y. Cheng, W. Shen, and G. Boracchi, \u0026ldquo;Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection,\u0026rdquo; arXiv preprint arXiv:2407.15795, 2024.\n[14] Q. Zhou, G. Pang, Y. Tian, S. He, and J. Chen, \u0026ldquo;AnomalyCLIP: Objectagnostic prompt learning for zero-shot anomaly detection,\u0026rdquo; in The Twelfth International Conference on Learning Representations, 2024.\n[15] A. Aich, K.-C. Peng, and A. K. Roy-Chowdhury, \u0026ldquo;Cross-domain video anomaly detection without target domain adaptation,\u0026rdquo; in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2023, pp. 2579–2591.\n[16] J. Micorek, H. Possegger, D. Narnhofer, H. Bischof, and M. Kozinski, \u0026ldquo;Mulde: Multiscale log-density estimation via denoising score matching for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 868– 18 877.\n[17] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh, \u0026ldquo;Learning regularity in skeleton trajectories for anomaly detection in videos,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 996–12 004.\n[18] S. Yu, Z. Zhao, H. Fang, A. Deng, H. Su, D. Wang, W. Gan, C. Lu, and W. Wu, \u0026ldquo;Regularity learning via explicit distribution modeling for skeletal video anomaly detection,\u0026rdquo; IEEE Transactions on Circuits and Systems for Video Technology, 2023.\n[19] A. Flaborea, L. Collorone, G. M. D. Di Melendugno, S. D\u0026rsquo;Arrigo, B. Prenkaj, and F. Galasso, \u0026ldquo;Multimodal motion conditioned diffusion model for skeleton-based video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 318–10 329.\n[20] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, \u0026ldquo;Openpose: Realtime multi-person 2d pose estimation using part affinity fields,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 1, pp. 172–186, 2019.\n[21] H.-S. Fang, J. Li, H. Tang, C. Xu, H. Zhu, Y. Xiu, Y.-L. Li, and C. Lu, \u0026ldquo;Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time,\u0026rdquo; IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7157–7173, 2022.\n[22] A. Acsintoae, A. Florescu, M.-I. Georgescu, T. Mare, P. Sumedrea, R. T. Ionescu, F. S. Khan, and M. Shah, \u0026ldquo;Ubnormal: New benchmark for supervised open-set video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 143–20 153.\n[23] C. Cao, Y. Lu, P. Wang, and Y. Zhang, \u0026ldquo;A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 20 392–20 401.\n[24] S. Sun and X. Gong, \u0026ldquo;Hierarchical semantic contrast for scene-aware video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 846–22 856.\n[25] A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan, \u0026ldquo;Graph embedded pose clustering for anomaly detection,\u0026rdquo; in Proceed-\nings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 539–10 547.\n[26] X. Tan, H. Wang, X. Geng, and L. Wang, \u0026ldquo;Frequency-guided diffusion model with perturbation training for skeleton-based video anomaly detection,\u0026rdquo; arXiv preprint arXiv:2412.03044, 2024. [27] D. P. Kingma and P. Dhariwal, \u0026ldquo;Glow: Generative flow with invertible 1x1 convolutions,\u0026rdquo; Advances in neural information processing systems , vol. 31, 2018. [28] R. Wu, Y. Chen, J. Xiao, B. Li, J. Fan, F. Dufaux, C. Zhu, and Y. Liu, \u0026ldquo;Da-flow: Dual attention normalizing flow for skeleton-based video anomaly detection,\u0026rdquo; arXiv preprint arXiv:2406.02976, 2024. [29] A. Li, C. Qiu, M. Kloft, P. Smyth, M. Rudolph, and S. Mandt, \u0026ldquo;Zeroshot anomaly detection via batch normalization,\u0026rdquo; Advances in Neural Information Processing Systems, vol. 36, 2024. [30] T. Aota, L. T. T. Tong, and T. Okatani, \u0026ldquo;Zero-shot versus manyshot: Unsupervised texture anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2023, pp. 5564–5572. [31] X. Chen, Y. Han, and J. Zhang, \u0026ldquo;A zero-/fewshot anomaly classification and segmentation method for cvpr 2023 vand workshop challenge tracks 1\u0026amp;2: 1st place on zero-shot ad and 4th place on few-shot ad,\u0026rdquo; arXiv preprint arXiv:2305.17382, vol. 2, no. 4, 2023. [32] A. Miyai, J. Yang, J. Zhang, Y. Ming, Y. Lin, Q. Yu, G. Irie, S. Joty, Y. Li, H. Li et al., \u0026ldquo;Generalized out-of-distribution detection and beyond in vision language model era: A survey,\u0026rdquo; arXiv preprint arXiv:2407.21794, 2024. [33] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., \u0026ldquo;Learning transferable visual models from natural language supervision,\u0026rdquo; in International conference on machine learning. PMLR, 2021, pp. 8748–8763. [34] D. Guo, Y. Fu, and S. Li, \u0026ldquo;Ada-vad: Domain adaptable video anomaly detection,\u0026rdquo; in Proceedings of the 2024 SIAM International Conference on Data Mining (SDM). SIAM, 2024, pp. 634–642. [35] L. Zanella, W. Menapace, M. Mancini, Y. Wang, and E. Ricci, \u0026ldquo;Harnessing large language models for training-free video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 527–18 536. [36] F. Sato, R. Hachiuma, and T. Sekii, \u0026ldquo;Prompt-guided zero-shot anomaly action recognition using pretrained deep skeleton features,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6471–6480. [37] J. Carreira and A. Zisserman, \u0026ldquo;Quo vadis, action recognition? a new model and the kinetics dataset,\u0026rdquo; in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308. [38] S. Yan, Y. Xiong, and D. Lin, \u0026ldquo;Spatial temporal graph convolutional networks for skeleton-based action recognition,\u0026rdquo; in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018. [39] X. Liu, S. Zhou, L. Wang, and G. Hua, \u0026ldquo;Parallel attention interaction network for few-shot skeleton-based action recognition,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1379–1388. [40] Y. Wang, S. Zhou, K. Xia, and L. Wang, \u0026ldquo;Learning discriminative spatiotemporal representations for semi-supervised action recognition,\u0026rdquo; arXiv preprint arXiv:2404.16416, 2024. [41] M. Wang, J. Xing, and Y. Liu, \u0026ldquo;Actionclip: A new paradigm for video action recognition,\u0026rdquo; arXiv preprint arXiv:2109.08472, 2021. [42] W. Xiang, C. Li, Y. Zhou, B. Wang, and L. Zhang, \u0026ldquo;Generative action description prompts for skeleton-based action recognition,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 276–10 285. [43] C. Lu, J. Shi, and J. Jia, \u0026ldquo;Abnormal event detection at 150 fps in matlab,\u0026rdquo; in Proceedings of the IEEE international conference on computer vision , 2013, pp. 2720–2727. [44] W. Li, V. Mahadevan, and N. Vasconcelos, \u0026ldquo;Anomaly detection and localization in crowded scenes,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 1, pp. 18–32, 2013. [45] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, \u0026ldquo;Imagebind: One embedding space to bind them all,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 180–15 190. [46] Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, \u0026ldquo;Channelwise topology refinement graph convolution for skeleton-based action recognition,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 359–13 368. [47] X. Yao, R. Li, J. Zhang, J. Sun, and C. Zhang, \u0026ldquo;Explicit boundary guided semi-push-pull contrastive learning for supervised anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 490–24 499.\n[48] A. Barbalau, R. T. Ionescu, M.-I. Georgescu, J. Dueholm, B. Ramachandra, K. Nasrollahi, F. S. Khan, T. B. Moeslund, and M. Shah, \u0026ldquo;Ssmtl++: Revisiting self-supervised multi-task learning for video anomaly detection,\u0026rdquo; Computer Vision and Image Understanding, vol. 229, p. 103656, 2023. [49] H. Liu, C. Li, Y. Li, and Y. J. Lee, \u0026ldquo;Improved baselines with visual instruction tuning,\u0026rdquo; 2023. [50] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang et al., \u0026ldquo;Qwen2. 5-vl technical report,\u0026rdquo; arXiv preprint arXiv:2502.13923, 2025. [51] C. Tang, Z. Han, H. Sun, S. Zhou, X. Zhang, X. Wei, Y. Yuan, J. Xu, and H. Sun, \u0026ldquo;Tspo: Temporal sampling policy optimization for longform video language understanding,\u0026rdquo; arXiv preprint arXiv:2508.04369 , 2025. ","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/action-hints-semantic-typicality-and-context-uniqueness-for/","section":"Papers","summary":"Proposes a zero-shot skeleton-based video anomaly detection framework leveraging action semantic typicality and context uniqueness learning, utilizing language-guided semantic modeling and test-time scene-adaptive boundaries to improve generalization without target domain training data.","title":"Action Hints: Semantic Typicality and Context Uniqueness for Generalizable Skeleton-based Video Anomaly Detection","type":"method"},{"content":" Action Hints: Semantic Typicality and Context Uniqueness for Generalizable Skeleton-based Video Anomaly Detection # Canhui Tang, Sanping Zhou, Member, IEEE, Haoyue Shi, Le Wang, Senior Member, IEEE\nAbstract—Zero-Shot Video Anomaly Detection (ZS-VAD) requires temporally localizing anomalies without target domain training data, which is a crucial task due to various practical concerns, e.g., data privacy or new surveillance deployments. Skeleton-based approach has inherent generalizable advantages in achieving ZS-VAD as it eliminates domain disparities both in background and human appearance. However, existing methods only learn low-level skeleton representation and rely on the domain-limited normality boundary, which cannot generalize well to new scenes with different normal and abnormal behavior patterns. In this paper, we propose a novel zero-shot video anomaly detection framework, unlocking the potential of skeleton data via action typicality and uniqueness learning. Firstly, we introduce a language-guided semantic typicality modeling module that projects skeleton snippets into action semantic space and distills LLM\u0026rsquo;s knowledge of typical normal and abnormal behaviors during training. Secondly, we propose a test-time context uniqueness analysis module to finely analyze the spatio-temporal differences between skeleton snippets and then derive sceneadaptive boundaries. Without using any training samples from the target domain, our method achieves state-of-the-art results against skeleton-based methods on four large-scale VAD datasets: ShanghaiTech, UBnormal, NWPU, and UCF-Crime, featuring over 100 unseen surveillance scenes.\nIndex Terms—Video Anomaly Detection, Skeleton-based, Zeroshot, Action Semantic Typicality, Context Uniqueness.\nI. INTRODUCTION # Video Anomaly Detection (VAD) aims to temporally locate abnormal events, which has wide applications in the context of video surveillance and public safety [1], [2]. Current mainstream paradigms include one-class [3]–[7] and weakly supervised methods [2], [8], [9], which require abundant samples from the target video domain for training. However, in surveillance scenarios involving privacy or newly installed monitoring devices, training samples from the target domain are usually not available. Therefore, designing a Zero-Shot Video Anomaly Detection (ZS-VAD) method that can generalize to diverse target domains becomes necessary. Despite the recent extensive attention given to zero-shot image anomaly detection [10]–[14], the zero-shot setting in the complex surveillance video domain remains under-explored [15].\nThe challenges of ZS-VAD come from significant variations in visual appearance and human activities across different\nThis work was supported in part by National Science and Technology Major Project under Grant 2023ZD0121300, National Natural Science Foundation of China under Grants 62088102, U24A20325 and 12326608, and Fundamental Research Funds for the Central Universities under Grant XTR042021005. (Corresponding author: Sanping Zhou, E-mail: spzhou@mail.xjtu.edu.cn.)\nCanhui Tang, Sanping Zhou, Haoyue Shi, and Le Wang are all with the National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi\u0026rsquo;an Jiaotong University, Shaanxi 710049, China.\nFig. 1. An illustration of skeleton-based VAD paradigm comparison. Previous approaches suffer from two main issues: (1) low-level representations and (2) domain-limited normal boundary. Our method enhances generalizability via action semantic typicality learning and context uniqueness analysis.\nvideo domains. While frame/object-based methods [3], [4], [16] have been prominent in video anomaly detection, their performance will degrade when adapting to new scenes due to visual feature distribution shifts. In another view, skeletonbased methods [6], [17]–[19] utilize mature pose detection systems [20], [21] to obtain skeleton data, learn to encode features via self-supervision tasks [18], [19], and then calculate the anomaly score. They are effective for identifying human behavior anomalies, which are popular in the VAD task due to their superior efficiency and performance. skeletonbased methods also have inherent generalizable advantages in achieving ZS-VAD as it eliminates domain disparities both in background and human appearance.\nHowever, as shown in Fig. 1, existing skeleton-based VAD methods still suffer from several limitations: (1) Low-level skeleton representations. They learn normal distribution of skeleton patterns using self-supervised tasks, such as skeleton prediction [19], reconstruction [18], or coordinate-based normalizing flows [6]. Without semantic supervision signals, such methods fail to capture higher-level action patterns, making them unable to distinguish novel anomaly patterns similar to normal patterns and sensitive to noise. (2) Domain-limited normality boundary. They blindly rely on training-datadefined normality boundaries, leading to the misclassification of unseen normal events as anomalies. Both limitations hinder their generalization to unseen scenes with varying normal and\nabnormal patterns. This leads to a question: \u0026ldquo;Can we further unlock the potential of skeleton in ZS-VAD with generalizable representation learning and prior injection? \u0026quot;\nTo address this question, we reflect on how human observers judge normal and abnormal behavior in a new scenario. As shown in Fig. 1, we first identify the types of individual actions in the video and consider whether they are normal or abnormal based on our experiential knowledge of normality and abnormality, which is referred to as typicality. For instance, a pedestrian walking would be considered normal, while a fight or scuffle would be deemed abnormal. Secondly, for atypical normal or abnormal scenarios, we integrate the behaviors of all individuals in the video to observe if any individual\u0026rsquo;s behavior significantly differs from others, as anomalies are usually rare and unique, referred to as context uniqueness.\nBased on these complementary priors, we propose a novel skeleton-based zero-shot video anomaly detection framework, which captures both typical anomalies guided by language prior and unique anomalies in spatio-temporal contexts. First, we introduce a language-guided typicality modeling module to achieve high-level semantic understanding beyond previous low-level representations. Specifically, it projects skeleton snippets into language-aligned action semantic space and distills LLM\u0026rsquo;s knowledge of typical normal and abnormal behaviors during training. Secondly, to derive scene-adaptive boundaries, we propose a context uniqueness analysis module at test time. It finely analyzes the spatio-temporal differences between skeleton snippets to get an adaptive understanding of target scene activities. Without using any training samples from the target domain, we achieve state-of-the-art results on four large-scale VAD datasets: ShanghaiTech [1], UBnormal [22], NWPU [23], UCF-Crime [2], featuring over 100 unseen surveillance scenes. Our contributions are as follows:\nWe propose a skeleton-based video anomaly detection framework that learns action typicality and uniqueness, enabling generalization across diverse target scenes. We propose a language-guided typicality modeling module that projects skeleton snippets into a generalizable semantic space and distills LLM\u0026rsquo;s knowledge of typical normal and abnormal behaviors during training. We propose a test-time uniqueness analysis module to finely analyze the spatio-temporal differences between skeleton snippets and derive scene-adaptive boundaries between normal and abnormal behavior. The rest of this paper is organized as follows. We review the related work in Section II. Section III describes the technical details of our proposed method. Section IV presents the experiment details and results. Finally, we summarize the paper in Section V .\nII. RELATED WORK # Video anomaly detection. Most previous video anomaly detection studies can be grouped into frame-based [1], [2], [4], object-centric [3], [16], [24], and skeleton-based methods [6], [19], [25]. In this work, we focus on the skeleton-based methods, which detect anomalies in human activity based on preprocessed skeleton/pose data. Morais et al. [17] propose an anomaly detection method that uses an RNN network to learn the representation of pose snippets, with prediction errors serving as anomaly scores. GEPC [25] utilizes autoencoders to learn pose graph embeddings, generates soft assignments through clustering, and uses a Dirichlet process mixture to determine anomaly scores. To model normal diversity, MoCoDAD [19] leverages diffusion probabilistic models to generate multimodal future human poses. FG-Diff [26] guides the diffusion model with observed high-frequency information and prioritizes the reconstruction of low-frequency components, enabling more accurate and robust anomaly detection. STGNF [6] proposes a simple yet effective method by establishing normalized flow [27] from normal pose snippets to obtain normal boundaries. DA-Flow [28] proposes a lightweight dual attention module for capturing cross-dimension interaction relationships in spatio-temporal skeletal data. However, these methods rely on training with normal data from the target domain, while overlooking the semantic understanding of human behavior, which makes it difficult to ensure performance in scenarios where the target data is unavailable.\nZero-shot anomaly detection. Thanks to the development of vision-language models, zero-shot anomaly detection has received a lot of attention [10]–[12], [14], [29]–[31], especially in the field of image anomaly detection [32]. The pioneering work is WinCLIP [11], which utilizes CLIP [33]\u0026rsquo;s image-text matching capability to distinguish between unseen normal and abnormal anomalies. Building on that, AnomalyCLIP [14] proposes to learn object-agnostic text prompts that capture generic normal and abnormal patterns in an image. AdaCLIP [13] introduces two types of learnable prompts to enhance CLIP\u0026rsquo;s generalization ability for anomaly detection. Despite the success in the image domain, only a few works [15], [34] have ventured into zero-shot video anomaly detection with underwhelming performance. Although recently [35] proposes to leverage large visual language models for zero-shot video anomaly detection, it requires multi-stage reasoning and the collaboration of multiple large models, making it less userfriendly. We aim to develop a lightweight, user-friendly, and easily deployable zero-shot anomaly detector starting from skeleton data. Our work shares some similarities with a recent study [36]. However, we emphasize that our approach differs significantly from [36] in the following ways: 1) Different tasks: It addresses abnormal action recognition, involving no more than two individuals in a short video, while ours requires temporally localizing abnormal events in real surveillance videos. 2) Novel perspective: We combine the action typicality and uniqueness priors to address zero-shot anomaly detection challenges in video surveillance scenes.\nIII. METHOD # A. Overview # The objective of ZS-VAD is to train one model that can generalize to diverse target domains. Formally, let V train be a training set from source video domain and {W test 1 , W 2 test W 2 , \u0026hellip;, W test N } be multiple test sets from target video domain. The test videos are annotated at the frame level with labels l i ∈ {0 , 1}, and the VAD model is required to\nFig. 2. Overview of our approach for skeleton-based zero-shot video anomaly detection. I. Language-guided typicality modeling in the training phase. It projects skeleton snippets into the action semantic space, collects typicality knowledge from LLM, and then effectively learns the typical distribution of normal and abnormal behavior. (Only the black dashed boxes are used during inference.) II. Test-time uniqueness analysis in the inference phase. It finely analyzes the spatio-temporal differences between skeleton snippets and derives scene-adaptive boundaries between normal and abnormal behavior.\npredict each frame\u0026rsquo;s anomaly score. In this work, we focus on the skeleton-based paradigm, as it is computation-friendly and can benefits ZS-VAD by reducing the domain gap in both background and appearance.\nFig. 2 overviews our proposed approach. Our model tackles the ZS-VAD problem from the perspective of action typicality and uniqueness learning. Firstly, to obtain a high-level semantic understanding, we propose a Language-Guided Typicality Modeling module that projects skeleton snippets into action semantic space and distills LLM\u0026rsquo;s knowledge of typical normal and abnormal behaviors during training. Secondly, to get scene-adaptive decision boundaries, we propose a Test-Time Uniqueness Analysis module that finely analyzes the spatiotemporal differences between skeleton snippets. During inference on unseen VAD datasets, our model integrates typicality scores and uniqueness scores of human behavior to provide a holistic understanding of anomalies.\nB. Language-guided Typical Modeling # Unlike previous works that learn low-level skeleton representations via self-supervised tasks [6], [18], [19], this module aims to obtain a high-level semantic understanding of human behavior. It learns language-aligned action features and scenegeneric distributions of typical distribution with distillation of LLM\u0026rsquo;s knowledge during training. Specifically, this module consists of skeleton-text alignment, typicality knowledge selection, and typicality distribution learning. During inference, it can predict typicality anomaly scores with only a lightweight skeleton encoder and a normalizing flow module.\nSkeleton-text alignment. For achieving a generalizable semantic understanding of human behavior, we first propose to align the skeleton snippets with the corresponding semantic labels. For such skeleton-text pairs, we utilize external action recognition datasets (e.g., Kinect [37] as the training set instead of specific VAD datasets (e.g., ShanghaiTech). The raw skeleton data of an action video is typically formally represented as Xi ∈ R C×J×L×M , where C is the coordinate number, J is the joint number, L is the sequence length, and M is the pre-defined maximum number of persons, respectively. In addition, each video is annotated with a text label gi representing the action class, which can also be transformed into a one-hot class vector yi .\nCompared to action recognition tasks [38]–[40] that only predict video-level categories, the VAD task is more finegrained, focusing on frame-level anomaly scores. Therefore, we decompose the original sequences into multiple short skeleton snippets Ai ∈ R C×J×T using a sliding window, and discard snippets that are composed of zeros, where T is the length of a snippet. For the snippets from the same action video, they share the same labels and undergo a normalization operation like STG-NF [6] to make different snippets independent. Inspired by the recent multimodal alignment works [41], [42], we then perform a skeleton-text alignment pretraining procedure to learn the discriminative representation. The procedure is built with a skeleton encoder E s and a text encoder E t , for generating skeleton features F s and text features F t , respectively. Additionally, the skeleton encoder also predicts a probability vector yˆ ˆ i using a fully-connected layer. The training loss consists of a KL divergence loss and a crossentropy classification loss following GAP [42]. This skeletontext alignment procedure effectively guides the projection of skeleton snippets into language-aligned action semantic space beyond previous VAD works [6], [18], [19] that learns lowlevel skeleton representation.\nTypicality knowledge selection. In most video surveillance scenarios, some behaviors are generally considered normal or abnormal, which constitute a scene-generic set. Therefore, training a typicality-aware capability is one of the promising\nTABLE I THE GENERATED TYPICALITY LABELS .\nType Typicality action list Normal Normal Abnormal Abnormal ways to achieve ZS-VAD. Thanks to the cutting-edge advancements of Large Language Models (LLMs), we propose to distill their prior knowledge about generic normality and abnormality during training. Based on the pre-trained skeletontext representation, we aim to use a LLM as our knowledge engine to collect typical normal and abnormal data from the massive skeleton snippets. In detail, we give the large model a prompt P: \u0026ldquo;In most video surveillance scenarios, what are generally considered as normal actions and abnormal actions among these actions (Please identify the 20 most typical normal actions and 20 most typical abnormal actions, ranked in order of decreasing typicality). The action list is \u0026lt;T \u0026gt;\u0026rdquo;, where T refers to the set of all action class labels in the prepared action recognition dataset [37].\nThe large language model will respond with a list of typical normal action classes T n T n and a list of typical abnormal action classes T a , which can be formalized as:\nwhere T n T n and T a are the subsets of T , and OLLM denotes the offline LLM used for initial typicality label generation. Note that the LLM is only needed to be used once during training for auxiliary data selection, while inference is not.\nAfter knowing the action categories of typicality, we first collect the data of these selected categories and then proceed to select the high-quality snippets from them. This is because 1) Some snippets contain noise, such as errors in pose detection and tracking. 2) In an abnormal action sequence, not all the snippets are abnormal. Therefore, we use the skeleton-text similarity score to select the high-quality skeleton snippets, which is formulated as:\nwhere M x refers to the selected snippets index, gi denotes the text label of snippet i, and β denotes the selection ratio. The superscript x represents n or a, indicating normal and abnormal, respectively. Using the index M x , we obtain the corresponding skeleton data A ˜ n and A ˜ a , as well as skeleton features F ˜ s n and F ˜ s a .\nTypicality distribution learning. As shown in Fig. 2, after obtaining the data, we proceed to model the feature distribution of typical behavior. Normalizing Flow (NF) [27] provides a robust framework for modeling feature distributions, transforming this distribution through a series of invertible and differentiable operations. Consider a random variable X ∈ R D with target distribution pX(x), and a random variable Z follows a spherical multivariate Gaussian distribution. A bijective map f : X ↔ Z is then introduced, which is composed of a sequence of transformations: f1 ◦ f2 ◦ \u0026hellip; ◦ fK. According to the variable substitution formula, the log-likelihood of X can be expressed as:\nUsing such a transformation, the feature distribution of typicality behavior is effectively modeled. Specifically, the bijective maps for the normal features and abnormal features are f : X n ↔ Zn Zn and f : X a ↔ Z a , respectively. Here, the log-likelihood of Zn Zn and Z a are as follows:\nwhere Con is a constant, and u n and u z are the centers of the Gaussian distributions (|u n − u z | ≫ 0), respectively. During training, the normalizing flow is optimized to increase the loglikelihood of the skeleton features F s with the following loss:\nDuring inference, the testing skeleton snippet F s i will be sent to the trained normalizing flow, outputting the typicality anomaly score as follows:\nwhere the normal skeletons will exhibit low S t i , while the anomalies will exhibit higher S t i . Our approach differs significantly from STG-NF [6]. It takes low-level skeleton coordinates as inputs and only learns implicit spatio-temporal features, which struggle to generalize to new datasets without the normality reference of training data from the target dataset. Differently, we use action semantics as a generalizable representation for normalizing flow input and leverage experiential typicality labels to learn domain-general boundaries between normal and abnormal behavior.\nC. Test-time Uniqueness Analysis # The goal of this component is to serve as a complementary perspective of typicality, deriving scene-adaptive boundaries by considering the context of the target scene. To this end, we propose a context uniqueness analysis module during the inference of the unseen VAD dataset.\nUnlike action recognition datasets, surveillance videos contain richer contextual information, featuring longer temporal spans, larger numbers of people, and more diverse behavioral patterns. For such a video, H skeleton sequences {X1 , \u0026hellip;, X H } are extracted, where each sequence comprises L i -frame poses, represented as Xi = {P1 , \u0026hellip;, P L i }. Here, P t ∈ R J×2 comprises J keypoints, each defined by a pair of coordinate values. Targeted at frame-level anomaly scoring, the sequences are segmented into shorter skeleton snippets, denoted as A i ∈ R C×J×T , each of which is then individually scored based on its contextual information.\nSpatio-temporal context. As shown in Fig. 2, to gain a fine-grained context understanding of the scene, we construct\ntwo types of spatio-temporal context graphs: a cross-person graph G c and a self-inspection graph G s . The first graph is constructed by retrieving the feature nearest neighbors among the surrounding skeleton snippets, while the second one is constructed by retrieving the feature nearest neighbors from different time segments of the current person. In this way, we can filter out some unrelated activities and focus solely on behaviors related to the current individual. Given a skeleton snippet Ai with feature F s i , the cross-person graph is defined as G c i = {V i c V i , E i c E i }, where V i c V i = {Ai , Nc Nc (Ai)} denotes the node set and E i c = {(i, j)| j ∈ Nc Nc } denotes the edge set. Besides, during the preprocessing of skeleton snippets, Aiis associated with a human trajectory index pi and timestamp ti . The neighborhood Nc Nc is formulated as:\nwhere d(·) represents the Euclidean distance, and D k c refers to the k-th smallest value for the cross-person distances. The second graph, which depicts self-inspection, is defined as G s i = {V i s V i , E i s E i }, where V i s V i = {Ai , Ns Ns (Ai)} denotes the node set and E i s = {(i, j)| j ∈ Ns Ns ) denotes the edge set. Then, the neighborhood Ns Ns is formulated as:\nwhere D k s refers to the k-th smallest value for the selfinspection distances. α is a threshold that masks out a period of time before and after the current time window, as the individual\u0026rsquo;s state tends to remain stable during adjacent periods.\nUniqueness scores. Since abnormal activities are rare, anomalies in real-world surveillance videos often differ from other activities in both spatial and temporal context, which is referred to as uniqueness. Based on the pre-trained discriminative skeleton features, uniqueness can be represented as the feature distances between a query node and other nodes in the built graph. Specifically, the uniqueness score S u for individual i is obtained by taking the larger one of the crossperson and self-inspection distances, formulated as follows:\nHolistic anomaly scoring. By integrating the complementary typicality S t i scores and the uniqueness scores S u i , our model can capture both typical anomalies in language prior and unique anomalies in spatio-temporal contexts. This helps gain a comprehensive understanding of anomalies in new scenes, where the holistic anomaly score of individual i is defined as:\nFinally, the frame-level anomaly scores are obtained by taking the highest score among all individuals within each frame. If any individual is considered anomalous, the entire frame is classified as anomalous. For frames where no individuals are detected, it is classified as a normal frame. In this condition, the anomaly score is assigned the minimum value among all scores in that video, following the approach in [6].\nTABLE II THE DETAILS OF OUR ZERO -SHOT VIDEO ANOMALY DETECTION BENCHMARKS. EACH SNIPPET CONTAINS 16 FRAMES OF SKELETON DATA WITH A 1-FRAME INTERVAL , WHILE THE SNIPPETS OF UCF-CRIME* ARE SAMPLED WITH A 16-FRAME INTERVAL AS ITS VIDEOS ARE TOO LONG .\n| | Resolution | Test Video Num. | Scenes Num. | Snippet\nNum. ShanghaiTech [1] 480×856 107 13 156,571 UBnormal [22] 720×1280 211 29 315,416 NWPU [23] multiple 242 43 723,490 UCF-Crime [2] 240×320 290 \u0026gt;50 152,231* IV. EXPERIMENTS # A. Dataset and Implementation Details # Dataset. The training of our model is conducted on the Kinect-400-skeleton dataset [37], [38], while the ZS-VAD capability of our model is evaluated on four large-scale VAD datasets: ShanghaiTech [1], UBnormal [22], NWPU [23] and UCF-Crime [2]. Note that we only use the test set of these four VAD datasets.\nFor the training of our model, we use the external Kinect400-skeleton [38] dataset. It is not intended for VAD tasks but for action recognition, which is gathered from YouTube videos covering 400 action classes. We utilize the preprocessed skeleton data obtained from ST-GCN [38] for training. For evaluation, we take four VAD-relevant datasets. Compared to some early VAD benchmarks [43], [44] that involve single scenes staged and captured at one location, the four datasets we evaluated are more extensive, encompassing a wider variety of scenes. Consequently, these four datasets are better suited for testing the model\u0026rsquo;s zero-shot capabilities and assessing its cross-scenario performance. The details are summarized in Table II and the following descriptions. (1) ShanghaiTech. It is a widely-used benchmark for one-class video anomaly detection, which consists of 330 training videos and 107 test videos from 13 different scenes. (2) UBnormal. It is a synthetic dataset with virtual objects and real-world environments. It consists of 186 training videos and 211 test videos from 29 different scenes. (3) NWPU. It is a newly published dataset that contains some scene-dependent anomaly types. It comprises 305 training videos and 242 testing videos from 43 scenes. (4) UCF-Crime. It is a large-scale dataset with 1900 long untrimmed surveillance videos. The 290 testing videos are used for our evaluation.\nImplementation Details. For a fair comparison, we directly use the skeleton data of ShanghaiTech and UBnormal from STG-NF [6]. For NWPU and UCF-Crime, as they do not have open-source skeleton data, we resort to utilizing AlphaPose [21] for data extraction. We use a segment window T = 16 and a stride of 1 to divide each sequence into snippets. Specifically, we use a stride of 16 for UCF-Crime because its videos are too long. For the backbone, we use multi-scale CTR-GCN [46] (2.1M) as the skeleton encoder and use a 4-layer feature normalizing flow [47] (2.9M) to model the normality probability. During training, we use the \u0026ldquo;ViT-B/32\u0026rdquo; CLIP [33] as the text encoder, and GPT-3.5-Turbo as our knowledge engine. During inference, these two models are\nTABLE III ZERO -SHOT VIDEO ANOMALY DETECTION PERFORMANCE ON THE FOUR LARGE -SCALE DATASETS, SHANGHAITECH, UBNORMAL, NWPU, AND UCF-CRIME , WHERE THE SUBSCRIPT DENOTES THE NUMBER OF SCENES .\n| Method | Training VAD | SHT13 | Testing Set UB29 NWPU43 | Testing Set\nUB29 NWPU43 UCFC\u0026gt;50 LVLM Imagebind [45] LAVAD [35] ✘ - - - 55.8 LAVAD [35] ✘ - - - 80.3 Frame/Object HF2-VAD [4] SHT 76.2 59.5 58.3 52.9 Jigsaw-VAD [3] SHT 84.3 58.6 61.1 53.3 Skeleton MocoDAD [19] SHT 77.6 67.0 56.4 51.8 MocoDAD [19] UB 76.0 68.4 56.6 52.0 STG-NF [6] SHT 85.9 68.8 57.6 51.6 STG-NF [6] UB 83.0 71.8 57.9 51.9 Ours ✘ 84.1 74.5 57.9 62.1 51.9 627 removed. For the hyperparameters, the batch size is set to 1024, and the Adam optimizer is used with a learning rate of 0.0005. Additionally, β n , β a , k, and α are set to 90%, 10%, 16, and 4, respectively. For the evaluation metrics, we follow common practice [1], [2], [6] by using the micro-average frame-level AUC as the evaluation metric, which involves concatenating all frames and calculating the score.\nB. Main Results # We conduct a comprehensive comparison of the performance of ZS-VAD, comparing the frame-based/objectbased [3], [4], skeleton-based [6], [19], and LVLM-based methods [35], [45].\nComparison with frame/object-based methods. We use their open-source checkpoints trained on the ShanghaiTech to evaluate the zero-shot performance on the remaining three VAD datasets. As shown in Table III, their generalization capabilities on new scene datasets are relatively poor due to the influence of human appearance and background variations.\nComparison with skeleton-based methods. We use their open-source checkpoints trained on the ShanghaiTech or UBnormal to evaluate on the remaining three VAD datasets. The performance of prevalent skeleton-based methods is still underwhelming due to a lack of understanding of complex normal and abnormal behaviors without target training data. Compared with our baseline STG-NF, our proposed method improves the frame-level AUC-ROC by 1.1% on ShanghaiTech, 5.7% on UBnormal, 4.2% on NWPU, and 10.8% on UCF-Crime. We also compare their performance in the fullshot setting, where target domain data is used for training. Table IV shows that our zero-shot approach can achieve comparable or even superior results to SoTA full-shot performance. To evaluate our method under the popular full-shot setting, we train our normalizing flow only on VAD normal data to model the normal distribution and test it on the same domain. The results outperform state-of-the-art (SOTA) full-shot methods.\nComparison with LVLM-based methods. With the advancements in Large Vision-Language Models (LVLMs) [49]–\nTABLE IV OUR ZERO -SHOT PERFORMANCE VS. SOTA FULL -SHOT PERFORMANCE . WE ALSO PROVIDE A VERSION NAMED OURS -FULL TO EVALUATE OUR METHOD UNDER THE POPULAR FULL -SHOT SETTING .\n| Setting | Method | Training VAD | Testing HT UB | Testing\nHT UB Setting Method VAD SHT UB zero-shot Ours ✘ 84.1 74.5 full-shot HF2-VAD [4] SHT / UB 76.2 - full-shot Jigsaw-VAD [3] SHT / UB 84.3 - full-shot g SSMTL++ [48] SHT / UB 83.8 62.1 full-shot GEPC [25] SHT / UB 76.1 53.4 full-shot MocoDAD [19] SHT / UB 77.6 68.4 full-shot STG-NF [6] SHT / UB 85.9 71.8 full-shot FG-Diff [26] - / UB - 68.9 full-shot Ours-full SHT / UB 86.0 78.2 TABLE V ABLATION EXPERIMENTS OF THE TYPICALITY MODELING MODULE .\nExperiments SHT UB NWPU UCFC (a) ours w/o aligning 83.2 69.4 60.4 59.5 (b) ours w/o selection - 64.1 61.9 56.7 (c) ours w/o NF 83.4 72.2 62.6 60.5 (d) prompt score 81.3 64.4 61.5 61 (e) ours 84.1 74.5 62.1 62.7 [51], LAVAD [35] proposes a zero-shot video anomaly detection (ZS-VAD) framework. However, it relies on multistage reasoning and the coordination of multiple large models with over 13 billion (B) parameters, posing challenges for widespread deployment. In contrast, we develop a lightweight zero-shot anomaly detector with a mere 5.0 million (M) parameters, just one in two thousand of LAVAD\u0026rsquo;s parameters.\nC. Ablation Study # Ablation of typicality module. We conduct ablation experiments on the typicality modeling module with the following settings: (a) removing the aligning stage, training an STGNF [6] network with our typicality data; (b) removing the collection phase, training with VAD source data (SHT); (c) removing the normalizing flow and calculating typicality scores using k-nearest neighbors distance techniques. As shown in Table V, the model shows poor performance without the aligning stage, as it fails to learn generalizable and discriminative semantic representations. Moreover, performance deteriorates without the selection of typicality action knowledge, as the model can only learn a limited normality boundary from the VAD source data. Furthermore, without the normalizing flow (NF), the model also loses flexibility in modeling the distribution of typical behaviors.\nAblation of uniqueness module. We ablate the uniqueness scores and the holistic scores in this part. As demonstrated in Table VI, when only using the cross-person distance, the model can identify contextual anomalies with acceptable performance. When combined with the self-inspection score, the model can spot changes in motion states, aiding in detecting a wider range of anomalies. The reason for the suboptimal\nFig. 3. Example results of our method that succeed in capturing typical anomalies. STG-NF [6] fails to detect the \u0026ldquo;jumping in the street\u0026rdquo; event, while ours performs well through action typicality learning. Each individual (blue skeleton) has a predicted anomaly score (red font), where the frame-level score (orange line) is defined as the maximum among all individuals in that frame.\nFig. 4. Example results of our method that succeed in capturing unique anomalies. STG-NF misclassifies unseen normal events during periods where \u0026ldquo;riding\u0026rdquo; occurs as anomalies, whereas our method correctly identifies them as normal by recognizing their contextual similarity. Moreover, STG-NF fails to detect the \u0026ldquo;photographing in restricted areas\u0026rdquo; anomaly, while our approach successfully identifies it by recognizing a sudden change in the person\u0026rsquo;s movement trajectory.\nTABLE VI ABLATION STUDY OF THE UNIQUENESS ANALYSIS MODULE AND HOLISTIC ANOMALY SCORING .\nTyp. Cross Self SHT UB NWPU UCFC ✓ ✓ 81.9 73.2 62.1 59.6 ✓ ✓ 81.9 62.9 60.7 59.9 ✓ ✓ 67.8 60.1 61 62.6 ✓ ✓ 82 64.5 61.7 61 ✓ ✓ ✓ 84.1 74.5 62.1 62.7 TABLE VII COMPARISON WITH PROMPT -BASED METHODS .\nExperiments SHT UB NWPU UCFC (a) normal prompts 81.3 64.4 61.5 61 (b) abnormal prompts 80.7 63.6 61 60.2 (c) ensemble prompts 80.7 63.9 61.1 61.3 (d) ours 84.1 74.5 62.1 62.7 performance of uniqueness score on UBnormal is that UBnormal is a synthetic dataset where some videos contain only one person with relatively short movement durations, which does not align well with real surveillance video scenarios. By integrating both the typicality and uniqueness modules, our approach can achieve optimal performance.\nComparison with prompt-based methods. Since promptbased techniques have been popular in other zero-shot tasks [11], [36], we conduct experiments to compare our typicality module with theirs. To this end, we design typical normal prompts, typical abnormal prompts, and the ensemble prompts, then use the skeleton-prompt similarity as the anomaly score. In detail, we use a normal prompt list: [\u0026ldquo;usual\u0026rdquo;, \u0026ldquo;normal\u0026rdquo;, \u0026ldquo;daily\u0026rdquo;, \u0026ldquo;stable\u0026rdquo;, \u0026ldquo;safe\u0026rdquo;], and an abnormal prompt list [\u0026ldquo;danger\u0026rdquo;, \u0026ldquo;violence\u0026rdquo;, \u0026ldquo;suddenness\u0026rdquo;, \u0026ldquo;unusual\u0026rdquo;, \u0026ldquo;instability\u0026rdquo;]. The prompts are encoded into text features and compute similarity with the skeleton features, together with our uniqueness analysis. As shown in Table V (d) and Table VII, the results are suboptimal. Unlike various forms of text seen in CLIP image-text alignment, the current skeletontext alignment scheme has only encountered text of action class names, thus the alignment capability for prompt text is weak. Our method, on the other hand, distills LLM\u0026rsquo;s\nTABLE VIII ABLATION STUDIES OF THE BACKBONE NETWORKS. THE PERFORMANCE IS EVALUATED ON UBNORMAL / NWPU / UCF-CRIME DATASETS .\nMethod Backbone Params. Performance Jigsaw [3] 3D-Conv 1.5M 58.6 / 61.1 / 53.3 HF2-VAD [4] MemAE+C-VAE 36.4M 9.5 / 58.3 / 52.9 STG-NF [4] STG-NF 0.7K 68.8 / 57.6 / 51. [] STG- STG-CN CTR-GCN STG-NF STG-CN + NF TR-GCN + NF 0.7K 4.1M 50M 0 / 60.1 / 58.5 STG-CN + NF TG-CN + NF 73.1 / 61.3 / 61 0 / 60.1 / 58.5 CTR-GCN + NF 5.0M 74.5 / 62.1 / 62.7 0 / 60.1 / 58.5 knowledge and learns typicality distribution, avoiding directly using the skeleton-prompt similarity as anomaly scores.\nAblation of backbone. In this part, we ablate our backbone. For a fair comparison with our baseline STG-NF [6], we attempt to use STG-NF as the backbone. However, STGNF takes XY -coordinates as input, with only 2 dimensions, making it extremely lightweight yet difficult to learn highlevel features. We then use STG-NF to learn the typicality skeleton coordinate inputs to obtain the typicality score. (Note that the results in Table 3 (a) of the main paper also consider the uniqueness score, resulting in the higher performance). As shown in Table VIII, using STG-NF as our backbone still demonstrates better performance compared to vanilla STGNF, highlighting the effectiveness of our typicality training. In addition, when we switch our backbone from CTR-GCN [46] to STG-CN [38], the model becomes more lightweight and the performance remains good.\nHyper-parameter ablations. We ablate the nearest neighborhood (NN) number k and the masking threshold α in the uniqueness analysis module. As shown in Table IX, our method is robust for these two hyper-parameters. Choosing an appropriate k can filter out some unrelated activities and focus solely on behaviors related to the current skeleton snippets. In addition, taking the average of the k neighbors helps suppress noise, which also makes our model insensitive to α .\nWe also ablate the hyperparameters in the typicality knowledge selection step. As shown in Table X, the optimal values of β n and β a are 0.9 (90%) and 0.1 (10%), respectively. A smaller β a can enhance performance by filtering out noisy data and normal snippets within the anomalous sequences.\nD. Visualization Results # Fig. 3 presents the qualitative localization results of typical anomalies. STG-NF fails to detect the \u0026ldquo;jumping\u0026rdquo; anomaly due to its low-level skeleton representation, which is unable to distinguish between indiscernible anomaly patterns that are similar to normal patterns. Disturbed by the skeletal noise of frame 350, it erroneously identifies the anomaly\u0026rsquo;s position. In contrast, our model maps the skeleton snippets to a highlevel space with generalizable and discriminative semantic information, allowing it to identify the anomaly based on our trained decision boundary. In surveillance scenarios lacking training samples, our model can still be effectively utilized to detect certain typical abnormal behaviors.\nFig. 4 shows the qualitative localization results of unique anomalies. Existing skeleton-based methods rely on the source\nTABLE IX ABLATION RESULTS OF TWO MAIN HYPER -PARAMETER .\n(k, α) SHT UB (k, α) SHT UB (1, 1) 83.9 74.2 (16, 1) 84.2 74.2 (4, 1) 84.1 74.3 (16, 2) 84.2 74.4 (16, 1) 84.1 74.5 (16, 3) 84 74.5 (64, 1) 83.7 74.7 (16, 4) 84.1 74.5 TABLE X HYPER -PARAMETER SENSITIVITY OF THE SELECTION RATIO .\n| (β n , βa) | SHT | UB | (β n\n, βa) SHT UB (0.1, 0.1) 82.8 68.3 (0.9, 0.1) 84.1 74.5 (0.3, 0.1) 83 70.9 (0.9, 0.3) 84.1 73.3 (0.5, 0.1) 83.4 74.1 (0.9, 0.5) 84.1 72.3 (0.7, 0.1) 83.9 74.3 (0.9, 0.7) 83.5 72 (0.9, 0.1) 84.1 74.5 (0.9, 0.9) 82.5 71.8 normal data for training. When the source domain does not include some novel behavior that appears in the target domain, these behavior will be classified as anomalies. Consequently, STG-NF erroneously localizes the anomaly during time periods when \u0026ldquo;riding\u0026rdquo; is present. In contrast, our model can analyze the spatio-temporal differences and establish sceneadaptive decision boundaries. Since \u0026ldquo;riding\u0026rdquo; occurs multiple times in the video, its uniqueness score is low. On the other hand, \u0026ldquo;photographing at restricted areas\u0026rdquo; exhibits significant differences from the surrounding people\u0026rsquo;s behavior and appears as a sudden change in the person\u0026rsquo;s movement trajectory, resulting in a corresponding increase in its anomaly score.\nV. CONCLUSION # In this paper, we identify the advantages of the skeletonbased approach in ZS-VAD, and introduce a novel framework that can generalize to various target scenes with typicality and uniqueness learning. First, we propose a language-guided typicality modeling module that effectively learns the typical distribution of normal and abnormal behavior. Secondly, we propose a test-time uniqueness analysis module to derive scene-adaptive boundaries. Experiments demonstrate the effectiveness of our model. Limitations and future work: Our work aims to exploit the full potential of skeleton data in the ZS-VAD task, while the complex relationships between behaviors and scenes will be left for future work.\nREFERENCES # [1] W. Liu, W. Luo, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction for anomaly detection–a new baseline,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6536– 6545.\n[2] W. Sultani, C. Chen, and M. Shah, \u0026ldquo;Real-world anomaly detection in surveillance videos,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6479–6488.\n[3] G. Wang, Y. Wang, J. Qin, D. Zhang, X. Bao, and D. Huang, \u0026ldquo;Video anomaly detection by solving decoupled spatio-temporal jigsaw puzzles,\u0026rdquo; in European Conference on Computer Vision. Springer, 2022, pp. 494–511.\n[4] Z. Liu, Y. Nie, C. Long, Q. Zhang, and G. Li, \u0026ldquo;A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 588–13 597.\n[5] C. Tang, S. Zhou, Y. Li, Y. Dong, and L. Wang, \u0026ldquo;Advancing pre-trained teacher: towards robust feature discrepancy for anomaly detection,\u0026rdquo; arXiv preprint arXiv:2405.02068, 2024.\n[6] O. Hirschorn and S. Avidan, \u0026ldquo;Normalizing flows for human pose anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 545–13 554.\n[7] L. Wang, J. Tian, S. Zhou, H. Shi, and G. Hua, \u0026ldquo;Memory-augmented appearance-motion network for video anomaly detection,\u0026rdquo; Pattern Recognition, vol. 138, p. 109335, 2023.\n[8] M. Cho, M. Kim, S. Hwang, C. Park, K. Lee, and S. Lee, \u0026ldquo;Look around for anomalies: weakly-supervised anomaly detection via context-motion relational learning,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 12 137–12 146.\n[9] H. Shi, L. Wang, S. Zhou, G. Hua, and W. Tang, \u0026ldquo;Abnormal ratios guided multi-phase self-training for weakly-supervised video anomaly detection,\u0026rdquo; IEEE Transactions on Multimedia, 2023.\n[10] Y. Liu, S. Li, Y. Zheng, Q. Chen, C. Zhang, and S. Pan, \u0026ldquo;Arc: A generalist graph anomaly detector with in-context learning,\u0026rdquo; arXiv preprint arXiv:2405.16771, 2024.\n[11] J. Jeong, Y. Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, \u0026ldquo;Winclip: Zero-/few-shot anomaly classification and segmentation,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 606–19 616.\n[12] Z. Gu, B. Zhu, G. Zhu, Y. Chen, H. Li, M. Tang, and J. Wang, \u0026ldquo;Filo: Zero-shot anomaly detection by fine-grained description and highquality localization,\u0026rdquo; arXiv preprint arXiv:2404.13671, 2024.\n[13] Y. Cao, J. Zhang, L. Frittoli, Y. Cheng, W. Shen, and G. Boracchi, \u0026ldquo;Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection,\u0026rdquo; arXiv preprint arXiv:2407.15795, 2024.\n[14] Q. Zhou, G. Pang, Y. Tian, S. He, and J. Chen, \u0026ldquo;AnomalyCLIP: Objectagnostic prompt learning for zero-shot anomaly detection,\u0026rdquo; in The Twelfth International Conference on Learning Representations, 2024.\n[15] A. Aich, K.-C. Peng, and A. K. Roy-Chowdhury, \u0026ldquo;Cross-domain video anomaly detection without target domain adaptation,\u0026rdquo; in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2023, pp. 2579–2591.\n[16] J. Micorek, H. Possegger, D. Narnhofer, H. Bischof, and M. Kozinski, \u0026ldquo;Mulde: Multiscale log-density estimation via denoising score matching for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 868– 18 877.\n[17] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh, \u0026ldquo;Learning regularity in skeleton trajectories for anomaly detection in videos,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 996–12 004.\n[18] S. Yu, Z. Zhao, H. Fang, A. Deng, H. Su, D. Wang, W. Gan, C. Lu, and W. Wu, \u0026ldquo;Regularity learning via explicit distribution modeling for skeletal video anomaly detection,\u0026rdquo; IEEE Transactions on Circuits and Systems for Video Technology, 2023.\n[19] A. Flaborea, L. Collorone, G. M. D. Di Melendugno, S. D\u0026rsquo;Arrigo, B. Prenkaj, and F. Galasso, \u0026ldquo;Multimodal motion conditioned diffusion model for skeleton-based video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 318–10 329.\n[20] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, \u0026ldquo;Openpose: Realtime multi-person 2d pose estimation using part affinity fields,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 1, pp. 172–186, 2019.\n[21] H.-S. Fang, J. Li, H. Tang, C. Xu, H. Zhu, Y. Xiu, Y.-L. Li, and C. Lu, \u0026ldquo;Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time,\u0026rdquo; IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7157–7173, 2022.\n[22] A. Acsintoae, A. Florescu, M.-I. Georgescu, T. Mare, P. Sumedrea, R. T. Ionescu, F. S. Khan, and M. Shah, \u0026ldquo;Ubnormal: New benchmark for supervised open-set video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 143–20 153.\n[23] C. Cao, Y. Lu, P. Wang, and Y. Zhang, \u0026ldquo;A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 20 392–20 401.\n[24] S. Sun and X. Gong, \u0026ldquo;Hierarchical semantic contrast for scene-aware video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 846–22 856.\n[25] A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan, \u0026ldquo;Graph embedded pose clustering for anomaly detection,\u0026rdquo; in Proceed-\nings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 539–10 547.\n[26] X. Tan, H. Wang, X. Geng, and L. Wang, \u0026ldquo;Frequency-guided diffusion model with perturbation training for skeleton-based video anomaly detection,\u0026rdquo; arXiv preprint arXiv:2412.03044, 2024. [27] D. P. Kingma and P. Dhariwal, \u0026ldquo;Glow: Generative flow with invertible 1x1 convolutions,\u0026rdquo; Advances in neural information processing systems , vol. 31, 2018. [28] R. Wu, Y. Chen, J. Xiao, B. Li, J. Fan, F. Dufaux, C. Zhu, and Y. Liu, \u0026ldquo;Da-flow: Dual attention normalizing flow for skeleton-based video anomaly detection,\u0026rdquo; arXiv preprint arXiv:2406.02976, 2024. [29] A. Li, C. Qiu, M. Kloft, P. Smyth, M. Rudolph, and S. Mandt, \u0026ldquo;Zeroshot anomaly detection via batch normalization,\u0026rdquo; Advances in Neural Information Processing Systems, vol. 36, 2024. [30] T. Aota, L. T. T. Tong, and T. Okatani, \u0026ldquo;Zero-shot versus manyshot: Unsupervised texture anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2023, pp. 5564–5572. [31] X. Chen, Y. Han, and J. Zhang, \u0026ldquo;A zero-/fewshot anomaly classification and segmentation method for cvpr 2023 vand workshop challenge tracks 1\u0026amp;2: 1st place on zero-shot ad and 4th place on few-shot ad,\u0026rdquo; arXiv preprint arXiv:2305.17382, vol. 2, no. 4, 2023. [32] A. Miyai, J. Yang, J. Zhang, Y. Ming, Y. Lin, Q. Yu, G. Irie, S. Joty, Y. Li, H. Li et al., \u0026ldquo;Generalized out-of-distribution detection and beyond in vision language model era: A survey,\u0026rdquo; arXiv preprint arXiv:2407.21794, 2024. [33] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., \u0026ldquo;Learning transferable visual models from natural language supervision,\u0026rdquo; in International conference on machine learning. PMLR, 2021, pp. 8748–8763. [34] D. Guo, Y. Fu, and S. Li, \u0026ldquo;Ada-vad: Domain adaptable video anomaly detection,\u0026rdquo; in Proceedings of the 2024 SIAM International Conference on Data Mining (SDM). SIAM, 2024, pp. 634–642. [35] L. Zanella, W. Menapace, M. Mancini, Y. Wang, and E. Ricci, \u0026ldquo;Harnessing large language models for training-free video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 527–18 536. [36] F. Sato, R. Hachiuma, and T. Sekii, \u0026ldquo;Prompt-guided zero-shot anomaly action recognition using pretrained deep skeleton features,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6471–6480. [37] J. Carreira and A. Zisserman, \u0026ldquo;Quo vadis, action recognition? a new model and the kinetics dataset,\u0026rdquo; in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308. [38] S. Yan, Y. Xiong, and D. Lin, \u0026ldquo;Spatial temporal graph convolutional networks for skeleton-based action recognition,\u0026rdquo; in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018. [39] X. Liu, S. Zhou, L. Wang, and G. Hua, \u0026ldquo;Parallel attention interaction network for few-shot skeleton-based action recognition,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1379–1388. [40] Y. Wang, S. Zhou, K. Xia, and L. Wang, \u0026ldquo;Learning discriminative spatiotemporal representations for semi-supervised action recognition,\u0026rdquo; arXiv preprint arXiv:2404.16416, 2024. [41] M. Wang, J. Xing, and Y. Liu, \u0026ldquo;Actionclip: A new paradigm for video action recognition,\u0026rdquo; arXiv preprint arXiv:2109.08472, 2021. [42] W. Xiang, C. Li, Y. Zhou, B. Wang, and L. Zhang, \u0026ldquo;Generative action description prompts for skeleton-based action recognition,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 276–10 285. [43] C. Lu, J. Shi, and J. Jia, \u0026ldquo;Abnormal event detection at 150 fps in matlab,\u0026rdquo; in Proceedings of the IEEE international conference on computer vision , 2013, pp. 2720–2727. [44] W. Li, V. Mahadevan, and N. Vasconcelos, \u0026ldquo;Anomaly detection and localization in crowded scenes,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 1, pp. 18–32, 2013. [45] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, \u0026ldquo;Imagebind: One embedding space to bind them all,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 180–15 190. [46] Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, \u0026ldquo;Channelwise topology refinement graph convolution for skeleton-based action recognition,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 359–13 368. [47] X. Yao, R. Li, J. Zhang, J. Sun, and C. Zhang, \u0026ldquo;Explicit boundary guided semi-push-pull contrastive learning for supervised anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 490–24 499.\n[48] A. Barbalau, R. T. Ionescu, M.-I. Georgescu, J. Dueholm, B. Ramachandra, K. Nasrollahi, F. S. Khan, T. B. Moeslund, and M. Shah, \u0026ldquo;Ssmtl++: Revisiting self-supervised multi-task learning for video anomaly detection,\u0026rdquo; Computer Vision and Image Understanding, vol. 229, p. 103656, 2023. [49] H. Liu, C. Li, Y. Li, and Y. J. Lee, \u0026ldquo;Improved baselines with visual instruction tuning,\u0026rdquo; 2023. [50] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang et al., \u0026ldquo;Qwen2. 5-vl technical report,\u0026rdquo; arXiv preprint arXiv:2502.13923, 2025. [51] C. Tang, Z. Han, H. Sun, S. Zhou, X. Zhang, X. Wei, Y. Yuan, J. Xu, and H. Sun, \u0026ldquo;Tspo: Temporal sampling policy optimization for longform video language understanding,\u0026rdquo; arXiv preprint arXiv:2508.04369 , 2025. ","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/typicality-and-context-uniqueness-for/","section":"Papers","summary":"Proposes a zero-shot skeleton-based video anomaly detection framework utilizing action semantic typicality and context uniqueness learning, involving a language-guided typicality modeling module and a test-time context uniqueness analysis module, achieving state-of-the-art results without target domain training data.","title":"Action Hints: Semantic Typicality and Context Uniqueness for Generalizable Skeleton-based Video Anomaly Detection","type":"method"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/ali-dabouei/","section":"Authors","summary":"","title":"Ali Dabouei","type":"authors"},{"content":" Aligning Effective Tokens with Video Anomaly in Large Language Models # Yingxian Chen 1∗ Jiahui Liu 1∗ Ruidi Fan 1 Yanwei Li 2 Chirui Chang 1 Shizhen Zhao 1 Wilton W.T.Fok 1 Xiaojuan Qi 1† Yik-Chung Wu 1† 1 2\nThe University of Hong Kong The Chinese University of Hong Kong\n{chenyx, liujh, xjqi, ycwu}@eee.hku.hk\n∗ equal contributions † c † corresponding authors\nAbstract # Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.\n1. Introduction # Detecting and summarizing abnormal events in videos is critical and challenging, and it has garnered considerable attention across multiple research domains and real-world applications, such as security monitoring, video analysis, and crime detection.\nAlthough many traditional methods [4 , 16 , 37 , 43 , 47 , 73 , 86] have been widely explored for video anomaly detection, they exhibit substantial limitations in their effectiveness [22 , 51 , 60 , 66 , 71 , 84]. These limitations manifest in two aspects: 1) Traditional video anomaly detection\nFigure 1. Baseline video understanding MLLM feeds forward every visual token (yellow squares) equally to participate in fine-tuning and inference (top row). Different from it, our method focuses on the effective area (unobstructed area in medium video frames) in each frame and select the Spatial Effective Tokens (orange squares) for the LLM (see Section 3.2) (filtered tokens are shown as gray squares). At the same time, we generate anomaly-aware Temporal Effective Tokens (green squares) (see Section 3.3) based on the assigned anomaly scores (denoted as s) of each frame from a pretrained classifier for better temporal localization of anomalies.\nmethods [6 , 13 , 57 , 63 , 64 , 86] essentially approach the task as a closed-set detection and classification problem, inherently limiting their ability to achieve a comprehensive understanding and interpretation of anomalies; 2) These methods [2 , 19 , 23 , 65 , 68 , 78] are restricted by a limited vocabulary, making it difficult for them to handle unseen or novel situations effectively.\nRecent advancements [1 , 28 , 34 , 36 , 45 , 58] in Vision Language Models (VLMs) and Large Language Models (LLMs) have demonstrated remarkable capabilities in scene understanding and comprehensive analysis. Multimodal Large Language Models (MLLMs), particularly those designed for video understanding [27 , 29 , 31 , 39 , 42 , 74], have achieved significant progress in general video analysis tasks. However, while these models exhibit strong performance in general video understanding, they fall short in accurately detecting and interpreting anomalies.\nFor mitigating the above challenges, some works [12 , 55 , 67 , 75 , 79] proposed anomaly-aware video MLLMs to better understand the anomalies in videos. Although these models work well for detecting obvious abnormal events, such as fighting or fire, they typically struggle to effectively align abnormal regions with relevant captions which requires addressing spatial redundancy, and accurately identifying abnormal time intervals by mitigating temporal redundancy. This is because these methods treat all latent tokens with equal priority across spatial and temporal dimensions. This leads to performance degradation caused by redundant tokens unrelated to anomalies. However, in most cases, only small regions within a few frames contain the essential information help to identify an anomaly (as shown in Figure 1). Thus, we explore: How can multimodal architectures evolve selective token generation and processing mechanisms to dynamically prioritize anomaly-salient information while preserving comprehensive scene understanding capabilities?\nTo address the aforementioned issues, we propose a new model named VA-GPT for analyzing various Videos for Abnormal events by aligning effective and accurate tokens with LLMs across both spatial and temporal dimensions. VA-GPT integrates two key components to identify effective visual tokens for alignment while eliminating redundant tokens that could hinder anomaly analysis and distract model from extracting useful information: 1) we develop the Spatial Effective Token Selection (SETS) module for identifying tokens corresponding to regions with challenges for aligning them with LLMs, while filtering out tokens associated with minor dynamics to remove redundancy. This is because we find that abnormal events often result in different visual changes and variations in local areas (see Figure 1); and 2) we propose the Temporal Effective Token Generation (TETG) module which employs a lightweight pre-trained classifier to assign a confidence score to each frame indicating the possibility of containing abnormal events. Then TETG generates efficient tokens with prior knowledge of the temporal information of abnormal events directly in the language space as additional input to the LLMs, effectively enhancing the model\u0026rsquo;s temporal reasoning and understanding abilities about abnormal events.\nFurthermore, beyond conventional benchmarks (indomain benchmark), we establish a new cross-domain evaluation protocol that systematically evaluates model robustness with domain shifts. Based on a novel video dataset, XD-Violence [64], we design comprehensive QAs about abnormal events which include different visual contents from our training data and integrate it as a new cross-domain benchmark. Meanwhile, we design temporal-informationoriented QAs on both in- and cross- domain benckmarks for evaluating temporal localization abilities. Comprehensive experiments demonstrate VA-GPT\u0026rsquo;s superiority, achieving state-of-the-art performance in both in-domain anomaly localization and cross-domain generalization scenarios.\nThe main contributions are summarized as follows:\nWe propose VA-GPT, a video-anomaly-aware MLLM for detecting and summarizing anomalies in various videos, which introduces the MLLM to the specific domain of video anomaly understanding. We introduce the SETS and TETG, which enable our MLLM to effectively capture both spatial and temporal information in video sequences, resulting in accurate understanding and localization of abnormal events. Meanwhile, we propose a new instruct-following dataset for video anomaly analysis and a comprehensive cross-domain evaluation benchmark for better evaluating the generalization abilities of MLLMs on video anomalies. Our extensive experiments demonstrate that our method outperforms existing state-of-the-art methods in various benchmarks, highlighting its effectiveness and potential for practical applications in video anomaly understanding. 2. Related Work # 2.1. Large Language Models (LLMs) # The domain of Natural Language Processing (NLP) has experienced significant progress, particularly with the emergence of Large Language Models (LLMs). The introduction of the Transformer architecture [3 , 18 , 59] is a critical turning point, followed by other influential language models [3 , 10 , 25 , 80] that exhibited remarkable proficiency. Generative Pre-trained Transformers (GPT) [49] brought about a revolution in NLP by employing auto-regressive prediction, establishing itself as a powerful language modeling approach. More recent groundbreaking contributions, such as ChatGPT [45], GPT-4 [1], LaMDA [56] and LLaMA [58], have expanded the horizon even further. These models, trained on extensive textual data, display extraordinary performance in intricate linguistic tasks.\n2.2. Vision Language Models (VLMs) # Progress in the fields of computer vision and natural language processing has given rise to the development of vision-language models (VLMs) [14 , 21 , 33 , 34 , 50 , 61 , 69]. These models combine visual and linguistic systems to facilitate cross-modal comprehension and reasoning. Notable examples include CLIP [50] which pairs BERT [10] with ViT [11]; BLIP-2 [28] incorporating Vision Transformer features into Flan-T5 [8]; MiniGPT4 [85], connecting BLIP-2 with Vicuna [28 , 48]; PandaGPT [53] , bridging ImageBind [15] with Vicuna. These models excel in tasks like image classification, captioning, and object detection [24 , 53 , 83]. Recent developments in vision-language models have extended into video processing with models\nFigure 2. Detailed illustration of our proposed model. When a video is fed into the model, patch embeddings and class embeddings (c.ebd) are extracted from all frames. 1) Based on the difference in patch embeddings between current frame and its neighbour frame, we can get a filter mask to filter out unimportant visual tokens (dashed square ) from current frame\u0026rsquo;s visual tokens , thereby selecting Spatial Effective Tokens that are compressed with a projector with pooling into aligned content token for each frame, meanwhile take attention with text input from users for resulting aligned context token for each frame; 2) Based on class embeddings (c.ebd) of all frames, we use a pre-trained Anomaly-aware Classifier to localize the time period of abnormal events, thereby generating Temporal Effective Tokens to feed forward into the LLM. All of the resulting aligned tokens are fed into the LLM for reasoning and inference of the whole model.\nlike Video-Chat [29], Video-ChatGPT [42], Otter [26], Valley [39], mPLUG [27], Video-LLaMA [74], and LLaMAVID [31]. These systems enable interactive video querying, enhance comprehension through audio-visual-text alignment, and support comprehensive video analysis. In this paper, we leverage VLMs and LLMs to develop a novel approach for video anomaly understanding.\n2.3. Video Anomaly Understanding (VAU) # Annotating surveillance video frames is labor-intensive, prompting researchers to explore alternatives: one-class learning [72], unsupervised anomaly detection without annotations [4 , 5 , 16 , 20 , 37 , 38 , 47], or weakly supervised methods using only video-level annotations [6 , 13 , 30 , 57 , 60 , 62 , 63 , 71 , 76]. In one-class learning, Luo et al. developed a ConvLSTM network for normal segment learning [40]. Several researchers employed Auto-Encoder networks to reconstruct normal frame features [5 , 20 , 81], while others implemented memory mechanisms [4 , 16 , 37 , 47] or meta-learning [38] to enhance generalizability. For weakly supervised learning, Tian [57] sed multiple-instance learning to localize anomalous clips. Zhong et al. uutilized graph convolution networks, though with limited generalization capability. [82]. To address this, Ramachandra et al. developed a Siamese network for normal feature learning. Wan et al. and Zaheer et al. [60 , 71]proposed clustering-based frameworks for anomaly identification. Recent studies have introduced new architectures for spatial-temporal feature ensemble learning [6 , 13 , 30 , 57 , 62 , 63 , 76]. However, these methods merely supply anomaly scores during inference, necessitating empirical threshold establishment on test sets to differentiate abnormal events. Recent research has begun exploring MLLMs to enhance models\u0026rsquo; capabilities in identifying and describing anomalies [12 , 55 , 67 , 75 , 79].\n3. Method # 3.1. Overview # Task. Video anomaly understanding MLLMs aims to determine whether an input video contains abnormal events, meanwhile describing and interacting with the temporal localization and the entire process of the detected abnormal events (if has). We train the model with an instruct-following dataset based on abnormal videos [54], so that the model can better align the tokens between visual encoders and LLMs for presenting and generalizing information about abnormal events.\nPipeline. Considering the video understanding MLLM framework as shown in Figure 2, taking a video (contains T frames) as input, a frozen ViT-based [11] visual encoder (CLIP [52]) extracts visual tokens X t from each video frame V t (t = 1, \u0026hellip;, T). For X t = {x t i } i=1,\u0026hellip;,N of the current frame, there are N visual tokens corresponding to equal amounts of image patches. Modality alignment converts the processed visual tokens X t into the semantic space of LLMs. At the same time, text prompts are processed and encoded as text tokens into the same semantic space and serve as a part of input to LLMs. Our key design on models consists of (1) selecting Spatial Effective Tokens X ∗t (SET) from X t for each frame participating in fine-tuning and inference instead of X t (see Section 3.2); and (2) generating Temporal Effective Tokens S ∗t (TET) as anomaly-aware temporal priors, participating in inference to facilitate the temporal localization of abnormal events for LLMs (see Section 3.3). In addition, we produce high-quality instruct-following data on abnormal videos and develop a training strategy for it to maximize the effectiveness of our proposed method (see Section 3.4).\n3.2. Spatial Effective Token Selection (SETS) # In classical video classification tasks, context and relationships are critical. However, in our MLLM setting, beyond leveraging contextual information, the most crucial problem is aligning the visual and language modalities. Therefore, the key aspect of our design is to extract useful information for effectively aligning visual tokens with the LLM. Since text captions primarily describe anomaly events, which occupy only a small portion of the entire video, aligning all visual patterns with text tokens would be unreasonable and computationally heavy. Thus, we are the first to propose a novel token selection method SETS to achieve efficient and effective alignment.\nInter-frame Difference. For a video, we believe that areas with large changes in adjacent frames are more worthy of attention. As illustrated in Figure 2, for each frame V t of the video, we can regard its previous frame V t − 1 as the reference frame for investigating the difference between current timestamp and previous timestamp. Employing DINOv2 [46] as the feature extractor, denoted as FE, we can extract patch embeddings:\nwhere F t , F t − 1 ∈ R N×C are the extracted embeddings (N indicates the number of image patches and C indicates the channels). Thanks to the distinction and stability of the extracted features, we calculate their patch-wise distances as the inter-frame difference map of the current frame:\nwhere dis(·) indicates Manhattan distance [17] and D t ∈ R N indicates the distances between corresponding patch pairs in neighbour frames.\nSelect Spatial Effective Tokens. According to the interframe difference map D t , we can set up a vector M t = [m t t 1 , m t t 2 , \u0026hellip;, m t N ] to record the difference of each patch, where top K ratio of elements with the largest distance are assigned with the value of 1, and the rest are assigned as 0. Thus we get a mask for filtering and updating the visual tokens as:\nwhere X ∗t contains the selected Spatial Effective Tokens (SET) which are fed into subsequent processing instead of\nX t as shown in Figure 2. SET can efficiently isolate the regions highly related to the abnormal events to participate in both fine-tuning and inference.\n3.3. Temporal Effective Token Generation (TETG) # Anomaly-aware Classifier. We design a simple but effective MLP FA FA for learning whether each frame is related to an abnormal event. For the class embeddings (denoted as z) extracted from the feature encoder, we can split them based on training video caption into normal and anomaly embeddings, denoted as z n and z a , respectively. Thus we can optimize FA using a binary classification loss:\nThe anomaly-aware classifier predicts whether each frame is related to anomalies in a video, which can bring additional important prior knowledge to LLMs at a very low cost to facilitate its inference.\nGenerate Temporal Effective Tokens. Since the information drawn from the anomaly-aware classifier is explicit, we can easily project it to LLMs\u0026rsquo; text token space through natural languages. Based on the prediction results of the anomaly-aware classifier, we select the first and last frames\u0026rsquo; timestamps with high confidence of containing abnormal events, denoted as \u0026lt;a-start\u0026gt; and \u0026lt;a-end\u0026gt;, respectively. Then we tokenize them with a template as: \u0026ldquo;Known common crime types are: \u0026lsquo;Shooting\u0026rsquo;,\u0026lsquo;Arson\u0026rsquo;,\u0026lsquo;Arrest\u0026rsquo;, \u0026hellip; There is one of the crime types occurring from \u0026lt;a-start\u0026gt; to \u0026lt;a-end\u0026gt;\u0026rdquo;, resulting Temporal Effective Tokens (TET) in the text token space of LLMs. During inference, with the well-trained lightweight anomaly-aware classifier, TET is used as an additional input to participate in the forward process of the LLM to provide prior knowledge about abnormal events temporally (as shown in Figure 2).\n3.4. Training Strategy # For modality alignment and instruction tuning, we follow the baseline [31] to ensure visual features are well aligned with the language space. In this work, the training strategy can be divided into two stages: 1) Stage One: Fine-tuning with anomaly video data, and 2) Stage Two: Aligning Spatial Effective Tokens with LLMs.\nFine-tuning with Video Anomaly Data. For enhancing the abnormal scene understanding of LLMs, we construct the UCF-crime [70] Question-Answer pairs for finetuning. We also mix different instruction pairs from various sources [32], including text conversations, single/multi-turn visual Question-Answer pairs, and video Question-Answer pairs. Different formats for text, image, and video inputs are adopted, and the image token is randomly inserted at the beginning or end of user input during training. All modules,\nTable 1. Comparing on in-domain (UCF-Crime [54]) and the proposed cross-domain (XD-Violence [64]) benchmarks, our method significantly outperforms other models and achieve the state-of-the-art performance (accuracy is detonated as Acc.) on anomaly video understanding and temporal localization with LLMs (Best results are shown in bold).\nMethod LLM n-domain n-domain Cross-domain Cross-domain Method LLM Total Acc.(%) Temporal Acc.(% Total Acc.(%) Temporal Acc.(%) Video-Chat [29] Vicuna-7B 22.41 - - - Video-ChatGPT [42] Vicuna-7B 24.13 28.51 24.00 29.10 Otter [26] LLaMa-7B 22.41 22.17 25.20 23.80 Valley [39] Vicuna-7B 20.34 14.48 21.00 20.20 mPLUG [27] LLaMa-7B 22.76 - - - Video-LLaMA2 [7] Vicuna-7B 21.38 26.62 24.20 23.00 Hawkeye [79] LLaVA-7B 28.60 30.00 25.30 28.50 LLaMA-VID [31] Vicuna-7B 14.83 26.70 18.80 23.60 VA-GPT (Ours) Vicuna-7B 30.69 35.00 26.20 31.02 Table 2. We evaluate our method on the MMEval [12] benchmark and show that our method outperforms the related previous method in different aspects.\nMethod Description Causes Effect A-Guardian [12] 79.65 58.92 50.64 Ours 80.83 59.55 51.08 except the frozen visual encoder, are optimized in this stage. After fine-tuning, the LLM has a preferred perception of anomalies, thus ensuring the effectiveness of the temporal effective tokens (see Section 3.3) during inference. More dataset details are illustrated in Section 4 .\nAligning Spatial Effective Tokens with LLMs. For abnormal video scenes, most areas would not be aligned with languages. Therefore, we implement an additional fine-tuning step. This step involves utilizing Spatial Effective Tokens (see Section 3.2) derived from each video frame within the UCF-Crime dataset. By incorporating these tokens, we aim to provide the model with a more refined understanding of the spatial context of anomalies. It also brings efficient optimization, and the alignment here is only designed for very short-term fine-tuning, which can greatly improve the model\u0026rsquo;s ability to detect and understand anomalies.\n4. Experiments # Datasets. We fine-tune our model on our proposed instructfollowing format [34] training dataset including 4077 videos and 7730 images based on UCF-Crime [54] dataset. We evaluate the models on two anomaly video understanding benchmarks: UCF-Crime [54] for in-domain evaluation and a proposed benchmark designed based on XD-Violence [64] dataset for cross-domain evaluation, respectively. More details are shown in the supplementary.\nBenchmarks and Metrics. To evaluate the ability to review videos and identify anomalies, we utilize a video anomaly understanding evaluation from Video-Bench [44] to assess the temporal comprehensive ability, which contains nature language based Question-Answer pairs from UCF-Crime [54] dataset. Meanwhile, in order to evaluate the model\u0026rsquo;s crossdomain video anomaly understanding ability, we contribute Question-Answer pairs as an extra benchmark based on the XD-Violence dataset. These Question-Answer pairs encompass four options, with each presenting the anomaly category and the respective time intervals during which the anomalies transpire. For each benchmark, we design different sets of questions for two evaluations: one is an overall evaluation of abnormal event detection and understanding, and the other is a special evaluation focusing on temporal localization ability, measured by question answering accuracy (denoted as Total Acc. and Temporal Acc., respectively, higher is better).\nImplementation Details. For network structure, we incorporate the pre-trained CLIP [52] and DINOv2 [46] as the visual encoder and Qformer [9] as the text decoder. We follow [31] to freeze the encoder during the modality alignment and to optimize the trainable parameters using anomaly videos and instructions for instruction tuning. During the training process, our model utilizes PyTorch on four NVIDIA A100 GPUs. We employ the AdamW optimizer with a batch size of 64. The learning schedule is set to cosine decay, with a learning rate of 2e-5 and a total of 1 epoch.\n4.1. Main Results # Results on In-domain Dataset. We first evaluate our method on the in-domain dataset, where the test set belongs to the same style and recording mode as the data used for training in Section 3.4. As shown in Table 1, compared with the baseline [31], with fewer visual embedding tokens and temporal effective tokens, our method brings more than double the performance improvement on total accuracy, also brings a significant increase in temporal localization. Driven by our proposed training strategy and designed effective tokens, more pure and effective visual-semantic information\nFigure 3. Qualitative results in Question-Answer diagrams, the red circles in the figures correspond to the bold text in the answers. From short video of only a dozen seconds to medium video of longer than one minute and long video of about half an hour, our model can reason well and understand the content.\nTable 3. Ablation studies on Spatial Effective Token Selection (SETS), Temporal Effective Tokens Generation (TETG), and progressive training strategies. At different model training stages, starting from the baseline (w/o fine-tuning, w/o both), we compare the performance of only using SETS (w.SETS), only using TETG (w.TETG), and using both (w.Both) on the UCF-Crime benchmark. Stage One: Only anomaly video fine-tuning. Stage Two: Anomaly video fine-tuning + Fine-tuning with SETS (Best results are shown in bold).\nMethod Baseline Baseline Baseline Baseline Stage One Fine-tuning Sta Stage One Fine-tuning Sta Stage One Fine-tuning Sta Stage Two Fine-tuning Stage Two Fine-tuning Stage Two Fine-tuning w/o Both w.SETS w.TETG w.Both w.SETS w.TETG w.Both w.SETS w.TETG w.Both Total Acc. (%)↑ 14.83 24.83 23.79 25.12 25.86 26.10 27.50 29.31 28.96 30.69 Temporal Acc. (%)↑ 26.70 27.20 27.76 28.81 29.68 30.02 30.77 31.60 33.58 35.00 of abnormal events is efficiently aligned with LLMs and exhibits powerful anomaly video understanding capabilities. At the same time, we conduct fair comparisons with existing video understanding models [26 , 27 , 29 , 39 , 42 , 74] (see Table 1), and we demonstrate competitive performance. It is worth noting that we use the fewest tokens among all methods to achieve the state-of-the-art results on both total and temporal accuracies.\nResults on Cross-domain Dataset. For evaluating the robustness and generalization of the models, we additionally design a cross-domain benchmark. We conduct a fair comparison of our method with the baseline [31] and the existing in-domain methods on the proposed cross-domain benchmark. The results presented in Table 1 showcase a substantial performance improvement over existing methods on the cross-domain dataset, underscoring the exceptional generalization and temporal localization capabilities of our methodology. This clear superiority in performance serves as a compelling validation of the robustness and adaptability of our approach across diverse domains.\nInteraction with the Model. We take some interactions with our well-trained model for better evaluation. As shown in Figure 3, we demonstrate the performance of our model in addressing various video anomaly understanding challenges. To evaluate the model\u0026rsquo;s effectiveness, we select videos of different durations: short (0 to 1 minute), medium (1 to 30 minutes), and long (over 30 minutes). This variety allows us to thoroughly assess the model\u0026rsquo;s capabilities in handling diverse anomaly understanding scenarios. In the Road Accident video (left side in Figure 3), our method successfully identifies a car driving at high speed and detects people falling, even in low-resolution footage. For the Explosion video (middle in Figure 3), the model accurately predicts the scene and the anomaly in a medium-length video depicting an explosion. In a normal video exceeding 30 minutes (right side in Figure 3), we demonstrate the model\u0026rsquo;s ability\nTable 4. We fine-tune comparison models with our proposed UCF instruct-following data and evaluate the performance of these models before and after fine-tuning on the UCF benchmark.\nMethod LLM Fine-tuned Total Acc.(%)↑ Video-ChatGPT Vicuna-7B - 24.13 Video-ChatGPT Vicuna-7B ✓ 26.23 LLaMA-VID (Baseline) Vicuna-7B - 14.83 LLaMA-VID (Baseline) Vicuna-7B ✓ 23.1 Ours Vicuna-7B - 25.12 Ours Vicuna-7B ✓ 30.69 to focus on both global and local information by asking it to summarize the content.\nComparison on Other Benchmark. We additionally compare our method on another benchmark (MMEval [12]) about anomaly video understanding with LLMs from different aspects. We follow a fair evaluation on our proposed model and obtain the quantity results as shown in Table 2 , which shows the superiority of our method.\n4.2. Ablation Studies # We conduct extensive ablation studies to validate the effectiveness of the key components in our method: Spatial Effective Token Selection (SETS, Section 3.2) and Temporal Effective Tokens Generation (TETG, Section 3.3) on progressive training strategies in Section 3.4 .\nFine-tuning Stages. The utilization of our high-quality UCF instruct-following data has proven to enhance the model\u0026rsquo;s performance. Fine-tuning with this dataset has effectively contributed to a notable accuracy compared with the baseline. As evidenced in Table 3, with our model designing (both SETS and TETG), the total accuracy for anomaly detection only achieves 25.12% without any fine-tuning (denoted as Baseline). With anomaly video fine-tuning (denoted as Stage One Fine-tuning), the accuracy increases to 27.5%. Furthermore, an efficient tuning with SETS (denoted as Stage Two Fine-tuning) can achieve our final performance of 30.69% total accuracy. Temporal accuracy also shows similar increasing patterns with the scaling tuning stages.\nEffectiveness of Fine-tuning Data. For a fair comparison, we fine-tune some high-performance comparison models [31 , 42] and compare them with our proposed UCF instructfollowing data. As shown in Table 4, the performance of these comparison models has increased after fine-tuning, which proves the effectiveness of our proposed data. However, their performance is still not as good as our method, which proves the effectiveness of our proposed model.\nEffectiveness of SETS. Our proposed SETS demonstrates efficiency in extracting useful abnormal information, leading to performance enhancement. As illustrated in Table 3 , with the SETS, the accuracy reaches 24.83%, 25.86% and 29.31% without fine-tuning, anomaly video fine-tuning and\nTable 5. For the sample rate of tokens, K ratios in SETS, we sample the patch tokens with ordered distance at a fixed sample rate (0.5). The ablation indicates too large sampling rates cause too much context noise, and too small sampling rates lose visual information.\n# Sample Rate K 0.1 0.3 0.5 0.7 0.9 Total Acc.(%)↑ 23.61 24.83 30.69 28.67 27.27 Temporal Acc.(%)↑ 29.03 29.93 35 31.23 31.03 fine-tuning with SETS, respectively, which far exceed the accuracy of the baseline. Its intuitive mechanism for information filtering can be further analyzed with reference to Figure 4. The initial video is often cluttered with irrelevant and deceptive data. For example, in Figure 4, case one illustrates a scenario where the overall structure is complex, yet only a small segment requires attention. The SETS effectively filters out the dynamic features that do not require attention. Similarly, as in case two, the abnormal area is quite small. Our proposed SETS mechanism effectively filters out redundant and irrelevant information, significantly enhancing the model\u0026rsquo;s ability to accurately pinpoint and recognize abnormal situations.\nWe also conduct the ablation studies on K ratios about SETS as shown in Tabel 5. Too small or too large K ratios will cause performance degradation in both total and temporal accuracy. If K ratios is too small, redundant information will affect the effectiveness of aligning abnormal event information with corresponding captions. In contrast, if K ratios is too large, some important areas will be filtered out, resulting in information loss and suboptimal performance.\nEffectiveness of TETG. Our proposed TETG generates tokens directly in the text token space of LLMs as priors for each video, offering robust priors on the temporal aspects of abnormal events. The provision results in performance improvement without the necessity for fine-tuning. As shown in Table 3, the accuracy rises from 14.83% to 23.79%. Fine-tuning with anomaly video and SETS, the results even achieve 26.10% and 30.69% independently, which manifests the effectiveness of TETG. Besides, the integration of SETS and TETG highlights the importance of leveraging spatial and temporal information effectively in anomaly detection systems, boosting the results to 25.15%, 27.50% and 30.69%, respectively.\n5. Discussion # Key tokens play key roles. To the best of our knowledge, we are the first to explore how to assign different learnable knowledge to different tokens for better alignment with LLMs on visual contents, thus promoting video anomaly detection and understanding (see Table 1 and Figure 3). We assign the most effective roles to different tokens in both spatial and temporal dimensions, enabling the model to handle\nFigure 4. Visualization of the initial videos and our masked results. These two cases illustrate road accident scenarios: one occurring in a bustling street and the other in an empty suburb. Our SETS effectively filters redundant and irrelevant regions (with black patch-level masks).\nvarious tokens more efficiently. The video contains abundant but redundant information. Our proposed SETS and TETG effectively compress the spatial and temporal information of abnormal events respectively, and utilize the existing alignment mechanism of MLLMs at a very low cost to participate in LLMs\u0026rsquo; reasoning (see Table 3). Our exploration inspires more representation learning of MLLMs to facilitate downstream tasks.\nData matters a lot. We construct instruct-following data for anomaly videos, containing approximately 4,000 videos, which is significantly less than the amount of baseline finetuning video data (e.g., over 90k videos for fine-tuning in baseline model [31]). We still achieve promising performance on both in-domain and cross-domain benchmarks (see Table 1). This relies on the high-quality Question-Answer pairs in our instruct-following data. Meanwhile, SETS also improves data quality during fine-tuning: visual areas irrelevant to Question-Answer pairs are filtered out (see Figure 4), which allows for significant performance improvements in the second stage of fine-tuning (see Section 3.4) with very few steps (less than 150 iterations).\nBroader impacts. Video anomaly understanding has farreaching implications across various sectors, including security, healthcare, industrial safety, and so on. By enhancing the ability to automatically identify and respond to unusual or suspicious activities in real-time, LLMs can significantly improve public safety, crime prevention, patient monitoring, hazard detection, loss prevention, traffic management, and urban planning. These systems offer substantial benefits in terms of operational efficiency and safety.\nLimitations. Although our model adeptly portrays the occurrence, type, and area of video abnormal events, it still faces challenges in detecting and describing certain complex scenes. Our strategy represents an early successful validation and investigation of large models for video anomaly identification and localization. Consequently, our method possesses significant potential for enhancement in recognizing diverse abnormal video scenes. These insights motivate us to continue pursuing more powerful and efficient video anomaly understanding technologies in the future, aiming to address more challenges in the real world [35 , 41 , 77].\n6. Conclusions # In this paper, we propose a novel MLLM for understanding anomalies in videos with LLMs by aligning effective tokens in both temporal and spatial space. The proposed method includes Spatial Effective Token Selection (SETS) for identifying abnormal events in small areas of large scenes and Temporal Effective Tokens Generation (TETG) for addressing the sparseness of abnormal events in video time sequences. We also develop instruct-following data of video anomaly detection to fine-tune the model. Besides, evaluation on the video anomaly understanding benchmark and a proposed cross-domain benchmark demonstrates the effectiveness of the proposed method. It further presents a promising approach for video anomaly understanding using MLLMs, showcasing the potential of effective tokens for enhancing video understanding tasks.\nAcknowledgements # This work has been supported in part by Hong Kong Research Grant Council - Early Career Scheme (Grant No.27209621), General Research Fund Scheme (Grant No.17202422, 17212923, 17215025), Theme-based Research (Grant No.T45-701/22-R) and Shenzhen Science and Technology Innovation Commission (SGDX20220530111405040). Part of the described research work is conducted in the JC STEM Lab of Robotics for Soft Materials funded by The Hong Kong Jockey Club Charities Trust.\nReferences # [1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 1 , 2 [2] Borislav Antic and Bj ´ ´ orn Ommer. Video parsing for abnormal- ¨ ¨ ity detection. In 2011 International Conference on Computer Vision, pages 2415–2422, 2011. 1 [3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. 2 [4] Ruichu Cai, Hao Zhang, Wen Liu, Shenghua Gao, and Zhifeng Hao. Appearance-motion memory consistency network for video anomaly detection. Proceedings of the AAAI Conference on Artificial Intelligence, 35(2):938–946, 2021. 1 , 3 [5] Yunpeng Chang, Zhigang Tu, Wei Xie, and Junsong Yuan. Clustering driven deep autoencoder for video anomaly detection. In Computer Vision – ECCV 2020, 16th European Conference, pages 329–345, 2022. 3 [6] Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitudecontrastive glance-and-focus network for weakly-supervised video anomaly detection. AAAI2023, 2022. 1 , 3 [7] Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatialtemporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 5 [8] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned language models, 2022. 2 [9] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose visionlanguage models with instruction tuning, 2023. 5 [10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. 2 [11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2 , 3 [12] Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jianhang Chen, Qimei Cui, and Xiaofeng Tao. Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly, 2024. 2 , 3 , 5 , 7 [13] JiaChang Feng, FaTing Hong, and WeiShi Zheng. MIST: multiple instance self-training framework for video anomaly detection. In 2021 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 1 , 3 [14] Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, Rongrong Ji, and Chunhua Shen. Pyramidclip: Hierarchical feature alignment for vision-language model pretraining, 2022. 2 [15] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all, 2023. 2 [16] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memoryaugmented deep autoencoder for unsupervised anomaly detection. In IEEE International Conference on Computer Vision (ICCV), 2019. 1 , 3 [17] Rodolfo Gonzalez and Sandra Palais. A path-building procedure for iterative circuit computers. Technical report, 1962. 4 [18] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao. A survey on vision transformer. IEEE Transactions on Pattern Analysis \u0026amp; Machine Intelligence, pages 1–1, 2020. 2 [19] Mahmudul Hasan, Jonghyun Choi, jan Neumann, Amit K Roy-Chowdhury, and Larry Davis. Learning temporal regularity in video sequences. In Proceedings of IEEE Computer Vision and Pattern Recognition, 2016. 1 [20] Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 3 [21] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision, 2021. 2 [22] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. Clip-tsa: Clip-assisted temporal self-attention for weaklysupervised video anomaly detection, 2023. 1 [23] Louis Kratz and Ko Nishino. Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1446–1453, 2009. 1 [24] Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. F-vlm: Open-vocabulary object detection upon frozen vision and language models, 2023. 2 [25] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019. 2 [26] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning, 2023. 3 , 5 , 6 [27] Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou, and Luo Si. mplug: Effective and efficient vision-language learning by cross-modal skip-connections, 2022. 1 , 3 , 5 , 6 [28] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. 1 , 2 [29] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding, 2024. 1 , 3 , 5 , 6 [30] Shuo Li, Fang Liu, and Licheng Jiao. Self-training multisequence learning with transformer for weakly supervised video anomaly detection. Proceedings of the AAAI Conference on Artificial Intelligence, 36(2):1395–1403, 2022. 3 [31] Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models, 2023. 1 , 3 , 4 , 5 , 6 , 7 , 8 [32] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2023. 4 [33] Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Munan Ning, and Li Yuan. Moellava: Mixture of experts for large vision-language models, 2024. 2 [34] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 1 , 2 , 5 [35] Jiahui Liu, Chirui Chang, Jianhui Liu, Xiaoyang Wu, Lan Ma, and Xiaojuan Qi. Mars3d: A plug-and-play motionaware model for semantic segmentation on multi-scan 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9372–9381, 2023. 8 [36] Jiahui Liu, Xin Wen, Shizhen Zhao, Yingxian Chen, and Xiaojuan Qi. Can ood object detectors learn from foundation models? In European Conference on Computer Vision, pages 213–231. Springer, 2024. 1 [37] Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE International Conference on Computer Vision, 2021. 1 , 3 [38] Yiwei Lu, Frank Yu, Mahesh Kumar Krishna Reddy, and Yang Wang. Few-shot scene-adaptive anomaly detection. In European Conference on Computer Vision, 2020. 3 [39] Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability, 2023. 1 , 3 , 5 , 6 [40] Weixin Luo, Wen Liu, and Shenghua Gao. Remembering history with convolutional lstm for anomaly detection. In 2017 IEEE International Conference on Multimedia and Expo (ICME), pages 439–444, 2017. 3 [41] Xiaoyang Lyu, Chirui Chang, Peng Dai, Yang-Tian Sun, and Xiaojuan Qi. Total-decom: decomposed 3d scene reconstruction with minimal interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20860–20869, 2024. 8 [42] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023. 1 , 3 , 5 , 6 , 7 [43] Trong-Nguyen Nguyen and Jean Meunier. Anomaly detection in video sequence with appearance-motion correspondence. CoRR, abs/1908.06351, 2019. 1 [44] Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models, 2023. 5 [45] OpenAI. Introducing chatgpt. 2022. 1 , 2 [46] Maxime Oquab, Timothee Darcet, Th ´ ´ eo Moutakanni, Huy Vo, ´ ´ Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien ´ ´ Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. Dinov2: Learning robust visual features without supervision, 2024. 4 , 5 [47] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14372–14381, 2020. 1 , 3 [48] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4, 2023. 2 [49] Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018. 2 [50] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 2 [51] Bharathkumar Ramachandra, Michael J. Jones, and Ranga Raju Vatsavai. Learning a distance function with a siamese network to localize anomalies in videos. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020. 1 [52] Aditya Sanghi, Hang Chu, Joseph G Lambourne, Ye Wang, Chin-Yi Cheng, and Marco Fumero. Clip-forge: Towards zero-shot text-to-shape generation. In CVPR, 2022. 3 , 5 [53] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all, 2023. 2 [54] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018. 3 , 5 [55] Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and YingCong Chen. Hawk: Learning to understand open-world video anomalies, 2024. 2 , 3 [56] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. Lamda: Language models for dialog applications, 2022. 2 [57] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W. Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4975–4986, 2021. 1 , 3 [58] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste ´ ´ Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. 1 , 2 [59] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017. 2 [60] Boyang Wan, Yuming Fang, Xue Xia, and Jiajie Mei. Weakly supervised video anomaly detection via center-guided discriminative learning. In Proceedings of the IEEE International Conference on Multimedia and Expo, 2020. 1 , 3 [61] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, and Jifeng Dai. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks, 2023. 2 [62] Jie Wu, Wei Zhang, Guanbin Li, Wenhao Wu, Xiao Tan, Yingying Li, Errui Ding, and Liang Lin. Weakly-supervised spatio-temporal anomaly detection in surveillance video. In he Thirtieth International Joint Conference on Artificial Intelligence, 2021. 3 [63] Peng Wu and Jing Liu. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing, 30:3513–3527, 2021. 1 , 3 [64] Peng Wu, jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In European Conference on Computer Vision (ECCV), 2020. 1 , 2 , 5 [65] Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection, 2023. 1 [66] Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open-vocabulary video anomaly detection, 2024. 1 [67] Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. Follow the rules: Reasoning for video anomaly detection with large language models, 2024. 2 , 3 [68] Zhiwei Yang, Jing Liu, and Peng Wu. Text prompt with normality guidance for weakly supervised video anomaly detection, 2024. 1 [69] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training, 2021. 2 [70] Tongtong Yuan, Xuange Zhang, Kun Liu, Bo Liu, Chen Chen, Jian Jin, and Zhenzhen Jiao. Towards surveillance videoand-language understanding: New dataset, baselines, and challenges, 2023. 4 [71] Muhammad Zaigham Zaheer, Arif Mahmood, Marcella Astrid, and Seung-Ik Lee. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In European Conference on Computer Vision (ECCV), pages 358–376. Springer, 2020. 1 , 3 [72] Muhammad Zaigham Zaheer, Arif Mahmood, Muhammad Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. Generative cooperative learning for unsupervised video anomaly detection, 2022. 3 [73] Muhammad Zaigham Zaheer, Jin-Ha Lee, Marcella Astrid, Arif Mahmood, and Seung-Ik Lee. Cleaning label noise with clusters for minimally supervised anomaly detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, June 2020. 1 [74] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023. 1 , 3 , 6 [75] Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xiaonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. Holmes-vau: Towards long-term video anomaly understanding at any granularity, 2024. 2 , 3 [76] Jiangong Zhang, Laiyun Qing, and Jun Miao. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In 2019 IEEE International Conference on Image Processing (ICIP), pages 4030–4034, 2019. 3 [77] Jiaqi Zhang, Yan Hu, Xiaojuan Qi, Ting Meng, Lihui Wang, Huazhu Fu, Mingming Yang, and Jiang Liu. Polar eyeball shape net for 3d posterior ocular shape representation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 180–190. Springer, 2023. 8 [78] Bin Zhao, Fei-Fei Li, and Eric P. Xing. Online detection of unusual events in videos via dynamic sparse coding. In CVPR 2011, pages 3313–3320, 2011. 1 [79] Jianing Zhao, Jingjing Wang, Yujie Jin, Jiamin Luo, and Guodong Zhou. Hawkeye: Discovering and grounding implicit anomalous sentiment in recon-videos via sceneenhanced video large language model. In Proceedings of ACM MM 2024, 2024. 2 , 3 , 5 [80] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2023. 2 [81] Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. Spatio-temporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM International Conference on Multimedia, page 1933–1941, New York, NY, USA, 2017. Association for Computing Machinery. 3 [82] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H. Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 3 [83] Lanfeng Zhong, Xin Liao, Shaoting Zhang, Xiaofan Zhang, and Guotai Wang. Vlm-cpl: Consensus pseudo labels from vision-language models for human annotation-free pathological image classification, 2024. 2 [84] Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection, 2023. 1 [85] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023. 2 [86] Yi Zhu and Shawn D. Newsam. Motion-aware feature for improved video anomaly detection. In British Machine Vision Conference (BMVC), 2019. 1 ","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/aligning-effective-tokens-with-video-anomaly-in-large-language-models/","section":"Papers","summary":"Proposes VA-GPT, a multimodal Large Language Model for video anomaly detection and understanding, utilizing effective token selection and generation modules (SETS and TETG) to improve spatial and temporal localization of anomalies. Introduces instruct-following fine-tuning data and cross-domain benchmarks for robustness evaluation.","title":"Aligning Effective Tokens with Video Anomaly in Large Language Models","type":"other"},{"content":" An Attribute-based Method for Video Anomaly Detection # Tal Reiss # School of Computer Science and Engineering The Hebrew University of Jerusalem, Israel\nYedid Hoshen # School of Computer Science and Engineering The Hebrew University of Jerusalem, Israel\nReviewed on OpenReview: https: // openreview. net/ forum? id= XL1N6iLr0G\nAbstract # Video anomaly detection (VAD) identifies suspicious events in videos, which is critical for crime prevention and homeland security. In this paper, we propose a simple but highly effective VAD method that relies on attribute-based representations. The base version of our method represents every object by its velocity and pose, and computes anomaly scores by density estimation. Surprisingly, this simple representation is sufficient to achieve state-of-the-art performance in ShanghaiTech, the most commonly used VAD dataset. Combining our attribute-based representations with an off-the-shelf, pretrained deep representation yields state-of-the-art performance with a 99 . 1% , 93 . 7%, and 85 . 9% AUROC on Ped2, Avenue, and ShanghaiTech, respectively. Our code is available at https://github.com/talreiss/Accurate-Interpretable-VAD .\n1 Introduction # Video anomaly detection (VAD) aims to discover interesting but rare events in video. The task has attracted much interest as it is critical for crime prevention and homeland security. One-class classification (OCC) is one of the most popular VAD settings, where the training set consists of normal videos only, while at test time the trained model needs to distinguish between normal and anomalous events. The key challenge for learning-based VAD is that classifying an event as normal or anomalous depends on the human operator\u0026rsquo;s particular definition of normality. Differently from supervised learning, there are no training examples of anomalous events, this essentially requires the learning algorithms to have strong priors.\nMany previous VAD methods use a combination of deep networks with self-supervised objectives. A popular line of work consists of training a neural network to predict the next frame and classifying it as anomalous if the predicted and observed frames differ significantly. While such methods can achieve good performance, they do not make their priors explicit i.e., it is unclear why they consider some frames more anomalous than others, making them difficult to debug and improve. More recent methods involve a preliminary object extraction stage from video frames, implying that objects are significant for VAD. However, they typically use object-level self-supervised approaches that do not make their priors explicit. Our hypothesis is that making priors more explicit will improve VAD performance.\nIn this paper, we propose a new approach that directly represents each video frame by simple attributes that are semantically meaningful to humans. Our method extracts objects from every frame, and represents each object by two attributes: its velocity and body pose (in the case the object is human). These attributes are well known to be important for VAD (Markovitz et al., 2020; Georgescu et al., 2021a). We detect anomalous values of these representations by using density estimation. Concretely, our method classifies a frame as anomalous if it contains one or more objects that have unusual values of velocity and/or pose (see Fig. 1). Our simple velocity and pose representations achieve state-of-the-art performance (85.9% AUROC) on the most popular VAD dataset, ShanghaiTech.\ntal.reiss@mail.huji.ac.il yedid.hoshen@mail.huji.ac.il\nFigure 1: The Avenue and ShanghaiTech datasets. We present the most normal and anomalous frames for each feature. For anomalous frames, we visualize the bounding box of the object with the highest anomaly score. Best viewed in color.\nWhile the velocity and pose representations are highly effective, they ignore other attributes, most importantly object category. As an example, if we never see a lion in normal videos, a test video containing a lion is anomalous. To represent these residual attributes, we model them with an off-the-shelf, deep representation (here, we use CLIP features). Our final method combines velocity, pose and the deep representations. It achieves state-of-the-art performance on the three most commonly reported datasets.\n2 Related Work # Classical video anomaly detection methods were typically composed of two steps: handcrafted feature extraction and anomaly scoring. Some of the manual features that were extracted were: optical flow histograms (Chaudhry et al., 2009; Colque et al., 2016; Perš et al., 2010) and SIFT (Lowe, 2004). Commonly used scoring methods include: density estimation (Eskin et al., 2002; Glodek et al., 2013; Latecki et al., 2007), reconstruction (Jolliffe, 2011), and one-class classification (Scholkopf et al., 2000).\nIn recent years, deep learning has gained in popularity as an alternative to these early works. The majority of video anomaly detection methods utilize at least one of three paradigms: reconstruction-based, predictionbased, skeletal-based, or auxiliary classification-based methods.\nReconstruction \u0026amp; prediction based methods. In the reconstruction paradigm, the normal training data is typically characterized by an autoencoder, which is then used to reconstruct input video clips. The assumption is that a model trained solely on normal training clips will not be able to reconstruct anomalous frames. This assumption does not always hold true, as neural networks can often generalize to some extent out-of-distribution. Notable works are (Nguyen \u0026amp; Meunier, 2019; Chang et al., 2020; Hasan et al., 2016b; Luo et al., 2017b; Yu et al., 2020; Park et al., 2020).\nPrediction-based methods learn to predict frames or flow maps in video clips, including inpainting intermediate frames, predicting future frames, and predicting human trajectories (Liu et al., 2018a; Feng et al., 2021b; Chen et al., 2020; Lee et al., 2019; Lu et al., 2019; Park et al., 2020; Wang et al., 2021; Feng et al., 2021a; Yu et al., 2020). Additionally, some works take a hybrid approach combining the two paradigms (Liu et al., 2021b; Zhao et al., 2017; Ye et al., 2019; Tang et al., 2020; Morais et al., 2019). As these methods are\ntrained to optimize both objectives, input frames with large reconstruction or prediction errors are considered anomalous.\nSelf-supervised auxiliary tasks. There has been a great deal of research on learning from unlabeled data. A common approach is to train neural networks on suitably designed auxiliary tasks with automatically generated labels. Tasks include: video frame prediction (Mathieu et al., 2016), image colorization (Zhang et al., 2016; Larsson et al., 2016), puzzle solving (Noroozi \u0026amp; Favaro, 2016), rotation prediction (Gidaris et al., 2018), arrow of time (Wei et al., 2018), predicting playback velocity (Doersch et al., 2015), and verifying frame order (Misra et al., 2016). Many video anomaly detection methods use self-supervised learning. In fact, self-supervised learning is a key component in the majority of reconstruction-based and prediction-based methods. SSMTL (Georgescu et al., 2021a) trains a CNN jointly on three auxiliary tasks: arrow of time, motion irregularity, and middle-box prediction, in addition to knowledge distillation. Jigsaw-Puzzle (Wang et al., 2022) trains neural networks to solve spatio-temporal jigsaw puzzles. The networks are then used for VAD.\nSkeletal methods. Such methods rely on a pose tracker to extract the skeleton trajectories of each person in the video. Anomalies are then detected using the skeleton trajectory data. Our attribute-based method outperforms previous skeletal methods (e.g., (Markovitz et al., 2020; Rodrigues et al., 2020; Yu et al., 2021; Sun \u0026amp; Gong, 2023)) by a large margin. In concurrent work, (Hirschorn \u0026amp; Avidan, 2023) proposed a skeletalbased method that achieved comparable results to ours on the ShanghaiTech dataset but was outperformed by large margins on the rest of the evaluated benchmarks. Different from skeletal approaches, our method does not require pose tracking, which is extremely challenging in crowded scenes. Our pose features only use a single frame, while our velocity features only require a pair of frames. In contrast, skeletal approaches require pose tracking across many frames, which is expensive and error-prone. It is also important to note that skeletal features by themselves are ineffective in detecting non-human anomalies, therefore, being insufficient for providing a complete VAD solution.\nObject-level video anomaly detection. Early methods, both classical and deep learning, operated on entire video frames. This proved difficult for VAD as frames contain many variations, as well as a large number of objects. More recent methods (Georgescu et al., 2021a; Liu et al., 2021b; Wang et al., 2022) operate at the object level by first extracting object bounding boxes using off-the-shelf object detectors. Then, they detect if each object is anomalous. This is an easier task, as objects contain much less variation than whole frames. Object-based methods yield significantly better results than frame-level methods.\nIt is often believed that due to the complexity of realistic scenes and the variety of behaviors, it is difficult to craft features that will discriminate between them. As object detection was inaccurate prior to deep learning, classical methods were previously applied at the frame level rather than at the object level, and therefore underperformed on standard benchmarks. We break this misconception and demonstrates that it is possible to craft semantic features that are accurate.\n3 Method # 3.1 Preliminaries # Our method assumes a training set consisting of Nc Nc video clips {c1, c2\u0026hellip;cN c } ∈ Xtrain that are all normal (i.e., do not contain any anomalies). Each clip ciis comprised of Ni frames, ci = [fi,1, fi,2, \u0026hellip;fi,N i ]. The goal is to classify each frame f ∈ c in an inference clip c as normal or anomalous. Our method represents each frame f as ϕ(f) ∈ R d , where d ∈ N is the feature dimension. We compute the anomaly score of frame f using an anomaly scoring function s(ϕ(f)), and classify it as anomalous if s(ϕ(f)) exceeds a threshold.\n3.2 Overview # We propose a method that represents each video frame as a set of objects, with each object characterized by its attributes. This contrasts with most previous methods that do not explicitly represent attributes. Specifically, we focus on two key attributes: velocity and body pose. Object velocity is probably the most important attribute as it can detect if an object is moving unusually fast e.g., running away, or in a strange\nFigure 2: An overview of our method. We first extract optical flow maps and bounding boxes for all of the objects in the frame. We then crop each object from the original image and its corresponding flow map. Our representation consists of velocity, pose, and deep (CLIP) features.\ndirection e.g., against the direction of traffic. Given the challenges of capturing true 3D velocity without depth information, we use optical flow as an effective proxy to represent apparent motion between frames. Similarly, human body poses can reveal anomalous activities, such as throwing an item. Instead of using the full 3D human pose, we represent it with 2D landmarks, which are easier to detect. Both attributes have been used in previous VAD studies, such as Georgescu et al. (2021a) and Markovitz et al. (2020), as part of more complex approaches.\nWe compute the anomaly score based on density estimation of object-level feature descriptors. This is done in three stages: pre-processing, feature extraction, and density estimation. In the pre-processing stage, our method (i) uses an off-the-shelf motion estimator to estimate the optical flow for each frame; (ii) localizes and classifies the bounding boxes of all objects within a frame using an off-the-shelf object detector. The outputs of these models are used to extract object-level velocity, pose, and deep representations (see Sec. 3.4). Finally, our method uses density estimation to calculate the anomaly score of each test frame. See Fig. 2 for an illustration.\n3.3 Pre-processing # Our method extracts velocity and pose from each object in each video frame. To do so, we compute optical flow and body landmarks for each object in the video.\nOptical flow. Our method uses optical flow as a proxy for its velocity. We extract the optical flow map o for each frame f ∈ c in every video clip c using an off-the-shelf optical flow model.\nObject detection. Our method models frames by representing every object individually. This follows many recent papers, e.g., (Georgescu et al., 2021a; Liu et al., 2021b; Wang et al., 2022) that found that object-based representations are more effective than global, frame-level representations. We first detect all objects in each frame using an off-the-shelf object detector. Formally, our object detection generates a set of m bounding boxes bb1, bb2\u0026hellip;bb m for each frame, with corresponding category labels y1, y2, \u0026hellip;, y m.\nFigure 3: An illustration of our velocity feature vector. Left: We quantize the orientations into B = 8 equi-spaced bins, and assign each optical flow vector in the object\u0026rsquo;s bounding box is to a single bin. Right: The value of each bin is the average magnitude of the optical flow vectors assigned to this bin. Best viewed in color.\n3.4 Feature extraction # Our method represents each object by two semantic attributes, velocity and pose, and by an implicit, deep representation.\nVelocity features. Our working hypothesis is that an unusual speed or direction of motion can often identify anomalies in video. As objects can move in both x and y axes and both the magnitude (speed) and orientation of the velocity may be anomalous, we compute velocity features for each object in each frame. We begin by cropping the frame-level optical flow map o by the bounding box bb of each detected object. Following this step, we obtain a set of cropped object flow maps (see Fig. 2). We rescale these flow maps to a fixed size of Hf low × Wf low. Next, we represent each flow map with the average motion for each orientation, where orientations are quantized into B ∈ N equi-spaced bins, following classical optical flow representations e.g., (Chaudhry et al., 2009). The final representation is a B-dimensional vector consisting of the average flow magnitude of the flow vectors in each bin (see Fig. 3). This representation is capable of describing motion in both radial and tangential directions. We denote our velocity feature extractor as: ϕvelocity : Hf low × Wf low → R B .\nPose features. Most anomalous events in video involve humans, so we include activity features in our representation. While a full understanding of activity requires temporal features, we find that human body pose from a single frame can detect many unusual activities. Even though human pose is essentially a 3D feature, 2D body landmark positions already provide much of the signal. We compute pose features for each human object bb using an off-the-shelf 2D keypoint extractor that outputs the pixel coordinates of each landmark position, denoted by ϕ ˆ pose (bb) ∈ R 2×d , where d ∈ N is the number of keypoints. To ensure invariance to the human\u0026rsquo;s position and size, we normalize the keypoints. First, we subtract the coordinates of the top-left corner of the object bounding box from each landmark. We then scale the x and y axes so that the object bounding box has a fixed size of Hp Hpose × Wp Wpose , where Hp Hpose and Wp Wpose are constants. Formally, let l ∈ R 2 be the top-left corner of the human bounding box. The pose descriptor becomes:\nHere, height(bb) and width(bb) are the height and width of the object bounding box bb, respectively and l ∈ R 2 be the top-left corner of bb. Finally, we flatten ϕ pose to obtain the final pose feature vector.\nDeep features. While velocity is the most discriminative attribute, other attributes beyond pose may also matter. Deep features implicitly bundle together many different attributes. Hence, we use them to model the residual attributes which are not described by velocity and pose. We follow previous anomaly detection works (Reiss et al., 2021; Reiss \u0026amp; Hoshen, 2023), that used generic, pretrained encoders to implicitly represent image attributes. Concretely, we use a pretrained CLIP encoder (Radford et al., 2021), ϕdeep( . ), to represent the bounding box of each object in each frame. Note that CLIP representations do not achieve competitiveness\non their own; in fact, they perform much worse than the velocity representations (see Tab. 3). However, together, velocity, pose, and CLIP features represent video sufficiently well to outperform the state-of-the-art.\n3.5 Density Estimation # We use density estimation for scoring samples as normal or anomalous, where a low estimated density indicates anomaly. To estimate density, we fit a separate estimator for each feature. As velocity features are low-dimensional, we use a Gaussian mixture (GMM) estimator. As our pose and deep features are high-dimensional, we estimate their density using kNN. Specifically, we compute the L2 distance between the feature x of a target object and the nearest k exemplars in the corresponding training feature set. We compare different exemplar selection methods is in Sec. 4.5. We denote our density estimators by svel ( . ), s pose ( . ), sdeep( . ) .\nScore calibration. Combining the three density estimators requires calibration. To do so, we estimate the distribution of anomaly scores on the normal training set. We then scale the scores using min-max normalization. The kNN used for scoring pose and deep features present a subtle point. When computing kNN on the training set, the exemplars must not be taken from the same clip as the target object. The reason is that the same object appears in nearby frames with virtually no variation, distorting kNN estimates. Instead, we compute the kNN between each training set object and all objects in the other video clips provided in the training set. We can now define ∀f ∈ {velocity, pose, deep}: µf = max x {sf (ϕf (x))} and ν f = min x {sf (ϕf (x))} .\n3.6 Inference # We extract a representation for each object bb in each frame f in each inference clip, as describe above. We then compute an anomaly score for each attribute feature of each object bb. The score for every frame is simply the maximum score across all objects. The final anomaly score is the sum of the individual feature scores normalized by our calibration parameters:\nAs anomalous events span multiple frames, we smooth the frame scores using a temporal smoothing filter.\n4 Experiments # 4.1 Datasets # We evaluated our method on three publicly available VAD datasets, using their training and test splits. Only test videos included anomalous events. We report the statistics of the datasets in Tab. 1.\nUCSD Ped2. This dataset (Mahadevan et al., 2010) contains 16 normal training videos and 12 test videos at a 240 × 360 pixel resolution. Videos show a fixed scene with a camera above the scene and pointed downward. The training video clips contain only normal behavior of pedestrians walking, while examples of abnormal events are bikers, skateboarding, and cars.\nCUHK Avenue. This dataset (Lu et al., 2013) contains 16 normal training videos and 21 test videos at 360 × 640 pixel resolution. Videos show a fixed scene using a ground-level camera. Training video clips contain only normal behavior. Examples of abnormal events are strange activities (e.g. throwing objects, loitering, and running), movement in the wrong direction, and abnormal objects.\nShanghaiTech Campus. This dataset (Liu et al., 2018a) is the largest publicly available dataset for VAD. There are 330 training videos and 107 test videos from 13 different scenes at 480 × 856 pixel resolution. ShanghaiTech contains video clips with complex light conditions and camera angles, making this dataset\nTable 1: Statistics of the evaluation datasets.\n| Dataset | Number of Frames t Tt t Nl Al | Number of Frames t Tt t Nl Al | Number of Frames t Tt t Nl Al | Number of Frames t Tt t Nl Al | Number of Frames\nt Tt t Nl Al Scenes Anomaly Types Dataset Total Train set Test set Normal Anomalous Scenes Anomaly Types UCSD Ped2 4,560 2,550 2,010 2,924 1,636 1 5 CUHK Avenue 30,652 15,328 15,324 26,832 3,820 1 5 ShanghaiTech 317,398 274,515 42,883 300,308 17,090 13 11 more challenging than the other two. Anomalies include robberies, jumping, fights, car invasions, and bike riding in pedestrian areas.\n4.2 Implementation Details # We use ResNet50 Mask-RCNN (He et al., 2017) pretrained on MS-COCO (Lin et al., 2014) to extract object bounding boxes. To filter out low confidence objects, we follow the same configurations as in (Georgescu et al., 2021a). Specifically for Ped2, Avenue, and ShanghaiTech, we set confidence thresholds of 0.5, 0.8, and 0.8. In order to generate optical flow maps, we use FlowNet2 (Ilg et al., 2017). For our landmark detection, we use AlphaPose (Fang et al., 2017) pretrained on MS-COCO with d = 17 keypoints. We use a pretrained ViT B-16 CLIP (Dosovitskiy et al., 2020; Radford et al., 2021) image encoder as our deep feature extractor. Our method is built around the extracted objects and flow maps. We use Hvelocity × Wvelocity = 224 × 224 to rescale flow maps. As for Hp Hpose × Wp Wpose rescaling, we calculate the average height and width from the bounding boxes of the train set and use those values. The lower resolution of Ped2 prevents objects from filling a histogram, and to extract pose representations, therefore we use B = 1 orientations and rely solely on velocity and deep representations. We use B = 8 orientations for Avenue and ShanghaiTech. When testing, for anomaly scoring we use kNN for the pose and deep representations with k = 1 nearest neighbors. For velocity, we use GMM with n = 5 Gaussians. Finally, the anomaly score of a frame represents the maximum score among all the objects within that frame.\n4.3 Evaluation Metrics # Our study uses standard VAD evaluation metrics. We vary the threshold over the anomaly scores to measure the frame-level Area Under the Receiver Operation Characteristic (AUROC) with respect to the ground-truth annotations. We report two types of AUROC: (i) micro-averaged AUROC, which computes the score by on all frames from all videos; (ii) macro-averaged AUROC, which computes the AUROC score individually for each video and then averages the scores of all videos. Most existing studies report micro-averaged AUROC, while only a few report macro-averaged AUROC.\n4.4 Quantitative Results # We compare our method and the state-of-the-art from recent years in Tab. 2. We took the performance numbers of the baseline methods directly from the original papers.\nPed2 results. Most methods obtained over 94% on Ped2, indicating that of the three public datasets, it is the simplest. While our method is comparable to the current state-of-the-art method (HF 2 (Liu et al., 2021b)) in terms of performance, the near-perfect results on Ped2 indicate it is practically solved.\nAvenue results. Our method obtained a new state-of-the-art micro-averaged AUROC of 93.7%. Our method also outperformed the current state-of-the-art in terms of macro-averaged AUROC by a considerable margin of 2.8%, reaching 96.3%.\nShanghaiTech results. Our method outperforms all previous methods on the largest dataset, ShanghaiTech, by a considerable margin. Accordingly, our method achieves 85.9% AUROC, higher than the best performance previous methods achieved, 85.1% (MS-VAD (Zhang et al., 2024)). We note that in concurrent\nTable 2: Frame-level AUROC (%) comparison. The best and second-best results are bolded and underlined, respectively.\n| Year | Method | Ped2 iM | Ped2 iM | Avenue iM | Avenue iM | ShanghaiTech Micro Macro | ShanghaiTech\nMicro Macro Year Method Micro Macro Micro Macro Micro Macro HOOF (Chaudhry et al., 2009) 61.1 - - - - - (y , ) HOFM (Colque et al., 2016) 89.9 - - - - - (q, SCL (Lu et al., 2013) - - 80.9 - - - () Conv-AE (Hasan et al., 2016a) () 90.0 - 70.2 - - - (, ) StackRNN (Luo et al., 2017a) SA() 92.2 - 81.7 - 68.0 - STAN (Lee et al., 2018) ( 96.5 - 87.2 - - - MC2ST (Liu et al., 2018b) 87.5 - 84.4 - - - Frame-Pred. (Liu et al., 2018a) 95.4 - 85.1 - 72.8 - () Mem-AE. (Gong et al., 2019) 94.1 - 83.3 - 71.2 - (g , ) CAE-SVM (Ionescu et al., 2019) 94.3 97.8 87.4 90.4 78.7 84.9 (, BMAN (Lee et al., 2019) 93 96.6 97.8 | 87.4 900 | 90.4 | 78.7 | 84.9 | | | () AM-Corr (Nguyen \u0026amp; Meunier, 2019) | 96.6 962 | - | 90.0 869 | - | 76.2 | - | | 2020 | MNAD-Recon. (Park et al., 2020) | 97.0 | - | 88.5 | - | 70.5 | - | | 2020 | ( CAC (Wang et al., 2020) | - | - | 87.0 | - | 79.3 | - | | 2020 | Scene-Aware (Sun et al., 2020) | - | - | 89.6 | - | 74.7 | - | | 2020 | VEC (Yu et al., 2020) | 97.3 | - | 90.2 | - | 74.8 | - | | 2020 | ClusterAE (Chang et al., 2020) | 96.5 | - | 86.0 | - | 73.3 | - | | 2021 | AMMCN (Cai et al., 2021) | 96.6 | - | 86.6 | - | 73.7 | - | | 2021 | SSMTL (Georgescu et al., 2021a) | 97.5 | 99.8 | 91.5 | 91.9 | 82.4 | 89.3 | | 2021 | ( MPN (Lv et al., 2021) | 96.9 | - | 89.5 | - | 73.8 | - | | 2021 | () HF2(Liu et al., 2021a) | 99.3 | - | 91.1 | 93.5 | 76.2 | - | | 2021 | CT-D2GAN (Feng et al., 2021a) | 97.2 | - | 85.9 | - | 77.7 | - | | 2021 | BA-AED (Georgescu et al., 2021b) | 98.7 | 99.7 | 92.3 | 90.4 | 82.7 | 89.3 | | 2022 | SSPCAB (Ristea et al., 2022) | - | - | 92.9 | 91.9 | 83.6 | 89.5 | | 2022 | (, ) DLAN-AC (Yang et al., 2022) | 97.6 | - | 89.9 | - | 74.7 | - | | 2022 | (g , ) Jigsaw-Puzzle (Wang et al., 2022) | 99.0 | 99.9 | 92.2 | 93.0 | 84.3 | 89.8 | | 2023 | USTN-DSC (Yang et al., 2023) | 98.1 | - | 89.9 | - | 73.8 | - | | 2023 | ( EVAL (Singh et al., 2023) | - | - | 86.0 | - | 76.6 | - | | 2023 | (g, ) FB-SAE (Cao et al., 2023) | 97.1 | 99.2 | 86.8 | 89.1 | 79.2 | 80.2 | | 2023 | (, ) FPDM (Yan et al., 2023) | - | - | 90.1 | - | 78.6 | - | | 2023 | (, ) LMPT (Shi et al., 2023) | 97.6 | - | 90.9 | - | 78.8 | - | | 2023 | STF-NF (Hirschorn \u0026amp; Avidan, 2023) | 93.1 | 91.2 | 60.1 | 63.5 | 85.9 | 87.8 | | 2024 | SD-MAE (Ristea et al., 2024) | 95.4 | 98.4 | 91.3 | 90.9 | 79.1 | 84.7 | | 2024 | MS-VAD (Zhang et al., 2024) | - | - | 92.4 | 92.9 | 85.1 | 89.8 | | 2024 | Ours | 99.1 | 99.9 | 93.7 | 96.3 | 85.9 | 89.6 | work, STF-NF (Hirschorn \u0026amp; Avidan, 2023) achieved comparable results to ours on the ShanghaiTech dataset. Our method outperforms it by large margins on Ped2 (by 6.0% AUROC) and Avenue (by 33.6% AUROC).\nTo summarize, our method achieves the highest performance on the three most popular public benchmarks. It simply consists of three simple representations and does not require training.\n4.5 Analysis # Ablation study. We report in Tab. 3 the anomaly detection performance on the Ped2, Avenue and ShanghaiTech datasets of all attribute combinations. Our findings reveal that the velocity features provide the highest frame-level AUROC on Ped2, Avenue and ShanghaiTech, with 98.8%, 86.0% and 84.4% micro-averaged AUROC, respectively. In ShanghaiTech, our velocity features on their own are already state-of-the-art com-\nTable 3: Ablation study. Result are in frame-level AUROC (%). The best and second-best results are in bold and underline, respectively.\nPose Features Deep Features Velocity Features Ped2 Ped2 Avenue Avenue ShanghaiTech ShanghaiTech Pose Features Deep Features Velocity Features Micro Macro Micro Macro Micro Macro ✓ ✓ - - 73.8 76.2 74.5 81.0 ✓ ✓ 96.4 95.3 85.4 87.7 72.5 82.5 ✓ ✓ ✓ 98.8 99.6 86.0 89.6 84.4 84.8 ✓ ✓ ✓ - - 89.3 88.8 76.7 84.9 ✓ ✓ ✓ 99.1 99.9 93.0 95.5 84.5 88.7 ✓ ✓ ✓ - - 86.8 93.0 85.9 88.8 ✓ ✓ ✓ - - 93.7 96.3 85.1 89.6 Table 4: Comparison of different numbers of velocity features bins (B). Frame-level AUROC (%) results. Best in bold.\n| Bins (B) | Avenue | Avenue | ShanghaiTech MiM | ShanghaiTech\nMiM Bins (B) Micro Macro Micro Macro B = 1 83.5 83.5 81.2 80.9 B = 2 84.1 83.8 82.1 82.7 B = 4 85.5 89.2 84.0 84.6 B = 8 86.0 89.6 84.4 84.8 B = 16 84.1 88.4 83.1 84.2 Table 5: Our final results when kNN is replaced by k-means. Frame-level AUROC (%). Time is expressed in average ms per frame. Best in bold.\n| k = | Avenue | Avenue | Avenue | ShanghaiTech iMTi | ShanghaiTech iMTi | ShanghaiTech\niMTi k = Mic. Mac. Time Mic. Mac. Time 1 91.8 94.0 0.51 84.2 87.2 0.45 5 92.0 94.2 0.52 84.3 88.1 0.45 10 92.1 94.5 0.52 84.6 88.1 0.45 100 92.9 95.2 0.53 84.8 88.6 0.46 All 93.7 96.3 4.93 85.1 89.6 36.0 pared with all previous VAD methods. We expect this to be due to the large number of anomalies associated with speed and motion, such as running people and fast-moving objects, e.g. cars and bikes. Adding either pose or CLIP improved performance, mostly macro-AUROC, presumably as it provided information about human activity which accounts for some of the anomalies in this dataset. Velocity features were still the most performant on Ped2 and Avenue. However, combining them with deep features improved performance significantly. Overall, we observe that using all three features performed the best on Avenue. Due to the extremely low resolution of the Ped2 dataset, pose feature extraction is not feasible, so we rely solely on velocity and deep features for this dataset.\nNumber of velocity bins. We ablated the impact of different numbers of bins (B) in our velocity features in Tab. 4. We compared AUROC scores on the Avenue and ShanghaiTech datasets. The results indicate that the choice of B influences detection accuracy. Specifically, we observed that increasing the number of bins from B = 1 to B = 8 led to consistent improvements in both micro and macro AUROC scores on both datasets. This suggests that a finer quantization of velocity orientations represents motion better and improves anomaly detection. Performance gains diminish beyond B = 8 .\nk-Means as a faster alternative. Computing kNN has linear complexity in the number of objects in the datasets, which may be slow for large datasets. We can speed it up by reducing the number of samples via k-means. In Tab. 5, we compare the performance of our method with kNN and k-means. Note that k-means still uses kNN to calculate anomaly scores as the sum of distances to nearest neighbor means. This is much faster than the original kNN as there are fewer means than the number of objects in the training set. We observe that it improves inference time with a small accuracy loss.\nPose features for non-human objects. We extract pose representations exclusively for human objects and not for non-human objects. We calculate the pose anomaly score for each frame by taking the score of the object with the most anomalous pose. Non-human objects are given a pose anomaly score of −∞ and therefore do not contribute to the frame-wise pose anomaly score. While we acknowledge that non-human objects can also exhibit anomalies, our method leverages velocity and deep representations to capture these types of events.\nTable 6: Comparison of FlowNet2 vs. RAFT for flow map extraction. Frame-level AUROC (%) based on velocity features.\nBackbone Avenue Avenue ShanghaiTech ShanghaiTech Backbone Micro Macro Micro Macro RAFT 85.7 89.7 84.3 84.2 FlowNet2 86.0 89.6 84.4 84.8 Table 7: Comparison of Mask R-CNN vs. YOLO-v8 for object detection. Frame-level AUROC (%) based on velocity features.\nBackbone Avenue Avenue ShanghaiTech ShanghaiTech Backbone Micro Macro Micro Macro YOLO-v8 84.8 87.4 83.1 82.7 Mask-RCNN 86.0 89.6 84.4 84.8 Table 8: Comparison of video encoders and CLIP. Frame-level AUROC (%) results. Best in bold.\nEncoder Level Avenue S Avenue S ShanghaiTech ShanghaiTech Encoder Level Micro Macro Micro Macro TimeSformer (Bertasius et al., 2021) Frame 61.2 64.1 58.2 60.1 TimeSformer (Bertasius et al., 2021) Object 63.1 64.0 59.1 59.2 VideoMAE V2 (Wang et al., 2023) Frame 68.3 67.9 60.3 60.5 VideoMAE V2 (Wang et al., 2023) Object 67.0 68.1 60.0 59.9 DINO Object 77.6 81.2 71.2 80.3 CLIP (ours) Object 85.4 87.7 72.5 82.5 Backbone analysis. We performed additional ablation studies to evaluate the impact of different backbone networks on the overall performance of our method. Specifically, we tested alternative backbones for optical flow (FlowNet2 vs. RAFT (Teed \u0026amp; Deng, 2020)), as shown in Tab. 6 and object detection (Mask R-CNN vs. YOLO-v8) in Tab. 7. The results indicate that the effectiveness of our approach is primarily driven by the feature design rather than any specific choice of backbone.\nWhy do we use an image encoder instead of a video encoder? Recent self-supervised learning methods such as TimeSformer (Bertasius et al., 2021), VideoMAE (Tong et al., 2022; Wang et al., 2023), XCLIP (Ni et al., 2022), and CoCa (Yu et al., 2022) have significantly improved the performance of pretrained video encoders on downstream tasks like Kinetics-400 (Kay et al., 2017). It is therefore reasonable to expect that video encoders, which capture both temporal and spatial information, would outperform image encoders in video anomaly detection (VAD). However, in our experiments, we found that features extracted by pretrained video encoders did not perform as well as those extracted from pretrained image encoders on VAD benchmark datasets. We hypothesize that this weaker performance is due to the video encoders\u0026rsquo; focus on capturing frame-level temporal dynamics, whereas our method is object-centric. Additionally, when we tested video encoders on 10-frame windows of fixed object bounding boxes (centered around time t), we observed no performance gain, likely due to resolution constraints and the need for high-quality contextual information. Tab. 8 summarizes our findings on the limited effectiveness of video encoders in this setting. Additionally, we evaluated DINO (Caron et al., 2021) as a comparison to CLIP and found that while DINO performed slightly worse than CLIP, it still outperformed video encoders. This result, with DINO showing only a slight performance drop compared to CLIP, demonstrates that our deep features are not dependent on a specific image encoder.\nRunning times. We carried out all our experiments on a NVIDIA RTX 2080 GPU. Our preprocessing stage, which includes object detection and optical flow extraction, takes approximately 80 milliseconds (ms) per frame. It takes our method approximately 5 ms to compute the velocity extraction, pose extraction, and deep features extraction stages, combined with anomaly scoring. Our method runs at 12FPS with an average of 5 objects per frame. For comparison, we evaluated two other methods on the same hardware: BA-AED (Georgescu et al., 2021b) runs at 24 FPS, while HF 2 Liu et al. (2021a), 2021) runs at 12 FPS. Our method\u0026rsquo;s running speed is comparable to HF 2 but slightly slower than BA-AED.\nFigure 4: Frame-level scores and anomaly localizations for Avenue\u0026rsquo;s test video 04. Best viewed in color.\n4.6 Qualitative Results # We visualize the anomaly detection process for Avenue and ShanghaiTech in Fig. 4 and Fig. 5, where we plot the anomaly scores across all frames of a video. Our anomaly scores are clearly highly correlated with anomalous events, demonstrating the effectiveness of our method.\n5 Discussion # Exploring other semantic attributes. There are other important attributes for VAD beyond velocity and pose. Identifying other relevant attributes that correlate with anomalous events can further improve anomaly detection systems. For example, attributes related to object interactions, spatial arrangements, or temporal patterns may be very discriminative for some types of anomalies. Finding ways to systematically discover such attributes may significantly speed up research.\nGuidance. Finding relevant attributes for anomaly detection may require user guidance. In real-world scenarios, operators have domain knowledge about factors that lead to anomalous behavior. Efficiently incorporating this guidance, such as selecting velocity or pose features in our work, is essential for leveraging this knowledge effectively.\nOther academic benchmarks. While our method, using simple attributes, was effective on the three most popular VAD datasets, extending it to more complex datasets may require more work. Publicly available datasets such as UCF-Crime (Sultani et al., 2018) and XD-Violence (Wu et al., 2020), which feature a wider variety of anomalies and larger scales, present additional challenges. These datasets are essentially different from the ones tested here as they contain distinct scenes in training and testing data, and include moving cameras, which also change the scene. So far, only weakly-supervised VAD has been successful on these datasets as they labeled anomalous data in training. The field needs new, more complex datasets within the fixed camera setting to further stress-test one-class classification VAD methods such as ours.\n6 Ethical Considerations # While VAD offers significant potential for enhancing public safety and security, it is crucial to acknowledge and address the ethical implications of such technology. VAD systems, including our proposed method, can be used in surveillance applications, which raises important privacy concerns. The continuous monitoring of public spaces may lead to a sense of constant observation, potentially infringing on individuals\u0026rsquo; right to privacy and freedom of movement. Moreover, there is a risk that VAD systems could be misused for unauthorized tracking or profiling of individuals.\nTo mitigate these ethical risks, several strategies should be considered in VAD systems development and deployment. First, strict data protection protocols should be implemented to ensure that collected video data is securely stored, accessed only by authorized individuals, and deleted after a defined period. Second, VAD use should be transparent, with clear warnings informing individuals when they are entering areas\nFigure 5: Frame-level scores and anomaly localizations for ShanghaiTech\u0026rsquo;s test video 03 _ 0059. Best viewed in color.\nunder surveillance. Third, VAD systems should be designed with privacy-preserving techniques, such as immediate data anonymization or the use of low-resolution data that can detect anomalies without identifying individuals. By implementing these measures, we can work towards harnessing VAD technology benefits while respecting individual privacy and civil rights.\nIn addition to technical safeguards, it is also necessary to consider regulatory and oversight mechanisms to ensure responsible deployment. We recommend that VAD systems be subject to civilian oversight, where independent authorities evaluate their use, especially in sensitive contexts like law enforcement or public monitoring. Such oversight would help prevent potential misuse, ensuring that VAD systems are applied in ways that benefit society without compromising human rights. Furthermore, restrictions should be placed on VAD deployment for purposes other than public safety, with guidelines that limit its use to specific cases where the benefits clearly outweigh the risks. These guidelines could include requiring legal authorization for certain VAD uses, particularly in private spaces or in applications that extend beyond standard anomaly detection use-cases.\n7 Conclusion # We propose a simple yet highly effective attribute-based method for video anomaly detection (VAD). Our method represents each object in each frame using velocity and pose representations and uses density estimation to compute anomaly scores. These simple representations are sufficient to achieve state-of-the-art performance on the ShanghaiTech and Ped2 datasets. By combining attribute-based representations with implicit deep representations, we achieve top VAD performance with AUROC scores of 99.1%, 93.7%, and 85.9% on Ped2, Avenue, and ShanghaiTech, respectively. Our extensive ablation study highlights the relative merits of the three representations. Overall, our method is both accurate and easy to implement.\nAcknowledgment # This research was partially supported by funding from the Israeli Science Foundation and the KLA Corporation. Tal Reiss is supported by the Google Fellowship and the Israeli Council for Higher Education.\nReferences # Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In ICML, volume 2, pp. 4, 2021.\nRuichu Cai, Hao Zhang, Wen Liu, Shenghua Gao, and Zhifeng Hao. Appearance-motion memory consistency network for video anomaly detection. In AAAI, 2021.\nCongqi Cao, Yue Lu, Peng Wang, and Yanning Zhang. A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20392–20401, 2023.\nMathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660, 2021.\nYunpeng Chang, Zhigang Tu, Wei Xie, and Junsong Yuan. Clustering driven deep autoencoder for video anomaly detection. In European Conference on Computer Vision, pp. 329–345. Springer, 2020.\nRizwan Chaudhry, Avinash Ravichandran, Gregory Hager, and René Vidal. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1932–1939. IEEE, 2009.\nDongyue Chen, Pengtao Wang, Lingyi Yue, Yuxin Zhang, and Tong Jia. Anomaly detection in surveillance video based on bidirectional prediction. Image and Vision Computing, 98:103915, 2020.\nRensso Victor Hugo Mora Colque, Carlos Caetano, Matheus Toledo Lustosa de Andrade, and William Robson Schwartz. Histograms of optical flow orientation and magnitude and entropy to detect anomalous events in videos. IEEE Transactions on Circuits and Systems for Video Technology, 27(3):673–682, 2016.\nCarl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422–1430, 2015.\nAlexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.\nEleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy, and Sal Stolfo. A geometric framework for unsupervised anomaly detection. In Applications of data mining in computer security, pp. 77–101. Springer, 2002.\nHao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. RMPE: Regional multi-person pose estimation. In ICCV, 2017.\nXinyang Feng, Dongjin Song, Yuncong Chen, Zhengzhang Chen, Jingchao Ni, and Haifeng Chen. Convolutional transformer based dual discriminator generative adversarial networks for video anomaly detection. Proceedings of the 29th ACM International Conference on Multimedia, 2021a.\nXinyang Feng, Dongjin Song, Yuncong Chen, Zhengzhang Chen, Jingchao Ni, and Haifeng Chen. Convolutional transformer based dual discriminator generative adversarial networks for video anomaly detection. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 5546–5554, 2021b.\nMariana-Iuliana Georgescu, Antonio Bărbălău, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. Anomaly detection in video via self-supervised and multi-task learning. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12737–12747, 2021a.\nMariana Iuliana Georgescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. A background-agnostic framework with adversarial training for abnormal event detection in video. IEEE transactions on pattern analysis and machine intelligence, 44(9):4505–4523, 2021b.\nSpyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.\nMichael Glodek, Martin Schels, and Friedhelm Schwenker. Ensemble gaussian mixture models for probability density estimation. Computational Statistics, 28(1):127–138, 2013.\nDong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1705–1714, 2019.\nMahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 733–742, 2016a.\nMahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 733–742, 2016b.\nKaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.\nOr Hirschorn and Shai Avidan. Normalizing flows for human pose anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13545–13554, October 2023.\nEddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462–2470, 2017.\nRadu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. Object-centric autoencoders and dummy anomalies for abnormal event detection in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7842–7851, 2019.\nIan Jolliffe. Principal component analysis. Springer, 2011.\nWill Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.\nGustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automatic colorization. In ECCV, 2016.\nLongin Jan Latecki, Aleksandar Lazarevic, and Dragoljub Pokrajac. Outlier detection with kernel density functions. In International Workshop on Machine Learning and Data Mining in Pattern Recognition, pp. 61–75. Springer, 2007.\nSangmin Lee, Hak Gu Kim, and Yong Man Ro. Stan: Spatio-temporal adversarial networks for abnormal event detection. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 1323–1327. IEEE, 2018.\nSangmin Lee, Hak Gu Kim, and Yong Man Ro. Bman: bidirectional multi-scale aggregation networks for abnormal event detection. IEEE Transactions on Image Processing, 29:2395–2408, 2019.\nTsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Springer, 2014.\nWen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6536–6545, 2018a.\nYusha Liu, Chun-Liang Li, and Barnabás Póczos. Classifier two-sample test for video anomaly detections. In BMVC, 2018b.\nZhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13588–13597, October 2021a.\nZhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13588–13597, 2021b.\nDavid G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.\nCewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision, pp. 2720–2727, 2013.\nYiwei Lu, K Mahesh Kumar, Seyed shahabeddin Nabavi, and Yang Wang. Future frame prediction using convolutional vrnn for anomaly detection. In 2019 16Th IEEE international conference on advanced video and signal based surveillance (AVSS), pp. 1–8. IEEE, 2019.\nWeixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE international conference on computer vision, pp. 341–349, 2017a.\nWeixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE international conference on computer vision, pp. 341–349, 2017b.\nHui Lv, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Learning normal dynamics in videos with meta prototype network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15425–15434, June 2021.\nVijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. Anomaly detection in crowded scenes. In 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 1975–1981. IEEE, 2010.\nAmir Markovitz, Gilad Sharir, Itamar Friedman, Lihi Zelnik-Manor, and Shai Avidan. Graph embedded pose clustering for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10539–10547, 2020.\nMichael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. ICLR, 2016.\nIshan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European conference on computer vision, pp. 527–544. Springer, 2016.\nRomero Morais, Vuong Le, Truyen Tran, Budhaditya Saha, Moussa Mansour, and Svetha Venkatesh. Learning regularity in skeleton trajectories for anomaly detection in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11996–12004, 2019.\nTrong-Nguyen Nguyen and Jean Meunier. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1273– 1283, 2019.\nBolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for general video recognition. arXiv preprint, 2022.\nMehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.\nHyunjong Park, Jongyoun Noh, and Bumsub Ham. Learning memory-guided normality for anomaly detection. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14360–14369, 2020.\nJanez Perš, Vildana Sulić, Matej Kristan, Matej Perše, Klemen Polanec, and Stanislav Kovačič. Histograms of optical flow for efficient representation of body motion. Pattern Recognition Letters, 31(11):1369–1376, 2010.\nAlec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.\nTal Reiss and Yedid Hoshen. Mean-shifted contrastive loss for anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 2155–2162, 2023.\nTal Reiss, Niv Cohen, Liron Bergman, and Yedid Hoshen. Panda: Adapting pretrained features for anomaly detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2806–2814, 2021.\nTal Reiss, Niv Cohen, Eliahu Horwitz, Ron Abutbul, and Yedid Hoshen. Anomaly detection requires better representations. In European Conference on Computer Vision, pp. 56–68. Springer, 2022.\nTal Reiss, George Kour, Naama Zwerdling, Ateret Anaby-Tavor, and Yedid Hoshen. From zero to hero: Cold-start anomaly detection. arXiv preprint arXiv:2405.20341, 2024.\nNicolae-C Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, Mubarak Shah, et al. Self-distilled masked auto-encoders are efficient video anomaly detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15984– 15995, 2024.\nNicolae-Cătălin Ristea, Neelu Madan, Radu Tudor Ionescu, Kamal Nasrollahi, Fahad Shahbaz Khan, Thomas B Moeslund, and Mubarak Shah. Self-supervised predictive convolutional attentive block for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13576–13586, 2022.\nRoyston Rodrigues, Neha Bhargava, Rajbabu Velmurugan, and Subhasis Chaudhuri. Multi-timescale trajectory prediction for abnormal human activity detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2626–2634, 2020.\nBernhard Scholkopf, Robert C Williamson, Alex J Smola, John Shawe-Taylor, and John C Platt. Support vector method for novelty detection. In NIPS, 2000.\nChenrui Shi, Che Sun, Yuwei Wu, and Yunde Jia. Video anomaly detection via sequentially learning multiple pretext tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10330–10340, October 2023.\nAshish Singh, Michael J Jones, and Erik G Learned-Miller. Eval: Explainable video anomaly localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18717–18726, 2023.\nWaqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6479–6488, 2018.\nChe Sun, Y. Jia, Yao Hu, and Y. Wu. Scene-aware context reasoning for unsupervised abnormal event detection in videos. Proceedings of the 28th ACM International Conference on Multimedia, 2020.\nShengyang Sun and Xiaojin Gong. Hierarchical semantic contrast for scene-aware video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22846– 22856, 2023.\nYao Tang, Lin Zhao, Shanshan Zhang, Chen Gong, Guangyu Li, and Jian Yang. Integrating prediction and reconstruction for anomaly detection. Pattern Recognition Letters, 129:123–130, 2020.\nZachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 402–419. Springer, 2020.\nZhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv preprint arXiv:2203.12602, 2022.\nGuodong Wang, Yunhong Wang, Jie Qin, Dongming Zhang, Xiuguo Bao, and Di Huang. Video anomaly detection by solving decoupled spatio-temporal jigsaw puzzles. In European Conference on Computer Vision (ECCV), 2022.\nLimin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560, 2023.\nXuanzhao Wang, Zhengping Che, Bo Jiang, Ning Xiao, Ke Yang, Jian Tang, Jieping Ye, Jingyu Wang, and Qi Qi. Robust unsupervised video anomaly detection by multipath frame prediction. IEEE transactions on neural networks and learning systems, 2021.\nZiming Wang, Yuexian Zou, and Zeming Zhang. Cluster attention contrast for video anomaly detection. Proceedings of the 28th ACM International Conference on Multimedia, 2020.\nDonglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060, 2018.\nPeng Wu, jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In European Conference on Computer Vision (ECCV), 2020.\nCheng Yan, Shiyu Zhang, Yang Liu, Guansong Pang, and Wenjun Wang. Feature prediction diffusion model for video anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5527–5537, October 2023.\nZhiwei Yang, Peng Wu, Jing Liu, and Xiaotao Liu. Dynamic local aggregation network with adaptive clusterer for anomaly detection. In European Conference on Computer Vision, pp. 404–421. Springer, 2022.\nZhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. Video event restoration based on keyframes for video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14592–14601, 2023.\nMuchao Ye, Xiaojiang Peng, Weihao Gan, Wei Wu, and Yu Qiao. Anopcn: Video anomaly detection via deep predictive coding network. In ACM MM, 2019.\nGuang Yu, Siqi Wang, Zhiping Cai, En Zhu, Chuanfu Xu, Jianping Yin, and Marius Kloft. Cloze test helps: Effective video anomaly detection via learning to complete video events. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 583–591, 2020.\nJiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.\nShoubin Yu, Zhongyin Zhao, Haoshu Fang, Andong Deng, Haisheng Su, Dongliang Wang, Weihao Gan, Cewu Lu, and Wei Wu. Regularity learning via explicit distribution modeling for skeletal video anomaly detection. arXiv preprint arXiv:2112.03649, 2021.\nMenghao Zhang, Jingyu Wang, Qi Qi, Haifeng Sun, Zirui Zhuang, Pengfei Ren, Ruilong Ma, and Jianxin Liao. Multi-scale video anomaly detection by multi-grained spatio-temporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17385–17394, 2024.\nRichard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.\nYiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. Spatio-temporal autoencoder for video anomaly detection. In ACM MM, 2017.\nA More Qualitative Results # We provide more qualitative results of our methods over the evaluation datasets.\nIn Ped2, Fig. 6 and Fig. 7 demonstrate the effectiveness of our method, which can easily detect fast-moving objects such as trucks and bicycles. Accordingly, we can conclude that Ped2 has been practically solved based on the near-perfect results obtained by our method (as well as many others). Fig. 8 shows that our method is capable of detecting anomalies within a short timeframe. Fig. 9 and Fig. 10 provide more qualitative information regarding our method\u0026rsquo;s ability to detect anomalies of various types. In this way, our method achieves a new state-of-the-art in Avenue and ShanghaiTech, surpassing other approaches by a wide margin.\nFigure 6: Frame-level scores and anomaly localization examples for test video 04 from Ped2.\nFigure 7: Frame-level scores and anomaly localization examples for test video 05 from Ped2.\nFigure 8: Frame-level scores and anomaly localization examples for test video 03 from Avenue.\nFigure 9: Frame-level scores and anomaly localization examples for test video 01_0025 from ShanghaiTech.\nFigure 10: Frame-level scores and anomaly localization examples for test video 07_0048 from ShanghaiTech.\nB More Analysis \u0026amp; Discussion # Per-scene breakdown. In Tab. 9, we present the per-scene performance of our method on the ShanghaiTech dataset, which is the only dataset among the three benchmarks that includes multiple scenes. The results demonstrate that our method performs consistently well across most scenes, with a few exceptions. Specifically, scenes 8, 9, and 10 demonstrate lower performance compared to others. These scenes (i.e., videos with the prefix 08_, 09_, 10_**) feature anomalies involving complex activities such as erratic jumping, throwing objects, and pushing people, as well as frequent occlusions. Such activities involve high\nTable 9: ShanghaiTech per-scene frame-level micro AUROC (%) results.\nScene Number Total Test Frames Total Test Anomalies Anomaly Ratio AUROC 01 11, 894 4, 884 0.41 88.2 02 1, 155 662 0.57 87.8 03 4, 090 1, 212 0.29 90.8 04 4, 761 1, 874 0.39 87 05 4, 160 1, 016 0.24 94.6 06 1, 470 702 0.47 94.5 07 3, 368 886 0.26 92.7 08 3, 708 1, 992 0.53 67.8 09 361 84 0.23 72.6 10 2, 213 1, 539 0.69 64.1 11 337 141 0.41 99.3 12 3, 74 2, 334 0.71 81.1 levels of motion and interaction between multiple subjects, which likely challenges the velocity-based feature representations, leading to reduced performance.\nWhat are the benefits of pretrained features? Previous anomaly detection works (Reiss et al., 2021; Reiss \u0026amp; Hoshen, 2023; Reiss et al., 2022; 2024) demonstrated that using feature extractors pretrained on external, generic datasets (e.g. ResNet on ImageNet classification) achieves high anomaly detection performance. This was demonstrated on a large variety of datasets across sizes, domains, resolutions, and symmetries. These representations achieved state-of-the-art performance on distant domains, such as aerial, microscopy, and industrial images. As the anomalies in these datasets typically had nothing to do with velocity or human pose, it is clear the pretrained features model many attributes beyond velocity and pose. Consequently, by combining our attribute-based representations with CLIP\u0026rsquo;s image encoder, we are able to emphasize both explicit attributes (velocity and pose) derived from real-world priors and attributes that cannot be described by them, allowing us to achieve the best of both worlds.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/an-attribute-based-method-for-video-anomaly-detection/","section":"Papers","summary":"A simple attribute-based approach that represents each object by velocity and pose attributes, combining these with deep representations, and uses density estimation for anomaly scoring, achieving state-of-the-art performance.","title":"An Attribute-based Method for Video Anomaly Detection","type":"method"},{"content":" Anomaly-Led Prompting Learning Caption Generating Model and Benchmark # Qianyue Bao , Student Member, IEEE, Fang Liu , Senior Member, IEEE, Licheng Jiao , Fellow, IEEE, Yang Liu , Student Member, IEEE, Shuo Li , Member, IEEE, Lingling Li , Senior Member, IEEE, Xu Liu , Senior Member, IEEE, Xinyi Wang, Baoliang Chen\nAbstract—Video anomaly detection (VAD) is an important intelligent system application, but most current research views it as a coarse binary classification task that lacks a fine-grained understanding of abnormal video sequences. We explore a new task for video anomaly analysis called Comprehensive Video Anomaly Caption (CVAC), which aims to generate comprehensive textual captions (containing scene information such as time, location, anomalous subject, anomalous behavior, etc.) for surveillance videos. CVAC is more consistent with human understanding than VAD, but it has not been well explored. We constructed a large-scale benchmark CVACBench to lead this research. For each video clip, we provide 6 fine-grained annotations, including scene information and abnormal keywords. A new evaluation metric Abnormal-F1 (A-F1) is also proposed to more accurately evaluate the caption generation performance of the model. We also designed a method called Anomaly-Led Generating Prompting Transformer (AGPFormer) as a baseline. In AGPFormer, we introduce an anomaly-led language modeling mechanism (Anomaly-Led MLM, AMLM) to focus on anomalous events in videos. To achieve more efficient cross-modal semantic understanding, we design the Interactive Generating Prompting (IGP) module and Scene Alignment Prompting (SAP) module to explore the divide between video and text modalities from multiple perspectives, and to improve the model\u0026rsquo;s performance in understanding and reasoning about the complex semantics of videos. We conducted experiments on CVACBench by using traditional caption metrics and the proposed metrics, and the experimental results demonstrate the effectiveness of AGPFormer in the field of anomaly caption.\nIndex Terms—Video Anomaly Detection, Prompting Learning, Video Caption\nI. INTRODUCTION # V IDEO anomaly detection (VAD) has a very wide range of applications in the field of intelligent security [1],\nThis work was supported in part by the National Natural Science Foundation of China (No.62076192), the Joint Fund Project of National Natural Science Foundation of China (No.U22B2054), the Postdoctoral Fellowship Program of China Postdoctoral Science Foundation (CPSF) (No.GZC20232033), the China Postdoctoral Science Foundation (No. 2023M742738), the National Natural Science Foundation of China (No.62406231), the Program for Cheung Kong Scholars and Innovative Research Team in University(No. IRT 15R53), the Fund for Foreign Scholars in University Research and Teaching Programs(the 111 Project)(No. B07048), the Key Scientific Technological Innovation Research Project by Ministry of Education, the Fundamental Research Funds for the Central Universities and the Innovation Fund of Xidian University.\nThe authors are affiliated with the School of Artificial Intelligence, Xidian University, Xi\u0026rsquo;an 710071, P.R. China, with the Key Laboratory of Intelligent Perception and Image Understanding of the Ministry of Education, with the International Research Center for Intelligent Perception and Computation, with the Joint International Research Laboratory of Intelligent Perception and Computation. Fang Liu is the corresponding author. (e-mail: f63liu@163.com)\n(a) Video Anomaly Detection # Fig. 1. VAD vs. CVAC, VAD only classifies videos with rough abnormal labels, while CVAC requires a comprehensive description of video scenes and abnormal events.\n[2], and is a hot research topic in both computer vision and multimedia analysis. VAD tasks can be categorized into two types depending on the supervised signals: semi-supervised VAD [3]–[8] and weak-supervised VAD [9]–[13]. The former determines anomalies by learning the normality patterns in normal videos, while the latter uses simple video-level labels to learn the differences between normal and abnormal videos. VAD is a high-semantic context-dependent [14]–[16] video understanding task, e.g. a fire appearing on a bus is abnormal, while people warming themselves around a campfire is a normal event. Individual labels in VAD alone are completely insufficient to describe consecutive anomalous events occurring in a video and and it is not interpretable, and the VAD community urgently needs to introduce additional scene knowledge and rules to improve the overall understanding. Since anomalous videos usually involve the interaction of multiple scenes, people, and objects, processing with Video Caption (VC) [17]–[23] can easily cope with this complexity and is more in consistent with human understanding. If a road accident occurs in a video, the model should describe it as: \u0026ldquo;At night, at a traffic intersection, a red car speeding and collided with an opposite white car\u0026rdquo; (which also contains comprehensive scene information such as time, location, abnormal subjects, abnormal behaviors, etc.), while the ordinary VAD models can only be expressed using a rough word, e.g., \u0026ldquo;road accident\u0026rdquo;. Based on such motivation, in this paper, we propose a new task called Comprehensive Video Anomaly Caption (CVAC) and construct a large-scale benchmark dataset,\n(a) Video Captioning (VC) # [caption]: Several people shopping in the store.\n[caption1]: Two people walked into the store.\n[caption2]: The cashier got up and left.\n[caption3]: A man takes something from the counter. (b) Dense Video Captioning (DVC)\n(c) Comprehensive Video Anomaly Caption (CVAC) # [scene]: In a supermarket, there is a clerk sitting at the counter, and there are several customers in front of shelf, It is daytime . [caption]: The man in a brown suit stole an item from the counter when the cashier wasn\u0026rsquo;t looking. [abnormal key]: stole, counter.\nFig. 2. Comparison of CVAC with other video captioning tasks. CVAC covers information such as time, location, abnormal subjects, and abnormal behaviors.\nCVACBench, to further promote research in video anomaly analysis. Fig. 1 shows the difference between the proposed CVAC task and the ordinary VAD task. Manually collecting large-scale abnormal video benchmarks from scratch will consume a lot of manpower and time. Therefore, CVACBench is mainly selected from the UCF-Crime dataset [12] (filtering and cleaning videos containing noise and extremely long time), and it is a well-recognized benchmark dataset in the VAD community. CVACBench consists of 863 normal videos and 925 abnormal videos. We design a new evaluation metric for CVACBench: Abnormal-F1 (A-F1), which aims to evaluate whether the model can truly understand the goals and causes of the abnormal occurrence.\nOur CVAC task is significantly different from traditional video description tasks. In the video description community, common tasks are Video Captioning (VC) [24]–[27], which aims to generate a brief overview of a given video, and Dense Video Captioning (DVC) [19], [20], [28], [29], which requires first detecting and slicing a period of time to obtain multiple events and then generating a description for each event. Both tasks are geared towards regular videos containing events that are common in daily life, and although they are able to describe the subjects and behaviors in the video, these methods lose critical information when generating the description text due to the lack of relevance to the specific task requirements. Our CVAC starts from capturing abnormal events. It first summarizes the scene information such as time and location, and then comprehensively describes the appearance and attributes of the main subjects in the video, as well as the interaction behavior with other objects. In addition to this, CVAC requires strong temporal reasoning capabilities; consider the video sequence in Fig. 2, where a shoplifting incident is monitored and a man in a brown suit steals something from the counter in between the cashier\u0026rsquo;s departures. CVAC first needs to clarify the location and time of the current video (some abnormal events often occur at night) and other scene information. The model is then required to understand the scene context based on the temporal evolution of the video and make abnormal reasoning. For example, when the cashier leaves the counter, the customer\u0026rsquo;s behavior of reaching into the counter is judged as a theft incident. The previous VC and DVC obviously do not satisfy the reasoning requirements of such complex semantics, it is inappropriate to directly migrate traditional video description methods on the CVAC task.\nWe design a baseline model for this new task, called Anomaly-Led Generating Prompting Transformer (AGPFormer), which simultaneously models and interacts features of video sequences and text. Existing video caption methods [25], [30] usually use masked language modeling (MLM) [31] to optimize multi-modal representation. We improve MLM for the CVAC task and propose an Anomaly-Led MLM framework (AMLM), AMLM injects the anomaly confidence of a video clip into the attention mask, making the model pay more attention to the core information in abnormal events. Furthermore, in the multi-modal interaction stage (often considered the key to determining video caption performance), we introduce a Interactive Generating Prompting module (IGP). IGP predicts several learnable prompts containing spatiotemporal information to dynamically update the multi-modal interaction features of video and text. Since CVAC requires the model to have scene understanding capabilities, we designed a Scene Alignment Prompting module (SAP) to achieve crossmodal alignment of the model in the scene perspective. In the experimental part, we extensively evaluated AGPFormer using conventional caption evaluation metrics and the proposed AF1. The experimental results proved that our method has significant advantages in cross-modal modeling and description of abnormal events.\nII. RELATED WORK # A. Video Anomaly Detection # With the development of deep learning models [35]–[38], various technical routes have emerged in the VAD community [6], [7], [9], [15]. Previous video anomaly detection methods\nTABLE I COMPARISON OF CVACBENCH WITH OTHER POPULAR VIDEO CAPTION DATASETS. CLIP . IS THE NUMBER OF VIDEO CLIPS, DUR . IS THE TOTAL DURATION (H) OF THE VIDEO, AVG . CLIPS IS THE AVERAGE LENGTH (S) OF THE VIDEO CLIPS, AVG.SEN . IS THE AVERAGE NUMBER OF WORDS IN THE CAPTION , AND DOMAIN IS THE SOURCE OF VIDEO COLLECTION .\nMethod Clip. Dur.(h) Avg. Clip.(s) Avg. Sen. Domain MSVD [32] 1,970 5.3 9.7 8.67 daily life MSR-VTT [26] 10,000 41.2 20 9.28 daily life VATEX [27] 41,250 114.6 10 15.23 daily life Youcookll [33] 15,433 176 316 - cooking UCFC-VD [34] 950 5.6 21.3 9.27 surveillance CVACBench 1,880 53 101.5 20.15 surveillance have used various forms of video labels (frame-level labels and video-level weak labels, etc.) to analyze abnormal events in a data-driven manner. We review recent progress and broadly classify existing methods into two categories: (1) Unsupervised/semi-supervised methods: these methods [6], [8], [39] usually used only normal videos in the training phase, and used autoencoders to learn fixed patterns in normal videos through self-supervised tasks such as frame reconstruction [4]–[7], [15], frame prediction [3], [8], [40]–[42], or frame sequence discrimination [39], [43]. In the test phase, video frames that deviate significantly from the existing pattern are judged as abnormal events, usually showing large frame reconstruction or prediction errors. These methods used the appearance and motion trends in video sequences as important cues for anomaly judgments, had high detection accuracy, and provided a solid foundation for the use of VAD systems in realworld scenarios. (2) Weakly supervised methods: these methods [9], [44], [45] used both normal and abnormal videos in the training phase. However, they used rough video-level anomaly labels to determine the anomaly type of a video frame. This label can only tell the network whether there is an abnormal event in the current video, but cannot accurately determine the specific location where the anomaly occurs. Existing methods have explored the performance of various advanced pre-trained video feature backbones in capturing abnormal semantics, such as C3D [12], [13], [44], [46]–[48], ViT [9], [49], [50], CLIP [45], [51], etc. In addition, various learning mechanisms such as multi-instance learning [12], [52], multisequence learning [9], feature magnitude learning [44], and glance-focus learning [53] have been introduced to improve performance. Although such methods can roughly identify the type of abnormal events, they are far from sufficient to fully express the abnormal semantics. Our proposed CVAC is different from this, it analyzes the continuous abnormal events occurring in the video by generating text captions.\nB. Video Caption # Caption generation is an important topic in the field of multimodal analytics. Image captioning [60]–[62] aims to describe the content in a single image in a human-like manner. Video captioning [21], [24], [25], [30], [63], [64] need to capture the spatio-temporal dynamics in video sequences and\nTABLE II COMPARISON OF CVACBENCH WITH POPULAR VIDEO ANOMALY DETECTION DATASETS. CLASSES IS THE NUMBER OF ABNORMAL TYPES .\nDatasets Clip. Classes Domain Captions CUHK Avenue [54] 37 5 campus ✗ UCF-Crime [12] 1,900 13 surveillance ✗ ShanghaiTech [55] 437 11 campus ✗ XD-Violence [56] 4,754 6 movies ✗ NWPU Campus [57] 547 28 campus ✗ UBnormal [58] 543 22 virtual ✗ DoTA [59] 4,677 18 driving ✗ CVACBench 1,880 13 surveillance ✓ convert them into vivid text descriptions, which is of great significance for the intelligent analysis of multimedia data. We provide a brief overview to this field, from three aspects: (1) Conventional video captioning: These methods [24], [25], [63], [65] capture the spatio-temporal features of the video from a global perspective and feed the features into the language decoder to generate text. The performance of such methods usually depends on the video feature backbone and language decoder model used. For example, the RecNet [63] and SGN [24] methods need to deploy a CNN-based video representation network to extract video appearance and motion features and generate captions in a two-stage manner. The subsequent Swinbert [25] and AVLFormer [30] directly trained the transformer model in an end-to-end manner and achieve better performance. In addition, CLIP4Clip [65] and CoCap [66] attempted to adapt pre-trained visual language models (such as CLIP [51]) to video description generation. (2) Dense video captioning: These methods [19], [20], [28], [29] can achieve fine-grained video description relative to conventional caption generation, first detecting event windows in the video and then generating captions for these windows to refine the description details. PDVC [20] proposed an end-toend parallel decoding framework to simplify this process. The Vid2Seq [67] architecture enhanced the language model by designing special time tokens, so that both event window prediction and text description are implemented by the language model. (3) Video caption datasets: Existing video caption datasets include MSVD [32], MSR-VTT [26], VATEX [27], and YouCookII [33], which only cover daily life videos. Goyal et al. [34] presented a dataset called UCFC-VD similar to our work, but it only provides short captions for abnormal videos in UCF-Crime [12]. We build CVACBench to include captions for both abnormal and normal videos, as it is equally important to accurately describe normal events in surveillance scenarios. In addition, we provide a more comprehensive fine-grained event annotation.\nC. Video Cross-Modal Modeling # Efficient video cross-modal modeling is crucial to improve the performance of downstream video understanding tasks. Recently, researchers have introduced new paradigms such as information maximization [68] and prompt learning [69]–\nFig. 3. An annotation example in CVACBench. For each clip, we provide 1 scene annotation [S.], 6 event captions [C.] (using different words and sentence patterns) and 1 Abnormalkey [A.], where time, location, abnormal subject and abnormal behavior are highlighted in orange, green, blue and red.\n[72] in cross-modal representation and modality alignment. Hoang et al. [68] introduced Cross-Modal Info-Max Hashing (CMIMH) to learn binary representations preserving both intra-modal and inter-modal similarities. Zhou et al. [69] developed prompt learning [70], [71] techniques for adapting pre-trained vision-language models to downstream tasks, while Liu et al. [72] proposed Multi-modal Attribute Prompting (MAP) to improve fine-grained visual perception. We briefly review the applications of cross-modal modeling techniques in various complex video tasks [73]–[76], especially in the fields of video scene graph generation [77], video entailment reasoning [78], and video summarization [79]. Due to the complexity of video semantics, its knowledge (such as evidence [73], relations [74], causality [75], and modal consensus [78]) should be emphasized when modeling. Among them, Wu et al. [77] introduced an unbiased cross-modal learning (UCML) framework to make progress on the weakly supervised video scene graph generation (WS-VidSGG) task. For unsupervised video summarization, Gao et al. [79] developed a relationaware assignment learning method. Yao et al. [78] proposed a new concept of cross-modal consensus, which can be used to exploit the underlying consensus of video and linguistic modalities to accomplish video entailment inference. Based on the above research, we proposed CVAC to focus on exploring video cross-modal modeling methods guided by abnormal signals. And through the prompt learning mechanism, the video temporal features are adapted to the downstream language decoder to generate accurate abnormal descriptions.\nFig. 4. Sentence length distribution histogram in CVACBench.\nIII. CVACBENCH # A. Dataset construction and statistics # Since surveillance anomalies are rare in real life, it is extremely challenging to collect a large-scale video benchmark from scratch. Therefore, we start from the UCF-Crime [12], which is the most far-reaching dataset in the VAD community, and contains 1900 videos captured from real cameras, with a total video duration of 128h, of which 950 are normal videos and 950 are abnormal videos. The event types contain 13 categories such as assault, explosion, burglary, fighting, robbery, road accident, etc., and in each category there are multiple scenes such as stations, traffic intersections, shopping malls, and residential areas. This provides a rich diversity for us to further annotate the video scene information and the corresponding abnormal events.\nWe put a lot of effort in building CVACBench, we first made an expert selection of the UCF-Crime because the original dataset contains a lot of poor quality videos, such as prolonging the video duration by replaying the same clip over and over again, haphazard splicing of multiple videos, incorrect video sizes and resolutions, and so on. We selected 1788 high-quality videos from it, including 863 normal videos and 925 abnormal videos. Compared to UCFD-VC [34], we ensure that the ratio of normal and abnormal videos is close to 1:1, which is very important for practical surveillance applications, as using only abnormal videos for training will cause the model to generate pseudo-anomalous captions for normal videos. We invite 6 experienced researchers to serve as annotators. The annotators were required to select the time window where the abnormal event occurred from each video, and then crop the video from this time window and sample 32 frames as the video clip. We provide 6 event caption annotations for each video clip. Fig. 3 shows a \u0026ldquo;fight\u0026rdquo; event that occurred in a detention center, where [C.1], [C.2], \u0026hellip; [C.6] are different annotations of the event. These annotations come from 6 different annotators, who describe the same abnormal event from their own perspectives. In addition, we also provide\na scene description for the current scene, as shown in [S.], which is intended to describe the background of the video and usually does not contain expressions of abnormality. Finally, we double-check all the labeled sentences to ensure the quality of the benchmark. In addition, we also summarize an abnormal keyword list (Abnormalkey) for each video clip. As shown in [A.] of Fig. 3, it is clear that the key information of the current event is \u0026ldquo;fight\u0026rdquo; and \u0026ldquo;police\u0026rdquo;. This is crucial for the subsequent evaluation of the model\u0026rsquo;s ability to identify abnormalities.\nWe compared CVACBench with several existing popular video caption datasets, as shown in Tab. I. The abbreviation Clip. represents the number of video clips in the dataset. Dur. represents the total duration of the video. Avg.Clip. is the average length of the video clip. Avg.Sen. is the average number of words in the caption. Domain represents the source of video collection. It can be seen that the average sentence length of CVACBench is 20.15, which is significantly longer than the other datasets (the average sentence lengths of MSVD [32], MSR-VTT [26], and UCFC-VD [34] are 8.67, 9.28, and 9.27, respectively). In addition, the average clip of CVACBench is 101.5s, which is only lower than that of the YouCookII dataset. This setting is better suited for complex anomalous events in realistic scenarios that require longer sentences to be presented in full. We also show the distribution of sentence lengths in CVACBench in Fig. 4, with the largest proportion of sentences in the 15-19 range. We follow the experimental setup specification in the video caption community [26], [32] and divide the training, validation, and testing sets of CVACBench into 1244, 281, and 263 video clips.\nIn addition, we compared CVACBench with several widely used datasets in the VAD community, summarized in Tab. II. The VAD task focuses on identifying anomalies or unusual events that deviate from normal patterns in video sequences. Recent researchs have introduced various methods [59], [80]– [86] to address the challenges brought by temporal and spatial complexity. [85], [87]–[89] focus on improving detection accuracy through unsupervised and weakly supervised frameworks. These methods usually exploit domain-specific knowledge [81], [83] to better handle the large-scale variations and unpredictability of anomalies in video data. It is worth noting that most VAD datasets [12], [54] focus on detecting anomalies in video frames or clips without providing text annotations. CVACBench provides richer and more contextualized anomaly representations by incorporating detailed event descriptions. This feature enhances the interpretability of anomaly detection models, allowing us to not only detect what anomalies are, but also better understand why the event is considered anomaly. This comprehensive approach is consistent with recent research in VAD, such as [59], [81], [83], which focuses on improving anomaly localization and detection under diverse and challenging conditions.\nB. Evaluation metrics # Existing video caption metrics include BLEU [90], ROUGE [91], Meteor [92], and CIDEr [93], which measure the model\u0026rsquo;s generative performance by calculating the similarity between predicted text and GT text on a word-by-word basis. Among them, BLEU mainly measures the n-gram (word sequence) matching between the predicted caption and the reference caption. ROUGE not only matches words, but also pays attention to the order in which words appear. Meteor considers synonyms, word form changes (such as stem extraction) when matching words, which can better reflect semantic similarity. CIDEr can give higher weights to unique words in the generated captions, which can evaluate the accuracy and richness. For CVAC, we want to explicitly explore whether the subjects, behaviors, and causes of abnormal events in the video are understood by the model, and these metrics do not meet our needs. In order to evaluate the effectiveness of generated descriptions more completely, we design a new metric based on Abnormalkey, called Abnormal-F1 (A-F1). AF1 measures complex anomaly semantics by calculating how well the predicted text matches Abnormalkey. Considering the different tenses and synonym relationships of words, we use the NLP toolkit NLTK [94] to extract the stems of the words to participate in the calculation. The formula of A-F1 is as follows:\nHere, A is the predicted word set, and B is the AbnormalKey word set. We follow [30] and send the words into the pre-trained T5 model δ(·) [95] to obtain the representation vector to approximate (cosine similarity) calculate the precision P(A, B). Then the recall R(A, B) is calculated by counting the number of predicted words in A that hit the word in B. A-F1 emphasizes the model\u0026rsquo;s attention to the complex semantics of abnormal behavior, which directly matches the predicted captions with the provided AbnormalKey. At the same time, with the powerful semantic understanding ability of the T5 model, A-F1 can understand the different word forms of abnormal objects, ensuring that the captions generated by the model are associated with the core features of abnormal events.\nIV. METHOD # In this section, we detail our AGPFormer, the overall structure of the framework is shown in Fig. 5. The input to AGPFormer contains video tokens F v ∈ R N v ×D v , caption tokens F c ∈ R N c ×D c , and scene tokens F s ∈ R N s ×D s , where pre-trained VidSwin [96] is used for the video encoder and pre-trained BERT [31] is used for the text embedding. Among them, F v is sent to an anomaly binary classifier to calculate the anomaly confidence G v , and then the caption c and the corresponding tokens F c are sent to AMLM for processing. In Sec. IV-B, we describe the proposed IGP module, and the SAP module is detailed in Sec. IV-C.\nA. Anomaly-Led Mask Language Modeling # Since CVAC focuses more on capturing abnormal events and objects than traditional video caption, we improved MLM under the guidance of the AbnormalKey and anomaly confidence G v . Pioneer work [97] has proven that maintaining\nFig. 5. An overview of AGPFormer implementation. It consists of key modules such as video encoder and word embedding, Anomaly-Led Mask Language Modeling (AMLM), Interactive Generating Prompting (IGP) and Scene Alignment Prompting (SAP).\nFig. 6. Detailed illustration of AMLM. AMLM selects words related to anomalies from the input caption and assigns a high priority mask.\nsemantic coherence between words when masking can improve model performance. We first use the syntactic analysis tool provided by spacy [98] to extract phrases from the input caption. As shown in Fig. 6, the two subjects involved in the car accident event: \u0026ldquo;a white car\u0026rdquo; and \u0026ldquo;black car\u0026rdquo; are both obtained. In the original MLM, the model may only mask \u0026ldquo;white\u0026rdquo; representing colors, which can interrupt the learning of object semantics. AMLM will randomly select elements from the phrase list and perform phrase-level mask operations. In addition, we also perform high-priority mask operations on words that fall in the AbnormalKey (e.g., \u0026ldquo;collided\u0026rdquo;). As shown in Fig. 5, the input sent to the BERT encoder is divided into three parts: Nv Nv video tokens, Nc Nc caption tokens and Np Np learnable prompts (introduced in Sec. IV-B). We follow [25] and deploy the attention mask E v (size Nv Nv × Nv Nv ) of the video tokens as a sparse learnable matrix. In AMLM, we use the video-level anomaly confidence G v as a salient signal to perform dot product on the randomly initialized E v to obtain the anomaly-salient attention mask:\nE v is subsequently augmented in residual form to obtain the anomaly-led attention mask E ˆ v , where Norm is the normalization, φ(·) is the sigmoid activation function. During model training, we use sparse loss LSPARSE [25] to regularize E ˆ v. By anomaly-salient guidance of sparse attention, the model can reduce the redundancy of video tokens while paying more attention to tokens where abnormal events occur.\nOur AMLM still follows the autoregressive language generation pattern. As shown on the right side of Fig. 5, we have carefully set up the mask attention matrix for AMLM, where the masked positions are light blue with a value of 0. For caption tokens (attention is represented in green), we adopt a single-direction causal attention mask strategy to ensure that the token predicted by the model only relies on the token at the previous position. For the video tokens (the attention is indicated in blue), we directly place the E ˆ v obtained in the previous step in the overall transformer attention mask (size (Nv Nv + Nt Nt + Np Np ) × (Nv Nv + Nt Nt + Np Np ) ). After being abnormally guided and sparsely constrained, the anomaly-led attention mask E ˆ v can adaptively select video tokens fed into language modeling. For the additional learnable prompts, we set it to full attention (the attention is indicated in orange) because\nFig. 7. Detailed illustration of IGP module. IGP introduces learnable prompts to retrieve interactive features between video tokens and caption tokens.\nthere is no order restriction. At the same time, we make video tokens and learnable prompts visible to text tokens, and video tokens and learnable prompts visible to each other (attention is indicated in brown and yellow, respectively), which forces the model to seek the help of video information and modal interaction information to predict the masked text.\nB. Interactive Generating Prompting # Previous video caption methods [25], [30] based on the transformers architecture, after obtaining the video tokens and caption tokens, only perform a simple concatenation operation on them, and fed into the downstream language model for generation. Due to the natural gap between visual and textual modalities, it is challenging to efficiently utilize multimodal information. Numerous studies [78], [99]–[101] have proven that in multi-modal analysis tasks, feature interactions between modalities have a significant impact on model performance. ALBEF [99] improves the quality of the base representation by aligning image and textual representations using contrastive loss. ViLT [100] focuses on exploring the impact of the complexity of feature extraction on the final interactive performance. VLMo [101] uses the Mixture-of-Experts (MoE) technique in cross-modal interaction. SACCN [78] introduces a concept of cross-modal consensus, emphasizing the importance of mining the underlying consensus knowledge of modalities during the interaction process. Different from these methods, we seek a new way to perform prompt interaction in token space to capture the complex abnormal semantics contained in multimodal data.\nThe design of Interactive Generating Prompting module (IGP) is inspired by the recently emerging prompt learning paradigm [69]–[71]. We randomly initialize a series of query prompts Q ∈ R Np Np ×D p and stack some lightweight transformer computing units to complete the prompting process, as shown in Fig. 7. We first concatenate video tokens F v and caption tokens F c as key inputs, and then send them with the query prompts Q to the multi-head attention layer (MHA) [102] for\nFig. 8. Illustration of SAP module. SAP performs cross-prompting on the input video tokens F v and scene tokens F s to achieve cross-modal alignment.\ndynamic interaction. The overall calculation process can be expressed as follows:\nAmong them, LN denotes the layer normalization operation, after performing the MHA interaction, we add the output embedding features back to the query prompts Q to perform dynamic updates. This is subsequently normalized and fed into a MLP layer to compute the dynamic interaction prompts P. We regard query prompts Q as a predefined interaction connector, which will prompt and predict masked information from naive concatenation tokens. IGP can construct modal interaction tokens more flexibly than concating video and caption tokens directly, and given dynamic interaction prompts P, we append them with the current input tokens into the BERT Encoder for generation.\nC. Scene Alignment Prompting # A deep understanding of scene context is crucial for video anomaly detection and analysis, and there is a considerable amount of work in the VAD community to model surveillance scene context [3], [14], [16], [103]. Inspired by these works, we try to introduce scene knowledge into AGPFormer to supplement and improve video description. It should be noted that the scene annotations provided by our CVACBench are anomaly-agnostic neutral descriptions. They only describe the attributes of the current scene and do not involve abnormal semantics, so they are more suitable for constructing sample pairs with video tokens for cross-modal alignment.\nAs shown in Fig. 8, we propose a lightweight Scene Alignment Prompting module (SAP) to close the embedding distance between scene tokens and video tokens. This allows the model to learn the comprehensive information (time,\nlocation, anomalous subjects, etc.) in the surveillance scene, which allows it to generate more descriptive and expressive event description statements. We first extract embedding for scene information to get scene tokens, and feed them together with corresponding video tokens into two branches of SAP. Each of these branches is set up with two MLP layers, e.g., the video branch consists of MLP v 1 and MLP v 2 , and the scene branch consists of MLP s 1 and MLP s 2 . Unlike IGP, we do not use a transformer-based structure, but a more concise prompt operation. The green box in Fig. 8 represents the branch that processes the scene tokens, and the blue box represents the branch for processing the video tokens. We first use AvgPool to unify the dimensions of the two modal tokens F v ∈ R N v ×D v , F s ∈ R N s ×D s , and then send them to MLP v 1 and MLP s 1 respectively to obtain the intermediate state features H v ∈ R N v ×D p and H s ∈ R N s ×D p :\nThen we feed H v and H s into MLP v 2 and MLP s 2 to predict the corresponding prompting vectors F ˆ v and F ˆ s , whose feature dimensions are both Np Np :\nBased on this, we define a scene alignment loss LCA to optimize F ˆ v and F ˆ s , details are introduced in Sec. IV-D.\nD. Training Objectives # First, for the AMLM language modeling task in AGPFormer, we set an AMLM loss LAMLM for the masked text GT y t mask and the model\u0026rsquo;s predicted probability at the current t-th position, this probability is calculated by the previous (t1) caption tokens F :t − 1 c , current video tokens F t v and prompts F t p , formally represented as follows:\nFor the constructed anomaly-led attention mask E ˆ v , we first follow most works in the VAD community [9], [12], [45] and use video-level binary cross entropy loss LBCE to optimize anomaly confidence G v . We then use sparse regularization loss L SPARSE to further optimize the anomaly-led attention mask E ˆ v :\nFor the two prompting vectors F ˆ v and F ˆ s output by the SAP module, we define a cosine alignment loss LCA for optimization. The loss function is expressed as follows:\nThe overall optimization objectives of the model can be integrated as follows:\nV. EXPERIMENTS # A. Implementation Details # We use the pretrained VidSwin-Tiny [96] and VidSwinSmall [96] as video encoders, and the text encoder uses pretrained bert-base-uncased [31]. All video clips are sampled at 32 frames and scaled to 224 × 224. The maximum mask probability of AMLM is set to 0.6, and the maximum generation sequence length is 35. For learnable prompts, the number is set to Np Np = 10 and the embedding dimension is set to D p = 768. During the training phase, we set the batch size to 8 and train the model on CVACBench for 100 epochs. The optimizer uses AdamW with initial learning rate of 3e − 4 and the weight decay of 1e − 5. For hyperparameters, we set λ1 , λ 2 and λ 3 to 0.0001, 0.1 and 0.001 respectively. In the model inference phase, we generate sentences in an autoregressive manner, each time using the text tokens generated in the previous step as the current input, until the model generation reaches the maximum sequence length or reaches a predefined end tokens [EOS] .\nB. Comparison with other methods # We select some representative video caption methods to compare with AGPFormer on CVACBench. These include the two-stage methods RecNet [63] and SGN [24], which require pre-extraction of video appearance and motion features, and the end-to-end training method Swinbert [25]. And methods such as CLIP4Caption [65], CoCap [66], and CARE [64] that use the prior of pre-trained visual language model (CLIP [51]) to improve the generated description. Among them, CLIP4Caption adapts the CLIP framework to the field of video captioning, and CoCap explores a captioning method that applied CLIP pre-trained representation in the video compression domain. CARE first detects concepts from text corpora based on CLIP model, and improves the quality of caption generation through a global-local semantic guidance mechanism. We directly use the open source implementation codes of these methods and run them according to the default settings to obtain experimental results. Quantitative and qualitative comparison results of these methods are provided below respectively.\nQuantitative comparison: In terms of experimental evaluation metrics, we use four traditional captioning metrics to evaluate performance: (1) BLEU (B@1, B@4) [90] measures n-gram precision between the generated and reference captions, emphasizing accuracy in word sequences but limited in capturing sentence structure and semantics. (2) ROUGE-L [91] focuses on the longest common subsequence (LCS) between the generated and reference captions, reflecting the coverage of both content and word order. (3) Meteor [92] accounts for word variations, synonyms, and word order, providing a more comprehensive measure of semantic similarity through a weighted precision and recall calculation. (4) CIDEr [93] gives higher weights to rare words in the generated text to evaluate TABLE III COMPARISON WITH OTHER METHODS ON THE TEST SPLIT OF CVACBENCH, RECNET [63] AND SGN [24] IS TWO -STAGE VIDEO CAPTIONING METHODS , SWINBERT [25] IS END -TO -END METHOD , BOLD REPRESENTS BEST PERFORMANCE\nMethod BackBone BackBone Traditional Metrics Traditional Metrics Traditional Metrics Traditional Metrics Traditional Metrics Proposed Metrics 2D Appearance 3D Motion B@1 B@4 Meteor ROUGE-L CIDEr A-F1(T5) RecNet [63] Inception-V4 [104] - 41.06 8.80 11.77 27.20 10.39 10.03 SGN [24] ResNet-101 [105] 3D-ResNext-101 [106] 42.60 9.62 13.04 29.18 16.60 13.49 Swinbert [25] VidSwin-Tiny VidSwin-Tiny 42.34 9.85 13.14 30.26 19.35 15.98 Swinbert [25] VidSwin-Small VidSwin-Small 43.71 10.05 13.50 29.75 20.71 17.12 CLIP4Caption [65] CLIP-ViT-B/32 CLIP-ViT-B/32 43.90 8.07 13.99 29.50 17.32 15.10 CoCap [66] CLIP-ViT-B/32 CLIP-ViT-B/32 44.96 9.73 13.84 30.45 18.42 16.48 CARE [64] CLIP-ViT-B/32 CLIP-ViT-B/32 45.66 9.93 13.46 29.53 21.71 20.52 AGPFormer VidSwin-Tiny VidSwin-Tiny 44.01 10.11 13.54 30.56 22.10 22.85 AGPFormer VidSwin-Small 44 VidSwin-Small 44 44.48 10.61 13.72 31.52 23.97 24.32 similarity, which can highlight the relevance and uniqueness of the content. In addition, we also use A-F1 proposed in Sec. III-B to more comprehensively evaluate the ability of AGPFormer to capture anomaly semantics.\nIn Tab. III, we show the caption generation accuracy of AGPFormer compared with other methods. For RecNet, we follow the default settings and use Inception-V4 [104] as the video backbone, and use ResNet101 [105] and 3DResNext101 [106] as the video appearance and motion backbone in SGN respectively. For CLIP4Caption, CoCap, and CARE, we use CLIP\u0026rsquo;s ViT-B/32 weights to initialize the network. For Swinbert and AGPFormer we use two backbone versions of VidSwin [96] (VidSwin-Tiny and VidSwin-Small) for experiments. As we can see, our AGPFormer (VidSwinSmall) achieves the best performance on all evaluation metrics, especially reaching 23.97 on CIDEr, as well as on A-F1 reached 24.32, significantly exceeding other comparison methods. Traditional metrics such as BLEU, Meteor and ROUGE-L usually evaluate model performance based on word matching or word sequence matching in sentences. Since the generated description contains a large number of common words such as \u0026ldquo;the, a, to\u0026rdquo;, it is difficult to highlight the model\u0026rsquo;s ability to capture rare anomalies in videos. In contrast, CIDEr first calculates weights based on word frequency in sentences, giving higher importance to rare but relevant words. Our proposed A-F1 uses the evaluation of abnormal keywords as the primary criterion for calculation. Therefore, it can more accurately reflect the model\u0026rsquo;s ability to capture abnormal events, which can prove that our AGPFormer is more suitable for generating captions for abnormal videos. In addition, we observe that RecNet and SGN perform poorly on CVACBench. These two-stage methods struggle to establish effective communication between the pre-extracted visual features and the downstream language model when faced with complex long videos. The CoCap and CARE methods based on CLIP pretrained representations show obvious advantages in wordlevel metrics (CARE\u0026rsquo;s B@1 achieves the best performance of 45.66), which may be due to the powerful visual language cross-modal transfer ability of the CLIP model, but there are still limitations in capturing abnormal key information. In contrast, Swinbert and AGPFormer based on end-to-end sequence modeling achieve a better trade-off in achieving cross-modal alignment and capturing abnormal semantics.\nQualitative comparison: In Fig. 9, we randomly select two abnormal videos from CVACBench to show the results of the qualitative analysis of AGPFormer versus other comparative methods. We highlight the anomalous keywords in each event using red font and also provide the GT caption for each video. The first example is a challenging and complex robbery event where a man with a pistol enters the store and robs it. Although SGN and Swinbert were able to recognize that the current scene was a store, they completely ignored the robbery and fighting behaviors. In contrast, AGPFormer attempts to capture the human interactions in the scene and reason about the intent of the robbery. The second example is an explosion event, which is simpler than the first example. However, SGN fails to perform well in this example, generating a completely disorganized caption. The caption generated by Swinbert completely covers keywords such as \u0026ldquo;fire\u0026rdquo;, \u0026ldquo;smoke\u0026rdquo; and \u0026ldquo;explosion\u0026rdquo;, but the overall statement has many grammatical errors and is poorly readable. AGPFormer provides a more fine-grained event description, covering all Abnormalkey. At the same time, the generated statements are syntactically complete and very expressive (e.g., \u0026ldquo;massive explosion\u0026rdquo;). C. Ablation experiments # We performed several ablation studies using VidSwin-Small [96] as the backbone of AGPFormer as follows. This includes the validation of the importance of key modules and the study of some hyperparameters. The detailed results are shown in Tab. IV.\nAblation experiments of AMLM mechanism: We first remove the proposed AMLM mechanism. Since AMLM allows us to introduce the abnormal prior in the training phase, it directs the model to pay more attention to the Abnormalkey when generating masked tokens. Its impact on the overall framework is thus critical, as can be analyzed from the results Fig. 9. Qualitative result of video description on CVACBench. The example above is a store robbery event and the example below is a road explosion event. We compared AGPFormer with SGN [63] and Swinbert [25], and the keywords in the generated statements are highlighted in red .\nTABLE IV ABLATION EXPERIMENTS OF KEY MODULES IN AGPFORMER .\nAMLM LBCE LSPARSE IGP SAP Metrics Metrics Metrics Metrics Metrics Metrics B@1 B@4 Meteor ROUGE-L CIDEr A-F1(T5) ✓ ✓ ✓ ✓ 43.41 8.91 12.97 29.99 19.41 20.19 ✓ ✓ ✓ 42.94 9.65 13.12 30.13 20.14 21.56 ✓ ✓ ✓ 42.46 10.03 12.98 29.07 20.20 21.17 ✓ ✓ ✓ ✓ 43.84 10.45 13.60 30.15 21.65 22.44 ✓ ✓ ✓ ✓ 44.03 10.32 13.83 30.50 22.01 23.20 ✓ ✓ ✓ ✓ ✓ 44.48 10.61 13.72 31.52 23.97 24.32 in the first row of Tab. IV. Despite all the other modules running, the model still loses its sensitivity to anomalous events and the CIDEr score decreases to 19.41, and the AF1 score decreases to 20.19. Since the proportion of abnormal words in the dataset is very small in the proposed CVAC task, the data distribution is unbalanced. Since both CIDEr and AF1 can reflect the model\u0026rsquo;s description effect on rare words, the performance drops significantly. While other metrics (BLEU, Meteor and ROUGE-L) keep the same weight for all words, the drop is not obvious.\nAblation experiments of LBCE and LSPARSE: We then conducted experiments on the proposed anomaly-led attention mask E ˆ v . The design goal of E ˆ v is to motivate the model to adaptively select the video tokens fed into the transformer according to the constraints of anomaly confidence and sparse loss, which we denote as a combination of LBCE and LSPARSE . The results are shown in the second row of Tab. IV. The CIDEr decreases to 20.14 and the BLEU@4 to 9.65. This shows that E ˆ v can alleviate the redundancy of the original video sequence input and at the same time guide the model to pay more attention to anomalous in surveillance.\nAblation experiments of IGP and SAP: IGP and SAP are considered as key modules for performing cross-modal alignment in AGPFormer. When both modules are removed (results are shown in row 3 of Tab. IV), CIDEr (20.20) is reduced by 3.77 compared to AGPFormer (23.97), which is a more significant performance degradation. For IGP, we expect it to accurately retrieve the masked information from naive concatenation tokens (video tokens and caption tokens) in the form of prompt learning. This is subsequently used as an efficient multimodal connector attached to the concatenation features for caption generation. When IGP alone is removed (row 4 of Tab. IV), CIDEr decreases to 21.65 (down 2.32). For SAP, we use it to align scene information and visual modalities. Since scene information is a neutral statement, it allows the model to learn commonsense knowledge beyond anomalies. When SAP alone is removed (row 5 of Tab. IV), CIDEr decreases to 22.01 (down 1.96). In contrast, IGP has a greater impact on the overall framework, which demonstrates that learning interactive prompts can help the model to better\nTABLE V ABLATION EXPERIMENTS ON DIFFERENT FUSION STRUCTURES IN THE SAP MODULE. EACH STRUCTURE USES TWO LAYERS .\nMethod Params Evaluation Metrics Evaluation Metrics Evaluation Metrics Evaluation Metrics Evaluation Metrics Evaluation Metrics B@1 B@4 Meteor ROUGE-L CIDEr A-F1(T5) MLP layer 0.26M 44.48 10.61 13.72 31.52 23.97 24.32 GCN layer 1.58M 43.49 9.77 13.11 29.61 20.76 23.10 SA layer 2.36M 46.09 10.91 14.42 30.67 24.76 25.65 LSTM layer 3.68M 44.06 9.47 13.36 29.85 22.01 24.35 model anomaly descriptions.\nAblation experiments on different fusion structures in SAP: To explore the potential of the SAP module in visual language interaction, we replace the MLP layer with a variety of structures. Including more complex self-attention layer (SA), graph convolution layer (GCN), and LSTM layer, to comprehensively analyze the performance impact of different operations on the video and scene text alignment prompt process. The experimental results are shown in Tab. V. When we use GCN and LSTM as fusion structures, the number of parameters of the model increases, but its performance decreases. Among them, GCN has the largest decrease, which may be because it was originally designed for data with a clear topological structure. When the SA layer is used for prompting, although the number of parameters (2.36M) is higher than the MLP layer, the model achieves better results (B@1 is 46.09, CIDEr is 24.76, and A-F1 is 25.65). Therefore, we believe that the MLP layer can strike a balance between performance and the number of parameters, and the SA layer can further improve the performance of our model. Effectiveness of Np Np in IGP and SAP: An important hyperparameter in the IGP and SAP modules is the number of learnable prompts Np Np . In order to explore the sensitivity of IGP and SAP to the number of prompts, we redefine Np Np as two independent hyperparameters NIGP and NSAP in this section. We use the control variable method to conduct experiments. First, we fix the NSAP value of SAP, and then change the NIGP value of IGP for multiple experiments. The NSAP value range is {4, 6, 8, 10}, and the NIGP value range is {6, 8, 10, 12} to cover a reasonable range. Fig. 10 shows the experimental results. The left and right figures are line graphs of CIDEr and A-F1, respectively. Each line represents the impact of changing NI NIGP on performance when the current NSAP is fixed. As can be seen from the Fig. 10, the number of prompts is not as large as better. The combination of prompts with the best performance is (NSAP: 8, NIGP: 10). When NSAP reaches 8, the performance has an inflection point and begins to decline. This indicates that the IGP module may need longer prompts to capture the details of individual instances; while the SAP module may have a lower demand for scene-level information, so shorter prompts can be tried. When NSAP and NIGP both exceed 8, the performance of the model shows a downward trend, which may lead to more redundant information. Fig. 10. The impact of different NIGP and NSAP on generating captions. The left and right figures are line graphs of CIDEr and A-F1, respectively. The horizontal axis represents the number of prompts and the vertical axis represents the evaluation score. Each line represents the impact of changing NI NIGP on performance when the current NSAP is fixed.\nFig. 11. False alarm caption visualization for normal videos by different methods.\nD. Analysis In-Depth # False alarm caption for normal videos: Although AGPFormer can capture abnormal events in video sequences, it needs to also have the ability to describe normal videos. Due to the inherent rarity and transient nature of anomalous events, they have more distinctive characteristics compared TABLE VI DETAILED COMPARISON OF SPEED WITH OTHER METHODS ON THE CVACBENCH TESTSET , MODEL RUN ON AN NVIDIA RTX 2080TI GPU.\nInference Time (ms) Inference Time (ms) Inference Time (ms) Method Backbone Feature Extraction Model Time Total RecNet Inception-V4 279 101 380 SGN Res101\u0026amp;3D-ResN101 314 148 462 Swinbert Vidswin-Small 339 339 339 AGPFormer Vidswin-Small 343 343 343 to normal videos. Although we set the ratio of normal and abnormal videos in CVACBench to 1:1, conventional methods still produce false alarms for normal videos. We randomly selected a normal video from the test set as shown in Fig. 11. We compared the captions generated by SGN [24], Swinbert [25], and AGPFormer. SGN and Swinbert misinterpreted the videos as robbery and assault events respectively. This may be because the current scene is located in a store and the model error associated these two abnormal events with the store environment. Our AGPFormer, on the other hand, was able to point out that a man was sitting on a chair in the store.\nRuning time analysis: We analyze the running speed of AGPFormer in Tab. VI. The experiment was conducted on an NVIDIA RTX 2080Ti GPU, and the batchsize was set to 1. We compared it with three representative methods, among which RecNet [63] (backbone is Inception-V4 [104]) and SGN [24] (backbone is ResNet101 [105] and 3D-ResNext101 [106]) are both two-stage methods. They first extract offline features from video frames (taking 279 and 314ms respectively), and then feed the features into the model for generation (taking 101 and 148ms respectively). So their total time is 380 and 462ms. AGPFormer is similar to Swinbert [25] in that it is an end-toend caption generation model, which does not require offline feature extraction, and thus takes less time. The total time taken by AGPFormer is 343ms, and the processing efficiency is at the normal level. VI. CONCLUSION # We propose CVAC, a new anomaly video analysis task designed to more comprehensively analyze anomalous events in surveillance video. We build a CVACBench benchmark to facilitate this research, which contains fine-grained scene and event annotations, and propose a new evaluation metric A-F1 to more accurately evaluate caption generation performance. When anomalous events meet the video caption, we propose a baseline method AGPFormer, which builds a novel video anomaly understanding framework based on prompt learning paradigm, which outperforms other video captioning methods. In the future, we hope that CVAC can promote the community to improve its comprehensive capabilities in anomaly analysis.\nREFERENCES # [1] B. Ramachandra, M. Jones, and R. R. Vatsavai, \u0026ldquo;A survey of singlescene video anomaly detection,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, 2020.\n[2] R. Nayak, U. C. Pati, and S. K. Das, \u0026ldquo;A comprehensive review on deep learning-based methods for video anomaly detection,\u0026rdquo; Image and Vision Computing, vol. 106, p. 104078, 2021.\n[3] Q. Bao, F. Liu, Y. Liu, L. Jiao, X. Liu, and L. Li, \u0026ldquo;Hierarchical scene normality-binding modeling for anomaly detection in surveillance videos,\u0026rdquo; in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 6103–6112.\n[4] Z. Fang, J. T. Zhou, Y. Xiao, Y. Li, and F. Yang, \u0026ldquo;Multi-encoder towards effective anomaly detection in videos,\u0026rdquo; IEEE Transactions on Multimedia, vol. 23, pp. 4106–4116, 2020.\n[5] N. Li, F. Chang, and C. Liu, \u0026ldquo;Spatial-temporal cascade autoencoder for video anomaly detection in crowded scenes,\u0026rdquo; IEEE Transactions on Multimedia, vol. 23, pp. 203–215, 2020.\n[6] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, \u0026ldquo;Learning temporal regularity in video sequences,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 733–742.\n[7] W. Luo, W. Liu, and S. Gao, \u0026ldquo;Remembering history with convolutional lstm for anomaly detection,\u0026rdquo; in 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2017, pp. 439–444.\n[8] Z. Liu, Y. Nie, C. Long, Q. Zhang, and G. Li, \u0026ldquo;A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 588–13 597.\n[9] S. Li, F. Liu, and L. Jiao, \u0026ldquo;Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1395–1403.\n[10] S. Chang, Y. Li, S. Shen, J. Feng, and Z. Zhou, \u0026ldquo;Contrastive attention for video anomaly detection,\u0026rdquo; IEEE Transactions on Multimedia , vol. 24, pp. 4067–4076, 2021.\n[11] H. Shi, L. Wang, S. Zhou, G. Hua, and W. Tang, \u0026ldquo;Abnormal ratios guided multi-phase self-training for weakly-supervised video anomaly detection,\u0026rdquo; IEEE Transactions on Multimedia, 2023.\n[12] W. Sultani, C. Chen, and M. Shah, \u0026ldquo;Real-world anomaly detection in surveillance videos,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6479–6488.\n[13] J.-C. Feng, F.-T. Hong, and W.-S. Zheng, \u0026ldquo;Mist: Multiple instance selftraining framework for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021, pp. 14 009–14 018.\n[14] C. Sun, Y. Jia, Y. Hu, and Y. Wu, \u0026ldquo;Scene-aware context reasoning for unsupervised abnormal event detection in videos,\u0026rdquo; in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 184–192.\n[15] G. Yu, S. Wang, Z. Cai, E. Zhu, C. Xu, J. Yin, and M. Kloft, \u0026ldquo;Cloze test helps: Effective video anomaly detection via learning to complete video events,\u0026rdquo; in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 583–591.\n[16] M. J. Leach, E. P. Sparks, and N. M. Robertson, \u0026ldquo;Contextual anomaly detection in crowded surveillance scenes,\u0026rdquo; Pattern Recognition Letters , vol. 44, pp. 71–79, 2014.\n[17] S. Liu, A. Li, J. Wang, and Y. Wang, \u0026ldquo;Bidirectional maximum entropy training with word co-occurrence for video captioning,\u0026rdquo; IEEE Transactions on Multimedia, 2022.\n[18] J. Song, Y. Guo, L. Gao, X. Li, A. Hanjalic, and H. T. Shen, \u0026ldquo;From deterministic to generative: Multimodal stochastic rnns for video captioning,\u0026rdquo; IEEE transactions on neural networks and learning systems, vol. 30, no. 10, pp. 3047–3058, 2018.\n[19] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, \u0026ldquo;Densecaptioning events in videos,\u0026rdquo; in Proceedings of the IEEE international conference on computer vision, 2017, pp. 706–715.\n[20] T. Wang, R. Zhang, Z. Lu, F. Zheng, R. Cheng, and P. Luo, \u0026ldquo;Endto-end dense video captioning with parallel decoding,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6847–6857.\n[21] H. Ye, G. Li, Y. Qi, S. Wang, Q. Huang, and M.-H. Yang, \u0026ldquo;Hierarchical modular network for video captioning,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 17 939–17 948.\n[22] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai, \u0026ldquo;Stat: Spatial-temporal attention mechanism for video captioning,\u0026rdquo; IEEE transactions on multimedia, vol. 22, no. 1, pp. 229–241, 2019.\n[23] S. Jing, H. Zhang, P. Zeng, L. Gao, J. Song, and H. T. Shen, \u0026ldquo;Memorybased augmentation network for video captioning,\u0026rdquo; IEEE Transactions on Multimedia, 2023.\n[24] H. Ryu, S. Kang, H. Kang, and C. D. Yoo, \u0026ldquo;Semantic grouping network for video captioning,\u0026rdquo; in proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2514–2522.\n[25] K. Lin, L. Li, C.-C. Lin, F. Ahmed, Z. Gan, Z. Liu, Y. Lu, and L. Wang, \u0026ldquo;Swinbert: End-to-end transformers with sparse attention for video captioning,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 949–17 958.\n[26] J. Xu, T. Mei, T. Yao, and Y. Rui, \u0026ldquo;Msr-vtt: A large video description dataset for bridging video and language,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288–5296.\n[27] X. Wang, J. Wu, J. Chen, L. Li, Y.-F. Wang, and W. Y. Wang, \u0026ldquo;Vatex: A large-scale, high-quality multilingual dataset for video-and-language research,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4581–4591.\n[28] N. Aafaq, A. S. Mian, N. Akhtar, W. Liu, and M. Shah, \u0026ldquo;Dense video captioning with early linguistic information fusion,\u0026rdquo; IEEE Transactions on Multimedia, 2022.\n[29] Z. Zhang, D. Xu, W. Ouyang, and L. Zhou, \u0026ldquo;Dense video captioning using graph-based sentence summarization,\u0026rdquo; IEEE Transactions on Multimedia, vol. 23, pp. 1799–1810, 2020.\n[30] X. Shen, D. Li, J. Zhou, Z. Qin, B. He, X. Han, A. Li, Y. Dai, L. Kong, M. Wang et al., \u0026ldquo;Fine-grained audible video description,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 585–10 596.\n[31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, \u0026ldquo;Bert: Pre-training of deep bidirectional transformers for language understanding,\u0026rdquo; arXiv preprint arXiv:1810.04805, 2018.\n[32] D. Chen and W. B. Dolan, \u0026ldquo;Collecting highly parallel data for paraphrase evaluation,\u0026rdquo; in Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, 2011, pp. 190–200.\n[33] L. Zhou, C. Xu, and J. Corso, \u0026ldquo;Towards automatic learning of procedures from web instructional videos,\u0026rdquo; in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.\n[34] A. Goyal, M. Mandal, V. Hassija, M. Aloqaily, and V. Chamola, \u0026ldquo;Captionomaly: A deep learning toolbox for anomaly captioning in social surveillance systems,\u0026rdquo; IEEE Transactions on Computational Social Systems, 2023.\n[35] L. Jiao, J. Chen, F. Liu, S. Yang, C. You, X. Liu, L. Li, and B. Hou, \u0026ldquo;Graph representation learning meets computer vision: A survey,\u0026rdquo; IEEE Transactions on Artificial Intelligence, 2022.\n[36] L. Jiao, R. Shang, F. Liu, and W. Zhang, Brain and Nature-Inspired Learning, Computation and Recognition. Elsevier, 2020.\n[37] L. Jiao, R. Zhang, F. Liu, S. Yang, B. Hou, L. Li, and X. Tang, \u0026ldquo;New generation deep learning for video object detection: A survey,\u0026rdquo; IEEE Transactions on Neural Networks and Learning Systems, 2021.\n[38] F. Liu, X. Qian, L. Jiao, X. Zhang, L. Li, and Y. Cui, \u0026ldquo;Contrastive learning-based dual dynamic gcn for sar image scene classification,\u0026rdquo; IEEE Transactions on Neural Networks and Learning Systems, 2022.\n[39] G. Wang, Y. Wang, J. Qin, D. Zhang, X. Bao, and D. Huang, \u0026ldquo;Video anomaly detection by solving decoupled spatio-temporal jigsaw puzzles,\u0026rdquo; in European Conference on Computer Vision. Springer, 2022, pp. 494–511.\n[40] J. Yu, Y. Lee, K. C. Yow, M. Jeon, and W. Pedrycz, \u0026ldquo;Abnormal event detection and localization via adversarial event prediction,\u0026rdquo; IEEE Transactions on Neural Networks and Learning Systems, 2021.\n[41] W. Liu, W. Luo, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction for anomaly detection–a new baseline,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6536–6545.\n[42] X. Wang, Z. Che, B. Jiang, N. Xiao, K. Yang, J. Tang, J. Ye, J. Wang, and Q. Qi, \u0026ldquo;Robust unsupervised video anomaly detection by multipath frame prediction,\u0026rdquo; IEEE transactions on neural networks and learning systems, 2021.\n[43] M.-I. Georgescu, A. Barbalau, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah, \u0026ldquo;Anomaly detection in video via self-supervised and multi-task learning,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 742–12 752.\n[44] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro, \u0026ldquo;Weakly-supervised video anomaly detection with robust temporal\nfeature magnitude learning,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 4975–4986. [45] P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y. Zhang, \u0026ldquo;Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,\u0026rdquo; arXiv preprint arXiv:2308.11681, 2023.\n[46] C. Feichtenhofer, H. Fan, J. Malik, and K. He, \u0026ldquo;Slowfast networks for video recognition,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211.\n[47] J. Lin, C. Gan, and S. Han, \u0026ldquo;Tsm: Temporal shift module for efficient video understanding,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7083–7093.\n[48] J. Carreira and A. Zisserman, \u0026ldquo;Quo vadis, action recognition? a new model and the kinetics dataset,\u0026rdquo; in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.\n[49] X. Feng, D. Song, Y. Chen, Z. Chen, J. Ni, and H. Chen, \u0026ldquo;Convolutional transformer based dual discriminator generative adversarial networks for video anomaly detection,\u0026rdquo; in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 5546–5554.\n[50] C. Huang, C. Liu, J. Wen, L. Wu, Y. Xu, Q. Jiang, and Y. Wang, \u0026ldquo;Weakly supervised video anomaly detection via self-guided temporal discriminative transformer,\u0026rdquo; IEEE Transactions on Cybernetics, 2022.\n[51] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., \u0026ldquo;Learning transferable visual models from natural language supervision,\u0026rdquo; in International conference on machine learning. PMLR, 2021, pp. 8748–8763.\n[52] H. Lv, C. Zhou, Z. Cui, C. Xu, Y. Li, and J. Yang, \u0026ldquo;Localizing anomalies from weakly-labeled videos,\u0026rdquo; IEEE transactions on image processing, vol. 30, pp. 4505–4515, 2021.\n[53] Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, and Y.-C. Wu, \u0026ldquo;Mgfn: Magnitude-contrastive glance-and-focus network for weaklysupervised video anomaly detection,\u0026rdquo; in Proceedings of the AAAI conference on artificial intelligence, vol. 37, no. 1, 2023, pp. 387–395.\n[54] C. Lu, J. Shi, and J. Jia, \u0026ldquo;Abnormal event detection at 150 fps in matlab,\u0026rdquo; in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2720–2727.\n[55] W. Luo, W. Liu, and S. Gao, \u0026ldquo;A revisit of sparse coding based anomaly detection in stacked rnn framework,\u0026rdquo; in Proceedings of the IEEE international conference on computer vision, 2017, pp. 341–349.\n[56] P. Wu, j. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang, \u0026ldquo;Not only look, but also listen: Learning multimodal violence detection under weak supervision,\u0026rdquo; in European Conference on Computer Vision (ECCV), 2020.\n[57] C. Cao, Y. Lu, P. Wang, and Y. Zhang, \u0026ldquo;A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 20 392–20 401.\n[58] A. Acsintoae, A. Florescu, M.-I. Georgescu, T. Mare, P. Sumedrea, R. T. Ionescu, F. S. Khan, and M. Shah, \u0026ldquo;Ubnormal: New benchmark for supervised open-set video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 20 143–20 153.\n[59] Y. Yao, X. Wang, M. Xu, Z. Pu, Y. Wang, E. Atkins, and D. J. Crandall, \u0026ldquo;Dota: Unsupervised detection of traffic anomaly in driving videos,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 444–459, 2022.\n[60] M. Yang, W. Zhao, W. Xu, Y. Feng, Z. Zhao, X. Chen, and K. Lei, \u0026ldquo;Multitask learning for cross-domain image captioning,\u0026rdquo; IEEE Transactions on Multimedia, vol. 21, no. 4, pp. 1047–1061, 2018.\n[61] J. Wu, T. Chen, H. Wu, Z. Yang, G. Luo, and L. Lin, \u0026ldquo;Fine-grained image captioning with global-local discriminative objective,\u0026rdquo; IEEE Transactions on Multimedia, vol. 23, pp. 2413–2427, 2020.\n[62] A. Nguyen, Q. D. Tran, T.-T. Do, I. Reid, D. G. Caldwell, and N. G. Tsagarakis, \u0026ldquo;Object captioning and retrieval with natural language,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019, pp. 0–0.\n[63] B. Wang, L. Ma, W. Zhang, and W. Liu, \u0026ldquo;Reconstruction network for video captioning,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7622–7631.\n[64] B. Yang, M. Cao, and Y. Zou, \u0026ldquo;Concept-aware video captioning: Describing videos with effective prior information,\u0026rdquo; IEEE Transactions on Image Processing, 2023.\n[65] M. Tang, Z. Wang, Z. Liu, F. Rao, D. Li, and X. Li, \u0026ldquo;Clip4caption: Clip for video caption,\u0026rdquo; in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4858–4862.\n[66] Y. Shen, X. Gu, K. Xu, H. Fan, L. Wen, and L. Zhang, \u0026ldquo;Accurate and fast compressed video captioning,\u0026rdquo; in Proceedings of the IEEE/CVF\nInternational Conference on Computer Vision, 2023, pp. 15 558– 15 567.\n[67] A. Yang, A. Nagrani, P. H. Seo, A. Miech, J. Pont-Tuset, I. Laptev, J. Sivic, and C. Schmid, \u0026ldquo;Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 10 714–10 726.\n[68] T. Hoang, T.-T. Do, T. V. Nguyen, and N.-M. Cheung, \u0026ldquo;Multimodal mutual information maximization: A novel approach for unsupervised deep cross-modal hashing,\u0026rdquo; IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 9, pp. 6289–6302, 2022.\n[69] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, \u0026ldquo;Learning to prompt for vision-language models,\u0026rdquo; International Journal of Computer Vision , vol. 130, no. 9, pp. 2337–2348, 2022.\n[70] G. Sun, C. Wang, Z. Zhang, J. Deng, S. Zafeiriou, and Y. Hua, \u0026ldquo;Spatiotemporal prompting network for robust video feature extraction,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 587–13 597.\n[71] Y. Pei, Z. Qing, S. Zhang, X. Wang, Y. Zhang, D. Zhao, and X. Qian, \u0026ldquo;Space-time prompting for video class-incremental learning,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11 932–11 942.\n[72] X. Liu, J. Wu, W. Yang, X. Zhou, and T. Zhang, \u0026ldquo;Multi-modal attribute prompting for vision-language models,\u0026rdquo; IEEE Transactions on Circuits and Systems for Video Technology, 2024.\n[73] J. Gao, M. Chen, and C. Xu, \u0026ldquo;Vectorized evidential learning for weakly-supervised temporal action localization,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, 2023.\n[74] J. Gao, T. Zhang, and C. Xu, \u0026ldquo;Learning to model relationships for zero-shot video classification,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3476–3491, 2020.\n[75] Y. Liu, F. Liu, L. Jiao, Q. Bao, L. Li, Y. Guo, and P. Chen, \u0026ldquo;A knowledge-based hierarchical causal inference network for video action recognition,\u0026rdquo; IEEE Transactions on Multimedia, 2024.\n[76] Y. Liu, F. Liu, L. Jiao, Q. Bao, L. Sun, S. Li, L. Li, and X. Liu, \u0026ldquo;Multigrained gradual inference model for multimedia event extraction,\u0026rdquo; IEEE Transactions on Circuits and Systems for Video Technology, 2024.\n[77] Z. Wu, J. Gao, and C. Xu, \u0026ldquo;Weakly-supervised video scene graph generation via unbiased cross-modal learning,\u0026rdquo; in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 4574– 4583.\n[78] X. Yao, J. Gao, M. Chen, and C. Xu, \u0026ldquo;Video entailment via reaching a structure-aware cross-modal consensus,\u0026rdquo; in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 4240–4249.\n[79] J. Gao, X. Yang, Y. Zhang, and C. Xu, \u0026ldquo;Unsupervised video summarization via relation-aware assignment learning,\u0026rdquo; IEEE Transactions on Multimedia, vol. 23, pp. 3203–3214, 2020.\n[80] W. Luo, W. Liu, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction network for video anomaly detection,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 11, pp. 7505–7520, 2021.\n[81] C. Cao, H. Zhang, Y. Lu, P. Wang, and Y. Zhang, \u0026ldquo;Scene-dependent prediction in latent space for video anomaly detection and anticipation,\u0026rdquo; IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.\n[82] C. Sun, Y. Jia, H. Song, and Y. Wu, \u0026ldquo;Adversarial 3d convolutional autoencoder for abnormal event detection in videos,\u0026rdquo; IEEE Transactions on Multimedia, vol. 23, pp. 3292–3305, 2020.\n[83] M. I. Georgescu, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah, \u0026ldquo;A background-agnostic framework with adversarial training for abnormal event detection in video,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 9, pp. 4505–4523, 2021.\n[84] P. Wu, W. Wang, F. Chang, C. Liu, and B. Wang, \u0026ldquo;Dss-net: Dynamic self-supervised network for video anomaly detection,\u0026rdquo; IEEE Transactions on Multimedia, 2023.\n[85] C. Tao, C. Wang, S. Lin, S. Cai, D. Li, and J. Qian, \u0026ldquo;Feature reconstruction with disruption for unsupervised video anomaly detection,\u0026rdquo; IEEE Transactions on Multimedia, 2024.\n[86] H. Song, C. Sun, X. Wu, M. Chen, and Y. Jia, \u0026ldquo;Learning normal patterns via adversarial attention-based autoencoder for abnormal event detection in videos,\u0026rdquo; IEEE Transactions on Multimedia, vol. 22, no. 8, pp. 2138–2148, 2019.\n[87] C. Huang, Q. Xu, Y. Wang, Y. Wang, and Y. Zhang, \u0026ldquo;Self-supervised masking for unsupervised anomaly detection and localization,\u0026rdquo; IEEE Transactions on Multimedia, vol. 25, pp. 4426–4438, 2022.\n[88] P. Wu, X. Liu, and J. Liu, \u0026ldquo;Weakly supervised audio-visual violence detection,\u0026rdquo; IEEE Transactions on Multimedia, vol. 25, pp. 1674–1685, 2022.\n[89] J. Meng, H. Tian, G. Lin, J.-F. Hu, and W.-S. Zheng, \u0026ldquo;Audio-visual collaborative learning for weakly supervised video anomaly detection,\u0026rdquo; IEEE Transactions on Multimedia, 2025.\n[90] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, \u0026ldquo;Bleu: a method for automatic evaluation of machine translation,\u0026rdquo; in Proceedings of the 40th annual meeting of the Association for Computational Linguistics , 2002, pp. 311–318.\n[91] C.-Y. Lin and E. Hovy, \u0026ldquo;Automatic evaluation of summaries using n-gram co-occurrence statistics,\u0026rdquo; in Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics, 2003, pp. 150–157.\n[92] M. Denkowski and A. Lavie, \u0026ldquo;Meteor universal: Language specific translation evaluation for any target language,\u0026rdquo; in Proceedings of the ninth workshop on statistical machine translation, 2014, pp. 376–380.\n[93] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, \u0026ldquo;Cider: Consensusbased image description evaluation,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575.\n[94] S. Bird, E. Klein, and E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit. \u0026quot; O\u0026rsquo;Reilly Media, Inc.\u0026quot;, 2009.\n[95] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, \u0026ldquo;Exploring the limits of transfer learning with a unified text-to-text transformer,\u0026rdquo; The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.\n[96] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, \u0026ldquo;Video swin transformer,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202–3211.\n[97] Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, and H. Wu, \u0026ldquo;Ernie: Enhanced representation through knowledge integration,\u0026rdquo; arXiv preprint arXiv:1904.09223, 2019.\n[98] spacy. [Online]. Available: https://spacy.io/\n[99] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, \u0026ldquo;Align before fuse: Vision and language representation learning with momentum distillation,\u0026rdquo; Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021.\n[100] W. Kim, B. Son, and I. Kim, \u0026ldquo;Vilt: Vision-and-language transformer without convolution or region supervision,\u0026rdquo; in International Conference on Machine Learning. PMLR, 2021, pp. 5583–5594.\n[101] H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, S. Piao, and F. Wei, \u0026ldquo;Vlmo: Unified vision-language pretraining with mixture-of-modality-experts,\u0026rdquo; Advances in Neural Information Processing Systems, vol. 35, pp. 32 897–32 912, 2022.\n[102] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, \u0026ldquo;Attention is all you need,\u0026rdquo; Advances in neural information processing systems, vol. 30, 2017.\n[103] S. Sun and X. Gong, \u0026ldquo;Hierarchical semantic contrast for scene-aware video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 846–22 856.\n[104] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, \u0026ldquo;Inception-v4, inception-resnet and the impact of residual connections on learning,\u0026rdquo; in Proceedings of the AAAI conference on artificial intelligence, vol. 31, no. 1, 2017.\n[105] K. He, X. Zhang, S. Ren, and J. Sun, \u0026ldquo;Deep residual learning for image recognition,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.\n[106] K. Hara, H. Kataoka, and Y. Satoh, \u0026ldquo;Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?\u0026rdquo; in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555.\nQianyue Bao (Student Member, IEEE) received the bachelor\u0026rsquo;s degree in digital media technology from North University of China, Taiyuan, China in 2020. He is currently pursuing the Ph.D. degree with the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence Xidian University, Xi\u0026rsquo;an, China. His main research interests include video analysis and deep learning.\nFang Liu (Senior Member, IEEE) received the B.S. degree in computer science and technology from the Xi\u0026rsquo;an Jiaotong University, Xi\u0026rsquo;an, China, in 1984, and the M.S. degree in computer science and technology from the Xidian University, Xi\u0026rsquo;an, in 1995. She is currently a Professor at Xidian University, Xi\u0026rsquo;an, China. She has authored or co-authored of ve books and over 80 papers in journals and conferences. Her current research interests include image perception and pattern recognition, machine learning, evolutionary computation and data mining.\nShe won the second prize of National Natural Science Award in 2013.\nLicheng Jiao (Fellow, IEEE) received the B.S. degree from Shanghai Jiaotong University, Shanghai, China, in 1982 and the M.S. and Ph.D. degree from Xi\u0026rsquo;an Jiaotong University, Xi\u0026rsquo;an, China, in 1984 and 1990, respectively. Since 1992, he has been a Professor with the school of Electronic Engineering, Xidian University, Xi\u0026rsquo;an, where he is currently the Director of Key Laboratory of Intelligent Perception and Image Understanding of the Ministry of Education of China. His research interests include image processing, natural computation, machine learning, and intelligent information processing. Dr. Jiao is the Chairman of the Awards and Recognition Committee, the Vice Board Chairperson of the Chinese Association of Artificial Intelligence, the Foreign member of the Academia Europaea, the Foreign member of the Russian Academy of Natural Sciences, the fellow of IEEE/IET/CAAI/CIE/CCF/CAA, a Councilor of the Chinese Institute of Electronics, a committee member of the Chinese Committee of Neural Networks, and an expert of the Academic Degrees Committee of the State Council.\nYang Liu (Student Member, IEEE) received the B.S. degree in software development and testing from North University of China, Taiyuan, China in 2020. She is currently pursuing the Ph.D. degree with the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence Xidian University, Xi\u0026rsquo;an, China. Her main research interests include image processing and machine learning.\nShuo Li (Member, IEEE) received the B.S. degree in software engineering from Wuhan University in 2016, and the Ph.D. degree in Computer Science and Technology from Xidian University, Xi\u0026rsquo;an, China, in 2023. He is currently a postdoctoral researcher of Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi\u0026rsquo;an, China. His research interests include computer vision, pattern recognition, and image interpretation.\nLingling Li (Senior Member, IEEE) received the B.S. and Ph.D. degrees from Xidian University, Xian, China, in 2011 and 2017 respectively. Between 2013-2014, she was an exchange Ph.D. student with the Intelligent Systems Group, Department of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, Spain. She is currently a postdoctoral researcher in the School of Artificial Intelligence at Xidian University. Her current research interests include quantum evolutionary optimization, machine learning and deep learning.\nXu Liu (Member, IEEE) received the B.S. degrees in Mathematics and applied mathematics from North University of China, Taiyuan, China in 2013. He received the Ph.D. degrees from Xidian University, Xian, China, in 2019. He is currently associate professor of Huashan elite and postdoctoral researcher of Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi\u0026rsquo;an, China. He is the chair of IEEE Xidian university student branch. His current research interests include machine learning and image processing.\nXinyi Wang received the B.S. degrees in software engineering from Shanxi University, Taiyuan, China in 2022. Currently, she is pursuing the master\u0026rsquo;s degree in the Key Laboratory of Intelligent Perception and Image Understanding of the Ministry of Education, School of Artificial Intelligence, Xidian University, Xi\u0026rsquo;an, China. Her main research interests include video anomaly detection and deep learning.\nBaoLiang Chen received the B.S. degrees in software engineering from Lanzhou University of Technology, Lanzhou, China in 2022. Currently, he is pursuing the master\u0026rsquo;s degree in the Key Laboratory of Intelligent Perception and Image Understanding of the Ministry of Education, School of Artificial Intelligence, Xidian University, Xi\u0026rsquo;an, China. His main research direction is computer vision and multimodal learning.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/anomaly-led_prompting_learning_caption_generating_model_and_benchmark/","section":"Papers","summary":"Introduces a new task for comprehensive video anomaly captioning, proposes a large-scale benchmark dataset CVACBench with fine-grained annotations, and designs a baseline model AGPFormer using prompt learning to improve anomaly understanding and description accuracy.","title":"Anomaly-Led Prompting Learning Caption Generating Model and Benchmark","type":"other"},{"content":" This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.\nExcept for this watermark, it is identical to the accepted version;\nthe final published version of the proceedings is available on IEEE Xplore.\nAnomize: Better Open Vocabulary Video Anomaly Detection # Fei Li 1 , 2# Wenxuan Liu3,4# Jingjing Chen 2 Ruixu Zhang 1 Yuran Wang 1 Xian Zhong 4 Zheng Wang 1*\n1 National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University 2 Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University 3 State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University 4 Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of Technology lifeiwhu@whu.edu.cn, liuwx66@pku.edu.cn\nFigure 1. Challenges Related to Novel Anomalies. (a) Detection ambiguity: The model struggles to assign accurate anomaly scores to unfamiliar frames containing novel anomalies. (b) Categorization confusion: Novel anomalies are misclassified as visually similar base instances from the training set.\nRecent research has explored open-set VAD [1 , 9], where anomalies seen in training are considered base cases, while others are treated as novel cases. It trains on normal and base anomalies to detect all anomalies, overcoming the limitations of closed-set detection. However, it struggles with understanding anomaly categories, leading to unclear outputs [42]. Consequently, a leading study [42] has further investigated open vocabulary (OV) VAD, which aims to detect and categorize all anomalies using the same training data as open-set VAD, offering more informative results.\nNovel anomalies in OVVAD introduce two challenges that remain unexplored by existing methods: (1) Detection ambiguity, where the model often lacks sufficient information to accurately assign anomaly scores to unfamiliar data, as shown in Fig. 1(a). Current methods rely on training or fine-tuning the model, which is inherently limited and cannot adapt to the variability of samples in an open setting. (2) Categorization confusion, where novel cases visually similar to base cases are misclassified, as shown in Fig. 1(b). OV tasks generally rely on multimodal alignment for categorization. Since the model tends to extract visual features for novel videos similar to base videos, these features are more likely to align with base label encodings, leading to miscategorization. Traditional OV methods use pre-trained encoders to encode text, where the input contains labels with\nAbstract # Open Vocabulary Video Anomaly Detection (OVVAD) seeks to detect and classify both base and novel anomalies. However, existing methods face two specific challenges related to novel anomalies. The first challenge is detection ambiguity, where the model struggles to assign accurate anomaly scores to unfamiliar anomalies. The second challenge is categorization confusion, where novel anomalies are often misclassified as visually similar base instances. To address these challenges, we explore supplementary information from multiple sources to mitigate detection ambiguity by leveraging multiple levels of visual data alongside matching textual information. Furthermore, we propose incorporating label relations to guide the encoding of new labels, thereby improving alignment between novel videos and their corresponding labels, which helps reduce categorization confusion. The resulting Anomize framework effectively tackles these issues, achieving superior performance on UCF-CRIME and XD-VIOLENCE datasets, demonstrating its effectiveness in OVVAD.\n1. Introduction # Video Anomaly Detection (VAD) identifies anomaly in videos and is widely used in public safety systems. Traditional VAD methods can be categorized based on the type of training data used. Semi-supervised VAD [3 , 22 , 49] is trained exclusively on normal samples, detecting anomalies as deviations from learned normal patterns. In contrast, weakly supervised VAD [13 , 33 , 50] is trained on both normal and anomalous samples but lacks precise temporal labels, treating VAD as a binary classification problem. Both methods focus on detecting specific anomaly within a closed set and exhibit limitations in open-world scenarios.\nCorresponding author. # Contributed equally to this work. Figure 2. Feature Visualization of Our Design. (a) Text augmentation shifts ambiguous frames to the anomalous feature space. In the static stream, text represents anomaly-related nouns (e.g., \u0026ldquo;abandoned fire starter\u0026rdquo;), while in the dynamic stream, it denotes label descriptions. (b) Group-guided text encoding improves the alignment of novel anomalies with novel labels, especially for those resembling base samples.\nunified templates [16 , 29 , 44 , 45] or embeddings [2 , 4 , 53]. These methods rely solely on pre-trained encoders without spatial guidance in label encoding, limiting multimodal alignment for novel cases.\nTo address detection ambiguity, we introduce a TextAugmented Dual Stream mechanism with dynamic and static streams, each focusing on different visual features augmented by corresponding textual information. The dynamic stream captures sequential information through temporal visual encoding, augmented by label descriptions related to dynamic characteristics. The static stream captures scene information through original contrastive languageimage pre-training (CLIP)-encoded visual features, augmented by a concept library related to static characteristics. The complementarity between dynamic and static data is crucial: certain anomalies rely on temporal information, such as tailing, while others depend on contextual cues, such as running on a highway. Synergistic training of the streams ensures mutual supplementation and constraints, delivering comprehensive temporal and contextual information, minimizing overfitting to specific anomaly categories, and improving overall performance. Additionally, the augmentation follows common-sense reasoning. To detect anomalies in real-world scenarios, we first define the anomaly and establish correlations between visual data and anomaly texts, providing a reference for detection within the overall context. Similarly, we augment visual features with relevant anomaly text, providing additional information for detection. As shown in Fig. 2(a), novel visual features that cause ambiguous detections are shifted into the anomalous feature space with support from text, helping the model better assess unfamiliar anomalies.\nTo address categorization confusion, we introduce a Group-Guided Text Encoding mechanism, encoding labels using group-based descriptions, with labels sharing similar visual characteristics grouped together. As shown in Fig. 2(b), this mechanism establishes connections be- tween base and novel data through grouping, positioning the encodings of novel labels close to those of base labels, where videos associated with both base and novel labels are visually similar, thereby enhancing multimodal alignment for novel data. For novel labels not grouped with base labels, the descriptions provide contextual support to pretrained encoders for text encoding, thus enhancing alignment. Compared to previous methods mainly relying on pre-trained models, our approach strengthens the guidance of the feature space for novel labels, achieving more effective alignment for categorization.\nWith the Anomize framework, we achieve notable results across both XD-VIOLENCE [41] and UCFCRIME [33] datasets. For anomaly detection, we obtain a 2.78% overall improvement on XD-VIOLENCE and 8.21% on novel cases, with UCF-CRIME results comparable to a more complex state-of-the-art model. For categorization, we achieve a 25.61% overall increase in Top-1 accuracy on XD-VIOLENCE and 5.71% on UCF-CRIME, with improvements of 56.53% and 4.49% on novel cases, respectively. In summary, our contributions are threefold:\nTo address detection ambiguity, we discover the importance of providing sufficient informational support. We combine dynamic and static streams to effectively constrain and complement each other. Operating at different levels of visual features, each stream is augmented with corresponding textual information, offering comprehensive support for detection. To tackle categorization confusion, we emphasize the importance of establishing connections between labels to guide their encodings. We propose a text encoding mechanism that groups labels based on visual characteristics and generates corresponding descriptions for encodings. Our Anomize framework targets the challenges of novel anomalies that remain unexplored, offering new insights for OVVAD and demonstrating superior performance on two widely-used datasets, particularly for novel cases. 2. Related Work # 2.1. Video Anomaly Detection # Semi-Supervised VAD. Existing semi-supervised video anomaly detection (VAD) methods are typically categorized into three groups: one-class classification (OCC), reconstruction-based models, and prediction-based models, all of which are trained exclusively on normal data. OCC models [31 , 32 , 35 , 40 , 49] classify anomalies by identifying data points that fall outside a learned hypersphere of normal data. However, defining normality can be ambiguous, which often reduces their effectiveness [23 , 46]. Reconstruction-based methods [3 , 25 , 30 , 48 , 52] use deep auto-encoders (DAEs) to learn normal patterns, detecting anomalies through high reconstruction errors. However,\nDAEs may still reconstruct anomalous frames with low error, weakening detection performance. Prediction models [5 , 8 , 18 , 22 , 24 , 26], often utilizing GANs, forecast future frames and identify anomalies by comparing predicted frames with actual frames.\nWeakly-Supervised VAD. Weakly-supervised video anomaly detection (WSVAD) identifies anomalies using only video-level labels without precise temporal or spatial information. WSVAD methods typically frame the task as a multiple instance learning (MIL) problem [19 , 33 , 34 , 39 , 50], where videos are divided into segments, and predictions are aggregated into video-level anomaly scores. Sultani et al. [33] first define the WSVAD paradigm using a deep multiple-instance ranking framework. Recent methods focus on optimizing models. Tian et al. [34] introduce RTFM, which combines dilated convolutions and self-attention to detect subtle anomalies, while Zaheer et al. [50] add a clustering-based normalcy suppression mechanism. Other approaches [13 , 27] leverage pre-trained models to gain task-agnostic knowledge. Wu et al. [43] propose VadCLIP, which uses CLIP [29] for dual-branch outputs of anomaly scores and labels.\nOpen-Set VAD. Open-set VAD models are trained on normal behaviors and base anomalies to detect all anomalies, addressing the challenges of open-world environments. Acsintoae et al. [1] first introduce open-set VAD, along with a benchmark dataset and evaluation framework. Zhu et al. [54] combine evidential deep learning and normalizing flows within a multiple instance learning framework. Hirschorn et al. [9] propose a lightweight normalizing flows framework that utilizes human pose graph structures.\nOur method provides both detection and categorization results in an open setting, focusing on addressing the challenges related to novel anomalies.\n2.2. Open Vocabulary Learning # Recent advancements in pre-trained vision-language models [11 , 29] have spurred significant interest in open vocabulary tasks, including object detection [6 , 15 , 51], semantic segmentation [7 , 20 , 47], and action recognition [12 , 14 , 21 , 36]. These studies leverage the pre-trained knowledge of multimodal models, demonstrating strong generalization. Wu et al. [42] first introduce open vocabulary video anomaly detection (OVVAD) using the pre-trained model CLIP. However, most methods emphasize the visual encoder while neglecting the text encoder, limiting zero-shot capabilities. Our method explores the text encoder and incorporates a guided encoding mechanism to enhance multimodal alignment in OVVAD.\n3. Proposed Anomize Method # 3.1. Overview # Following Wu et al. [42], we define the training sample set as D = {(vi, yi)} N+A i=1 , which consists of N normal samples D n and A abnormal samples D a . Here, vi represents video samples, and yi ∈ Cbase denotes the corresponding anomaly labels. Each vi ∈ D a contains at least one anomalous frame, while vi ∈ D n consists entirely of normal frames. The complete label set C includes both base and novel anomaly labels. The objective of OVVAD is to train a model on D to predict frame-level anomaly scores and video-level anomaly labels from C .\nFig. 3 illustrates the overview of framework. We leverage the encoder of the pre-trained CLIP model for its strong generalization capabilities. Video frames are processed by the CLIP image encoder Φvisual to extract original visual features xf ∈ R n×d , where n is the number of frames and d is the feature dimension. These features are temporally modeled by a lightweight temporal encoder. The original features, augmented by a concept library ConceptLib , pass through the static stream, while temporal features, augmented by label descriptions, pass through the dynamic stream. The prediction from each stream is obtained and aggregated to generate the final frame-level anomaly score s ∈ R n×1 . For categorization, a multimodal alignment method is used. A fused visual feature is first generated, and the CLIP text encoder Φ text extracts textual features via the group-guided text encoding mechanism. Frame-level predictions are then obtained through alignment and aggregated for the final video-level result pvideo .\n3.2. Lightweight Temporal Encoder # We utilize the frozen Φ visual for visual features to leverage its zero-shot capabilities. However, since CLIP is pretrained on image-text pairs, it lacks temporal modeling for video. Recent methods [14 , 37 , 38] commonly introduce a temporal encoder. However, this often leads to performance degradation on novel cases, as the additional parameters in the encoder may become specialized for the training set, leading to overfitting. Therefore, we adopt a lightweight long short-term memory (LSTM) [10] for temporal modeling, resulting in the temporal visual feature xtem ∈ R n×d :\nOther parameter-efficient models may also be suitable, as discussed in the supplementary material.\n3.3. Group-Guided Text Encoding # Previous methods mainly rely on the generalization capabilities of pre-trained models without task-specific guidance, often leading to categorization confusion. We introduce a group-guided text encoding mechanism to address this.\nFigure 3. Overview of Our Anomize Framework. (a) Process for obtaining label features via the Group-Guided Text Encoding mechanism. (b) Creation of the concept library ConceptLib for anomaly detection. (c) The framework processes anomaly labels and video frames to generate frame-level anomaly scores and detected labels. Scoring is performed using a Text-Augmented Dual Stream mechanism, where each stream receives corresponding text and visual features, and the fused scores are produced as output. For labeling, the model aligns label features from the Group-Guided Text Encoding mechanism with the fused original and temporal visual encodings. Both the text and image encoders, pre-trained on CLIP, remain frozen without further optimization.\nWe leverage large language models (LLM), specifically GPT-4 [28], for textual encoding. We first use the prompt prompt group to group labels, ensuring that corresponding videos in each group exhibit high visual similarity. Then, we apply the prompt promptdesc to generate text descriptions for each label based on the grouping. These descriptions capture shared elements while emphasizing unique characteristics within each group, ensuring the encodings remain similar yet distinguishable:\nwhere result group and result desc represent the label groups and their descriptions. The descriptions are then passed into the frozen CLIP text encoder Φ text to obtain encodings tdesc ∈ R c×d , where c is the number of anomaly labels:\nThese encodings are used for multimodal alignment. For the visual features, we combine the temporal and original encodings, preserving the knowledge captured by the pretrained model:\nwhere α is a scalar weight. The prediction probabilities for video frames are expressed as:\nwhere pframe ∈ R n×c represents the probability distribution over c anomaly labels for each frame. To obtain the videolevel prediction, we select the top M probabilities for each label and average these top values, where M is the total number of frames divided by 16:\nwhere p avg ∈ R c is the average probabilities for each label after applying the softmax function σ(·). Finally, the videolevel prediction pvideo is determined as the label with the highest average probability:\n3.4. Augmenter # In our method, both streams utilize text to enhance visual encodings via a unified augmenter. The augmenter takes visual encodings evisual and textual encodings etext as input. These are processed by a multi-head attention layer MHA(·), where evisual acts as the query and etext acts as the key and value. This operation extracts the most relevant textual features for supplementation, denoted as erefine:\nA fully connected layer FC(·) linearly projects the visual encoding, which is concatenated with the refined textual features and passed through a multi-layer perceptron\nMLP(·) for dimensionality reduction. This results in the augmented output e aug :\n3.5. Text-Augmented Dual Stream # In open settings, models may struggle to assess unfamiliar anomalies due to limited information. We propose a TextAugmented Dual Stream mechanism with complementary dynamic and static streams, each augmented by relevant text to provide sufficient support for detection.\nSince video anomaly detection (VAD) relies on temporal cues, we employ a dynamic stream to predict anomaly scores s dyn ∈ R n×1 based on refined visual features fa faug ∈ R n×d , derived from temporal visual features and augmented by label descriptions via the augmenter in Sec. 3.4:\nwhere Sigmoid(·) converts predictions to [0, 1].\nSince the dynamic stream is limited in scene context, we employ a static stream with original visual features augmented by anomaly-relevant concept data.\nSpecifically, we create a concept library ConceptLib containing key features related to anomalies. These features are generated by Φtext from various nouns describing significant characteristics of the anomalies:\nwhere prompt conc is a prompt for relevant nouns. We then compute the cosine similarity between the visual feature of frame i , x (i) f , and concept features h ∈ ConceptLib:\nThe top K relevant concept features h (i) f ∈ R K×d and their scores s (i) f ∈ R K are selected:\nThe selected features are then weighted by their scores and concatenated to form refined textual features for the video:\nNext, the refined features h new f ∈ R n×k×d and the original visual features xf are passed through the augmenter to generate the augmented encoding x aug ∈ R n×d :\nSimilar to the dynamic stream, x aug is fed into a detector for anomaly score prediction ssta in the static stream:\nFinally, the outputs of the two streams are aggregated for the overall anomaly score prediction s:\nwhere β is a tunable parameter balancing the contributions of the dynamic and static streams.\n3.6. Objective Functions # First Training Stage. In the first stage, we focus on video anomaly categorization to train the LSTM while freezing other modules to prevent optimization conflicts. We use cross-entropy loss L ce for categorization. To prevent overfitting to normal data due to class imbalance, we add a separation loss L sep to enhance the distinction between normal and anomalous predictions. The loss for the first stage is:\nwhere p avg,i denotes the predicted probabilities for the i-th video, and the normal label is at the first index. giis the one-hot ground truth, and N denotes the batch size.\nSecond Training Stage. In the second stage, we focus on anomaly detection to train the static and dynamic streams while freezing other modules. Following Wu et al. [43], we apply the MIL loss. Specifically, we first average the top M frame anomaly scores to obtain the video-level prediction qˆ ˆ i , then compute the loss LX − MIL for each stream using binary cross-entropy to quantify the difference between predictions and binary labels qi, where qi = 1 denotes an anomaly and X ∈ {D, S} denotes the type of stream. Additionally, we apply a weight wiin this phase to tackle data imbalance by increasing the penalty for incorrect scores related to anomalous videos. The loss is defined as follows:\nTable 1. Detection Metrics (%) Comparisons for XDVIOLENCE (left) and UCF-CRIME (right). The best results are highlighted in bold, our method is shaded in gray, the symbol ∗ indicates different category divisions, and underlined values represent the second-best results.\nMethod AP APb APn AUC AUCb AUCn Zhu et al.∗ [54] 64.4 - - 78.82 - - Sultani et al. [33] 52.26 51.25 54.64 78.25 86.31 80.12 Wu et al. [41] 55.43 52.94 64.10 82.24 90.62 84.13 RTFM [34] 58.99 55.72 65.97 84.47 92.54 85.87 Wu et al. [42] 66.53 57.10 76.03 86.4 93.80 88.20 Ours 69.31 57.37 84.24 84.49 93.00 87.05 4. Experimental Results # 4.1. Datasets and Implementation Details # Datasets. We evaluate the performance of Anomize on two widely-used benchmark datasets. XD-VIOLENCE [41] is the largest dataset focused on violent events in videos, containing 3,954 training videos and 800 testing videos. The videos are collected from movies and YouTube, capturing six types of anomalous events across diverse scenarios. UCF-CRIME [33] is a large-scale dataset with 1,610 untrimmed surveillance videos for training and 290 for testing, totaling 128 hours. This dataset includes 13 types of anomalous events, spanning both indoor and outdoor settings and providing broad coverage of real-world scenarios.\nEvaluation Metrics. For anomaly detection, we employ standard metrics from previous works [33 , 41]. Specifically, for UCF-CRIME, we compute AUC, which captures the trade-off between true positive and false positive rates. For XD-VIOLENCE, we report AP, reflecting the balance between precision and recall. For anomaly categorization, we report Top-1 accuracy on anomalous test videos from both datasets. These metrics are provided for all categories combined, as well as separately for base and novel categories, denoted by the subscripts b and n, respectively.\nImplementation Details. We implement our model in PyTorch and train it on an RTX 4090 with a 256-frame limit. Using the AdamW optimizer [17] with a learning rate of 2 × 10 − 5 and a batch size of 32, we train for 16 and 64 epochs in two phases. We use the pre-trained CLIP (ViT-B/16) model. The MLP module contains 2 fully connected layers with GeLU activation. The fusion weight α is 1 during training and 2 during testing. K is 25 for XDVIOLENCE and 5 for UCF-CRIME. Score weight β is 1 on XD-VIOLENCE (0 for base categories) and 0.5 on UCFCRIME (0 for novel categories). Loss weight wi follows the normal-to-anomaly ratio per iteration.\n4.2. Comparison with State-of-the-Art Methods # In Tab. 1, we compare the performance of our method with prior VAD methods, ensuring that all methods use the same\nTable 2. Top-1 Accuracy (%) Comparisons on XD-VIOLENCE (left) and UCF-CRIME (right).\nMethod ACC ACCb ACCn ACC ACCb ACCn Wu et al. [42] 64.68 89.31 30.9 41.43 49.02 37.08 Ours 90.29 92.37 87.43 47.14 56.86 41.57 visual features from CLIP and adopt an open-set setting. On XD-VIOLENCE, we achieve the best performance, with an increase of 2.78% overall and 8.21% on novel cases, demonstrating the effectiveness of our method in reducing detection ambiguity. On UCF-CRIME, our method achieves competitive results, likely due to the lightweight temporal encoder and segmented training, which limits optimization for the detection branch. Since other studies focus solely on traditional VAD without categorization, we compare our method\u0026rsquo;s categorization performance with a leading study [42] in Tab. 2. Our method shows significant improvements, with a 25.61% gain on XD-VIOLENCE and 5.71% on UCF-CRIME, as well as further improvements of 56.53% and 4.49% on novel cases, highlighting its effectiveness in reducing categorization confusion.\n4.3. Ablation Studies # Effectiveness of Lightweight Temporal Encoder. The experiments confirm the importance of temporal information in video-level tasks. Tab. 3 shows that the dynamic stream, with temporal encoding, provides useful temporal cues and complements the static stream effectively. While relying solely on temporal information performs poorly due to noise, the dynamic stream becomes more effective with loss weighting and text augmentation. Tab. 4 shows that adding the temporal encoder improves performance on base cases, but without further guidance, the lightweight encoder still introduces confusion for novel anomalies.\nEffectiveness of Group-Guided Text Encoding Mechanism. Comparisons in Tab. 4 between the second and third rows, or the fourth and fifth rows, on both datasets show that textual encodings based on group descriptions outperform the baseline, demonstrating the importance of our text encoding mechanism.\nEffectiveness of Text Augmentation. As shown in Tab. 3, text augmentation in both dynamic and static streams generally reduces detection ambiguity by compensating for the limitations of visual features. The dynamic stream with only text augmentation shows a slight drop on UCFCRIME, suggesting noise in the temporal encodings. However, when combined with the loss weight, it demonstrates the importance of text, as shown in Rows 6 and 7.\nEffectiveness of Integrating Dynamic and Static Streams. Tab. 3 shows that integrating the two streams is generally more effective than using them independently, as\nTable 3. Effectiveness of Dynamic Mdyn and Static Msta Streams with Text-Augmented Visual Data, Additional Loss Weight wi , and Segmented Training (ST) on XD-VIOLENCE (left) and UCF-CRIME (right). In Msta, the visual data is the original visual feature output by CLIP, while in Mdyn, it is derived from a lightweight temporal encoder.\nMsta Msta Msta Mdyn Mdyn Mdyn ST XD-VIOLENCE XD-VIOLENCE XD-VIOLENCE UCF-CRIME UCF-CRIME UCF-CRIME visual text wi visual text wi ST AP (%) APb (%) APn (%) AUC (%) AUCb (%) AUCn (%) √ × × × × × √ 54.75 46.30 76.61 48.47 47.72 50.05 √ √ √ × × × × √ 59.65 56.41 75.29 52.21 59.18 48.45 √ √ √ × × × √ 59.26 57.20 74.28 84.26 92.06 86.86 × × × √ × × √ 47.19 36.87 69.79 26.93 17.39 25.73 × × × √ √ × √ 54.98 57.79 69.40 23.06 14.68 19.74 × × × √ √ √ √ 54.16 56.03 68.19 83.21 91.94 85.43 × × × √ × √ √ 48.55 51.15 59.93 80.97 90.71 83.63 √ √ × √ √ × √ 64.92 52.93 79.91 52.37 59.44 48.61 √ √ √ √ √ √ × 58.22 58.58 72.31 84.52 92.62 86.52 √ √ √ √ √ √ √ 69.31 57.37 84.24 84.49 93.00 87.05 Table 4. Effectiveness of the Lightweight Temporal Encoder Etem, Group-Guided Text Encoding Mechanism Mg Mgroup , Fusion Function Ffus, Additional Separation Loss L sep , and Segmented Training (ST) on XD-VIOLENCE (left) and UCF-CRIME (right).\nE M F L ST XD-VIOLENCE XD-VIOLENCE XD-VIOLENCE UCF-CRIME UCF-CRIME UCF-CRIME E M F L ST ACC (%) ACCb (%) ACCn (%) ACC (%) ACCb (%) ACCn (%) × × × × √ 45.92 74.81 6.28 35.71 56.86 23.60 √ × × × √ 53.42 92.37 0 25.71 70.59 0 √ √ × × √ 56.07 92.75 5.76 27.86 68.63 4.49 √ × √ × √ 56.51 94.66 4.19 27.14 74.51 0 √ √ √ × √ 90.07 91.98 87.43 46.43 56.86 40.45 √ √ √ √ √ √ × 89.95 91.98 86.91 46.43 52.94 42.70 √ √ √ √ √ 90.29 92.37 87.43 47.14 56.86 41.57 they complement and constrain each other, which is evident from the comparisons in the third, sixth, and last rows.\nEffectiveness of Additional Loss Design. Tab. 3 emphasizes the importance of loss weight wi, especially when training dynamic and static streams together. On XD-VIOLENCE, adding wi slightly degrades single-stream training but significantly benefits integrated streams. Besides, Tab. 4 shows that adding separation loss L sep improves performance, effectively addressing class imbalance.\nEffectiveness of Segmented Training. Results in Tab. 3 and Tab. 4 show that segmented training generally outperforms single-phase training, especially for novel cases, confirming that single-phase training may cause optimization conflicts and increase overfitting risks. On UCF-CRIME , single-phase training shows slightly better categorization performance on novel categories, likely due to randomness caused by optimization conflicts, as reflected in the poor performance on base categories.\n4.4. Qualitative results # Fig. 4 presents qualitative results for anomaly detection, featuring two base and two novel categories from each dataset to cover all label groups. The comparison between the blue predicted score lines and the pink rectangles for ground truth demonstrates the effectiveness of our model in detecting anomalies. Notably, our method demonstrates\nTable 5. Cross-Dataset Detection and Categorization Results.\nTest XD-VIOLENCE XD-VIOLENCE UCF-CRIME UCF-CRIME Train AP (%) ACC (%) AUC (%) ACC (%) XD-VIOLENCE 69.31 90.29 81.69 45.71 UCF-CRIME 66.16 85.87 84.68 47.14 its capability to handle novel cases with minimal detection ambiguity, highlighting the strong support of the textaugmented dual stream.\nFig. 5 shows similarity matrices of textual encodings, where labels indicated by the same dot are grouped together. Fig. 5(a) and (c) show results from encoding text using only label text, where visually similar anomalies lack corresponding textual similarity, revealing the limitations of relying solely on pre-trained models. Fig. 5(b) and (d) present results after being guided, where textual similarities within groups are enhanced to better align with visual similarities. This demonstrates the benefit of establishing label connections to guide encoding.\n4.5. Adaptability and Generalization # Analysis of Cross-Dataset Ability. Tab. 5 shows that training on one dataset in an open-set manner and testing on another achieves results comparable to direct training on the target dataset, highlighting the adaptability of our method.\nFigure 4. Qualitative Results for Anomaly Detection. The first and second rows present results on XD-VIOLENCE and UCF-CRIME respectively. Red boxes and rectangles highlight the ground-truth anomalous frames, while blue lines represent predicted anomaly scores.\nFigure 5. Similarity Matrices of Textual Encoding. (a) and (c) depict results using encodings from the original label data, while (b) and (d) show improvements achieved with the group-guided text encoding mechanism.\nTable 6. Impact of Data Addition on Top-1 Accuracy for XDVIOLENCE (top) and UCF-CRIME (bottom). \u0026ldquo;+ shoplifting\u0026rdquo; denotes the addition of both videos and labels, while \u0026ldquo;+ labels\u0026rdquo; indicates adding new labels only.\nSetting ACC (%) ACCb (%) ACCn (%) orig. 90.29 92.37 87.43 + shoplifting 90.51 92.37 88.21 + arrest 89.96 92.37 86.73 + arson 88.77 92.37 84.08 + assault 87.28 87.4 87.11 + labels 85.21 86.64 83.25 orig. 47.14 56.86 41.57 + riot 65.69 56.86 68.09 + labels 45 54.9 39.33 Analysis of Open Vocabulary Ability. The open vocabulary ability is demonstrated by stable categorization performance after adding novel anomaly data, which we further validate by evaluating top-1 accuracy with the added data, as shown in Tab. 6. The additional data for one dataset is sourced from another. On XD-VIOLENCE, incorporating both same-group (e.g., arson) and differentgroup data (e.g., shoplifting) leads to stable performance, though assault shows the clearest decline due to confusion with similar labels like fighting. On UCF-CRIME, incorporating riot leads to notable improvement, with most instances accurately categorized despite being grouped with four other labels. Although the addition of new labels introduces some confusion, the performance remains robust across both datasets.\n5. Conclusion # We propose the Anomize framework to address detection ambiguity and categorization confusion in open vocabulary video anomaly detection (OVVAD). By augmenting visual encodings with anomaly-related text through a dualstream mechanism, Anomize improves the detection of unfamiliar samples and resolves ambiguity. Additionally, a group-guided text encoding mechanism enhances multimodal alignment and reduces categorization confusion. Experiments on XD-VIOLENCE and UCF-CRIME demonstrate the effectiveness of our method.\nAcknowledgments. This work was supported by the National Natural Science Foundation of China (Grants 62171325, 62271361), the Hubei Provincial Key Research and Development Program (Grant 2024BAB039), and the supercomputing system at Wuhan University.\nReferences # [1] Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ubnormal: New benchmark for supervised open-set video anomaly detection. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 20111–20121, 2022. 1 , 3\n[2] Dibyadip Chatterjee, Fadime Sener, Shugao Ma, and Angela Yao. Opening the vocabulary of egocentric actions. In Adv. Neural Inf. Process. Syst., 2023. 2\n[3] Yang Cong, Junsong Yuan, and Ji Liu. Sparse reconstruction cost for abnormal event detection. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 3449–3456, 2011. 1 , 2\n[4] Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 14064–14073, 2022. 2\n[5] Xinyang Feng, Dongjin Song, Yuncong Chen, Zhengzhang Chen, Jingchao Ni, and Haifeng Chen. Convolutional transformer based dual discriminator generative adversarial networks for video anomaly detection. In Proc. ACM Int. Conf. Multimedia, pages 5546–5554, 2021. 3\n[6] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. In Proc. Int. Conf. Learn. Represent. , 2022. 3\n[7] Kunyang Han, Yong Liu, Jun Hao Liew, Henghui Ding, Jiajun Liu, Yitong Wang, Yansong Tang, Yujiu Yang, Jiashi Feng, Yao Zhao, and Yunchao Wei. Global knowledge calibration for fast open-vocabulary segmentation. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 797–807, 2023. 3\n[8] Yi Hao, Jie Li, Nannan Wang, Xiaoyu Wang, and Xinbo Gao. Spatiotemporal consistency-enhanced network for video anomaly detection. Pattern Recognit., 121:108232, 2022. 3\n[9] Or Hirschorn and Shai Avidan. Normalizing flows for human pose anomaly detection. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 13499–13508, 2023. 1 , 3\n[10] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term ¨ ¨ memory. Neural Comput., 9(8):1735–1780, 1997. 3\n[11] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Proc. Int. Conf. Mach. Learn., pages 4904–4916, 2021. 3\n[12] Chengyou Jia, Minnan Luo, Xiaojun Chang, Zhuohang Dang, Mingfei Han, Mengmeng Wang, Guang Dai, Sizhe Dang, and Jingdong Wang. Generating action-conditioned prompts for open-vocabulary video action recognition. In Proc. ACM Int. Conf. Multimedia, pages 4640–4649, 2024. 3\n[13] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. CLIP-TSA: clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In Proc. IEEE Int. Conf. Image Process., pages 3230–3234, 2023. 1 , 3\n[14] Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. In Proc. Eur. Conf. Comput. Vis., pages 105– 124, 2022. 3\n[15] Dahun Kim, Anelia Angelova, and Weicheng Kuo. Regionaware pretraining for open-vocabulary object detection with vision transformers. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 11144–11154, 2023. 3\n[16] Jooyeon Kim, Eulrang Cho, Sehyung Kim, and Hyunwoo J. Kim. Retrieval-augmented open-vocabulary object detection. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 17427–17436, 2024. 2\n[17] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. Int. Conf. Learn. Represent., 2015. 6\n[18] Sangmin Lee, Hak Gu Kim, and Yong Man Ro. BMAN: bidirectional multi-scale aggregation networks for abnormal event detection. IEEE Trans. Image Process., 29:2395–2408, 2020. 3\n[19] Shuo Li, Fang Liu, and Licheng Jiao. Self-training multisequence learning with transformer for weakly supervised video anomaly detection. In Proc. AAAI Conf. Artif. Intell. , pages 1395–1403, 2022. 3\n[20] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted CLIP. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 7061–7070, 2023. 3\n[21] Kun-Yu Lin, Henghui Ding, Jiaming Zhou, Yi-Xing Peng, Zhilin Zhao, Chen Change Loy, and Wei-Shi Zheng. Rethinking clip-based video learners in cross-domain openvocabulary action recognition. arXiv:2403.01560, 2024. 3\n[22] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection - A new baseline. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 6536–6545, 2018. 1 , 3\n[23] Wenxuan Liu, Shilei Zhao, Xiyu Han, Aoyu Yi, Kui Jiang, Zheng Wang, and Xian Zhong. Pixel-refocused navigated trimargin for semi-supervised action detection. In Proc. ACM Int. Conf. Multimedia Workshop, pages 23–31, 2024. 2\n[24] Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 13568–13577, 2021. 3\n[25] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 FPS in MATLAB. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 2720–2727, 2013. 2\n[26] Yiwei Lu, K. Mahesh Kumar, Seyed Shahabeddin Nabavi, and Yang Wang. Future frame prediction using convolutional VRNN for anomaly detection. In Proc. IEEE Int. Conf. Adv. Video Signal Based Surveill., pages 1–8, 2019. 3\n[27] Hui Lv, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, and Hanwang Zhang. Unbiased multiple instance learning for weakly supervised video anomaly detection. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 8022–8031, 2023. 3\n[28] OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023. 4\n[29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proc. Int. Conf. Mach. Learn., pages 8748–8763, 2021. 2 , 3\n[30] Nicolae-Catalin Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, and Mubarak Shah. Self-distilled masked auto-encoders are efficient video anomaly detectors. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 15984–15995, 2024. 2\n[31] Mohammad Sabokrou, Mohammad Khalooei, Mahmood Fathy, and Ehsan Adeli. Adversarially learned one-class classifier for novelty detection. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 3379–3388, 2018. 2\n[32] Bernhard Scholkopf, Robert C. Williamson, Alexander J. ¨ ¨ Smola, John Shawe-Taylor, and John C. Platt. Support vector method for novelty detection. In Adv. Neural Inf. Process. Syst., pages 582–588, 1999. 2\n[33] Waqas Sultani, Chen Chen, and Mubarak Shah. Realworld anomaly detection in surveillance videos. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 6479–6488, 2018. 1 , 2 , 3 , 6\n[34] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W. Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proc. IEEE/CVF Int. Conf. Comput. Vis. , pages 4955–4966, 2021. 3 , 6\n[35] Jue Wang and Anoop Cherian. GODS: generalized one-class discriminative subspaces for anomaly detection. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 8200–8210, 2019. 2\n[36] Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip: A new paradigm for video action recognition. arXiv:2109.08472, 2021. 3\n[37] Syed Talal Wasim, Muzammal Naseer, Salman H. Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Videogrounding-dino: Towards open-vocabulary spatiotemporal video grounding. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 18909–18918, 2024. 3\n[38] Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, and YuGang Jiang. Open-vclip: Transforming CLIP to an openvocabulary video model via interpolated weight optimization. In Proc. Int. Conf. Mach. Learn., pages 36978–36989, 2023. 3\n[39] Peng Wu and Jing Liu. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Trans. Image Process., 30:3513–3527, 2021. 3\n[40] Peng Wu, Jing Liu, and Fang Shen. A deep one-class neural network for anomalous event detection in complex scenes. IEEE Trans. Neural Networks Learn. Syst., 31(7): 2609–2622, 2020. 2\n[41] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also\nlisten: Learning multimodal violence detection under weak supervision. In Proc. Eur. Conf. Comput. Vis., pages 322– 339, 2020. 2 , 6 [42] Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open-vocabulary video anomaly detection. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 18297–18307, 2024. 1 , 3 , 6 [43] Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In Proc. AAAI Conf. Artif. Intell. , pages 6074–6082, 2024. 3 , 5 [44] Tao Wu, Shuqiu Ge, Jie Qin, Gangshan Wu, and Limin Wang. Open-vocabulary spatio-temporal action detection. arXiv:2405.10832, 2024. 2 [45] Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S. Davis, and Yu-Gang Jiang. Building an openvocabulary video CLIP model with better architectures, optimization and data. IEEE Trans. Pattern Anal. Mach. Intell. , 46(7):4747–4762, 2024. 2 [46] Haiyang Xie, Zhengwei Yang, Huilin Zhu, and Zheng Wang. Striking a balance: Unsupervised cross-domain crowd counting via knowledge diffusion. In Proceedings of the 31st ACM international conference on multimedia, pages 6520–6529, 2023. 2 [47] Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. SAN: side adapter network for open-vocabulary semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 45(12):15546–15561, 2023. 3 [48] Zhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. Video event restoration based on keyframes for video anomaly detection. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 14592–14601, 2023. 2 [49] Muhammad Zaigham Zaheer, Jin-Ha Lee, Marcella Astrid, and Seung-Ik Lee. Old is gold: Redefining the adversarially learned one-class classifier training paradigm. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 14171–14181, 2020. 1 , 2 [50] Muhammad Zaigham Zaheer, Arif Mahmood, Marcella Astrid, and Seung-Ik Lee. CLAWS: clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In Proc. Eur. Conf. Comput. Vis. , pages 358–376, 2020. 1 , 3 [51] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and ShihFu Chang. Open-vocabulary object detection using captions. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. , pages 14393–14402, 2021. 3 [52] Yuanhong Zhong, Xia Chen, Jinyang Jiang, and Fan Ren. A cascade reconstruction model with generalization ability evaluation for anomaly detection in videos. Pattern Recognit., 122:108336, 2022. 2 [53] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. Int. J. Comput. Vis., 130(9):2337–2348, 2022. 2 [54] Yuansheng Zhu, Wentao Bao, and Qi Yu. Towards open set video anomaly detection. In Proc. Eur. Conf. Comput. Vis. , pages 395–412, 2022. 3 , 6 ","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/li_anomize_better_open_vocabulary_video_anomaly_detection_cvpr_2025_paper/","section":"Papers","summary":"The paper introduces the Anomize framework that addresses detection ambiguity and categorization confusion in open vocabulary video anomaly detection (OVVAD) by leveraging visual and textual data augmentation, dual-stream mechanisms, and label relation guidance, achieving superior performance on multiple datasets.","title":"Anomize: Better Open Vocabulary Video Anomaly Detection","type":"method"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/anwaar-ulhaq/","section":"Authors","summary":"","title":"ANWAAR ULHAQ","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/type/application/","section":"Type","summary":"","title":"Application","type":"type"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/armin-danesh-pazho/","section":"Authors","summary":"","title":"Armin Danesh Pazho","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/ashish-bastola/","section":"Authors","summary":"","title":"Ashish Bastola","type":"authors"},{"content":" AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis # Zhiwei Yang 1 , Chen Gao 2 , Jing Liu 1† , Peng Wu 3 , Guansong Pang 4 , Mike Zheng Shou 2† 1 Xidian University 2 Show Lab, National University of Singapore 4\n3 Northwestern Polytechnical University\nAbstract # The rapid advancements in large language models (LLMs) have spurred growing interest in LLM-based video anomaly detection (VAD). However, existing approaches predominantly focus on video-level anomaly question answering or offline detection, ignoring the real-time nature essential for practical VAD applications. To bridge this gap and facilitate the practical deployment of LLM-based VAD, we introduce AssistPDA, the first online video anomaly surveillance assistant that unifies video anomaly prediction, detection, and analysis (VAPDA) within a single framework. AssistPDA enables real-time inference on streaming videos while supporting interactive user engagement. Notably, we introduce a novel event-level anomaly prediction task, enabling proactive anomaly forecasting before anomalies fully unfold. To enhance the ability to model intricate spatiotemporal relationships in anomaly events, we propose a Spatio-Temporal Relation Distillation (STRD) module. STRD transfers the long-term spatiotemporal modeling capabilities of vision-language models (VLMs) from offline settings to real-time scenarios. Thus it equips AssistPDA with a robust understanding of complex temporal dependencies and long-sequence memory. Additionally, we construct VAPDA-127K, the first large-scale benchmark designed for VLM-based online VAPDA. Extensive experiments demonstrate that AssistPDA outperforms existing offline VLMbased approaches, setting a new state-of-the-art for realtime VAPDA. Our dataset and code will be open-sourced to facilitate further research in the community.\n1. Introduction # Video anomaly detection (VAD) [1 , 22 , 32 , 40] aims to automatically identify anomalous events in video. Traditional VAD methods mainly focus on score-based detection, i.e ., assigning anomaly scores to frames, clips, or entire videos to indicate the degree of abnormality. However, these meth-\n†Corresponding authors\nSingapore Management University ods lack semantic interpretability, making them insufficient for handling complex and diverse anomalous events.\nThe emergence of large language models (LLMs) [5 , 13 , 26 , 43] has inspired LLM-based VAD approaches. For instance, Du et al. proposed an anomaly causal understanding framework [8]; Zhang et al. introduced a multimodal LLMbased unbiased and interpretable VAD framework [41]; and Tang et al. developed an open-world anomaly comprehension method using vision-language models (VLMs) [23]. These works demonstrate the potential of LLMs in VAD, showcasing promising applications of VLMs in the field. However, a major limitation of these methods is that they operate in an offline setting, which fundamentally diverges from the real-world requirement for online VAD in practical surveillance scenarios. As of now, research on leveraging VLMs for online VAD remains unexplored.\nTo advance the practical application of VLMs in VAD, our primary goal is to develop an online video anomaly surveillance assistant. Specifically, as illustrated in Fig. 1 , we identify three core capabilities: (1) Video Anomaly Prediction (VAP): In real-world surveillance, anomalies should not only be detected post-occurrence but also anticipated as early as possible to minimize potential damage. (2) Video Anomaly Detection: The system must robustly detect sudden, unpredictable anomalies such as explosions or sudden attacks, ensuring timely alerts. (3) Video Anomaly Analysis (VAA): Given the diversity of real-world anomalies, users may require real-time assistance to analyze and respond appropriately to incidents. The surveillance assistant should facilitate real-time question answering and event analysis, aiding users in handling anomalies effectively.\nTo realize aforementioned goals, we propose AssistPDA, the first online video surveillance assistant for Video Anomaly Prediction, Detection, and Analysis (VAPDA). AssistPDA is the first framework to unify anomaly prediction, detection, and interactive analysis within a single system, supporting real-time streaming inference and interaction. AssistPDA operates in three primary modes: proactive anomaly prediction, real-time anomaly detection, and interactive analysis. In predictive and detection modes, the\nFigure 1. Illustration of the proposed Video Anomaly Prediction, Detection, and Analysis (VAPDA) tasks.\nsystem autonomously alerts users to critical anomalies. In interactive mode, it responds to user queries in real-time.\nDeveloping the AssistPDA presents two key challenges. (1) Constructing training data for online VAPDA. Existing LLM-based VAD methods have released video anomaly question-answering datasets. However, these datasets are constrained to clip-level Q\u0026amp;A, making them unsuitable for training a real-time video streaming-based model. To bridge this gap, we construct VAPDA-127K, the first large-scale benchmark dataset for online VAPDA. Built upon UCFCrime [22] and XD-Violence [31] video anomaly datasets, our dataset consists of 2,415 videos across 15 anomaly categories, and 127K time-stamped anomaly predictions, detections, and Q\u0026amp;A in natural language form. (2) Enabling temporal awareness in frame-by-frame streaming inference. AssistPDA leverages Qwen2-VL [27] as the backbone, which inherently supports offline video/image inference. However, transitioning to frame-by-frame online inference introduces a critical challenge since capturing longrange temporal dependencies is crucial for detecting complex and varied anomaly events.\nTo address this, we propose a SpatioTemporal Relation Distillation (STRD) module, inspired by recent advances in vision-language modeling. Many existing VLMs [24 , 27], trained on large-scale video datasets, exhibit strong offline temporal reasoning capabilities. We aim to distill this spatiotemporal reasoning knowledge from a pre-trained offline VLM vision encoder into a lightweight module, integrating it within the online vision encoder-LLM pipeline. This enables AssistPDA to maintain robust long-term spatiotemporal understanding despite operating in a streaming frameby-frame inference setting. Through extensive experiments, we demonstrate that AssistPDA significantly outperforms existing VLMs in VAPDA, marking a major step towards intelligent real-time video anomaly surveillance systems. To summarize, our major contributions are as follows:\nWe propose, for the first time, a unified framework that integrates video anomaly prediction, detection, and analysis in an online setting. Moreover, we propose event-level video anomaly prediction as a new task. We devise the AssistPDA, an assistant for online video anomaly surveillance, incorporating a novel STRD module to transfer offline VLM spatiotemporal reasoning capabilities to streaming inference, significantly enhancing long-term spatiotemporal understanding. We construct VAPDA-127K, the first large-scale benchmark dataset for online VAPDA, containing 127K timestamped anomaly predictions, detections, and Q\u0026amp;A in natural language form, providing a valuable resource for future VLM-based video anomaly research. 2. Related Works # 2.1. Video Anomaly Detection # VAD problem has been studied over the years [9 , 11]. Early methods primarily relied on handcrafted feature-based approaches, such as those proposed in [6 , 15 , 19]. With the rapid advancements in deep learning, deep learning-based methods have become the dominant paradigm. These methods can be broadly categorized into three types: unsupervised, semi-supervised, and weakly supervised VAD.\nUnsupervised VAD methods [20] typically leverage clustering techniques or pseudo-label generation with selftraining to directly mine anomaly-related information from\nFigure 2. Pipeline of data construction for the proposed VAPDA-127K dataset.\nmixed normal and abnormal data. Semi-supervised VAD approaches [14 , 16 , 35 , 36] are mainly based on either frame reconstruction or frame prediction, both of which employ a surrogate task to learn patterns from normal video data. During inference, deviations from normal patterns are considered anomalies. Weakly supervised VAD methods [22 , 25 , 31 , 33 , 34 , 37] carry out on datasets with only video-level annotations and often utilize multiple-instance learning to infer segment-level anomaly scores.\n2.2. Multimodal Video Anomaly Detection # With the rapid progress of LLMs and their superior performance in visual understanding [10 , 28 , 29], multimodal VAD based on LLMs has gained increasing attention [8 , 39]. For instance, Lv et al. proposed a video anomaly detection and explanation framework leveraging VLM [17]. Zanella et al. introduced a training-free VAD method using LLMs [39]. Du et al. proposed a framework that utilizes VLMs for causal reasoning in VAD [8]. Tang et al. introduced HAWK, which leverages VLMs to understand open-world video anomalies [23]. Zhang et al. developed Holmes-VAD for unbiased and interpretable VAD [41].\nHowever, existing VAD methods based on LLMs or VLMs are limited to single-task anomaly detection or video anomaly question answering. These methods operate only in offline settings without predictive capabilities. Such limitations hinder their applicability in real-world surveillance scenarios. In contrast to previous methods, we propose the first VLM-based online video anomaly surveillance assistant, unifying video anomaly prediction, detection, and real-time question answering within a single framework. Furthermore, we construct a large-scale benchmark dataset VAPDA-127K tailored for the online VAPDA task.\n3. Method # In this section, we first define the tasks of video anomaly prediction, detection, and analysis in Sec 3.1. Sec 3.2 introduces the construction process of the VAPDA-127K dataset. Sec 3.3 presents the detailed model architecture, while Sec 3.4 describes the training and inference procedures.\n3.1. Task Definition # As mentioned, a video anomaly surveillance assistant should possess three key capabilities: Video Anomaly Prediction , Video Anomaly Detection, and Video Anomaly Analysis. We first define these tasks under the setting of VLM-based streaming video online inference.\nVideo anomaly prediction. In this work, we introduce the event-level VAP task for the first time. Although previous studies [2 , 30] have explored frame-level anomaly score prediction, such as predicting whether anomalies will occur in future T + n frames based on the previous T frames, existing methods are typically limited to a very short prediction window (0-1s in advance), which limits their practical applicability in real-world scenarios. In contrast, event-level prediction aims to anticipate potential anomalous events before they fully unfold, leveraging historical video information to generate early warnings in natural language. The predicted output includes the event category and a descriptive explanation of the anticipated anomaly.\nWe formalize this process as follows: At time t0, a user issues a query, e.g ., \u0026ldquo;Please predict potential abnormal events in real time based on the received video stream.\u0026rdquo; The actual anomaly occurs between t n ∼ t m . Given the observed video stream between t 0 and t k (k \u0026lt;= n), if an anomaly is deemed likely, the model should automatically generate a natural language response at tk, detailing the predicted event type and description.\nVideo Anomaly Detection. Certain types of abrupt anomalous events are inherently unpredictable in advance. Therefore, the capability of the model to perform real-time anomaly detection is crucial. We formalize this process as follows: At time t 0 , the user issues a query, e.g ., \u0026ldquo;Please detect any abnormal events in real time based on the received video stream.\u0026rdquo; The actual anomaly occurs between t n ∼ t m . During this period, the model provides anomaly detection responses at multiple key moments within t n to t m , each response containing the detected anomaly type and a descriptive explanation of the event.\nVideo Anomaly Analysis. For the VAA task, we adopt a user-centric approach, where the model provides responses based on the user\u0026rsquo;s inquiries regarding ongoing anomalous events. Since real-world anomalies can be highly diverse, and user queries are completely open-ended, we formalize video anomaly analysis as an online video question answering task. Specifically, assume that an anomaly begins to occur at t n . At a later time t n + k, the user issues a query, such as \u0026ldquo;How should the ongoing anomaly be handled?\u0026rdquo; Upon receiving the query, the model should generate an immediate response at t n + l with (l \u0026gt;= k), addressing the user\u0026rsquo;s question in real time.\n3.2. Dataset Construction # In this section, we detail the construction process of the VAPDA-127K dataset to adapt the three tasks above. Fig. 2 illustrates the process of construction of the dataset.\nData Collection. We first collect raw video data from the two largest weakly supervised VAD datasets, UCFCrime [22] and XD-Violence [31]. After filtering out low-quality videos, we obtain a total of 2415 untrimmed videos. These videos only contain video-level annotations, indicating whether an anomaly occurs within the video, but lack precise timestamps for when anomalies happen. However, to train the VLM-online for our three defined tasks, timestamp-level anomaly annotations are necessary. Thanks to HIVAU-70K [42] providing annotated event start and end timestamps for UCF-Crime and XD-Violence, we further build our task-specific dataset based on them.\nData Annotation for Anomaly Prediction. For the VAP task, we require frame-level information preceding the occurrence of an anomaly. To reduce the computational burden caused by redundant frames, we first sample the raw videos at 1 FPS and then use an existing VLM to generate a caption for each frame. Next, we segment each video at the start time of each anomalous event. We then feed all captions (with their corresponding caption ID ) from the video start time up to the onset of the anomaly into a LLM. Using specifically designed prompts, we instruct the LLM to determine the earliest frame where a potential future anomaly could have been predicted and to generate the anomaly type and a brief description of the predicted anomaly.\nData Annotation for Anomaly Detection. For the VAD task, we focus on video segments corresponding to the actual anomaly occurrence. Using HIVAU-70K [42], which provides event start and end timestamps and segment-level captions, we first extract data containing explicit start and end timestamps for anomalous events. While segmentlevel captions exist within the anomalous event period, not all captions within this period necessarily contain explicit anomaly-related information due to the complexity of realworld events. We sequentially feed captions with timestamps along with historical captions into the LLM. The LLM is instructed to determine whether each caption contains an ongoing anomalous event. Furthermore, by leveraging both the current and historical captions, the LLM generates a concise anomaly description that includes the anomaly type. Through this process, we obtain timestamped anomaly detection captions corresponding to key moments during the anomaly occurrence.\nData Annotation for Anomaly Analysis. For the VAA task, we construct open-ended question-answer pairs based on ongoing anomalous events. This is distinctly different from existing anomalous Q\u0026amp;A data, which are fixedtemplate Q\u0026amp;A pairs constructed based on the entire video or clips. Building upon the anomaly detection annotations, we extract key detection captions at critical moments within the anomaly period and combine them with historical captions in chronological order. These are then fed into an LLM, which, based on the 5W (Who, What, When, Where, Why) and 2H (How, How much) principle, generates questions relevant to the ongoing anomalous event. The LLM also generates factually and logically consistent answers based on both the current and historical captions.\nFigure 3. Pipeline of our method. VE and STRD are short for Video Encoder and Spatiotemporal relation distillation, respectively.\nManual Review and Refinement. To mitigate the effects of LLM hallucinations, we iteratively refine the prompts to ensure optimal generated responses. Finally, all LLMgenerated data undergo manual review, where inappropriate responses are removed or modified. This review process involved five annotators, each spending an average of 10 hours, ensuring high-quality annotations for the dataset. Please refer to the supplementary material for more details on the construction of the dataset.\n3.3. Model Architecture # Fig. 3 presents an overview of our proposed AssistPDA, which consists of three key components: a vision encoder, a spatiotemporal relationship distillation module, and an LLM with a fine-tunable LoRA module. In the following sections, we provide details of each module.\n3.3.1. Vision Encoder # We adopt the frozen vision encoder φ v from Qwen2-VL [27], which is based on a Vision Transformer (ViT) [7]. Following existing work [3], we sample frames from the original video at 2 FPS. To support both image and video inputs, the Qwen2-VL vision encoder duplicates input images when operating in image mode. To reduce redundant computation, we directly take every two consecutive frames as input to extract visual tokens. Given an input video frame sequence ν ∈ R T ×H×W×C , the visual token obtained from\nFigure 4. Illustration of the STRD module.\nthe (i − 1)-th and i-th frames is formulated as:\nwhere v j i − 1,i (j ∈ {1 , 2, \u0026hellip;, N}) represents the patch tokens, with N denoting the total number of patches obtained from every two input frames. For clarity and conciseness, we will refer to \u0026ldquo;each frame\u0026rdquo; as a representation of the actual two-frame input in the subsequent discussion.\n3.3.2. SpatioTemporal Relationship Distillation # In online processing mode, video frames are input frame by frame, making the learning of spatiotemporal relationships and long-term memory a significant challenge. Existing approaches often incorporate memory units between\nthe vision encoder and the LLM to store historical frame information, which is then retrieved to maintain temporal memory or extract key information. However, such methods impose substantial constraints on inference speed.\nTo ensure that our designed online framework, AssistantPDA, maintains high inference efficiency while also exhibiting strong spatiotemporal reasoning and long-term memory capabilities, we introduce a STRD module ϕ. To minimize additional computational overhead on the VLM backbone, we adopt a lightweight approach by employing a two-layer Multi-Head Self-Attention (MHSA) network as the STRD module. This module transfers the VLM\u0026rsquo;s offline-mode ability to model global spatiotemporal relationships into an online processing pipeline. We perform the distillation using Qwen2-VL [27] with the goal that the tokens obtained from the frame-by-frame video input remain as consistent as possible in feature space with those obtained from processing the entire video directly. The spatiotemporal relationship distillation process is illustrated in Fig. 4. First, the input video frame sequence ν ∈ R T ×H×W×C is directly processed by the Qwen2-VL vision encoder, obtaining the global visual token representation:\nwhere v j i (j ∈ {1 , 2, \u0026hellip;, M}) denotes the patch tokens and M is the number of patches extracted from the input video.\nSince the vision encoder applies a 3D convolution with a stride of 2 before patch embedding, each v j i still represents a patch token fused from two consecutive frames. However, unlike frame-by-frame input, since the visual tokens here are obtained through a global attention operation, meaning each v j i inherently contains information from all other frame patches, incorporating full spatiotemporal context. The role of the STRD module is to ensure that tokens obtained from frame-by-frame input, after transformation, also encapsulate global contextual information. To achieve this, we first concatenate the tokens obtained from frameby-frame input along the temporal dimension:\nWe then apply the distillation module to transform these tokens, which is formulated as:\nFinally, we enforce consistency between ⌢ V images and the global video tokens ⌢ V video in feature space using a mean squared error (MSE) loss function:\nAfter training the STRD module, we insert it between the vision encoder and the LLM during LoRA fine-tuning. In real-time inference, the MHSA module in the STRD module is equipped with KV cache, allowing frame-by-frame input tokens to retain historical spatiotemporal context. By adjusting the length of the KV cache, we can control the temporal span of frames considered by the STRD module. On our experimental setup with an A6000 GPU, the maximum temporal receptive field can reach up to 20 minutes.\n3.3.3. LLM # The LLM used in our framework is QwenLM from Qwen2VL [27]. It is responsible for processing the visual tokens obtained from the STRD module, concatenating them with the text tokens derived from the user query in temporal order, and feeding them into the LLM for decoding to generate the VLM response.\n3.4. Training and Inference # Our training process consists of two stages. The first stage involves pre-training the STRD module. As described in Sec 3.3.2, we optimize this module using the MSE loss function. The second stage involves instruction fine-tuning of the model using the constructed VAPDA-127K dataset. The loss function consists of two components. The first component is autoregressive language modeling, which aims to maximize the joint probability P i [Txti+1] P i of the input text sequence. The second component is video streaming input prediction modeling. For real-time anomaly prediction and detection tasks, AssistPDA needs to have the capability to respond automatically, determining when to generate a response and when to remain silent. Following the work [3], we introduce an additional streaming End-ofSequence (EOS) token appended to each video frame token. The probability P i [EOS] P i of predicting the EOS token is used to decide whether to continue receiving video frame inputs or to generate a response. Both components are optimized using the cross-entropy loss function, formulated as follows:\nwhere l i and fi are condition indicators; liis 1 if the i-th token is a language response token, and 0 otherwise; fiis 1 if (1) the i-th token is the last token of a frame, and (2) l i+1 = 0 . w is balance term. Essentially, the streaming EOS loss is applied to frames before responding. P i [Txti+1] P i denotes the probability of the (i+ 1)-th text token output from the language model head at the i-th token, while P i [EOS] P i represents the probability assigned to the EOS token.\nDuring the inference stage, AssistPDA executes different tasks based on user-specified queries. For VAP and\nVAD tasks, we introduce a threshold γ to control the prediction of the EOS token. When the predicted probability of the EOS token falls below γ, the model generates a response, enabling AssistPDA to provide predictions or detection alerts at critical moments while remaining silent during normal periods. For anomaly analysis tasks, AssistPDA responds immediately after the user completes their query, no threshold setting is required. On an A6000 GPU, AssistPDA achieves an average inference speed of 15–20 FPS .\n4. Experiments # 4.1. Dataset and Evaluation Metrics # Dataset. Our VAPDA-127K is constructed based on the raw videos from the two largest-scale VAD datasets, UCFCrime [22] and XD-Violence [31]. VAPDA-127K consists of 2 , 415 untrimmed videos, covering a total of 15 categories of anomalous events, including abuse, arson, car accidents, fighting, explosion, riots, stealing, and shooting, etc. These events are from real-world scenarios and selected footage from movies and live broadcasts. More details about the dataset are in the supplementary material.\nEvaluation Metrics. For training and ablation studies, we follow [3] and adopt three evaluation metrics to efficiently assess the overall performance of the model: Language Modeling Perplexity (LM-PPL), Time Difference (TimeDiff), and Fluency. LM-PPL is a commonly used perplexity measure to evaluate language modeling capability, where a lower LM-PPL indicates more accurate responses. TimeDiff measures the temporal alignment ability of the model by computing the difference between the predicted response timestamp and the expected timestamp. Fluency evaluates the proportion of continuously and successfully predicted tokens within a dialogue round. Since this also includes language tokens, the fluency metric comprehensively reflects the model\u0026rsquo;s language modeling ability in an online streaming pipeline. During inference, different evaluation metrics are used depending on the task. For textual responses in real-time inference across VAP, VAD, and VAA tasks, following the work[8] we employ MoverScore (MS) [44], Bleurt [21], and Unieval [45] to evaluate response quality by comparing them with ground truth text annotations. For VAP and VAD tasks, we also use the weighted F1-score to measure the model\u0026rsquo;s accuracy in classifying predicted and detected anomaly types. In addition, we introduce the average advance time (AAT) metric to evaluate the model\u0026rsquo;s capability to predict anomalies in advance by comparing the response time with the actual anomaly onset time.\n4.2. Implementation Details # The vision encoder and LLM module of our framework are initialized using the Qwen2-VL-2B-Instruct version. During the pretraining distillation stage of the STRD module, we train for a total of 10 epochs using the AdamW optimizer with an initial learning rate of 1 × 10 − 4 , employing a cosine annealing decay strategy. In the main framework training stage, we train all linear layers of the LLM using LoRA with r = 32 and α = 64, and the epoch is set to 2. Additionally, we fine-tune the final output linear layer of the STRD module. The default loss weight w is set to 1. The EOS token prediction thresholds for video anomaly prediction and detection are set to 0.96 and 0.7. Further execution details can be found in the supplementary materials.\n4.3. Main Results # Since AssistPDA is the first framework that leverages a VLM for online VAPDA, we compare it with two types of baselines: general-purpose VLMs and VLMs designed for VAD. Most existing VLMs only support offline processing. To facilitate a fair comparison, we simulate an online setting for offline models by adopting a sliding window approach with a window size of 5 seconds. Specifically, we provide each VLM with prompts related to prediction, detection, and question, instructing them to generate anomaly categories, corresponding anomaly descriptions, and responses to user queries based on the input video segments. Due to differences in instruction fine-tuning data across VLMs, we optimize prompts for each model to maximize response accuracy. For VLMs that support online processing, such as VideoLLM-online [3], we directly feed streaming video input at 2 FPS. Table 1 presents the comparison results of AssistPDA with existing methods on the three tasks. The compared VLMs include Video-LLaMA2 [4], Video-LLaVA [12], Video-ChatGPT [18], InternVL2 [24], and Qwen2VL [27], which are among the most advanced VLMs currently available. Additionally, Holmes-VAD [41] is an offline VLM specifically designed for VAD.\nAs shown in Table 1, our method significantly outperforms all baselines, achieving superior performance across all evaluation metrics. Except for Holmes-VAD, other VLMs exhibit low F1-scores on both VAP and VAD tasks. This is primarily due to the fact that VAPDA-127K encompasses 15 distinct categories of anomalous events, posing a considerable challenge for general-purpose VLMs. The online method, VideoLLM-Online, fails to follow our instructions, producing largely garbled and redundant outputs, resulting in poor performance. Notably, in the video anomalous event prediction task, the average advance prediction time AAT of our method is 29.19s, which is a qualitative leap compared to frame-level prediction. The consistent superiority of our method across all metrics demonstrates the effectiveness of AssistPDA in VAP, VAD, and VAA tasks.\n4.4. Ablation Study # We conduct ablation experiments in this subsection to analyze the effectiveness of each component of our framework.\nTask Name VAP VAP VAP VAP VAP VAP VAD VAD VAD VAD VAA VAA VAA F1-score(%) AAT (s) Language Language Language F1-score (%) Language Language Language Language Language Language Params MS (%) Bleurt (%) Unieval (%) MS (%) Bleurt (%) Unieval (%) MS (% Bleurt (%) Unieval (%) Sliding windows size=5s fps=2 Sliding windows size=5s fps=2 Sliding windows size=5s fps=2 Sliding windows size=5s fps=2 Sliding windows size=5s fps=2 Sliding windows size=5s fps=2 Sliding windows size=5s fps=2 Sliding windows size=5s fps=2 Sliding windows size=5s fps=2 Sliding windows size=5s fps=2 Sliding windows size=5s fps=2 Sliding windows size=5s fps=2 Sliding windows size=5s fps=2 Sliding windows size=5s fps=2 Video-LLaMA2 [4] 7B 28.26 10.32 53.05 37.98 79.84 9.56 50.65 29.67 65.23 56.64 52.15 80.22 Video-LLaVA [12] 7B 38.63 12.34 52.99 37.39 73.89 12.01 48.32 20.28 67.69 57.05 44.49 81.84 Video-ChatGPT [18] 7B 18.94 7.25 53.54 38.06 41.08 11.35 54.15 40.66 64.12 56.29 47.56 80.60 InternVL2 [24] 2B 16.16 6.32 53.22 33.98 61.98 13.77 52.59 38.82 66.25 55.82 44.19 73.29 Qwen2-VL [27] 2B 30.71 11.64 54.59 40.02 72.12 11.83 55.45 48.14 69.29 54.78 47.13 75.45 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 Holmes-VAD [41] 7B 47.91 15.68 54.97 41.47 70.61 25.83 55.00 42.52 68.48 55.70 40.72 88.05 VideoLLM-online [3] 8B 0 / 5.23 6.75 10.23 0 6.78 4.43 9.46 5.67 3.56 8.92 AssistPDA 2B 64.69 29.19 61.89 51.63 76.69 45.66 65.45 63.83 72.46 62.87 61.12 88.32 Table 1. Main results of VAP, VAD, VAA on VAPDA-127K. We compare existing general-purpose VLMs with those tailor-designed for VAD. For VLMs that do not support online inference, video is sampled at 2 FPS and processed using a 5-second sliding window as input.\nTable 2. Performance comparison of our method with different STRD settings.\nVAP VAP VAP VAD VAD VAD LM-PPL ↓ TimeDif ↓ Fluency LM-PPL ↓ TimeDiff ↓ ↓ Fluency ↑ Baseline 1.76 1.52 53.02% 2.15 5.14 46.69% w/o pretraining 1.79 1.85 52.76% 2.27 5.19 44.53% w/o finetune 1.70 1.27 53.42% 1.98 4.82 46.68% w finetune 1.68 1.07 53.81% 1.96 4.71 46.83% Table 3. Performance comparison of our method on STRD with different numbers of MHSA layers.\nVAP VAP VAP VAD VAD VAD LM-PPL ↓ T imeDif ↓ Fluency ↑ L LM-PPL TimeDiff Fluency ↑ 1 layer MHSA 1.67 1.08 53.07% 1.97 4.78 45.95% 2 layer MHSA 1.68 1.07 53.81% 1.96 4.71 46.83% 3 layer MHSA 1.70 1.09 53.46% 1.96 4.83 46.46% Effectiveness of the STRD. To evaluate the effectiveness of the STRD module, we perform multiple ablation and comparative experiments. Table 2 presents the results of four experimental settings: (1) the baseline model without the STRD (baseline), (2) the model with the STRD module but without distillation pertaining (w/o pretraining), (3) the model with the STRD module but without finetune (w/o finetune), and (4) the model with both the STRD module and fine-tuning (w finetune). From the Table 2, it can be observed that the setting with both the STRD module and finetuning achieves the best performance. Compared to the baseline, the model with the STRD module and fine-tuning shows a significant advantage in both LM-PPL and TimeDiff metrics, indicating improved language modeling accuracy and temporal alignment. These results demonstrate that the design of the STRD module, combined with finetuning, effectively enhances the model\u0026rsquo;s capability in spatiotemporal reasoning. Table 3 presents the impact of different numbers of MHSA layers in the STRD module. The results show that the 2-layer MHSA configuration achieves the best overall performance.\nImpact of EOS Token Prediction Threshold γ . As shown in Fig. 5 (a)(b), we illustrate the impact of the EOS to-\nFigure 5. F1-score variation for different EOS token prediction thresholds γ on VAP and VAD tasks.\nken threshold γ on model performance during the inference phase. We can observe that the prediction task is sensitive to the EOS token threshold, which aligns with its inherent nature, requiring heightened sensitivity to anomalous events. Finally, the optimal EOS token threshold γ is set to 0.96 for the prediction task and 0.7 for the detection task.\n5. Conclusion # In this work, we propose the AssistPDA, an online video anomaly surveillance assistant that integrates video anomaly prediction, detection, and analysis. Based on this framework, we introduce a novel event-level video anomaly prediction task aimed at enabling early warning of anomalous events. To enhance AssistPDA\u0026rsquo;s capability of understanding long-term spatiotemporal relationships in video streams under online inference settings, we introduce a novel STRD module, which can effectively transfer the spatiotemporal reasoning ability of existing VLMs from offline processing to online inference. To accommodate the tasks of online VAPDA, we construct a large-scale benchmark dataset, VAPDA-127K, which serves as a valuable resource for future research on online video anomaly understanding. Extensive experiments have shown that AssistPDA achieves superior performance compared to existing state-of-the-art VLMs across the VAP, VAD, and VAA tasks.\nReferences # [1] Yannick Benezeth, P-M Jodoin, Venkatesh Saligrama, and Christophe Rosenberger. Abnormal events detection based on spatio-temporal co-occurences. In CVPR, pages 2458– 2465, 2009. 1\n[2] Congqi Cao, Yue Lu, Peng Wang, and Yanning Zhang. A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation. In CVPR, pages 20392– 20401, 2023. 4\n[3] Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. In CVPR, pages 18407–18418, 2024. 5 , 6 , 7 , 8\n[4] Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatialtemporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 7 , 8\n[5] WeiLin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 1\n[6] Yang Cong, Junsong Yuan, and Ji Liu. Sparse reconstruction cost for abnormal event detection. In CVPR, pages 3449– 3456, 2011. 2\n[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 5\n[8] Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, et al. Uncovering what why and how: A comprehensive benchmark for causation understanding of video anomaly. In CVPR, pages 18793–18803, 2024. 1 , 3 , 7 , 2\n[9] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In CVPR, pages 1705–1714, 2019. 2\n[10] Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models. In AAAI, pages 1932–1940, 2024. 3\n[11] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In CVPR, pages 733–742, 2016. 2\n[12] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 7 , 8\n[13] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, pages 34892–34916, 2023. 1\n[14] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In CVPR, pages 6536–6545, 2018. 3\n[15] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In CVPR, pages 2720–2727, 2013. 2\n[16] Weixin Luo, Wen Liu, and Shenghua Gao. Remembering history with convolutional lstm for anomaly detection. In ICME, pages 439–444, 2017. 3\n[17] Hui Lv and Qianru Sun. Video anomaly detection and explanation via large language models. arXiv preprint arXiv:2401.05702, 2024. 3 , 2\n[18] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 7 , 8\n[19] Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. Anomaly detection in crowded scenes. In CVPR , pages 1975–1981, 2010. 2\n[20] Guansong Pang, Cheng Yan, Chunhua Shen, Anton van den Hengel, and Xiao Bai. Self-trained deep ordinal regression for end-to-end video anomaly detection. In CVPR, pages 12173–12182, 2020. 2\n[21] Thibault Sellam, Dipanjan Das, and Ankur P Parikh. Bleurt: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696, 2020. 7\n[22] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In CVPR, pages 6479–6488, 2018. 1 , 2 , 3 , 4 , 7\n[23] Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and Yingcong Chen. Hawk: Learning to understand open-world video anomalies. In NeurIPS, pages 139751–139785, 2024. 1 , 3 , 2\n[24] OpenGVLab Team. Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy, 2024. 2 , 7 , 8\n[25] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In ICCV, pages 4975–4986, 2021. 3\n[26] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste ´ ´ Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Open and efficient foundation language models. Preprint at arXiv. https://doi. org/10.48550/arXiv, 2302(3), 2023. 1\n[27] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model\u0026rsquo;s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 2 , 5 , 6 , 7 , 8\n[28] Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yingya Zhang, Changxin Gao, Deli Zhao, and Nong Sang. Few-shot action recognition with captioning foundation models. arXiv preprint arXiv:2310.10125, 2023. 3\n[29] Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli Zhao, and Nong Sang. Clip-guided prototype modulating for few-shot action recognition. IJCV, 132(6): 1899–1912, 2024. 3\n[30] Yang Wang, Jun Xu, Jiaogen Zhou, and Jihong Guan. Video anomaly prediction: Problem, dataset and method. In ICASSP, pages 3870–3874, 2024. 4\n[31] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In ECCV, pages 322–339, 2020. 2 , 3 , 4 , 7\n[32] Peng Wu, Chengyu Pan, Yuting Yan, Guansong Pang, Peng Wang, and Yanning Zhang. Deep learning for video anomaly detection: A review. arXiv preprint arXiv:2409.05383, 2024. 1\n[33] Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open-vocabulary video anomaly detection. In CVPR, pages 18297–18307, 2024. 3\n[34] Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In AAAI, pages 6074–6082, 2024. 3\n[35] Zhiwei Yang, Peng Wu, Jing Liu, and Xiaotao Liu. Dynamic local aggregation network with adaptive clusterer for anomaly detection. In ECCV, pages 404–421, 2022. 3\n[36] Zhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. Video event restoration based on keyframes for video anomaly detection. In CVPR, pages 14592–14601, 2023. 3\n[37] Zhiwei Yang, Jing Liu, and Peng Wu. Text prompt with normality guidance for weakly supervised video anomaly detection. In CVPR, pages 18899–18908, 2024. 3\n[38] Tongtong Yuan, Xuange Zhang, Kun Liu, Bo Liu, Chen Chen, Jian Jin, and Zhenzhen Jiao. Towards surveillance video-and-language understanding: New dataset baselines and challenges. In CVPR, pages 22052–22061, 2024. 2\n[39] Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. Harnessing large language models for training-free video anomaly detection. In CVPR , pages 18527–18536, 2024. 3 , 1 , 2\n[40] Shuangfei Zhai, Yu Cheng, Weining Lu, and Zhongfei Zhang. Deep structured energy based models for anomaly detection. In ICML, pages 1100–1109, 2016. 1\n[41] Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, and Nong Sang. Holmes-vad: Towards unbiased and explainable video anomaly detection via multi-modal llm. arXiv preprint arXiv:2406.12235, 2024. 1 , 3 , 7 , 8\n[42] Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xiaonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. Holmes-vau: Towards long-term video anomaly understanding at any granularity. arXiv preprint arXiv:2412.06171, 2024. 4 , 2\n[43] Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of\nlanguage models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023. 1 [44] Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622, 2019. 7 [45] Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197, 2022. 7 AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis # Supplementary Material # 6. Dataset Construction Details # This section provides additional details on the dataset construction process.\n6.1. Caption Model # For generating video frame captions, we follow the work [39] and aggregate the outputs from five caption models: BLIP2-flan-t5-xl, BLIP2-flan-t5-xl-coco, BLIP2-flant5-xxl, BLIP2-opt-6.7b, and BLIP2-opt-6.7b-coco. This aggregation helps mitigate potential biases from individual caption models.\n6.2. Task Prompts for LLM # For the three tasks VAP, VAD, and VAA, task-specific data are generated based on captions using the Qwen2.5-72BInstruct model, which is the most powerful open-source LLM available at the time of dataset construction. The specific task prompts for each task are shown in Figure 7 .\n6.3. Dataset Splits # Table 5 provides a detailed split of the training and test set composition of VAPDA-127K across the three tasks. Notably, for the anomaly analysis test set, we select one of the\nStreaming Video # ten open-ended questions corresponding to each timestamp as the final test sample to ensure compatibility with online streaming inference. Table 6 compares our dataset with existing VAD-related datasets, highlighting its advantages. Our dataset provides textual annotations tailored for online video anomaly prediction and detection tasks. Additionally, for the anomaly analysis task, it offers open-ended questionanswer pairs specific to individual anomalous events.\n7. Impact of LoRA Fine-tuning Parameters # Table 4 presents the impact of different LoRA fine-tuning parameters, r and α, on performance. The results show that different parameter combinations affect LM-PPL, TimeDiff, and Fluency metrics differently. After a comprehensive trade-off, we select r = 32 and α = 64 as the optimal configuration.\n8. More Qualitative Results # Figure 6 further demonstrates the qualitative results on the test set. For the VAP and VAD task, the assistant receives the video stream input in real time and gives a response at the moment when the anomaly may occur as well as at the moment when the anomaly actually occurs. For the VAA task, the assistant immediately responds to the user\u0026rsquo;s question.\nFigure 6. Visualization results on the test set.\nVAP VAP VAP VAD VAD VAD LM-PPL ↓ TimeDiff ↓ Fluency ↑ LM-PPL ↓ TimeDiff ↓ Fluency ↑ r=8/α=16 1.69 1.09 53.41% 1.98 4.72 46.55% r=16/α=32 1.67 1.07 53.74% 1.99 4.68 46.76% r=32/α=64 1.68 1.07 53.81% 1.96 4.71 46.83% r=64/α=128 1.70 1.12 53.65% 1.98 4.85 46.71% Table 4. Performance comparison of our method with different LoRA fine-tuning parameters.\nTable 5. Detailed training and test set split for VAPDA-127K dataset.\nVAPDA-127K dataset VAPDA-127K dataset VAPDA-127K dataset VAPDA-127K dataset Prediction text Detection text Anomaly Analysis (QA pair) Timestamp Training set 2511 6513 96720 ✓ Test set 556 1521 19630 (1963) ✓ Table 6. Comparison of other existing VAD method datasets.\n#Sl Text Text Text Text Methods #Categories #Samples Prediction text Detection text Anomaly Analysis (QA pair) Anomaly Analysis (QA pair) Temp. Anno VLM tuning Prediction tex Detection text Fixed template Open-end Temp. Anno VLM tuning UCA [38] 13 23542 ✗ ✗ ✗ ✗ ✓ ✗ LAVAD [39] N/A N/A ✗ ✗ ✗ ✗ ✗ ✗ VAD-VideoLLama [17] 13/7 2400 ✗ ✗ ✗ ✗ ✗ projection CUVA [8] 11 6000 ✗ ✗ ✓ ✗ ✗ ✗ Hawk [23] - 16000 ✗ ✗ ✓ ✗ ✗ projection HIVAU-70K [42] 19 70000 ✗ ✗ ✓ ✗ ✓ LoRA VAPDA-127K (Ours) 15 127451 ✓ ✓ - ✓ ✓ LoRA Anomaly Prediction Task Prompt # You are a n expert i n anomaly event prediction . Based o n the provided video frame caption occurring before an anomaly event, analyze whether it is possible to predict future anomaly events based on the information from these pre-anomaly video caption .\nPlease provide the output in the following JSON format:\n{ \u0026quot; Prediction Result \u0026quot; : \u0026ldquo;Yes/No\u0026rdquo; . \u0026lt;If it is possible t o predict future anomaly events based on the information, respond with \u0026lsquo;Yes\u0026rsquo; . Otherwise, respond with \u0026lsquo;No\u0026rsquo; . \u0026gt;,\n\u0026quot; Potential Anomaly ID \u0026quot; : \u0026ldquo;\u0026lt;xxx\u0026gt;\u0026rdquo; . \u0026lt;From the provided pre-anomaly video caption with I D annotations, select the I D corresponding to the video caption that allows the prediction of future anomaly events . If no prediction is possible, respond with \u0026lsquo;None\u0026rsquo; . \u0026gt;,\n\u0026quot; Anomaly Type \u0026quot; : \u0026ldquo;\u0026lt;xxx\u0026gt;\u0026rdquo; . \u0026lt;Indicate the potential type of future anomaly, selecting from the following: Abuse, Arrest, Arson, Assault, Burglary, Explosion, Fighting, Road Accidents, Robbery, Shooting, Shoplifting, Stealing, Vandalism, Riot, Car accident . If prediction is not possible, respond with \u0026lsquo;None\u0026rsquo; . \u0026gt;,\n\u0026quot; Anomaly Prediction Description \u0026quot; : \u0026ldquo;\u0026lt;xxx\u0026gt;\u0026rdquo; . \u0026lt;Provide a brief explanation of the reason for the potential future anomaly event . Use the following template for the response: \u0026lsquo;A future \u0026lt;Anomaly Type\u0026gt; anomaly may occur because \u0026lt;reason\u0026gt; . \u0026rsquo; If prediction is not possible, respond with \u0026lsquo;None\u0026rsquo; . \u0026gt;}\nAnomaly Detection Task Prompt # You are a n expert i n anomaly event analysis . Based on the provided video caption, analyze whether the segment indicates that an anomaly event i s currently occurring . Please provide the output in the following JSON format:\n{ \u0026ldquo;Detection Result\u0026rdquo;: \u0026ldquo;Yes/No\u0026rdquo; . \u0026lt;If it can be determined that an anomaly event is currently occurring, respond with \u0026lsquo;Yes\u0026rsquo; . Otherwise, respond with \u0026lsquo;No\u0026rsquo; . \u0026gt;,\n\u0026ldquo;Anomaly Type\u0026rdquo;: \u0026ldquo;\u0026lt;xxx\u0026gt;\u0026rdquo; . \u0026lt;Indicate the type of anomaly event currently occurring, selecting from the following: Abuse, Arrest, Arson, Assault, Burglary, Explosion, Fighting, Road Accidents, Robbery, Shooting, Shoplifting, Stealing, Vandalism, Riot, Car accident . If it cannot be determined, respond with \u0026lsquo;None\u0026rsquo; . \u0026gt;,\n\u0026ldquo;Anomaly Detection Description\u0026rdquo;: \u0026ldquo;\u0026lt;xxx\u0026gt;\u0026rdquo; . \u0026lt;Provide a refined description based on the provided anomaly event segment description . Use the following template: \u0026lsquo;A \u0026lt;Anomaly Type\u0026gt; anomaly is currently occurring, \u0026lt;reason\u0026gt; . \u0026rsquo; If the detection result is \u0026lsquo;No\u0026rsquo;, respond with \u0026lsquo;None\u0026rsquo; . \u0026gt;}\nAnomaly Analysis Task Prompt # You are an advanced video surveillance assistant capable of detecting and analyzing anomalous events i n real time . Based o n the provided descriptions of the current anomalous video clip and contextual information from past video clips, generate 10 possible questions and corresponding answers t o analyze and address the anomalous event . The questions should primarily focus o n the specific details of the current anomalous video clip, ensuring that the answers can be derived or inferred from the given contextual information . Frame the questions using the 5W2H framework: When, What, Who, Where, Why, How, and How much . Provide the output in the following JSON format:\n[ { \u0026ldquo;Question \u0026lt;id\u0026gt;\u0026rdquo;: \u0026ldquo;\u0026lt;A specific question related to analyzing and addressing the anomaly\u0026gt;\u0026rdquo;,\n\u0026ldquo;Answer \u0026lt;id\u0026gt;\u0026rdquo;: \u0026ldquo;\u0026lt;A detailed answer to the corresponding question based on the context\u0026gt;\u0026rdquo; }]\nImportant Notice: Ensure the questions and answers are detailed, contextually relevant, and practical for investigating and addressing the described anomaly .\nFigure 7. Illustration of how to prompt LLM to generate data for VAP, VAD, and VAA tasks.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/assistpda-an-online-video-surveillance-assistant-for-video-anomaly-prediction-detection-and-analysis/","section":"Papers","summary":"Introducing AssistPDA, a pioneering framework for real-time online video anomaly prediction, detection, and analysis leveraging vision-language models with a novel spatiotemporal relation distillation module and constructed benchmark dataset VAPDA-127K.","title":"AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis","type":"application"},{"content":" AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection # Peng Wu, Wanshun Su, Guansong Pang Member, IEEE, Yujia Sun, Qingsen Yan Member, IEEE , Peng Wang Member, IEEE and Yanning Zhang Fellow, IEEE\nAbstract—With the increasing adoption of video anomaly detection in intelligent surveillance domains, conventional visualonly detection approaches often struggle with information insufficiency and high false-positive rates in complex environments. To address these limitations, we present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Capitalizing on the exceptional cross-modal representation learning capabilities of Contrastive Language-Image Pretraining (CLIP) across visual, audio, and textual domains, our framework introduces two major innovations: an efficient audio-visual fusion that enables adaptive crossmodal integration through lightweight parametric adaptation while maintaining the frozen CLIP backbone, and a novel audiovisual prompt that dynamically enhances text embeddings with key multimodal information based on the semantic correlation between audio-visual features and textual labels, significantly improving CLIP\u0026rsquo;s generalization for the video anomaly detection task. Moreover, to enhance robustness against modality deficiency during inference, we further develop an uncertaintydriven feature distillation module that synthesizes audio-visual representations from visual-only inputs. This module employs uncertainty modeling based on the diversity of audio-visual features to dynamically emphasize challenging features during the distillation process. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy in various scenarios. Notably, with unimodal data enhanced by uncertaintydriven distillation, our approach consistently outperforms current unimodal VAD methods.\nIndex Terms—video anomaly detection, audio-visual collaboration, weakly supervised learning.\nI. INTRODUCTION # V IDEO anomaly detection (VAD), as a pivotal technology in intelligent surveillance systems, focuses on identifying anomalous events within videos and has attracted substantial research interest in recent years [1]–[11]. Due to the rarity of anomalies and the high cost of manual annotation, fully supervised frameworks are impractical for large-scale deployment. As a solution, weakly supervised video anomaly detection (WSVAD) methods [12]–[15] have gained traction, aiming to discover latent anomalies under coarse supervision. Current WSVAD methods primarily rely on the multiple\nPeng Wu, Wanshun Su, Qingsen Yan, Peng Wang, and Yanning Zhang are with the School of Computer Science, Northwestern Polytechnical University, China. E-mail:{xdwupeng, suws0616, qingsenyan}@gmail.com; {peng.wang, ynzhang}@nwpu.edu.cn.\nGuansong Pang is with the School of Computing and Information Systems, Singapore Management University, Singapore. E-mail: pangguansong@gamil.com.\nYujia Sun is with the School of Artifical Intelligence, Xidian University, China. E-mail: yjsun@stu.xidian.edu.cn.\nManuscript received April 19, 2021; revised August 16, 2021.\nFig. 1. Left: Illustration of audio-visual collaboration effects; Right: Illustration of our proposed distillation (UKD) effects.\ninstance learning (MIL) framework, using video-level labels for model training [12], [16]. Specifically, these approaches treat videos as bags of segments (instances) and distinguish anomalous patterns through the hard attention mechanism (a.k.a Top-K) [17]. With the rapid advancement of foundation models, Contrastive Language-Image Pretraining (CLIP) [18] has shown remarkable potential in various downstream tasks, including video understanding [19], [20]. Building on the remarkable success of CLIP, recent methods like VadCLIP [21] and TPWNG [15] have advanced WSVAD by leveraging CLIP\u0026rsquo;s semantic alignment capabilities.\nHowever, these methods, whether CLIP-based or conventional, predominantly rely on unimodal visual information, which often leads to significant detection limitations in complex real-world scenarios. Visual occlusion, extreme lighting variations, and environmental noise can render visual features unreliable or ambiguous [22]–[24]. In these challenging conditions, multimodal information, particularly audio, offers indispensable contextual cues that can complement and enhance visual-based detection. For instance, audio remains robust when visual data is compromised, allowing detection of off-camera events. In acoustically rich environments, certain anomalies like explosion, scream, or gunshot exhibit distinct acoustic signatures, making them more discriminative in the audio domain. Similarly, in low-light conditions where visual features degrade, audio serves as a critical supplementary modality. These observations underscore the importance of integrating audio and video modalities, as their complementary nature can significantly enhance the accuracy and robustness of anomaly detection systems in diverse and challenging environments. We illustrate the impact of audio-visual integration for WSVAD in Figure 1.\nExisting attempts [22], [25], [26] to incorporate audio into video anomaly detection typically adopt traditional feature concatenation methods, such as fusing visual features extracted by I3D [27] or C3D [28] with audio features extracted by VG-\nGish [29]. These approaches fail to fully exploit the potential of multimodal learning, resulting in suboptimal cross-modal integration. Moreover, they overlook the inherent semantic alignment between visual and auditory modalities, which are essential for enhancing anomaly detection performance.\nTo address these limitations, we propose AVadCLIP, a WSVAD framework that leverages audio-visual collaborative learning to drive audio-visual anomaly detection by CLIPpowered cross-modal alignment. AVadCLIP fully exploits CLIP\u0026rsquo;s intrinsic capability to establish semantic consistency across vision, text, and audio, ensuring that video anomaly detection is performed within a unified multimodal semantic space rather than merely fusing raw features. Our framework introduces three significant innovations: an efficient audiovisual feature fusion mechanism that is different from the naive feature concatenation and achieves adaptive cross-modal integration through lightweight parametric adaptation while keeping the CLIP backbone frozen; a novel audio-visual prompt mechanism dynamically enriches text label embeddings with key multimodal information, enhancing contextual understanding of videos and enabling more precise identification of different categories; and an uncertainty-driven feature distillation (UKD) module that generates audio-visual-like enhanced features in audio-missing scenarios, ensuring robust anomaly detection performance (as illustrated in Figure 1). Overall, our AVadCLIP relies on only a small set of trainable parameters, effectively transferring CLIP\u0026rsquo;s pretrained knowledge to the weakly supervised audio-visual anomaly detection task. Furthermore, by employing a distillation strategy based on data uncertainty modeling, we further transfer the learned knowledge from our audio-visual anomaly detector to a unimodal detector, enabling robust anomaly detection in scenarios with incomplete modalities.\nIn summary, our main contributions are as follows:\nWe propose a WSVAD framework that harnesses audio-visual collaborative learning, leveraging CLIP\u0026rsquo;s multimodal alignment capabilities. By incorporating a lightweight adaptive audio-visual fusion mechanism and integrating audio-visual information through promptbased learning, our approach effectively achieves CLIPdriven robust anomaly detection in multimodal settings. We design an uncertainty-driven feature distillation module, which transforms deterministic estimation into probabilistic uncertainty estimation. This enables the model to capture feature distribution variance, ensuring robust anomaly detection performance even with unimodal data. Extensive experiments on two WSVAD datasets demonstrate that our method achieves superior performance in audio-visual scenarios, while maintaining robust anomaly detection results even in audio-absent conditions. II. RELATED WORK # A. Video Anomaly Detection # Video anomaly detection has been extensively studied in recent years, with existing approaches broadly categorized into semi-supervised and weakly supervised methods. Among them, semi-supervised methods primarily rely on normal video clips for training and identify anomalies by detecting deviations from learned normal patterns during inference. These methods commonly adopt self-supervised learning techniques [30]–[32], such as reconstruction [33], [34] or prediction [35], [36]. Reconstruction-based methods assume that the model can effectively reconstruct normal videos, whereas abnormal videos, due to distributional discrepancies, result in significant reconstruction errors. Autoencoders [37], [38] are widely employed to capture normal pattern features, with reconstruction error serving as an anomaly indicator. Predictionbased methods [39] utilize models to forecast future frames, detecting anomalies based on prediction errors. However, a key limitation of semi-supervised methods is their tendency to overfit normal patterns, leading to poor generalization to unseen anomalies.\nWeakly supervised methods, in contrast, typically adopt the MIL framework, requiring only video-level anomaly labels and significantly reducing annotation costs. The classic work, DeepMIL [12], which employs a ranking loss to distinguish normal from anomalous instances. Furthermore, two-stage self-training strategy has been proposed to further enhance detection, where high-confidence anomalous regions identified during MIL training serve as pseudo-labels for a secondary refinement phase [40]–[42]. With the rise of Vision-Language Models (VLMs) [43], CLIP has shown remarkable crossmodal capabilities and is increasingly applied to WSVAD. VadCLIP [21], the first CLIP-based WSVAD method, integrates textual priors via text and visual prompts, enhancing anomaly detection. Building on this, TPWNG [15] refines feature learning through a two-stage approach. Recent research trends focus on large model driven strategies, e.g., training-free frameworks [44], [45], spatiotemporal anomaly detection [46], and open-scene anomaly detection [47]. Recent advances in multi-modal fusion [48] introduce powerful frameworks combining diverse modalities such as visual and audio features. For instance, AVCL [49] and DSRL [50] have shown significant promise in improving anomaly detection by leveraging both visual and audio cues.\nB. Audio-Visual Learning # The integration of audio and visual information has emerged as a critical research direction in multimodal learning, as it not only enhances model performance but also facilitates a deeper understanding of complex scenes. Significant progress has been achieved in various aspects of audio-visual fusion [51], [52]. In audio-visual segmentation, researchers aim to accurately segment sound-producing objects based on audio-visual cues. Chen et al. [53] proposed a novel informative sample mining method for audio-visual supervised contrastive learning. Ma et al. [54] introduced a two-stage training strategy to address the audio-visual semantic segmentation (AVSS) task. Building on these works, Guo et al. [55] introduced a new task: Open-Vocabulary AVSS (OV-AVSS), which extends AVSS to open-world scenarios beyond predefined annotation labels. Audio-visual event localization aims to identify the spatial and temporal locations of both visual and auditory events, with attention mechanisms widely used for modality fusion. For\ninstance, He et al. [56] proposed an audio-visual co-guided attention mechanism, while Xu et al. [57] introduced an audioguided spatial-channel attention mechanism. Related tasks include audio-visual video parsing [58], [59] and audio-visual action recognition [60]. Audio-visual anomaly detection [25], [61] has also become a growing research hot. For example, Yu et al. [62] applied a self-distillation module to transfer singlemodal visual knowledge to an audio-visual model, reducing noise and bridging the semantic gap between single-modal and multimodal features. Similarly, Pang et al. [63] proposed a weighted feature generation approach, leveraging mutual guidance between visual and auditory information, followed by bilinear pooling for effective feature integration.\nC. Large Models in Video Understanding # In recent years, large models have exhibited exceptional capabilities in perception and reasoning for video understanding tasks, significantly accelerating the shift from purely visual models to multimodal video understanding frameworks. Representative visual models, such as VideoMAE [64], employ masked self-supervised learning to effectively model spatiotemporal dynamics in videos, facilitating their widespread application in video classification, action recognition, and anomaly detection. With the success of VLMs like CLIP [18] and ALIGN [65], integrating language priors into video understanding has emerged as a prominent research trend. These models perform cross-modal semantic alignment through joint image-text encoding and have been widely adopted in tasks such as zero-shot action recognition, video retrieval, and openvocabulary scene understanding. Further advances, including X-CLIP [66] and VideoCLIP [20], introduce temporal modeling into VLM architectures, significantly improving semantic comprehension of long-form video content. Meanwhile, VLMbased video reasoning tasks are gaining increasing attention. Models such as VL-T5 [67] and VideoChat [68] leverage language-guided mechanisms to enable video question answering, event interpretation, and causal reasoning, thereby substantially broadening the scope of video understanding.\nIII. METHODOLOGY # A. Problem Statement # Given a training set of videos {Vi}, where each video V contains both visual and corresponding audio information, along with a video-level label y ∈ R C . Here, C indicates that the number of categories (including the normal class and various anomaly classes). To facilitate model processing, we employ a video encoder and an audio encoder to extract highlevel features X v ∈ R N×d and X a ∈ R N×d , respectively, where N represents the temporal length of the video (i.e., the number of frames or snippets) and d denotes the feature dimensionality. The objective of WSVAD task is to train a detector using all available X v, X a , and their corresponding labels from the training set, enabling the model to accurately determine whether each frame in a test sample is anomalous and to identify the specific anomaly category.\nThe overall pipeline of our method as shown in Figure 2, starts with extracting features from video and audio using dedicated encoders, then adaptively fuses them for multimodal correspondence learning. We combine a classification branch with a CLIP-based alignment approach, using a audio-visual prompt to inject fine-grained multimodal information into text embeddings. Additionally, an uncertainty-driven distillation is employed to improve anomaly detection robustness in scenarios with incomplete modalities.\nB. Video and Audio Encoders # Video encoder. Leveraging CLIP\u0026rsquo;s robust cross-modal representation, we use its image encoder (ViT-B/16) as a video encoder, in contrast to traditional models like C3D and I3D, which are less effective in capturing semantic relationships. We extract features from sampled video frames using CLIP, but to address CLIP\u0026rsquo;s lack of temporal modeling, we incorporate a lightweight temporal model, such as Graph Convolution Network (GCN) [16] and Temporal Transformer [21], to capture temporal dependencies. This approach ensures efficient transfer of CLIP\u0026rsquo;s pre-trained knowledge to the WSVAD task. Audio encoder. For audio feature extraction, we use Wav2CLIP [69], a CLIP-based model that maps audio signals into the same semantic space as images and text. The audio is first converted into spectrograms, then sampled to match the number of video frames. These audio segments are processed by Wav2CLIP to extract features. To capture contextual relationships, we apply a temporal convolution layer [70], which models local temporal dependencies, preserving key dynamics within the audio modality.\nC. Audio-Visual Adaptive Fusion # In multimodal feature fusion, while both video and audio contain valuable semantic information, their importance often varies depending on the specific task. Inspired by human perception mechanisms [71], our approach follows a visioncentric, audio-assisted paradigm, where video features serve as the primary modality, and audio features complement and enhance visual information. To preserve the generalization capability of the original CLIP model in downstream tasks while avoiding the introduction of excessive trainable parameters, we design a lightweight adaptive fusion that integrates audio features without significantly increasing computational overhead. We present the structure of this fusion in Figure 3.\nSpecifically, given the video feature X v and audio feature X a , we first concatenate them to obtain a joint representation X a+v ∈ R N×2d , which is then processed by two projection networks to generate the adaptive weight and residual feature. The first projection network computes adaptive fusion weights W, which determine the contribution of audio at each time step [72]. This is achieved through a linear transformation followed by a sigmoid activation:\nThe second projection network is responsible for residual mapping, which transforms X a+v into a residual feature Xr Xres that encodes the fused information from both modalities:\nFig. 2. The pipeline of our proposed AVadCLIP. Our method supports both multimodal inputs and visual-only inputs via distillation, enabling robust video anomaly detection through the proposed UKD strategy. Throughout the entire framework, the pre-trained CLIP backbone remains fully frozen, with only a few modules being trainable. This design allows for efficient and lightweight adaptation of CLIP\u0026rsquo;s knowledge to the specific task of audio-visual anomaly detection.\nFig. 3. The pipeline of our proposed adaptive fusion module, binary classifier, visual enhancement network, and uncertainly modeling network.\nFinally, the fused representation X av is obtained by adaptively incorporating the residual feature into the original video feature:\nwhere ⊙ denotes element-wise multiplication. The adaptive weight W dynamically adjusts the degree of audio integration, ensuring that video features remain dominant while audio features provide auxiliary information. Additionally, the residual mapping enhances the expressiveness of the fused representation by capturing nonlinear transformations. By introducing an adaptive fusion mechanism and maintaining a lightweight design, our fusion approach effectively balances efficiency and expressiveness, leveraging the complementarity of visual and audio modalities while minimizing computational overhead.\nD. Dual Branch Framework with Prompts # We leverage a dual-branch framework [21] for the WSVAD task, consisting of a classification branch and an alignment branch, which effectively leverage audio-visual information to improve detection accuracy. Classification branch consists of a lightweight binary classifier (as shown in Figure 3), which takes X av as input and directly predicts the framelevel anomaly confidence A. Alignment branch leverages the cross-modal semantic alignment mechanism, which computes the similarity between frame-level features and class label features. To obtain class label representations, we leverage the CLIP text encoder combined with the learnable textual prompt [73] and audio-visual prompt to extract class embeddings, ensuring unified semantic alignment between visual and textual modalities. Given a set of predefined class labels (e.g.,\n\u0026ldquo;normal\u0026rdquo;, \u0026ldquo;fighting\u0026rdquo;), we first introduce a learnable textual prompt, then we concatenate the textual prompt with class labels and feed them into CLIP text encoder to obtain the class representation X c . Compared to the manually defined prompt, the learnable prompt allows the model to dynamically adjust textual representations during training, making them more suitable for the specific requirements of WSVAD. Furthermore, we incorporate an audio-visual prompt into the class label features to enrich the class representations with additional multimodal information.\nThe proposed audio-visual prompt mechanism aims to dynamically inject instance-level key audio-visual information into text labels to enhance the representation. Specifically, we leverage the anomaly confidence A from the classification branch and audio-visual features X av to generate a video-level global representation:\nwhere Norm represents the normalization operation. Next, we calculate the similarity matrix Sp Sp between the class representation X c and the global representation Xp Xp to measure the alignment between class labels and videos:\nbased on Sp Sp , we generate the enhanced instance-level audiovisual prompt X mp :\nThis operation dynamically adjusts the class representation\u0026rsquo;s focus on different video instances by calculating the similarity between global audio-visual features and class labels, thereby enhancing cross-modal alignment.\nThen, we add X mp and the class representation X c , followed by a feed-forward network (FFN) transformation and a skip connection to obtain the final instance-specific class embedding X cp :\nwhere ADD represents element-wise addition.\nThis dual-branch framework provides anomaly confidence through the classification branch and refines category identification with class information via the alignment branch, improving robustness and enabling fine-grained anomaly detection.\nE. Optimization of Audio-Visual Model # For the classification branch, we adopt the Top-K mechanism, as proposed in previous work [25], to select the top K anomaly confidence values from both normal and abnormal videos, which are averaged as the video-level prediction. The classification loss LBCE is then computed using binary crossentropy between the prediction and groundtruth class.\nIn the case of the alignment branch, the MIL-Align mechanism [21] is applied. We compute an alignment map M , reflecting the similarity between frame-level features X av and all category embeddings X cp . For each row in M, the top K similarities are selected and their average is used to quantify the alignment between the video and the current class. This results in a vector S = {s1, . . . , s m } representing the similarity between the video and all possible classes. Then the multi-class prediction is then calculated as:\nwhere pi represents the prediction for the i th class, and τ is the temperature scaling parameter. Then, we compute LNCE based on cross-entropy. Besides, to address the class imbalance in WSVAD , where normal samples dominate and anomaly instances are sparse, we employ the focal loss [74]. Finally, the overall loss LALIGN for alignment branch is the average of LNCE and LF OCAL .\nF. Uncertainty-Driven Distillation # In the WSVAD task, audio serves as a complementary modality to video, enhancing detection accuracy. However, audio may be unavailable in practical scenarios, leading to performance degradation. To address this, we apply knowledge distillation by using a pre-trained multi-modal (video+audio) teacher model to guide a unimodal (video-only) student model, ensuring robust anomaly detection even without audio. Traditional knowledge distillation methods typically assume a deterministic transfer of knowledge, employing mean square error (MSE) loss to align the student model with the teacher\u0026rsquo;s feature representations. However, this approach fails to account for the inherent uncertainty in audio-visual feature fusion. In real-world scenarios, factors such as noisy audio or occluded visual content can introduce distortions in the fused features, leading to inaccurate feature representations and diminished generalization capability.\nTo overcome this, we propose a probabilistic uncertainty distillation strategy [75], [76], which models data uncertainty during distillation, improving the student model\u0026rsquo;s robustness across diverse scenarios. Specifically, assume Xav,i = Xvs,i+ ϵσi, where ϵ ∼ N (0 , I) , X vs represents enhanced visual features generated from the student model, and it is derived from X v after passing through a visual enhancement network, which is illustrated in Figure 3. Besides, σi refers to the inherent uncertainty between the i th pair of features. Then we model the observation as a Gaussian likelihood function to more accurately quantify data uncertainty in the feature distillation. The relationship between the audio-visual fusion feature X av,i and the unimodal feature X vs,i is formulated as:\nwhere θ is the parameter of models, to maximize the likelihood for each pair of features Xav,i and Xvs,i, we adopt the loglikelihood form:\nIn practice, we design a network branch (a simple threelayer convolutional neural network, which is shown in Figure 3) to predict the variance σ 2 i and reformulate the likelihood\nmaximization problem as the minimization of a loss function. Specifically, we employ an uncertainty-weighted MSE loss:\nwhere L represents the number of feature pairs, and the constant term is omitted for clarity.\nDuring the distillation process, the student model not only learns the unimodal feature X vs,i from the teacher model but also considers the feature uncertainty σ 2 i to optimize its learning strategy. Specifically, the first term of the loss function represents the feature similarity between the student and teacher models, normalized by σ 2 i . This assigns smaller weights to features with higher uncertainty, thereby avoiding overfitting to hard-to-learn information. The second term acts as a regularization term to prevent σ 2 i from becoming too small, ensuring effective distillation.\nUltimately, during the inference phase, we only input video and perform anomaly detection through the unimodal student model for audio-missing scenarios.\nIV. EXPERIMENTS # A. Datasets and Evaluation Metrics # Datasets: We conduct extensive experiments on two audio-visual benchmarks: XD-Violence [25] and CCTVFightssub [49], both of which contain synchronized audio and visual modalities. Unlike traditional unimodal datasets, these benchmarks enable a more comprehensive evaluation of our framework\u0026rsquo;s robustness under multimodal settings. XDViolence. As the largest publicly available audio-visual WSVAD dataset, XD-Violence [25] significantly surpasses existing datasets in scale and diversity. It comprises 3,954 training videos and 800 test videos, with the test set containing 500 violent and 300 non-violent videos. The dataset covers six distinct categories of violent events, including abuse, car accident, explosion, fighting, riot, and shooting, which occur at various temporal locations within videos. CCTV-Fightssub . Derived from CCTV-Fights [77], CCTVFightssub [49] is a carefully curated subset designed to address audio-visual anomaly detection. The subset retains 644 highquality videos depicting real-world fight scenarios, each with meaningful audio content, making it a valuable resource for evaluating audio-visual anomaly detection methods in realworld surveillance contexts.\nEvaluation Metrics: For performance evaluation, we adopt distinct metrics tailored to different granularities of WSVAD tasks. For coarse-grained WSVAD, we employ framelevel Average Precision (AP), which provides a comprehensive measure of detection accuracy across varying confidence thresholds. For fine-grained anomaly detection, we utilize mean Average Precision (mAP) [22] computed across multiple intersection over union (IoU) thresholds and the average mAP (AVG) across different thresholds. Specifically, we evaluate mAP at IoU thresholds ranging from 0.1 to 0.5 with an interval of 0.1, followed by reporting AVG across these thresholds. TABLE I COARSE -GRAINED COMPARISONS ON XD-VIOLENCE .\nMethod Reference Modality AP(%) DeepMIL [12] CVPR 2018 RGB(ViT) 75.18 Wu et al. [25] ECCV 2020 RGB(ViT) 80 RTFM [78] ICCV 2021 RGB(ViT) 78.27 AVVD [22] TMM 2022 RGB(ViT) 78.1 Ju et al. [19] ECCV 2022 RGB(ViT) 76.57 DMU [26] AAAI 2023 RGB(ViT) 82.41 CLIP-TSA [79] ICIP 2023 RGB(ViT) 82.17 AnomalyCLIP [80] CVIU 2024 RGB(ViT) 78.51 TPWNG [15] CVPR 2024 RGB(ViT) 83.68 VadCLIP [21] AAAI 2024 RGB(ViT) 84.51 AVadCLIP∗ this work RGB(ViT) 85.53 FVAI [63] ICASSP 202 RGB(I3D)+Audio 81.69 MACIL-SD [62] ACMMM 20 RGB(I3D)+Audio 81.21 CUPL [81] CVPR 2 RGB(I3D)+Audio 81.43 AVCL [49] TMM 202 RGB(I3D)+Audio 81.11 AVadCLIP this wor RGB(ViT)+Audio 86.04 TABLE II COARSE -GRAINED COMPARISONS ON CCTV-FIGHTSsub .\nMethod Reference Modality AP(%) VadCLIP [21] AAAI 2024 RGB(ViT) 72.78 AVadCLIP∗ this work RGB(ViT) 73.36 MACIL-SD [62] ACMMM20 RGB(I3D)+Audio 72.92 DMU [26] CVPR202 RGB(I3D)+Audio 72.97 AVCL [49] TMM 202 RGB(I3D)+Audio 73.2 AVadCLIP this work RGB(ViT)+Audio 73.38 B. Implementation Details # We conduct experiments on an NVIDIA RTX 4090 GPU, where the visual enhancement network is a single-layer 1D convolutional network, which includes a convolutional layer with a kernel size of 3, padding size of 1, ReLU activation function, and a skip connection. Such an operation effectively facilitates the aggregation of local contextual information. For input processing, we employ a frame selection strategy tailored to different datasets, sampling one frame per 16 frames for XD-Violence and one frame per 4 frames for CCTV-Fightssub , using a uniform sampling strategy with a maximum frame count of 256; During optimization, we set the batch size, learning rate, and total epoch to 96, 1e − 5 , and 10, respectively.\nC. Comparison with State-of-the-Art Methods # Performance comparison on XD-Violence: Our experiments evaluate both coarse-grained and fine-grained anomaly detection performance on XD-Violence, comparing our AVadCLIP against state-of-the-art approaches, as shown in Tables I, III. For coarse-grained anomaly detection, using only RGB inputs, AVadCLIP ∗ (∗ denotes RGB-only input) achieves an AP score of 85.53%, surpassing all existing vision-only methods. Notably, it outperforms VadCLIP, the previous best-performing RGB-only approach, by 1.0%, demonstrating superior visual anomaly detection. When incorporating audio, AVadCLIP further improves performance, significantly outperforming all\nTABLE III FINE -GRAINED COMPARISONS ON XD-VIOLENCE .\nMethod Reference Modality mAP@IoU(%) mAP@IoU(%) mAP@IoU(%) mAP@IoU(%) mAP@IoU(%) mAP@IoU(%) 0.1 0.2 0.3 0.4 0.5 AVG Random - RGB(VIT) 1.82 0.92 0.48 0.23 0.09 0.71 DeepMIL [12] CVPR 2018 RGB(ViT) 22.72 15.57 9.98 6.2 3.78 11.65 AVVD [22] TMM 2022 RGB(ViT) 30.51 25.75 20.18 14.83 9.79 20.21 VadCLIP [21] AAAI 2024 RGB(ViT) 37.03 30.84 23.38 17.9 14.31 24.70 AVadCLIP∗ this work RGB(ViT) 39.63 32.77 26.84 21.58 16.39 27.44 AVadCLIP this work RGB(ViT)+Audio 41.89 34.61 27.08 22.16 17.3 28.61 TABLE IV FINE -GRAINED COMPARISONS ON CCTV-FIGHTSsub .\nMethod Reference Modality mAP@IoU(%) mAP@IoU(%) mAP@IoU(%) mAP@IoU(%) mAP@IoU(%) mAP@IoU(%) 0.1 0.2 0.3 0.4 0.5 AVG VadCLIP [21] AAAI 2024 RGB(ViT) 19.34 14.32 9.25 6.64 3.73 10.66 AVadCLIP∗ this work RGB(ViT) 21.1 14.57 9.01 5.74 4.69 11.02 AVadCLIP this work RGB(ViT)+Audio 22.25 15.91 10.4 7 5.28 12.17 TABLE V CROSS -DATASET WSVAD RESULTS ON XD-VIOLENCE AND CCTV-FIGHTSsub .\n| Test⇒ Train⇓ | XD-Violence | CCTV-Fightssub\nAP(%) Train⇓ AP(%) 69.24 XD-Violence 86.04 69.24 CCTV-Fightssub 76.60 73.38 multimodal baselines, achieving a remarkable 4.9% gain over the latest method AVCL [49].\nFor fine-grained anomaly detection, AVadCLIP consistently outperforms all competitors across different IoU thresholds, as detailed in Table III. With RGB-only input, AVadCLIP ∗ surpasses VadCLIP at all IoU thresholds, achieving an AVG improvement of 2.7%. Similarly, the full-modality model AVadCLIP leads across all metrics, boosting the AVG by 3.9%. These results highlight the effectiveness of multimodal learning in precisely localizing anomaly boundaries and improving category predictions.\nPerformance comparison on CCTV-Fightssub: The coarse-grained anomaly detection results on CCTV-Fightssub are presented in Table II. For RGB-only methods, AVadCLIP ∗ achieves 73.36% AP, surpassing the state-of-the-art VadCLIP and demonstrating the effectiveness of our approach in unimodal scenarios. For audio-visual scenarios, AVadCLIP further improves performance, outperforming all existing methods. These results indicate that incorporating audio information can further enhance anomaly detection performance, validating the effectiveness of cross-modal complementary information mining. We present the fine-grained anomaly detection results on CCTV-Fightssub in Table IV, and it can be observed that AVadCLIP consistently outperforms all competitors at different IoU thresholds. Using RGB-only input, AVadCLIP ∗ surpasses VadCLIP at all IoU thresholds, achieving a 0.4% improvement in AVG. Similarly, AVadCLIP with audio leads in all metrics, increasing AVG by 1.5%. These results further highlight the effectiveness of multimodal learning in accurately locating\nTABLE VI EFFECTIVENESS OF DESIGNED MODULES ON XD-VIOLENCE .\nAV V Fusion AV Prompt LF OCAL AP(%) AVG(%) × × × 79.85 27.89 √ × × 82.9 26.63 √ √ √ √ × 86.18 26.79 √ √ √ 86.04 28.61 anomalous boundaries and improving category prediction.\nCross-dataset Results: Table V presents the cross-dataset evaluation results of AVadCLIP on XD-Violence and CCTVFightssub, aiming to assess its generalization capability across different domains. Despite being trained on one dataset and tested on another, AVadCLIP consistently achieves competitive performance, demonstrating strong robustness and transferability. For example, AVadCLIP trained on XD-Violence still achieves an AP of 69.24% when directly tested on the surveillance-oriented CCTV-Fightssub, with less than a 4% drop compared to the model trained specifically on that dataset. These results highlight the model\u0026rsquo;s ability to generalize well to unseen data distributions and diverse anomaly scenarios. Overall, AVadCLIP achieves state-of-the-art performance in both unimodal and multimodal settings across coarse-grained and fine-grained anomaly detection tasks. The comprehensive results validate its effectiveness in leveraging audio-visual collaboration and demonstrate the feasibility of uncertaintydriven distillation strategy.\nD. Ablation Studies # The effect of audio-visual adaptive fusion: From Table VI, it can be observed that the introduction of audiovisual fusion improves detection performance. Furthermore, Table VII presents the impact of different audio-visual fusion strategies on anomaly detection performance. First, the cross attention fusion performs poorly in the WSVAD task, indicating that although it can capture the relationships between modalities, its complex parameterized design may negatively Fig. 4. Coarse-grained and Fine-grained WSVAD visualization results of AVadCLIP and the baseline model on XD-Violence.\nTABLE VII EFFECTIVENESS OF AUDIO -VISUAL FUSION ON XD-VIOLENCE .\nMethod AP(%) AVG(%) Cross Attention 75.15 10.51 Element-wise Addition 83.02 27.66 Concat+Linear Projection 83.36 28.88 Adaptive Fusion 86.04 28.61 TABLE VIII EFFECTIVENESS OF UKD ON XD-VIOLENCE .\nMethod AP(%) AVG(%) Audio Model w/o UKD 50.89 12.2 Audio Model w 52.51 13.5 Visual Model w/o UK 84.6 22.92 Visual Model w 85.53 27.44 Audio-Visual Model 86.04 28.61 TABLE IX EFFECTIVENESS OF UKD ON CCTV-FIGHTSsub .\nMethod AP(%) AVG(%) Visual Model w/o UKD 67.89 10.65 Visual Model w 73.36 11.02 Audio-Visual Model 73.38 12.17 impact the generalization ability of CLIP model in downstream WSVAD tasks. Next, the simple element-wise addition strategy achieves an AP of 83.02% and an AVG of 27.66%. Then, the concatenation with linear projection approach improves the AP to 83.36% and the AVG to 28.88%, indicating that enhancing feature representation through linear transformation facilitates more effective cross-modal information capture. Finally, our proposed adaptive fusion strategy achieves the best AP of 86.04%, outperforming the other three methods on the whole. This demonstrates that our adaptive fusion strategy, as a lightweight and effective fusion strategy, can more exploit complementary information between audio and\nvisual modalities. # The effect of audio-visual prompt and LF OCAL: As presented in Table VI, the baseline model achieves an AP of only 79.85%. Integrating the audio-visual prompt on top of the adaptive fusion mechanism significantly enhances performance, increasing the AP to 86.18%. This improvement underscores the effectiveness of the audio-visual prompt in capturing critical multimodal patterns, thereby facilitating more precise anomaly recognition. Furthermore, incorporating focal loss into the model contributes to refining anomaly boundary detection, leading to more stable performance in fine-grained anomaly localization. In summary, the audio-visual prompt primarily enhances coarse-grained anomaly detection, and focal loss further refines boundary precision, enabling the model to achieve optimal performance across both AP and AVG metrics.\nThe effect of uncertainty-driven distillation: As shown in Table VIII, the proposed UKD mechanism significantly enhances anomaly detection performance in both visual-only and audio-only models. Specifically, in the visual-only setting, UKD achieves a 0.9% improvement in AP and a 4.5% increase in AVG, attaining performance levels comparable to the teacher model trained with audio-visual inputs. Similarly, the audio-only model also benefits from UKD, exhibiting consistent performance gains. These results highlight the effectiveness of UKD in leveraging data uncertainty to enhance the robustness of unimodal representations during the distillation process, making it particularly well-suited for real-world applications where modality incompleteness is prevalent. In addition, we present a comparison of the effectiveness of the UKD module on CCTV-Fightssub in Table IX. The results show that the proposed UKD mechanism significantly improves the anomaly detection performance of the unimodal model. Notably, in the visual-only scenario, adding UKD improves AP by 5.5%, achieving performance comparable to that of the audio-visual model. This finding further demonstrates the effectiveness of UKD in enhancing the robustness of unimodal representations through data uncertainty.\nFig. 5. Coarse-grained WSVAD visualization results of AVadCLIP, AVadCLIP ∗ , and the baseline model on XD-Violence.\nFig. 6. Coarse-grained WSVAD visualization results of AVadCLIP, AVadCLIP ∗ , and the baseline model on CCTV-Fightssub .\nE. Qualitative Results # In Figure 4, we present the qualitative visualizations of AVadCLIP and the baseline model for both coarse-grained and fine-grained WSVAD. The blue curves denote the anomaly predictions by AVadCLIP, whereas the yellow curves represent those by the baseline model (RGB-only w/o UKD). As illustrated, compared to the baseline model, AVadCLIP significantly reduces anomaly confidence in normal video segments, thereby enhancing its ability to distinguish between abnormal and normal regions more accurately. The fine-grained map below also indicates that AVadCLIP can predict categories with greater precision. Notably, the observed performance improvement supports our hypothesis that audio information is more advantageous in visual occlusion (shooting) or acoustic dominant scenes (explosion), and can effectively eliminate ambiguity in visually similar patterns in anomaly detection scenes, thereby ensuring more robust detection performance.\nIn addition, we compare the coarse-grained visualization results of the baseline model (RGB-only w/o UKD), the student model AVadCLIP ∗ , and the teacher model AVadCLIP on XD-Violence, as shown in Figure 5. Experimental results show that AVadCLIP significantly outperforms the other two counterparts. By using this model as a teacher model to guide the unimodal student model, it effectively mitigates anomaly confidence biases, steering them towards more accurate detection results. To a certain extent, this demonstrates the robustness of our proposed method.\nIn order to demonstrate the superiority of our proposed method more comprehensively and intuitively, Figure 6 shows the coarse-grained visualization results on the CCTV-Fightssub dataset (since this dataset only includes the \u0026ldquo;Fighting\u0026rdquo; category, fine-grained visualizations are not provided). It can be seen that our method achieves significantly higher anomaly confidence scores in abnormal regions and notably lower scores in normal regions compared to the baseline model. This demonstrates that the integration of audio and video information can still yield substantial performance improvements in complex scenes. Besides, as can be seen from the last two rows, the unimodal model distilled with the UKD mechanism shows significantly fewer false positives compared to the baseline, demonstrating that the UKD mechanism effectively transfers audio-visual multi-modal knowledge into the unimodal model.\nV. CONCLUSION # In this work, we propose a novel weakly supervised framework for robust video anomaly detection using audio-visual collaboration. Leveraging the powerful representation ability\nand cross-modal alignment capability of CLIP, we design two distinct modules to achieve efficient audio-visual collaboration and multimodal anomaly detection, based on the frozen CLIP model. Specifically, to seamlessly integrate audio-visual information, we introduce a lightweight fusion mechanism that adaptively generates fusion weights based on the importance of audio to assist visual information. Additionally, we propose an audio-visual prompt strategy that dynamically refines text embeddings with key multimodal features, strengthening the semantic alignment between video content and corresponding textual labels. To further bolster robustness in scenarios with missing modalities, we develop an uncertainty-driven distillation module that synthesizes audio-visual representations from visual inputs, focusing on challenging features. Experimental results across two benchmarks demonstrate that our framework effectively enables video-audio anomaly detection and enhances the model\u0026rsquo;s robustness in scenarios with incomplete modalities. In the future, we will explore the integration of additional modalities (e.g., textual description) based on VLMs to achieve more robust video anomaly detection.\nREFERENCES # [1] W. Luo, W. Liu, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction network for video anomaly detection,\u0026rdquo; IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7505–7520, 2021.\n[2] H. Lv, C. Zhou, Z. Cui, C. Xu, Y. Li, and J. Yang, \u0026ldquo;Localizing anomalies from weakly-labeled videos,\u0026rdquo; IEEE transactions on image processing , vol. 30, pp. 4505–4515, 2021.\n[3] P. Wu and J. Liu, \u0026ldquo;Learning causal temporal relation and feature discrimination for anomaly detection,\u0026rdquo; IEEE Transactions on Image Processing, vol. 30, pp. 3513–3527, 2021.\n[4] M. I. Georgescu, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah, \u0026ldquo;A background-agnostic framework with adversarial training for abnormal event detection in video,\u0026rdquo; IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 4505–4523, 2021.\n[5] M. Z. Zaheer, J.-H. Lee, A. Mahmood, M. Astrid, and S.-I. Lee, \u0026ldquo;Stabilizing adversarially learned one-class novelty detection using pseudo anomalies,\u0026rdquo; IEEE Transactions on Image Processing, vol. 31, pp. 5963– 5975, 2022.\n[6] C. Cao, Y. Lu, and Y. Zhang, \u0026ldquo;Context recovery and knowledge retrieval: A novel two-stream framework for video anomaly detection,\u0026rdquo; IEEE Transactions on Image Processing, 2024.\n[7] T. Liu, K.-M. Lam, and B.-K. Bao, \u0026ldquo;Injecting text clues for improving anomalous event detection from weakly labeled videos,\u0026rdquo; IEEE Transactions on Image Processing, 2024.\n[8] Y. Pu, X. Wu, L. Yang, and S. Wang, \u0026ldquo;Learning prompt-enhanced context features for weakly-supervised video anomaly detection,\u0026rdquo; IEEE Transactions on Image Processing, 2024.\n[9] P. Wu, C. Pan, Y. Yan, G. Pang, P. Wang, and Y. Zhang, \u0026ldquo;Deep learning for video anomaly detection: A review,\u0026rdquo; arXiv preprint arXiv:2409.05383, 2024.\n[10] P. Wu, J. Liu, X. He, Y. Peng, P. Wang, and Y. Zhang, \u0026ldquo;Toward video anomaly retrieval from video anomaly detection: New benchmarks and model,\u0026rdquo; IEEE Transactions on Image Processing, vol. 33, pp. 2213– 2225, 2024.\n[11] Y. Liu, H. Wang, Z. Wang, X. Zhu, J. Liu, P. Sun, R. Tang, J. Du, V. C. Leung, and L. Song, \u0026ldquo;Crcl: Causal representation consistency learning for anomaly detection in surveillance videos,\u0026rdquo; IEEE Transactions on Image Processing, 2025.\n[12] W. Sultani, C. Chen, and M. Shah, \u0026ldquo;Real-world anomaly detection in surveillance videos,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6479–6488.\n[13] M. Z. Zaheer, A. Mahmood, M. Astrid, and S.-I. Lee, \u0026ldquo;Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection,\u0026rdquo; in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16. Springer, 2020, pp. 358–376.\n[14] C. Huang, C. Liu, J. Wen, L. Wu, Y. Xu, Q. Jiang, and Y. Wang, \u0026ldquo;Weakly supervised video anomaly detection via self-guided temporal discriminative transformer,\u0026rdquo; IEEE Transactions on Cybernetics, 2022.\n[15] Z. Yang, J. Liu, and P. Wu, \u0026ldquo;Text prompt with normality guidance for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 18 899–18 908.\n[16] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, \u0026ldquo;Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1237–1246.\n[17] S. Paul, S. Roy, and A. K. Roy-Chowdhury, \u0026ldquo;W-talc: Weakly-supervised temporal activity localization and classification,\u0026rdquo; in Proceedings of the European conference on computer vision, 2018, pp. 563–579.\n[18] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., \u0026ldquo;Learning transferable visual models from natural language supervision,\u0026rdquo; in International conference on machine learning. PMLR, 2021, pp. 8748–8763.\n[19] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, \u0026ldquo;Prompting visuallanguage models for efficient video understanding,\u0026rdquo; in Computer Vision– ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV. Springer, 2022, pp. 105–124.\n[20] H. Xu, G. Ghosh, P.-Y. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, \u0026ldquo;Videoclip: Contrastive pre-training for zero-shot video-text understanding,\u0026rdquo; arXiv preprint arXiv:2109.14084, 2021.\n[21] P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y. Zhang, \u0026ldquo;Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 6074–6082.\n[22] P. Wu, X. Liu, and J. Liu, \u0026ldquo;Weakly supervised audio-visual violence detection,\u0026rdquo; IEEE Transactions on Multimedia, pp. 1674–1685, 2022.\n[23] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, \u0026ldquo;Audio-visual event localization in unconstrained videos,\u0026rdquo; in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 247–263.\n[24] Y. Tian, D. Li, and C. Xu, \u0026ldquo;Unified multisensory perception: Weaklysupervised audio-visual video parsing,\u0026rdquo; in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, 2020, pp. 436–454.\n[25] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang, \u0026ldquo;Not only look, but also listen: Learning multimodal violence detection under weak supervision,\u0026rdquo; in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer, 2020, pp. 322–339.\n[26] H. Zhou, J. Yu, and W. Yang, \u0026ldquo;Dual memory units with uncertainty regulation for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the AAAI Conference on Artificial Intelligence, 2023.\n[27] J. Carreira and A. Zisserman, \u0026ldquo;Quo vadis, action recognition? a new model and the kinetics dataset,\u0026rdquo; in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.\n[28] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, \u0026ldquo;Learning spatiotemporal features with 3d convolutional networks,\u0026rdquo; in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.\n[29] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, \u0026ldquo;Audio set: An ontology and humanlabeled dataset for audio events,\u0026rdquo; in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780.\n[30] C. Huang, J. Wen, Y. Xu, Q. Jiang, J. Yang, Y. Wang, and D. Zhang, \u0026ldquo;Self-supervised attentive generative adversarial networks for video anomaly detection,\u0026rdquo; IEEE transactions on neural networks and learning systems, vol. 34, no. 11, pp. 9389–9403, 2022.\n[31] C. Shi, C. Sun, Y. Wu, and Y. Jia, \u0026ldquo;Video anomaly detection via sequentially learning multiple pretext tasks,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 330–10 340.\n[32] C. Huang, J. Wen, C. Liu, and Y. Liu, \u0026ldquo;Long short-term dynamic prototype alignment learning for video anomaly detection,\u0026rdquo; in Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , 2024, pp. 866–874.\n[33] Y. Cong, J. Yuan, and J. Liu, \u0026ldquo;Sparse reconstruction cost for abnormal event detection,\u0026rdquo; in CVPR 2011. IEEE, 2011, pp. 3449–3456.\n[34] W. Luo, W. Liu, and S. Gao, \u0026ldquo;A revisit of sparse coding based anomaly detection in stacked rnn framework,\u0026rdquo; in Proceedings of the IEEE international conference on computer vision, 2017, pp. 341–349.\n[35] W. Liu, W. Luo, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction for anomaly detection–a new baseline,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6536– 6545.\n[36] C. Cao, H. Zhang, Y. Lu, P. Wang, and Y. Zhang, \u0026ldquo;Scene-dependent prediction in latent space for video anomaly detection and anticipation,\u0026rdquo; IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.\n[37] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, \u0026ldquo;Learning temporal regularity in video sequences,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 733–742.\n[38] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel, \u0026ldquo;Memorizing normality to detect anomaly: Memoryaugmented deep autoencoder for unsupervised anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1705–1714.\n[39] Z. Yang, J. Liu, Z. Wu, P. Wu, and X. Liu, \u0026ldquo;Video event restoration based on keyframes for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 14 592–14 601.\n[40] S. Li, F. Liu, and L. Jiao, \u0026ldquo;Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1395–1403.\n[41] J.-C. Feng, F.-T. Hong, and W.-S. Zheng, \u0026ldquo;Mist: Multiple instance selftraining framework for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 009–14 018.\n[42] M. Cho, M. Kim, S. Hwang, C. Park, K. Lee, and S. Lee, \u0026ldquo;Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 12 137–12 146.\n[43] F.-L. Chen, D.-Z. Zhang, M.-L. Han, X.-Y. Chen, J. Shi, S. Xu, and B. Xu, \u0026ldquo;Vlp: A survey on vision-language pre-training,\u0026rdquo; Machine Intelligence Research, vol. 20, no. 1, pp. 38–56, 2023.\n[44] L. Zanella, W. Menapace, M. Mancini, Y. Wang, and E. Ricci, \u0026ldquo;Harnessing large language models for training-free video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 527–18 536.\n[45] Y. Yang, K. Lee, B. Dariush, Y. Cao, and S.-Y. Lo, \u0026ldquo;Follow the rules: Reasoning for video anomaly detection with large language models,\u0026rdquo; arXiv preprint arXiv:2407.10299, 2024.\n[46] P. Wu, X. Zhou, G. Pang, Z. Yang, Q. Yan, P. WANG, and Y. Zhang, \u0026ldquo;Weakly supervised video anomaly detection and localization with spatio-temporal prompts,\u0026rdquo; in ACM Multimedia 2024, 2024.\n[47] P. Wu, X. Zhou, G. Pang, Y. Sun, J. Liu, P. Wang, and Y. Zhang, \u0026ldquo;Openvocabulary video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 297–18 307.\n[48] D. Ding, L. Wang, L. Zhu, T. Gedeon, and P. Koniusz, \u0026ldquo;Learnable expansion of graph operators for multi-modal feature fusion,\u0026rdquo; arXiv preprint arXiv:2410.01506, 2025.\n[49] J. Meng, H. Tian, G. Lin, J.-F. Hu, and W.-S. Zheng, \u0026ldquo;Audio-visual collaborative learning for weakly supervised video anomaly detection,\u0026rdquo; IEEE Transactions on Multimedia, 2025.\n[50] J. Leng, Z. Wu, M. Tan, Y. Liu, J. Gan, H. Chen, and X. Gao, \u0026ldquo;Beyond euclidean: Dual-space representation learning for weakly supervised video violence detection,\u0026rdquo; in Advances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 17 373–17 397. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/ 2024/file/1f471322127d6347e5ae09a14b1e5cf7-Paper-Conference.pdf\n[51] G. Li, Y. Wei, Y. Tian, C. Xu, J.-R. Wen, and D. Hu, \u0026ldquo;Learning to answer questions in dynamic audio-visual scenarios,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 19 108–19 118.\n[52] Y. Wei, D. Hu, Y. Tian, and X. Li, \u0026ldquo;Learning in audio-visual context: A review, analysis, and new perspective,\u0026rdquo; arXiv preprint arXiv:2208.09579 , 2022.\n[53] Y. Chen, Y. Liu, H. Wang, F. Liu, C. Wang, H. Frazer, and G. Carneiro, \u0026ldquo;Unraveling instance associations: A closer look for audio-visual segmentation,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 497–26 507.\n[54] J. Ma, P. Sun, Y. Wang, and D. Hu, \u0026ldquo;Stepping stones: a progressive training strategy for audio-visual semantic segmentation,\u0026rdquo; in European Conference on Computer Vision. Springer, 2024, pp. 311–327.\n[55] R. Guo, L. Qu, D. Niu, Y. Qi, W. Yue, J. Shi, B. Xing, and X. Ying, \u0026ldquo;Open-vocabulary audio-visual semantic segmentation,\u0026rdquo; in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 7533–7541.\n[56] X. He, X. Liu, Y. Li, D. Zhao, G. Shen, Q. Kong, X. Yang, and Y. Zeng, \u0026ldquo;Cace-net: Co-guidance attention and contrastive enhancement for effective audio-visual event localization,\u0026rdquo; in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 985–993.\n[57] H. Xu, R. Zeng, Q. Wu, M. Tan, and C. Gan, \u0026ldquo;Cross-modal relationaware networks for audio-visual event localization,\u0026rdquo; in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 3893– 3901.\n[58] J. Zhou, D. Guo, Y. Zhong, and M. Wang, \u0026ldquo;Advancing weaklysupervised audio-visual video parsing via segment-wise pseudo labeling,\u0026rdquo; International Journal of Computer Vision, vol. 132, no. 11, pp. 5308–5329, 2024.\n[59] J. Zhou, D. Guo, Y. Mao, Y. Zhong, X. Chang, and M. Wang, \u0026ldquo;Labelanticipated event disentanglement for audio-visual video parsing,\u0026rdquo; in European Conference on Computer Vision. Springer, 2024, pp. 35–51.\n[60] J. Chalk, J. Huh, E. Kazakos, A. Zisserman, and D. Damen, \u0026ldquo;Tim: A time interval machine for audio-visual action recognition,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 153–18 163.\n[61] Y. Liu, Z. Wu, M. Mo, J. Gan, J. Leng, and X. Gao, \u0026ldquo;Dual space embedding learning for weakly supervised audio-visual violence detection,\u0026rdquo; in 2024 IEEE International Conference on Multimedia and Expo (ICME) . IEEE, 2024, pp. 1–6.\n[62] J. Yu, J. Liu, Y. Cheng, R. Feng, and Y. Zhang, \u0026ldquo;Modality-aware contrastive instance learning with self-distillation for weakly-supervised audio-visual violence detection,\u0026rdquo; in Proceedings of the 30th ACM international conference on multimedia, 2022, pp. 6278–6287.\n[63] W.-F. Pang, Q.-H. He, Y.-j. Hu, and Y.-X. Li, \u0026ldquo;Violence detection in videos based on fusing visual and audio information,\u0026rdquo; in ICASSP 20212021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2021, pp. 2260–2264.\n[64] Z. Tong, Y. Song, J. Wang, and L. Wang, \u0026ldquo;Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,\u0026rdquo; Advances in neural information processing systems, vol. 35, pp. 10 078– 10 093, 2022.\n[65] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, \u0026ldquo;Scaling up visual and vision-language representation learning with noisy text supervision,\u0026rdquo; in International Conference on Machine Learning. PMLR, 2021, pp. 4904–4916.\n[66] B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, \u0026ldquo;Expanding language-image pretrained models for general video recognition,\u0026rdquo; in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV. Springer, 2022, pp. 1–18.\n[67] J. Cho, J. Lei, H. Tan, and M. Bansal, \u0026ldquo;Unifying vision-and-language tasks via text generation,\u0026rdquo; in International Conference on Machine Learning. PMLR, 2021, pp. 1931–1942.\n[68] K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang, and Y. Qiao, \u0026ldquo;Videochat: Chat-centric video understanding,\u0026rdquo; arXiv preprint arXiv:2305.06355, 2023.\n[69] H.-H. Wu, P. Seetharaman, K. Kumar, and J. P. Bello, \u0026ldquo;Wav2clip: Learning robust audio representations from clip,\u0026rdquo; in ICASSP 20222022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 4563–4567.\n[70] C. Lea, R. Vidal, A. Reiter, and G. D. Hager, \u0026ldquo;Temporal convolutional networks: A unified approach to action segmentation,\u0026rdquo; in Computer vision–ECCV 2016 workshops: Amsterdam, the Netherlands, October 8-10 and 15-16, 2016, proceedings, part III 14. Springer, 2016, pp. 47–54.\n[71] X. Chen, S. Fischer, M. C. Rue, A. Zhang, D. Mukherjee, P. O. Kanold, J. Gillis, and A. M. Zador, \u0026ldquo;Whole-cortex in situ sequencing reveals input-dependent area identity,\u0026rdquo; Nature, pp. 1–10, 2024.\n[72] X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel, \u0026ldquo;Pixelsnail: An improved autoregressive generative model,\u0026rdquo; in International conference on machine learning. PMLR, 2018, pp. 864–872.\n[73] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, \u0026ldquo;Learning to prompt for visionlanguage models,\u0026rdquo; International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.\n[74] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, \u0026ldquo;Focal loss ´ ´ for dense object detection,\u0026rdquo; in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.\n[75] J. Chang, Z. Lan, C. Cheng, and Y. Wei, \u0026ldquo;Data uncertainty learning in face recognition,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5710–5719.\n[76] Z. Yang, W. Dong, X. Li, J. Wu, L. Li, and G. Shi, \u0026ldquo;Self-feature distillation with uncertainty modeling for degraded image recognition,\u0026rdquo;\nin European Conference on Computer Vision. Springer, 2022, pp. 552– 569.\n[77] M. Perez, A. C. Kot, and A. Rocha, \u0026ldquo;Detection of real-world fights in surveillance videos,\u0026rdquo; in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 2662–2666.\n[78] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro, \u0026ldquo;Weakly-supervised video anomaly detection with robust temporal feature magnitude learning,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 4975–4986.\n[79] H. K. Joo, K. Vo, K. Yamazaki, and N. Le, \u0026ldquo;Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection,\u0026rdquo; in 2023 IEEE International Conference on Image Processing (ICIP) . IEEE, 2023, pp. 3230–3234.\n[80] L. Zanella, B. Liberatori, W. Menapace, F. Poiesi, Y. Wang, and E. Ricci, \u0026ldquo;Delving into clip latent space for video anomaly recognition,\u0026rdquo; Computer Vision and Image Understanding, vol. 249, p. 104163, 2024.\n[81] C. Zhang, G. Li, Y. Qi, S. Wang, L. Qing, Q. Huang, and M.-H. Yang, \u0026ldquo;Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16 271–16 280.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/avadclip-audio-visual-collaboration-for-robust-video-anomaly-detection/","section":"Papers","summary":"A novel weakly supervised framework leveraging audio-visual collaboration to improve the robustness and accuracy of video anomaly detection.","title":"AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection","type":"method"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/baoliang-chen/","section":"Authors","summary":"","title":"Baoliang Chen","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/bardia-safaei/","section":"Authors","summary":"","title":"Bardia Safaei","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/behzad-dariush/","section":"Authors","summary":"","title":"Behzad Dariush","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/benfeng-wang/","section":"Authors","summary":"","title":"Benfeng Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/bin-guo/","section":"Authors","summary":"","title":"Bin Guo","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/boyuan-du/","section":"Authors","summary":"","title":"Boyuan Du","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/canhui-tang/","section":"Authors","summary":"","title":"Canhui Tang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/changxin-gao/","section":"Authors","summary":"","title":"Changxin Gao","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/chen-gao/","section":"Authors","summary":"","title":"Chen Gao","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/chenchen-tao/","section":"Authors","summary":"","title":"Chenchen Tao","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/cheng-fang/","section":"Authors","summary":"","title":"Cheng Fang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/chenggang-wang/","section":"Authors","summary":"","title":"Chenggang Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/chengliang-liu/","section":"Authors","summary":"","title":"Chengliang Liu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/chenxu-wang/","section":"Authors","summary":"","title":"Chenxu Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/chirui-chang/","section":"Authors","summary":"","title":"Chirui Chang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/chong-wang/","section":"Authors","summary":"","title":"Chong Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/chuchu-han/","section":"Authors","summary":"","title":"Chuchu Han","type":"authors"},{"content":" CLIP: Assisted Video Anomaly Detection # Meng Dong a\nSchool of Electrical and Electronic Engineering, Nanyang Technological University, Singapore\nKeywords:\nCLIP, Video Anomaly Detection.\nAbstract:\nAs the main application of intelligent monitoring, video anomaly detection in surveillance has been well developed but remains challenging. Various types of anomalies promote the requirements of unique detectors in the general domains, whereas users may need to customize normal and abnormal situations in specific domains in descriptions, such as \u0026ldquo;pedestrian No entry\u0026rdquo; or \u0026ldquo;people fighting\u0026rdquo;. Moreover, anomalies in unseen videos are usually excluded from the training datasets. Conventional techniques based on computer vision or machine learning are typically data-intensive or limited to specific domains. Targeting developing a generalized framework for intelligent monitoring, we introduce generative anomaly descriptions to compensate for the visual branch and bridge the possibilities to adapt specific application domains. In particular, we adopt contrastive language-image pre-training (CLIP) with generative anomaly descriptions as our general anomaly detector. Not as state-of-the-art, category-level anomaly descriptions instead of simple category names will be adopted as language prompts in this work. A temporal module is developed on top of CLIP to capture temporal correlations of anomaly events. Besides the above frame-level anomaly detection, we support the detection of object-centric anomalies for some specific domains. Extensive experiment results show that the novel framework offers state-of-the-art performance on UCF-Crime and ShanghaiTech datasets.\n1 INTRODUCTION # The concept of automatic video surveillance, which could take over the role of human monitors, has attracted more and more attention accompanied by the popularization of surveillance cameras. Developing highly discriminative anomaly detectors has become a big challenge for Video anomaly detection (VAD) due to the characteristics of surveillance videos. There are unlimited unknown anomaly cases in real-time, 24/7 scenarios. Hopefully, The well-trained models could be updated whenever newly defined or undefined cases emerge. However, each update is on behalf of the cost of frame annotation and obtaining anomaly data.\nAccording to the supervision setting of training datasets, there are commonly three kinds of methods for anomaly detection: One-Class Classification (OCC), weakly supervised, and unsupervised manner. Both hand-crafted features (Medioni et al., 2001; Piciarelli et al., 2008) and deep features extracted using pre-trained models (Ravanbakhsh et al., 2017; Sun and Gong, 2023) have been explored in recent works. However, it will be challenging for OCC ap-\nDOI: 10.5220/0012356300003654\nproaches to classify the well-reconstructed anomalous testing data since the ineffective classifier boundary may be achieved while training only on normal class data and excluding anomalies. The weakly supervised approaches are proposed to address the above limitations, video-level labeled abnormal data combined with the normal data are used in the training process (Tian et al., 2021; Cho et al., 2023; Zhang et al., 2023). Specifically, a video will be labeled as normal if its contents are normal; otherwise, it will be anomalous. In real applications, it will be impractical to annotate all surveillance videos, specifically for raw footage recorded 24/7 hours. Some work (Zaheer et al., 2022; Tur et al., 2023) explore unsupervised manner on unlabeled training datasets for anomaly detection. Even though impressive success in exploring highly discriminative anomaly boundaries, these works face enormous challenges, such as the rare normal samples in testing data, and specific domain anomalies.\nUsually, anomaly events capture the interactions between action/activity and entities over time. The rich prior knowledge of action could imply extra context or semantic information for anomaly detection. Naturally, the prevalent vision-language mod-\nels(VLMs), e.g. CLIP model(Radford et al., 2021) and its variations, have attracted our sights. The discriminative visual-language representations also demonstrate success in related tasks, such as video understanding, captioning, and event locations(Wang et al., 2021; Li et al., 2022a; Guzhov et al., 2021; Xu et al., 2021). Recently, (Joo et al., 2022) adopted ViTencoded visual features of CLIP to detect anomalies, without considering the semantic knowledge between vision and language. Language prompts could provide rich and powerful prior knowledge for activity localization, such as objects, humans, and interactions in the scene. However, category labels are usually adopted as language prompts in current CLIP-related works. Simple category names or labels may be insufficient to reveal complex interactions in real-world scenarios. For example, we prefer a comprehensive description such as \u0026ldquo;a man cruelty against a dog\u0026rdquo; instead of the single word \u0026ldquo;Abuse\u0026rdquo;. Furthermore, visual features from CLIP are towards image instead of video, and temporal dynamics over time are usually ignored or not fully explored.\nTo address the above challenges, we propose a novel framework for general anomaly detection on top of CLIP. Figure 1 (a) depicts the conventional approaches that explore discriminative classifiers or boundaries for extracted representations. (b) shows the standard CLIP. (c) demonstrates our framework based on two developed modules: temporal module and generative anomaly descriptions. In particular, we introduce generative anomaly descriptions instead of labels for the text encoder. Besides, the learnable prompt is adopted for the context of anomaly descriptions (Zhou et al., 2022) for each category. Targeting discriminative representations from spatialtemporal correlation, a temporal module, combined with a local transformer and lightweight GCN, is introduced on top of the visual encoder to capture local and global temporal correlation. To evaluate the proposed temporal module, we further introduce framelevel and original CLIP-based visual representations as the benchmark. To obtain accurate category-level anomaly descriptions, including human-related and non-human-related, ChatGPT (Cha, ), one of the large language models (LLMs), is adopted to generate and leverage the language prompts to the framework. We evaluate our proposed framework on two datasets, the ShanghaiTech (Liu et al., 2018a) and UCFcrime (Sultani et al., 2018). The experiment results show that the temporal module could enhance performance, and the generative anomaly descriptions achieve superior results compared to category-level prompts.\nFurthermore, regarding various types of anomalies, frame-level features will fail in complex back- ground scenarios. To reduce this bias(Liu and Ma, 2019), some object-centric approaches(Georgescu et al., 2021b) try to leverage the object\u0026rsquo;s appearance(Georgescu et al., 2021a; Georgescu et al., 2021b; Sabokrou et al., 2017), motion, or skeleton (Li et al., 2022b; Yang et al., 2022)to frame to further improve performance. For each detected object, anomaly detection is proceeded. Once one detected object is abnormal, the whole frame will be determined as abnormal. However, such methods require additional costs for optical flow estimation in the inference process. Addressing the above, we fine-tune our framework as background-agnostic by switching to the object-centric mode from whole frame mode.\nThis work makes the following contributions. (1) We introduce a novel generalized anomaly detection based on CLIP with proposed generative anomaly descriptions and temporal adapter. It allows userspecific anomaly definitions based on the anomaly descriptions module. (2) We adapt our generalized framework for supporting object-centric anomaly detection to conquer complex background bias. (3) Experiments on two video datasets illustrate the superior performance of our framework.\n2 RELATED WORK # Both hand-crafted features and deep features extracted using pre-trained models have been explored in recent works. However, it will be challenging for OCC approaches to classify the well-reconstructed anomalous test data since the ineffective classifier boundary may be achieved while training only on normal class data and excluding anomalies. All these works are under the assumption that all or most of the collected training data is normal. However, there are rare normal samples in testing data that will be classified as anomalies. Two common techniques used in anomaly detection: (1) Reconstruction-based, such as autoencoder (AE)(Hasan et al., 2016; Lv et al., 2021), memory-augmented AE(Park et al., 2020), and generative models(Liu et al., 2018a), are used to reconstruct current frame (Ionescu et al., 2019a) or predict future frame, the frame with high reconstruction errors will be detected as anomalies. (2) Distance-based approaches often adopt one-class SVM (Ionescu et al., 2019a; Ionescu et al., 2019b) or Gaussian mixture model (Sabokrou et al., 2017; Li et al., 2015) to compute decision boundary, and anomalies will deviate from normality. Most reconstruction-based or distance-based approaches to learn frame-level features will fail in complex backgrounds. To reduce this bias(Liu and Ma, 2019), some object-centric ap-\nFigure 1: Comparison of different frameworks of anomaly detection.\nproaches(Georgescu et al., 2021b) try to leverage the object\u0026rsquo;s appearance(Georgescu et al., 2021a), motion, or skeleton (Li et al., 2022b; Yang et al., 2022)to frame to further improve performance. They perform anomaly detection for each detected object from an object detector. When at least one detected object is determined as abnormal, they determine that abnormal situations occur in the frame. However, such methods require additional costs for optical flow estimation in the inference process. Furthermore, it would apply to abnormal situations, such as explosion and arson in the UCFcrime dataset (Sultani et al., 2018). Even though users expect these anomaly detectors to be background-agnostic, there are some scene-dependent anomalies. Novel scene-aware approaches(Bao et al., 2022; Cao et al., 2022), emerge for such cases.\nIn this work, CLIP-based anomaly detection is the frame-level scheme, we introduce a human-centric skeleton branch to make the framework backgroundagnostic.\n3 METHOD # The proposed anomaly detector has two branches: visual and text. For the visual branch, visual representations are captured in two ways: frame-level and video-level, with different temporal adapters. For the text branch, we adopt anomaly descriptions instead of category names, and then the learnable prompt is utilized as the context of anomaly descriptions. Furthermore, ChatGPT (Cha, ) is adopted in this work to generate normal and abnormal descriptions for each scenario to cover a wide range of general anomalies.\n3.1 Generative Anomaly Descriptions # In this branch, anomaly descriptions are not only from the labels of datasets. We adopt the generative anomaly descriptions that could cover a wide range of general anomalies in general and specific scenarios. Furthermore, these descriptions could be comprehensive for the interactions between actions and entities over time and contain rich prior knowledge about the activities in the scene. Therefore, the target of this branch is to provide prior information about anomalies and complement the visual branch for the generalized anomaly detection network that can work on limited data and could be adapted to specific domains by users.\n3.1.1 Category-Level Anomaly Descriptions # Currently, most public datasets are labeled with a single word to annotate the complex real scenarios. However, there are similarities in some actions/activities across different labels, which lead to class boundaries are not discriminative, such as shoplifting, stealing, and burglary in UCF-Crime (Sultani et al., 2018) dataset, which contains 13 abnormal labels, almost cover most scenarios of the real world. Some of the categories are intuitive, while some are ambiguous. In this work, we substitute some with anomaly descriptions to pursue discriminative boundaries in Table 1.\n3.1.2 Generative Anomaly Descriptions # ChatGPT, based on models GPT-4, is well-trained on a large scale of texts online, and we assume that the obtained descriptions should be explicit for situations\nTable 1: Samples of category-level Anomaly Descriptions.\nCategory Anomaly Descriptions abuse child abuse, elder abuse, or animal cruelty arrest police arrest arson fire setting assault street violence, bar fights theft theft in street, theft in stores, or theft in buildings road accidents traffic accidents involving vehicles, pedestrians or cyclists vandalism break windows, remove or damage road signs of each typical location, such as normal and abnormal cases. The repetitive and ambiguous descriptions will be filtered to obtain clear, clean, relevant, normal, and abnormal descriptions. Even though the obtained anomaly descriptions are suitable for general domains, they may not be accurate in specific domains. Subsequently, the users can modify relevant anomaly descriptions based on their prior knowledge. For example, the scenarios in UCFcrime (Sultani et al., 2018) could cover and simulate the general domains. But in the specific domain, taking ShanghaiTech (Liu et al., 2018a) for example, there exists an only-walking zone in ShanghaiTech (Liu et al., 2018a)dataset, so the bicycles, vehicles, and running pedestrians will be forbidden while they are normal cases in UCFcrime (Sultani et al., 2018). Table 2 shows some samples for normal and abnormal cases. So, based on generative anomaly descriptions, the users could define their specific anomalies.\n3.1.3 Learnable Prompts # Usually, categories or descriptions are short words or phrases. They are a bit succinct compared to event captions or sentences for summarization of abnormal events. In this chapter, we adopt learnable prompts(Zhou et al., 2022) to the description embeddings for robust scalability of text encoder. To evaluate the combination of description embedding and learnable prompts, we conduct different settings: the descriptions are transformed by CLIP tokenizer,\nwhere \u0026ldquo;description\u0026rdquo; is anomaly description. The class-specific concatenation [learnable prompt][description] and shareable [learnable prompt] for all descriptions as follows:\nwhere {c1 ,\u0026hellip;, cl } are learnable prompts, containing l context tokens.\nTable 2: Samples on normal/abnormal descriptions generated by ChatGPT.\nNormal Abnormal deliverymen deliveries loitering cleaning crew working unruly crowds walk the dog fire pedestrian crossings drug dealing children playing shoplifting cleaning the sidewalk hiding goods building access assault Person chatting fighting birds flying overhead robbery sunrise accidents routine patrols falling down animals wandering around smoke walking through the station hit and run joggers jaywalking guest leaving vehicle collisions cashier bagging items vehicle accidents cashier scanning car theft restocking shelves injuries running burglary street performers theft Forwarding the prompt t p to text encoder ft(·), we can obtain C classification vector f t k ∈ R d representing the concept for visual part:\n3.2 Video-Level Visual Features from CLIP # To achieve discriminative visual features, we conduct two visual processes, video-level and frame-level visual features. On top of the ViT encoder f(·) of CLIP, the temporal relationships are challenging for event detection. Given a video, the snippets of T frames of the size H ×W are sampled as input x ∈ R T×3×H×W . The T feature vectors f i v f i ∈ R d of each frame x i after f(·) , will be fed into temporal module, where i ∈ {1 , 2 , ··· , T} , d is dimension of the feature vectors.\nThe temporal module consists of the local transformer and GCN layers, imposed on top of frame-level CLIP features. In particular, frame-level features will be split into equal-length local windows (T frames), and self-attention will be conducted within each window. Furthermore, a lightweight GCN, proven in many anomaly detection works(Wu et al., 2020; Zhong et al., 2019), is introduced after the local transformer to capture global temporal correlations. In such cases, long-range and short-range temporal dependencies in video can be captured. The overall framework of our anomaly detector is shown in Figure 2.\n3.2.1 Local Transformer Encoder # The T frame-level features f i v f i ∈ R d are fed into a local temporal model g(·), consisting of several Transformer encoders, to explore temporal correlations and obtain the visual representation f l V f l ∈ R d :\nWhere, f 0 v f 0 and e represent learnable vectors for the class token and position embedding.\nTaking class-specific concatenation of learnable prompt and description as an example, the form of the prompt tk, and feature vector f k t f k , the probability of prediction can be obtained as:\nwhere τ is a temperature parameter, cos(· , · ) denotes cosine similarity.\n3.2.2 Global Temporal Adapter # To model global temporal dependencies of consecutive images, a lightweight GCN, proven in many anomaly detection works(Wu et al., 2020; Zhong et al., 2019), is introduced after the local transformer to capture global temporal correlations. In such cases, long-range and short-range temporal dependencies in video can be explored. Similar to (Wu et al., 2020), we use relative distance and feature similarity to model global temporal dependencies, as follows:\nWhere Msim and Mdis are the adjacency matrices. f l V f l is the video features from the local transformer, W is a weight for transforming feature spaces and can be learnable. Feature similarity is to calculate adjacency matrix and presented as follows,\nPosition distance captures long-range dependencies and adjacency matrix between i th and j th is calculated as follows:\nWhere hyperparameter σ controls the influence range of distance relation.\nFor video-level anomaly confidence, we adopt the alignment map M, which demonstrates the similarity between video features at frame-level and anomaly class embeddings. Following the definition of M, topk similarities are selected and averaged to get the similarity between the video and the current class. Finally, S = {s1 ,\u0026hellip;, s m } is obtained to represent the similarity between the video and all anomaly classes. The highest score will pair the video and its class. The prediction of each class j th class is:\nWhere τ, temperature hyper-parameter, and the loss for alignment Lali can be computed by the cross entropy. Additional contrastive loss is used to push away the embedding of abnormal classes from the normal ones as follows,\nwhere tn tn and ta ta represent embeddings of normal and abnormal classes.\nFinally, the total loss for video level is given by:\n3.3 Frame-Level Visual Features from CLIP # To bridge CLIP to anomaly detection comprehensively, we further conduct frame-level anomaly detection. The generative descriptions from ChatGPT about normal and abnormal cases are fed into the text encoder of CLIP for normalized text features, f k t f k , k = 1 ,\u0026hellip;, N, where N is the description number. CLIP \u0026ldquo;ViT-B/32\u0026rdquo; is selected in this work, and the image and text features from CLIP, f i v f i and f k t f k , and the feature dimension is set as 512. Figure 3 depicts the framelevel anomaly detection framework. In particular, we extract the whole frame feature for UCFcrime and ShanghaiTech. Specifically, the object regions from the object detector are additionally adopted and proceed to extract features for the ShanghaiTech dataset for background-agnostic anomaly types. For each normalized image feature f i v f i , the cosine similarities\nFigure 2: Proposed video-level framework of anomaly detection.\nwith f k t f k are computed. We fine-tune a bit for similarity calculation to adapt VLMs to zero-shot tasks. The trainable parameters are introduced to modify the calculation of similarity in CLIP for the k th text description:\nf i , f k f i f k , where diagonal matrix Ak ∈ R 512×512 , scalar bk and m is set as 0.01 in this work. Ak and bk can be trained by gradient descent in the total loss function. The initial value of the update for Ak and bk are set to the identity matrix and zero, respectively. For the similarity between two normalized vectors, we use W(· , · ) to represent, then feed to softmax. All the abnormal descriptions will be summed to obtain the frame-level or object-level anomaly score score(x):\nwhere Ca Ca is indices of anomaly description sets. The frame or detected object will be detected as abnormal when the score exceeds the predefined threshold. To explore the temporal correlations of abnormal activities, we further introduce a simple majority voting scheme to assess multiple frames for a more accurate score compared to single frames. We apply InfoNCE loss for CLIP-based method:\nwhere λ j set as the loss weight for each xj . λ j = 1 , T is set as 1, in this work. A simple majority voting method is applied for event classification to explore the temporal relationship between consecutive frames in experiments.\n4 EXPERIMENT # 4.1 Datasets # To simulate the proposed anomaly detection in surveillance, we explore two public anomaly datasets UCFcrime and ShanghaiTech datasets, shown in Table 3. Abnormal situations in UCFCrime are captured from various locations and scenarios (abuse, arrest, arson, assault, burglary, explosion, fighting, road accidents, robbery, shooting, shoplifting, stealing, and vandalism). It involves the accidents and crimes that happen frequently in public. Most anomalies of ShanghaiTech are pedestrian-based. It captures 13 different scenes and contains 130 abnormal events with various numbers of people.\n4.2 Experiment Setting # The frozen encoders of image and text are pre-trained CLIP visual and text, ViT-B/32. σ is set as 1 , τ is set as 0.07, window length in local transformer and GCN is 8. λ in final loss equation is set as 1×10 − 1 . All the works are implemented on an Intel Core i9 CPU, 32 GB of RAM, and NVIDIA GeForce RTX 3060, 24GB VRAM, Pytorch 1.12. Adam optimizer (Kingma and\nFigure 3: Proposed frame-level framework of anomaly detection.\nTable 3: Summary of anomaly datasets in this work.\nDataset Description Video and Duration Annotation Types ShanghaiTech (Liu et al., 2018b) Person based abnormal situations in Campus 437 videos 317,398 frames Frame-level Pixel-Level UCF-Crime (Sultani et al., 2018) 13 categories of abnormal situations 1,900 videos 128 hours Video-Level Ba, 2014) is used with batch size 64. The learning rate is 1×10 − 5 .\n4.3 Comparison with State-of-the-Art # To evaluate our proposed framework, several stateof-the-art methods are chosen as references, including weakly supervised, unsupervised, full, and OOC on UCF-Crime and ShanghaiTech, shown in Table 4 and Table 5, respectively. In this work, the final anomaly detection result is calculated similarities between visual and all anomaly text. To compare with conventional classifiers, we set two benchmarks by adopting a CLIP image encoder as a feature extractor and followed with a linear classifier. Furthermore, our temporal module is also added to explore temporal relationships. From the two tables\u0026rsquo; results, our CLIP-text-based methods outperform the CLIPclassifier-based benchmark on both datasets with 2%, which also proves the effectiveness of compensation of text branch. Besides, CLIP-based features are more discriminative than CID and I3D features with the help of temporal modules on both UCF-Crime and ShanghaiTech datasets, because the latter ones are designed for action recognition tasks. Besides, the complex background also influences the feature extraction. Compared with CLIP-based methods(Joo et al., 2022), our proposed method achieves comparative results on ShanghaiTech dataset. Both the re- sults with the local transformer and GCN outperform with 1% due to the compensation of the text branch on UCFCrime. Furthermore, we also conduct framelevel multi-frame CLIP features to explore temporal dependencies between adjacent frames. Our framebased method is slightly inferior to our video-based method because the simple majority voting scheme still lacks temporal relations. The results demonstrate the effectiveness of the scalability of CLIP in the downstream task, anomaly detection.\n4.4 Ablation Study # An exhaustive ablation analysis is conducted in this work to evaluate the effectiveness of individual components in our framework. In particular, we first compare the category and proposed anomaly description prompt performance to evaluate the effectiveness of the description prompt, including the comparison of different prompts and different settings for learnable context. Then, a temporal module comparison is also conducted to evaluate.\n4.4.1 Evaluation of Prompt # Table 6 shows the results of different prompts. The learnable prompt and temporal module are set the same as 8, and local transformer, respectively. In vision language models, the prompt could help to adapt\nTable 4: Comparisons with state-of-the-art on UCF-Crime Dataset.\nSupervised Way Method Feature AUC(%) Un (Wang and Cherian, 2019) (Zaheer et al., 2022) I3D 70.46 g (Zaheer et al., 2022) ResNext 71.04 Fully (Liu and Ma, 2019) NLN 82.0 OCC (Scholkopf et al., 1999) ¨ OCCSVM 63.2 OCC Weakly (Purwanto et al., 2021) TRN 85.00 8448 OCC Weakly (Thakare et al., 2022) C3D+I3D 84.48 OCC Weakly (Zhong et al., 2019) TSN 81.08 OCC Weakly (Tian et al., 2021) C3D 83.28 OCC Weakly (Wu et al., 2020) C3D 82.44 OCC Weakly (Tian et al., 2021) I3D 84.30 OCC Weakly (Wu and Liu, 2021) I3D 84.89 OCC Weakly (Joo et al., 2022) CLIP 87.58 OCC Weakly (Yu et al., 2023) Pose 64.63 OCC Weakly CLIP+Classifer CLIP 73.17 OCC Weakly CLIP+Local+Global +Classifer CLIP 86.17 OCC Weakly Ours-Video(Local) CLIP 88.13 OCC Weakly Ours-Video(Local+Global) CLIP 88.52 OCC Weakly Ours-Frame CLIP 86.62 Table 5: Comparisons with state-of-the-art on ShanghaiTech Dataset.\nSupervised Way Method Feature AUC(%) Un (Zaheer et al., 2022) ResNext 78.93 Weakly (Purwanto et al., 2021) TRN 96.85 Weakly (Zhong et al., 2019) TSN 84.44 Weakly (Tian et al., 2021) C3D 91.57 Weakly (Tian et al., 2021) I3D 97.2 Weakly (Wu and Liu, 2021) I3D 97.48 Weakly () (Joo et al., 2022) CLIP 98.32 Weakly () CLIP+Classifer CLIP 83.21 Weakly CLIP+Local+Global +Classifer CLIP 94.17 Weakly Ours-Video(Local) CLIP 97.31 Weakly () Ours-Video(Local+Global) CLIP 98.43 Weakly Ours-Frame CLIP 95.02 VLM to specific tasks. As a baseline, we compare hand-crafted and learnable prompts on two datasets with the same categories. Both could achieve comparative results, and learnable prompts achieve a slightly 0.6% better performance on ShanghaiTech. Further, our description-based prompt also indicates the effectiveness of a learnable prompt with anomaly descriptions compared to category.\n4.4.2 Evaluation of Variable Length # In this work, we further evaluate the variable lengths for three settings: length of Learnable Prompt, length of Window in local transformer, and depths of transformers. Generally, longer context/prompt length l should lead to better performance (Zhou et al., 2022), and it seems there is a golden role for the optimal context length. The effectiveness of the temporal module has been verified in Table 4 and Table 5. We evaluate to select the optimal depth of transformers. Usually, the temporal dependencies among consecutive frames decrease with the length of the window, especially for datasets annotated at the video level. We conducted three experiments for the analysis. As shown in Table 8, first, we set a certain range (4 to 32) for context length with fixed transformer depth(e.g. as 1) and fixed window length (e.g. 16 frames). The performance gradually improves before 20 and decreases after 24 with more learnable vectors. Considering performance and (Zhou et al., 2022), we select 16 as the optimal context length for two datasets. The AUC decreases even with higher network costs and lower model generation for deeper transformers. And finally, we select 1 layer transformer to model local temporal dependency. From the results, the perfor-\nTable 6: Comparisons of Different Prompts.\nPrompt AUC(%) AUC(%) UCF-Crime ShanghaiTech a photo of [Category] 87.43 96.20 Learnable Prompt+[Category] 87.66 96.81 Learnable Prompt+[Description] 88.19 97.32 Table 7: Performance of our framework on Object-centric and Frame-level.\nObject-Centric Mode Object-Centric Mode Object-Centric Mode Object-Centric Mode Ours Ours (Georgescu et al., 2021b) (Georgescu et al., 2021b) CLIP feature calculation 20.49 ms Optical flow calculation 57.93 ms Similarity calculation 0.2 ms prediction 4.57 ms Total 20.69 ms Total 62.5 ms ame-based Mode ame-based Mode ame-based Mode ame-based Mode Ours (Zaheer et al., 2022) CLIP feature calculation 4.79 ms ResNext feature calculation. 18.89 ms Total 4.98 ms Total 19.02 ms Total 4.98 ms Total 19.02 ms mance is robust with a range of window lengths (8 to 64), and decrease with longer window. These results also reveal a single local transformer is not very effective for longer video temporal correlations. It is an optimal combination of the local transformer and global temporal adapter. Considering the duration of activity in datasets and introduced GCN in the temporal module, we select an intermediate value(16) for window length in this work.\n4.4.3 Evaluation of the Position of Learnable Prompt # To evaluate the combination of description and learnable prompts, we conducted two settings: First, class-specific in the form of [Learnable prompt]description, each description has its learnable prompt. Second, a shareable form of [learnable prompt] for all descriptions (middle). The results of the two datasets are shown in Table 9, and [Learnable prompt][description] combination achieves better results as class-specific prompt could provide more semantic information compared to shareable context for all classes. The context length set as 16.\n4.4.4 Evaluation of the Temporal Module # The above results have proven the effectiveness of the temporal module. To further evaluate the local transformer and Global temporal adapter, we conduct the ablation analysis: (1) CLIP without the temporal module, (2)CLIP only with the local transformer, (3) CLIP with a temporal module (Local transformer + Global temporal adapter). From the results in Table 10, the Global temporal adapter, together with the local transformer, could capture robust temporal correlations compared to only the local transformer about 4%, even with a longer window, which has also been proven in Table 8, and is the optimal combination for temporal dependencies.\n4.5 Object-Centric CLIP Method # As mentioned before, some object-centric approaches(Georgescu et al., 2021b) try to leverage the object\u0026rsquo;s appearance(Georgescu et al., 2021a; Georgescu et al., 2021b; Sabokrou et al., 2017), motion, or skeleton (Li et al., 2022b; Yang et al., 2022) to frame level to further improve performance by removing background bias. In this work, we additionally experiment on ShanghaiTech dataset to evaluate objectcentric performance. The frame will be classified as abnormal when one detected object is abnormal. Figure 4 shows the anomaly scores in object-centric and frame-based CLIP on ShanghaiTech dataset. Objectcentric could get more accurate anomaly scores in periods of abnormal events. Besides, we perform a performance analysis in Table 7. Object-centric and frame-based methods show an efficient inference process. The object-centric method is performed for each object, and we set the maximum is 20 in this work. Both the object-centric and frame-based methods are faster than the baseline, with milliseconds of inference times for each module.\nTable 8: Comparisons of Different Variable Length AUC(%).\n| Context Number 4 | Depth of Transformer 1 | Window Length 16 | UCF-Crime 84.30 | ShanghaiTech\n94.00 8 1 16 85.21 95.2 16 1 16 86 97.02 20 1 16 86.37 97 24 1 16 85.52 96.12 32 1 16 84.31 95.21 16 2 16 85.82 95.52 16 3 16 85.5 95.34 16 1 8 85.3 95.9 16 1 32 86.56 97 16 1 64 86.88 97.22 16 1 128 86.39 96.8 Table 9: Comparisons different positions of learnable context.\nTable 10: Comparisons Local Transformer and Global temporal adapter.\n| Anomaly Description\nPosition AUC(%) AUC(%) UCF-Crime ShanghaiTech Middle 87.45 96.26 End 88.14 97.39 Figure 4: Anomaly scores in object-centric and frame-based on ShanghaiTech.\nMethod AUC(%) AUC(%) UCF-Crime ShanghaiTech w/o Temporal Module 84.42 92.24 w/o GCN 87.28 96.45 w Temporal Module 88.17 97.33 5 CONCLUSION # In this work, we propose a novel framework for video anomaly detection based on CLIP. A local transformer and global temporal adapter are added to the frame-level features of CLIP to capture temporal dependencies. Furthermore, we present generative anomaly descriptions from ChatGPT to cover all the possible anomalies in general and specific domains. The users can also modify the generative descriptions based on their prior knowledge. Several benchmarks for anomaly detection based on CLIP have been introduced to comprehensively evaluate the proposed generalized framework. The results also demonstrate the robustness and effectiveness of the proposed framework. To remove the background bias effects, we further proceed with the object-centric framework. The results have demonstrated the efficiency on detected regions. However, CLIP-based methods lack temporal dependencies, even with local transformers and global temporal adapter. In the future, we will explore video-level CLIP for potential further performance improvement.\nREFERENCES # Cha-gpt https://chat.openai.com/, march, 2023.\nBao, Q., Liu, F., Liu, Y., Jiao, L., Liu, X., and Li, L. (2022). Hierarchical scene normality-binding modeling for anomaly detection in surveillance videos. In Proceedings of the 30th ACM International Conference on Multimedia, pages 6103–6112. Cao, C., Lu, Y., and Zhang, Y. (2022). Context recovery and knowledge retrieval: A novel two-stream framework for video anomaly detection. arXiv preprint arXiv:2209.02899 . Cho, M., Kim, M., Hwang, S., Park, C., Lee, K., and\nLee, S. (2023). Look around for anomalies: Weaklysupervised anomaly detection via context-motion relational learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12137–12146.\nGeorgescu, M.-I., Barbalau, A., Ionescu, R. T., Khan, F. S., Popescu, M., and Shah, M. (2021a). Anomaly detection in video via self-supervised and multi-task learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12742–12752.\nGeorgescu, M. I., Ionescu, R. T., Khan, F. S., Popescu, M., and Shah, M. (2021b). A background-agnostic framework with adversarial training for abnormal event detection in video. IEEE transactions on pattern analysis and machine intelligence, 44(9):4505–4523.\nGuzhov, A., Raue, F., Hees, J., and Dengel, A. (2021). Audioclip: Extending clip to image, text and audio.\nHasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A. K., and Davis, L. S. (2016). Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 733–742.\nIonescu, R. T., Khan, F. S., Georgescu, M.-I., and Shao, L. (2019a). Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7842– 7851.\nIonescu, R. T., Smeureanu, S., Popescu, M., and Alexe, B. (2019b). Detecting abnormal events in video using narrowed normality clusters. In 2019 IEEE winter conference on applications of computer vision (WACV), pages 1951–1960. IEEE.\nJoo, H. K., Vo, K., Yamazaki, K., and Le, N. (2022). Cliptsa: Clip-assisted temporal self-attention for weaklysupervised video anomaly detection. arXiv preprint arXiv:2212.05136 .\nKingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .\nLi, M., Xu, R., Wang, S., Zhou, L., Lin, X., Zhu, C., Zeng, M., Ji, H., and Chang, S.-F. (2022a). Clip-event: Connecting text and images with event structures. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16399–16408.\nLi, N., Chang, F., and Liu, C. (2022b). A self-trained spatial graph convolutional network for unsupervised humanrelated anomalous event detection in complex scenes. IEEE Transactions on Cognitive and Developmental Systems .\nLi, N., Wu, X., Guo, H., Xu, D., Ou, Y., and Chen, Y.-L. (2015). Anomaly detection in video surveillance via gaussian process. International Journal of Pattern Recognition and Artificial Intelligence , 29(06):1555011.\nLiu, K. and Ma, H. (2019). Exploring background-bias for anomaly detection in surveillance videos. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1490–1499.\nLiu, W., Luo, W., Lian, D., and Gao, S. (2018a). Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536– 6545.\nLiu, W., W. Luo, D. L., and Gao, S. (2018b). Future frame prediction for anomaly detection – a new baseline. In 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .\nLv, H., Chen, C., Cui, Z., Xu, C., Li, Y., and Yang, J. (2021). Learning normal dynamics in videos with meta prototype network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 15425–15434.\nMedioni, G., Cohen, I., Bremond, F., Hongeng, S., and ´ ´ Nevatia, R. (2001). Event detection and analysis from video streams. IEEE Transactions on pattern analysis and machine intelligence, 23(8):873–889.\nPark, H., Noh, J., and Ham, B. (2020). Learning memoryguided normality for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14372–14381.\nPiciarelli, C., Micheloni, C., and Foresti, G. L. (2008). Trajectory-based anomalous event detection. IEEE Transactions on Circuits and Systems for video Technology, 18(11):1544–1554.\nPurwanto, D., Chen, Y.-T., and Fang, W.-H. (2021). Dance with self-attention: A new look of conditional random fields on anomaly detection in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 173–183.\nRadford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.\nRavanbakhsh, M., Nabi, M., Sangineto, E., Marcenaro, L., Regazzoni, C., and Sebe, N. (2017). Abnormal event detection in videos using generative adversarial nets. In 2017 IEEE international conference on image processing (ICIP), pages 1577–1581. IEEE.\nSabokrou, M., Fayyaz, M., Fathy, M., and Klette, R. (2017). Deep-cascade: Cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes. IEEE Transactions on Image Processing , 26(4):1992–2004.\nScholkopf, B., Williamson, R. C., Smola, A., Shawe-Taylor, ¨ ¨ J., and Platt, J. (1999). Support vector method for novelty detection. Advances in neural information processing systems, 12.\nSultani, W., Chen, C., and Shah, M. (2018). Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488.\nSun, S. and Gong, X. (2023). Hierarchical semantic contrast for scene-aware video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22846–22856.\nThakare, K. V., Sharma, N., Dogra, D. P., Choi, H., and Kim, I.-J. (2022). A multi-stream deep neural network with late fuzzy fusion for real-world anomaly detection. Expert Systems with Applications, 201:117030.\nTian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J. W., and Carneiro, G. (2021). Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4975–4986.\nTur, A. O., Dall\u0026rsquo;Asen, N., Beyan, C., and Ricci, E. (2023). Exploring diffusion models for unsupervised video anomaly detection. arXiv preprint arXiv:2304.05841 .\nWang, J. and Cherian, A. (2019). Gods: Generalized oneclass discriminative subspaces for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8201–8211.\nWang, M., Xing, J., and Liu, Y. (2021). Actionclip: A new paradigm for video action recognition. ArXiv , abs/2109.08472.\nWu, P. and Liu, J. (2021). Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing, 30:3513– 3527.\nWu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., and Yang, Z. (2020). Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 322–339. Springer.\nXu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., and Feichtenhofer, F. M. L. Z. C. (2021). Videoclip: Contrastive pre-training for zero-shot video-text understanding. In Conference on Empirical Methods in Natural Language Processing .\nYang, Y., Fu, Z., and Naqvi, S. M. (2022). A two-stream information fusion approach to abnormal event detection in video. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5787–5791. IEEE.\nYu, S., Zhao, Z., Fang, H., Deng, A., Su, H., Wang, D., Gan, W., Lu, C., and Wu, W. (2023). Regularity learning via explicit distribution modeling for skeletal video anomaly detection. IEEE Transactions on Circuits and Systems for Video Technology .\nZaheer, M. Z., Mahmood, A., Khan, M. H., Segu, M., Yu, F., and Lee, S.-I. (2022). Generative cooperative learning for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14744– 14754.\nZhang, C., Li, G., Qi, Y., Wang, S., Qing, L., Huang, Q., and Yang, M.-H. (2023). Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16271–16280.\nZhong, J.-X., Li, N., Kong, W., Liu, S., Li, T. H., and Li, G. (2019). Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF conference on\ncomputer vision and pattern recognition, pages 1237– 1246.\nZhou, K., Yang, J., Loy, C. C., and Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/clip-assisted/","section":"Papers","summary":"Proposes a generalized framework for video anomaly detection based on CLIP, introducing generative anomaly descriptions, temporal modules for capturing temporal correlations, and object-centric approaches to improve performance and robustness, with extensive experimentation on UCF-Crime and ShanghaiTech datasets.","title":"CLIP: Assisted Video Anomaly Detection","type":"method"},{"content":" Cross-Domain Learning for Video Anomaly Detection with Limited Supervision # Yashika Jain University of Delhi yashikajain201@gmail.com\nAli Dabouei * Carnegie Mellon University ali.dabouei@gmail.com\nAbstract # Video Anomaly Detection (VAD) automates the identification of unusual events, such as security threats in surveillance videos. In real-world applications, VAD models must effectively operate in cross-domain settings, identifying rare anomalies and scenarios not well-represented in the training data. However, existing cross-domain VAD methods focus on unsupervised learning, resulting in performance that falls short of real-world expectations. Since acquiring weak supervision, i.e., video-level labels, for the source domain is cost-effective, we conjecture that combining it with external unlabeled data has notable potential to enhance crossdomain performance. To this end, we introduce a novel weakly-supervised framework for Cross-Domain Learning (CDL) in VAD that incorporates external data during training by estimating its prediction bias and adaptively minimizing that using the predicted uncertainty. We demonstrate the effectiveness of the proposed CDL framework through comprehensive experiments conducted in various configurations on two large-scale VAD datasets: UCF-Crime and XD-Violence. Our method significantly surpasses the stateof-the-art works in cross-domain evaluations, achieving an average absolute improvement of 19.6% on UCF-Crime and 12.87% on XD-Violence.\n1. Introduction # Video anomaly detection (VAD) aims to locate anomalous events in the videos [3 , 10 , 11 , 15 , 21 , 25 , 32 , 33 , 42 , 47]. Unlike manual surveillance, which is costly and timeconsuming, video anomaly detection eliminates the need for extensive human effort, saving resources and time. It holds significant potential for playing a vital role in video surveillance by identifying unusual behaviors and activities such as accidents, burglaries, explosions, and other events that signal security threats.\nVAD has been extensively studied previously [11 , 15 , 21 , 32 , 33 , 47]. Owing to the high costs and time associated\nCorresponding authors. Min Xu * Carnegie Mellon University mxu1@cs.cmu.edu\nFigure 1. Anomaly score comparison on a video of XD-Violence dataset, with and without employing the proposed CDL framework. The model trained without CDL on UCF-Crime as the weakly labeled set consistently yields high anomaly scores. In contrast, the model trained with CDL, using UCF-Crime as the weakly labeled set and HACS as the unlabeled set, is better able to localize the anomalous frames.\nwith obtaining frame-level labels, most approaches formulate the problem as either an unsupervised [10 , 15 , 21] or weakly-supervised learning setup [11 , 32 , 33]. In the unsupervised or one-class classification-based) learning setup, only normal videos are used to model the underlying distribution of normal spatiotemporal patterns, and any deviations from the modeled distribution are regarded as anomalies. Despite the convenience of the unsupervised setup, the lack of anomalous videos during training limits the model\u0026rsquo;s ability to learn the specific characteristics of anomalies. This results in limited performance which does not meet real-world expectations. To address this issue, weaklysupervised setup has attracted significant attention. In this setup, merely video-level labels indicating the presence of anomalies within the videos are incorporated as weak supervision to train models capable of making frame-level predictions at inference. Multiple Instance Learning (MIL) [32] is a prominent technique in this domain. By treating each video as a \u0026ldquo;bag\u0026rdquo; and each snippet as a \u0026ldquo;segment\u0026rdquo;, MIL-based algorithms operate under the premise of a worstcase scenario where the segment with the highest predicted probability of being abnormal is considered as the candidate to represent the whole video.\nIn real-world applications, it is inevitable to encounter environments and scenarios not fully represented in the model\u0026rsquo;s training set. However, it is essential that the model makes correct predictions in such novel situations. For in- stance, when the training data lacks samples of rare events like \u0026ldquo;riots\u0026rdquo; or accidents in novel scenes, the model should be able to characterize such occurrences as anomalous when they occur. Previous works study these novel situations under the cross-domain problem definition [3 , 13 , 23].\nExisting cross-domain VAD methods [3 , 13 , 23 , 25] rely on unsupervised techniques and consequently exhibit limited performance, as demonstrated later in our empirical evaluations in Tables 2 and 3. A solution to this could be the adoption of weakly-supervised techniques for crossdomain VAD. While weakly-supervised approaches have proven promising in single-domain scenarios [11 , 32 , 33], their effectiveness in cross-domain scenarios has not been extensively explored. Our evaluations in Tables 2 and 3 suggest that directly employing existing weakly-supervised methods to address the cross-domain challenges results in a significant performance drop when tested in scenarios of even similar nature, such as surveillance videos. We argue that this performance gap is due to the following reasons. First, anomalous events, by their very nature, lack a specific pattern or predefined structure. Hence, the definition of anomaly is context-dependent and a naive adaptation of the previous method cannot capture the context-dependencies in multiple domains. Second, anomalous events are relatively infrequent, making VAD a class imbalance problem. This issue becomes more severe when dealing with multiple domains. Third, because of the limited amount of weakly labeled training data, the model\u0026rsquo;s learning capacity to detect novel (open-set) anomalies is also constrained. Due to these challenges, weakly-supervised methods cannot be readily applied to cross-domain or cross-dataset scenarios.\nTo overcome these challenges and develop a generalized VAD model, substantial amounts of weakly-labeled data are required. However, acquiring even video-level labels for a large number of videos is inefficient and labor-intensive. On the other hand, vast streams of unlabeled videos are generally available. Utilizing the limited weakly-labeled data alongside this abundant unlabeled data provides a notable opportunity to address the aforementioned challenges in cross-domain VAD. Prudent utilization of the unlabeled data can provide valuable insights into the underlying data distribution, leading to improved decision-making and identification of anomalous events.\nTo this end, we propose a weakly-supervised CrossDomain Learning (CDL) framework for VAD that integrates external, unlabeled data, from the wild with limited weakly-labeled data to provide competitive generalization across the domains. This is achieved by adaptively minimizing the prediction bias over the external data using the estimated prediction variance, which serves as an uncertainty regularization score. In the proposed framework, we first train fine-grained pseudo-label generation models on the weakly-labeled data to obtain sets of segment-level pre-\nTable 1. Brief overview of the taxonomy of current works for VAD using a source domain dataset (D) and a secondary domain dataset (D ′ ). All these methods do not utilize any labels for training on (D ′ ) and assume distinct distributions for D and D ′ .\nMethod(s) Sup. on D Target Acsintoae et al. [1] unsupervised D rGAN [23], MPN [25] unsupervised D ′ zxVAD [3] unsupervised D ∪ D′ Ours weakly-supervised D ∪ D′ dictions for the external dataset. Second, we compute the variance of the predictions across multiple predictors as a proxy to represent uncertainty associated with the segments in the external data. Third, during the optimization process, involving training on both labeled and external data, we adaptively reweigh the bias on each external data using the uncertainty regularization scores. This dynamic reweighing ensures that segments from the external dataset closer to the source dataset are emphasized during the training, while those with higher uncertainty are down-weighted. Finally, we iteratively regenerate pseudo-labels using the models trained on labeled and pseudo-labeled data, re-estimate the uncertainties, and re-train the model on the union of labeled and external datasets. This iterative process helps refine the pseudo-labels as the training progresses. With this training process, the model learns to generalize to both source and external data, given only supervision on the source data. Figure 1 illustrates the effectiveness of the CDL framework.\nTo summarize, we make the following contributions:\nWe present a practical CDL framework for weaklysupervised VAD, in which unlabeled external videos are employed to enhance the cross-domain generalization of the model. We design a novel uncertainty quantification method that enables the adaptive uncertainty-driven integration of external videos into the training set. Through extensive experiments and ablation studies on benchmark datasets, we validate the proposed approach, demonstrating state-of-the-art performance in cross-domain settings while retaining a competitive performance on the in-domain data. 2. Related Works # Video Anomaly Detection (VAD). Video Anomaly Detection (VAD). VAD is a well-established problem, with most works formulating it either as unsupervised learning [15 , 21 , 22 , 41 , 44] or weakly-supervised learning [29 , 32 , 33 , 43 , 48] problem. In unsupervised setups, the training data consists solely of normal videos, with the majority of works encoding normal patterns through techniques like frame reconstruction [15 , 39], future frame pre- diction [21], dictionary learning [22 , 44], and one-class classification [17 , 24]. Any deviation from the encoded patterns is considered anomalous. Since the model categorizes anything beyond its learned representations as anomalous, it can label novel video actions and scenarios encountered during training but in altered environments as anomalous. Weakly-supervised VAD methods help mitigate these issues by incorporating video-level labels as weak supervision for the model, with the majority of methods utilizing the Multiple Instance Ranking Loss [11 , 32 , 35 , 47]. Given that a VAD model is expected to encounter previously unseen scenarios during deployment, it is of paramount importance for the model to have a high generalization across domains. Previous works refer this as cross-domain [3] or cross-dataset generalization [9]. We provide an overview of the existing works employing external data in VAD in Table 1. Previous works on cross-domain generalization focus on unsupervised methods based on few-shot targetdomain scene adaptation. [23 , 25] employ data from the target domain via meta-learning to adapt to that specific domain. Aich et al. [3] proposed a zero-shot target domain adaptation method that incorporates external data to generate pseudo-abnormal frames. Despite the intriguing setup, these unsupervised cross-domain generalization methods lack explicit knowledge about what constitutes an anomaly, hindering the model\u0026rsquo;s ability to learn the specific characteristics of anomalies. To this end, we propose the use of weakly-supervised learning for cross-domain generalization. We integrate external datasets from diverse domains to enable the cross-domain generalization of a model trained in a weakly-supervised fashion.\nPseudo-Labeling and Self-training. Pseudo-labeling [4 , 28] is a common technique where the model trained on labeled data assigns labels to unlabeled data. Subsequently, the model is trained on both the initially labeled data and the pseudo-labeled data. This self-training strategy [26 , 40] operates iteratively, allowing the model to progressively enhance its generalization. In VAD, several works leverage pseudo-labeling and self-training for generating finegrained pseudo-labels [11 , 20 , 42]. However, in contrast to the previous methods, instead of generating pseudo-labels for the weakly labeled data, we leverage pseudo-labels for incorporating the external data.\nUncertainty Estimation. To address pseudo-label noise, prior research in different contexts has explored uncertainty estimation using various approaches, such as data augmentation [5 , 30], inference augmentation [12], and model augmentation [46]. While data augmentation is effective for images, it can disrupt temporal relationships in video frames and is not efficient for training on high-cardinality data like videos. On the other hand, inference augmentation methods, such as MC Dropout [12 , 42], introduce perturbations during model inference to obtain slightly dif- ferent predictions, but that is inefficient for training with fixed backbones. In contrast, model augmentation uses different models. Since different models may have varying biases and receptive fields, this would result in diverse predictions. This prediction discrepancy can help quantify uncertainty, making model augmentation well-aligned with our problem. To avoid any manual thresholding for learning from pseudo-labels during training, following [16 , 46] we use adaptive reweighing of loss with uncertainty values. In [46], Zheng et al. quantify uncertainty by estimating discrepancies between predictions made by two classifiers using Kullback–Leibler (KL) divergence. However, given that VAD is a binary classification task, the divergence based on only two outcomes for the posterior probability is not optimally informative. Hence, we propose a method to quantify uncertainty in the high-dimensional feature space instead of the probability space.\n3. Method # 3.1. Problem Definition # In this work, we address a real-world VAD problem, where a weakly-labeled dataset Dl = {(X i l , Y l i Y l )} nl i=1 and an external unlabeled dataset D u = {X i u } n u i=1 are available for training. Here, nl and n u indicate the number of videos in the two datasets, respectively, with n u ≫ nl due to the convenience of gathering unlabeled video data. The videolevel labels of Xl are denoted by Yl ∈ {0 , 1}. We do not make any assumption about distributions of Dl and D u , and therefore, they can be drawn from different distributions. We aim to find the model F(·|θ), parameterized by θ, that provides accurate predictions on weakly-labeled data while adaptively minimizing the prediction bias on the external data using the uncertainty regularization scores. We illustrate the proposed framework in Figure 2 .\n3.2. Feature Extraction and Temporal Processing # The proposed uncertainty quantification method (Section 3.4) compares two diverse representations of each sample to estimate the uncertainty associated with the segmentlevel predictions on external data. To this aim, we employ two different backbones for feature extraction from videos, which are widely used for anomaly detection tasks. The first one is the conventional I3D backbone [6], which extracts segment-level features using 3D convolution, and the other is the CLIP backbone [27], which extracts frame-level features using the frozen CLIP Model\u0026rsquo;s ViT encoder. The contrasting inductive biases of the 3D convolution-based I3D and the transformer-based CLIP help to effectively capture the prediction variance. It is to be noted that only the CLIP backbone is used during inference. We develop two prediction heads, namely the main model, Pm Pm , built on top of the CLIP backbone, and the auxiliary model, Pa Pa , built on top\nFigure 2. Overview of the proposed CDL Framework. CDL Step 0: The Ranking Loss, Lrank (Supp Mat. §6), is employed to train two pseudo-label generation models, Pm Pm and Pa Pa , §3.2, on weakly-labeled data, Dl . CDL Step k, k \u0026gt; 0: Pm Pm and Pa Pa are trained iteratively on Dl ∪ D u , incorporating pseudo-labels for D u generated at the end of the previous CDL step. To deal with noise in pseudo-labels, uncertainty regularization scores are estimated using the divergence between the predictions of the two models, §3.4. When optimizing on D u , the prediction bias, Lbce (§3.3), for external data is reweighed using the computed uncertainty regularization scores, §3.5 .\nof the I3D backbone.\nVideo frames are highly correlated in the temporal dimension. To reduce the redundancy in frame-level features extracted by the CLIP backbone, we pool the representations by bilinearly interpolating them to a fixed, empirically determined length, n s . Each of the n s interpolated features represents one segment. To ensure consistency, we also fix the length of representations extracted by the I3D backbone. Evaluation in Section 4.6 analyzes the role of n s on the model\u0026rsquo;s performance. To capture long-range temporal information over the sequence, we employ a lightweight temporal network, i.e., transformer encoder, to implement Pm Pm and Pa Pa.\n3.3. Bias Estimation for External Data # Similar to [46], we formulate the prediction bias on external data as:\nwhere F(X u |θ)represents a set of predicted probability distributions, each one corresponding to a distinct segment of X u , and Yu Yu denotes the set of unknown segment-level labels of X u. Bias(D u ) can be re-written as:\nwhere Y ˆ u denotes the set of segment-level pseudo-labels for X u. Y ˆ u can be generated by performing inference on the model trained on D l . The first term in Equation 2 denotes the difference between the predicted posterior probability and the pseudo-labels, while the second term denotes the error between the pseudo-labels and the ground-truth labels. While minimizing the prediction bias, due to the lack of ground truth supervision, we employ a self-training mechanism, considering Y ˆ u as the soft labels, thereby treating the second term as a constant and minimizing the first term. Specifically, we use the binary cross-entropy (BCE) loss, Lbce, given by:\nto estimate the prediction bias associated with each video segment, for both Pm Pm and Pa Pa.\n3.4. Uncertainty Estimation # Since D u and D l do not necessarily share the same distribution, the generated pseudo-labels are noisy. This noise can adversely affect the subsequent training process as it causes bias to further magnify and propagate within the model. This issue, known as Confirmation Bias [4], is often mitigated by quantifying the uncertainty associated with pseudo-labels and then incorporating this uncertainty into the training process to compensate for the noise. As discussed in Section 2, we opt to address the confirmation bias by computing uncertainty using model augmentation. To quantify uncertainty through model augmentation, following [46], we estimate prediction variance, which is formulated as:\nDue to the lack of ground-truth labels, Equation 4 can be approximated as:\nWhen optimizing the prediction bias in Equation 2, the variance in Equation 5 will also be minimized, potentially re- sulting in inaccurate quantification of the true prediction variance. To address this, we adopt an alternative approximation, expressed as:\nSince VAD is a binary classification task, the probability distributions corresponding to each segment have limited support. Consequently, estimating prediction variance using only the predicted anomaly scores, as in Equation 6 , may not be robust. Hence, instead of measuring the divergence between the predicted posterior probabilities for the two classes, we propose quantifying pseudo-label uncertainty in the high-dimensional space. To this end, we compute the cosine similarity between the segments in each set of the representations, Zm Zm and Z a , obtained from the penultimate layer of Pm Pm and Pa Pa , respectively. Here, Zm Zm = {z 1 m, z 2 m, . . . , z n s m } and Z a = {z 1 a , z 2 a , . . . , z n s a } .\nTo obtain a set of stabilized, segment-level uncertainty regularization scores within a bounded range from the computed cosine similarity, we introduce the following function. Let S = {s 1 , s 2 , . . . , s n s } be the set of surrogate variances that we use as proxies for the uncertainty of segments. The surrogate variance is computed as:\nwhere s j indicates the uncertainty regularization score for the j th segment, ⟨z j m, z j a ⟩ indicates the cosine similarity, and τ denotes the temperature parameter.\nHigher uncertainty regularization scores indicate the similar encoding of data between the models, implying less uncertainty in the predicted labels, while, lower scores imply high uncertainty in the predicted labels. Empirical evidence in Section 4.4 demonstrates a significant negative correlation between uncertainty regularization scores and Binary Cross-Entropy (BCE) loss between the predicted labels and ground truths. This affirms that the proposed uncertainty regularization score effectively serves as a proxy for the quality of pseudo-labels.\n3.5. Training Process # CDL Step 0. We initially train Pm Pm and Pa Pa separately on the labeled set, optimizing both of them using the Ranking Loss, Lrank, discussed in Supp. Mat. Sec. 6. We then perform inference on the trained models to generate the sets of soft segment-level pseudo-labels for training on D u.\nCDL Step \u0026gt; 0. Following the generation of the sets of pseudo-labels for D u , we enter an iterative pseudo-label refinement phase, where we train Pm Pm and Pa Pa on D l ∪ D u for multiple CDL steps. Each CDL step comprises a fixed number of epochs. In each epoch, we regenerate the sets of segment-level uncertainty regularization scores. To enable the uncertainty-driven learning from external data, similar to [46], we use the estimated uncertainty regularization scores, S, as automatic thresholds as this dynamically adjusts learning from noisy labels by scaling the prediction bias associated with external data based on S. This helps filter out unreliable predictions while prioritizing highly confident predictions. To encourage lower prediction variance, which would in turn lead to increased pseudo-label quality, we explicitly add the prediction variance to the optimization objective corresponding to the external data, Lext, as:\nEquation 8 is rewritten with the approximated terms as:\nAlternatively, Equation 9 can be rewritten as:\nwhere λ 3 is a hyper-parameter to balance the losses. Similar to CDL step 0, to optimize the training on Dl, we use Lrank . The total optimization objective for training on Dl ∪D u can be expressed as:\nwhere λ 4 is a trade-off parameter for Lext. We employ the optimization objective defined in Equation 11 during training on Dl ∪ D u for each epoch within every CDL step. After each CDL step is completed, we re-generate the set of soft segment-level pseudo-labels using the models trained on D l ∪ D u . This iterative refinement process repeats k times, where k is a hyper-parameter determining the number of CDL steps. With each CDL step, the models\u0026rsquo; performance gets further refined as the pseudo-labels get iteratively improved.\n3.6. Inference - Extending Segment-level Scores to Frame-level Scores # During inference, we compute segment-level anomaly scores for the videos using Pm Pm . Since we encounter longuntrimmed videos with varying numbers of frames, for extending the segment-level anomaly score to the frame level, for each video, we divide the total number of frames nf by the number of segments n s to obtain the number of frames per segment, nf s. We assign the anomaly score of each segment to its consecutive frames. The first segment corresponds to the first nf s frames, and so forth until the (n s − 1) th segment. For the last segment, its anomaly score is assigned to any remaining frames, potentially exceeding n f s , if there is a remainder.\n4. Experiments # We evaluate the proposed method on the major video anomaly datasets, UCF-Crime (UCF) [32] and XDViolence (XDV) [38]. Additionally, we use 11,000 videos from the HACS [45] dataset as a source of external data. We provide detailed information about the datasets in Supp. Mat. §7. In §4.1, we discuss the implementation details. In §4.2, we discuss the inherent noise in the test annotations of benchmark datasets. We proceed to compare the proposed framework with prior works in cross-domain scenarios (§4.3.1) and open-set scenarios (§4.3.2). Subsequently, in §4.4, we demonstrate a strong correlation between the quality of pseudo labels and the computed uncertainty scores. We then explore the evolution of these uncertainty scores through the training process in §4.5. Finally, in §4.6, we conduct ablation studies and hyper-parameter analysis to analyze the impact of individual components of the proposed framework.\n4.1. Implementation Details # We implement the proposed method using PyTorch. We extract CLIP and I3D features at a fixed frame rate of 30 FPS. CLIP features are extracted from the frozen CLIP model\u0026rsquo;s image encoder (ViT-B/32). For the hyper-parameters, in the open-set scenarios, we empirically set the value of n s to 64, τ to 1.25, λ1 and λ2 to 5e − 4 , λ 3 to 1e − 3, and λ4 to 700. Ablation studies for selecting n s and λ 3 are included in Section 4.6. We use the Adam optimizer with a weight decay of 1e − 3, and we set a learning rate of 3e − 5 for the transformer encoder and 5e − 4 for the fully connected layers. We use a batch size of 64. In both Pm Pm and Pa Pa , we explicitly encode positional information in the segments using sinusoidal positional encodings [34]. We train on the weaklylabeled source dataset for 200 epochs, followed by training on the union of weakly-labeled and external datasets for 40 CDL steps, each CDL step comprising 4 epochs. Additional information regarding hyper-parameters is provided in Supp. Material Section 8.\nModel Architecture. Both Pm Pm and Pa Pa consist of a transformer encoder layer with four heads, followed by four fully connected layers, each consisting of 4096, 512, 32, and 1 neurons, respectively. In both the models, for all the layers except the last, we use ReLU [2] activation while for the last layer, we use Sigmoid activation.\nEvaluation Setup. To reduce bias, we perform each experiment three times with different seeds and average the results. In open-set experiments, we repeat each experiment three times, using different sets of anomaly classes each time.\nEvaluation Metric. Following previous works on UCFCrime [32], we adopt the frame-level area under the ROC curve (AUC) to evaluate on UCF-Crime. In line with previous works on XD-Violence [38], we use the frame-\nTable 2. Comparison with prior works on XDV, considering UCFCrime as the source data. Asterisk (∗) indicates that evaluations were conducted by us using the official code. Dagger (†) indicates that evaluations were conducted by our implementation due to the lack of an official implementation.\n| | Methods | Features | UCF AUC(%) | UCF-R AUC(%) | XDV\nAP(%) Cross-Domain (Unsup.) rGAN [23] - 64.35∗ 65.19∗ 37.74 Cross-Domain (Unsup.) MPN [25] - 65.67 67.98∗ 38.89 Cross-Domain (Unsup.) zxVAD [3] - 68.74 69.39 40.68 Non Cross\u0002 Sultani et al.[32] I3D 80.70 84.63∗ 53.88 Non MIST [11] I3D 82.30 86.17∗ 50.33 Cross\u0002 RTFM [33] I3D 84.03 86.47∗ 37.3 Domain S3R [37] I3D 85.99 87.11∗ 49.84 Domain CU-Net [42] I3D 86.22 88.15∗ 37.98 Domain MGFN [8] I3D 86.98 87.33∗ 32.16 Domain SSRL [19] I3D 87.43 87.02∗ 51.6 Domain CLIP-TSA [18] CLIP 87.58 73.20∗ 44.33 Domain Ours (No ext. data) CLIP 84.49 89.96 58.13 Cross-Domain (WeaklySup) Ours (UCF + HACS) )CLIP 84.63 90.53 65.14 (Weakly-Sup.) Ours (UCF + XDV) CLIP 84.73 90.26 68.37 Table 3. Comparison with prior works on UCF-Crime, considering XDV as the source data. Asterisk (∗) indicates that evaluations were conducted by us using the official code. Dagger (†) indicates that evaluations were conducted by our implementation due to the lack of an official implementation.\n| | Methods | Features | XDV AP(%) | UCF-R\nAUC(%) Cross\u0002 rGAN [23] - 40.10∗ 59.82∗ Domain MPN [25] - 44.79∗ 60.35∗ (Unsup.) zxVAD [3] - 47.53† 63.61 Non Cross\u0002Domain Sultani et al.[32] I3D 73.20 71.23∗ Non Cross\u0002Domain RTFM [33] I3D 77.81 70.46∗ Non Cross MGFN [8] I3D 80.11 69.12∗ Cross Domain S3R [37] I3D 80.26 69.04 Cross Domain CLIP-TSA [18] CLIP 80.67 67.58 Cross Domain Ours (No ext. data) CLIP 75.13 76.39 Cross-Domain Ours (XDV + UCF) CLIP 77.04 88.06 (Weakly-Sup.) Ours (XDV + HACS) CLIP 78.61 88.50 level area under the Precision-Recall curve (PRAUC), also known as Average Precision (AP), to evaluate on XDV.\n4.2. Noise in the Test Annotations of Benchmark Datasets # Our manual inspection reveals that the frame-level testing annotations of the UCF-Crime (UCF) [32] and XDViolence (XDV) [38] datasets, which are commonly used for benchmarking VAD models, exhibit significant noise. This noise largely stems from the fact that the original annotations do not consistently label the frames leading up to the primary anomalous events and their subsequent consequences as anomalous. For instance, in a video assigned a label like \u0026ldquo;shooting\u0026rdquo;, we assert that frames showing the person holding the gun and frames illustrating the injured victim should also be marked as anomalous. This perspective aligns with the fundamental goal of VAD, which is to\nTable 4. Comparison with other methods in Open-set setting on UCF-Crime dataset; c denotes the no. of anomalous classes included for weakly-supervised training.\n| | UCF (AUC%) | UCF (AUC%) | UCF (AUC%) | UCF (AUC%) | UCF-R (AUC%) w/o CDL) Ours (CDL) | UCF-R (AUC%) w/o CDL) Ours (CDL) | UCF-R (AUC%)\nw/o CDL) Ours (CDL) c Wu et al.[38] RTFM [33] Zhu et al. [49] Ours (w/o CDL) Ours (CDL) Ours (w/o CDL) Ours (CDL) 1 73.22 75.91 76.73 75.17 77.45 84.32 85.39 3 75.15 76.98 77.78 81.51 82.57 86.84 87.69 6 78.46 77.68 78.82 82.97 83.44 87.85 88.21 9 79.96 79.55 80.14 83.02 83.37 89.22 89.82 identify all anomalous frames within a video, irrespective of the video\u0026rsquo;s primary label. However, it should also be noted that in the original annotations, for some videos, certain frames related to the video\u0026rsquo;s primary anomaly label are also not marked anomalous.\nTo address this, we re-annotate the test set of UCF-Crime by assigning each video to three independent annotators. We then combine their annotations to generate more accurate frame-level labels. Compared to the original annotations where 7.58% of the total frames are labeled as anomalous, the proposed annotations label 16.55% of the total frames as anomalous. The proposed annotations are available here 1 . We provide a comparison of the proposed and original annotations here 2 . For the remainder of this paper, we refer the re-annotated test set of the UCF-Crime dataset as UCF-R.\n4.3. Comparison with Prior Works # 4.3.1 Cross-Domain Scenarios # While the UCF-Crime [32] and XD-Violence [38] datasets share similar definitions of what constitutes anomalies, that definition differs from those of smaller datasets like ShanghaiTech [21], CUHK-Avenue [22], UCSD Pedestrian [7], UBnormal [1], where anomalies are more subtle. For instance, running is considered anomalous in UBnormal but not in XD-Violence. Due to these divergent notions of anomalies across datasets, we conduct cross-domain experiments by simultaneously evaluating on the UCF-Crime and XD-Violence datasets, given their more aligned anomaly definitions.\nUCF-Crime as the Weakly-Labeled Source Set, XDV as the Cross-Domain Set. Table 2 summarizes the results for this scenario. First, we observe that the proposed method achieves state-of-the-art results on XDV and UCFR even without utilizing any external data (without CDL). We believe this is due to the inductive bias of previous methods towards the noisy annotations of UCF-Crime. Next, we observe that the addition of external data, HACS and XDV, leads to a significant enhancement in the performance of\n1 https : / / drive . google . com / drive / folders / 1IVjQQFHXVcsaT63HUjpfk8C5KH6HsQ7t?usp=drive_link 2 https://rb.gy/4vkr1r\nthe cross-domain dataset, XDV, by 11.26% and 14.49%, respectively, compared to the previous state-of-the-art baseline. Additionally, there is also a marginal improvement in the performance of the source set upon integration of external datasets.\nXDV as the Weakly-Labeled Source Set, UCF-Crime as the Cross-Domain Set. Table 3 summarizes the results for this scenario. Notably, the proposed method achieves stateof-the-art performance on the cross-domain dataset, UCFR, even without the utilization of any external data during training. This is attributed to the simplicity of the proposed architecture compared to other baselines. The proposed architecture prevents overfitting to the source dataset, thereby increasing its generalizability to the cross-domain dataset. Additionally, integrating external data further enhances performance on both the cross-domain and source sets. Specifically, leveraging the CDL framework with UCF-Crime and HACS as external datasets boosts UCF-R\u0026rsquo;s AUC by 18.94% and 19.39% respectively, compared to previous state-of-theart baselines. We also observe that the proposed method\u0026rsquo;s performance is inferior on XDV. We attribute this to the noise in the annotations of XDV\u0026rsquo;s test set.\nThese results highlight that the proposed CDL framework is capable of effectively exploiting external data with vast domain gaps to achieve a significant cross-domain generalization. It\u0026rsquo;s noteworthy that the performance gain observed with the proposed CDL framework remains consistent across all tested datasets, suggesting that the performance improvement is not dependent on any specific source or external dataset.\n4.3.2 Open-Set Scenarios # In Table 4, we evaluate the proposed framework\u0026rsquo;s performance on the UCF-Crime dataset in a realistic openset scenario, where the model is evaluated on both, previously seen and unseen anomaly classes. To simulate this scenario, we randomly include c anomalous classes in the weakly-labeled set, while the remaining anomalous classes are placed in the unlabeled set. In both the weaklysupervised source set and the unlabeled set, the number of normal videos equals the number of anomalous videos. We evaluate two model configurations; one trained solely on\nFigure 3. (a) Correlation between uncertainty scores and BCE loss computed between the estimated scores and ground truth. When λ 3 = 1e − 3, as expected, a consistently high negative correlation emerges, demonstrating the effectiveness of the proposed uncertainty quantification method as a reliable proxy for pseudo-label quality. (b) Cumulative Distribution Function (CDF) plots illustrating the progression of average uncertainty regularization scores for each video during training. CDL step 20 has a higher concentration of scores around 1 compared to CDL step 2, while CDL step 2 has a higher concentration around 1 than CDL step 1. This suggests that, as training progresses, there is a higher tendency for scores to have elevated values, indicating more confident pseudo-label predictions. (c) Ablation study on the coefficient of the cosine similarity loss term, λ3 . (d) Ablation study on the number of segments, n s.\nthe weakly-labeled set (without CDL) and the other on the union of weakly-labeled and unlabeled sets using the CDL Framework.\nOn UCF-Crime, the proposed model, without CDL, surpasses the state-of-the-art baselines for c \u0026gt; 1. This highlights its efficacy in open-set settings. While, with CDL, the model surpasses the baselines across all values of c by a considerable margin.\nFor both UCF-Crime and UCF-R, when unlabeled data is incorporated, we observe a consistent performance gain across all values of c, suggesting the effectiveness of the CDL framework across varying amounts of weakly-labeled and unlabeled data.\n4.4. Correlation between Uncertainty Scores and BCE Loss (Proxy to Label Quality) # To assess the efficacy of the proposed uncertainty quantification method as a proxy for pseudo-label quality, we compute the non-parametric Spearman correlation between estimated uncertainty regularization scores and BCE loss between the predicted pseudo-labels and the corresponding ground truths. For this experiment, we consider UCF-Crime as the weakly-labeled source set and XDV as the external set. In Figure 3(a), with λ3 = 1e − 3, CDL step 1 onwards, a consistently high negative correlation (-0.46 in CDL step 6, with a p-value \u0026lt; 1e-5) emerges, indicating the robustness of the proposed uncertainty quantification framework. Conversely, setting λ3 to 0 results in a sustained positive correlation, signifying sub-optimal pseudo-labels in the absence of cosine similarity loss term.\n4.5. Progression of Uncertainty Scores # To assess the evolution of uncertainty regularization scores through the training process, in Figure 3(b), we plot the Cumulative Distribution Function (CDF) of average uncertainty regularization scores for external videos across the first epoch of three different CDL steps. We conduct this experiment considering UCF-Crime as the weakly-labeled source set and XDV as the external set. We observe that in CDL step 1, 16.65% of the uncertainty scores fall within the range [0, 0.1]. As training progresses to CDL steps 2 and 20, this proportion decreases to 13.06% and 11.39%, respectively. Meanwhile, the proportion of uncertainty scores in the range [0.9, 1] increases from 35.11% in CDL step 1 to 56.70% in CDL step 2 and further to 57.68% in CDL step 20. This trend indicates a discernible shift towards higher uncertainty scores as training progresses, suggesting an improvement in model confidence due to increased pseudolabel quality.\n4.6. Ablation Studies and Hyper-parameter Analysis # For the sake of consistency, we conduct all ablation studies on UCF-Crime in an open-set setting, with c = 1. However, it should be noted that for different training setups, hyperparameters are tuned separately as well.\nImpact of Various Components of the CDL Framework . We assess the effectiveness of each component of the CDL framework by adding them sequentially. The results are summarized in Table 5. We consider training on c = 1 anomaly class in a weakly-supervised fashion as our baseline. The remaining c − 1 anomalous classes are placed in the external set. We first observe that integrating external data into the source set without accounting for pseudo-label uncertainty (S i,j = 1 , ∀i, j) and without minimizing cosine similarity between representations (λ3 = 0) yields a 0.35% gain in AUC, highlighting the effectiveness of external data in improving the model\u0026rsquo;s performance. Next, we study the impact of uncertainty-aware integration of external data, i.e., adaptively reweighing the prediction bias of external data with the computed uncertainty values and with λ 3 set to 0. This results in a gain of 0.13% in AUC,\nTable 5. Ablation study of various components on the UCF-R dataset in an open-set setting (c = 1).\nExternal data Uncertainty Coeff. Cos. Similarity Loss AUC y Coeff. Cos. Similarity Loss AUC ✗ ✗ 84.32 ✓ ✗ 84.67 ✓ ✓ 84.8 ✓ ✓ 85.39 demonstrating the superiority of uncertainty-driven integration compared to the standard integration. Finally, we assess the impact of adding the cosine similarity loss term during uncertainty-aware training. This further leads to a significant boost of 0.59%, validating its effectiveness.\nImpact of Cosine Similarity Loss. In Figure 3(c), we explore the impact of varying the coefficient of the cosine similarity loss on the model\u0026rsquo;s performance. We observe a gradual increase in AUC as λ 3 increases from 1e-9 to 1e-3. This could be due to the effect of cosine similarity loss getting more pronounced with higher values of λ3. However, beyond 1e-3, there is a rapid decline in AUC, likely due to the dominance of the cosine similarity loss over other losses when its coefficient is high. Therefore, we select 1e-3 as the optimal choice for λ3 .\nImpact of Number of Segments. In Figure 3(d), we observe that the performance consistently improves as no. of segments, n s , increases from 16 to 64, but it begins to decline rapidly afterward. Therefore, we set n s as 64.\nImpact of the Size of External Data. To determine the optimal number of unlabeled external videos from the HACS dataset to integrate into the weakly-labeled training set of UCF-Crime, we conduct an ablation study, depicted in Figure 4. We observe that increasing the size of the external set increases the performance on XDV. However, this increase tends to plateau after the inclusion of 11,000 videos. Consequently, we do not include additional videos beyond the 11,000 threshold.\n5. Conclusion # In this work, we demonstrated the effectiveness of integrating external, unlabeled data with weakly-labeled source data to enhance the cross-domain generalization of VAD models. To enable this integration, we proposed a weaklysupervised CDL (Cross-Domain Learning) framework that adaptively minimizes the prediction bias on external data by scaling it with the prediction variance, which serves as an uncertainty regularization score. The proposed method outperforms baseline models significantly in cross-domain and open-set settings while retaining competitive performance in in-domain settings.\nFigure 4. Ablation study on the impact of the size of external data.\nAcknowledgement # This work was supported in part by U.S. NIH grants R01GM134020 and P41GM103712, NSF grants DBI1949629, DBI-2238093, IIS-2007595, IIS-2211597, and MCB-2205148. This work was supported in part by Oracle Cloud credits and related resources provided by Oracle for Research, and the computational resources support from AMD HPC Fund. We thank Eshaan Mandal and Bhavay Malhotra for their assistance, which has been instrumental in completing this work.\nReferences # [1] Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ubnormal: New benchmark for supervised open-set video anomaly detection. In CVPR, 2022. 2 , 7 [2] Abien Fred Agarap. Deep learning using rectified linear units (relu), 2019. 6 [3] Abhishek Aich, Kuan-Chuan Peng, and Amit K. RoyChowdhury. Cross-domain video anomaly detection without target domain adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2579–2591, 2023. 1 , 2 , 3 , 6 [4] Eric Arazo, Diego Ortego, Paul Albert, Noel E O\u0026rsquo;Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In IJCNN, 2020. 3 , 4 [5] David Berthelot, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring. In ICLR, 2020. 3 [6] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR , pages 6299–6308, 2017. 3 [7] Antoni B. Chan and Nuno Vasconcelos. Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 909–926, 2008. 7 [8] Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: magnitudecontrastive glance-and-focus network for weakly-supervised video anomaly detection. In Proceedings of the ThirtySeventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, 2023. 6 [9] MyeongAh Cho, Minjung Kim, Sangwon Hwang, Chaewon Park, Kyungjae Lee, and Sangyoun Lee. Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning. In CVPR, pages 12137– 12146, 2023. 3 [10] Yang Cong, Junsong Yuan, and Ji Liu. Sparse reconstruction cost for abnormal event detection. In CVPR, pages 3449– 3456, 2011. 1 [11] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. MIST: Multiple instance self-training framework for video anomaly detection. In CVPR, pages 14009–14018, 2021. 1 , 2 , 3 , 6 , 14 [12] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, pages 1050–1059, 2016. 3 [13] Mariana Iuliana Georgescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. A background-agnostic framework with adversarial training for abnormal event detection in video. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 4505– 4523, 2022. 2 [14] Mahmudul Hasan, Jonghyun Choi, jan Neumann, Amit K Roy-Chowdhury, and Larry Davis. Learning temporal regularity in video sequences. In Proceedings of IEEE Computer Vision and Pattern Recognition, 2016. 14 [15] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K. Roy-Chowdhury, and Larry S. Davis. Learning temporal regularity in video sequences. In CVPR, 2016. 1 , 2 [16] Kexin Huang, Vishnu Sresht, Brajesh Rai, and Mykola Bordyuh. Uncertainty-aware pseudo-labeling for quantum calculations. In UAI, pages 853–862, 2022. 3 [17] Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In CVPR, pages 7842–7851, 2019. 3 [18] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In ICIP, pages 3230–3234, 2023. 6 [19] Guoqiu Li, Guanxiong Cai, Xingyu Zeng, and Rui Zhao. Scale-aware spatio-temporal relation learning for video anomaly detection. In ECCV, pages 333–350, 2022. 6 [20] Shuo Li, Fang Liu, and Licheng Jiao. Self-training multisequence learning with transformer for weakly supervised video anomaly detection. AAAI, pages 1395–1403, 2022. 3 , 14 [21] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In CVPR, pages 6536–6545, 2018. 1 , 2 , 3 , 7 [22] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In ICCV, pages 2720–2727, 2013. 2 , 3 , 7 , 14 [23] Yiwei Lu, Frank Yu, Mahesh Kumar Krishna Reddy, and Yang Wang. Few-shot scene-adaptive anomaly detection. In ECCV, pages 125–141, 2020. 2 , 3 , 6 [24] Weixin Luo, Wen Liu, and Shenghua Gao. Remembering history with convolutional lstm for anomaly detection. In ICME, pages 439–444, 2017. 3 [25] Hui Lv, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Learning normal dynamics in videos with meta prototype network. In CVPR, pages 15425–15434, 2021. 1 , 2 , 3 , 6 [26] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In NAACL, pages 152–159, 2006. 3 [27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021. 3 [28] Mamshad Nayeem Rizve, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In ICLR, 2021. 3 [29] Hitesh Sapkota and Qi Yu. Bayesian nonparametric submodular video partition for robust anomaly detection. In CVPR , pages 3212–3221, 2022. 2 [30] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: simplifying semisupervised learning with consistency and confidence. In NeurIPS, 2020. 3 [31] Fahad Sohrab, Jenni Raitoharju, Moncef Gabbouj, and Alexandros Iosifidis. Subspace support vector data description. In ICPR, pages 722–727, 2018. 14 [32] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In CVPR, pages 6479–6488, 2018. 1 , 2 , 3 , 6 , 7 , 12 [33] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In ICCV, pages 4975–4986, 2021. 1 , 2 , 6 , 7 , 14 [34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, page 6000–6010, 2017. 6 , 12 [35] Boyang Wan, Yuming Fang, Xue Xia, and Jiajie Mei. Weakly supervised video anomaly detection via centerguided discriminative learning. ICME, pages 1–6, 2020. 3 [36] Jue Wang and Anoop Cherian. Gods: Generalized one-class discriminative subspaces for anomaly detection. In ICCV, V, pages 8200–8210, 2019. 14 [37] Jhih-Ciang Wu, He-Yen Hsieh, Ding-Jie Chen, Chiou-Shann Fuh, and Tyng-Luh Liu. Self-supervised sparse representa- tion for video anomaly detection. In ECCV, pages 729–745, 2022. 6\n[38] Peng Wu, jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In ECCV, 2020. 6 , 7 , 12 , 14 [39] Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. Learning deep representations of appearance and motion for anomalous event detection. In BMVC, pages 8.1– 8.12, 2015. 2 [40] David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In ACL, page 189–196, 1995. 3\n[41] Guang Yu, Siqi Wang, Zhiping Cai, En Zhu, Chuanfu Xu, Jianping Yin, and Marius Kloft. Cloze test helps: Effective video anomaly detection via learning to complete video events. In Proceedings of the 28th ACM International Conference on Multimedia, pages 583–591, 2020. 2 [42] Chen Zhang, Guorong Li, Yuankai Qi, Shuhui Wang, Laiyun Qing, Qingming Huang, and Ming-Hsuan Yang. Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection. In CVPR, pages 16271– 16280, 2023. 1 , 3 , 6 , 14 [43] Jiangong Zhang, Laiyun Qing, and Jun Miao. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In ICIP, pages 4030– 4034, 2019. 2 [44] Bin Zhao, Fei-Fei Li, and Eric Xing. Online detection of unusual events in videos via dynamic sparse coding. In CVPR , pages 3313–3320, 2011. 2 , 3 [45] Hang Zhao, Antonio Torralba, Lorenzo Torresani, and Zhicheng Yan. Hacs: Human action clips and segments dataset. In ICCV, pages 8668–8678, 2019. 6 , 12 [46] Zhedong Zheng and Yi Yang. Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. International Journal of Computer Vision (IJCV), 2021. 3 , 4 , 5 [47] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In CVPR, pages 1237–1246, 2019. 1 , 3 [48] Yi Zhu and Shawn D. Newsam. Motion-aware feature for improved video anomaly detection. In BMVC, page 270, 2019. 2 [49] Yuansheng Zhu, Wentao Bao, and Qi Yu. Towards open set video anomaly detection. In ECCV, pages 395–412, 2022. 7 , 14 Cross-Domain Learning for Video Anomaly Detection with Limited Supervision # Supplementary Material # 6. Revisiting Multiple Instance Learning # Since acquiring frame-level labels requires significant time and effort, following Sultani et al. [32], we use Multiple Instance Learning (MIL) to train the classifiers using weakly-supervised video-level labels. By dividing a video (bag) into multiple temporal non-overlapping segments (instances) and encouraging anomalous video segments to have higher anomaly scores as compared to the normal segments, they formulate anomaly detection as a regression problem.\nThe multiple instance ranking objective function is given by:\nwhere D a l = {(X, Y ) ∈ Dl : Y = 1} and D n l = {(X, Y ) ∈ Dl : Y = 0} are the set of abnormal and normal videos, respectively and max is taken over all video segments in a bag.\nInstead of ranking every segment of the positive and negative bags, ranking is enforced on one segment from each bag, having the highest anomaly score. The overall loss function, Lrank, for a pair of abnormal and normal videos, is given by:\nwhere L Ts is the temporal smoothness constraint, and LSp is the sparsity constraint.\n7. Datasets # UCF-Crime [32]: This is a large-scale VAD dataset having a total duration of 128 hours. It contains long and untrimmed real-world surveillance videos across 13 realistic anomaly categories that are specifically chosen due to their significant impact on public safety. The dataset comprises 1610 weakly-labeled training videos and 290 test videos annotated at the frame level.\nXD-Violence (XDV) [38]: This is a large-scale and multiscene audio-visual dataset for violence detection, having a total duration of 217 hours. Its long and untrimmed videos are collected from movies, games, and in-the-wild scenarios, with anomalies spread over 6 categories. It comprises 3954 weakly-labeled training videos and 800 test videos annotated at the frame level.\nHACS [45]: This is a large-scale dataset for human action recognition, sourced from YouTube. It features 200 action classes across 140K segments on 50K videos. Due to its diverse range of actions, larger size, and longer video durations compared to other video datasets such as UCF-101, Kinetics, and ActivityNet, we use a subset of 11K videos from HACS Segments as external, unlabeled data.\n8. Implementation Details # To ensure consistency and gradient stability, while training on D l ∪ D u , each mini-batch consists of an equal number of samples from Dl and D u . Since the computation of L rank necessitates pairs of abnormal and normal videos, each labeled sample within the mini-batch comprises a pair of anomalous and normal videos. All the experiments were conducted on an NVIDIA RTX A5000 24 GB GPU. For the experiments using UCF-Crime as the weakly-labeled data, we set the batch size to 64, and for the experiments using XD-Violence as the weakly-labeled data, we set the batch size to 32. In all our experiments except the open-set, we set n s to 64, τ to 1.25, λ1 to 5e-3, λ2 to 1e-3, λ3 to 1e-3. We set λ 4 to 2000 for UCF+HACS and UCF+XDV, 1250 for XDV+HACS, and 700 for XDV+UCF. For all our experiments, we use the Adam optimizer with a weight decay of 1e-3. For the fully connected layers, we use a learning rate of 5e-4 when UCF-Crime is used as the weakly-labeled dataset and a learning rate of 1e-4 when XDV is used as the weakly-labeled dataset. For the transformer encoder layers, we use a learning rate of 3e-5 when UCF-Crime is used as the weakly-labeled dataset and a learning rate of 5e-5 when XDV is used as the weakly-labeled dataset. In all our experiments, we explicitly encode positional information in the segments using sinusoidal positional encodings [34]. We train on the weakly-labeled source dataset for 200 epochs, followed by training on the union of weakly-labeled and external datasets for 40 CDL steps, each CDL step comprising 4 epochs. Due to the finer granularity and semantic richness inherent in CLIP features, we choose to use CLIP features during inference.\n9. Comparison with Unsupervised Baselines in Open-Set Settings # Table 6 depicts that the proposed method outperforms all the baselines in open-set settings on the UCF-Crime dataset by a large margin. As expected, all the weakly-supervised methods outperform the unsupervised methods, even when a small subset of the data is used for weakly-supervised training. This highlights the necessity of incorporating\nFigure 5. A comparison between the original annotations (UCF) and the proposed annotations (UCF-R). The green region represents frames labeled as anomalous by both the original and proposed annotations. The red region indicates frames labeled as anomalous by the proposed annotations but not by the original annotations. The unshaded (white) region denotes normal frames. For instance, in the first row, while the original annotations just label frames depicting arson (a person setting the Christmas tree on fire) as anomalous, UCF-R also labels the frames depicting the fire and smoke following arson as anomalous.\nweak labels during training. Since a direct comparison of the proposed weakly-supervised framework with unsupervised methods is not fair, we did not include unsupervised baselines in Table 4.\n10. Comparison of the Original and Proposed Annotations for UCF-Crime Dataset # Figure 5 illustrates a subset of instances from the UCFCrime\u0026rsquo;s test set where the original annotations do not label frames as anomalous, despite their actual anomalous nature. We also provide a comparison of the proposed and original annotations superimposed on the videos at this link:\nTable 6. Comparison with prior works in open-set setting on UCF-Crime dataset; c denotes the number of anomalous classes included for weakly-supervised training. The values represent AUC (%).\nc 0 1 3 6 9 p Conv-AE [14] 50.60 - - - - p Sohrab et al. [31] 58.50 - - - - p Lu et al. [22] 65.51 - - - - p BODS [36] 68.26 - - - - p GODS [36] 70.46 - - - - Wu et al. [38] (offline - 73.22 75.15 78.46 79.96 Wu et al. [38] (online) - 73.78 74.64 77.84 79.11 RTFM [33] - 75.91 76.98 77.68 79.55 Zhu et al. [49] - 76.73 77.78 78.82 80.14 Ours (w/o CDL) - 75.17 81.51 82.97 83.02 Ours - 77.45 82.57 83.44 83.37 https://rb.gy/4vkr1r . # 11. Limitations # Similar to some recent weakly-supervised VAD works [11 , 20 , 42], the training process of the proposed CDL framework involves two stages. Consequently, the training does not operate in an end-to-end manner. This incurs additional complexity and challenges for training the model in real-world applications. However, since the generalization obtained using this multi-stage training is significant, the complex training setup of the multi-stage framework is reasonable. Nonetheless, developing end-to-end training frameworks would be an important direction for future research. This can facilitate the advancement of anomaly detection approaches for real-world applications, particularly the ones with limited training budgets.\nAdditionally, the cross-domain performance in case of drastic distribution shifts between the source and target domains may be hindered. For instance, a model primarily trained on videos from stationary surveillance cameras may not effectively work on videos with rapidly evolving scenes from car dashcams. This is mainly because the uncertaintybased reweighing approach in our framework aims to select samples from the external set that are similar to the source domain. In case of drastic shifts between the two domains, finding informative samples from the target domain would not be trivial.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/cross-domain-learning-for-vad-with-limited-supervision/","section":"Papers","summary":"A proposed weakly-supervised framework that incorporates external unlabeled data during training by estimating prediction bias and adaptively minimizing it using predicted uncertainty, to enhance cross-domain generalization in video anomaly detection.","title":"Cross-Domain Learning for Video Anomaly Detection with Limited Supervision","type":"method"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/elisa-ricci/","section":"Authors","summary":"","title":"Elisa Ricci","type":"authors"},{"content":" Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection # Kun Qian, Tianyu Sun, Wenhong Wang\nShangqiu University\nAbstract. Industrial anomaly detection (IAD) plays a crucial role in the maintenance and quality control of manufacturing processes. In this paper, we propose a novel approach, Vision-Language Anomaly Detection via Contrastive Cross-Modal Training (CLAD), which leverages large vision-language models (LVLMs) to improve both anomaly detection and localization in industrial settings. CLAD aligns visual and textual features into a shared embedding space using contrastive learning, ensuring that normal instances are grouped together while anomalies are pushed apart. Through extensive experiments on two benchmark industrial datasets, MVTec-AD and VisA, we demonstrate that CLAD outperforms state-of-the-art methods in both image-level anomaly detection and pixel-level anomaly localization. Additionally, we provide ablation studies and human evaluation to validate the importance of key components in our method. Our approach not only achieves superior performance but also enhances interpretability by accurately localizing anomalies, making it a promising solution for real-world industrial applications.\nKeywords: Large Vision-Language Models · Industrial Anomaly Detection · Contrastive Learning.\n1 Introduction # Industrial anomaly detection (IAD) plays a critical role in ensuring the quality and safety of manufacturing processes, particularly in industries that rely on automated systems for production. Identifying unusual or faulty behavior in industrial systems—whether it involves machinery malfunctions, material defects, or process deviations—is crucial for minimizing downtime, reducing operational costs, and ensuring product quality. In recent years, the advent of large visionlanguage models (LVLMs) has provided a promising direction for advancing the state-of-the-art in IAD. LVLMs, which integrate both visual understanding and natural language processing, have demonstrated strong capabilities in tasks that involve both image and text data [1,2]. This dual-modal nature of LVLMs makes them particularly well-suited for industrial anomaly detection, where both visual patterns and textual descriptions (e.g., defect reports, product manuals, and machine logs) need to be comprehended in conjunction.\nDespite their potential, the application of LVLMs to IAD faces several significant challenges. First, current IAD methods, which often rely solely on visual features or simple anomaly scoring, struggle to capture complex relationships between visual defects and textual descriptions, leading to limited generalization across different industrial scenarios. Second, many existing methods require large amounts of labeled anomaly data for training, which is not always available in real-world industrial settings. Furthermore, anomalies can often be subtle, requiring the model to understand fine-grained details that may not be immediately obvious from raw visual input alone. Finally, current models often fail to effectively leverage textual data, which could provide valuable contextual information that helps differentiate between normal and anomalous behavior.\nOur motivation stems from the need to overcome these limitations by leveraging the power of LVLMs to align visual and textual information in a way that improves both anomaly detection and the interpretability of model predictions. In this work, we propose a novel method called Vision-Language Anomaly Detection via Contrastive Cross-Modal Training (CLAD) . Our approach combines contrastive learning with cross-modal reasoning to create a joint embedding space for both visual and textual data. By doing so, we ensure that the model learns to distinguish normal and anomalous instances not just based on visual cues, but also by considering their textual context. This approach allows for the detection of both known and unseen anomalies in industrial environments, improving model generalization across diverse anomaly types and industrial setups. We further incorporate a contextualized reasoning module that enables the model to generate textual explanations for detected anomalies, thereby providing valuable insights into the model\u0026rsquo;s decision-making process.\nFor evaluation, we conduct extensive experiments on two benchmark datasets: MVTec-AD [3] and VisA [4]. These datasets provide a comprehensive testbed for evaluating anomaly detection methods across different types of industrial objects and defects. We use a combination of image-level and pixel-level AUC (Area Under Curve) scores, as well as accuracy measures, to assess the performance of our model. Our results show that CLAD significantly outperforms existing methods in both anomaly detection and localization tasks, demonstrating a clear improvement in both accuracy and robustness compared to prior approaches such as AnomalyGPT [5], PaDiM [6], and PatchCore [7].\nIn summary, the main contributions of our work are as follows:\n– We propose a novel method for industrial anomaly detection, CLAD, which leverages contrastive learning and cross-modal reasoning to jointly model visual and textual information for anomaly detection. – We introduce a contextualized reasoning module that enables the model to generate textual explanations for detected anomalies, improving both the interpretability and effectiveness of the detection process. – We demonstrate the effectiveness of CLAD through comprehensive experiments on benchmark datasets, showing significant improvements over existing methods in both detection performance and generalization capabilities. 2 Related Work # 2.1 Large Vision-Language Models # Large Vision-Language Models (LVLMs) have emerged as a powerful framework for learning joint representations of images and text. One of the most influential models in this domain is CLIP (Contrastive Language-Image Pretraining) [8], which pre-trains a vision model and a language model by aligning images and their corresponding text descriptions in a shared embedding space. CLIP demonstrates impressive zero-shot performance across a variety of downstream tasks, enabling it to generalize well to unseen data without task-specific fine-tuning. Its architecture leverages a large-scale dataset of images and text to learn semantic correspondences, making it a highly versatile model for many vision-language tasks [9,10,11].\nFollowing CLIP, DALL·E [12], another model developed by OpenAI, introduced the ability to generate images from textual descriptions using a transformerbased architecture. Unlike CLIP, which primarily focuses on representation learning, DALL·E explores the creative aspect of image generation, utilizing a large dataset of image-caption pairs to learn how to create novel images conditioned on textual inputs. This model has inspired further research into generative tasks within the vision-language domain.\nAnother notable approach is VisualBERT [13], which extends the transformerbased BERT architecture to the vision-language domain. VisualBERT integrates visual features directly into the language model [14,15], treating both image regions and text tokens as a unified sequence. It shows strong performance on tasks such as Visual Question Answering (VQA) and image captioning. Other works, such as UNITER [16] and VL-BERT [17], have similarly adapted transformer models for joint image-text representation learning. These models perform well across multiple vision-language tasks, achieving state-of-the-art results by pretraining on large-scale datasets and fine-tuning on task-specific data [18,19].\nAdditionally, more recent methods like ALBEF [20] have explored improved fusion strategies for vision-language alignment. ALBEF introduces an alignmentbefore-fusion approach, where image and text features are first aligned and then fused into a shared representation. This method has been shown to improve performance in tasks requiring fine-grained alignment between visual and textual modalities, such as image-text retrieval and VQA.\nFinally, Florence [21], a recent contribution from Microsoft Research, is a foundational model designed for general-purpose vision and language understanding. Florence integrates large-scale vision and language pretraining, enabling it to achieve state-of-the-art performance across a wide range of visionand-language tasks. Its scalable architecture and pretraining framework push the boundaries of what is achievable in multimodal learning .\nThese models represent significant steps forward in the field of vision-language understanding. They have demonstrated that large-scale pretraining and the alignment of visual and textual data can lead to highly effective representations that generalize across a variety of tasks. However, despite these advancements,\nchallenges remain in adapting these models for specialized tasks, such as industrial anomaly detection, where domain-specific knowledge and precise localization are crucial.\n2.2 Detecting Industrial Anomalies # The detection of industrial anomalies has garnered increasing attention due to the potential for improving operational efficiency, preventing breakdowns, and minimizing production losses. Recent works have explored various methodologies, including machine learning, deep learning, and computer vision-based techniques, to address the challenges associated with anomaly detection in industrial settings.\nOne of the most commonly used approaches is unsupervised anomaly detection. Unsupervised methods do not rely on labeled data, making them particularly suitable for real-world industrial environments where obtaining labeled data is often costly and time-consuming. A prominent example of this approach is the use of Autoencoders for anomaly detection in industrial systems. Autoencoders, such as convolutional autoencoders [22], learn to reconstruct the input data, and anomalies are detected when reconstruction errors exceed a threshold. These methods are particularly effective in detecting anomalies in images and sensor data, where the system learns a compact representation of normal operations and identifies deviations.\nIn addition to autoencoders, Generative Adversarial Networks (GANs) have been applied for anomaly detection in industrial settings [23]. GAN-based approaches learn the distribution of normal data and use the discriminator network to detect anomalies by identifying samples that do not conform to the learned distribution. GANs are particularly effective when there is limited labeled data available for training, as they can generate realistic samples of normal behavior.\nDeep learning models have also been explored in the context of industrial image anomaly detection. In the domain of manufacturing, defect detection in product images is a key application area. Convolutional neural networks (CNNs) have been used for automated defect detection [24], where models are trained to classify regions of images as normal or defective. Recently, methods like Vision Transformers (ViTs) have been investigated for their ability to capture global contextual information in industrial images [25], offering improvements in accuracy over traditional CNN-based models.\nAnother approach involves time-series anomaly detection, which is important in industrial control systems where sensor data is continuously collected [26]. Recurrent neural networks (RNNs), and specifically Long Short-Term Memory (LSTM) networks, have been widely applied for anomaly detection in time-series data [27]. These models are designed to capture temporal dependencies and detect deviations from the normal operational patterns of industrial equipment.\nThe MVTec AD dataset [3], a comprehensive benchmark for industrial anomaly detection, has been extensively used to evaluate the performance of anomaly detection models in industrial environments. The dataset contains high-resolution images of industrial products and associated anomalies, including class-specific\ndefects such as scratches, dents, and missing parts. Many recent anomaly detection methods have been benchmarked using this dataset, demonstrating the effectiveness of modern deep learning techniques for detecting fine-grained anomalies in industrial settings.\nWhile significant progress has been made in industrial anomaly detection, challenges remain, particularly in real-time detection, anomaly localization, and adaptation to diverse industrial domains. Many models require substantial computational resources or rely on large labeled datasets, limiting their practicality for deployment in production environments. Furthermore, adapting existing anomaly detection techniques to specialized industrial tasks, such as detecting rare or subtle defects in highly variable manufacturing processes, remains a challenging research direction.\n3 Method # In this section, we present the methodology for Vision-Language Anomaly Detection via Contrastive Cross-Modal Training (CLAD). Our approach combines the strengths of both generative and discriminative models, leveraging the power of large vision-language models (LVLMs) to jointly process visual and textual data. Specifically, we propose a discriminative approach that focuses on distinguishing normal and anomalous instances based on their visual and textual representations. The model is trained to map both visual features and textual descriptions into a shared embedding space, where normal instances are grouped together while anomalies are separated, allowing for both detection and localization of anomalies.\n3.1 Model Overview # Our proposed model consists of three key components:\nVisual Encoder: A pretrained convolutional neural network (CNN) or vision transformer (ViT) is used to extract visual features from the input image. Let I represent an input image, and fv fv (I) denote the feature vector extracted from the visual encoder. This feature vector captures the high-level spatial and semantic information of the industrial object in the image. Textual Encoder: A pretrained transformer-based language model (such as GPT or BERT) is used to process textual descriptions. Let T represent the textual input (such as defect descriptions or product manuals), and ft(T ) represent the textual feature vector. The textual encoder captures the semantic information related to the object and its potential anomalies. Contrastive Learning Module: This component aligns the visual and textual embeddings into a shared space using a contrastive loss function, which is central to the anomaly detection process. The overall architecture can be described as:\nwhere z v and z t are the visual and textual feature embeddings, respectively.\n3.2 Contrastive Loss for Cross-Modal Alignment # The core of our model\u0026rsquo;s training lies in a contrastive loss that ensures visual and textual representations of normal instances are closer in the shared embedding space, while those of anomalous instances are pushed apart. To achieve this, we define the contrastive loss as:\nwhere: - N is the batch size, - kz i v − z i t k 2 2 is the squared Euclidean distance between the visual and textual embeddings for the same instance i , -α is a margin that encourages the embeddings of the same instance to be close in the feature space, - kz i v − z j t k 2 2 is the distance between the embeddings of different instances i and j , -β is a margin that encourages the embeddings of different instances to be far apart in the embedding space, - [·]+ is the positive part, meaning the loss is zero if the distance between positive pairs is smaller than α .\nThis contrastive loss function pushes the positive (normal) pairs closer while pushing the negative (anomalous) pairs farther apart in the shared space.\n3.3 Anomaly Detection and Localization # Once the visual and textual features have been aligned using the contrastive loss, the next task is anomaly detection and localization. To detect anomalies, we compute the similarity score between the visual feature z v of an unseen image and the corresponding textual feature zt of the object description. For a new test sample, we use the following anomaly score function S(I, T ):\nwhere: - z v = fv fv (I) is the visual feature of the test image, - zt = ft(T ) is the textual feature of the associated description, - σ is a scaling factor that controls the sensitivity of the similarity measure.\nA lower value of S(I, T ) indicates a higher degree of anomaly, and we classify the sample as anomalous if S(I, T ) falls below a threshold.\nFor anomaly localization, we utilize a segmentation technique that identifies the specific pixels within the image that contribute most to the anomaly. This can be achieved using a simple gradient-based method, such as Grad-CAM, to highlight the regions of the image most responsible for the mismatch between the visual and textual embeddings:\nwhere: - α k are the weights of the final convolutional layer, - Ak is the activation map at location k in the last convolutional layer, - The ReLU function ensures only positive contributions are considered.\nThis localization method provides a visual heatmap that highlights the anomalous regions in the input image, making the anomaly detection process more interpretable.\n3.4 Learning Strategy: Task-Driven Fine-Tuning # The learning strategy is designed to optimize the model for industrial anomaly detection. We use a task-driven fine-tuning approach, where the model is initially pre-trained on a large dataset of general vision-language pairs (e.g., images and captions from a large corpus) and then fine-tuned on the specific industrial dataset. During fine-tuning, we update both the visual and textual encoders by minimizing the contrastive loss in the context of the specific anomaly detection task.\nThe overall loss function for training consists of two parts:\nwhere L reconstruction is a reconstruction loss that helps preserve the visual and textual details, particularly for the normal instances. The reconstruction loss ensures that the model does not overly generalize and that important visual and textual features are retained during the training process. The hyperparameter λ controls the balance between the contrastive loss and the reconstruction loss.\nThe reconstruction loss is defined as:\nwhere f v − 1 f v and f t − 1 f t represent the inverse functions of the visual and textual encoders, used to reconstruct the original inputs from the embeddings.\n3.5 Model Inference # During inference, given a test image I and its associated textual description T , we compute the anomaly score S(I, T ) and classify the image as normal or anomalous. If S(I, T ) is below a predefined threshold, the sample is classified as anomalous. The localization technique is then applied to highlight the anomalous regions in the image.\n4 Experiments # In this section, we present the experimental setup and results for evaluating the performance of our proposed method, Vision-Language Anomaly Detection via Contrastive Cross-Modal Training (CLAD). We compare our approach with several state-of-the-art anomaly detection methods on two widelyused industrial anomaly detection datasets: MVTec-AD and VisA. Our goal is to demonstrate that CLAD outperforms existing techniques in both anomaly detection and localization tasks. Additionally, we provide a human evaluation to assess the interpretability and usefulness of our method in real-world applications.\n4.1 Experimental Setup # We evaluate CLAD on two benchmark datasets: MVTec-AD [3] and VisA [4]. The MVTec-AD dataset contains 15 categories, with 3,629 training images and 1,725 test images, including both normal and anomalous samples. The VisA dataset includes 12 categories, with 9,621 normal images and 1,200 anomalous images. For comparison, we select several state-of-the-art anomaly detection methods, including:\n– SPADE [28] – PaDiM [29] – PatchCore [7] – WinCLIP [30] We evaluate the models on two main tasks: anomaly detection (i.e., classification of normal vs. anomalous) and anomaly localization (i.e., pixel-level identification of anomalies). For anomaly detection, we report Image-AUC, and for anomaly localization, we report Pixel-AUC.\n4.2 Quantitative Results # Table 1 shows the comparison results of CLAD with other methods on the MVTec-AD and VisA datasets. We report the performance in terms of both Image-AUC and Pixel-AUC, with results averaged over five runs. As seen in the table, CLAD consistently outperforms all other methods on both datasets, achieving the highest scores in both anomaly detection and localization tasks.\nTable 1. Comparison of CLAD with other anomaly detection methods on the MVTecAD and VisA datasets.\nMethod MVTec-AD MVTec-AD VisA VisA Image-AUC Pixel-A AUC Pixel-AUC Image-AUC Pix AUC Image-AUC Pixel Pixel-AUC SPADE 81.0±2.0 91.2±0.4 79.5±4.0 95.6±0.4 PaDiM 76.6±3.1 89.3±0.9 62.8±5.4 89.9±0.8 PatchCore 83.4±3.0 92.0±1.0 79.9±2.9 95.4±0.6 WinCLIP 93.1±2.0 95.2±0.5 83.8±4.0 96.4±0.4 CLAD 94.1±1.1 95.3±0.1 86.1±1.1 96.2±0.1 As shown in Table 1, our method, CLAD, achieves superior performance across both datasets. Notably, CLAD improves upon the next best performing method (WinCLIP) by a substantial margin in terms of Image-AUC and Pixel-AUC. For example, on the MVTec-AD dataset, CLAD achieves an ImageAUC of 94.1, outperforming WinCLIP by 1.0 points. Additionally, our model significantly improves the Pixel-AUC scores, demonstrating better localization capabilities.\n4.3 Ablation Study # To further validate the contributions of different components in our method, we conduct an ablation study to assess the impact of each key element. We perform experiments by progressively removing or modifying parts of our model, including: - Removing the contrastive loss and using only standard supervised training, - Removing the task-specific fine-tuning step, - Using a simple vision model (CNN) instead of the ViT-based encoder.\nThe results of the ablation study are presented in Table 2. The ablation study clearly shows that the contrastive loss and fine-tuning are critical components that contribute to the superior performance of CLAD.\nTable 2. Ablation study results, demonstrating the contribution of key components to the performance of CLAD.\nMethod Image-AUC Pixel-AUC Image-AUC Pixel-AUC CLAD (full) 94.1±1.1 95.3±0.1 Without contrastive loss 88.5±2.3 91.2±1.0 Without fine-tuning 91.2±2.0 92.5±0.8 Simple CNN (no ViT) 85.6±3.1 89.4±1.3 The results confirm that both the contrastive loss and the fine-tuning step are crucial for achieving high performance. Removing the contrastive loss results\nin a significant drop in both Image-AUC and Pixel-AUC. Likewise, replacing the ViT with a simpler CNN leads to a noticeable degradation in performance, highlighting the importance of using powerful visual encoders.\n4.4 Human Evaluation # To assess the practical utility and interpretability of our method, we conduct a human evaluation. We invite experts in industrial defect detection to evaluate the anomaly localization results produced by our method and compare them with the ground truth annotations. The experts are asked to rate the quality of the anomaly localization on a scale of 1 to 5, where 1 indicates poor localization and 5 indicates highly accurate localization.\nThe results of the human evaluation are shown in Table 3. CLAD significantly outperforms other methods in terms of localization accuracy, with an average rating of 4.6, indicating that the anomaly localization produced by CLAD is both accurate and highly interpretable.\nTable 3. Human evaluation of anomaly localization. CLAD significantly outperforms other methods in terms of localization accuracy.\nMethod Human Evaluation Score SPADE 3.4 PaDiM 3.7 PatchCore 4.1 WinCLIP 4.4 CLAD 4.6 The human evaluation results indicate that our method not only performs well in quantitative evaluations but also provides practical benefits in real-world anomaly detection tasks. The high localization accuracy allows for more effective and interpretable detection, which is crucial for industrial applications.\n4.5 Analysis of Anomaly Localization Performance # In this subsection, we analyze the results of anomaly localization produced by CLAD. We focus on both the precision and recall of the localized anomaly regions. To evaluate these metrics, we compare the predicted anomaly regions against ground truth annotations using Intersection over Union (IoU). Table 4 presents the IoU scores for each method. CLAD consistently achieves the highest IoU, indicating superior performance in correctly identifying the boundaries of anomalies.\nThe high IoU score of CLAD further demonstrates its ability to not only detect anomalies effectively but also localize them with high precision, making it a reliable solution for industrial anomaly detection tasks.\nTable 4. Intersection over Union (IoU) scores for anomaly localization. CLAD achieves the highest IoU, indicating superior localization performance.\nMethod oU Score SPADE 0.63 PaDiM 0.66 PatchCore 0.72 WinCLIP 0.75 CLAD 0.8 5 Conclusion # In this paper, we proposed a novel method, Vision-Language Anomaly Detection via Contrastive Cross-Modal Training (CLAD), that utilizes large vision-language models to enhance both anomaly detection and localization in industrial environments. By aligning visual and textual features in a shared embedding space through contrastive learning, CLAD improves the discrimination between normal and anomalous samples, leading to more accurate anomaly detection. Our extensive experiments on the MVTec-AD and VisA datasets demonstrate that CLAD outperforms existing state-of-the-art methods in both image-level anomaly detection and pixel-level anomaly localization. Furthermore, the ablation study and human evaluation reinforce the effectiveness of key components, such as the contrastive loss and fine-tuning, and highlight the superior localization capabilities of CLAD. In conclusion, our method offers a promising solution for industrial anomaly detection tasks, combining high performance with interpretability, making it a valuable tool for industrial quality control and maintenance.\nReferences # Zhou, Y., Li, X., Wang, Q., Shen, J.: Visual in-context learning for large visionlanguage models. In: Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. pp. 15890– 15902. Association for Computational Linguistics (2024) Zhou, Y., Rao, Z., Wan, J., Shen, J.: Rethinking visual dependency in long-context reasoning for large vision-language models. arXiv preprint arXiv:2410.19732 (2024) Bergmann, P., Fauser, M., Sattlegger, D., Steger, C.: Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9592–9600 (2019) Zou, Y., Jeong, J., Pemula, L., Zhang, D., Dabeer, O.: Spot-the-difference selfsupervised pre-training for anomaly detection and segmentation. In: European Conference on Computer Vision. pp. 392–408. Springer (2022) Gu, Z., Zhu, B., Zhu, G., Chen, Y., Tang, M., Wang, J.: Anomalygpt: Detecting industrial anomalies using large vision-language models. In: Wooldridge, M.J., Dy, J.G., Natarajan, S. (eds.) Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada. pp. 1932–1940. AAAI Press (2024). https://doi.org/10.1609/AAAI.V38I3.27963 , https://doi.org/10.1609/aaai.v38i3.27963 Defard, T., Setkov, A., Loesch, A., Audigier, R.: Padim: A patch distribution modeling framework for anomaly detection and localization. In: Bimbo, A.D., Cucchiara, R., Sclaroff, S., Farinella, G.M., Mei, T., Bertini, M., Escalante, H.J., Vezzani, R. (eds.) Pattern Recognition. ICPR International Workshops and Challenges - Virtual Event, January 10-15, 2021, Proceedings, Part IV. Lecture Notes in Computer Science, vol. 12664, pp. 475–489. Springer (2020). https://doi.org/10.1007/978-3-030-68799-1_35 , https://doi.org/10.1007/978-3-030-68799-1_35 Roth, K., Pemula, L., Zepeda, J., Schölkopf, B., Brox, T., Gehler, P.: Towards total recall in industrial anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14318–14328 (2022) Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021), http://proceedings.mlr.press/v139/radford21a.html Zhou, Y., Long, G.: Improving cross-modal alignment for text-guided image inpainting. In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. pp. 3445–3456 (2023) Zhou, Y., Long, G.: Multimodal event transformer for image-guided story ending generation. In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. pp. 3434–3444 (2023) Zhou, Y., Long, G.: Style-aware contrastive learning for multi-style image captioning. In: Findings of the Association for Computational Linguistics: EACL 2023. pp. 2257–2267 (2023) Reddy, M.D.M., Basha, M.S.M., Hari, M.M.C., Penchalaiah, M.N.: Dall-e: Creating images from text. UGC Care Group I Journal 8(14), 71–75 (2021) Li, L.H., Yatskar, M., Yin, D., Hsieh, C., Chang, K.: Visualbert: A simple and performant baseline for vision and language. CoRR abs/1908.03557 (2019), http://arxiv.org/abs/1908.03557 Zhou, Y., Shen, T., Geng, X., Long, G., Jiang, D.: Claret: Pre-training a correlation-aware context-to-event transformer for event-centric generation and classification. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2559–2575 (2022) Zhou, Y., Geng, X., Shen, T., Long, G., Jiang, D.: Eventbert: A pre-trained model for event correlation reasoning. In: Proceedings of the ACM Web Conference 2022. pp. 850–859 (2022) Chen, Y., Li, L., Yu, L., Kholy, A.E., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: UNITER: learning universal image-text representations. CoRR abs/1909.11740 (2019), http://arxiv.org/abs/1909.11740 Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: VL-BERT: pre-training of generic visual-linguistic representations. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net (2020), https://openreview.net/forum?id=SygXPaEYvH Zhou, Y., Tao, W., Zhang, W.: Triple sequence generative adversarial nets for unsupervised image captioning. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 7598–7602. IEEE (2021) Zhou, Y.: Sketch storytelling. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4748–4752. IEEE (2022) Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34, 9694–9705 (2021) Yuan, L., Chen, D., Chen, Y., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., Liu, C., Liu, M., Liu, Z., Lu, Y., Shi, Y., Wang, L., Wang, J., Xiao, B., Xiao, Z., Yang, J., Zeng, M., Zhou, L., Zhang, P.: Florence: A new foundation model for computer vision. CoRR abs/2111.11432 (2021), https://arxiv.org/abs/2111.11432 Heger, J., Desai, G., El Abdine, M.Z.: Anomaly detection in formed sheet metals using convolutional autoencoders. Procedia CIRP 93, 1281–1285 (2020) Li, D., Chen, D., Jin, B., Shi, L., Goh, J., Ng, S.: MAD-GAN: multivariate anomaly detection for time series data with generative adversarial networks. In: Tetko, I.V., Kurková, V., Karpov, P., Theis, F.J. (eds.) Artificial Neural Networks and Machine Learning - ICANN 2019: Text and Time Series - 28th International Conference on Artificial Neural Networks, Munich, Germany, September 17-19, 2019, Proceedings, Part IV. Lecture Notes in Computer Science, vol. 11730, pp. 703–716. Springer (2019). https://doi.org/10.1007/978-3-030-30490-4_56 , https://doi.org/10.1007/978-3-030-30490-4_56 Palakurti, N.R.: Challenges and future directions in anomaly detection. In: Practical Applications of Data Processing, Algorithms, and Modeling, pp. 269–284. IGI Global (2024) Dosovitskiy, A., Fischer, P., Springenberg, J.T., Riedmiller, M.A., Brox, T.: Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1734–1747 (2016). https://doi.org/10.1109/TPAMI.2015.2496141 , https://doi.org/10.1109/TPAMI.2015.2496141 Wang, Q., Hu, H., Zhou, Y.: Memorymamba: Memory-augmented state space model for defect recognition. arXiv preprint arXiv:2405.03673 (2024) Parsai, S., Mahajan, S.: Anomaly detection using long short-term memory. In: 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC). pp. 333–337. IEEE (2020) Zou, H., Cao, K., Jiang, C.: Spatio-temporal visual analysis for urban traffic characters based on video surveillance camera data. ISPRS International Journal of Geo-Information 10(3), 177 (2021) Defard, T., Setkov, A., Loesch, A., Audigier, R.: Padim: A patch distribution modeling framework for anomaly detection and localization. In: Bimbo, A.D., Cucchiara, R., Sclaroff, S., Farinella, G.M., Mei, T., Bertini, M., Escalante, H.J., Vezzani, R. (eds.) Pattern Recognition. ICPR International Workshops and Challenges - Virtual Event, January 10-15, 2021, Proceedings, Part IV. Lecture Notes in Computer Science, vol. 12664, pp. 475–489. Springer (2020). https://doi.org/10.1007/978-3-030-68799-1_35 , https://doi.org/10.1007/978-3-030-68799-1_35 14 K. Qian et al. Jeong, J., Zou, Y., Kim, T., Zhang, D., Ravichandran, A., Dabeer, O.: Winclip: Zero-/few-shot anomaly classification and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19606– 19616 (2023) ","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/exploring-large-vision-language-models-for-robust-and-efficient-industrial-anomaly-detection/","section":"Papers","summary":"Proposes a novel approach (CLAD) leveraging large vision-language models with contrastive cross-modal training for improved industrial anomaly detection and localization, enhancing interpretability and robustness.","title":"Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection","type":"other"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/fang-liu/","section":"Authors","summary":"","title":"Fang Liu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/fei-li/","section":"Authors","summary":"","title":"Fei Li","type":"authors"},{"content":" Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models # Yuchen Yang 1⋆ , Kwonjoon Lee 2 , Behzad Dariush 2 , Yinzhi Cao 1 , and Shao-Yuan Lo 2\n1\nJohns Hopkins University {yc.yang, yinzhi.cao}@jhu.edu 2 Honda Research Institute USA {kwonjoon_lee, bdariush, shao-yuan_lo}@honda-ri.com\nAbstract. Video Anomaly Detection (VAD) is crucial for applications such as security surveillance and autonomous driving. However, existing VAD methods provide little rationale behind detection, hindering public trust in real-world deployments. In this paper, we approach VAD with a reasoning framework. Although Large Language Models (LLMs) have shown revolutionary reasoning ability, we find that their direct use falls short of VAD. Specifically, the implicit knowledge pre-trained in LLMs focuses on general context and thus may not apply to every specific real-world VAD scenario, leading to inflexibility and inaccuracy. To address this, we propose AnomalyRuler, a novel rule-based reasoning framework for VAD with LLMs. AnomalyRuler comprises two main stages: induction and deduction. In the induction stage, the LLM is fed with few-shot normal reference samples and then summarizes these normal patterns to induce a set of rules for detecting anomalies. The deduction stage follows the induced rules to spot anomalous frames in test videos. Additionally, we design rule aggregation, perception smoothing, and robust reasoning strategies to further enhance AnomalyRuler\u0026rsquo;s robustness. AnomalyRuler is the first reasoning approach for the one-class VAD task, which requires only few-normal-shot prompting without the need for full-shot training, thereby enabling fast adaption to various VAD scenarios. Comprehensive experiments across four VAD benchmarks demonstrate AnomalyRuler\u0026rsquo;s state-of-the-art detection performance and reasoning ability. AnomalyRuler is open-source and available at: https://github.com/Yuchen413/AnomalyRuler\n1 Introduction # Video Anomaly Detection (VAD) aims to identify anomalous activities, which are infrequent or unexpected in surveillance videos. It has a wide range of practical applications, including security (e.g., violence), autonomous driving (e.g., traffic accidents), etc. VAD is a challenging problem since anomalies are rare and longtailed in real life, leading to a lack of large-scale representative anomaly data.\n⋆ This work was mostly done when Y. Yang was an intern at HRI-USA.\nFig. 1: Comparison of one-class VAD approaches. In this specific safety application example, only \u0026ldquo;walking\u0026rdquo; is normal. The test frame contains \u0026ldquo;skateboarding\u0026rdquo;, so it is abnormal. (a) Traditional methods require full-shot training and only output anomaly scores, lacking reasoning. (b) Direct LLM use may not align with specific VAD needs. Here GPT-4V mistakenly treats \u0026ldquo;skateboarding\u0026rdquo; as normal. (c) Our AnomalyRuler has induction and deduction stages. It derives rules from few-shot normal reference frames to detect anomalies, correctly identifying \u0026ldquo;skateboarding\u0026rdquo; as an anomaly.\nHence, the one-class VAD (a.k.a. unsupervised VAD) paradigm [16 , 36 , 43 , 45 , 53] is preferred, as it assumes that only the more accessible normal data are available for training. Most existing one-class VAD methods learn to model normal patterns via self-supervised pretext tasks, such as frame reconstruction [16 , 25 , 27 , 36 , 45 , 53 , 54] and frame order classification [14 , 43 , 49]. Despite good performance, these traditional methods can only output anomaly scores, providing little rationale behind their detection results (see Fig. 1a). This hinders them from earning public trust when deployed in real-world products.\nWe approach the VAD task with a reasoning framework toward a trustworthy system, which is less explored in the literature. An intuitive way is to incorporate the emergent Large Language Models (LLMs) [1 , 7 , 19 , 38 , 39 , 47], which have shown revolutionary capability in various reasoning tasks. Still, we find that their direct use falls short of performing VAD. Specifically, the implicit knowledge pre-trained in LLMs focuses on general context, meaning that it may not always align with specific real-world VAD applications. In other words, there is a mismatch between an LLM\u0026rsquo;s understanding of anomalies and the anomaly definitions required for certain scenarios. For example, the GPT-4V [1] typically treats \u0026ldquo;skateboarding\u0026rdquo; as a normal activity, whereas certain safety applications need to define it as an anomaly, such as within a restricted campus (see Fig. 1b). However, injecting such specific knowledge by fine-tuning LLMs for each application is costly. This highlights the necessity for a flexible prompting approach that steers LLMs\u0026rsquo; reasoning strengths to different uses of VAD.\nTo arrive at such a solution, we revisit the fundamental process of the scientific method [4] emphasizing reasoning, which involves drawing conclusions in a rigorous manner [41]. Our motivation stems from two types of reasoning: inductive reasoning, which infers generic principles from given observations, and deductive reasoning, which derives conclusions based on given premises. In this\npaper, we propose AnomalyRuler, a new VAD framework based on reasoning with LLMs. AnomalyRuler consists of an induction stage and a deduction stage as shown in Fig. 1c. In the induction stage, the LLM is fed with visual descriptions of few-shot normal samples as references to derive a set of rules for determining normality. Here we employ a Vision-Language Model (VLM) [23 , 51] to generate the description for each input video frame. Next, the LLM derives a set of rules for detecting anomalies by contrasting the rules for normality. The deduction, which is also an inference stage, follows the induced rules to identify anomalous frames in test video sequences. Additionally, in response to potential perception and reasoning errors by the VLM and LLM, we design strategies including rule aggregation via the randomized smoothing [10] for rule induction error mitigation, perception smoothing via the proposed Exponential Majority Smoothing for perception error reduction together with temporal consistency enhancement, and robust reasoning via a recheck mechanism for reliable reasoning output. These strategies are integrated into the AnomalyRuler pipeline to further enhance its detection robustness.\nApart from equipping VAD with reasoning ability, AnomalyRuler offers several advantages. First, AnomalyRuler is a novel few-normal-shot prompting approach that utilizes only a few normal samples from a training set as references to derive the rules for VAD. This avoids the need for expensive full-shot training or fine-tuning of the entire training set, as required by traditional one-class VAD methods. Importantly, it enables efficient adaption by redirecting LLM\u0026rsquo;s implicit knowledge to different specific VAD applications through just a few normal reference samples. Second, AnomalyRuler shows strong domain adaptability across datasets, as the language provides consistent descriptions across different visual domains, e.g., \u0026ldquo;walking\u0026rdquo; over visual data variance. This allows the application of induced rules to datasets with similar scenarios but distinct visual appearances. Furthermore, AnomalyRuler is a generic framework that is complementary to VLM and LLM backbones. It accommodates both closed-source models such as the GPT family [1 , 39] and open-source alternatives such as Mistral [19]. To the best of our knowledge, the proposed AnomalyRuler is the first reasoning approach for the one-class VAD problem. Extensive experiments on four VAD datasets demonstrate AnomalyRuler\u0026rsquo;s state-of-the-art performance, reasoning ability, and domain adaptability.\nIn summary, this paper has three main contributions. (1) We propose a novel rule-based reasoning framework for VAD with LLMs, namely AnomalyRuler. To the best of our knowledge, it is the first reasoning approach for one-class VAD. (2) The proposed AnomalyRuler is a novel few-normal-shot prompting approach that eliminates the need for expensive full-shot tuning and enables fast adaption to various VAD scenarios. (3) We propose rule aggregation, perception smoothing, and robust reasoning strategies for AnomalyRuler to enhance its robustness, leading to state-of-the-art detection performance, reasoning ability, and domain adaptability.\n2 Related Work # Video Anomaly Detection. VAD is a challenging task since anomaly data are scarce and long-tailed. Therefore, researchers often focus on the one-class VAD (a.k.a. unsupervised VAD) paradigm [14 , 16 , 18 , 25 , 27 , 36 , 43 , 45 , 49 , 53 , 54], which uses only normal data during training. Most one-class methods learn to model normal patterns via self-supervised pretext tasks, based on the assumption that the model would obtain poor pretext task performance on anomaly data. Reconstruction-based methods [16 , 25 , 27 , 36 , 45 , 53 , 54] employ generative models such as auto-encoders and diffusion models to perform frame reconstruction or frame prediction as pretext tasks. Distance-based [14 , 43 , 49] methods use classifiers to perform pretext tasks such as frame order classification. These traditional methods can only output anomaly scores, providing little rationale behind their detection. Several recent studies explore utilizing VLMs or LLMs in anomaly detection. Elhafsi et al. [12] analyze semantic anomalies with an object detector [33] an LLM [7] in driving scenes. However, it relies on predefined concepts of normality and anomaly, which limits its adaption to different scenarios and cannot handle long-tailed undefined anomalies. Moreover, this method has not been evaluated on standard VAD benchmarks [2 , 22 , 24 , 28]. Cao et al. [8] explore the use of GPT-4V for anomaly detection, but their direct use may fall into the misalignment between GPT-4V\u0026rsquo;s implicit knowledge and specific VAD needs, as discussed. Gu et al. [15] adopt a large VLM for anomaly detection, but it focuses on industrial images. Despite supporting dialogues, this method can only describe anomalies rather than explain the rationales behind its detection. Lv et al. [31] equip video-based LLMs in the VAD framework to provide detection explanations. It involves three-phase training to fine-tune the heavy video-based LLMs. Besides, it focuses on weakly-supervised VAD, a relaxed paradigm that requires training with anomaly data and labels. Different from these works, our AnomalyRuler provides rule-based reasoning via efficient few-normal-shot prompting and enables fast adaption to different VAD scenarios.\nLarge Language Models. LLMs [1 , 7 , 19 , 38 , 39 , 46 , 47] have achieved significant success in natural language processing and are recently being explored for computer vision problems. Recent advances, such as the GPT family [1 , 7 , 38 , 39], the LLaMA family [46 , 47], and Mistral [19], have shown remarkable capabilities in understanding and generating human language. On the other hand, large VLMs [1 , 21 , 23 , 34 , 44 , 51 , 57 , 58 , 60] have shown promise in bridging the vision and language domains. BLIP-2 [21] leverages Q-Former to integrate visual features into a language model. LLaVA [23] introduces a visual instruction tuning method for visual and language understanding. CogVLM [51] trains a visual expert module to improve large VLM\u0026rsquo;s vision ability. Video-LLaMA [57] extends LLMs to understand video data. These models\u0026rsquo; parametric knowledge is trained for general purposes and thus may not apply to every VAD application. Recent studies explore prompting methods to exploit LLMs\u0026rsquo; reasoning ability. Chain-of-Thought (CoT) [11 , 52] guides LLMs to solve complex problems via multiple smaller and manageable intermediate steps. Least-to-Most (LtM) [20 , 59] decomposes a complex problem into multiple simpler sub-problems and solves them in sequence.\nFig. 2: The AnomalyRuler pipeline consists of two main stages: induction and deduction. The induction stage involves: i) visual perception transfers normal reference frames to text descriptions; ii) rule generation derives rules based on these descriptions to determine normality and anomaly; iii) rule aggregation employs a voting mechanism to mitigate errors in rules. The deduction stage involves: i) visual perception transfers continuous frames to descriptions; ii) perception smoothing adjusts these descriptions considering temporal consistency to ensure neighboring frames share similar characteristics; iii) robust reasoning rechecks the previous dummy answers and outputs reasoning.\nHypotheses-to-Theories (HtT) [61] learns a rule library for reasoning from labeled training data in a supervised manner. However, a reasoning approach for the VAD task in the one-class paradigm is not well-explored.\n3 Induction # The induction stage aims to derive a set of rules from a few normal reference frames for performing VAD. The top part of Fig. 2 shows the three modules in the induction pipeline. The visual perception module utilizes a VLM which takes a few normal reference frames as inputs and outputs frame descriptions. The rule generation module uses an LLM to generate rules based on these descriptions. The rule aggregation module employs a voting mechanism to mitigate the errors from rule generation. In the following sections, we discuss each module and the strategies applied in detail.\n3.1 Visual Perception # We design the visual perception module as the initial step in our pipeline. This module utilizes a VLM to convert video frames into text descriptions. We define Fn Fnormal = {fnormal 0 , . . . , fnormal n } as the few-normal-shot reference frames, with each frame fnormal i ∈ Fnormal randomly chosen from the training set. This module outputs the text description of each normal reference frame:\nD normal = {VLM(fnormal i , pv) | fnormal i ∈ Fnormal}, with p v as the prompt \u0026ldquo;What are people doing? What are in the images other than people?\u0026rdquo;. Instead of directly asking \u0026ldquo;What are in the image?\u0026rdquo;, we design p v to separate humans and the environment with the following advantages. First, it enhances perception precision by directing the model\u0026rsquo;s attention to specific aspects of the scene, ensuring that no details are overlooked. Second, it simplifies the following rule generation module by dividing the task into two subproblems [20], i.e., rules for human activities and rules for environmental objects. We denote this strategy as Human and Environment.\n3.2 Rule Generation # With the text descriptions from normal reference frames Dnormal, we design a Rule Generation module that uses a frozen LLM to generate rules (denoted as R). In formal terms, R = {LLM(dnormal i , pg) | dnormal i ∈ Dnormal}, where p g is the prompt detailed in Appendix A.2. We craft p g with three strategies to guide the LLM in gradually deriving rules from the observed normal patterns:\nNormal and Anomaly. The prompt p g guides the LLM to perform contrast , which first induces rules for normal based on D normal , which are assumed to be ground-truth normal. Then, it generates rules for anomalies by contrasting them with the rules for normal. For instance, if \u0026ldquo;walking\u0026rdquo; is a common pattern in D normal , it becomes a normal rule, and then \u0026ldquo;non-walking movement\u0026rdquo; will be included in the rules for anomaly. This strategy sets a clear boundary between normal and anomaly without access to anomaly frames.\nAbstract and Concrete. The prompt p g helps the LLM to perform analogy , which starts from an abstract concept and then effectively generalizes to more concrete examples. Taking the same \u0026ldquo;walking\u0026rdquo; example, the definition of a normal rule is now expanded to \u0026ldquo;walking, whether alone or with others.\u0026rdquo; Consequently, the anomaly rule evolves to include specific non-walking movements, i.e., \u0026ldquo;nonwalking movement, such as riding a bicycle, scooting, or skateboarding.\u0026rdquo; This strategy clarifies the rules with detailed examples and enables the LLM to use analogy for reasoning without exhaustively covering every potential scenario.\nHuman and Environment. This strategy is inherited from the Visual Perception module. The prompt p g leads the LLM to pay attention separately to environmental elements (e.g., vehicles or scene factors) and human activities, separately. This enriches the rule set for VAD tasks, where anomalies often arise from interactions between humans and their environment.\nThese strategies align with the spirit of CoT [52] yet are further refined for the VAD task. The ablation study in Section 5.4 demonstrates their effectiveness.\n3.3 Rule Aggregation # The rule aggregation module uses a LLM as an aggregator with a voting mechanism to combine n sets of rules (i.e., R) generated independently from n randomly chosen normal reference frames into one set of robust rules, Rrobust = LLM(R, p a ) .\nThis module aims to mitigate errors from previous stages, such as the visual perception module\u0026rsquo;s potential misinterpretation of \u0026ldquo;walking\u0026rdquo; as \u0026ldquo;skateboarding\u0026rdquo;, leading to incorrect rules. The aggregation process filters out uncommon elements by retaining rule elements consistently present across the n sets. The prompt p a for the LLM to achieve this is detailed in Appendix A.2. This strategy is based on the assumption of randomize smoothing [10], where errors may occur on a single input but are less likely to consistently occur across multiple randomly sampled inputs. Therefore, by aggregating these outputs, AnomalyRuler generates rules more resilient to individual errors. The hyperparameter n can be treated as the number of batches. For simplicity, previous discussions assume that each batch has only one frame, i.e., m = 1. Here we define m as the number of normal reference frames per batch, i.e., batch size. We show the effectiveness of the rule aggregation and provide an ablation on different n and m values in Section 5.4 .\n4 Deduction # After the induction stage derives a set of robust rules, the deduction stage follows these rules to perform VAD. The bottom part of Fig. 2 illustrates the deduction stage, which aims to precisely perceive each frame of videos and then use the LLM to reason if they are normal or abnormal based on the rules. To achieve this goal, we design three modules. First, the visual perception module works similarly as described in the induction stage. However, instead of taking the few-normalshot reference frames, the deduction processes continuous frames from each test video and outputs a series of frame descriptions D = {d0, d1, . . . , dt}. Second, the perception smoothing module reduces errors with the proposed Exponential Majority Smoothing. This step alone can provide preliminary detection results, referred to as AnomalyRuler-base. Third, the robust reasoning module utilizes an LLM to recheck the preliminary detection results against the rules and perform reasoning. The perception smoothing and robust reasoning modules are elaborated in the following sections.\n4.1 Perception Smoothing # As we discussed in Section 3.3, visual perception errors would happen in the induction stage, and this concern extends to the deduction stage as well. To address this challenge, we propose a novel mechanism named Exponential Majority Smoothing. This mechanism mitigates the errors by considering temporal consistency in videos, i.e., movements are continuous and should exhibit consistent patterns over time. We utilize the results of this smoothing to guide the correction of frame descriptions, enhancing AnomalyRuler\u0026rsquo;s robustness to errors. There are four key steps:\nInitial Anomaly Matching. For the continuous frame descriptions D = {d0, d1, . . . , dt}, AnomalyRuler first match anomaly keywords K found within the anomaly rules from the induction stage (see details in Appendix A.2), and assigns di with label yi where i ∈ [0, t], represents the predicted label. Formally,\nwe have yi = 1 if ∃k ∈ K ⊆ di, indicating an anomaly triggered by keywords such as ing-verb \u0026ldquo;riding\u0026rdquo; or \u0026ldquo;running\u0026rdquo;. Otherwise, yi = 0 indicates the normal. We denote the initial matching predictions as Y = {y0, y1, . . . , yt} .\nExponential Majority Smoothing. We propose an approach that combines Exponential Moving Average (EMA) and Majority Vote. This approach is designed to enhance the continuity in human or object movements by adjusting the predictions to reflect the most common state within a specified window. The final smoothed predictions are denoted as Y ˆ = {yˆ ˆ 0, y ˆ 1, . . . , y ˆ t }, where each yˆ ˆ i is either 1 or 0. Formally, we have:\nStep I: EMA. For original prediction yt, the EMA value st is computed as st = P t i=0 (1−α) t − i P yi t i=0 (1−α) i . We denote α as the parameter that influences the weighting of data points in the EMA calculation. Step II: Majority Vote. The idea is to apply a majority vote to smooth the prediction within a window centered at each EMA value si with a padding size p. This means that for each si, we consider its neighboring EMA values within the window and determine the smoothed prediction yˆ ˆ i based on the majority of these values being above or below a threshold τ . We define this threshold as the mean of all EMA values: τ = 1 t P t i=1 si. Formally, the smoothed prediction y ˆ i is determined as: where 1(·) denotes the indicator function and the window size is adaptively defined as min(i + p, t) − max(1, i − p) + 1 ensuring that the window does not extend beyond the boundaries determined by the range from max(1, i − p) to min(i + p, t) .\nAnomaly Score. Given that Y ˆ represents the initial detection results of AnomalyRuler, we can further assess these by calculating an anomaly score through a secondary EMA. Specifically, the anomaly scores, denoted as A = {a0, a1, . . . , at} , where a t is:\nWe denote the above procedure AnomalyRuler-base as a baseline of our method, which provides a dummy answer, i.e., \u0026ldquo;Anomaly\u0026rdquo; if yˆ ˆ i = 1 otherwise \u0026ldquo;Normal\u0026rdquo;, with an anomaly score that is comparable with the state-of-the-art VAD methods [3 , 25 , 35 , 43]. Subsequently, AnomalyRuler utilizes the dummy answer in the robust reasoning module for further analysis.\nDescription Modification. In this step, AnomalyRuler modifies the description D comparing Y and Y ˆ and outputs the modified D ˆ . If yi = 0 while yˆ ˆ i = 1 , indicating a false negative in the perception module, AnomalyRuler corrects di by adding \u0026ldquo;There is a person {k}.\u0026rdquo;, where k ∈ K is the most frequent anomaly keyword within the window size w. Conversely, if yi = 1 while yˆ ˆ i = 0, indicating a false positive in the perception module, so AnomalyRuler modifies di by removing parts of the description that contain the anomaly keyword k .\n4.2 Robust Reasoning # In the robust reasoning module, AnomalyRuler utilizes an LLM to achieve the reasoning task for VAD, with the robust rule Rrobust derived from the induction stage as the context. The LLM is fed with each frame\u0026rsquo;s modified description ˆ d i with its dummy answer, i.e., either \u0026ldquo;Anomaly\u0026rdquo; or \u0026ldquo;Normal\u0026rdquo; generated from AnomalyRuler-base. We denote the output of robust reasoning as Y ∗ = {LLM( ˆ d i , y ˆ i , Rrobust, p r ) | ˆ d i ∈ D, ˆ y ˆ i ∈ Y ˆ }. To ensure reliable results, the prompt p r , detailed in Appendix A.2, guides the LLM to recheck whether the dummy answer yˆ ˆ i matches the description ˆ di according to Rrobust. This validation step, instead of directly asking the LLM to analyze ˆ d i , improves decision-making by using the dummy answer as a hint. This approach helps AnomalyRuler reduce missed anomalies (false negatives) and ensures that its reasoning is more closely aligned with the rules. Additionally, to compare AnomalyRuler with the state-ofthe-art approaches based on thresholding anomaly scores, we apply Equation (2) with replacing yˆ ˆ i by y ∗ i ∈ Y ∗ to output anomaly scores.\n5 Experiments # This section compares AnomalyRuler with LLM-based baselines and state-ofthe-art methods in terms of both detection and reasoning abilities. We also conduct an ablation study on each module within AnomalyRuler to evaluate their contributions. Examples of complete prompts, derived rules, and outputs are illustrated in Appendix A.2 .\n5.1 Experimental Setup # Datasets. We evaluate our method on four VAD benchmark datasets. (1) UCSD Ped2 (Ped2) [22]: A single-scene dataset captured in pedestrian walkways with over 4,500 frames of videos, including anomalies such as skating and biking. (2) CUHK Avenue (Ave) [28]: A single-scene dataset captured in the CUHK campus avenue with over 30,000 frames of videos, including anomalies such as running and biking. (3) ShanghaiTech (ShT) [24]: A challenging dataset that contains 13 campus scenes with over 317,000 frames of videos, containing anomalies such as biking, fighting, and vehicles in pedestrian areas. (4) UBnormal (UB) [2]: An open-set virtual dataset generated by the Cinema4D software, which contains 29 scenes with over 236,000 frames of videos. For each dataset, we use the default training and test sets that adhere to the one-class setting. The normal reference frames used by AnomalyRuler are randomly sampled from the normal training set. The methods are evaluated on the entire test set if not otherwise specified. Evaluation Metrics. Following the common practice, we use the Area Under the receiver operating characteristic Curve (AUC) as the main detection performance metric. To compare with LLM-based methods that cannot output anomaly scores, we use the accuracy, precision, and recall metrics. Besides, we adopt the DoublyRight metric [32] to evaluate reasoning ability. All the metrics are calculated with frame-level ground truth labels.\nTable 1: Detection performance with accuracy, precision, and recall (%) compared with different VAD with LLM methods on the ShT dataset.\nMethod Accuracy Precision Recall Ask LLM Directly 52.1 97.1 6.2 Ask LLM with Elhafsi et al. [12] 58.4 97.9 15.2 [ Ask Video-based LLM Directly 54.7 85.4 8.5 AnomalyRuler 81.8 90.2 64.3 Implementation Details. We implement our method, AnomalyRuler, using PyTorch [37]. If not otherwise specified, we employ CogVLM-17B [51] as the VLM for visual perception, GPT-4-1106-Preview [1] as the LLM for induction, and the open-source Mistral-7B-Instruct-v0.2 [19] as the LLM for deduction (i.e., inference) due to using GPTs on entire test sets is too costly. We discuss other VLMs/LLMs choices in Appendix A.4. The default hyperparameters of AnomalyRuler are set as follows: The number of batches in rule aggregation n = 10, the number of normal reference frames per batch m = 1, the padding size p = 5 in majority vote, and the weighting parameter α = 0 . 33 in EMA.\n5.2 Comparison with LLM-based Baselines # Reasoning for one-class VAD using LLMs is not well-explored. To demonstrate AnomalyRuler\u0026rsquo;s superiority over the direct LLM use, we build asking LLM/Videobased LLM directly as baselines and also adapt related works [8 , 12] to our target problem as baselines. At test time, let us denote test video frames as F = {f1, f2, . . . , ft}. We elaborate on our four baselines as follows. (1) Ask LLM Directly: {LLM(di, p) | di ∈ D}, where the LLM is Mistral-7B, D is F\u0026rsquo;s frame descriptions generated by CogVLM, and p is \u0026ldquo;Is this frame description anomaly or normal?\u0026rdquo; (2) Ask LLM with Elhafsi et al. [12]: {LLM(di, p) | di ∈ D}, where the LLM is Mistral-7B, D is F\u0026rsquo;s frame descriptions generated by CogVLM, and p is [12]\u0026rsquo;s prompts and predefined concepts of normality/anomaly. (3) Ask Video-based LLMs Directly: {Video-based LLM(ci, p) | ci ∈ C}, where p is \u0026ldquo;Is this clip anomaly or normal?\u0026rdquo; We use Video-LLaMA [57] as the Videobased LLM, which performs clip-wise inference. Each video clip ci consists of consecutive frames in F with the same label. (4) Ask GPT-4V with Cao et al. [8]: {GPT-4V(fi, p) | fi ∈ F}, where p is [8]\u0026rsquo;s prompts. As a large VLM, GPT-4V directly takes frames as inputs.\nDetection Performance. Table 1 compares the accuracy, precision, and recall on the ShT dataset. Overall, AnomalyRuler achieves significant improvements with an average increase of 26.2% in accuracy and 54.3% in recall. Such improvements are attributed to the reasoning based on the rules generated in the induction stage. In contrast, the baselines tend to predict most samples as normal based on the implicit knowledge pre-trained in LLMs, resulting in very low recall and accuracy close to a random guess. Their relatively high precision is due to that they rarely predict anomalies, leading to fewer false positives.\nTable 2: Reasoning performance with the Doubly-Right metric: {RR, RW, WR, WW} (%) on 100 (limited by GPT-4\u0026rsquo;s query capacity) randomly selected frames from the ShT test set. We evaluate cases with visual perception errors (w. Perception Errors) and with manually corrected visual perception (w/o. Perception Errors).\n| Method | w. Perception Errors | w. Perception Errors | w. Perception Errors | w. Perception Errors | . Perception Errors RW WR WW | . Perception Errors RW WR WW | . Perception Errors RW WR WW | . Perception Errors\nRW WR WW Method RR RW WR WW RR RW WR WW Ask GPT-4 Directly 57 4 15 24 73 3 0 24 Ask GPT-4 with Elhafsi et al. [12] 60 3 15 22 76 2 0 22 Ask GPT-4V with Cao et al. [8] 74 2 7 17 81 2 0 17 AnomalyRuler 83 1 15 1 99 0 0 1 Reasoning Performance. The reasoning performance is evaluated using the Doubly-Right metric [32]: {RR, RW, WR, WW} (%), where RR denotes Right detection with Right reasoning, RW denotes Right detection with Wrong reasoning, WR denotes Right detection with Wrong reasoning, and WW denotes Wrong detection with Wrong reasoning. We desire a high accuracy of RR (the best is 100%) and low percentages of RW, WR and WW (the best is 0%). Since {RW, WR, WW} may be caused by visual perception errors rather than reasoning errors, we also consider the case with manually corrected visual perception to exclusively evaluate each method\u0026rsquo;s reasoning ability, i.e., w. Perception Errors vs. w/o. Perception Errors in Table 2 .\nDue to the lack of benchmarks for evaluating reasoning for VAD, we create a dataset consisting of 100 randomly selected frames from the ShT test set, with an equal split of 50 normal and 50 abnormal frames. For each frame, we offer four choices: one normal and three anomalies, where only one choice with the matched rules is labeled as RR, while the other choices correspond to RW, WR or WW. Details and examples of this dataset are illustrated in Appendix A.3 . Since the 100 randomly selected frames are not consecutive, here AnomalyRuler\u0026rsquo;s perception smoothing is not used.\nTable 2 shows the evaluation results. With perception errors, AnomalyRuler outperforms the baselines by 10% to 27% RR, and it achieves a very low WW of 1% compared to the 17% WW of the second best Ask GPT-4V with Cao et al. [8]. Without perception errors, AnomalyRuler\u0026rsquo;s RR jumps to 99%. These results demonstrate AnomalyRuler\u0026rsquo;s superiority over the GPT-4(V) baselines and its great ability to make correct detection along with correct reasoning.\n5.3 Comparison with State-of-the-Art Methods # This section compares AnomalyRuler with 15 state-of-the-art one-class VAD methods across four datasets, evaluating their detection performance and domain adaptability. The performance values of these methods are sourced from their respective original papers.\nDetection Performance. Table 3 shows the effectiveness of AnomalyRuler. There are three main observations. First, AnomalyRuler, even with its basic version AnomalyRuler-base, outperforms all the Image-Only competitors, which\nTable 3: AUC (%) compared with different one-class VAD methods. \u0026ldquo;Image Only\u0026rdquo; methods only rely on image features. In contrast, others employ additional features such as bounding boxes from object detectors or 3D features from action recognition networks. \u0026ldquo;Training\u0026rdquo; indicates the methods that need a full-shot training process.\nMethod Venue Image Only Training Ped2 Ave ShT UB MNAD [36] CVPR-20 ✓ ✓ 97.0 88.5 70.5 - rGAN [29] ECCV-20 ✓ ✓ 96.2 85.8 77.9 - [] CDAE [9] ECCV-20 ✓ ✓ 96.5 86.0 73.3 - [] MPN [30] CVPR-21 ✓ ✓ 96.9 89.5 73.8 - [] NGOF [50] CVPR-21 ✗ ✓ 94.2 88.4 75.3 - [ HF2 [25] ICCV-21 ✗ ✓ 99.2 91.1 76.2 - [] BAF [14] TPAMI-21 ✗ ✓ 98.7 92.3 82.7 59.3 GCL [56] CVPR-22 ✗ ✓ - - 79.6 - S3R [53] ECCV-22 ✗ ✓ - - 80.5 - SSL [49] ECCV-22 ✗ ✓ 99.0 92.2 84.3 - zxVAD [3] WACV-23 ✗ ✓ 96.9 - 71.6 - HSC [45] CVPR-23 ✗ ✓ 98.1 93.7 83.4 - FPDM [54] ICCV-23 ✓ ✓ - 90.1 78.6 62.7 SLM [43] ICCV-23 ✓ ✓ 97.6 90.9 78.8 - [] STG-NF [18] ICCV-23 ✗ ✓ - - 85.9 71.8 AnomalyRuler-base - ✓ ✗ 96.5 82.2 84.6 69.8 AnomalyRuler - ✓ ✗ 97.9 89.7 85.2 71.9 Table 4: AUC (%) compared with different cross-domain VAD methods. We follow the compared works to use ShT as the source domain dataset for other target datasets.\nMethod Venue Image Only Training Ped2 Ave ShT1 UB rGAN [29] ECCV-20 ✓ ✓ 81.9 71.4 77.9 - [ MPN [30] CVPR-21 ✓ ✓ 84.7 74.1 73.8 - [] zxVAD [3] WACV-23 ✗ ✓ 95.7 82.2 71.6 - AnomalyRuler-base - ✓ ✗ 97.4 81.6 83.5 65.4 1 AnomalyRuler employs UB as the source domain when ShT serves as the target domain. The competitors have no cross-domain evaluation on ShT, so we report their same-domain results.\nalso do not use any additional features (e.g., bounding boxes from object detectors or 3D features from action recognition networks), on the challenging ShT and UB datasets. This suggests that our rule-based reasoning benefits the challenging oneclass VAD task. Second, for Ped2 and Ave, AnomalyRuler performs on par with the Image-Only methods. This is achieved without any tuning, meaning that our few-normal-shot prompting approach is as effective as the costly full-shot training on these benchmarks. Third, AnomalyRuler outperforms AnomalyRuler-base, indicating that the robust reasoning module improves performance further.\nDomain Adaptability. Domain adaptation considers the scenario that the source domain (i.e., training/induction) dataset differs from the target domain (i.e., testing/deduction) dataset [13 , 26 , 48]. We compare AnomalyRuler with three state-of-the-art VAD methods that claim their domain adaptation ability [3 , 29 , 30]. We follow the compared works to use ShT as the source domain dataset for other target datasets. As shown in Table 4, AnomalyRuler achieves the highest AUC on Ped2, ShT and UB, outperforming with an average of 9.88%. While AnomalyRuler trails zxVAD [3] by 0.6%, it is still higher than the others with an average of 8.85%. The results indicate that AnomalyRuler has better\ndomain adaptability across different datasets. This advantage is due to that the language provides consistent descriptions across different visual domains, which allows the application of induced rules to datasets with similar anomaly scenarios but distinct visual appearances. In contrast, traditional methods extract high-dimensional visual features that are sensitive to visual appearances, thereby struggling to transfer their knowledge across datasets.\n5.4 Ablation Study # In this section, we look into how the proposed strategies affect AnomalyRuler. We investigate two aspects: rule quantity (i.e., the number of induced rules) and rule quality (i.e., their resulting performance). Regarding this, we evaluate variants of AnomalyRuler-base on the ShT dataset.\nAblation on Strategies. Table 5 shows the effects of removing individual strategies compared to using all strategies. In terms of rule quantity, removing Human and Environment or Normal and Anomaly significantly reduces rules by 47.6% and 82.4%, respectively. This reduction is due to not separating the rules for humans and the environment halves the number of rules. Moreover, without deriving anomaly rules from normal rules, we only have a limited set of normal rules. Removing Abstract and Concrete or Rule Aggregation slightly increases the number of rules, as the former merges rules within the same categories and the latter removes incorrect rules. Perception Smoothing does not affect rule quantity since it is used in the deduction stage. In terms of rule quality, removing Normal and Anomaly or Rule Aggregation has the most negative impact. The former happens because when only normal rules are present, the LLM overreacts to slightly different actions such as \u0026ldquo;walking with an umbrella\u0026rdquo; compared to the rule for \u0026ldquo;walking\u0026rdquo;, leading to false positives. Furthermore, without rules for anomalies as a reference, the LLM easily misses anomalies. The latter is due to that perception errors in the induction stage would lead to incorrect rules for normal. Besides, removing other strategies also decreases AUC, underscoring their significance. In summary, the proposed strategies effectively improve AnomalyRuler\u0026rsquo;s performance. There is no direct positive/negative correlation between rule quantity and quality, i.e., having too few rules leads to inadequate coverage of normality and anomaly concepts while having too many rules would cause redundancy and errors.\nAblation on Hyperparameters. Fig. 3 illustrates the effects of the hyperparameters in the rule aggregation and perception smoothing modules. For rule aggregation, we conduct cross-validation on the number of batches n = [1, 5, 10, 20] and the number of normal reference frames per batch m = [1, 2, 5, 10]. We observe that both the number of rules and AUC increase with the increases of n and m, but they start to fluctuate when n × m becomes large. For example, when n = 20, AUC drops from 85.9% to 72.2% as m increases because having too many reference frames (e.g., over 100) results in redundant information in a long context. For perception smoothing, we test the padding size in majority vote p = [1, 5, 10, 20] and the weighting parameter in EMA α = [0.09, 0.18, 0.33, 1]. We\nTable 5: Ablation on strategies. We assess the effects of removing individual strategies in AnomalyRuler. We conduct the experiments five times with different randomly selected normal reference frames for induction and report their mean and standard deviation on the ShT dataset.\nStrategy Stage # Rules A # Rules A # Rules A Accuracy Pr Accuracy Pr Accuracy Pr Recall Recall AUC AUC Strategy Stage mean std mean std mean std mean std mean std w. All Below (default) Both 42.2 4.2 81.6 1.3 90.9 0.8 63.9 2.7 84.5 1.1 w/o. Human and Environmen Both -20.1 +1. -3.3 +0.8 -3.9 +0.8 -1.9 +1.6 -2.4 +2.0 w/o. Normal and Anomaly Induction - -34.8 8 -1.3 -20.5 +4.3 -41.2 +7.0 -14.4 +11.6 -18.8 +1.2 w/o. Abstract and Concrete Induction +2.3 3 +2. -0.6 -0.2 -0.9 -0.2 -0.3 -0.4 -0.9 +0.1 w/o. Rule Aggregation Induction +8.5 +6.1 -9.6 + 14.7 +1.1 +2.9 -10.7 +14. -15.8 +0.8 w/o. Perception Smoothing Deduction NA NA -1.7 -0.9 -1.9 +0.1 -3.8 -0.3 -3.3 +0.8 Fig. 3: Ablation on hyperparameters of the (a) (b) rule aggregation and (c) perception smoothing modules on the ShT dataset.\nfound p = 5 to be optimal for capturing the motion continuity in a video while avoiding the excessive noise that can occur with more neighborhoods. α adjusts the weight of the most recent frames compared to previous frames. A smaller α emphasizes previous frames, resulting in more smoothing but less responsiveness to recent changes. In general, increasing α from 0.09 to 0.33 improves AUC, suggesting that moderate EMA smoothing is beneficial.\n6 Conclusion # In this paper, we propose AnomalyRuler, a novel rule-based reasoning framework for VAD with LLMs. With the induction and deduction stages, AnomalyRuler requires only few-normal-shot prompting without the need for expensive full-shot tuning, thereby fast steering LLMs\u0026rsquo; reasoning strengths to various specific VAD applications. To the best of our knowledge, AnomalyRuler is the first reasoning approach for one-class VAD. Extensive experiments demonstrate AnomalyRuler\u0026rsquo;s state-of-the-art performance, reasoning ability, and domain adaptability. Limitations and potential negative social impact of this work are discussed in the Appendix A.1. In future research, we expect this work to advance broader oneclass problems and related tasks, such as industrial anomaly detection [6 , 55], open-set recognition [5 , 40], and out-of-distribution detection [17 , 42].\nAcknowledgments # This work was supported in part by National Science Foundation (NSF) under grants OAC-23-19742 and Johns Hopkins University Institute for Assured Autonomy (IAA) with grants 80052272 and 80052273. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of NSF or JHU-IAA.\nReferences # Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)\nAcsintoae, A., Florescu, A., Georgescu, M.I., Mare, T., Sumedrea, P., Ionescu, R.T., Khan, F.S., Shah, M.: Ubnormal: New benchmark for supervised open-set video anomaly detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)\nAich, A., Peng, K.C., Roy-Chowdhury, A.K.: Cross-domain video anomaly detection without target domain adaptation. In: IEEE/CVF Winter Conference on Applications of Computer Vision (2023)\nBacon, F.: Novum organum (1620)\nBendale, A., Boult, T.E.: Towards open set deep networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2016)\nBergmann, P., Fauser, M., Sattlegger, D., Steger, C.: Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)\nBrown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. In: Conference on Neural Information Processing Systems (2020)\nCao, Y., Xu, X., Sun, C., Huang, X., Shen, W.: Towards generic anomaly detection and understanding: Large-scale visual-linguistic model (gpt-4v) takes the lead. arXiv preprint arXiv:2311.02782 (2023)\nChang, Y., Tu, Z., Xie, W., Yuan, J.: Clustering driven deep autoencoder for video anomaly detection. In: European Conference on Computer Vision (2020)\nCohen, J., Rosenfeld, E., Kolter, Z.: Certified adversarial robustness via randomized smoothing. In: International Conference on Machine Learning (2019)\nDiao, S., Wang, P., Lin, Y., Zhang, T.: Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246 (2023)\nElhafsi, A., Sinha, R., Agia, C., Schmerling, E., Nesnas, I.A., Pavone, M.: Semantic anomaly detection with large language models. In: Autonomous Robots (2023)\nGanin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning (2015)\nGeorgescu, M.I., Ionescu, R.T., Khan, F.S., Popescu, M., Shah, M.: A backgroundagnostic framework with adversarial training for abnormal event detection in video. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2021)\nGu, Z., Zhu, B., Zhu, G., Chen, Y., Tang, M., Wang, J.: Anomalygpt: Detecting industrial anomalies using large vision-language models. In: AAAI Conference on Artificial Intelligence (2024)\nHasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A.K., Davis, L.S.: Learning temporal regularity in video sequences. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2016)\nHendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-ofdistribution examples in neural networks. In: International Conference on Learning Representations (2017)\nHirschorn, O., Avidan, S.: Normalizing flows for human pose anomaly detection. In: IEEE/CVF International Conference on Computer Vision (2023)\nJiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al.: Mistral 7b. arXiv preprint arXiv:2310.06825 (2023)\nLee, S., Kim, G.: Recursion of thought: A divide-and-conquer approach to multicontext reasoning with language models. In: Annual Meeting of the Association for Computational Linguistics (2023)\nLi, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pretraining with frozen image encoders and large language models. In: International Conference on Machine Learning (2023)\nLi, W., Mahadevan, V., Vasconcelos, N.: Anomaly detection and localization in crowded scenes. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2013)\nLiu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Conference on Neural Information Processing Systems (2023)\nLiu, W., Luo, W., Lian, D., Gao, S.: Future frame prediction for anomaly detection– a new baseline. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)\nLiu, Z., Nie, Y., Long, C., Zhang, Q., Li, G.: A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In: IEEE/CVF International Conference on Computer Vision (2021)\nLo, S.Y., Oza, P., Chennupati, S., Galindo, A., Patel, V.M.: Spatio-temporal pixel-level contrastive learning-based source-free domain adaptation for video semantic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)\nLo, S.Y., Oza, P., Patel, V.M.: Adversarially robust one-class novelty detection. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)\nLu, C., Shi, J., Jia, J.: Abnormal event detection at 150 fps in matlab. In: IEEE/CVF International Conference on Computer Vision (2013)\nLu, Y., Yu, F., Reddy, M.K.K., Wang, Y.: Few-shot scene-adaptive anomaly detection. In: European Conference on Computer Vision (2020)\nLv, H., Chen, C., Cui, Z., Xu, C., Li, Y., Yang, J.: Learning normal dynamics in videos with meta prototype network. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)\nLv, H., Sun, Q.: Video anomaly detection and explanation via large language models. arXiv preprint arXiv:2401.05702 (2024)\nMao, C., Teotia, R., Sundar, A., Menon, S., Yang, J., Wang, X., Vondrick, C.: Doubly right object recognition: A why prompt for visual rationales. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)\nMinderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., et al.: Simple open-vocabulary object detection. In: European Conference on Computer Vision (2022)\nMittal, H., Agarwal, N., Lo, S.Y., Lee, K.: Can\u0026rsquo;t make an omelette without breaking some eggs: Plausible action anticipation using large video-language models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)\nMorais, R., Le, V., Tran, T., Saha, B., Mansour, M., Venkatesh, S.: Learning regularity in skeleton trajectories for anomaly detection in videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019)\nPark, H., Noh, J., Ham, B.: Learning memory-guided normality for anomaly detection. In: IEEE/CVF Conference Computer Vision and Pattern Recognition (2020)\nPaszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, highperformance deep learning library. In: Conference on Neural Information Processing Systems (2019)\nRadford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training. OpenAI Blog (2018)\nRadford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog (2019)\nSafaei, B., Vibashan, V., de Melo, C.M., Hu, S., Patel, V.M.: Open-set automatic target recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2023)\nSeel, N.M.: Encyclopedia of the sciences of learning (2011)\nSharifi, S., Entesari, T., Safaei, B., Patel, V.M., Fazlyab, M.: Gradient-regularized out-of-distribution detection. In: European Conference on Computer Vision (2024)\nShi, C., Sun, C., Wu, Y., Jia, Y.: Video anomaly detection via sequentially learning multiple pretext tasks. In: IEEE/CVF International Conference on Computer Vision (2023)\nSu, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023)\nSun, S., Gong, X.: Hierarchical semantic contrast for scene-aware video anomaly detection. In: IEEE/CVF Computer Vision and Pattern Recognition Conference (2023)\nTouvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)\nTouvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)\nTsai, Y.H., Hung, W.C., Schulter, S., Sohn, K., Yang, M.H., Chandraker, M.: Learning to adapt structured output space for semantic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)\nWang, G., Wang, Y., Qin, J., Zhang, D., Bao, X., Huang, D.: Video anomaly detection by solving decoupled spatio-temporal jigsaw puzzles. In: European Conference on Computer Vision (2022)\nWang, H., Zhang, X., Yang, S., Zhang, W.: Video anomaly detection by the duality of normality-granted optical flow. arXiv preprint arXiv:2105.04302 (2021)\nWang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., et al.: Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)\nWei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Conference on Neural Information Processing Systems (2022)\nWu, J.C., Hsieh, H.Y., Chen, D.J., Fuh, C.S., Liu, T.L.: Self-supervised sparse representation for video anomaly detection. In: European Conference on Computer Vision (2022)\nYan, C., Zhang, S., Liu, Y., Pang, G., Wang, W.: Feature prediction diffusion model for video anomaly detection. In: IEEE/CVF International Conference on Computer Vision (2023)\nYou, Z., Cui, L., Shen, Y., Yang, K., Lu, X., Zheng, Y., Le, X.: A unified model for multi-class anomaly detection. Conference on Neural Information Processing Systems (2022)\nZaheer, M.Z., Mahmood, A., Khan, M.H., Segu, M., Yu, F., Lee, S.I.: Generative cooperative learning for unsupervised video anomaly detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)\nZhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. In: Conference on Empirical Methods in Natural Language Processing (2023)\nZhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li, Y., Liu, S., et al.: Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514 (2023)\nZhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al.: Least-to-most prompting enables complex reasoning in large language models. In: International Conference on Learning Representations (2023)\nZhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing visionlanguage understanding with advanced large language models. In: International Conference on Learning Representations (2024)\nZhu, Z., Xue, Y., Chen, X., Zhou, D., Tang, J., Schuurmans, D., Dai, H.: Large language models can learn rules. arXiv preprint arXiv:2310.07064 (2023)\nA Appendix # A.1 Limitations and Potential Negative Social Impact # Limitations. Similar to most existing LLM-based studies, AnomalyRuler assumes that the employed LLM backbones have decent capabilities. Sub-optimal LLMs may hinder the effectiveness of the methods. Exploring this limitation further could be an interesting future investigation.\nPotential Negative Social Impact. The proposed method may enable malicious actors to more easily adapt VLMs/LLMs for illegal surveillance. To mitigate this risk, computer security mechanisms could be integrated.\nA.2 Examples of Input Prompts and Outputs Results # Induction. This stage starts from n randomly chosen normal reference frames Fn Fnormal = {fnormal 1 , . . . , fnormal n } and outputs a set of robust rules Rrobust. To simplify the illustration, we show one frame fnormal i ∈ Fnormal as an example in the visual perception and rule generation steps.\nVisual Perception – Input fnormal i and prompt p v : pv = How many people are in the image and what is each of them doing ? What are in the image other than people ? Think step by step .\n– Outputs: Frame description dnormal i = V LM(fnormal i , pv) Rule Generation – Input prompt p g : d normal i = There are four people in the image . Starting from the left , the first person is walking on the path . The second person is walking on the bridge . The third person is also walking on the bridge . The fourth person is also walking on the bridge . Other than people , there are trees , a railing , a path , and a bridge visible in the image . pg = [ {\u0026#39; \u0026#39; role \u0026#39; \u0026#39;: \u0026#39;\u0026#39; system \u0026#39;\u0026#39;, \u0026#39;\u0026#39; content \u0026#39; \u0026#39;: As a surveillance monitor for urban safety using the ShanghaiTech dataset , my job is to derive rules for detecting abnormal human activities or environmental objects .} , {\u0026#39; \u0026#39; role \u0026#39; \u0026#39;: \u0026#39;\u0026#39; user \u0026#39; \u0026#39;, \u0026#39;\u0026#39; content \u0026#39; \u0026#39;: Based on the assumption that the given frame descriptions are normal , Please derive rules for normal , start from an abstract concept , and then generalize to concrete activities or objects .} , {\u0026#39; \u0026#39; role \u0026#39; \u0026#39;: \u0026#39;\u0026#39; assistant \u0026#39;\u0026#39;, \u0026#39;\u0026#39; content \u0026#39; \u0026#39;: ** Rules for Normal Human Activities : 1. ** Rules for Normal Environmental Objects : 1. }, {\u0026#39; \u0026#39; role \u0026#39; \u0026#39;: \u0026#39;\u0026#39; user \u0026#39; \u0026#39;, \u0026#39;\u0026#39; content \u0026#39; \u0026#39;: Compared with the above rules for normal , can you provide potential rules for anomaly ? Please start from an abstract concept then generalize to concrete activities or objects , compared with normal ones .} , {\u0026#39; \u0026#39; role \u0026#39; \u0026#39;: \u0026#39;\u0026#39; assistant \u0026#39;\u0026#39;, \u0026#39;\u0026#39; content \u0026#39; \u0026#39;: ** Rules for Anomaly Human Activities : 1. ** Rules for Anomaly Environmental Objects : 1. }, {\u0026#39; \u0026#39; role \u0026#39; \u0026#39;: \u0026#39;\u0026#39; user \u0026#39; \u0026#39;, \u0026#39;\u0026#39; content \u0026#39; \u0026#39;: Now you are given frame description {dnormal i }. What are the Normal and Anomaly rules you have ? Think step by step . Reply following the above format , start from an abstract concept and then generalize to concrete activities or objects . List them using short terms , not an entire sentence .} , ] – Outputs: For each normal reference frame dnormal i , we will get one set of rules r i = LLM(dnormal i , pg). Since the structure of the rules is identical to the robust rules, we only present the robust rules in the following step as an illustration of our final induction output. Rule Aggregation – Input prompt p a : – Outputs: Robust rules Rrobust = LLM(R = {r1, . . . , r n }, p a ) pa = [ {\u0026#39; \u0026#39; role \u0026#39; \u0026#39;: \u0026#39;\u0026#39; system \u0026#39;\u0026#39;, \u0026#39;\u0026#39; content \u0026#39; \u0026#39;: As a surveillance monitor for urban safety using the ShanghaiTech dataset , my job is to organize rules for detecting abnormal activities and objects .} , {\u0026#39; \u0026#39; role \u0026#39; \u0026#39;: \u0026#39;\u0026#39; user \u0026#39; \u0026#39;, \u0026#39;\u0026#39; content \u0026#39; \u0026#39;: You are given {n} independent sets of rules for Normal and Anomaly . For the organized normal Rules , list the given normal rules with high - frequency elements For the organized anomaly Rules , list all the given anomaly rules }, {\u0026#39; \u0026#39; role \u0026#39; \u0026#39;: \u0026#39;\u0026#39; assistant \u0026#39;\u0026#39;, \u0026#39;\u0026#39; content \u0026#39; \u0026#39;: ** Rules for Anomaly Human Activities : 1. ** Rules for Anomaly Environmental Objects : 1. ** Rules for Normal Human Activities : 1. ** Rules for Normal Environmental Objects : 1. }, {\u0026#39; \u0026#39; role \u0026#39; \u0026#39;: \u0026#39;\u0026#39; user \u0026#39; \u0026#39;, \u0026#39;\u0026#39; content \u0026#39; \u0026#39;: Now you are given {n} independent sets of rules as the sublists of {R}. What rules for Anomaly and Normal do you get ? Think step by step , and reply following the above format .} , ] ** Rules for Anomaly Human Activities :** 1. Using any non - walking movement such as riding a bicycle or scooter , skateboarding 2. Running or jumping 3. Moving with the usual speed such as very fast or in a hurry 4. Person lying or bending down on the ground 5. Aggressive or unusual movements such as fighting or pushing 6. Loitering around the trash bin 7. Climbing on trees or into manhole / utility access points 8. Tampering with manhole covers 9. Engaging in physical altercation 10. Lingering without a clear purpose near a sensitive area 11. Falling unexpectedly ** Rules for Anomaly Environmental Objects :** 1. Missing , removed , or ajar manhole covers 2. Ground transportation such as vehicles , vans , bicycles . 3. Unattended bags or objects ( security threat ) 4. Unusual objects ( vehicles , large unattended luggage , etc .) 5. Unauthorized posters or signs 6. Damaged manhole cover or trees 7. Abandoned items in a public area 8. Yellow objects ( when not commonly seen ) 9. Graffiti on building windows or vandalism to a fence 10. Obstructions on pedestrian crossing 11. Fire hydrant spraying water without presence of emergency services 12. Unidentified objects near the tall structure 13. Smoke or fire coming from a utility access point 14. Objects thrown or falling from a height ** Rules for Normal Human Activities :** 1. Walking alone or with another person 2. Standing and using a mobile device 3. Walking on the sidewalk 4. Walking towards or away from a building 5. Entering a building 6. Standing near a trash bin 7. Waiting at traffic lights 8. Walking on pedestrian crossing 9. Sitting on bench or ground ** Rules for Normal Environmental Objects :** 1. Manhole covers on the ground or street 2. Trees along the street or Plants present 3. Trash bin in vicinity or placed beside the street 4. Posters on glass door 5. Static building with glass windows 6. Fence along the water 7. Pedestrian crossing markings visible 8. Standing fire hydrant 9. Static tall structure in the background 10. Utility access points on the ground Deduction. This stage starts from a test video that contains t continuous frames F = {f1, . . . , ft} and outputs the reasoning results Y ˆ ∗ = {yˆ ˆ ∗ 1 , . . . , y ˆ ∗ t }. To simplify the illustration, we show two frames of this test video, fi, fj ∈ F as examples that represent one anomaly frame and one normal frame, respectively.\n· Visual Perception: # – Input test frames fi , fj and prompt p v : – Outputs: Frame descriptions di = V LM(fi, p v ) , d j = V LM(fj , p v ) 22 Y. Yang et al. # d i = There are four people in the image . One person is walking , another is also walking , the third person is riding a bicycle , and the fourth person is walking near the bicycle . Other than people , there are trees , a pathway , a trash bin , a bicycle , and two manhole covers visible in the image . d j = There are two people in the image . One person appears to be walking , the other seems to be walking together . Other than people , there are two manhole covers on the ground , a trash bin , and some trees and plants .\n· Perception Smoothing: # – Rrobust → K (generate anomaly keywords from anomaly rules, see Section 4). · Input prompt pk: # pk = You will be given a set of rules for detecting abnormal activities and objects ; please extract the anomaly keywords , activities using \u0026lsquo;\u0026lsquo;ing \u0026rsquo; \u0026rsquo; verbs , and anomaly objects using nouns , and provide a combined Python list with each represented by a single word . The output should be in the format : [\u0026quot; object1 \u0026quot; , \u0026hellip; , \u0026quot; activity1 \u0026ldquo;, \u0026quot; activity2 \u0026ldquo;, \u0026hellip;]. Now you are given {Rrobust} :\n· Output K: # anomaly_from_rule = [\u0026#34; trolley \u0026#34; , \u0026#34; cart \u0026#34; , \u0026#34; luggage \u0026#34;, \u0026#34; bicycle \u0026#34;, \u0026#34; skateboard \u0026#34;, \u0026#34; scooter \u0026#34;, \u0026#34; vehicles \u0026#34;, \u0026#34; vans \u0026#34;, \u0026#34; accident \u0026#34; , \u0026#34; running \u0026#34;, \u0026#34; jumping \u0026#34;, \u0026#34; riding \u0026#34; , \u0026#34; skateboarding \u0026#34;, \u0026#34; scooting \u0026#34;, \u0026#34; lying \u0026#34;, \u0026#34; falling \u0026#34;, \u0026#34; bending \u0026#34;, \u0026#34; fighting \u0026#34;, \u0026#34; pushing \u0026#34;, \u0026#34; loitering \u0026#34;, \u0026#34; climbing \u0026#34;, \u0026#34; tampering \u0026#34; , \u0026#34; lingering \u0026#34;] – d i → ˆ d i \u0026amp; yˆ ˆ i , d j → ˆ d j \u0026amp; yˆ ˆ j (generate modified descriptions and smoothed predictions, see Section 4). · Robust Reasoning: # – Input prompt p r : pr = You will be given an description of frame , your task is to double check my initial anomaly detection result based on the rules . The rules are : {R robust } My initial result is {yˆ ˆ i } First , if human activity is present , which rule is matching ? List the rule category , e.g. , normal or anomaly , with the rule number . Second , if an environmental object is present , which rule is matching ? List the rule category , e.g ., normal or anomaly , with the rule number . Third , are the human activities or environmental objects anomaly ? Answer : anomaly , if ANY anomaly rule ( even if only one , no matter human activities or environmental objects ) matches , otherwise answer : normal . Now you are given the frame {d ˆ i }, think step by step . – Final reasoning output: y ∗ i = LLM( ˆ d i # , y ˆ i , Rrobust, p r )\ny ˆ ∗ i = The third person is riding a bicycle , which matches rule number 1 under Anomaly Human Activities . The bicycle matches rule number 2 under Anomaly Environmental Objects . Answer : anomaly . y ∗ j = LLM( ˆ d j , y ˆ j , Rrobust, p r ) y ˆ ∗ j = The human activities present are two people walking . The rules for normal human activities include walking alone or with another person . So , the rule matching is : 1. Walking alone or with another person ( normal ) The environmental objects present are two manhole covers on the ground , a trash bin , and some trees and plants . So , the rules matching are : 1. Manhole covers on the ground or street ( normal ) 2. Trees along the street or Plants present ( normal ) 3. Trash bin in vicinity or placed beside the street ( normal ) Answer : normal . A.3 Examples of Doubly-Right Reasoning Evaluation # In Section 5.2, we evaluate reasoning ability using the Doubly-Right metric [32]. We create a benchmark dataset with multiple choices for Doubly-Right reasoning evaluation. The evaluation is conducted in the deduction stage, where we input the visual perception description and the induced rules to an LLM. The goal is to demonstrate that the induced rules enable LLMs to perform correct reasoning.\nWe list the prompt for reasoning evaluation below and one example of the description and its four choices as Table 6. The content in normal choice is fixed, while the anomaly choices include one correct reasoning with a matched rule and two randomly chosen non-matched rules from our generated anomaly rules. In this example, Choices A, B, C and D correspond to RW, WW, RR and RW, respectively.\n[ {\u0026#39; \u0026#39; role \u0026#39; \u0026#39;: \u0026#39;\u0026#39; system \u0026#39;\u0026#39;, \u0026#39;\u0026#39; content \u0026#39; \u0026#39;: You will be given a description of the frame and four choices . Your task is to make the correct choice based on the rules . The rules are : {R robust }} , {\u0026#39; \u0026#39; role \u0026#39; \u0026#39;: \u0026#39;\u0026#39; user \u0026#39; \u0026#39;, \u0026#39;\u0026#39; content \u0026#39; \u0026#39;: Description : {d ˆ i } Choices : { Four Choices } Choose just one correct answer from the options (A , B , C , or D) and output without any explanation . Please Answer :} , ] A.4 Different VLMs/LLMs as Backbones # Table 7 shows the results of using various VLMs/LLMs as backbones in the deduction stage, compared to the default setting (the first row). All the results are based on the same rules derived in the induction stage with the default setting.\nWe categorize the comparisons into three types: (1) VLMs only: AnomalyRuler uses the same VLMs as an end-to-end solution, combining visual perception and robust reasoning. It inputs the test frame and outputs the reasoning result. This category includes GPT-4V [1], LLaVA [23], and PandaGPT [44]. (2) VLMs + Mistrial [19]: We keep Mistrial as the default LLM for robust reasoning and test different VLMs (e.g., OWLViT [33], LLaVA, BLIP-2 [21], RAM [58]) for visual\nTable 6: An example of reasoning performance evaluation with multiple reasoning choices. In this example, Choices A, B, C and D correspond to RW, WW, RR and RW of the Doubly-Right metric, respectively. The RR choice is highlighted in yellow .\nFrame Description Multiple Choices for Reasoning Evaluation There are four people in the im\u0002A. he im\u0002A. Anomaly, since “climbing on a tree” matches anomaly human age. One pe g withactivities “Climbing on trees or into manhole/utility access points”. a backpack, an her person isB. Normal, since no rules for anomaly human activities or non\u0002 riding a bicycle, a th another person is le, a third person B. Normal, since no rules for anomaly human objects match. is standing and looking at and looking at the d hfh C. Anomaly, since “riding a bicycle” matches anomaly human ac “lkh dbl bicyclist, and the four nd the fourth persontivities “Using any non-walking movement such as riding a bicycle is sitting on a bench. Other thanor scooter, skateb her thanor scooter, skateboarding”. people, there are trees re trees, a trashD. Anomaly, since “a vehicle parked blocking a pedestrian crossing” bin, and two manh covers vis\u0002matches anomaly non-human objects “Obstructions on pedestrian ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” is sitting on a bench. Other thanor scooter, skatebo ther thanor scooter, skateboarding”. is sitting on a bench. Other thanor scooter, skatebo Other thanor scooter, skateboarding”. people, there are trees, a trashD trees, a trashD. Anomaly, since “a vehicle parked blocking a pedestrian crossing” people, there are trees, a trashD. Ano rees, a trashD. Anomaly, since “a vehicle parked blocking a pedestrian crossing” bin, and two manhole covers covers vis\u0002matches anomaly non-human objects “Obstructions on pedestrian bin, and two manhole covers vis\u0002 covers vis\u0002matches anomaly non-human objects “Obstructions on pedestrian , ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image crossing”. bicyclist, and the fourth persontiviti the fourth persontivities “Using any non-walking movement such as riding a bicycle ible in the image. p h. Other than g y g g y or scooter, skateboarding”. bin, and two manhol ible in the image satces aoay ouaobjects Obstuctos opedesta crossing”. bin, and two manh ible in the image matches anomaly nonhuman objects Obstructions on pedest crossing” , ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” pp, bin, and two manhole crossing”. bin, and two manhole covers vis\u0002mat covers vis\u0002matches anomaly non-human objects “Obstructions on pedestrian bin, and two manhole covers vis\u0002matc covers vis\u0002matches anomaly non-human objects “Obstructions on pedestrian bin, and two manhole covers vis\u0002mat covers vis\u0002matches anomaly non-human objects “Obstructions on pedestrian bin, and two manhole covers vis covers vis\u0002matches anomaly non-human objects “Obstructions on pedestrian bin, and two manhole covers vis\u0002matches covers vis\u0002matches anomaly non-human objects “Obstructions on pedestrian bin, and two manhole covers vis covers vis\u0002matches anomaly non-human objects “Obstructions on pedestrian bin, and two manhole y j rossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. c ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. ble in the image. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing”. ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” ible in the image. crossing” Table 7: Detection performance with accuracy, precision, and recall (%) using different VLMs/LLMs as backbones in the deduction stage on 100 (limited by GPT-4\u0026rsquo;s query capacity) randomly selected frames from the ShT test set.\nVisual Perception Robust Rea Accuracy Precision Recall Open Sourc CogVLM [51] (default) Mistral [19] (defaul 82 88.1 74 ✓ GPT-4V [1] GPT-4V 83 88.4 76 ✗ LLaVA [23] LLaVA 40 40.4 42 ✓ PandaGPT [44] PandaGPT 37 31.4 22 ✓ OWLViT [33] Mistral 71 82 54 ✓ LLaVA Mistral 76 79.5 70 ✓ BLIP-2 [21] Mistral 50 50 94 ✓ RAM [58] Mistral 45 47.2 84 ✓ CogVLM GPT-3.5 [7] 81 86 74 ✗ CogVLM LLaMA-2 [47] 60 70.8 34 ✓ perception. (3) CogVLM [51] + LLMs: We use CogVLM as the fixed VLM for visual perception and test different LLMs for robust reasoning (e.g., GPT-3.5 [7], LLaMA-2 [47]). We have the following observations.\nFor the VLMs-only category, GPT-4V performs well, but it has limitations on the number of queries and a high cost per query, making it expensive for largescale testing. LLaVA and PandaGPT, on the other hand, show poor reasoning ability. They cannot follow the provided robust rules, and generate irrelevant content or hallucinations. An example frame with their outputs is shown below:\nFor the VLMs + Mistrial category, using OWLViT and LLaVA as visual perception modules yields usable results, though they are still 6 to 10% lower than using CogVLM. However, the results with BLIP-2 and RAM are not usable due to serious hallucinations. For example, in a normal frame featuring only\npeople walking, BLIP-2 outputs \u0026ldquo;A sidewalk with trees, two people are walking down a sidewalk, a man is riding a skateboard on a sidewalk, a woman walking down a sidewalk in a park.\u0026rdquo;, while RAM (recognize anything) outputs \u0026ldquo;Image Tags: path | person | skate | park | pavement | plaza | skateboarder | walk\u0026rdquo;.\nFor the CogVLM + LLMs category, GPT-3.5 performs well but is expensive for large-scale testing. LLaMA-2, on the other hand, struggles with reasoning and fails to follow the given rules as context effectively.\nIn summary, the propose AnomalyRuler is a generic plug-and-play framework that can improve VAD performance upon both the closed-source GPTs and the open-source VLMs/LLMs such as CogVLM and Mistral. AnomalyRuler applies to various VLMs/LLMs backbones as long as they have decent visual perception and rule-following capabilities.\nA.5 Further Discussions on Perception Smoothing and Robust Reasoning # Sections 5.3 and 5.4 demonstrate the effectiveness of the proposed perception smoothing and robust reasoning strategies. In this section, we provide a deeper investigation into them. Specifically, we aim to examine the extent to which the smoothing step may incorrectly smooth out anomalies from a sequence of video frames, and the extent to which the robust reasoning step can rectify these errors.\nTable 8 shows that less than 0.7% of anomalies are incorrectly smoothed out by the perception smoothing step (before the robust reasoning step), indicating very low false negative rates. The subsequent robust reasoning step successfully rechecks and corrects inaccuracies in the smoothed results, further reducing the false negative rates to below 0.15%.\nTable 8: The percentage (%) of incorrectly smoothed-out anomalies by the perception smoothing strategy on each dataset.\nDataset ShT Ave Ped2 UB Before Robust Reasoning 0.7% 0.4% 0.6% 0.3% After Robust Reasoning 0.08% 0.15% 0.08% 0.01% The low false negative rates are due to that the smoothing step only smooths out the brief, isolated frames within a sequence of continuous frames. Table 9 shows that brief anomalies are rare in VAD datasets, as they typically persist for 97.9 to 441.3 continuous frames due to the time required for an anomaly to enter and exit the camera\u0026rsquo;s view. We also calculated the percentage of brief frames, i.e., ≤ 10 frames, among all continuous anomaly frames. The ShT dataset has the highest percentage at 17.5% and an average length of 5.5 frames. In Section 5.4 , we find that a padding size p = 5 in our majority vote step is the optimal window size for ShT for capturing the predominant motion continuity in a video. This aligns with the average length of brief continuous anomalies (5.5 frames) and may explain the reason behind this optimal value.\nTable 9: Statistics for the number of continuous anomaly frames per video clip of each dataset.\nDataset ShT Ave Ped2 UB # Average continuous anomaly frames 111.3 97.9 137.3 441.3 % Brief continuous anomalies (≤ 10 frames) 17.5% 2.1% 0.0% 0.0% # Average brief continuous anomaly frames 5.5 10.0 0.0 0.0 A.6 Normal Reference Frame Sampling # The proposed few-normal-shot prompting method is particularly beneficial when only a few normal data points are available in real-world scenarios. In our experiments, we simulate this scenario by randomly sampling normal reference frames from a training set, assuming only the randomly sampled frames are available.\nHowever, even when a set of normal data (e.g., a training set) has already been collected, our few-normal-shot prompting method is still useful for fast adaptation. In this scenario, different normal reference frame sampling strategies beyond random sampling can be considered, such as sampling by GPT-4V [1]. Table 10 compares the random sampling and GPT-4V sampling (sampling ten frames) on the ShT dataset. The results of five trials show similar performance. The reason is that normal patterns in existing VAD datasets are not very diverse. Hence, randomly sampled normal frames are efficient as references for rule induction. Requiring only a few randomly sampled reference frames is one of our contributions, but GPT-4V sampling could be a promising extension for more complicated VAD scenarios.\nTable 10: Random sampling vs. GPT-4V sampling on the ShT dataset. Results of five trials are reported.\nMethod # Rules AUC (%) Random sampling (ten frames) 42.2 ± 4.2 84.5 ± 1.1 GPT-4V sampling (ten frames) 39.9 ± 6.9 84.8 ± 1.6 A.7 Unified Anomaly Detection # Unified anomaly detection [55] considers image anomaly detection that trains a single model across different object classes. We extend this setting to VAD by considering a single model across different datasets. Specifically, the proposed AnomalyRuler can perform as a unified anomaly detection approach by using normal reference frames randomly sampled from various datasets and deriving a set of unified rules for all datasets. Table 11 shows the results, which are on par with the main evaluation in Table 3. This demonstrates that AnomalyRuler performs well under the unified anomaly detection setting by inducing effective unified rules across datasets with similar anomaly scenarios but distinct visual appearances.\nTable 11: AUC (%) of AnomalyRuler under the unified anomaly detection setting. AnomalyRuler induces unified rules from a few normal reference frames across all four datasets and is evaluated on these datasets.\nPed2 Ave ShT UB 97.6 85.6 84.7 68.8 ","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/follow-the-rules-reasonin-for-vad-with-llm/","section":"Papers","summary":"Proposes a rule-based reasoning framework, AnomalyRuler, for video anomaly detection using large language models, enabling fast scenario adaptation with few-normal-shot prompting and enhanced robustness through strategic modules.","title":"Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models","type":"method"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/ghazal-alinezhad-noghre/","section":"Authors","summary":"","title":"Ghazal Alinezhad Noghre","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/gijs-dubbelman/","section":"Authors","summary":"","title":"Gijs Dubbelman","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/hamed-tabkhi/","section":"Authors","summary":"","title":"Hamed Tabkhi","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/hao-lu/","section":"Authors","summary":"","title":"Hao Lu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/hao-wang/","section":"Authors","summary":"","title":"Hao Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/haoyue-shi/","section":"Authors","summary":"","title":"Haoyue Shi","type":"authors"},{"content":" This CVPR paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the accepted version;\nthe final published version of the proceedings is available on IEEE Xplore.\nHarnessing Large Language Models for Training-free Video Anomaly Detection # Luca Zanella 1 Willi Menapace 1 Massimiliano Mancini 1 Yiming Wang 2 Elisa Ricci1,2 University of Trento 1 Fondazione Bruno Kessler 2\nhttps://lucazanella.github.io/lavad/\nFigure 1. We introduce the first training-free method for video anomaly detection (VAD), diverging from state-of-the-art methods that are ALL training-based with different degrees of supervision. Our proposal, LAVAD, leverages modality-aligned vision-language models (VLMs) to query and enhance the anomaly scores generated by large language models (LLMs).\nsupervision of both normal and abnormal videos) [11 , 13 , 15 , 24 , 28 , 35], one-class (i.e. only normal videos) [18 , 20 , 21 , 25 , 37 , 38], and unsupervised (i.e. unlabeled videos) [30 , 31 , 40]. While more supervision leads to better results, the cost of manual annotation is prohibitive. On the other hand, unsupervised methods assume abnormal videos to constitute a certain portion of the training data, a fragile assumption in practice without human intervention.\nCrucially, every existing method necessitates a training procedure to establish an accurate VAD system, and this entails some limitations. One primary concern is generalization: a VAD model trained on a specific dataset tends to underperform in videos recorded in different settings (e.g., daylight versus night scenes). Another aspect, particularly relevant to VAD, is the challenge of data collection, especially in certain application domains (e.g. video surveillance) where privacy issues can hinder data acquisition. These considerations led us to explore a novel research question: Can we develop a training-free VAD method?\nAbstract # Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, trainingfree paradigm, exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring realworld surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.\n1. Introduction # Video anomaly detection (VAD) aims to temporally localize events that deviate significantly from the normal pattern in a given video, i.e. the anomalies. VAD is challenging as anomalies are often undefined and context-dependent, and they rarely occur in the real world. The literature [10] often casts VAD as an out-of-distribution detection problem and learns the normal distribution using training data with different levels of supervision (see Fig. 1), including fullysupervised (i.e. frame-level supervision of both normal and abnormal videos) [1 , 32], weakly-supervised (i.e. video-level\nIn this paper, we aim to answer this challenging question. Developing a training-free VAD model is hard due to the lack of explicit visual priors on the target setting. However, such priors might be drawn using large foundation models, renowned for their generalization capability and wide knowledge encapsulation. Thus, we investigate the potential of combining existing vision-language models (VLMs) with large language models (LLMs) in addressing training-free VAD. On top of our preliminary findings, we propose the first training-free LAnguage-based VAD method (LAVAD), that jointly leverages pre-trained VLMs and LLMs for VAD. LAVAD first exploits an off-the-shelf captioning model to generate a textual description for each video frame. We address potential noise in the captions by introducing a cleaning process based on the cross-modal similarity between captions and frames in the video. To capture the dynamics of the scene, we use an LLM to summarize captions within a temporal window. This summary is used to prompt the LLM to provide an anomaly score for each frame, which is further refined by aggregating the anomaly scores among frames with semantically similar temporal summaries. We evaluate LAVAD on two benchmark datasets: UCF-Crime [24] and XD-Violence [36], and empirically show that our trainingfree proposal outperforms unsupervised and one-class VAD methods on both datasets, demonstrating that it is possible to address VAD with no training and no data collection .\nContributions. In summary, our contributions are:\nWe investigate, for the first time, the problem of trainingfree VAD, advocating its importance for the deployment of VAD systems in real settings where data collection may not be possible. We propose LAVAD, the first language-based method for training-free VAD using LLMs to detect anomalies exclusively from a scene description. We introduce novel techniques based on cross-modal similarity with pre-trained VLMs to mitigate noisy captions and refine the LLM-based anomaly scoring, effectively improving the VAD performance. Experiments show that, while using no task-specific supervision and no training, LAVAD achieves competitive results w.r.t. unsupervised and one-class VAD methods, opening new perspectives for future VAD research. 2. Related Work # Video Anomaly Detection. Existing literature on trainingbased VAD methods can be categorized into four groups, depending on the level of supervision: supervised, weaklysupervised, one-class classification, and unsupervised. Supervised VAD relies on frame-level labels to distinguish normal from abnormal frames [1 , 32]. However, this scenario has received little attention due to its prohibitive annotation effort. Weakly-supervised VAD methods have access to video-level labels (the entire video is labeled as abnormal if at least one frame is abnormal, otherwise is regarded as normal) [11 , 13 , 15 , 24 , 28 , 35]. Most of these methods utilize 3D convolutional neural networks for feature learning and employ a multiple instance learning (MIL) loss for training. One-class VAD methods train only on normal videos, although manual verification is necessary to ensure the normality of the collected data. Several methods [18 , 20 , 21 , 25 , 37 , 38] have been proposed, e.g. considering generative models [37] or pseudo-supervised methods, where pseudo-anomalous instances are synthesized from normal training data [38]. Finally, Unsupervised VAD methods do not rely on predefined labels, leveraging both normal and abnormal videos with the assumption that most videos contain normal events [26 , 27 , 30 , 31 , 40]. Most methods in this category exploit generative models to capture normal data patterns in videos. In particular, generative cooperative learning (GCL) [40] employs alternating training: an autoencoder reconstructs input features, and pseudo-labels from reconstruction errors guide a discriminator. Tur et al . [30 , 31] use a diffusion model to reconstruct the original data distribution from noisy features, calculating anomaly scores based on the reconstruction error between denoised and original samples. Other approaches [26 , 27] train a regressor network from a set of pseudo-labels generated using OneClassSVM and iForest [16].\nInstead, we completely sidestep the need for collecting data and training the model by exploiting existing largescale foundation models to design a training-free pipeline for VAD.\nLLMs for VAD. Recently, LLMs have been explored in detecting visual anomalies across diverse application domains. Kim et al. [12] propose an unsupervised method that mainly leverages VLMs for detecting anomalies, where ChatGPT is only utilized to produce textual descriptions that characterize normal and anomalous elements. However, the method involves human-in-the-loop to refine the LLM\u0026rsquo;s outputs according to specific application contexts and requires further training to adapt the VLM. Other examples include exploiting LLMs for spatial anomaly detection in images addressing specific applications in robotics [4] or industry [7].\nDifferently, we leverage LLMs together with VLMs to address temporal anomaly detection on videos and propose the first training-free method for VAD, requiring no training and no data collection.\n3. Training-Free VAD # In this section, we first formalize the VAD problem and the proposed training-free setting (Sec. 3.1). We then analyze the capabilities of LLMs in scoring anomalies in video frames (Sec. 3.2). Finally, we describe LAVAD, our proposed VAD method (Sec. 3.3).\nFigure 2. Bar plot of the VAD performance (AUC ROC) by querying LLMs with textual descriptions of video frames from various captioning models on the UCF-Crime test set. Different bars correspond to different variants of the captioning model BLIP-2 [14], while different colors indicate two different LLMs [9 , 29]. For reference, we also plot the performance of the best-performing unsupervised method [27] in a red dashed line, and that of a random classifier in a gray dashed line.\nFigure 3. The anomaly score predicted by Llama [29] over time for video Shooting033 from UCF-Crime. We highlight some sample frames with their associated BLIP-2 captions to demonstrate that the caption can be semantically noisy or incorrect (red bounding boxes are for abnormal predictions while blue bounding boxes are for normal predictions). Ground-truth anomalies are highlighted. In particular, the caption of the frame enclosed by a blue bounding box within the ground truth anomaly fails to accurately represent the visual content, leading to a wrong classification due to the low anomaly score given by the LLM.\n3.1. Problem formulation # Given a test video V = [I1 , . . . , I M ] of M frames, traditional VAD methods aim to learn a model f, which can classify each frame I ∈ V as either normal (score 0) or anomalous (score 1), i.e . f : I M → [0 , 1] M with I being the image space. f is usually trained on a dataset D that consists of tuples in the form (V, y). Depending on the supervision level, y can be either a binary vector with frame-level labels (fully-supervised), a binary video-level label (weakly-supervised), a default one (one-class), or ab- sent (unsupervised). However, in practice, it can be costly to collect y as anomalies are rare, and V itself due to potential privacy concerns. Moreover, both label and video data may need regular updates due to evolving application contexts.\nDifferently, in this paper, we introduce a novel setup for VAD, termed as training-free VAD. Under this setting, we aim to estimate the anomaly score of each I ∈ V using only pre-trained models at inference time, i.e. without any training or fine-tuning involving a training dataset D .\n3.2. Are LLMs good for VAD? # We propose to address training-free VAD by exploiting recent advances in LLMs. As the use of LLMs in VAD is still in its infancy [12], we first analyze the capabilities of LLMs in producing an anomaly score based on a textual description of a video frame.\nTo achieve this, we first exploit a state-of-the-art captioning model ΦC , i.e. BLIP-2 [14], to generate a textual description for each frame I ∈ V. We then treat anomaly score estimation as a classification task, asking an LLM ΦLLM to select only one score from a list of 11 uniformly sampled values in the interval [0 , 1], where 0 means normal and 1 anomalous. We get the anomaly score as:\nwhere PC is a context prompt that provides priors to the LLM regarding VAD, PF instructs the LLM on the desired output format to facilitate automated text parsing 1 , and ◦ is the text concatenation operation. We devise PC to simulate a potential end user of a VAD system, e.g. law enforcement agency, as we empirically observe that impersonation can be an effective way of guiding the output generation of the LLM. For example, we can form PC as: \u0026ldquo;If you were a law enforcement agency, how would you rate the scene described on a scale from 0 to 1, with 0 representing a standard scene and 1 denoting a scene with suspicious activities?\u0026rdquo;. Note that PC does not encode any prior on the type of anomalies itself, but just on the context.\nFinally, with the estimated anomaly score from Eq. (1), we measure the VAD performance using the standard area under the curve of the receiver operating characteristic (AUC ROC). Fig. 2 reports the results obtained on the test set of the UCF-Crime dataset [24] with different variants of BLIP2 for obtaining the frame captions, and with different LLMs including Llama [29] and Mistral [9] for computing the frame-level anomaly scores. For reference, we also provide the state-of-the-art performance under the unsupervised setting (the closest setting to ours) [27], and the random scoring as lower-bound. The plot demonstrates that state-of-the-art LLMs possess anomaly detection capabilities, largely outperforming random scoring. However, this performance is\n1 The exact form of PF can be found in the Supp. Mat.\nFigure 4. The architecture of our proposed LAVAD for addressing training-free VAD. For each test video V, we first employ a captioning model to generate a caption Ci for each frame Ii ∈ V, forming a caption sequence C. Our Image-Text Caption Cleaning component addresses noisy and incorrect raw captions based on cross-modal similarity. We replace the raw caption with a caption C i ˆ C i ∈ C whose textual embedding ET (C i ˆ C i ) is most aligned to the image embedding EI (Ii), resulting in a cleaned caption sequence C ˆ . To account for scene context and dynamics, our LLM-based Anomaly Scoring component further aggregates the cleaned captions within a temporal window centered around each I i by prompting the LLM to produce a temporal summary Si, forming a summary sequence S. The LLM is then queried to provide an anomaly score for each frame based on its Si, obtaining the initial anomaly scores a for all frames. Finally, our Video-Text Score Refinement component refines each ai by aggregating the initial anomaly scores of frames whose textual embeddings of the summaries are mostly aligned to the representation EV (Vi) of the video snippet Vi centered around Ii, leading to the final anomaly scores ˜a for detecting the anomalies ( anomalous frames are highlighted) within the video.\nmuch lower w.r.t. trained state-of-the-art methods, even in an unsupervised setting.\nWe observe that two aspects might be the limiting factors in LLMs\u0026rsquo; performance. Firstly, the frame-level captions can be very noisy: the captions might be broken or may not fully reflect the visual content (see Fig. 3). Despite the use of BLIP-2 [14], the best off-the-shelf captioning model, some captions appear corrupted, thus leading to unreliable anomaly scores. Secondly, the frame-level caption lacks details about the global context and the dynamics of the scene, which are key elements when modeling videos. In the following, we address these two limitations and propose LAVAD, the first training-free method for VAD that leverages LLMs for anomaly scoring together with modality-aligned VLMs.\n3.3. LAVAD: LAnguage-based VAD # LAVAD decomposes the VAD function f into five elements (see Fig. 4). As in the preliminary study, the first two are the captioning module ΦC mapping images to textual descriptions in the language space T , i.e . Φ C : I → T , and the LLM Φ LLM generating text from language queries, i.e . Φ LLM : T → T . The other elements involve three en- coders mapping input representations to a shared latent space Z. Specifically we have the image encoder EI : I → Z , the textual encoder ET ET : T → Z, and the video encoder E V : V → Z for videos. Note that all five elements involve only off-the-shelf frozen models.\nFollowing the positive findings of the preliminary analysis, LAVAD leverages ΦLLM and ΦC to estimate the anomaly score for each frame. We design LAVAD to address the limitations related to noise and lack of scene dynamics in frame-level captions by introducing three components: i) Image-Text Caption Cleaning through the vision-language representations of EI and ET , ii) LLM-based Anomaly Scoring, encoding temporal information via ΦLLM and iii) VideoText Score Refinement of the anomaly scores via video-text similarity, using EV and ET . In the following, we describe each component in detail.\nImage-Text Caption Cleaning. For each test video V, we first employ ΦC to generate a caption Ci for each frame I i ∈ V. Specifically, we denote as C = [C1, . . . , CM] the sequence of captions, where Ci = ΦC(Ii). However, as shown in Sec. 3.2, the raw captions can be noisy, with\nbroken sentences or incorrect descriptions. To mitigate this issue, we rely on the captions of the whole video C assuming that in this set there exist captions that are unbroken and better capture the content of their respective frames, an assumption often verified in practice as the video features a scene captured by static cameras at a high frame rate. Thus, semantic content among frames can overlap regardless of their temporal distances. From this perspective, we treat caption cleaning as finding the semantically closest caption to a target frame Ii within C .\nFormally, we make use of vision-language encoders and form a set of caption embeddings by encoding each caption in C via ET ET , i.e . {ET (C1) , . . . , ET ET (CM)}. For each frame I i ∈ V, we compute its closest semantic caption as:\nwhere ⟨· , ·⟩ is the cosine similarity, and EI the image encoder of the VLM. We then build the cleaned set of captions as C ˆ = [C ˆ 1, . . . , C ˆ M ], replacing each initial caption Ci with its counterpart C i ˆ C i retrieved from C. By performing the caption cleaning process, we can propagate the captions of frames that are semantically more aligned to the visual content, regardless of their temporal positioning, to improve or correct noisy descriptions.\nLLM-based Anomaly Scoring. The obtained caption sequence C ˆ , while being cleaner than the initial set, lacks temporal information. To overcome this, we leverage the LLM to provide temporal summaries. Specifically, we define a temporal window of T seconds, centered around Ii. Within this window, we uniformly sample N frames, forming a video snippet Vi, and a caption sub-sequence C ˆ i = {C n ˆ C n } N n=1 . We can then query the LLM with C ˆ i and a prompt PS to get the temporal summary Si centered on frame Ii:\nwhere the prompt PS is formed as \u0026ldquo;Please summarize what happened in few sentences, based on the following temporal description of a scene. Do not include any unnecessary details or descriptions. \u0026quot; 2 .\nCoupling Eq. (3) with the refinement process of Eq. (2), we obtain a textual description of the frame (Si) which is semantically and temporally richer than Ci. With Si, we can then query the LLM for estimating an anomaly score. Following the same prompting strategy described in Sec. 3.2 , we ask Φ LLM to assign to each temporal summary Si a score aiin the interval [0 , 1]. We get the score as:\nwhere, as in Sec. 3.2 , PC is a context prompt containing VAD contextual priors, and PF provides information on the desired output format.\nˆ\n2 C i is represented as an ordered list, with items separated by \\n .\nVideo-Text Score Refinement. By querying the LLM for each frame in the video with Eq. (4), we obtain the initial anomaly scores of the video a = [a1, . . . , aM]. However, a is purely based on the language information encoded in their summaries, without taking into account the whole set of scores. Thus, we further refine them by leveraging the visual information to aggregate scores from semantically similar frames. Specifically, we encode the video snippet Vi centered around I i using EV and all the temporal summaries using ET . Let us define Ki as the set of indices of the Kclosest temporal summaries to Viin {S1, . . . , SM}, where the similarity between Vi and a caption Sj is the cosine similarity, i.e . ⟨EV (Vi) , ET ET (Sj )⟩. We obtain the refined anomaly score a˜ ˜ i :\nwhere ⟨· , ·⟩ is the cosine similarity. Note that Eq. (5) exploits the same principles of Eq. (2), refining frame-level estimations (i.e. score/captions) using their visual-language similarity (i.e. video/image) with other frames in the video. Finally, with the refined anomaly scores for the test video ˜a = [˜a1 , . . . , a ˜ M ], we identify the anomalous temporal windows via thresholding.\n4. Experiments # We validate our training-free proposal LAVAD on two datasets in comparison with state-of-the-art VAD methods that are trained with different levels of supervision, as well as training-free baselines. We conduct an extensive ablation study to justify our main design choices regarding the proposed components, prompt design, and score refinement. In the following, we first describe our experimental setup in terms of datasets and performance metrics. We then present and discuss the results in Sec. 4.1, followed by the ablation study in Sec. 4.2. We show more qualitative results and ablation on minor designs in the Supp. Mat.\nDatasets. We evaluate our method using two commonly used VAD datasets featuring real-world surveillance scenarios, i.e. UCF-Crime [24] and XD-Violence [36].\nUCF-Crime is a large-scale dataset that is composed of 1900 long untrimmed real-world surveillance videos, covering 13 real-world anomalies. The training set consists of 800 normal and 810 anomalous videos, while the test set includes 150 normal and 140 anomalous videos.\nXD-Violence is another large-scale dataset for violence detection, comprising 4754 untrimmed videos with audio signals and weak labels that are collected from both movies and YouTube. XD-Violence captures 6 categories of anomalies and it is divided into a training set of 3954 videos and a test set of 800 videos.\nTable 1. Comparison with state-of-the-art weakly-supervised , one-class , unsupervised and training-free methods on the UCF-Crime dataset. The best results among training-free methods are highlighted in bold.\nMETHOD BACKBONE AUC(%) SULTANI et al. [24] C3D-RGB 75.41 SULTANI et al. [24] I3D-RGB 77.92 IBL [41] C3D-RGB 78.66 GCL [40] ResNext 79.84 GCN [42] TSN-RGB 82.12 MIST [5] I3D-RGB 82.3 WU et al.[36] I3D-RGB 82.44 CLAWS [39] C3D-RGB 83.03 RTFM [28] VideoSwin-RGB 83.31 RTFM [28] I3D-RGB 84.03 WU \u0026amp; LIU [35] I3D-RGB 84.89 MSL [15] I3D-RGB 85.3 MSL [15] VideoSwin-RGB 85.62 S3R [34] I3D-R 85.99 MGFN [2] VideoSwin-RGB 86.67 MGFN [2] I3D-RGB 86.98 SSRL [13] I3D-RGB 87.43 CLIP-TSA [11] ViT 87.58 SVM [24] - 50 SSV [23] - 58.5 BODS [33] I3D-RGB 68.26 GODS [33] I3D-RGB 70.46 GCL [40] ResNext 74.2 TUR et al. [30] ResNet 65.22 TUR et al. [31] ResNet 66.85 DYANNET [27] I3D 79.76 ZS CLIP [22] ViT 53.16 ZS IMAGEBIND (IMAGE) [6] ViT 53.65 ZS IMAGEBIND (VIDEO) [6] ViT 55.78 LLAVA-1.5 [17] ViT 72.84 LAVAD ViT 80.28 Performance Metrics. We measure the VAD performance using the area under the curve (AUC) of the frame-level receiver operating characteristics (ROC) as it is agnostic to thresholding for the detection task. For the XD-Violence dataset, we also report the average precision (AP), i.e. the area under the frame-level precision-recall curve, following the established evaluation protocol in [36].\nImplementation Details. We sample each video every 16 frames for computational efficiency. We employ BLIP-2 [14] as the captioning module ΦC. Particularly, we consider an ensemble of BLIP-2 model variants in our Image-Text Caption Cleaning technique. Please refer to Supp. Mat. for a detailed analysis of these variants. We use Llama-2-13b-chat [29] as our LLM module ΦLLM. We use multimodal encoders provided by ImageBind [6]. Specifically, the temporal window is T = 10 seconds, in line with the pre-trained video encoder of ImageBind. We employ K = 10 in Video-Text Score Refinement.\nTable 2. Comparison with state-of-the-art weakly-supervised , one-class , unsupervised and training-free methods on the XDViolence dataset. ∗ denotes results reported in [26]. The best results among training-free methods are highlighted in bold.\nMETHOD BACKBONE AP(%) AUC(%) WU et al. [36] C3D-RGB 67.19 - WU et al. [36] I3D-RGB 73.20 - MSL [15] C3D-RGB 75.53 - WU AND LIU[35] I3D-RGB 75.90 - RTFM [28] I3D-RGB 77.81 - MSL [15] I3D-RGB 78.28 - MSL [15] VideoSwin-RGB 78.58 - S3R[34] I3D-RGB 80.26 - MGFN [2] I3D-RGB 79.19 - MGFN [2] VideoSwin-RGB 80.11 - HASAN et al. [8] AERGB - 50.32∗ LU et al. [19] Dictionary - 53.56∗ BODS [33] I3D-RGB - 57.32∗ GODS[33] I3D-RGB - 61.56∗ RAREANOM [26] I3D-RGB - 68.33∗ ZS CLIP [22] ViT 17.83 38.21 ZS IMAGEBIND (IMAGE) [6] ViT 27.25 58.81 ZS IMAGEBIND (VIDEO) [6] ViT 25.36 55.06 LLAVA-1.5 [17] ViT 50.26 79.62 LAVAD ViT 62.01 85.36 4.1. Comparison with state of the art # We compare LAVAD against state-of-the-art approaches, including unsupervised methods [26 , 27 , 30 , 31 , 40], one-class methods [8 , 19 , 23 , 24 , 33], and weakly-supervised methods [2 , 5 , 11 , 13 , 15 , 15 , 24 , 28 , 34 – 36 , 39 – 42]. In addition, as none of the above methods specifically address VAD in a training-free setup, we further introduce a few training-free baselines with VLMs, i.e. CLIP [22], ImageBind [6], and LLaVa [17].\nSpecifically, we introduce Zero-shot CLIP [22] (ZS CLIP) and Zero-shot ImageBind [6] (ZS IMAGEBIND). For both baselines, we exploit their pre-trained encoders to compute the cosine similarities of each frame embedding against the textual embeddings of two prompts: a standard scene and a scene with suspicious or potentially criminal activities. We then apply a softmax function to the cosine similarities to obtain the anomaly score for each frame. Since ImageBind also supports the video modality, we include ZS IMAGEBIND (VIDEO) using the cosine similarities of the video embeddings against the two prompts. We choose ViT-B/32 [3] as the visual encoder for ZS-CLIP, ViT-H/14 [3] as the visual encoders for ZS-IMAGEBIND (IMAGE , VIDEO), and both utilize CLIP\u0026rsquo;s text encoder [22]. Finally, we introduce a baseline based on LLAVA-1.5, where we directly query LLaVa [17] to generate an anomaly score for each frame, using the same context prompt as in ours. LLAVA-1.5 uses CLIP ViT-L/14 [22] as the visual encoder and Vicuna-13B as the LLM.\nFigure 5. We showcase qualitative results obtained by LAVAD on four test videos, including two videos (top row) from UCF-Crime and two videos from XD-Violence (bottom row). For each video, we plot the anomaly score over frames computed by our method. We display some keyframes alongside their most aligned temporal summary (blue bounding boxes for normal frame predictions and red bounding boxes for abnormal frame predictions), illustrating the relevance among the predicted anomaly score, visual content, and description. Ground-truth anomalies are highlighted.\nTab. 1 presents the results of the full comparison against the state-of-the-art methods, as well as our introduced training-free baselines, on the UCF-Crime dataset [24]. Notably, our method without any training demonstrates superior performance compared to both the one-class and unsupervised baselines, achieving a higher AUC ROC, with a significant improvement of +6 . 08% when compared to GCL [40] and a minor improvement of +0 . 52% against the current state of the art obtained by DyAnNet [27].\nMoreover, it is evident that training-free VAD is a challenging task as a naive application of VLMs to VAD, such as ZS CLIP , ZS IMAGEBIND (IMAGE) and ZS IMAGEBIND (VIDEO), leads to poor VAD performance. VLMs are mostly trained to attend to foreground objects, rather than actions or the background information in an image that contributes to the judgment of anomalies. This might be the main reason for the poor generalization of VLMs on the VAD task. The baseline LLAVA-1.5, which directly prompts for the anomaly score for each frame, achieves a much higher VAD performance than directly exploiting VLMs in a zero-shot manner. Yet, its performance is still inferior to ours, where we leverage a richer temporal scene description for anomaly estimation, instead of a single-frame basis. The similar effect of the temporal summary is also confirmed by our ablation study as presented in Tab. 3. We also report the comparison against state-of-the-art methods and our baselines evaluated on XD-Violence in Tab. 2. Ours achieves superior performance compared to all one-class and unsupervised methods. In particular, LAVAD outperforms RareAnom [26], the bestscoring unsupervised method, by a substantial margin of +17 . 03% in terms of AUC ROC.\nQualitative Results. Fig. 5 shows qualitative results of LAVAD with sample videos from UCF-Crime and XDViolence, where we highlight some keyframes with their temporal summaries. In the three abnormal videos (Row 1, Column 1, and Row 2), we can see that the temporal summaries of the keyframes during the anomalies accurately portray the visual content regarding the anomalous situations, which in turn benefits LAVAD to correctly identify the anomalies. In the case of Normal Videos 722 (row 1, column 2), we can see that LAVAD consistently predicts a low anomaly score throughout the video. For more qualitative results on the test videos, please refer to the Supp. Mat.\n4.2. Ablation study # In this section, we present the ablation study conducted with the UCF-Crime dataset. We first ablate the effectiveness of each proposed component of LAVAD. Then, we demonstrate the impact of task-related priors in the context prompt PC when prompting the LLM for estimating the anomaly scores. Finally, we show the effect of K when aggregating the K semantically closest frames in the Video-Text Score Refinement component.\nEffectiveness of each proposed component. We ablate different variants of our proposed method LAVAD to prove the effectiveness of the three proposed components, including Image-Text Caption Cleaning, LLM-based Anomaly Score, and Video-Text Score Refinement. Tab. 3 shows the results of all ablated variants of LAVAD. When the Image-Text Caption Cleaning component is omitted (Row 1), i.e. the LLM only exploits the raw captions to perform temporal summary\nand obtain the anomaly scores with refinement, the VAD performance degrades by −3 . 8% compared to LAVAD in terms of AUC ROC (Row 4). If we do not perform temporal summary, and only rely on the cleaned captions with refinement (Row 2), we observe a significant performance drop of − 7 . 58% compared to LAVAD in AUC ROC, indicating that the temporal summary is an effective booster for LLM-based anomaly scoring. Finally, if we only use the anomaly scores obtained with the temporal summary on cleaned captions, without the final aggregation of semantically similar frames (Row 3), we can see that the AUC ROC decreases with a significant margin of −7 . 49% compared to LAVAD, proving that Video-Text Score Refinement also plays an important role in improving the VAD performance.\nTask priors in the context prompt. We investigate the impact of different priors in the context prompt PC and present the results in Tab. 4. In particular, we experimented on two aspects, i.e. impersonation and anomaly prior, which we believe can potentially benefit the estimation of LLM. Impersonation may help the LLM to process the input from the perspective of potential end users of a VAD system, while anomaly prior, e.g. anomalies are criminal activities, may provide the LLM with a more relevant semantic context. Specifically, we ablate LAVAD with various context prompts PC . We begin with a base context prompt: \u0026ldquo;How would you rate the scene described on a scale from 0 to 1, with 0 representing a standard scene and 1 denoting a scene with suspicious activities?\u0026rdquo; (Row 1). We inject only the anomaly prior by appending \u0026ldquo;suspicious activities\u0026rdquo; with \u0026ldquo;or potentially criminal activities\u0026rdquo; (Row 2). We incorporate only impersonation by adding \u0026ldquo;If you were a law enforcement agency,\u0026rdquo; at the beginning of the base prompt (Row 3). Finally, we integrate both priors into the base context prompt (Row 4). As shown in Tab. 4, for videos within UCF-Crime, the anomaly prior appears to have a negligible effect on the LLM\u0026rsquo;s assessment for anomalies, while impersonation improves the AUC ROC by +0 . 96% compared to the one obtained with only the base context prompt. Interestingly, incorporating both priors does not further boost the AUC ROC. We hypothesize that a more stringent context might limit the detection of a wider range of anomalies.\nEffect of K on refining anomaly score. In this experiment, we investigate how the VAD performance changes in relation to the number of semantically similar temporal summaries, i.e . K, used for refining the anomaly score of each frame. As depicted in Fig. 6, the AUC ROC metric consistently increases as K increases, and saturates when K approaches 9. The plot confirms the contribution of accounting semantically similar frames in obtaining more reliable anomaly scores of the video.\n| IMAGE-TEXT CAPTION CLEANING | LLM-BASED ANOMALY SCORING | VIDEO-TEXT SCORE REFINEMENT | AUC\n(%) ✗ ✓ ✓ 76.48 ✓ ✗ ✓ 72.7 ✓ ✓ ✗ 72.79 ✓ ✓ ✓ 80.28 Table 3. Results of LAVAD variants w/o each proposed component on the UCF-Crime Dataset.\nTable 4. Results of LAVAD on UCF-Crime with different priors in the context prompt when querying the LLM for anomaly scores.\nANOMALY PRIOR IMPERSONATION AUC (%) ✗ ✗ 79.32 ✓ ✗ 79.38 ✗ ✓ 80.28 ✓ ✓ 79.77 Figure 6. Results of LAVAD on UCF-Crime over the number of K semantically similar frames used for anomaly score refinement.\n5. Conclusions # In this work, we introduced LAVAD, a pioneering method to address training-free VAD. LAVAD follows a languagedriven pathway for estimating the anomaly scores, leveraging off-the-shelf LLMs and VLMs. LAVAD has three main components, where the first uses image-text similarities to clean the noisy captions provided by a captioning model; the second leverages an LLM to aggregate scene dynamics over time and estimate anomaly scores; and the final component refines the latter by aggregating scores from semantically close frames according to video-text similarity. We evaluated LAVAD on both UCF-Crime and XD-Violence, demonstrating superior performance compared to trainingbased methods in the unsupervised and one-class setting, without the need for training and additional data collection.\nAcknowledgments. This work is supported by MUR PNRR project FAIR - Future AI Research (PE00000013), funded by NextGeneration EU and by PRECRISIS, funded by EU Internal Security Fund (ISFP-2022-TFI-AG-PROTECT-02101100539). We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources and support.\nReferences # [1] Shuai Bai, Zhiqun He, Yu Lei, Wei Wu, Chengkai Zhu, Ming Sun, and Junjie Yan. Traffic anomaly detection via perspective map based on spatial-temporal information matrix. In CVPRW, 2019. 1 , 2\n[2] Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitudecontrastive glance-and-focus network for weakly-supervised video anomaly detection. In AAAI, 2023. 6\n[3] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. 6\n[4] Amine Elhafsi, Rohan Sinha, Christopher Agia, Edward Schmerling, Issa AD Nesnas, and Marco Pavone. Semantic anomaly detection with large language models. Autonomous Robots, 2023. 2\n[5] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. Mist: Multiple instance self-training framework for video anomaly detection. In CVPR, 2021. 6\n[6] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In CVPR, 2023. 6\n[7] Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models. arXiv, 2023. 2\n[8] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In CVPR, 2016. 6\n[9] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv, 2023. 3\n[10] Runyu Jiao, Yi Wan, Fabio Poiesi, and Yiming Wang. Survey on video anomaly detection in dynamic scenes with moving cameras. Artificial Intelligence Review, 2023. 1\n[11] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. Clip-tsa: Clip-assisted temporal self-attention for weaklysupervised video anomaly detection. In ICIP, 2023. 1 , 2 , 6\n[12] Jaehyun Kim, Seongwook Yoon, Taehyeon Choi, and Sanghoon Sull. Unsupervised video anomaly detection based on similarity with predefined text descriptions. Sensors, 2023. 2 , 3\n[13] Guoqiu Li, Guanxiong Cai, Xingyu Zeng, and Rui Zhao. Scale-aware spatio-temporal relation learning for video anomaly detection. In ECCV, 2022. 1 , 2 , 6\n[14] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. 3 , 4 , 6\n[15] Shuo Li, Fang Liu, and Licheng Jiao. Self-training multisequence learning with transformer for weakly supervised video anomaly detection. In AAAI, 2022. 1 , 2 , 6\n[16] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolationbased anomaly detection. ACM TKDD, 2012. 2\n[17] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv , 2023. 6\n[18] Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In ICCV, 2021. 1 , 2\n[19] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In ICCV, 2013. 6\n[20] Hui Lv, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Learning normal dynamics in videos with meta prototype network. In CVPR, 2021. 1 , 2\n[21] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. Learning memory-guided normality for anomaly detection. In CVPR , 2020. 1 , 2\n[22] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 6\n[23] Fahad Sohrab, Jenni Raitoharju, Moncef Gabbouj, and Alexandros Iosifidis. Subspace support vector data description. In ICPR, 2018. 6\n[24] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In CVPR, 2018. 1 , 2 , 3 , 5 , 6 , 7\n[25] Shengyang Sun and Xiaojin Gong. Hierarchical semantic contrast for scene-aware video anomaly detection. In CVPR , 2023. 1 , 2\n[26] Kamalakar Vijay Thakare, Debi Prosad Dogra, Heeseung Choi, Haksub Kim, and Ig-Jae Kim. Rareanom: A benchmark video dataset for rare type anomalies. Pattern Recognition , 2023. 2 , 6 , 7\n[27] Kamalakar Vijay Thakare, Yash Raghuwanshi, Debi Prosad Dogra, Heeseung Choi, and Ig-Jae Kim. Dyannet: A scene dynamicity guided self-trained video anomaly detection network. In WACV, 2023. 2 , 3 , 6 , 7\n[28] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In ICCV, 2021. 1 , 2 , 6\n[29] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste ´ ´ Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv , 2023. 3 , 6\n[30] Anil Osman Tur, Nicola Dall\u0026rsquo;Asen, Cigdem Beyan, and Elisa Ricci. Exploring diffusion models for unsupervised video anomaly detection. In ICIP, 2023. 1 , 2 , 6\n[31] Anil Osman Tur, Nicola Dall\u0026rsquo;Asen, Cigdem Beyan, and Elisa Ricci. Unsupervised video anomaly detection with diffusion models conditioned on compact motion representations. In ICIAP, 2023. 1 , 2 , 6\n[32] Gaoang Wang, Xinyu Yuan, Aotian Zheng, Hung-Min Hsu, and Jenq-Neng Hwang. Anomaly candidate identification and starting time estimation of vehicles from traffic videos. In CVPRW, 2019. 1 , 2\n[33] Jue Wang and Anoop Cherian. Gods: Generalized one-class discriminative subspaces for anomaly detection. In ICCV, V, 2019. 6\n[34] Jhih-Ciang Wu, He-Yen Hsieh, Ding-Jie Chen, Chiou-Shann Fuh, and Tyng-Luh Liu. Self-supervised sparse representation for video anomaly detection. In ECCV, 2022. 6\n[35] Peng Wu and Jing Liu. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE TIP , 2021. 1 , 2 , 6\n[36] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In ECCV, 2020. 2 , 5 , 6\n[37] Cheng Yan, Shiyu Zhang, Yang Liu, Guansong Pang, and Wenjun Wang. Feature prediction diffusion model for video anomaly detection. In ICCV, 2023. 1 , 2\n[38] Muhammad Zaigham Zaheer, Jin-ha Lee, Marcella Astrid, and Seung-Ik Lee. Old is gold: Redefining the adversarially learned one-class classifier training paradigm. In CVPR, 2020. 1 , 2\n[39] Muhammad Zaigham Zaheer, Arif Mahmood, Marcella Astrid, and Seung-Ik Lee. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In ECCV, 2020. 6\n[40] M Zaigham Zaheer, Arif Mahmood, M Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. Generative cooperative learning for unsupervised video anomaly detection. In CVPR , 2022. 1 , 2 , 6 , 7\n[41] Jiangong Zhang, Laiyun Qing, and Jun Miao. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In ICIP, 2019. 6\n[42] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In CVPR, 2019. 6\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/zanella_harnessing_large_language_models_for_training-free_video_anomaly_detection_cvpr_2024_paper/","section":"Papers","summary":"Introduces a training-free method for video anomaly detection (VAD) leveraging pre-trained large language models (LLMs) and vision-language models (VLMs). Proposes techniques for caption cleaning, scene description, and anomaly scoring without additional training, demonstrating superior performance on surveillance datasets.","title":"Harnessing Large Language Models for Training-free Video Anomaly Detection","type":"other"},{"content":" Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM # Huaxin Zhang 1 , 3 , Xiaohao Xu 2 , Xiang Wang 1 , Jialong Zuo 1 , Chuchu Han 1 , 3 ,\nXiaonan Huang 2 , Changxin Gao 1 , Yuehuan Wang 1 , Nong Sang 1\n1 Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology\n2 University of Michigan, Ann Arbor 3 Baidu Inc. Corresponding Author\nAbstract # Towards open-ended Video Anomaly Detection (VAD), existing methods often exhibit biased detection when faced with challenging or unseen events and lack interpretability. To address these drawbacks, we propose Holmes-VAD, a novel framework that leverages precise temporal supervision and rich multimodal instructions to enable accurate anomaly localization and comprehensive explanations. Firstly, towards unbiased and explainable VAD system, we construct the first largescale multimodal VAD instruction-tuning benchmark, i.e. , VAD-Instruct50k. This dataset is created using a carefully designed semi-automatic labeling paradigm. Efficient single-frame annotations are applied to the collected untrimmed videos, which are then synthesized into high-quality analyses of both abnormal and normal video clips using a robust off-the-shelf video captioner and a large language model (LLM). Building upon the VAD-Instruct50k dataset, we develop a customized solution for interpretable video anomaly detection. We train a lightweight temporal sampler to select frames with high anomaly response and fine-tune a multimodal large language model (LLM) to generate explanatory content. Extensive experimental results validate the generality and interpretability of the proposed Holmes-VAD , establishing it as a novel interpretable technique for real-world video anomaly analysis. To support the community, our benchmark and model will be publicly available at https://holmesvad.github.io/ .\n1 Introduction # Video Anomaly Detection (VAD) [14] aims to identify abnormal events in videos, which has been extensively researched in recent years due to its considerable application value in public safety [43] and video content understanding [55]. Current VAD approaches can be broadly classified into three categories according to the annotation type of the training data, i . e ., unsupervised, weaklysupervised and fully-supervised. Unsupervised methods [14 , 35 , 30 , 12 , 49 , 60] train solely on normal videos (one-class) or unlabeled normal/abnormal videos, while weakly supervised methods [43 , 71 , 45 , 23 , 54 , 72 , 37] train on normal/abnormal videos with video-level labels. Fully-supervised methods [29 , 19] are less studied due to the high cost of precise frame-by-frame annotations. Recently, inspired by the strong representation of multi-modal large language models (MLLMs) pretrained on massive data [46 , 7 , 16 , 59 , 26 , 74 , 28 , 69 , 62 , 9 , 4 , 51] and their impressive advancements in many downstream visual tasks [13 , 52 , 53], many efforts [41 , 17 , 57 , 61 , 56 , 65] start to integrate the multi-modal knowledge into VAD systems, which enables more precise anomaly detection.\nDespite significant progress, existing VAD models still face the following primary challenges:\nPreprint. Under review.\nFigure 1: Towards unbiased and explainable VAD. In contrast to prevailing VAD approaches (a) that primarily concentrate on identifying anomalies, our method (b) facilitates not only unbiased (i.e., less false alarms toward easily cofused or unseen normality) predictions of anomaly scores but also explanation of detected anomalies, through constructing a large scale VAD dataset with single-frame annotations for untrimmed videos and explanable instruction data for trimmed videos.\nBiased anomaly space: Due to the lack of reliable frame-level abnormal supervision, unsupervised methods fail to reconstruct or predict unseen normal data, while weakly supervised methods also struggle to select trustworthy snippets for training under the videolevel supervision. Consequently, the learned anomaly space of these methods develop a prevalent bias toward unseen or easily-confused normality, remain \u0026ldquo;when does the anomaly happen\u0026rdquo; still facing challenges. Although there are some fully supervised studies [29 , 19], the number of annotated videos is very small due to the inefficiency of the annotation process, resulting in a lack of scalability. Lack of explainability: Existing video anomaly detection approaches do not offer transparent explanations and reasoning for detected anomalies, i . e ., \u0026ldquo;what is the anomaly\u0026rdquo; , and \u0026ldquo;why is it considered anomalous\u0026rdquo;. This opacity restricts human comprehension and engagement with the system. Drawing from the above analysis, our insight is that a strong AI-powered anomaly detection system requires not only identifying deviations, but also providing insightful explanations, mirroring the deductive reasoning like the detective Sherlock Holmes. To this end, we present Holmes-VAD, an unbiased and explainable VAD framework based on MLLMs (see Fig.1).\nMore specifically, to tackle the first issue, we propose a more label-friendly single-frame supervision (one-click for each abnormal event) [38 , 20 , 67 , 8 , 21] in the domain of video anomaly detection instead of the prohibitive frame-by-frame annotation. Following this labeling paradigm, we manually make single-frame annotations for the exsiting two largest VAD datasets, e . g., UCF-Crime [43] and XD-Violence [55]. To address the second problem of lacking explainability, we construct a large amount of anomaly-awared instruction conversation data for the finetuning of Multimodal LLM. We leverage the single-frame annotated videos and exsiting off-the-shell large foundation model to build an efficient semi-automated data engine. This data engine can be divided into three main steps: 1) Data Collection: gathering video data, primarily from open-source datasets. 2) Annotation Enhancement: generate reliable video event clips around the single-frame annotated frames and give textual descriptions to them through human effort or foundation models. 3) Instruction Construction: utilizing powerful LLM with open-world knowledge to generate explanable analysis in the context of the enhanced video annotations. Subsequently, the obtained analysis is filted manually and structured into conversational format. After on the above steps, a new benchmark containing single-frame temporal annotations and explanatory text descriptions is constructed, and we name the final obtained dataset as VAD-Intruct50k .\nBuilt upon the proposed VAD-Intruct50k, we develop a customized solution for interpretable video anomaly detection, which has three key components, i . e ., Video Encoder, Temporal Sampler and Multi-modal LLM. The Video Encoder and Multi-modal LLM are used to encode the input video and generate text response to the input text prompt, respectively. Additionally, the Temporal Sampler is used to predict the abnormal scores of video frames and sample high-responsive parts as the input for Multi-modal LLM, which is lightweight and enables effient inference. Specially, these three\ncomponents can be replaced by any other Video-MLLMs or VAD-Networks. Our primary focus is on how to construct a supervised multi-modal dataset to train these components. Extensive experiments demonstrate that our Holmes-VAD achieve outstanding performance in video anomaly detection and can provide detailed explanations for the detected abnormal events.\nTo summarize, our major contributions are as follows:\nWe propose Holmes-VAD, a video anomaly detection system that is capable of identifying anomalies and providing insightful explainations across even hour-long videos. To bridge the dataset gap toward an unbiased and explanable VAD system, we introduce VAD-Intruct50k, a large-scale multimodal video anomaly detection datasets, including single-frame annotations for untrimmed videos, and a large amount of instruction conversation data for trimmed abnormal/normal video clips. Extensive quantitative and qualitative experiments demonstrate that the proposed HolmesVAD achieves superior performance and interpretability over recent state-of-the-art methods. 2 Related Works # Video Anomaly Detection. This task aims to temporally detect abnormal frames in a long untrimmed video [2 , 39 , 24 , 33 , 49 , 14]. The early VAD attempts are based on hand-crafted features [2 , 18 , 70 , 39 , 33 , 24]. Recently, deep learning approaches [14 , 60 , 37] have become dominant in Video Anomaly Detection (VAD), broadly classified into unsupervised, weakly-supervised, and fully-supervised methods. Unsupervised methods train only on normal videos to learn normal patterns and are often designed as reconstruction-based [14 , 58 , 12 , 60], prediction-based [30], or a combination [31]. Some methods [64 , 44 , 47] also explore a fully unsupervised setting, including both unlabeled normal and abnormal videos in the training set. Weakly-supervised methods [43 , 71 , 10 , 55 , 45 , 23 , 54 , 72 , 37 , 68] use both normal and abnormal videos with video-level annotations. Fully-supervised methods [29 , 19] are less common due to the high cost of precise frame-level annotations.\nMulti-modal Large Language Model. The universal and powerful conversational capabilities of ChatGPT [1] have inspired the entire AI community. This has prompted the emergence of the open-source Large Language Models (LMMs), such as LLaMA [46], Vicuna [7], and Mistral [16], based on autoregressive models [48], they are pretrained and instruction tuned via large amounts of text tokens, thus posses universal and powerful text generation capabilities. Recently, Multi-modal LLMs [59 , 26 , 74 , 28 , 27 , 69 , 62 , 9 , 4 , 51] empower LLMs with visual understanding capabilities. Additionally, MLLMs for videos (e.g., VideoChat [22], Video-LLaMA [66], and Video-LLaVA [25]) pave the way for multi-modal temporal understanding.\nMulti-modal Video Anomaly Detection. Large-scale visual-language pretrained models such as CLIP [42] serve as a bridge between visual and textual modalities. Some recent works [41 , 17 , 57 , 61] in the realm of video anomaly detection have leveraged textual information as prompts to enhance the model\u0026rsquo;s anomaly representation. Based on this, [56] firstly proposed the open vocabulary VAD task. Furthermore, [65] extracted captions from video frames using a caption model and designed prompts for LLMs to provide anomaly scores. However, these approaches primarily focus on generating anomaly scores and lack fine-tuning on large-scale domain-specific instruction datasets, resulting in their performance being highly dependent on the base LLMs.\n3 VAD-Instruct50k Benchmark # In this section, we will illustrate the process of VAD-Instruct50k dataset generation. Firstly, the data collection process of VAD-Instruct50k will be presented. Subsequently, we will elaborate on how to enhance the annotations of the collected videos. Finally, the generation process of the instruction conversation data will be introduced. The overall pipeline of the data engine is shown in Fig. 2 .\n3.1 Data Collection # We first collect videos from the training sets of the two largest weakly-supervised VAD datasets, UCF-Crime [43] and XD-Violence [55], because their video quantity far exceeds that of other existing datasets [50 , 32 , 34], and their video-level annotations provide a solid foundation for further data\nFigure 2: Data engine for the proposed VAD-Instruct50k. We collect numerous abnormal/normal videos from exsiting datasets, following by a series of annotation enhancement including temporal single-frame annotation, event clips generation and event clips captioning. Then we construct the instruction data by prompting the powerful LLM with the enhanced annotation. Throughout the pipeline, manual work and large fundation models coordinated with each other to ensure efficiency and quality in construction.\nprocessing. After filtering out some low-quality videos via human inspection, we collected a total of 5547 untrimmed videos, include 810/800 abnormal/normal videos from UCF-Crime and 1905/2032 abnormal/normal videos from XD-Violence.\n3.2 Annotation Enhancement # The collected videos from UCF-Crime [43] and XD-Violence [55] only offers video-level anomaly labels, which denotes whether the video includes anomalies. Going beyond these coarse annotations, we purify these annotations to enable more discriminative anomaly detection model training.\nTemporal single-frame annotation. We adopt an efficient temporal annotation method involving sparse single-frame annotation for the collected abnormal videos, inspired by [40 , 38 , 67 , 20 , 21 , 68] that use this approach to balance model performance and annotation cost. Specifically, we annotate only one frame for each abnormal event in the video 1 . Through this process, we collect an average of 2.35 single-frame annotations per video.\nEvent clip generation. Based on the single-frame annotation, we design a reliable pseudo frame-level label generation method and leverage it to train a VAD network ϕ s 2 . For each abnormal video with single-frame annotations G = {gi} N g and its anomaly score estimated by the trained VAD network, we generate multiple anomaly event proposals around the annotated frame. Formally, each proposal is represented via a starting and ending timestamp, i.e. , s and e. For each normal video, we randomly extract several normal event proposals. After this process, we collect all trimmed event clips with anomaly labels: E = {si, ei, yi} N e , where yiis set to the anomaly class of the video (e . g ., Explosion) if the event clip is from an abnormal video, otherwise, it is set to Normal .\nEvent clip captioning. To fully extract semantic information from the event clips, we utilize a video-based multimodal large language model (MLLM) [25] to generate detailed captions for each event clip. We also include the SurveillanceVision dataset [63], which provides manually annotated detailed fine-grained event descriptions for video clips from UCF-Crime [43]. After combining these resources, we obtain all event clips with corresponding captions c and anomaly labels: E = {si, ei, yi, ci} N e i .\n1 More details about the annotation process are illustrated in Sec. A.2 of the Appendix.\n2 See Sec. A.3 of the Appendix for more details about the network.\nFigure 3: Overview of Holmes-VAD . Holmes-VAD takes untrimmed video and user prompt as inputs, and takes the anomaly scores and explanation for detected anomalies outputs. The Temporal Sampler takes class tokens of frames as input and estimates the anomaly scores, and the dense visual tokens are resampled accroding to their anomaly scores before entering the projector.\n3.3 Instruction-tuning Data Construction # The process of annotation enhancement effectively fills the gap of insufficient information in the original video-level annotation. However, there is still a lack of anomaly-awared explanation for these event clips, i . e ., what is the anomaly and why. To address this issue, we utilize the powerful LLM with sufficient open-world knowledge for further instruction dataset construction. Technically, for each event clip in E, we design a task prompts Pt combined with the referenceable anomaly context, i . e ., the abnormal label yi and the detail caption ci. Then we input the combined prompt into the LLM M to make a judgment on anomalies in the video clip and providing an explanation. The generated response is paired with a corresponding anomaly-awared quesion Pd, result in an instruction item:\nWe use Llama3-Instruct-70B [3] as M here because of its open-source availability and comparable performance to GPT4. We design multiple Pd to ensure the diversity of the instruction data, a typical prompts of Pd is: \u0026ldquo;\u0026lt;video\u0026gt;\\n Are there any unexpected or unusual events in the video clip?\u0026rdquo;.\n4 Holmes-VAD # Utilizing the proposed VAD-Intruct50k dataset for training, we develop a customised solution for interpretable video anomaly detection, namely Holmes-VAD, which has three key components, Video Encoder, Temporal Sampler and Multi-modal LLM with tunable LoRA [15] modules (see Fig. 3).\n4.1 Model Architecture # Visual Encoder. We utilize the frozen video encoder in LanguageBind [73] following [25]. It inherits the ViT-L/14 structure from CLIP [42], we refer to it as ϕ v . Different from the orginal ViT, it models the temporal relationship between frames through additional self-attention layer in the temporal dimension. Give a video frame sequence V ∈ R N×H×W×C , the output features of each frame can be denotes as follow:\nwhere f i cls f i indicates the class token feature of i-th video frame, f i j f i (j ∈ {1 , 2, \u0026hellip;, Np Np }) denotes the visual embedding of each patch, and Np Np reperesents the number of patches of each frame.\nTemporal Sampler. Due to the excessive computational burden caused by numerous visual tokens in video, past video-based MLLM approaches [22 , 66 , 25] have resorted to uniform temporal frame sampling of videos, e . g., 8 frames. This method is clearly unsuitable for long videos in video anomaly detection task, as it increases the probability of ignoring key information. [65] conduct dense anomaly detection via MLLM in a frame-by-frame mode, which also inevitably leads to a large amount of redundant computation. To address this issue, we first input the dense video frames into the visual encoder, then we introduce the trained VAD network in 3.2 here, which receives the cls token of the video frames f cls 1 , f 2 cls f 2 , \u0026hellip;, f N cls f N and outputs anomaly scores s1, s2, \u0026hellip;, sN :\nwhere ϕ s denotes the trained VAD network.\nThen, we sample the video tokens according to the anomaly scores. Specifically, only the tokens fk from frames with corresponding anomaly score sk above a set threshold θ are then fed into the subsequent network:\nwhere F s denotes the sampled sparse visual tokens from the original dense visual tokens F d . In this way, the model can generate anomaly-awared response to long untrimmed video.\nProjector and LLM. To enable the LLM to understand the features output by the visual encoder, a projector ϕproj composed of two layers of MLPs is designed between them, after this, the feature dimention is aligned with the input dimension of LLM. We utilize Vicuna [7] as our LLM following [25].\nwhere T0 T0:i represents the input text tokens to LLM and Ti+1 indicates the predicted next token. ϕproj and ϕT represents the Projector and the Text Encoder, respectively. [· , · ] denotes concatenation.\n4.2 Training # Training of the Temporal Sampler. In this stage, we only train the Temporal Sampler under the single-frame supervision. In essence, we employed a pseudo-labeling supervision strategy. The pseudo-labels are initialized through single-frame annotations during the training process and are online updated around the annotated frames 3 . We use the generated pseudo label to supervise the predicted anomaly score, which can effectively reduce the bias of the temporal sampler towards easily confued normality.\nInstruction Tuning. During this stage, we take the trimmed event clips as input and do not perform Temporal Sampler because each clip has been labeled as abnormal or normal. In this stage, we train the projector and use LoRA [15] to fine-tune the Multi-modal LLM. We conduct different tuning strategy and compare them in the next section. Given the projected visual features Fv Fv and the textual input embedding Ft, the LLM decode them into a sequence words A. we follow mainstream works to use the original auto-regressive training objective. The objective aims to maximize the likelihood of generating the ground truth answer sequence given the input features, encouraging the model to produce coherent and accurate responses based on the input features.\n5 Experiments # In this section, we conduct extensive experiments to thoroughly demonstrate the capabilities of our proposed model, i.e. , Holmes-VAD .\n5.1 Experiment Setup # Datasets. We conduct the comparative experiments on two standard VAD datasets, namely, UCFCrime [43] and XD-Violence [55]. (1) UCF-Crime [43] comprises 1900 untrimmed videos totaling 128 hours from outdoor and indoor surveillance cameras. It encompasses 13 classes of real-world anomalies, including Abuse , Explosion , Fighting, and Shooting. In the weakly-supervised setting,\n3 More details about the generation of pseudo labels can be found in the Appendix.\nTable 1: Comparision with state-of-the-art Video Anomaly Detection approches. We include semisupervised (Semi.) methods, unsupervised (Un.) methods, weakly-supervised (W.) methods and some other methods. \u0026ldquo;∗\u0026rdquo; represents the result reported in [65].\nMethods Backbone Supervision Explanation XD-Violence UCF-Crime p p AP/% AUC/% Non-explainable VAD Non-explainable VAD Non-explainable VAD Non-explainable VAD Non-explainable VAD Non-explainable VAD Conv-AE [14] - Semi. ✗ 27.25 50.60 GODS [49] I3D Semi. ✗ N/A 70.46 GCL [64] ResNext Un. ✗ N/A 71.04 DYANNET [44] I3D Un. ✗ N/A 84.50 MIST [10] I3D W. ✗ N/A 82.30 Wu et al. [55] I3D W ✗ 78.64 82.44 RTFM [45] I3D W ✗ 77.81 84.30 MSL [23] I3D W ✗ 78.28 85.30 S3R [54] I3D W. ✗ 80.26 85.99 MGFN [6] I3D W. ✗ 79.19 86.98 UR-DMU [72] I3D W. ✗ 81.66 86.97 CLIP-TSA [17] ViT W. ✗ 82.19 87.58 VadCLIP [57] ViT W. ✗ 84.51 88.02 Yang et al. [61] ViT W. ✗ 83.68 87.79 Wu et al. [56] ViT Open-Vocabulary ✗ 66.53 86.40 Explainable Multi-modal VAD Explainable Multi-modal VAD Explainable Multi-modal VAD Explainable Multi-modal VAD Explainable Multi-modal VAD Explainable Multi-modal VAD ZS CLIP [42] ∗ ViT Training-Free ✓ 17.83 53.16 ZS IMAGEBIND [11] ∗ ViT g Training-Free ✓ 25.36 55.78 LLAVA-1.5 [27] ∗ ViT g Training-Free ✓ 50.26 72.84 LAVAD [65] ViT g Training-Free ✓ 62.01 80.28 Holmes-VAD (Ours) ViT Instruction-Tuned ✓ 90.67 89.51 there are 1610/290 videos for training/testing, with the training set consisting of 810 abnormal videos and 800 normal videos, respectively. (2) XD-Violence [55] is the largest VAD benchmark, comprising 4754 videos totaling 217 hours sourced from surveillance, movies, car cameras, and games. It encompasses 6 anomaly classes: Abuse , Car Accidents , Explosions , Fighting , Riots, and Shooting. The training/testing video count stands at 3954/800, adhering to a weakly-supervised framework. The training set comprises 1905 abnormal videos and 2049 normal videos, respectively.\nMetrics. To evaluate the anomaly detection performance of the temporal sampler, we use the Area Under the Curve (AUC) as the main evaluation metric for UCF-Crime following [45 , 54 , 23 , 72 , 68]. Meanwhile, the AUC of the frame-level precision-recall curve (AP) is utilized for XD-Violence. To evaluate the quality of explanation response, we randomly extract 86 abnormal/normal video segments from the test videos of UCF-Crime and XD-Violence, and then invite 10 users to vote on the responses of different models from 3 aspects include Judgement Accuracy (JA), Content Perception (CP) and Anomaly Explanatory (AE). Please see the Appendix for more details about the metrics.\nImplementation details. In our study, we take the ViT in LanguageBind model [73] as the Video Encoder and initialize the Multi-modal LLM with Video-LLaVA [25]. UR-DMU [72] serves as the foundation structure for our Temporal Sampler. To optimize the Temporal Sampler, we ramdomly sample one frame at 16-frame intervals, and Adam optimizer with a learning rate of 1e-4 is adopted. Note that when evaluating performance on XD-Violence and UCF-Crime, only videos in the corresponding training sets are used to train our model for fair comparisons. For instruction tuning, we train with a batch size of 128 for 1 epoch, using the AdamW optimizer with cosine learning rate decay and a warm-up period, setting the projector\u0026rsquo;s learning rate to 2e-5. The LoRA [15] parameters are set as: r=64, α=128, and learning rate=2e-4. The abnormal threshold θ is set to 0.8 during inference. Experiments are conducted on 2 NVIDIA A100 GPUs.\n5.2 Main Results # We compare our method with state-of-the-art methods, including semi-supervised methods [14 , 49], unsupervised methods [64 , 44], weakly-supervised methods [45 , 23 , 54 , 72 , 17 , 57] and recently training-free method [65]. We have indicated their backbones, supervision methods, and performance on the UCF-Crime and XD-Violence datasets, as shown in Table 1. Our method has an AP of 90.67% on XD-Violence and an AUC of 89.51% on UCF-Crime, significantly outperforming the\nprior state-of-the-art methods, which demonstrates that our method can generate less biased anomaly scores. It is worth noting that while achieving precise localization of anomalies, Holmes-VAD is also capable of providing explanations and analysis for the detected anomalies by the model, a feature unavailable in existing non-explainable VAD methods. Although LAVAD [65] has explainability, the training-free large language model lacks an understanding of anomaly knowledge due to the limitation of insufficient supervised data.\n5.3 Analytic Results # Table 2: Human evaluation on models under different training settings.\nTraining Strategy Average Words of Model Reponse JA(%) CP(%) AE(%) Training-free 38.29 65.1 11.6 15.9 Projector 40.84 81.4 27.2 32.2 Projector+LoRA (Default) 46.13 86 61.2 51.9 Table 3: Ablation study of backbone and supervision in Temporal Sampler. Table 4: Effect of random temporal shifting of single-frame annotations.\nBackbone Single-frame XD-AP(%) UCF-AUC(%) Shifted Timestamp XD-AP(%) UCF-AUC(%) I3D [5] ✗ 82.4 86.54 0 90.67 89.51 ViT [73] ✗ 84.96 84.61 10 90.55 89.45 I3D [5] ✓ 89.4 90.8 50 90.45 89.32 ViT [73] ✓ 90.67 89.51 100 90.12 88.95 Table 5: Temporal Sampler v.s. Uniform Sampler. We averaged the inference time of all test videos.\nSampling Strategy XD-AP(%) UCF-AUC(%) Avg. Infer Speed (second per video) Uniform Sampler 67.25 78.38 32.82 Temporal Sampler (Default) 90.67 89.52 4.24 Influence of varied training strategies on anomaly explanation. We conduct a user study to evaluate three different training strategies over 86 test samples and 10 volunteers: a) Trainingfree: no fine-tuning; b) Projector: fine-tuning on VAD-Instruct50k, only training the projector while keeping the Multi-modal LLM fixed; c) Projector+LoRA: fine-tuning on VAD-Instruct50k, training the projector and using LoRA [15] to fine-tune the Multi-modal LLM. As shown in Table 2 , Projector+LoRA provide the most detailed response (46.13 words in average) and reaches the highest Judgement Accuracy (86.0%). Addtionally, it also achieves the highest voting rate, including 61.2% on Content Perception and 51.9% on Anomaly Explanatory, these demonstrate better interpretability by fine-tuning Multi-modal LLM on VAD-Instruct50k.\nBackbone and supervision matters in Temporal Sampler. In Table 3, we ablate the impact of video backbone and the supervision for Temporal Sampler. We use UR-DMU [72] as our baseline method. The results indicate that on XD-Violence dataset, LanguageBind [73] as a backbone outperforms I3D [5] significantly, whereas the opposite is observed on UCF-Crime. Additionally, single-frame supervision significantly enhances performance regardless of the backbone used, demonstrating the effectiveness of point supervision in improving anomaly localization capabilities.\nInfluence of perturbed single-frame annotations. To assess the robustness of our method to the perturbed temporal position of single-frame annotation, we introduce varied temporal timestamp shifts to the original positions of the annotated frames. As shown in Table 4, there is no significant performance degradation of our model under perturbed annotation positions, indicating that our method possesses a notable tolerance towards variations in degraded supervision.\nTemporal Sampler v.s. Uniform Sampler. We replace the Temporal Sampler with Uniform Sampler while maintaining the frame rate. The video is then divided into non-overlapping clips, which are sequentially fed into the Multimodal LLM to output results. If the output is \u0026ldquo;Yes\u0026rdquo; the anomaly scores of all frames in the input segment are set to 1, otherwise, they are set to 0. Finally, we compare the\nFigure 4: Qualitative results. We compare our interpretability results with Video-LLaVA [36] (without instruction tuning). Correct and wrong explanations are in green and red, respectively.\ndetection performance and inference efficiency in Table 5. The results demonstrate that the Temporal Sampler ensures higher inference efficiency while maintaining accurate detection results.\nQualitative comparision. To provide a more intuitive understanding of the capabilities of MLLM in explaining complex anomalies, we provide qualitative comparisons between Holmes-VAD and Video-LLaVA in Fig. 4. The results demonstrate that Holmes-VAD can accurately identify anomalies in videos and provide specific explanations for conflicts in sports competitions, explosions, and accidents captured by car cameras (Abnormal Cases). Even for normal videos, Holmes-VAD exhibits robust analytical abilities, correcting erroneous responses from the Temporal Sampler (Normal Cases). These findings highlight the effectiveness and advantage of Holmes-VAD in perceiving video events and analyzing anomalies.\n6 Conclusion # In this paper, we introduce a video anomaly detection system called Holmes-VAD to address the biases and lack of interpretability in existing anomaly detection methods. By introducing a more efficient labeling paradigm and constructing a large-scale multimodal video anomaly detection dataset, VAD-Instruct50k, we validated the generality and interpretability of Holmes-VAD. Through extensive experiments, we positioned Holmes-VAD as a valuable tool for real-world applications.\nLimitation and Future work. Despite the human effort for filtering the noise instruction data during constructing the VAD-Instruct50k dataset, the reliance on off-the-shelf video captioning models for generating video description may not always capture the nuances and context-specific information. This is a trade-off we have made between labeling costs and efficiency, we believe that the quality of data is no less important than the quantity of data, and we plan to further enhance data quality and quantity within acceptable labor costs in the future. Furthermore, although we control the length of the video input to the Multi-modal LLM through Temporal Sampler and accurately analyze abnormal content in the trimmed video clips, there is still a lack of an effective solution for Multimodal LLM to understand long-term video anomalies without compromising its image-level perceptual capabilities. We leave these for our future exploration.\nAcknowledgement This work is supported by the National Natural Science Foundation of China under grant U22B2053 and 623B2039.\nReferences # [1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.\n[2] Amit Adam, Ehud Rivlin, Ilan Shimshoni, and Daviv Reinitz. Robust real-time unusual event detection using multiple fixed-location monitors. IEEE transactions on pattern analysis and machine intelligence , 30(3):555–560, 2008.\n[3] AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/ main/MODEL_CARD.md .\n[4] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023.\n[5] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017.\n[6] Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 387–395, 2023.\n[7] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.\n[8] Ran Cui, Tianwen Qian, Pai Peng, Elena Daskalaki, Jingjing Chen, Xiaowei Guo, Huyang Sun, and Yu-Gang Jiang. Video moment retrieval from text queries via single frame annotation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 1033–1043, 2022.\n[9] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.\n[10] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. Mist: Multiple instance self-training framework for video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14009–14018, 2021.\n[11] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.\n[12] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1705–1714, 2019.\n[13] Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 1932–1940, 2024.\n[14] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 733–742, 2016.\n[15] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 , 2021.\n[16] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.\n[17] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. Clip-tsa: Clip-assisted temporal selfattention for weakly-supervised video anomaly detection. In 2023 IEEE International Conference on Image Processing (ICIP), pages 3230–3234. IEEE, 2023.\n[18] Jaechul Kim and Kristen Grauman. Observe locally, infer globally: a space-time mrf for detecting abnormal activities with incremental updates. In 2009 IEEE conference on computer vision and pattern recognition , pages 2921–2928. IEEE, 2009.\n[19] Federico Landi, Cees GM Snoek, and Rita Cucchiara. Anomaly locality in video surveillance. arXiv preprint arXiv:1901.10364, 2019.\n[20] Pilhyeon Lee and Hyeran Byun. Learning action completeness from points for weakly-supervised temporal action localization. In ICCV, pages 13648–13657, 2021.\n[21] Hanjun Li, Xiujun Shu, Sunan He, Ruizhi Qiao, Wei Wen, Taian Guo, Bei Gan, and Xing Sun. D3g: Exploring gaussian prior for temporal sentence grounding with glance annotation. arXiv preprint arXiv:2308.04197, 2023.\n[22] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.\n[23] Shuo Li, Fang Liu, and Licheng Jiao. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 36, pages 1395–1403, 2022.\n[24] Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013.\n[25] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.\n[26] Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, et al. Mm-vid: Advancing video understanding with gpt-4v (ision). arXiv preprint arXiv:2310.19773, 2023.\n[27] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.\n[28] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.\n[29] Kun Liu and Huadong Ma. Exploring background-bias for anomaly detection in surveillance videos. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1490–1499, 2019.\n[30] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018.\n[31] Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13588–13597, 2021.\n[32] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision, pages 2720–2727, 2013.\n[33] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision, pages 2720–2727, 2013.\n[34] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE international conference on computer vision, pages 341–349, 2017.\n[35] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE international conference on computer vision, pages 341–349, 2017.\n[36] Hui Lv and Qianru Sun. Video anomaly detection and explanation via large language models. arXiv preprint arXiv:2401.05702, 2024.\n[37] Hui Lv, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, and Hanwang Zhang. Unbiased multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8022–8031, 2023.\n[38] Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, and Zheng Shou. Sf-net: Single-frame supervision for temporal action localization. In ECCV, pages 420–437. Springer, 2020.\n[39] Ramin Mehran, Alexis Oyama, and Mubarak Shah. Abnormal crowd behavior detection using social force model. In 2009 IEEE conference on computer vision and pattern recognition, pages 935–942. IEEE, 2009.\n[40] Pascal Mettes, Jan C Van Gemert, and Cees GM Snoek. Spot on: Action localization from pointlysupervised proposals. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 437–453. Springer, 2016.\n[41] Yujiang Pu, Xiaoyu Wu, and Shengjin Wang. Learning prompt-enhanced context features for weaklysupervised video anomaly detection. arXiv preprint arXiv:2306.14451, 2023.\n[42] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.\n[43] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018.\n[44] Kamalakar Vijay Thakare, Yash Raghuwanshi, Debi Prosad Dogra, Heeseung Choi, and Ig-Jae Kim. Dyannet: A scene dynamicity guided self-trained video anomaly detection network. In Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pages 5541–5550, 2023.\n[45] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4975–4986, 2021.\n[46] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.\n[47] Anil Osman Tur, Nicola Dall\u0026rsquo;Asen, Cigdem Beyan, and Elisa Ricci. Exploring diffusion models for unsupervised video anomaly detection. In 2023 IEEE International Conference on Image Processing (ICIP), pages 2540–2544. IEEE, 2023.\n[48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017.\n[49] Jue Wang and Anoop Cherian. Gods: Generalized one-class discriminative subspaces for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8201–8211, 2019.\n[50] Shu Wang and Zhenjiang Miao. Anomaly detection in crowd scene. In IEEE 10th International Conference on Signal Processing Proceedings, pages 1220–1223. IEEE, 2010.\n[51] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.\n[52] Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli Zhao, and Nong Sang. Clipguided prototype modulating for few-shot action recognition. International Journal of Computer Vision , pages 1–14, 2023.\n[53] Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yingya Zhang, Changxin Gao, Deli Zhao, and Nong Sang. Few-shot action recognition with captioning foundation models. arXiv preprint arXiv:2310.10125, 2023.\n[54] Jhih-Ciang Wu, He-Yen Hsieh, Ding-Jie Chen, Chiou-Shann Fuh, and Tyng-Luh Liu. Self-supervised sparse representation for video anomaly detection. In European Conference on Computer Vision, pages 729–745. Springer, 2022.\n[55] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 322–339. Springer, 2020.\n[56] Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Openvocabulary video anomaly detection. arXiv preprint arXiv:2311.07042, 2023.\n[57] Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. arXiv preprint arXiv:2308.11681, 2023.\n[58] Dan Xu, Yan Yan, Elisa Ricci, and Nicu Sebe. Detecting anomalous events in videos by learning deep representations of appearance and motion. Computer Vision and Image Understanding, 156:117–127, 2017.\n[59] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.\n[60] Zhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. Video event restoration based on keyframes for video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14592–14601, 2023.\n[61] Zhiwei Yang, Jing Liu, and Peng Wu. Text prompt with normality guidance for weakly supervised video anomaly detection. arXiv preprint arXiv:2404.08531, 2024.\n[62] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.\n[63] Tongtong Yuan, Xuange Zhang, Kun Liu, Bo Liu, Chen Chen, Jian Jin, and Zhenzhen Jiao. Towards surveillance video-and-language understanding: New dataset, baselines, and challenges, 2023.\n[64] M Zaigham Zaheer, Arif Mahmood, M Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. Generative cooperative learning for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14744–14754, 2022.\n[65] Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. Harnessing large language models for training-free video anomaly detection. arXiv preprint arXiv:2404.01014, 2024.\n[66] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.\n[67] Huaxin Zhang, Xiang Wang, Xiaohao Xu, Zhiwu Qing, Changxin Gao, and Nong Sang. Hr-pro: Point-supervised temporal action localization via hierarchical reliability propagation. arXiv preprint arXiv:2308.12608, 2023.\n[68] Huaxin Zhang, Xiang Wang, Xiaohao Xu, Xiaonan Huang, Chuchu Han, Yuehuan Wang, Changxin Gao, Shanjun Zhang, and Nong Sang. Glancevad: Exploring glance supervision for label-efficient video anomaly detection. arXiv preprint arXiv:2403.06154, 2024.\n[69] Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.\n[70] Bin Zhao, Li Fei-Fei, and Eric P Xing. Online detection of unusual events in videos via dynamic sparse coding. In CVPR 2011, pages 3313–3320. IEEE, 2011.\n[71] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1237–1246, 2019.\n[72] Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. arXiv preprint arXiv:2302.05160, 2023.\n[73] Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by languagebased semantic alignment. arXiv preprint arXiv:2310.01852, 2023.\n[74] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 , 2023.\nA Appendix # A.1 Broader Impact # The paper proposes a video anomaly detection framework, namely Holmes-VAD, that is capable of temporally identifying anomalies accurately and providing insightful explainations across even hour-long videos. Additionally, this paper provides VAD-Intruct50k, a large scale multimodal video anomaly detection datasets, including single-frame annotations for untrimmed videos, and a large amount of instruction conversation data for trimmed abnormal/normal video clips.\nThe positive societal impacts of the work include:\nImproved public safety: The development of more accurate and interpretable video anomaly detection systems can enhance public safety by enabling quicker and more precise identification of anomalies in surveillance videos, such as criminal activities or accidents. Advancement in supervised and open-world VAD research: The proposed VADIntruct50k dataset provide a). accurate temporal timestamp of the abnormal events in videos, and b). video-explanation pair for both abnormal and normal video clips, which can pave the way for further supervised and open-world research in the video anomaly detection area. The negative societal impacts may include:\nPrivacy concerns: The use of video surveillance technology, especially in public spaces, raises concerns about privacy and the potential for intrusive monitoring of individuals without their consent. Disregard for minor anomalies: Despite efforts to reduce bias in anomaly detection, there is still a risk of disregard for subtle anomalies such as stealing in the supermarket, leading to potential undetected anomalies. Consequently, researchers should adhere to relevent laws and regulations, and strive to avoid using our model or dataset for any improper invasion of privacy. Meanwhile, all our model and data will be only used for research purpose to avoid the potential negative societal impacts.\nA.2 Process of single-frame annotation. # Figure 5: Screenshot of the single-frame annotation interface.\nAnnotation tool. We develop an interface designed specifically for single-frame annotation in videos, as shown in Fig. 5. This interface makes it easier to navigate through video lists, adjust video progress rapidly, and automatically record timestamps for annotating individual frames. Furthermore, it enables the preview of annotated frames. By clicking on the annotated frame ID, the video progress automatically synchronizes with the corresponding temporal position. These features greatly\nFigure 6: Examples of single-frame annotation.\nstreamline the annotation process, enhancing convenience and efficiency. If annotators come across any errors or need to make adjustments, they can delete incorrect annotations and proceed with re-annotation.\nQuality control. We initially divide the entire dataset into various portions and distribute them among different annotators for labeling. Once the first round of annotations is completed, we proceed with a secondary review of the video annotations to eliminate incorrect or redundant annotations. In addition, we include ignored clicks to minimize the possibility of overlooking potential anomalies. This process ensures the Reliability and Comprehensiveness of the single-frame annotations.\nExamples of single-frame annotation. To facilitate a better understanding of the annotation process, we offer several examples of annotated videos in Fig. 6 .\nA.3 Model architecture and training details of the Temporal Sampler. # Model architecture. We use UR-DMU [72] as the VAD network in our Temporal Sampler. As shown in Fig. 7, UR-DMU utilizes a Global and Local Multi-Head Self Attention (GLMHSA) module to capture both long-range and short-range temporal relationships among video snippets. Furthermore, UR-DMU introduces two memory banks to store and differentiate abnormal and normal prototypes, thereby maximizing the margins between these two representations. In order to learn discriminative representations, UR-DMU employs triplet loss to increase the feature distance after interacting with different memories. Simultaneously, it utilizes KL loss to constrain the normal memory to follow a Gaussian distribution, accounting for the variance introduced by noise. Thus, the base loss function for the UR-DMU baseline is defined as follows:\nFigure 7: Architecture of the Temporal Sampler.\nTraining details. During the training stage of the Temporal Sampler, we leverage the sparse singleframe annotations to generate reliable dense snippet-level pseudo label. As illustrated in Alg. 1 , we employ a dynamic threshold and perform local bidirectional mining based on the single-frame annotations. Snippets with anomaly scores exceeding a specific proportion of the annotated snippet\u0026rsquo;s score are identified as pseudo anomaly snippets. We set α = 0 . 9 in our implementation. After mining the pseudo anomaly snippets, we adopt Gaussian function to smooth the binary pseudo label:\nAlgorithm 1 Pseudo Label Mining. # Input: Anomaly score S ∈ R T , single-frame annotations G = {gi} N g , anomaly ratio α . Output: Pseudo anomaly snippets T a = {ti} N a i . 1: Let T a ← ∅ . 2: for every gi ∈ G do 3: for t = gi to gi − 1 do 4: if S[t] \u0026gt; α · S[gi] , then T a ← t ∪ T a , else break 5: end if 6: end for 7: for t = gi to gi+1 do 8: if S[t] \u0026gt; α · S[gi] , then T a ← t ∪ T a , else break 9: end if 10: end for 11: end for 12: Return T a where r = 0 . 1 indicates the smoothing ratio. We use the generated dense and smooth pseudo label to supervise the predicted anomaly score:\nwhere S and S ˆ denote the predicted anomaly score and the generated pseudo frame-level label, respectively.\nA.4 Details of human evaluation. # Figure 8: Screenshot of the human evaluation interface.\nTo evaluate the quality of explanation response, we randomly extract 86 abnormal/normal video segments from the test videos of UCF-Crime and XD-Violence, and then invite 10 users to vote on the responses of different models from 3 aspects include Judgement Accuracy (JA), Content Perception (CP) and Anomaly Explanatory (AE).\nJudgement Accuracy (JA): Determine whether the model\u0026rsquo;s judgment on anomalies is correct, we extract predictions by matching \u0026ldquo;Yes\u0026rdquo;/\u0026ldquo;No\u0026rdquo; in the answers, and compare them with the ground truth labels (abnormal/normal). Finally, we calculate the accuracy of the judgments. Content Perception (CP): The accuracy and clarity of the model\u0026rsquo;s descriptions of the content, characters, and events in the video scenes, as well as any potential hallucination issues (descriptions of non-existent objects in the video or responses unrelated to the questions). Anomaly Explanatory (AE): The model\u0026rsquo;s ability to analyze and interpret abnormal/normal events in the video. We provide the screenshot of the human evaluation interface in Fig. 8, to ensure a fair selection, the names of the models are not visible to the users, and choices can only be made from anonymous options. In Fig. 9, we provide several test examples, with the results of the Judgement Accuracy (JA), Content Perception (CP) and Anomaly Explanatory (AE).\nA.5 Dats statistical analysis of VAD-Instruct50k # In Table 6, we conduct a statistical analysis of our proposed VAD-Instruct50k and compare it with representative datasets in the VAD field, which shows the significant volume and excellent diversity of our constructed instruction dataset.\nTable 6: Datasets Statistics .\nDataset #Videos Annotation Type #Queries Avg word CHUK Avenue 37 None N/A N/A ShanghaiTech 437 None/video label N/A N/A UCF-Crime [43] 1,610 video label N/A N/A XD-Violence [55] 4,754 video label N/A N/A UCA [63] 1,854 segment caption 23,542 20.15 VAD-Instruct50k (Ours) 5,547 single-frame\u0026amp;segment instruction 51,567 44.83 Figure 9: Qualitative comparision in human evaluation. We show the results of Judgement Accuracy (JA), Content Perception (CP) and Anomaly Explanatory (AE) above the answer box of each model.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/holmes-vad-towards-unbiased-and-explainable-video-anomaly-detection-via-multi-modal-llm/","section":"Papers","summary":"A novel framework leveraging multimodal instructions and large-scale datasets to enable unbiased, interpretable, and accurate video anomaly detection with large language models, including a new dataset VAD-Instruct50k with single-frame annotations and explanatory instruction data.","title":"Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM","type":"method"},{"content":" Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity # Huaxin Zhang 1 Xiaohao Xu 2 Xiang Wang 1 Jialong Zuo 1 Xiaonan Huang 2 Changxin Gao 1 3 Li Yu 1 1*\nShanjun Zhang Nong Sang\n1 Key Laboratory of Image Processing and Intelligent Control,\nSchool of Artificial Intelligence and Automation, Huazhong University of Science and Technology 2 University of Michigan, Ann Arbor 3 Kanagawa University\n{zhanghuaxin,wxiang,cgao,hustlyu,nsang}@hust.edu.cn, {xiaohaox,xiaonanh}@umich.edu, {chiyoz01}@kanagawa-u.ac.jp\nFigure 1. Motivation . Left: Existing datasets lack the hierarchical structure to capture transient and sustained anomalies across varying temporal scales. Our HIVAU-70k dataset addresses this by providing multi-granularity annotations—clip, event, and video levels—that enable detailed anomaly analysis in complex real-world scenarios. Right: Inspired by Sherlock Holmes\u0026rsquo;s knack for zeroing in on critical details, our Holmes-VAU method integrates an Anomaly-focused Temporal Sampler with a multi-modal LLM, directing model attention to anomaly-rich segments, which enables models to decode complex, long-term video anomalies efficiently.\nAbstract # How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts? Traditional Video Anomaly Understanding (VAU) methods focus on frame-level anomaly prediction, often missing the interpretability of complex and diverse real-world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both short-term and long-term anomalies. To address this challenge, we introduce HIVAU-70k, a large-scale benchmark for hierarchical video anomaly understanding across any granularity. We develop a semi-automated annotation engine that efficiently scales high-quality annotations by combining manual video segmentation with recursive free-text\nCorresponding author annotation using large language models (LLMs). This results in over 70,000 multi-granular annotations organized at clip-level, event-level, and video-level segments. For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler (ATS). ATS integrates an anomaly scorer with a density-aware sampler to adaptively select frames based on anomaly scores, ensuring that the multimodal LLM concentrates on anomaly-rich regions, which significantly enhances both efficiency and accuracy. Extensive experiments demonstrate that our hierarchical instruction data markedly improves anomaly comprehension. The integrated ATS and visual-language model outperform traditional methods in processing long videos. Our benchmark and model are publicly available at https: //github.com/pipixin321/HolmesVAU .\n1. Introduction # Video Anomaly Understanding (VAU) is crucial for applications such as video surveillance [46], violent content analysis [56], and autonomous driving [63]. Detecting deviations from normal patterns aids in hazard prevention and real-time decision-making. Traditional methods [14 , 49 , 74] mainly focus on frame-level predefined closed-set anomaly prediction, assigning an anomaly score to each frame. However, these approaches often fail to describe and understand complex anomalies in the real world.\nTo address this gap, open-world anomaly understanding [57] embraces the diversity and unpredictability of real-world anomalies. Recent work integrates multimodal approaches, combining visual data with textual descriptions [42 , 58 , 62 , 65], while advances in multimodal visuallanguage models (VLMs) [6 , 21 , 27 , 29 , 68] have enabled more nuanced understanding through anomaly-related instruction tuning and text generation [9 , 36 , 47 , 67].\nDespite these advancements, a significant gap remains in models\u0026rsquo; ability to comprehend anomalies across multiple temporal scales. For instance, while anomalies such as explosions or fights may be captured in a single frame, more complex events like theft or arson require understanding long-term contextual patterns. Existing VAU datasets [46 , 56] typically provide annotations at a single level of granularity, limiting models to understanding either immediate perceptual anomalies or those requiring extended contextual reasoning. The lack of datasets with hierarchical annotations—encompassing both shortterm and long-term anomalies—hinders models\u0026rsquo; capacity to reason about anomalies with diverse temporal characteristics. Moreover, constructing datasets that encapsulate this hierarchical complexity poses significant challenges in scalability and annotation quality.\nTo address these issues, we develop a semi-automated annotation engine that efficiently scales high-quality annotation by combining manual video segmentation with recursive free-text annotation using large language models (LLMs). The process involves three key stages: 1) hierarchical video decoupling, where we manually identify anomaly events and segment them into shorter clips; 2) hierarchical free-text annotation, where captions for each clip are generated through human effort or video captioning models, then summarized at the event and video levels via LLMs; and 3) hierarchical instruction construction, where the textual data is transformed into question-answer instruction prompts by combining captions and summaries with designed prompts, creating a dataset with rich annotations for training and evaluating models.\nUtilizing the annotation engine, we introduce HIVAU70k, a large-scale video anomaly understanding benchmark with hierarchical instructions. Our dataset comprises over 70,000 multi-granular instruction data organized across clip-level, event-level, and video-level segments as shown in Fig. 1. This hierarchical structure empowers models to detect immediate anomalies, e.g., sudden explosions or fighting, as well as complex events that require an understanding of long-term context, like theft or arson. By annotating at multiple temporal levels, HIVAU-70k provides diverse anomalies in open-world scenarios.\nTowards long-term VAU, efficiency remains a critical challenge. Previous methods [9 , 47 , 67] often rely on uniform frame sampling, which can either miss crucial anomaly frames or incur large computational costs [25 , 47 , 67]. To address this, we propose the Holmes-VAU method, which combines the proposed Anomaly-focused Temporal Sampler (ATS) with the multimodal visual-language model for efficient long-term video anomaly understanding (See Fig. 1). The ATS combines a anomaly scorer with a density-aware sampler that adaptively selects frames by their anomaly scores. This integration ensures that the visual-language model concentrates on anomaly-rich regions, enhancing both efficiency and accuracy.\nOur contributions are threefold: 1) We introduce HIVAU-70k, a large-scale, multi-granular benchmark for hierarchical video anomaly understanding. 2) We propose the Holmes-VAU method, which combines the proposed Anomaly-focused Temporal Sampler (ATS) to boost the efficiency and accuracy of inference on long-term videos. 3) We conduct extensive experiments demonstrating the effectiveness of hierarchical instruction data in enhancing anomaly comprehension and validate the performance gains provided by the integrated ATS and visual-language model in processing long videos.\n2. Related Works # Video Anomaly Detection. This task aims to temporally detect abnormal frames in a long untrimmed video [1 , 14 , 23 , 34 , 40 , 52]. Early attempts are based on handcrafted features [1 , 19 , 23 , 34 , 40 , 72]. Recently, deep learning approaches [14 , 37 , 61] have become dominant, broadly classified into unsupervised, weakly-supervised, and fully-supervised methods. Unsupervised methods train only on normal videos to learn normal patterns and are often designed as reconstruction-based [13 , 14 , 59 , 61], prediction-based [31], or a combination [33]. Some methods [48 , 50 , 66] also explore a fully unsupervised setting, including both normal and abnormal videos in training set without real labels. Weakly-supervised methods [11 , 22 , 37 , 46 , 49 , 55 , 56 , 70 , 73 , 74] use both normal and abnormal videos with video-level annotations. Fully-supervised methods [20 , 30] are less studied due to the high cost of precise frame-level annotations.\nMulti-modal Video Anomaly Understanding. Largescale visual-language pre-trained models such as CLIP [44]\nserve as a bridge between visual and textual modalities. Some recent works [17 , 42 , 58 , 62] in the realm of video anomaly detection have leveraged textual information as prompts to enhance the model\u0026rsquo;s anomaly representation. Based on this, [57] firstly proposed the open vocabulary VAD task, [65] introduced a multimodal video anomaly dataset composed of dense clip captions. Furthermore, [67] extracted captions from video frames and designed prompts for LLMs to provide anomaly scores, [9] and [47] construct diverse and interactive instruction data at the video level. However, these datasets consider only a single temporal level of anomaly understanding data construction, i.e . clip-level [65] or video-level [9 , 47]. Unlike these methods, we focus on building large-scale hierarchical video anomaly understanding data for multimodal instruction tuning.\nHierarchical Video Understanding. Video understanding is a challenging task due to its temporal-scale diversity. To better comprehend videos, many previous works have focused on both datasets and models in hierarchical video understanding. For example, [3 , 24 , 32 , 35 , 45 , 54] proposed fine-grained action recognition and localization datasets, [16] provided free-form hierarchical captions for hour-long videos at multiple temporal scales, and [8 , 18 , 38 , 43 , 60 , 69] trained models on hierarchical levels to obtain better video feature representation. Recently, to assess the capability of video vision-language models (VLMs) in handling challenges in real-world scenarios, several video benchmarks [10 , 12] have been built, incorporating data at multiple temporal scales for evaluation. Unlike these works, we dive into the field of Video Anomaly Understanding, thus filling the gap in multi-scale annotations in this area.\n3. HIVAU-70k Benchmark # We first define the video anomaly understanding task in Sec.3.1. Then, the construction process of the HIVAU-70k benchmark will be elaborated in Sec.3.2. Finally, we will present HIVAU-70k\u0026rsquo;s statistical information in Sec. 3.3 .\n3.1. Task Description # This work focuses on video anomaly understanding, involving temporal anomaly detection and anomaly explainability. Temporal anomaly detection aims to predict an anomaly score for each frame in a video V, represented as S ∈ R T , where T is the total number of frames. Building on this, we explore the model\u0026rsquo;s ability to generate explanatory text outputs related to anomalies based on video input and user queries. Specifically, given a video V and a text query Q, the model generates an anomaly-related response A. We consider two key abilities: 1) visual perception, which involves recognizing main entities in the video, and 2) anomaly reasoning, which encompasses the\nFigure 2. Data Engine. We present a structured workflow for generating hierarchical annotations across video, event, and clip levels. Clips are first captioned, then processed through a large language model (LLM) with prompts for event summarization. The outputs include clip captions, event summaries, and video summaries, followed by manual checking and refinement. This multi-step approach enriches the dataset with detailed judgments, descriptions, and analyses of anomalies, enabling robust contextual understanding at varying granularities.\nmodel’s judgment and analysis of the anomaly content.\n3.2. LLM-Empowered Data Engine # As shown in Fig.2, we develop a semi-automated annotation engine that efficiently scales high-quality annotations, which consists of three main steps: 1) hierarchical video decoupling, 2) hierarchical free-text annotation, and 3) hierarchical instruction construction.\nHierarchical Video Decoupling. Our video sources include the training set of the UCF-Crime [46] dataset and the XD-Violence [56] dataset, which contains videos of varying durations and diverse real-world anomalies. For abnormal videos, we first manually obtain the temporal boundaries of each anomaly event in the video. Non-continuous anomalous states are considered separate events. Then, we divide each event into clips of random lengths. For normal videos, we apply random sampling to obtain corresponding segments of varying granularities. Ultimately, we obtained 5,443 videos, 11,076 events, and 55,806 clips. This process took 5 annotators approximately 20 hours to complete. More details can be found in the appendix.\nHierarchical Free-text Annotation. To fully extract semantic information from the clip-level videos, we utilize a powerful off-the-shell video perception model LLaVANext-Video [28] to generate detailed captions for each clip. We also include the UCA dataset [64], which provides manually annotated captions for video clips in the UCFCrime [46] dataset. Then, we use an LLM [2] to consoli-\nFigure 3. HIVAU-70k dataset. (a) Duration distributions for clips, events, and full videos, showing dominance of short clips. (b) Hierarchical data organization from clip-level to video-level, enabling perception-to-reasoning insights. (c) Word count variations across annotation levels, with more detailed descriptions at the video level. (d) Sample annotations capturing captioning, judgment, description, and anomaly analysis, highlighting nuanced understanding of anomaly events in complex scenes.\ndate all clip captions within an event, generating an eventlevel video summary. Specifically, we design prompts 1 to guide the LLM to produce three parts of content for each anomalous event summary: 1) Judgment: A determination of whether an anomaly exists and its specific category, 2) Description: A detailed description of the anomalous or normal event, 3) Analysis: The reasoning behind the anomaly judgment, including causal analysis. To guide the LLM in generating reliable responses, we also inject the event\u0026rsquo;s category label (e.g., \u0026ldquo;Shooting\u0026rdquo;, \u0026ldquo;Explosion\u0026rdquo;) into the LLM prompt. We consolidate all event-level summaries to obtain the video-level summary. This results in free-text annotations across multiple temporal scales, including short-term visual perception (clip-level) and longterm anomaly reasoning (event-level, video-level).\nHierarchical Instruction Data Construction. The ability of VLMs to follow user instructions and generate responses is achieved through instruction-tuning [21 , 25 , 27 , 29]. The training data format typically consists of the following:\n\u0026rsquo; {Q:user instruction , A:model response.} '\nTo build instruction-tuning data for VLMs in the domain of Video Anomaly Understanding, we matched free-text annotations with pre-designed anomaly-related user instructions. Specifically, for clip-level segments, we only construct instructions related to captions, as it is challenging to obtain anomaly-related analysis for short videos. For event-level and video-level segments, we construct instruction data from the perspectives of Judgment, Description, and Analysis. Typical examples are shown in the Fig. 3 (d). Manual Checking for Data Quality Control. To ensure dataset quality, we implemented several human inspec-\n1 For detailed prompts, please refer to the appendix.\ntion and curation strategies. First, we labeled the temporal boundaries of abnormal event segments and reviewed them at the second level. Next, anomaly labels were incorporated during summary generation using LLMs, directing focus on relevant entities. Finally, manual reviews were performed to correct low-quality instruction data.\n3.3. Data Statistic of HIVAU-70k # Utilizing the proposed annotation engine, we introduce HIVAU-70k, as shown in Fig. 3, a large-scale benchmark designed for hierarchical instruction-based video anomaly understanding. As shown in Fig. 3(b), HIVAU-70k contains over 70,000 multi-granular annotations organized at clip-level, event-level, and video-level segments, achieving a progression from perception to reasoning. As shown in Fig. 3(a) and (c), the durations of segments and the word numbers of text annotations at different granularities exhibit significant distributional differences. As shown in Fig. 3(d), HIVAU-70k\u0026rsquo;s instruction data covers Caption , Judgment , Description, and Analysis for real-world anomalies, which guide the model to develop both short-term and long-term video anomaly understanding capabilities.\n4. Method: Holmes-VAU # Long-term video anomaly understanding with LLMs/VLMs has traditionally been hindered by frame redundancy, complicating accurate anomaly detection. Previous VAU approaches struggle with focus: methods like dense window sampling [67] add redundancy, and uniform frame sampling [9 , 47] often misses key anomalies, limiting application to short videos. We introduce the Anomaly-focused Temporal Sampler (ATS) to address this,\nFigure 4. Holmes-VAU: a multi-modal-LLM-based video anomaly detection framework with adaptive anomaly focus.\nintegrating it into the VLM, and fine-tuning it via instruction on HIVAU-70k to form our Holmes-VAU model.\n4.1. Pipeline Overview # The overall pipeline of our Holmes-VAU model is shown in Fig. 4. Video frames are processed by a visual encoder, creating visual tokens. These tokens are analyzed by an Anomaly-focused Temporal Sampler using an anomaly scorer and cumulative sum (S cumsum ) to select keyframes. A text encoder processes a prompt (e.g., \u0026lsquo;Describe the abnormal events in the video\u0026rsquo;). Visual and textual representations are combined in a pre-trained language model, fine-tuned with LoRA, to generate a description of detected anomalies, such as a \u0026lsquo;collision between a car and a motorcycle\u0026rsquo; and \u0026rsquo;traffic accident\u0026rsquo; indicators.\n4.2. Model Architecture # Visual and Text embedding. We utilize the frozen visual encoder in InternVL2 [6], which inherits the ViT structure from CLIP [44], we refer to it as ϕ v . Following previous VAU works [55 , 58 , 67 , 74], we sample dense video frames at an interval of 16 frames from the input video. Each video frame is then processed through the visual encoder to obtain the corresponding visual tokens. Given the input video frame sequence V ∈ R T ×H×W×C , the output features of i-th frame can be denotes as:\nwhere v cls i indicates the class token, v j i (j ∈ {1 , 2, \u0026hellip;, Np Np }) denotes the patch tokens, and Np Np reperesents the number of patches. The text encoder ϕt is also initialized from [6], which includes a tokenizer and an embedding layer. The prompt text Q is converted into text tokens through the text encoder: X q = ϕt(Q) .\nAnomaly-focused Temporal Sampler (ATS). ATS consists of two components: the anomaly scorer and the density-aware sampler. The anomaly scorer ϕ s is a featurebased VAD network which estimates the anomaly score for each frame. We follow the network architecture in [74] due to its simplicity and good performance. Given the class token of the video frames {v cls 1 , v cls 2 , \u0026hellip;, v cls T }, the anomaly scores can be obtained: s i = ϕ s (v cls i ), where si denotes the anomaly score of the i-th frame.\nAnomalous frames typically contain more information and exhibit greater variation than normal frames [49]. This observation motivates us to sample more frames in regions with higher anomaly scores while reducing the sampling in areas with lower anomaly scores. As shown in Fig. 4, to achieve non-uniform sampling, we propose density-aware sampler to selectively choose N frames from a total of T input frames. Specifically, we treat the anomaly scores S ∈ R T as a probability mass function and first accumulate them along the temporal dimension to obtain the cumulative distribution function, denoted as S cumsum :\nWe uniformly sample N points along the cumulative axis, then map these points to the cumulative distribution S cumsum , the corresponding N timestamps on the time axis are mapped to the closest frame index and finally form the sampled frame indices, denoted as G . τ is used to control the uniformity of the sampling.\nProjector and LLM. We select the tokens corresponding to the sampled frame, i.e ., G, as the visual embedding. A projector ϕ p is then used to map the visual embedding to the language feature space. Finally, we concatenate these embeddings with the text embeddings, input them into the pre-trained large language model, and compute the probability of the target answers X a . To obtain an initial visuallanguage alignment space, we initialize the projector and LLM parameters from [6], with the parameters kept frozen. The above process can be expressed as follows:\nwhere cat[·] represents the concatenation operation, θ is the trainable parameters, Xins,\u0026lt;i, Xa,\u0026lt;i are the instruction and answer tokens in all turns before the current prediction token x i , L is the length of sequence, respectively.\n4.3. Training and Testing # Training. We train the model following two steps. In the first step, we use the video data and annotated framelevel label (yˆ ˆ ∈ R T ) from HIVAU-70k to train the anomaly scorer , which provides more accurate anomaly supervision compared to previous unsupervised and weakly-supervised methods [14 , 46 , 55 , 74].\nIn the second step, we keep the anomaly scorer fixed, and use all the instruction data from HIVAU-70k to train the model. To achieve more efficient fine-tuning without disrupting the original capabilities of the LLM, we employ LoRA [15] for fine-tuning, optimizing the cross entropy loss between the predicted and the ground truth tokens.\nTesting. During testing, the user inputs a video and text prompts; the model will generate the corresponding text response following the user\u0026rsquo;s instruction.\n5. Experiments # 5.1. Experiment Setup # Dataset. Our HIVAU-70k is built upon two large-scale realworld datasets, i.e., UCF-Crime [46] and XD-Violence [56], they provide a diverse range of videos with anomalous events. UCF-Crime [46] comprises 1,900 untrimmed videos totaling 128 hours from outdoor and indoor surveillance cameras. It encompasses 13 classes of real-world anomalies, including Abuse , Explosion , Fighting, and Shooting. XD-Violence [56] is the largest VAD benchmark, comprising 4,754 videos totaling 217 hours sourced from surveillance, movies, car cameras, and games. It encompasses 6 anomaly classes: Abuse , Car Accidents , Explosions , Fighting , Riots, and Shooting .\nMetric. We assess the anomaly understanding ability from two aspects: anomaly detection and reasoning . 1) For anomaly detection, we use the anomaly scores output by the Anomaly Scorer as the prediction and perform the evaluation. Following [14 , 46 , 67 , 74], we use AUC and AP to quantify detection performance, which is evaluated only on the video level. 2) For anomaly reasoning, we annotate instruction data from the UCF-Crime and XD-Violence test sets, which have been carefully reviewed and filtered by annotators. We finally collected 3,300 test samples at multiple granularities. The test set contains 2200/732/398 samples at clip/event/video levels. We calculate metrics including BLEU [41], CIDEr [51], METEOR [4] and ROUGE [26] to measure the quality of the reasoning text output by the model, comparing with the annotated ground truth text.\nImplementation Details. For the proposed Holmes-VAU\nTable 1. Comparison of detection performance with state-ofthe-art Video Anomaly Detection approaches. We include the results of explainable and non-explainable methods. \u0026ldquo;∗\u0026rdquo; represents the result reported in [67].\n| Methods | Backbone | XD-Violence | UCF-Crime\nAUC/% Methods Backbone AP/% AUC/% Non-explainable VAD Non-explainable VAD Non-explainable VAD Non-explainable VAD Conv-AE [14] (CVPR’16) - 27.25 50.60 GODS [52] (ICCV’19) I3D N/A 70.46 GCL [66] (CVPR’22) ResNext N/A 71.04 DYANNET [48] (WACV’23) I3D N/A 84.50 MIST [11] (CVPR’21) I3D N/A 82.30 Wu et al. [56] (ECCV’20) I3D 78.64 82.44 RTFM [49] (ICCV’21) I3D 77.81 84.30 MSL [22] (AAAI’22) I3D 78.28 85.30 S3R [55] (ECCV’22) I3D 80.26 85.99 MGFN [5] (AAAI’23) I3D 79.19 86.98 UR-DMU [74] (AAAI’23) I3D 81.66 86.97 CLIP-TSA [17] (ICIP’23) ViT 82.19 87.58 VadCLIP [58] (AAAI’24) ViT 84.51 88.02 Yang et al. [62] (CVPR’24) ViT 83.68 87.79 Wu et al. [57] (CVPR’24) ViT 66.53 86.40 Explainable Multi-modal VAD Explainable Multi-modal VAD Explainable Multi-modal VAD Explainable Multi-modal VAD Zero-Shot CLIP [44] ∗ ViT 17.83 53.16 LLAVA-1.5 [27] ∗ ViT 50.26 72.84 LAVAD [67] (CVPR’24) ViT 62.01 80.28 Holmes-VAU (Ours) ViT 87.68 88.96 method, we initialize the Multimodal LLM with InternVL22B [6]. To optimize the Anomaly-focused Temporal Sampler, we adopt the Adam optimizer with a learning rate of 1e-4. Note that when evaluating detection performance on XD-Violence and UCF-Crime, only videos in the corresponding training sets are used to train our model for fair comparisons. For instruction tuning, we train with a batch size of 512 for 1 epoch, using the AdamW optimizer with cosine learning rate decay and a warm-up period. The LoRA [15] parameters are set as: r=64, α=128, and learning rate=4e-5. During testing, τ in Eq. 2 is set to 0.1. Experiments are conducted on 2 NVIDIA A100 GPUs.\n5.2. Main Results # Anomaly Detection Results. We compare our method with state-of-the-art methods, including semi-supervised methods [14 , 52], unsupervised methods [48 , 66], weaklysupervised methods [17 , 22 , 49 , 55 , 58 , 74] and recently training-free method [67]. We have indicated their backbones and performance on the UCF-Crime and XDViolence datasets, as shown in Table 1. Our method has an AP of 87.68% on XD-Violence and an AUC of 88.96% on UCF-Crime, significantly outperforming the prior state-ofthe-art methods, which demonstrates that our method can generate less biased anomaly scores. It is worth noting that while achieving precise localization of anomalies, HolmesVAU is also capable of providing explanations and analysis for the detected anomalies by the model, a feature unavailable in existing non-explainable VAD methods. Although LAVAD [67] has explainability, the training-free large lan-\nTable 2. Comparison of reasoning performance with state-of-the-art Multimodal Large Language Models (MLLMs). \u0026lsquo;BLEU\u0026rsquo; refers to the cumulative values from BLEU-1 to BLEU-4. We evaluate the quality of the generated text at different granularities, including cliplevel (C), event-level (E), and video-level (V).\n| Method | Params | BLEU 41 C | BLEU 41 C | BLEU 41 C | CIDEr 51 | CIDEr 51 | CIDEr 51 | METEOR 4 | METEOR 4 | METEOR 4 | ROUGE 26 C E V | ROUGE 26 C E V | ROUGE 26\nC E V Method Params C E V C E V C E V C E V Video-ChatGPT [39] 7B 0.152 0.068 0.066 0.033 0.011 0.013 0.102 0.069 0.044 0.153 0.048 0.079 Video-LLaMA [68] 7B 0.151 0.079 0.104 0.024 0.014 0.017 0.112 0.076 0.057 0.156 0.067 0.090 Video-LLaVA [25] 7B 0.164 0.046 0.055 0.032 0.009 0.013 0.097 0.022 0.014 0.132 0.023 0.045 LLaVA-Next-Video [71] 7B 0.435 0.091 0.120 0.102 0.015 0.031 0.117 0.085 0.096 0.198 0.080 0.106 QwenVL2 [53] 7B 0.312 0.082 0.155 0.044 0.020 0.044 0.133 0.092 0.112 0.163 0.081 0.137 InternVL2 [6] 8B 0.331 0.101 0.145 0.052 0.022 0.035 0.141 0.095 0.101 0.182 0.102 0.122 Holmes-VAU (Ours) 2B 0.913 0.804 0.566 0.467 1.519 1.437 0.190 0.165 0.121 0.329 0.370 0.355 Table 3. Ablation of hierarchical instruction data. During the instruction tuning phase, we combined training data of different granularities, including clip (C), event (E), and video (V) levels.\nTraining Data Training Data Training Data BLEU(↑) BLEU(↑) BLEU(↑) CIDEr(↑) CIDEr(↑) CIDEr(↑) C E V C E V C E V ✓ 0.984 0.261 0.351 0.459 0.120 0.106 ✓ 0.508 0.576 0.292 0.097 1.183 0.872 ✓ 0.280 0.222 0.279 0.039 0.708 0.884 ✓ ✓ 0.889 0.741 0.349 0.470 1.285 0.889 ✓ ✓ 0.906 0.341 0.522 0.472 0.962 1.093 ✓ ✓ 0.394 0.797 0.505 0.081 1.472 1.074 ✓ ✓ ✓ 0.913 0.804 0.566 0.467 1.519 1.437 guage model lacks an understanding of anomaly knowledge due to the limitation of insufficient supervised data.\nAnomaly Reasoning Results. We compare the anomalyrelated text quality generated by Holmes-VAU with that produced by state-of-the-art general Multimodal Large Language Models (MLLMs), and presented the results at different temporal granularities, including clip-level, event-level, and video-level, as shown in Table. 2. Earlier MLLMs such as Video-ChatGPT [21] and Video-LLaMA [68], struggled with basic visual perception and instruction-following capabilities. Recent MLLMs [6 , 7 , 71] trained on larger and higher-quality video instruction data have made significant progress in general video understanding, with noticeable improvements at the clip-level perception task. However, due to the absence of learning from complex, real-world anomaly data, their reasoning abilities at the event-level and video-level are still lacking. Our Holmes-VAU, however, shows significant improvements in video understanding across all temporal granularities compared to existing general MLLMs, highlighting the importance of injecting anomaly-related knowledge through instruction tuning on high-quality Video Anomaly Understanding benchmarks.\n5.3. Analytic Results # Influence of Hierarchical Instruction. To explore the impact of different granularity video training data on the model\u0026rsquo;s anomaly reasoning ability, we designed various training data combinations during the instruction tuning phase and evaluated the model\u0026rsquo;s performance, as shown\nTable 4. Ablation study of sampling methods and the number of sampled frames. We compare the proposed Anomaly-focused Temporal Sampler (ATS) with other sampling methods under different frame sampling numbers, including Uniform and Top-K. Latency is the time delay in generating the first token.\nFrames (N) Sampler Latency (ms) Video-level Video-level Frames (N) Sampler Latency (ms) BLEU(↑) CIDEr(↑) 8 Top-K Uniform ATS (Our 244 0.462 0.491 0.514 0476 1.229 1.276 1324 16 Uniform ATS (Ours) Top-K Uif 244 0.514 1.324 16 Top-K Uniform ATS (Our 566 0.476 0.511 0566 1.302 1.345 1437 16 TopK Uniform ATS (Ours 566 0.511 1.345 Uf ATS (Ours) Top-K 1402 0.566 0.481 0558 1.437 1.332 in Table 3. The inclusion of clip-level data primarily enhanced the model\u0026rsquo;s basic visual perception abilities regarding actions and scenes within the video. Adding eventlevel data improved the model\u0026rsquo;s ability to judge and understand complete anomaly events. Furthermore, the involvement of video-level data further enhanced the model\u0026rsquo;s ability to analyze and summarize anomaly-related information across longer-span videos. The hierarchical instruction data structure facilitated a comprehensive and complementary improvement in the model\u0026rsquo;s anomaly-related perception-toreasoning capabilities.\nInfluence of different sampling methods and the number of sampled frames. Our ATS (Anomaly-focused Temporal Sampler) is designed to adaptively sample frames input to the LLM based on the frame-level anomaly scores. To validate its advantages, we compared ATS with other sampling methods at various sample frame counts, including Uniform and Top-K sampling. In Uniform sampling, N frames are uniformly sampled from all frames, while Top-K sampling selects the frames with the top N highest anomaly scores. As shown in Table 4, ATS consistently outperforms other sampling methods, regardless of the sample count. We believe that Uniform sampling tends to overlook key anomaly frames, though this issue lessens as more frames are sampled. Besides, Top-K sampling tends to overly focus on local anomaly frames, missing contextual frame informa-\nFigure 5. Qualitative comparison of anomaly understanding explanation. Compared with state-of-the-art general MLLMs, i.e., InternVL2 [6] and QwenVL2 [53], our proposed Holmes-VAU demonstrates more accurate anomaly judgment, along with more detailed and comprehensive anomaly-related descriptions and analysis. Correct and wrong explanations are highlighted in green and red, respectively.\nFigure 6. Ablation study of trainable parameters. (a) Loss curve during instruction-tuning. (b) We tuned the LoRA [15] parameter r to control trainable parameters, evaluating its impact on VAU capability, and General performance on Video-MME [12].\ntion. Our proposed ATS mitigates both issues. To balance inference efficiency and performance, we set N=16 as the default sample frame number.\nInstruction Tuning Parameters. We conducted an ablation study on the parameter r in LoRA [15] to explore how the trainable parameters affects both the model\u0026rsquo;s VAU performance and its general capability. We use VideoMME [12] to evaluate the model\u0026rsquo;s general capability. The results are shown in Fig. 6, as r increases, the model gradually adapts to the VAU task. However, when r becomes too large, the model\u0026rsquo;s general capability decreases. To retain the original general video understanding capability of the MLLM, we set r=64 as the default value.\n5.4. Qualitative Comparision # We provide qualitative comparisons between Holmes-VAU and existing MLLMs in Fig. 5. The results demonstrate that Holmes-VAU can accurately identify anomalies in videos and provide accurate and complete explanations, highlight the effectiveness and advantage of Holmes-VAU in perceiving video events and analyzing anomalies.\n6. Conclusion # In conclusion, this work pushes the boundaries of video anomaly understanding by introducing hierarchical anomaly detection across diverse temporal scales, from momentary clips to extended events. The HIVAU-70k benchmark, with over 70,000 multi-level annotations, addresses a critical gap in the field, enabling comprehensive anomaly analysis in real-world scenarios. Our Anomaly-focused Temporal Sampler (ATS) strategically enhances focus on anomaly-dense segments, optimizing both efficiency and accuracy in long-term anomaly detection. Extensive experiments demonstrate that our hierarchical dataset and ATS-enhanced VLM achieve significant performance gains over conventional methods, proving robust for open-world anomaly understanding. This work sets a new standard for multi-granular anomaly comprehension, paving the way for more fine-grained video anomaly understanding.\nAcknowledgement This work is supported by the National Natural Science Foundation of China under grants U22B2053 and 623B2039, and in part by the Interdisciplinary Research Program of HUST (2024JCYJ034).\nReferences # [1] Amit Adam, Ehud Rivlin, Ilan Shimshoni, and Daviv Reinitz. Robust real-time unusual event detection using multiple fixed-location monitors. IEEE transactions on pattern analysis and machine intelligence, 30(3):555–560, 2008. 2 [2] AI@Meta. Llama 3 model card. 2024. 3 , 13 [3] Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, and Kristen Grauman. Hiervl: Learning hierarchical videolanguage embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 23066–23078, 2023. 3 [4] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005. 6 , 7 [5] Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitudecontrastive glance-and-focus network for weakly-supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 387–395, 2023. 6 [6] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 2 , 5 , 6 , 7 , 8 , 17 [7] Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatialtemporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 7 [8] Jisheng Dang, Huicheng Zheng, Xiaohao Xu, Longguang Wang, Qingyong Hu, and Yulan Guo. Adaptive sparse memory networks for efficient and robust video object segmentation. IEEE Transactions on Neural Networks and Learning Systems, 2024. 3 [9] Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, et al. Uncovering what why and how: A comprehensive benchmark for causation understanding of video anomaly. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18793– 18803, 2024. 2 , 3 , 4 , 16 , 17 [10] Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. arXiv preprint arXiv:2406.14515, 2024. 3 [11] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. Mist: Multiple instance self-training framework for video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14009– 14018, 2021. 2 , 6 [12] Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 3 , 8\n[13] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1705–1714, 2019. 2\n[14] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 733–742, 2016. 2 , 6\n[15] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan AllenZhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 6 , 8\n[16] Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, and Gedas Bertasius. Video recap: Recursive captioning of hour-long videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18198–18208, 2024. 3\n[17] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In 2023 IEEE International Conference on Image Processing (ICIP), pages 3230–3234. IEEE, 2023. 3 , 6\n[18] Kumara Kahatapitiya and Michael S Ryoo. Coarse-fine networks for temporal activity detection in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8385–8394, 2021. 3\n[19] Jaechul Kim and Kristen Grauman. Observe locally, infer globally: a space-time mrf for detecting abnormal activities with incremental updates. In 2009 IEEE conference on computer vision and pattern recognition, pages 2921–2928. IEEE, 2009. 2\n[20] Federico Landi, Cees GM Snoek, and Rita Cucchiara. Anomaly locality in video surveillance. arXiv preprint arXiv:1901.10364, 2019. 2\n[21] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 2 , 4 , 7\n[22] Shuo Li, Fang Liu, and Licheng Jiao. Self-training multisequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1395–1403, 2022. 2 , 6\n[23] Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence , 36(1):18–32, 2013. 2\n[24] Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: Towards action recognition without representation bias. In Proceedings of the European Conference on Computer Vision (ECCV), pages 513–528, 2018. 3\n[25] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 2 , 4 , 7\n[26] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004. 6 , 7\n[27] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 2 , 4 , 6\n[28] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024. 3\n[29] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 2 , 4\n[30] Kun Liu and Huadong Ma. Exploring background-bias for anomaly detection in surveillance videos. In Proceedings of the 27th ACM International Conference on Multimedia , pages 1490–1499, 2019. 2\n[31] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018. 2\n[32] Yi Liu, Limin Wang, Yali Wang, Xiao Ma, and Yu Qiao. Fineaction: A fine-grained video dataset for temporal action localization. IEEE transactions on image processing, 31: 6937–6950, 2022. 3\n[33] Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13588–13597, 2021. 2\n[34] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision, pages 2720–2727, 2013. 2\n[35] Jian Lu, Xuanfeng Li, Bo Zhao, and Jian Zhou. A review of skeleton-based human action recognition. Journal of Image and Graphics, 28(12):3651–3669, 2023. 3\n[36] Hui Lv and Qianru Sun. Video anomaly detection and explanation via large language models. arXiv preprint arXiv:2401.05702, 2024. 2 , 16\n[37] Hui Lv, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, and Hanwang Zhang. Unbiased multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8022–8031, 2023. 2\n[38] Baiteng Ma, Shiwei Zhang, Changxin Gao, and Nong Sang. Temporal global correlation network for end-to-end action proposal generation. Acta Electronica Sinica, 50(10):2452– 2461, 2022. 3\n[39] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024. 7\n[40] Ramin Mehran, Alexis Oyama, and Mubarak Shah. Abnormal crowd behavior detection using social force model. In 2009 IEEE conference on computer vision and pattern recognition, pages 935–942. IEEE, 2009. 2\n[41] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 6 , 7\n[42] Yujiang Pu, Xiaoyu Wu, Lulu Yang, and Shengjin Wang. Learning prompt-enhanced context features for weaklysupervised video anomaly detection. IEEE Transactions on Image Processing, 2024. 2 , 3\n[43] Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yi Xu, Xiang Wang, Mingqian Tang, Changxin Gao, Rong Jin, and Nong Sang. Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13821–13831, 2022. 3\n[44] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2 , 5 , 6\n[45] Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2616–2625, 2020. 3\n[46] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018. 2 , 3 , 6 , 13\n[47] Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and YingCong Chen. Hawk: Learning to understand open-world video anomalies. arXiv preprint arXiv:2405.16886, 2024. 2 , 3 , 4 , 16 , 17\n[48] Kamalakar Vijay Thakare, Yash Raghuwanshi, Debi Prosad Dogra, Heeseung Choi, and Ig-Jae Kim. Dyannet: A scene dynamicity guided self-trained video anomaly detection network. In Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pages 5541–5550, 2023. 2 , 6\n[49] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4975–4986, 2021. 2 , 5 , 6\n[50] Anil Osman Tur, Nicola Dall\u0026rsquo;Asen, Cigdem Beyan, and Elisa Ricci. Exploring diffusion models for unsupervised video anomaly detection. In 2023 IEEE International Conference on Image Processing (ICIP), pages 2540–2544. IEEE, 2023. 2\n[51] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 6 , 7\n[52] Jue Wang and Anoop Cherian. Gods: Generalized one-class discriminative subspaces for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8201–8211, 2019. 2 , 6\n[53] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model\u0026rsquo;s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 7 , 8\n[54] SC Wang, Q Huang, YF Zhang, X Li, YQ Nie, and GC Luo. Review of action recognition based on multimodal data. Image Graph, 27(11):3139–3159, 2022. 3\n[55] Jhih-Ciang Wu, He-Yen Hsieh, Ding-Jie Chen, Chiou-Shann Fuh, and Tyng-Luh Liu. Self-supervised sparse representation for video anomaly detection. In European Conference on Computer Vision, pages 729–745. Springer, 2022. 2 , 5 , 6\n[56] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 322–339. Springer, 2020. 2 , 3 , 6 , 13\n[57] Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open-vocabulary video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18297–18307, 2024. 2 , 3 , 6\n[58] Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 6074–6082, 2024. 2 , 3 , 5 , 6\n[59] Dan Xu, Yan Yan, Elisa Ricci, and Nicu Sebe. Detecting anomalous events in videos by learning deep representations of appearance and motion. Computer Vision and Image Understanding, 156:117–127, 2017. 2\n[60] Xiaohao Xu, Jinglu Wang, Xiang Ming, and Yan Lu. Towards robust video object segmentation with adaptive object calibration. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2709–2718, 2022. 3\n[61] Zhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. Video event restoration based on keyframes for video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14592–14601, 2023. 2\n[62] Zhiwei Yang, Jing Liu, and Peng Wu. Text prompt with normality guidance for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18899–18908, 2024. 2 , 3 , 6\n[63] Yu Yao, Xizi Wang, Mingze Xu, Zelin Pu, Yuchen Wang, Ella Atkins, and David J Crandall. Dota: Unsupervised detection of traffic anomaly in driving videos. IEEE transactions on pattern analysis and machine intelligence, 45(1): 444–459, 2022. 2\n[64] Tongtong Yuan, Xuange Zhang, Kun Liu, Bo Liu, Chen Chen, Jian Jin, and Zhenzhen Jiao. Towards surveillance video-and-language understanding: New dataset, baselines, and challenges, 2023. 3 , 16\n[65] Tongtong Yuan, Xuange Zhang, Kun Liu, Bo Liu, Chen Chen, Jian Jin, and Zhenzhen Jiao. Towards surveillance video-and-language understanding: New dataset baselines and challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22052– 22061, 2024. 2 , 3 , 13 , 17\n[66] M Zaigham Zaheer, Arif Mahmood, M Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. Generative cooperative learning for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14744–14754, 2022. 2 , 6\n[67] Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. Harnessing large language models for training-free video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18527–18536, 2024. 2 , 3 , 4 , 5 , 6 , 16\n[68] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023. 2 , 7\n[69] Huaxin Zhang, Xiang Wang, Xiaohao Xu, Zhiwu Qing, Changxin Gao, and Nong Sang. Hr-pro: Point-supervised temporal action localization via hierarchical reliability propagation. arXiv preprint arXiv:2308.12608, 2023. 3\n[70] Huaxin Zhang, Xiang Wang, Xiaohao Xu, Xiaonan Huang, Chuchu Han, Yuehuan Wang, Changxin Gao, Shanjun Zhang, and Nong Sang. Glancevad: Exploring glance supervision for label-efficient video anomaly detection. arXiv preprint arXiv:2403.06154, 2024. 2\n[71] Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. 7 , 13\n[72] Bin Zhao, Li Fei-Fei, and Eric P Xing. Online detection of unusual events in videos via dynamic sparse coding. In CVPR 2011, pages 3313–3320. IEEE, 2011. 2\n[73] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1237–1246, 2019. 2\n[74] Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncertainty regulation for weakly supervised video\nanomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3769–3777, 2023. 2 , 5 , 6 , 14 , 15\nHolmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity # Supplementary Material\nA. Details of the Data Engine. # Brief explanation of the basis for judging the anomaly.’\nTo construct a dataset with hierarchical annotations with both short-term and long-term anomalies, we developed a semi-automated annotation engine that combines manual efforts with the generative capabilities of LLM. In the main paper, we present the complete annotation workflow. Below, we provide additional details about the data engine.\nA.1. Hierarchical Video Decoupling # Before annotation, we collected videos from the training sets of the UCF-Crime [46] and XD-Violence [56] datasets. From UCF-Crime, we selected 758 normal videos and 735 anomaly videos, while from XD-Violence, we selected 1,904 normal videos and 2,046 anomaly videos. The anomaly videos included their original video-level labels, e.g ., Abuse , Explosion. For the anomaly videos, we organized a team of five annotators to label each anomaly event within the videos. The annotation process took approximately 20 hours to complete. For the normal videos, we considered all segments to be normal and randomly cropped segments of varying lengths to serve as normal event-level video segments. These anomaly and normal event-level videos were further divided into shorter clip-level segments. For UCF-Crime, we adopted the clip-level divisions from UCA [65]. For XD-Violence, we performed uniform division.\nA.2. Hierarchical Free-text Annotation # Clip Captioning. For videos in UCF-Crime, we fully utilized the manually annotated captions from UCA [65]. For videos in XD-Violence, we used LLaVA-Next-Video7B [71] as our captioner to generate textual descriptions for clip-level videos. The specific prompt is as follows:\n\u0026lsquo;Please provide a short and brief description of the video clip, focusing on the main subjects and their actions.\u0026rsquo;\nEvent Summary. We combined all captions and videolevel category labels to generate anomaly-related summaries for each event using an LLM. We selected LLaMA370B [2] as our LLM due to its strong summarization capabilities. The specific prompt is as follows:\n\u0026lsquo;The dense caption of the video is: {clip captions} . There are (is no) abnormal events ({video-level label}) in the video. Your response should include the following three parts: 1. Whether the anomaly exists and the specific name of the anomaly. 2. A summary of the anomaly events. 3.\nVideo Summary. Similar to generating event summaries, we generated video-level summaries by analyzing the event-level summaries. The specific prompt is as follows:\n\u0026lsquo;Below is a summary of all the events in the video: {event summaries}. There are (is no) abnormal events ({videolevel label}) in the video. Your response should include the following three parts: 1. Whether the anomaly exists and the specific name of the anomaly. 2. Detailed description of the video anomaly event from start to end. 3. Brief analysis of the basis for judging the anomaly.\u0026rsquo;\nAnnotation Format. In Fig.A, we present an example of the hierarchical free-text annotations for a video.\nA.3. Hierarchical Instruction Data Construction # To construct the instruction dataset, we designed question prompts tailored to different tasks, including Caption , Judgment , Description, and Analysis. For each instruction item, we randomly selected one prompt from the pool and matched it with the corresponding content from the free-text annotations as the answer.\nCaption. # \u0026ldquo;Describe the video briefly.\u0026rdquo;\n\u0026ldquo;Describe the main events that take place in this video.\u0026rdquo;\n\u0026ldquo;Give a short description of the video.\u0026rdquo;\n\u0026ldquo;What happened in this video?\u0026rdquo;\n\u0026ldquo;Generate a brief caption for the video.\u0026rdquo;\n”Can you provide a brief description of the video?”\n\u0026ldquo;Briefly describe the main subjects and their actions in the video.\u0026rdquo; \u0026ldquo;Provide a short overview of what happens in the video?\u0026rdquo; \u0026ldquo;Describe the key moments that showcase the subjects\u0026rsquo; activities in the video.\u0026rdquo; \u0026ldquo;Describe the sequence of events involving the main subjects in the video.\u0026rdquo; \u0026ldquo;What activities happen throughout the video?\u0026rdquo; \u0026ldquo;Describe the main subjects and their roles in the video.\u0026rdquo; ”What key moments stand out in the video?” \u0026ldquo;What are the primary activities showcased in the video?\u0026rdquo; \u0026ldquo;What happens to the main subjects as the video progresses?\u0026rdquo; \u0026ldquo;What is a brief overview of what happens in the video?\u0026rdquo; \u0026ldquo;Describe the main subjects and their contributions to the video.\u0026rdquo; \u0026ldquo;Describe the key events in the video.\u0026rdquo; \u0026ldquo;Describe the video\u0026rsquo;s main activities.\u0026rdquo; \u0026ldquo;Can you describe the main action in this video briefly?\u0026rdquo; \u0026ldquo;Describe the video clip concisely.\u0026rdquo; \u0026ldquo;Provide a brief description of the given video clip.\u0026rdquo; \u0026ldquo;Summarize the visual content of the video clip.\u0026rdquo; \u0026ldquo;Give a short and clear explanation of the subsequent video clip.\u0026rdquo; Judgement. # \u0026ldquo;What types of anomalies are shown in the video clip?\u0026rdquo; \u0026ldquo;Are there any anomaly events detected in the video?\u0026rdquo; \u0026ldquo;Detect and classify the anomaly events in the video.\u0026rdquo; \u0026ldquo;Identify any abnormal behaviors depicted in the video.\u0026rdquo; \u0026ldquo;Determine whether there are anomaly events in the video and the specific name of the anomaly.\u0026rdquo; \u0026ldquo;What anomalies can be identified in the video?\u0026rdquo; \u0026ldquo;What categories of anomalies can be found in the video?\u0026rdquo; \u0026ldquo;Could you point out any abnormal actions in the video?\u0026rdquo; \u0026ldquo;Point out the abnormal actions in the video.\u0026rdquo; 1 { 2 \u0026#34;video\u0026#34;: \u0026#34;v=2rfyeR-YaJw__#1_label_G-0-0\u0026#34; , 3 \u0026#34;n_frames\u0026#34;: 1940 , 4 \u0026#34;fps\u0026#34;: 24.0 , 5 \u0026#34;label\u0026#34;: [\u0026#34;Explosion\u0026#34;], 6 \u0026#34;clips\u0026#34;: [[[5.583 , 11.903], [11.903 , 18.222], [18.222 , 24.542]], 7 [[36.167 , 43.48], [43.48 , 50.792]]], 8 \u0026#34;clip_captions\u0026#34;:[ 9 [ 10 \u0026#34;A military tank moving across a barren landscape with low-rise buildings and sparse vegetation. the sky , → is overcast, and the overall color palette is muted with earthy tones.\u0026#34; , 11 \u0026#34;A series of images depicting a barren landscape with a few buildings in the background. the foreground consists of a rocky terrain with sparse vegetation. the sky is overcast, and there are no visible people or moving objects.\u0026#34; , , → , → 12 \u0026#34;A silhouette of a person operating a large, mounted weapon on a rocky terrain under a clear sky. the , → individual appears to be adjusting or aiming the weapon.\u0026#34; 13 ], 14 [ 15 \u0026#34;A series of images depicting a hazy, dusty environment with buildings in the background. the focus is on a red light source that appears to be moving or flashing, possibly indicating some form of activity or event occurring in the area.\u0026#34; , , → , → 16 \u0026#34;A series of images depicting a military tank moving across a landscape with buildings in the background. , → the tank is seen from different angles, and there are visible explosions occurring in the distance.\u0026#34; 17 ] 18 ], 19 \u0026#34;events\u0026#34;: [[5.583 , 24.542], [36.167 , 50.792]], 20 \u0026#34;event_summary\u0026#34;: [ 21 \u0026#34;The anomaly exists, specifically an Explosion anomaly. The basis for judging the anomaly is the presence of an unexpected and sudden release of energy, likely accompanied by a loud noise and visible effects such as smoke, fire, or debris.\u0026#34; , , → , → 22 \u0026#34;The anomaly exists and its specific name is Explosion. The anomaly event is a series of visible explosions occurring in the distance as a military tank moves across a landscape with buildings in the background. The basis for judging the anomaly is the sudden and intense release of energy and light in the form of explosions, which is an unusual and notable event in the otherwise desolate and quiet war-torn environment depicted in the video\u0026#34;], , → , → , → , → 23 \u0026#34;video_summary\u0026#34;: \u0026#34;The anomaly exists, specifically named as Explosion. The anomaly event depicts a series of visible explosions occurring in the distance as a military tank moves across a landscape with buildings in the background, characterized by sudden and intense releases of energy and light, accompanied by loud noise and visible effects such as smoke, fire, or debris. The basis for judging the anomaly lies in the unusual and notable nature of these explosions, which stand out against the otherwise desolate and quiet war-torn environment depicted in the video, making them an unexpected and sudden release of energy that grabs attention.\u0026#34; , → , → , → , → , → , → 24 } Figure A. An example of hierarchical free-text annotations. For each labeled video, the hierarchical free-text annotations include cliplevel captions, event-level, and video-level anomaly analysis. Additionally, the temporal boundaries for each event and clip are annotated.\n\u0026ldquo;Are there anomalies observed in the video clip?\u0026rdquo; Description. # \u0026ldquo;Describe the anomaly events observed in the video.\u0026rdquo; \u0026ldquo;Could you describe the anomaly events observed in the video?\u0026rdquo; \u0026ldquo;Could you specify the anomaly events present in the video?\u0026rdquo; \u0026ldquo;Give a description of the detected anomaly events in this video.\u0026rdquo; \u0026ldquo;Could you give a description of the anomaly events in the video?\u0026rdquo; \u0026ldquo;Provide a summary of the anomaly events in the video.\u0026rdquo; \u0026ldquo;Could you provide a summary of the anomaly events in this video?\u0026rdquo;\u0026quot; \u0026ldquo;What details can you provide about the anomaly in the video?\u0026rdquo; \u0026ldquo;How would you detail the anomaly events found in the video?\u0026rdquo; \u0026ldquo;How would you describe the particular anomaly events in the video?\u0026rdquo; Analysis. # \u0026ldquo;Why do you judge this event to be anomalous?\u0026rdquo; \u0026ldquo;Can you provide the reasons for considering it anomalous?\u0026rdquo; \u0026ldquo;Can you give the basis for your judgment of this event as an anomaly?\u0026rdquo; \u0026ldquo;What led you to classify this event as an anomaly?\u0026rdquo; \u0026ldquo;Could you provide the reasons for considering this event as abnormal?\u0026rdquo; \u0026ldquo;What evidence do you have to support your judgment of this event as an anomaly?\u0026rdquo; \u0026ldquo;Can you analyze the factors contributing to this anomalous event?\u0026rdquo; \u0026ldquo;Could you share your analysis of the anomalous event?\u0026rdquo; \u0026ldquo;What patterns did you observe that contributed to your conclusion about this event being an anomaly?\u0026rdquo; \u0026ldquo;How do the characteristics of this event support its classification as an anomaly?\u0026rdquo; A.4. Data Samples. # To facilitate understanding, we provide the final constructed instruction data at various temporal granularities, including clip-level , event-level, and video-level, as shown in Fig. B , Fig. C, and Fig. D .\nB. Details of the Anomaly Scorer # B.1. Model Architecture # We use UR-DMU [74] as the anomaly scorer in our Anomaly-focused Temporal Sampler. As shown in Fig. E , UR-DMU utilizes a Global and Local Multi-Head Self Attention (GL-MHSA) module to capture both long-range and short-range temporal relationships among video snippets. Furthermore, UR-DMU introduces two memory banks to store and differentiate abnormal and normal prototypes, thereby maximizing the margins between these two representations. In order to learn discriminative representations, UR-DMU uses triplet loss to increase the feature distance\n1\n2\n3\n4\n5\n1 { 2 \u0026#34;id\u0026#34;: \u0026#34;Abuse007_x264_E0C2\u0026#34; , 3 \u0026#34;type\u0026#34;: \u0026#34;clip\u0026#34; , 4 \u0026#34;video\u0026#34;: , → \u0026#34;ucf-crime/clips/train/Abuse007_x264_E0C2.mp4\u0026#34; , 5 \u0026#34;conversations\u0026#34;: [ 6 {\u0026#34;from\u0026#34;: \u0026#34;human\u0026#34; , 7 \u0026#34;value\u0026#34;: \u0026#34;Describe the key events in the video.\u0026#34;}, 8 {\u0026#34;from\u0026#34;: \u0026#34;gpt\u0026#34; , 9 \u0026#34;value\u0026#34;: \u0026#34;The man in shorts was slapped to the , → ground by the strong man opposite.\u0026#34;} 10 ] 11 } 6 7 8 9 10 11 12 13 14 Figure B. An example of clip-level instruction data.\n1 { 2 \u0026#34;id\u0026#34;: \u0026#34;Abuse018_x264_E0\u0026#34; , 3 \u0026#34;type\u0026#34;: \u0026#34;event\u0026#34; , 4 \u0026#34;video\u0026#34;: , → \u0026#34;ucf-crime/events/train/Abuse018_x264_E0.mp4\u0026#34; , 5 \u0026#34;conversations\u0026#34;: [ 6 {\u0026#34;from\u0026#34;: \u0026#34;human\u0026#34; , 7 \u0026#34;value\u0026#34;: \u0026#34;Identify any abnormal behaviors depicted , → in the video.\u0026#34;}, 8 {\u0026#34;from\u0026#34;: \u0026#34;gpt\u0026#34; , 9 \u0026#34;value\u0026#34;: \u0026#34;An anomaly exists, specifically an , → instance of Police Brutality.\u0026#34;}, 10 {\u0026#34;from\u0026#34;: \u0026#34;human\u0026#34; , 11 \u0026#34;value\u0026#34;: \u0026#34;Could you provide a summary of the anomaly , → events in this video?\u0026#34;}, 12 {\u0026#34;from\u0026#34;: \u0026#34;gpt\u0026#34; , 13 \u0026#34;value\u0026#34;: \u0026#34;The anomaly event involves the police officer kicking the man who rushed towards him, causing the man to fall heavily to the ground with his head on his back.\u0026#34;}, , → , → , → 14 {\u0026#34;from\u0026#34;: \u0026#34;human\u0026#34; , 15 \u0026#34;value\u0026#34;: \u0026#34;Could you share your analysis of the , → anomalous event?\u0026#34;}, 16 {\u0026#34;from\u0026#34;: \u0026#34;gpt\u0026#34; , 17 \u0026#34;value\u0026#34;: \u0026#34;The excessive use of force by the police officer, which deviates from standard protocol and may be considered an abuse of authority.\u0026#34;} , → , → 18 ] 19 } Figure C. An example of event-level instruction data.\nafter interacting with different memories. Simultaneously, it utilizes KL loss to constrain the normal memory to follow a Gaussian distribution, accounting for the variance introduced by noise. Furthermore, We leveraged the event-level anomaly boundaries obtained during the annotation phase to generate frame-level labels and computed a binary crossentropy loss, i.e ., LAS, which is simple yet effective. Thus, the loss function for the anomaly scorer is defined as follows:\n15\n16\n17\n18\n19\n20\n{ \u0026#34;id\u0026#34;: \u0026#34;v=xLFbqPzSI2Y__#1_label_B4-0-0\u0026#34; , \u0026#34;type\u0026#34;: \u0026#34;video\u0026#34; , \u0026#34;video\u0026#34;: \u0026#34;xd-violence/videos/train/ v=xLFbqPzSI2Y__#1_label_B4-0-0.mp4\u0026#34; , \u0026#34;conversations\u0026#34;: [ {\u0026#34;from\u0026#34;: \u0026#34;human\u0026#34; , \u0026#34;value\u0026#34;: \u0026#34;Are there any anomaly events detected in , → the video?\u0026#34;}, {\u0026#34;from\u0026#34;: \u0026#34;gpt\u0026#34; , \u0026#34;value\u0026#34;: \u0026#34;The anomaly exists and its specific name , → is the Riot anomaly.\u0026#34;}, {\u0026#34;from\u0026#34;: \u0026#34;human\u0026#34; , \u0026#34;value\u0026#34;: \u0026#34;Could you give a description of the , → anomaly events in the video?\u0026#34;}, {\u0026#34;from\u0026#34;: \u0026#34;gpt\u0026#34; , \u0026#34;value\u0026#34;: \u0026#34;The anomaly event depicts a chaotic and intense urban unrest scene, starting with a chaotic street scene where individuals are running away from an area of fire and smoke, with people appearing to be in a state of panic or urgency, and some being sprayed with water by individuals in uniform. The scene escalates to show a growing fire in the background, police officers in riot gear standing on the street, and a formation of police officers spread out across the road, suggesting an ongoing riot or civil unrest. \u0026#34;}, , → , → , → , → , → , → , → , → , → , → , → {\u0026#34;from\u0026#34;: \u0026#34;human\u0026#34; , \u0026#34;value\u0026#34;: \u0026#34;Can you give the basis for your , → judgment?\u0026#34;}, {\u0026#34;from\u0026#34;: \u0026#34;gpt\u0026#34; , \u0026#34;value\u0026#34;: \u0026#34;The presence of multiple indicators of a riot or emergency situation, including fire, smoke, and people fleeing in panic, as well as the sudden appearance of flames, smoke, and police in riot gear, and the presence of violent and destructive events, such as fires and explosions, and the response of authorities, including the use of tear gas or pepper spray, which deviate significantly from normal, peaceful urban activity.\u0026#34;} , → , → , → , → , → , → , → , → , → ] } Figure D. An example of video-level instruction data.\nFigure E. Architecture of the Anomaly Scorer (UR-DMU [74]).\nB.2. Visualization Results # In Fig. F, we present visualized results of anomaly scores and sampled frames on the UCF-Crime and XD-Violence test sets. These results demonstrate the accuracy of our\nFigure F. Visualization results of anomaly scores and sampled frames output by the Anomaly-focused Temporal Sampler. .\nTable A. Comparision with related multimodal/explanable VAU methods and benchmarks. HIVAU-70k provides accurate temporal annotations and hierarchical anomaly-related free-text annotations.\nMethods #Catogories #Samples Text Text Text Temp. Anno. MLLM tuning clip-level event-level video-level UCA [64] 13 23,542 ✓ ✗ ✗ ✓ ✗ LAVAD [67] N/A N/A ✓ ✗ ✓ ✗ ✗ VAD-VideoLLama [36] 13/7 2,400 ✗ ✗ ✓ ✗ projection CUVA [9] 11 6,000 ✗ ✗ ✓ ✗ ✗ Hawk [47] - 16,000 ✗ ✗ ✓ ✗ projection HIVAU-70k (Ours) 19 70,000 ✓ ✓ ✓ ✓ LoRA method in anomaly detection within complex real-world scenarios, with the sampled frames being concentrated in anomalous regions.\nC. Discussion with related works. # In Table A, we provide a comprehensive comparison with related works in terms of benchmarks and methods.\nSummary of related works: Recently, there has been substantial research on multi-modal Video Anomaly Understanding, making significant contributions to advancing open-world anomaly understanding. LAVAD [67] utilized several pre-trained foundational models to offer a trainingfree explainable VAD process. VAD-VideoLLaMA [36], designed a three-phase training method to finetune VideoLLaMA in the VAD domain. CUVA [9] introduced a dataset and metric for evaluating causation understanding of video anomalies. Hawk [47] constructed an instruction dataset and finetuned a video-language framework that incorporates both motion and video information.\nDifference and Advantages of our proposed benchmark and method: # We develop a semi-automated annotation engine that scales hierarchical anomaly annotation efficiently, combining manual refinement with LLM-based annotation to maintain high-quality data across multiple granularities, resulting in over 70,000 annotations at clip, event, and video levels, which significantly surpasses previous datasets in scale.\nUCA [65] only provides clip-level captions, overlooking the understanding of anomalies across longer time spans. CUVA [9] and Hawk [47], on the other hand, only offer video-level instruction data, neglecting finer-grained visual perception and anomaly analysis. In contrast, our proposed HIVAU-70k takes a multi-temporal granularity perspective, offering more comprehensive and diverse anomaly annotations for open-world anomaly detection. It enables progressive and comprehensive learning, from short-term visual perception to long-term anomaly reasoning. We propose the Anomaly-focused Temporal Sampler (ATS), integrated with a multi-modal visual-language model. Benefiting from the precise temporal annotations we provide, the ATS is able to focus on anomaly-dense video segments. This integration significantly improves efficiency and accuracy in long-video anomaly detection. D. More Qualitative Results. # As shown from Fig. G to Fig. J, we present the output of explainable text generated by Holmes-VAU compared with the base model, i.e., InternVL-2B [6]. The results demonstrate significant improvements in the model\u0026rsquo;s visual perception and anomaly analysis capabilities after fine-tuning on HIVAU-70k.\nE. Limitations and Future Work. # While our work demonstrates significant strides in multigranular video anomaly understanding, several limitations present opportunities for future enhancement. First, optimizing for real-time streaming remains a challenge. Our sparse sampling approach improves efficiency, but further refinement is necessary for seamless deployment in streaming contexts. Additionally, our work has so far focused on surveillance data, extending our framework to other domains, such as industrial monitoring and medical diagnostics, will help validate its generalization capabilities. Lastly, integrating additional sensory data, like audio, and scalable hierarchical annotation could enhance anomaly detection and broaden applicability.\nGround Truth: # A green car driving on a road in rainy weather . the windshield wipers are in motion, and the road appears wet . there are traffic signs and other vehicles visible in the background .\nInternVL2 # Throughout the video, we see cars driving on the road, a green car driving on the road, a green car driving on the road, a green car driving on the road, a green car driving on the road\nHolmes -VAU (Ours) # A view from inside a vehicle , looking out through the windshield at a street scene . the sky is overcast , and the visibility is low due to the weather conditions . there are several cars on the road, including a green car in the foreground . the road is lined with traffic signs and streetlights, and there are buildings and trees visible in the background .\nPrompt: Describe the main events that take place in this video .\nGround Truth: # A man in a striped shirt holding another person by the hair, appearing to be in distress . the scene transitions to a close -up of the man\u0026rsquo;s hands as he picks up a knife from a cluttered kitchen counter .\nInternVL2 # The video clip shows a man in a striped shirt cutting a piece of meat with a knife on a kitchen counter . The camera focuses on the man\u0026rsquo;s hands and the knife as he cuts the meat . The background is blurred, emphasizing the action in the foreground .\nHolmes -VAU (Ours) # A close -up of two individuals engaged in a physical altercation . one person is wearing a striped shirt and appears to be grappling with another individual who is dressed in dark clothing . the setting is dimly lit, suggesting an indoor environment, possibly a room or a kitchen\nFigure G. Qualitative comparison of anomaly understanding explanation with our baseline model, i.e., InternVL-2B. Correct and wrong explanations are highlighted in green and red, respectively.\nclip-level \u0026lt;Caption\u0026gt;\nFigure H. Qualitative comparison of anomaly understanding explanation with our baseline model, i.e., InternVL-2B. Correct and wrong explanations are highlighted in green and red, respectively.\nFigure I. Qualitative comparison of anomaly understanding explanation with our baseline model, i.e., InternVL-2B. Correct and wrong explanations are highlighted in green and red, respectively.\nFigure J. Qualitative comparison of anomaly understanding explanation with our baseline model, i.e., InternVL-2B. Correct and wrong explanations are highlighted in green and red, respectively.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/holmes-vau-towards-long-term-video-anomaly-understanding-at-any-granularity/","section":"Papers","summary":"A semi-automated hierarchical video annotation framework combined with a novel Anomaly-focused Temporal Sampler and a multimodal large language model, aimed at comprehensive understanding of complex and long-term video anomalies across multiple temporal scales.","title":"Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity","type":"other"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/huaping-liu/","section":"Authors","summary":"","title":"Huaping Liu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/huaxin-zhang/","section":"Authors","summary":"","title":"Huaxin Zhang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/hui-lv/","section":"Authors","summary":"","title":"Hui Lv","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/isht-dwivedi/","section":"Authors","summary":"","title":"Isht Dwivedi","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jiacong-xu/","section":"Authors","summary":"","title":"Jiacong Xu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jiafei-wu/","section":"Authors","summary":"","title":"Jiafei Wu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jiahui-liu/","section":"Authors","summary":"","title":"Jiahui Liu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jialong-zuo/","section":"Authors","summary":"","title":"Jialong Zuo","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jiangbo-lu/","section":"Authors","summary":"","title":"Jiangbo Lu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jiangbo-qian/","section":"Authors","summary":"","title":"Jiangbo Qian","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jianqin-wu/","section":"Authors","summary":"","title":"Jianqin Wu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jiaqi-tang/","section":"Authors","summary":"","title":"Jiaqi Tang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jiayou-qin/","section":"Authors","summary":"","title":"Jiayou Qin","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jingjing-chen/","section":"Authors","summary":"","title":"Jingjing Chen","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/john-suchanek/","section":"Authors","summary":"","title":"John Suchanek","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jun-wang/","section":"Authors","summary":"","title":"Jun Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/ke-ma/","section":"Authors","summary":"","title":"Ke Ma","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/kun-qian/","section":"Authors","summary":"","title":"Kun Qian","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/kwonjoon-lee/","section":"Authors","summary":"","title":"Kwonjoon Lee","type":"authors"},{"content":" Language-guided Open-world Video Anomaly Detection # Zihao Liu, Xiaoyu Wu * *, Jianqin Wu, Xuxu Wang, Linlin Yang Communication University of China\n{liuzihao, wuxiaoyu}@cuc.edu.cn {wujianqin, wangxuxu}@mails.cuc.edu.cn mu4yang@gmail.com\nAbstract # Video anomaly detection models aim to detect anomalies that deviate from what is expected. In open-world scenarios, the expected events may change as requirements change. For example, not wearing a mask is considered abnormal during a flu outbreak but normal otherwise. However, existing methods assume that the definition of anomalies is invariable, and thus are not applicable to the open world. To address this, we propose a novel open-world VAD paradigm with variable definitions, allowing guided detection through user-provided natural language at inference time. This paradigm necessitates establishing a robust mapping from video and textual definition to anomaly score. Therefore, we propose LaGoVAD (Language-guided Openworld VAD), a model that dynamically adapts anomaly definitions through two regularization strategies: diversifying the relative durations of anomalies via dynamic video synthesis, and enhancing feature robustness through contrastive learning with negative mining. Training such adaptable models requires diverse anomaly definitions, but existing datasets typically provide given labels without semantic descriptions. To bridge this gap, we collect PreVAD (Pre-training Video Anomaly Dataset), the largest and most diverse video anomaly dataset to date, featuring 35,279 annotated videos with multi-level category labels and descriptions that explicitly define anomalies. Zero-shot experiments on seven datasets demonstrate SOTA performance. Data and code will be released.\n1. Introduction # Video Anomaly Detection (VAD) aims to identify frames in videos that deviate from expected patterns [5 , 33], which is applicable in fields such as intelligent surveillance and monitoring [20]. In recent years, many VAD methods have achieved commendable performance employing weak supervision [6 , 14 , 21 , 25 , 35 , 37] or semi-supervision [24 , 31] in the closed-set setting. However, there is a consensus\nCorresponding author [20 , 34 , 45 , 46] that the algorithm should detect anomalies beyond the training data in open-world scenarios.\nFigure 1. Comparison of different VAD paradigms. Closed-set methods (b) can only detect anomalies in the training scope, while open-set methods (c) can detect novel anomalies. Our open-world approach (d) can deal with changes to the pattern\u0026rsquo;s label in openworld scenarios, with an example in (e). AhiFi1hii\n, p() As shown in Fig. 1a, the training data for VAD models encompass patterns labeled as normal or abnormal, where normal patterns include activities such as running and abnormal patterns comprise events like explosions. Conventional closed-set methods (Fig. 1b) [21 , 35] aim to detect patterns identical to those encountered during training when applied to test sets, thereby restricting their application in open-world scenarios. In contrast, open-set approaches (Fig. 1c) [46] (including open-vocabulary [34] and domain generalization [2 , 13 , 30] methods) are able to detect novel patterns absent from the training data without tuning. However, these methods neglect the critical issue of potential label alteration during testing (Fig. 1d), i.e., patterns originally labeled as normal may be redefined as abnormal (and vice versa). A representative example from Fig. 1e demonstrates this phenomenon: while pedestrian on road is regarded as a normal behavior in conventional crime anomaly datasets [25], this same pattern would typically be classified as abnormal in freeway surveillance scenarios. The cause\nof such label alteration lies in the user\u0026rsquo;s different definition of what constitutes anomalies, driven by environments or temporal policies. Formally, this is a concept drift issue, as defined in [18], which refers to the divergence between the conditional probability distributions of training and testing phases, i.e ., Pt Ptrain (y|v) ̸= Ptest(y|v), where v are videos and y are anomaly labels. While some attempts have begun to address this, critical limitations remain. Scene-dependent methods [2 , 4 , 8] associate the anomaly definition with scenes, neglecting user-specific requirements (e.g., hospital administrators may require detecting the anomaly of not wearing masks during influenza outbreaks but not at other times). Meanwhile, a dataset-dependent method [9] explores anomaly conflicts across datasets, but remains constrained by predefined categories of training datasets, lacking generalizability to open-world dynamics. Besides the limitation of task paradigms, existing methods are evaluated only on limited scenes with small-scale data, lacking extensive zero-shot cross-domain comparisons to verify the open-world capability.\nTo address the aforementioned concept drift challenge, we propose a novel open-world paradigm. First, we explicitly model the anomaly definition as a stochastic variable instead of fixing it as one or a few realizations. Then, we condition predictions y on both the video v and the anomaly definition z , i.e., learning a mapping Φ : (v, z) → y. Since we take the changing definition of anomalies into account, we effectively avoid concept drift (as detailed in Sec. 3). Finally, to enable natural interaction, we employ textual anomaly definition, allowing users to dynamically define anomalies via language.\nHowever, learning Φ requires modeling a more complex multimodal space, resulting in decaying sample density that leads to overfitting. To address this, we mitigate it from both model and dataset perspectives. We propose a Language-Guided Open-world Video Anomaly Detector (LaGoVAD), which employs two regularization strategies to reduce overfitting: 1) We align vision and language through contrastive learning with negative sample mining, which helps the model learn more robust features. 2) We incorporate a novel dynamic video synthesis module that generates long videos and pseudo-labels on the fly, which functions as a form of data augmentation to diversify the relative duration of abnormal events.\nWe construct a large-scale diversified Pre-training Video Anomaly Dataset (PreVAD), which is collected through a scalable data curation pipeline utilizing foundation models to automate data cleaning and annotation, significantly reducing manual labeling costs while ensuring high quality. PreVAD comprises 35,279 videos annotated with multilevel categories and anomaly descriptions. To our knowledge, PreVAD surpasses existing datasets in diversity and scale. To evaluate the open-world capability, we set up\nFigure 2. Comparison of zero-shot performance between LaGoVAD and other methods across seven datasets [1 , 17 , 25 , 27 , 32 , 38 , 45]. Except for XD-Violence which uses the AP as metric, all other datasets adopt the AUC metric for evaluation. LaGoVAD significantly outperforms existing state-of-the-art methods on all datasets.\na new benchmark that performs zero-shot anomaly detection on seven VAD datasets (UCF-Crime [25], XD-Violence [32], MSAD [45], UBNormal [1], DoTA [38], TAD [17], LAD [27]). Our contributions are summarized as follows:\nWe reformulate open-world VAD that pioneers the formulation of the concept drift in VAD and proposes a language-guided paradigm to avoid it. We propose a novel language-guided video anomaly detection model, LaGoVAD, which implements the proposed paradigm and incorporates two regularization strategies to mitigate overfitting. We build a large-scale and diverse dataset, PreVAD, annotated with multi-level taxonomy and anomaly descriptions to enhance generalization under the new paradigm. We conduct zero-shot evaluation across seven datasets to validate the generalization of our method, and LaGoVAD achieves state-of-the-art performance (Fig. 2). 2. Related Work # 2.1. Video Anomaly Datasets # We summarize the characteristics of existing video anomaly datasets in Tab. 1 .\nScale. The largest standalone dataset [32] contains only 5K videos, with ensemble datasets reaching 7.8K [26]. The data scarcity limits the performance of VAD.\nDomain \u0026amp; Category. Many datasets focus only on a single scene, such as traffic or campus. The few datasets that cover multiple scenes overlook domains like mishaps, animal-related violence, and production accidents.\nText Annotation. Existing VAD datasets are labeled with anomaly categories, which introduces semantic ambiguity. Although [12 , 26 , 39 , 41 , 43] provide different types of text annotation, they focus on understanding or captioning tasks and cannot provide a fine-grained overall description of the anomaly in a video.\nTable 1. Comparisons between PreVAD and existing datasets. Our dataset 1) has the largest scale and broadest domain coverage, 2) is annotated with abnormal video descriptions, 3) does not integrate data from existing VAD datasets.\n| Dataset | # videos\n(# abnormal videos) Domain # categories Text Annotation Source ShanghaiTech [42] 437 (107) campus 13 - recording UCF-Crime [25] 1900 (950) crime 14 - web XD-Violence [32] 4754 (2405) crime 7 - web,movi UBI-Fights [10] 1000 (216) violence 2 - web LAD [27] 2000 (762) crime, traffic, animal, mishap 14 - web TAD [17] 500 (250) crime 8 - web UBNormal [1] 543 (278) pedestrian 28 - synthesis DoTA [38] 5677 (5677) traffic 9 - web NWPU-Campus [4] 547 (124) campus 28 - recordin MSAD [45] 720 (240) crime, traffic, mishap 55 - web UCCD [43] 1012 (382) crime - dense UCF UCA [39] 1854 (944) crime - dense UCF VAD-Instruct50k [41] 5547 (2715) crime - instructio UCF+XD HAWK [26] 7852 (6677) crime, traffic - instruction 7 VAD dataset CUVA [12] 1000 (1000) crime, traffic, pedestrian, animal 42 instructio web PreVAD 35279 (11979) crime, traffic, animal, mishap, production 35 anomaly description web Source. Current datasets are mainly from public web videos, while others rely on synthetic generation [1 , 19 , 34] or movie clips [34]. However, synthetic datasets suffer from misalignment with the real world, and movie data raises concerns about potential copyright infringement.\nIn this paper, we propose a scalable data curation pipeline to collect a novel dataset, which has large-scale diversified videos with multi-level taxonomy and anomaly descriptions.\n2.2. Open-world Video Anomaly Detection Methods # Intuitively, open-world VAD models should detect novel anomalies beyond the training set, as discussed in [13 , 26 , 34 , 46]. From a task paradigm perspective, early attempts adopt open-set and domain generalization strategies [1 , 13 , 46]. Then, [34] extends this paradigm with openvocabulary VAD, enabling both detection and classification of unseen anomalies. However, these approaches implicitly assume a fixed anomaly definition and restrict model exposure to partial categories during training, unable to deal with the concept drift issue. Recent studies explore the dynamic anomaly definition: [2 , 4 , 8] posit that anomaly is scene-dependent (e.g., identical behaviors classified differently across scenes), training models to infer scene-anomaly correlations from data, and [9] trains dataset-specific classifiers. Despite these efforts, they lack the ability of usercustomizable anomaly definition, limiting their applicability in open-world scenarios. Additionally, [26] explores open-world video understanding for video-QA tasks but lacks detection capabilities.\nFrom a model design perspective, current advancements primarily adopt two pathways: 1) data-driven approaches [1 , 13 , 19 , 34 , 46] enhance generalization by utilizing more data, while 2) cross-modal alignment approaches [6 , 34 , 35 ,\n37] aim to construct more robust feature spaces by aligning vision and language. However, they neglect the problem of duration distribution shifts when leveraging more data and only align videos to class-level text embeddings without further fine-grained aligning.\nOur work introduces a novel open-world VAD paradigm that allows users to flexibly define anomalies to guide detection, thereby avoiding concept drift. We implement this paradigm via a model featuring dynamic video synthesis and contrastive learning with hard negative mining. This model synthesizes videos of variable durations and achieves fine-grained modal alignment.\n3. Paradigm: Language-guided Open-world Video Anomaly Detection # We define open-world video anomaly detection as the task of identifying video frames containing abnormal patterns, where the definition of abnormality may change during testing. Abnormal patterns manifest as events, behaviors, or actions (e.g., running). In practice, the definition of anomalies may change as users\u0026rsquo; requirements change, influenced by cultural differences, policy updates, and specific environments. The user may expand the definition to detect new anomalies or narrow the definition to remove those of no interest, which causes the abnormality label of a particular pattern to change. For instance, while running is generally normal behavior, it becomes abnormal in libraries or offices. Formally, this is a concept drift issue [18]:\nFigure 3. Architecture of our proposed LaGoVAD. We implement the language-guided VAD paradigm by adding an anomaly definition branch (z → G → U). The model is trained with two novel regularization strategies: dynamic video synthesis Ldvs (Sec. 4.1) and contrastive learning loss with negative mining L neg (Sec. 4.2).\nwhere V denotes the video and Y denotes the label. This conditional probability distribution can be expanded as:\nwhere Z denotes the anomaly definition. We hypothesize that Y is solely determined by Z and V , i.e., the result only depends on the video to be detected and the anomaly definition. Therefore, the concept drift happens due to the change of P(Z|V ) .\nExisting methods can be seen as modeling Φ : v → y and performing detection based on a fixed definition z sampled form Z:\nwhere θ denotes the parameters of the model Φ, and L denotes the loss function. It is worth emphasizing that some methods that can detect unknown anomalies also belong to this paradigm, including open-set [1 , 46], domain generalization [30] and open-vocabulary [34] methods, because they assume a fixed category set under a specific definition and only a subset are available in training. Under their assumption, an abnormal pattern would never change to normal, and thus they are unable to deal with the concept drift in open world.\nIn contrast, we propose a paradigm that directly models Φ : (v, z) → y to avoid the concept drift. It assumes a dynamic anomaly definition and conditions predictions on both the video and the definition. Formally,\nDuring training, the model Φ learns an optimal group of parameters θ that detect anomalies in video v under the guidance of definition z. We implement z in the form of natural language, but theoretically, it could be image, video, audio, or a learned embedding. Our paradigm is especially applicable when the user needs to temporarily specify an abnormal behavior, e.g., detecting not wearing a mask during a flu outbreak.\n4. Method: LaGoVAD # We implement the language-guided VAD paradigm via LaGoVAD. We first introduce the overall architecture, followed by details of two proposed regularization terms (i.e ., dynamic video synthesis and negative mining). As illustrated in Fig. 3, we take video v and anomaly definition z as inputs. The video is synthesized by a non-parametric dynamic video synthesis module. The anomaly definition is a category set z = {z0, z1, . . . , zC − 1 }, where each class ziis defined by a class name or a description and C is the number of categories in a certain definition. During training, we randomly choose either the class names or the anomaly descriptions within a batch as the definition. Specially, z0 denotes the normal class. We extract and encode features of videos with F, which includes a pretrained CLIP image encoder [22] and a Transformer-based temporal encoder. And the text features are extracted with CLIP text encoder G. Then, the encoded features are fused by a Transformer-based fusion module U. Finally, the fused features are fed into a detection head H bin to obtain the anomaly score y bin ∈ R L×1 and a classification head H mul to obtain the classification probability y mul ∈ R L×C , where L is the length of video. Formally,\nwhere N, A are normal and abnormal video sets, Synthesis(· , · ) is the dynamic video synthesis module, y p is the pseudo-label generated during synthesis, v t , z t are encoded features and v u , z u are fused features. In this paper, we address the challenge of concept drift in open-world scenarios to enhance model generalization. To this end, we build a model with intentionally simplified architecture.\nDuring training, we optimize the model through four losses under weak supervision. Following [9 , 21 , 25 , 34 , 35], we use multiple instance learning loss LMIL to optimize detection. And we employ MIL-align loss LMIL-align to optimize classification following [34 , 35]. Our paradigm operates in multimodal joint spaces (P(v, z, y)) that inherently suffer from exponentially decaying sample density, thereby inducing overfitting problems. Specifically, the algorithm may establish a wrong mapping or suppress a certain modality. Therefore, we leverage more diverse videos via a dynamic video synthesis loss Ldvs to learn better mappings. We also incorporate a contrastive learning loss with hard negative mining L neg to better align two modalities and mitigate the imbalance of two modalities. Formally,\n4.1. Dynamic Video Synthesis # In real-world scenarios, anomalies typically occupy only a small portion of a lengthy video, whereas current datasets predominantly contain videos with high anomaly ratios due to web-sourced data limitations. To mitigate this bias, we dynamically synthesize videos with varying durations and compute a loss based on the pseudo label generated during synthesis, which is inspired by Mosaic [3] and Cutout [11] augmentation. The module initially determines whether to generate a normal or abnormal video, followed by specifying the number of segments. It then selects an anchor video and randomly selects similar videos from k-nearest neighbors to construct a semantically coherent sequence, where the anchor\u0026rsquo;s position is transformed to a binary pseudo label y p ∈ {0 , 1} L , where L denotes the feature length. Notably, the distance metrics required for retrieval are pre-computed, effectively reducing computational overhead during training. Finally, a dynamic video synthesis is calculated as:\nwhere σ denotes the Sigmoid function, yˆ ˆ denotes the videolevel ground truth, Ω a k and Ω n k are indices of Top-K scores of synthetic abnormal and normal videos, respectively.\n4.2. Contrastive Loss with Hard Negative Mining # Given the ambiguous boundary between normal and abnormal frames in anomaly videos, we incorporate contrastive learning with hard negative mining as a regularization term to enhance their discriminability. Specifically, we first aggregate the frame-level visual features into video-level features with binary abnormal scores as weights:\nwhere v t i denotes the i-th feature in v t , η denotes the temperature, v˜ ˜ pos denotes the aggregated foreground feature and v ˜ neg denotes the aggregated background feature. The background feature in an abnormal video is the normal part of it, which could be considered as the hard negative to its corresponding anomaly description. Therefore, we obtain v ˜ pos of all samples and v˜ ˜ neg of only abnormal samples in a batch, forming V ˜ ∈ R (B1+B2)×E, where B 1 is the batch size, B2 is the number of abnormal videos in a batch, and E is the feature dimension. We also obtain text features before fusing, forming Z ˜ ∈ R B 2×E. The contrastive loss is as follows:\nwhere Norm is L2 normalization and τ denotes the temperature.\nDuring inference, the user can input either descriptions or class names as the anomaly definition. For the classification head, we select the minimum value of the normal class and the maximum value of the abnormal class over the temporal axis and use these values after applying Softmax as probabilities. More details are provided in supp(Sec. B).\n5. Dataset: PreVAD # As in Eq. (4), our new paradigm requires diverse (v, z, y) triples for training. Therefore, we propose PreVAD—a large-scale video anomaly detection dataset, which is collected through a scalable curation pipeline.\n| | PreVAD (209.5 hours) Train Val PreVAD (209.5 hours) Train Val | PreVAD (209.5 hours) Train Val PreVAD (209.5 hours) Train Val | PreVAD (209.5 hours) Train Val PreVAD (209.5 hours) Train Val | PreVAD (209.5 hours) Train Val PreVAD (209.5 hours)\nTrain Val Train normal Train normal NormaNorma Abnormal Abnormal NormalNormal # videos # videos 10673 10673 22000 22000 1306 1306 13001300 duration duration 49.26h 49.26h 145.6h145.6h 6.33h 6.33h 8.26h8.26h Figure 4. The statistics of PreVAD.\n5.1. Data Curation Pipeline # Our pipeline encompasses three stages: source, cleansing, and annotation, leveraging multiple foundation models to ensure cost efficiency while maximizing data quality. We aggregate videos from three sources. We first leverage existing video-text datasets [16 , 29 , 36 , 44] to retrieve anomaly videos through text-based video retrieval. Second, we expand the collection through curated web resources, including 1) accident compilations, self-defense tutorials, and fail videos; 2) driving vlogs and travel documentaries; 3) violence recognition datasets [7]. Last, we obtain normal surveillance videos from YouTube streams and traffic camera videos from government-released road camera streams.\nIn the cleansing stage, we first use PySceneDetect to remove irrelevant segments such as intros and outros. Next, a multimodal LLM (MLLM) generates detailed video descriptions, and a vision-language model (VLM) verifies the consistency between the descriptions and video content. Finally, an LLM evaluates the descriptions to confirm the presence of anomalies, decreasing false positives and ensuring high-quality data.\nThe annotation stage involves a weakly supervised labeling approach with hybrid human-AI annotation. We annotate each video with a video-level category and annotate each validation-set video with a frame-level label. Additionally, an MLLM generates fine-grained anomaly descriptions for each video under constrained prompts, which include human-labeled categories, and is able to guide MLLM to describe only the anomaly in the video. Notably, we do not additionally label a test set, as we will conduct zero-shot evaluations on other existing VAD datasets. More details can be found in the supp (Sec. C)\nFigure 5. Comparison between PreVAD and existing datasets.\nFigure 6. A sample of PreVAD, which includes multi-level category label and precise description of the anomaly.\n5.2. Dataset Statistics # Our model stands out on a larger scale, with a wider variety of anomaly types and high-quality anomaly descriptions.\nScale. As in Fig. 4b, PreVAD comprises 35,279 videos, with 11,979 abnormal videos and 23,300 normal videos, partitioned into training and validation sets, which is the largest video anomaly dataset up to now.\nAnomaly Types. The diversity of a dataset is related to both the granularity and breadth of its taxonomy. Our dataset features a hierarchical taxonomy comprising 7 first-level categories (i.e ., Violence, Vehicle Accident, Firerelated Accident, Robbery, Daily Accident, Animal-related Violence, Production Accident) and 35 subcategories (e.g ., carjacking, mugging, sport fail, war). Our taxonomy spans minor (e.g., fall to the ground) to severe anomalies (e.g ., shooting), covering most of the common types of anomaly under surveillance. Moreover, the hierarchical design helps model learning diverse definitions.\nAnomaly Descriptions. Each abnormal video is annotated with a text description, which has a total vocabulary size of 5,298 words and an average of 22.9 words per description. As shown in Figure 6, our annotation accurately describes the abnormal objects and behaviors in a fine-grained manner, which enables the model to learn better alignment.\nStatistics. The videos have an average duration of 21.38 seconds (as in Fig. 4a), with the majority of them clustering within 5 to 20 seconds. And there are over 2000\nTable 2. Zero-shot comparison with other methods in video anomaly detection. We reproduced models marked with ⋆ using their open-source codes, and the rest of the data are taken from their publications. Results marked with † are from [40].\n| Methods | Training-set | UCF | XD | MSAD | Test-set\nAD UBN DoTA A TAD LAD OVVAD[34] AIGC+XD 82.42 - - - - - - LaGoVAD PreVAD+XD 82.81 - - - - - - OVVAD AIGC+UCF - 63.74 - - - - - LaGoVAD PreVAD+UCF - 76.28 - - - - - CLIP†[22] - 53.16 17.83 - - - - - LLaVA1.5†[15] - 72.84 50.26 - - - - - LAVAD[40] - 80.28 62.0 - - - - - MIL[25] UCF - - 49.50 - - - CMRL[8] UCF - 46.74 - - - - - MultiDomain[9] Multiple 78.55 - - - - 79.2 77.36 PEL⋆[21] UCF - 43.53 79.82 54.02 53.05 86.27 69.99 PEL⋆ XD 54.52 - 68.25 49.55 44.97 43.02 30.82 VadCLIP⋆[35] UCF - 58.29 88.09 56.24 50.93 74.4 74.29 VadCLIP⋆ XD 80.16 - 88.48 57.41 49.00 83.5 74.46 VadCLIP⋆ PreVAD 79.37 67.43 89.79 55.66 50.59 85.96 75.02 LaGoVAD PreVAD 81.12 74.25 90.41 58.07 62.60 89.56 78.91 videos that exceed one minute. Unlike UCF-Crime [25] and XD-Violence [32], PreVAD does not include untrimmed videos that last for tens of minutes or even several hours, as this would lack diversity and be detrimental to both training and validation. As shown in Fig. 4c, most of the videos are from existing video-text datasets or streaming, significantly reducing the overhead of manual clipping and retrieval. Unlike datasets assembled by merging existing VAD datasets, PreVAD obtains videos independently, enabling cross-dataset validation as a new generalization benchmark. Please refer to the supplementary material for more details (Sec. C).\n6. Experiments # 6.1. Experiment Setup # Datasets \u0026amp; Metrics This study aims to enhance the openworld generalization of VAD models. We therefore conduct comprehensive evaluations across seven datasets: UCFCrime (UCF) [25], XD-Violence (XD) [32], MSAD [45], UBNormal (UBN) [1], DoTA [38], TAD [17], and LAD [27], which encompass diverse anomaly types. The validation set of our proposed PreVAD is utilized for in-domain performance analysis and ablation studies. During zeroshot evaluation, we use manual designed prompts based on the class name of the corresponding dataset as the anomaly definition. For detection metrics, we follow others using Average Precision (AP) for XD-Violence, while employing Area Under the Curve of the frame-level receiver operating characteristic (AUC) for other datasets. For classification metrics, there is currently no consensus evaluating classification performance in video anomaly detection. [35] uses\nTable 3. Zero-shot comparison with other methods in video anomaly classification. The results for other models are obtained with their open-source codes and weights. † denotes providing ground-truth segments when classifying abnormal videos.\nMethod Training UCF UCF XD XD Method Training Acc. F1 Acc. F1 CLIP [22] - 19.31 12.08 56.25 45.04 CLIP† - 20.34 11.10 64.25 54.61 ActionCLIP [28] K400 18.62 16.12 38.75 37.11 ActionCLIP† K400 19.31 13.85 41.37 38.58 ViFi-CLIP [23] K400 20.34 15.67 53.75 50.33 VadCLIP [35] UCF - - 46.38 26.16 VadCLIP XD 38.28 10.52 - - VadCLIP PreVAD 45.52 17.81 71.38 57.99 LaGoVAD PreVAD 51.72 16.64 78.13 63.80 mAP, and [34] uses accuracy for abnormal videos, but they neglect the evaluation of classifying normal videos. Therefore, we measure using accuracy and F1-score on both abnormal and normal videos.\nBaselines Given the absence of prior zero-shot evaluations on all seven datasets, we reproduce PEL [21] and VadCLIP [35] using publicly available weights and code for comparison. For classification, we further benchmark against zero-shot action recognition models [23 , 28]. To ensure fairness in comparisons with CLIP-based methods, we use the identical ViT/B-16 variant.\nDetails of the implementation, evaluation datasets, and reproduced baselines are provided in the supp (Secs. B.2 , D and E)\n6.2. Comparison with State-of-the-Arts # As demonstrated in Fig. 2 and Tabs. 2 and 3, our method achieves state-of-the-art performance in both zero-shot detection and classification. In Tab. 2, LaGoVAD outperforms existing approaches, including the LLM-based method [40], vision-language aligned methods [21 , 35], and the multidomain generalization method [9] on detection. Notably, on XD-Violence, our approach achieves a 20% improvement over prior methods. For fair comparison with OVVAD [34], which leverages AIGC-augmented training, we construct a comparable set and still observe superior performance. In Tab. 3, empirical results demonstrate that our approach outperforms both the CLIP baseline and state-ofthe-art action recognition models under identical CLIP variants, attributable to its capacity for selectively attending to abnormal patterns in videos. While sharing the same feature extractor and a similar two-branch architecture capable of detection and classification, our framework achieves comprehensive improvements over VadCLIP [35]. Notably, it exhibits a 27% gain in detection performance (measured\nTable 4. Ablation on each component. guided refers to guiding the detection with language. Det. Avg. refers to the average zero-shot detection performance on seven datasets. Cls. Avg. refers to the average zero-shot classification performance on UCF-Crime and XD-Violence.\nLdvs Lneg guided PreVAD Det. Avg. Cls. Avg. ✓ ✓ ✓ 69.98 76.42 52.57 ✓ ✓ 65.73 73.51 51.73 ✓ ✓ 68.92 73.96 51.85 ✓ 67.35 71.31 48.81 ✓ ✓ 69.87 73.84 46.23 by AP) and a 68% improvement in classification accuracy on XD-Violence.\n6.3. Ablation Studies # Dataset Effectiveness To quantify dataset impacts, we compare VadCLIP [35] trained by three datasets in Tabs. 2 and 3 via average metrics in detection and classification. Experimental results reveal that the model trained with PreVAD outperforms the one trained with UCF-Crime by 14% in detection (average metrics on six other datasets) and 88% in classification (average metrics on XD-Violence) while surpassing the one trained with XD-Violence by 7.6% in detection and 44% in classification, respectively. This substantial margin validates that a larger and more diverse dataset can significantly improve zero-shot performance.\nArchitecture Effectiveness As demonstrated in Tabs. 2 and 3, when trained on identical datasets, our LaGoVAD framework achieves consistent improvements over VadCLIP, with gains of 7.2% in average detection performance on seven datasets and 2.8% in classification across two datasets. This confirms the superiority of our paradigm in open-world scenarios.\nModule Effectiveness We conduct a series of ablations to validate each module of our model in Tab. 4. Removing either the dynamic video synthesis module or the contrastive learning with hard negative mining led to a noticeable degradation in detection and classification performance. When both are removed, the model exhibits a significant decline in zero-shot performance. We also validate the effect of language guidance by removing it in our experiments. When disabling the language guidance, we followed approaches in [34 , 35] to place the fusion module after the detection stage, which does not condition detection results on the given text. Experiment shows that without language guidance, in-domain performance only slightly dropped, while cross-domain performance decreased significantly. It indicates that the conventional paradigm lack the\n(b) An example of a normal event turning into an abnormal event.\nFigure 7. Visualization of predicted anomaly scores for two cases where concept drift occurs in open-world scenarios, compared with VadCLIP [35] and PEL [21] trained on UCF-Crime [25]. Both videos are from the UCF-Crime dataset, where (a) is labeled as abnormal and (b) is labeled as normal.\ncapacity to incorporate user-defined guidance for detection, thereby limiting their adaptability to open-world scenarios.\n6.4. Qualitative Results # Figure 7 visualizes the prediction results of PEL [21], VadCLIP [35] and our proposed LaGoVAD under concept drift. As illustrated in Fig. 7a, the video depicts an incident of stealing a cattle, which is considered as abnormal in the UCF-Crime dataset. When users exclude such categories (e.g., in traffic or violence detection scenarios), our method could dynamically adapt to it by redefining anomalies. In Fig. 7b, the video shows a dog knocking over a trashcan, labeled as normal in UCF-Crime. However, when users explicitly define this behavior as an anomaly, our method seamlessly adjusts its detection logic. In contrast, the two other methods are not able to address concept drift under different user requirements. These results validate the practicability and effectiveness of our paradigm in open-world video anomaly detection. More ablations and visualizations are in the supp (Sec. F).\n7. Conclusion # In this work, we propose a novel paradigm, languageguided open-world video anomaly detection, to deal with\nconcept drift in the open-world scenario. It assumes that the definition of anomaly is dynamic and models it as a stochastic variable input to the network. To support training this model, we build a large-scale video anomaly dataset that is annotated by multi-level taxonomy and anomaly descriptions. We empirically verify the effectiveness of the proposed framework through state-of-the-art zero-shot performance and sufficient ablations on seven datasets.\nReferences # [1] Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ubnormal: New benchmark for supervised open-set video anomaly detection. In CVPR, pages 20111–20121, New Orleans, USA, 2022. IEEE. 2 , 3 , 4 , 7 [2] Abhishek Aich, Kuan-Chuan Peng, and Amit K. RoyChowdhury. Cross-domain video anomaly detection without target domain adaptation. In WACV, pages 2578–2590. IEEE, 2023. 1 , 2 , 3 [3] Alexey Bochkovskiy, Chien-Yao Wang, and HongYuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arxiv preprint, abs/2004.10934, 2020. 5 [4] Congqi Cao, Yue Lu, Peng Wang, and Yanning Zhang. A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation. In CVPR, pages 20392– 20401, Vancouver, Canada, 2023. IEEE. 2 , 3 [5] S. Chandrakala, K. Deepak, and G. Revathy. Anomaly detection in surveillance videos: a thematic taxonomy of deep models, review and performance analysis. Artif. Intell. Rev. , 56(4):3319–3368, 2023. 1 [6] Weiling Chen, Keng Teck Ma, Zi Jian Yew, Minhoe Hur, and David Aik-Aun Khoo. TEVAD: improved video anomaly detection with captions. In CVPRW, pages 5549–5559, Vancouver, CA, 2023. IEEE. 1 , 3 [7] Ming Cheng, Kunjing Cai, and Ming Li. RWF-2000: an open large scale video database for violence detection. In ICPR, pages 4183–4190. IEEE, 2020. 6 [8] MyeongAh Cho, Minjung Kim, Sangwon Hwang, Chaewon Park, Kyungjae Lee, and Sangyoun Lee. Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning. In CVPR, pages 12137– 12146, Vancouver, CA, 2023. IEEE. 2 , 3 , 7 [9] MyeongAh Cho, Taeoh Kim, Minho Shim, Dongyoon Wee, and Sangyoun Lee. Towards multi-domain learning for generalizable video anomaly detection. In NIPS, Vancouver, CA, 2024. 2 , 3 , 5 , 7 [10] Bruno Degardin and Hugo Proenc¸a. Human activity analysis: Iterative weak/self-supervised learning frameworks for detecting abnormal events. In IEEE IJCB, pages 1–7, Houston, USA, 2020. IEEE. 3 [11] Terrance Devries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arxiv preprint, abs/1708.04552, 2017. 5 [12] Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiang- ming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jianhang Chen, Qimei Cui, and Xiaofeng Tao. Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly. In CVPR, pages 18793–18803. IEEE, 2024. 2 , 3 [13] Yashika Jain, Ali Dabouei, and Min Xu. Cross-domain learning for video anomaly detection with limited supervision. In ECCV, pages 468–484. Springer, 2024. 1 , 3\n[14] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. CLIP-TSA: clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In ICIP, pages 3230–3234, Kuala Lumpur, Malaysia, 2023. IEEE. 1\n[15] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. CVPR , pages 26286–26296, 2023. 7\n[16] Jing Liu, Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, and Jinhui Tang. VALOR: vision-audio-language omni-perception pretraining model and dataset. IEEE Trans. Pattern Anal. Mach. Intell., 47(2): 708–724, 2025. 6\n[17] Hui Lv, Chuanwei Zhou, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Localizing anomalies from weakly-labeled videos. TIP, 30:4505–4515, 2021. 2 , 3 , 7\n[18] Jose G. Moreno-Torres, Troy Raeder, Roc ´ ´ıo Ala ´ ızRodr ´ ´ıguez, Nitesh V. Chawla, and Francisco Herrera. A unifying view on dataset shift in classification. Pattern Recognit., 45(1):521–530, 2012. 2 , 3\n[19] Pradeep Narwade, Ryosuke Kawamura, Gaurav Gajbhiye, and Koichiro Niinuma. Synthetic video generation for weakly supervised cross-domain video anomaly detection. In ICPR, pages 375–391. Springer, 2024. 3\n[20] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton van den Hengel. Deep learning for anomaly detection: A review. arxiv preprint, abs/2007.02500, 2020. 1\n[21] Yujiang Pu, Xiaoyu Wu, Lulu Yang, and Shengjin Wang. Learning prompt-enhanced context features for weaklysupervised video anomaly detection. TIP, 33:4923–4936, 2024. 1 , 5 , 7 , 8\n[22] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, Virtual Event, 2021. PMLR. 4 , 7\n[23] Hanoona Abdul Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman H. Khan, and Fahad Shahbaz Khan. Fine-tuned CLIP models are efficient video learners. In CVPR, pages 6545–6554, Vancouver, CA, 2023. IEEE. 7\n[24] Sorina Smeureanu, Radu Tudor Ionescu, Marius Popescu, and Bogdan Alexe. Deep appearance features for abnormal behavior detection in video. In Int. Conf. Image Anal. Proc. , pages 779–789. Springer, 2017. 1\n[25] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In CVPR, pages 6479–6488. Computer Vision Foundation / IEEE Computer Society, 2018. 1 , 2 , 3 , 5 , 7 , 8\n[26] Jiaqi Tang, Hao LU, RUIZHENG WU, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and Yingcong Chen. Hawk: Learning to understand open-world video anomalies. In NIPS, pages 139751–139785, 2024. 2 , 3\n[27] Boyang Wan, Wenhui Jiang, Yuming Fang, Zhiyuan Luo, and Guanqun Ding. Anomaly detection in video sequences: A benchmark and computational model. IET Image Process. , 15(14):3454–3465, 2021. 2 , 3 , 7\n[28] Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip: A new paradigm for video action recognition. arxiv preprint , abs/2109.08472, 2021. 7\n[29] Xin Eric Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-fang Wang, and William Yang Wang. Vatex: A large-scale, highquality multilingual dataset for video-and-language research. In ICCV, pages 4580–4590. IEEE/CVF, 2019. 6\n[30] Zhiqiang Wang, Xiaojing Gu, Huaicheng Yan, and Xingsheng Gu. Domain generalization for video anomaly detection considering diverse anomaly types. Signal Image Video Process., 18(4):3691–3704, 2024. 1 , 4\n[31] Peng Wu, Jing Liu, and Fang Shen. A deep one-class neural network for anomalous event detection in complex scenes. IEEE Trans. Neural Networks Learn. Syst., 31(7): 2609–2622, 2020. 1\n[32] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In ECCV, pages 322–339. Springer, 2020. 2 , 3 , 7\n[33] Peng Wu, Chengyu Pan, Yuting Yan, Guansong Pang, Peng Wang, and Yanning Zhang. Deep learning for video anomaly detection: A review. arxiv preprint, abs/2409.05383, 2024. 1\n[34] Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open-vocabulary video anomaly detection. In CVPR, pages 18297–18307, Seattle, USA, 2024. IEEE. 1 , 3 , 4 , 5 , 7 , 8\n[35] Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In AAAI, pages 6074–6082, Vancouver, Canada, 2024. AAAI Press. 1 , 3 , 5 , 7 , 8\n[36] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296. IEEE, 2016. 6\n[37] Zhiwei Yang, Jing Liu, and Peng Wu. Text prompt with normality guidance for weakly supervised video anomaly detection. In CVPR, pages 18899–18908, Seattle, USA, 2024. IEEE. 1 , 3\n[38] Yu Yao, Xizi Wang, Mingze Xu, Zelin Pu, Yuchen Wang, Ella Atkins, and David Crandall. Dota: unsupervised detection of traffic anomaly in driving videos. PAMI, 2022. 2 , 3 , 7\n[39] Tongtong Yuan, Xuange Zhang, Kun Liu, Bo Liu, Chen Chen, Jian Jin, and Zhenzhen Jiao. Towards surveillance video-and-language understanding: New dataset, baselines, and challenges. In CVPR, pages 22052–22061, Seattle, USA, 2024. IEEE. 2 , 3\n[40] Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. Harnessing large language models for training-free video anomaly detection. In CVPR , pages 18527–18536, Seattle, USA, 2024. IEEE. 7\n[41] Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, and Nong Sang. Holmes-vad: Towards unbiased and explainable video anomaly detection via multi-modal LLM. arXiv preprint, abs/2406.12235, 2024. 2 , 3\n[42] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H. Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In CVPR, pages 1237–1246, Long Beach, USA, 2019. Computer Vision Foundation / IEEE. 3\n[43] Lingru Zhou, Yiqi Gao, Manqing Zhang, Peng Wu, Peng Wang, and Yanning Zhang. Human-centric behavior description in videos: New benchmark and model. IEEE Trans. Multim., 26:10867–10878, 2024. 2 , 3\n[44] Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Hongfa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Caiwan Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to nmodality by language-based semantic alignment. In ICLR . OpenReview.net, 2024. 6\n[45] Liyun Zhu, Lei Wang, Arjun Raj, Tom Gedeon, and Chen Chen. Advancing video anomaly detection: A concise review and a new dataset. In NeurIPS, pages 89943–89977. Curran Associates, Inc., 2024. 1 , 2 , 3 , 7\n[46] Yuansheng Zhu, Wentao Bao, and Qi Yu. Towards open set video anomaly detection. In ECCV, pages 395–412, Tel Aviv,Israel, 2022. Springer. 1 , 3 , 4\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/language-guided-open-world-vad/","section":"Papers","summary":"Proposes a novel open-world VAD paradigm guided by natural language, with a dynamic anomaly definition, regularization strategies, and a large-scale dataset (PreVAD) with multi-level annotations and descriptions. Achieves state-of-the-art zero-shot performance on seven datasets.","title":"Language-guided Open-world Video Anomaly Detection","type":"application"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/le-wang/","section":"Authors","summary":"","title":"Le Wang","type":"authors"},{"content":" Learning Suspected Anomalies from Event Prompts for Video Anomaly Detection # Chenchen Tao ∗ , Xiaohao Peng ∗ , Chong Wang R , Member, IEEE, Jiafei Wu, Puning Zhao, Jun Wang, Jiangbo Qian\nAbstract—Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly. However, the ambiguous nature of anomaly definitions across contexts may introduce inaccuracy in discriminating abnormal and normal events. To show the model what is anomalous, a novel framework is proposed to guide the learning of suspected anomalies from event prompts. Given a textual prompt dictionary of potential anomaly events and the captions generated from anomaly videos, the semantic anomaly similarity between them could be calculated to identify the suspected events for each video snippet. It enables a new multi-prompt learning process to constrain the visual-semantic features across all videos, as well as provides a new way to label pseudo anomalies for self-training. To demonstrate its effectiveness, comprehensive experiments and detailed ablation studies are conducted on four datasets, namely XD-Violence, UCF-Crime, TAD, and ShanghaiTech. Our proposed model outperforms most state-of-the-art methods in terms of AP or AUC (86.5%, 90.4%, 94.4%, and 97.4%). Furthermore, it shows promising performance in openset and cross-dataset cases. The data, code, and models can be found at: https://github.com/shiwoaz/lap.\nIndex Terms—Weakly Supervised, Video Anomaly Detection, Event Prompt, Multiple Instanse Learning.\nI. INTRODUCTION # V IDEO anomaly detection (VAD) [1], [2], [3] is crucial in video surveillance, given the extensive use of surveillance cameras. The task of VAD is to determine whether each frame in a video is normal or abnormal, which poses a significant challenge as it is not feasible to train a model with complete supervision. Consequently, weakly supervised learning methods (WS-VAD) [4], [5], [6] that solely rely on video-level annotations have gained importance and popularity in recent years.\nThe general paradigm of these methods involves utilizing convolutional networks such as 3D ConvNet (C3D) [7], inflated 3D ConvNet (I3D) [8], or vision transformer [9] to extract visual features and aggregate spatio-temporal information between consecutive frames. Subsequently, an anomalydetection network is trained using multiple instance learning (MIL) [10]. This approach simultaneously maximizes and minimizes the top-k highest scores from individual anomaly and normal videos, respectively. Most methods [10], [11] only focused on the visual-related modality, while some [12], [13] have incorporated semantic descriptions into videos. However,\n∗ These authors contributed equally to this work and should be considered co-first authors.\nR Corresponding Author: Chong Wang.\nFig. 1. The difference between the traditional multiple instance learning methods (upper) and our model (lower). The former one only learns the anomalies using top-k scores in each abnormal video, while the latter utilizes a prompt dictionary to provide extra guidance across different videos.\nsuch semantic information was simply fused with the visual one, instead of delving into the underlying meaning of the textual descriptions. As a result, the MIL based approaches often suffer from a relatively high false alarm rate (FAR) and low accuracy in detecting ambiguous abnormal events.\nMeanwhile, foundation models in natural language processing (NLP) and computer vision (CV), such as InstructGPT [14] and CLIP [15], have demonstrated impressive performance on multimodal tasks. Additionally, prompting techniques in the image field provide a new way to transfer semantic information from well-trained foundation models into vision tasks. It is intriguing to explore whether CLIP\u0026rsquo;s zero-shot detection ability can be effectively transferred into video anomaly detection.\nTherefore, a novel framework to Learn suspected Anomalies from event Prompts, called LAP, is proposed in this paper. As illustrated in Figure 1, a prompt dictionary is designed to list the potential anomaly events. In order to mark suspected anomalies, it is utilized to compare with the captions generated from anomaly videos in the form of semantic features. As a result, an anomaly vector that records the most suspected anomalous events for each video snippet can be obtained. This vector is used to guide a new multi-prompt learning scheme across different videos, as well as form a new set of pseudo anomaly labels.\nThe main contributions of this work are threefold:\nThe new textual prompts describing the abnormal events are introduced into weakly supervised video anomaly detection. Giving the explanation of what is anomalous, the score predictor can implicitly learn more details about\nthe anomalies. It leads to incredible performance on openset and cross-database problems.\nA new multi-prompt learning strategy is proposed to provide an overall understanding of normal and abnormal patterns across different videos, while MIL is limited to individual videos.\nAdditional pseudo labels are excavated from the anomaly videos according to the semantic similarity between the event prompts and videos. They are utilized to train the predictor effectively in a self-supervised manner.\nII. RELATED WORKS # A. Weakly Supervised Video Anomaly Detection # The weakly supervised methods tackle frame-level anomaly detection by video-level annotations. Most of them are based primarily on multiple instance learning (MIL) due to limited annotated labels [10]. However, conventional MIL faces challenges in providing sufficient supervision for various anomalies, leading to misclassifications and a high false alarm rate. To address these issues, Yu et al. propose cross-epoch learning (XEL) [4], which stores hard instances from previous epochs to optimize the anomaly predictor in the latest epoch. Additionally, dual memory units with uncertainty regulation (URDMU) [16] extend the anomaly memory unit into learnable dual memory units to alleviate the high false alarm rate issue. Another approach called robust temporal feature magnitude learning (RTFM) [11] trains a feature magnitude learning function to effectively recognize positive instances. All these methods are based on single or multiple visual modalities, including RGB and optical flow.\nAs the field embraces multi-modality models like GPT [14] and CLIP [15], researchers are now focusing on text-visual models. Text-empowered video anomaly detection (TEVAD) [12] demonstrates improvements by generating text and visual features independently. However, TEVAD treats text features as auxiliary to visual features. In contrast, our approach aims to capitalize on the high semantic-level guidance provided by text, offering a unique perspective for enhancing anomaly detection performance.\nB. Prompt Tuning for Visual Tasks # In the realm of pre-trained foundation multimodality models, a cost-effective prompt tuning approach is gaining traction for adapting models to downstream tasks in the domains of natural language processing (NLP) [17], [18] and computer vision [19].\nThe concept of prompt tuning originated in computer vision to tackle zero-shot or few-shot image tasks by incorporating semantic guidance. Multimodal models such as CLIP [15] leverage textual prompts for image classification, demonstrating state-of-the-art performance. In the video domain, Sato [20] explores prompt tuning for zero-shot anomaly action recognition, using skeleton features and text embeddings in a shared space to refine decision boundaries. Wang et al. introduce prompt learning for action recognition (PLAR) [21], which incorporates optical flow and learnable prompts to acquire input-invariant knowledge from a prompt expert dictionary and input-specific knowledge based on the data.\nA previous effort, the prompt-based feature mapping framework (PFMF) [22], applies prompt-based learning to semisupervised video anomaly detection. PFMF generates anomaly prompts by concatenating anomaly vectors from virtual datasets and scene vectors from real datasets, guiding the feature mapping network. However, the prompt in PFMF defines anomalies at the visual level, introducing ambiguity. In our work, we propose textual anomaly prompts based on prior knowledge to mine fine-grained anomalies to achieve high performance.\nSeveral recent studies have introduced additional information or tasks to maximize the CLIP\u0026rsquo;s effectiveness in WSVAD. CLIP-assisted temporal self-attention (CLIP-TSA) [23] incorporates temporal information into CLIP features using a self-attention mechanism. In contrast, VadCLIP [13] delves deeper into aligning textual category labels with CLIP\u0026rsquo;s visual features to enhance its WS-VAD performance. Unlike VadCLIP, which constructs learnable prompts based on class labels, our approach designs event prompts to describe specific anomaly-related situations, eliminating the need for additional supervised information or learning tasks.\nIII. METHODOLOGY # The proposed LAP framework, as shown in Figure 2, is built upon the basic VAD structure consisting of a visual feature extractor and a score predictor. To enhance discrimination between normal and abnormal videos, semantic clues from anomaly events are integrated using a prompt dictionary and an additional semantic feature extractor. This integration introduces three key processes: feature synthesis, multi-prompt learning, and pseudo anomaly labeling. Semantic features are extracted from videos and fused with visual features, enriching the overall representation. Simultaneously, anomaly prompts, describing abnormal events, are employed to generate another set of semantic features. An anomaly similarity matrix is then computed between these two semantic feature sets. This matrix identifies the most anomalous features corresponding to each prompt in the dictionary. This batch-level anomaly vector not only facilitates a new multi-prompt learning procedure but also acts as a set of snippet-level pseudo labels. The subsequent subsections delve into the specifics of these procedures.\nA. Feature Synthesis # Following the protocol of WS-VAD [24], we adopt a training approach using pairwise normal and abnormal data. Each training batch comprises an abnormal bag and a normal bag, consisting of b abnormal and normal videos with labels y a = 1 ∈ R b×1 and y n = 0 ∈ R b×1 , respectively. In this setup, every video is divided into L snippets, each containing 16 consecutive frames. Consequently, the total number of snippets in each bag is N = b × L. All of these snippets are then processed by the visual and semantic feature extractors.\nTo clarify, we acquire the visual features V a ∈ R N×d v and V n ∈ R N×d v from video snippets in the abnormal and normal bags, utilizing the visual encoder of a CLIP model [15]. Given\nFig. 2. The overview of the proposed LAP framework. Synthetic features, as input to score predictors, are generated through the visual and semantic feature extractors. A prompt dictionary is used to produce the anomaly matrix and vector, which is employed to perform multi-prompt learning (MPL) and pseudo anomaly labeling (PAL) across different videos.\nthat many VAD videos, primarily from surveillance, often lack associated text descriptions, we leverage a pre-trained visualto-text encoder from SwinBERT [25], following TEVAD [12], to generate descriptions for each video snippet. These textual descriptions then undergo processing by the semantic feature extractor (SimCSE [26]), producing corresponding semantic features T a ∈ R N×d t and T n ∈ R N×d t for abnormal and normal video snippets. With extracted visual features and semantic features in hand, we feed these features into a multiscale temporal network (MTN) to obtain both local and global temporal fused features.\nIntuitively, a combination of visual and semantic features is employed to synthesize new features F a ∈ R N×df and F n ∈ R N×df , aiming for an enhanced feature representation,\nwhere θ symbolizes a feature alignment and fusion operation. It can be either a concatenation or addition. Subsequently, the anomaly scores s a and s n can be calculated by applying F a and F n to a score predictor. Typically, this predictor takes the form of a multi-layer perceptron (MLP) [27], expressed as:\nB. Multi-Prompt Learning # In recent WS-VAD, the prevalent approach for training the anomaly score predictor involves a multiple instance learning (MIL) framework [12], [28]. This framework selects the top-k highest anomaly scores from each video, whether abnormal or normal and employs their average ˆ ˆy as the predicted value for the respective video. Given the complete score set s = [s a ;sn] ∈ R 2N×1 ,\nwhere maxk(s) denotes the operator to select k largest values from vector s , k is usually from 2 to 5, i and j indicate the snippet and video indices, respectively. Then the MIL loss L MIL is formulated as,\nFrom Equation 5, it can be seen that the top-k strategy focuses only on a few snippets with the highest scores within an individual video. Moreover, the top anomaly scores in an abnormal video may not be from an abnormal snippet. Therefore, a new textual prompt dictionary consisting of P anomaly prompts is designed to link abnormal video snippets from different videos. Unlike the category annotations used in VadCLIP [13], we expanded the single word annotations into complete anomaly sentences, like \u0026ldquo;someone is doing something to whom\u0026rdquo; or \u0026ldquo;something is what\u0026rdquo;. These sentences can better describe the events/actions related to a certain anomaly category. Then, the prompt dictionary is constructed as a set of these anomaly sentences, e.g. \u0026ldquo;A man is shooting a gun\u0026rdquo; or \u0026ldquo;Something is on fire\u0026rdquo;. As depicted in Figure 2, these prompts undergo the same semantic extraction process (SimCSE [26]) as the earlier video captions, generating their respective semantic features M ∈ R P ×d t . Subsequently, we calculate the similarity between each prompt in the dictionary\nFig. 3. Visualization of the proposed anomaly matrix Ψ ⊤ . It is truncated due to the limited column width.\nand every snippet in T a to construct an anomaly matrix Ψ ∈ R N×P as,\nin which || · || denotes the l 2 -norm. The consideration of T n is dismissed here since there are no abnormal snippets in the normal bag. In essence, each element in Ψ provides insights into the probable type of anomaly associated with each snippet or indicates where a predetermined abnormal event might occur. Figure 3 offers a visual representation of Ψ, where frames containing abnormal events exhibit more pronounced colors. Notably, there is a discernible alignment between the frames and prompts.\nIn order to exploit the anomaly features across different videos, the most likely anomalous event of each snippet, i.e. the highest values in each row of Ψ, is picked to construct a new anomaly vector c ∈ R N×1 .\nTo leverage these potential anomaly samples, we introduce a novel multi-prompt learning strategy. Based on the predicted score s and the anomaly vector c, all features in F n and F a are categorized into three sets: anchor set, positive set, and negative set. Subsequently, their averages are computed, denoted as fa fanc, fp fpos , and fn fneg . It\u0026rsquo;s important to note that fa fanc and fp fpos model the normal features in normal and abnormal videos, respectively, and can be expressed as,\nwhere argmin P (s) denotes the operator to obtain the indices of P lowest values in vector s , F n [i, :] and F a [i, :] are the i-th row of F n and F a , respectively, which is a synthetic feature vector to represent a certain video snippet. In contrast, the negative set is built by choosing the most anomalous samples in anomaly videos, according to the similarity values in c . Thus the feature fn fneg can be formulated as,\nwhere argmax P (c) denotes the operator to obtain the indices of P largest values in vector c .\nBased on these representative features, it is possible to provide an overall understanding of normal and abnormal patterns across different videos. Thus, the multi-prompt learning loss L MPL is defined in a form of triplet loss,\nwhere α represents the margin coefficient. The goal of LMPL is to establish a considerable distance between fn fneg and both fa fanc and fp fpos while simultaneously bringing fa fanc and fp fpos closer together. This feature-level examination implicitly impacts the training of the score predictor, given that the selection of fa fanc and fp fpos is based on s .\nC. Pseudo Anomaly Labeling # In addition to constructing the negative set in MPL, the anomaly vector c serves as a metric for pseudo-labels, enabling the extraction of more latent information in the anomaly bag T a . Specifically, the snippet-level pseudo-anomaly label p is determined by a dynamic threshold within the current batch,\nwhere p[i] and c[i] are the i-th element of p and c, mean{c} and std{c} are the mean and standard deviation considering the anomaly vector c, and τ is a hyper-parameter. Then, the anomaly score predictor can be trained in a fully supervised manner, through a pseudo anomaly loss LPAL ,\nBy incorporating prior knowledge into the pseudo label, the PAL module can better distinguish fine-grained anomalies and generate more accurate detecting results across abnormal videos.\nTABLE I PERFORMANCE COMPARISON OF STATE -OF -THE -ART METHODS ON XD-VIOLENCE (AP%) AND UCF-CRIME (AUC%). BOLD AND UNDERLINE INDICATE THE BEST AND SECOND -BEST RESULTS .\n| Type | Source | Method | Feat. | XD | UCF UC AUC | UCF\nUC AUC Typ Source Method Feat. AP AUC all AUC abn emi CVPR 16’ Conv-AE [29] AE 30.7 - - Sem CVPR 22’ GCL [30] CNN - 71.0 - y ICCV 21’ RTFM [11] CLIP 78.3 85.7 63.9 AAAI 22’ MSL [31] ViT 78.6 85.6 - ECCV 22’ CSL-TAL [32] I3D 71.7 - - CVPR 22’ BN-SVP [33] I3D - 83.4 - CSVT 23’ Yang [34] I3D 77.7 81.5 - kly AAAI 23’ UR-DMU [16] CLIP 82.4 86.7 68.6 Weak CVPR 23’ ECUPL [35] I3D 81.4 86.2 - We CVPR 23’ CMRL [36] I3D 81.3 86.1 - CVPR 23’ TEVAD [12] I3D 79.8 84.9 - AAAI 23’ MGFN [37] ViT 80.1 - - CVPR 23’ UMIL [28] XCLIP - 86.7 68.7 ICIP 23’ CLIP-TSA [23] CLIP 82.2 87.6 - CVPR 24’ Wu et al. [38] CLIP 66.5 86.4 - AAAI 24’ VadCLIP [13] CLIP 84.5 88.0 70.2 ours LAP CLIP (SwinBert) CLIP (CA) 86.5 88.9 73.0 The final training loss LLAP can be denoted as,\nwhere β and γ are hyper-parameters utilized in our model. Importantly, it\u0026rsquo;s worth noting that the MPL and PAL modules are trained collaboratively. During the inference stage, the test samples will only traverse the feature extractors and the predictor to acquire abnormal scores, and the MPL and PAL modules incur no additional computational cost.\nD. Inference Process # The inference process is identical to the baseline model, i.e. TEVAD [12], which is the left part of Figure 2 without the prompt dictionary. We initially extract visual and text features, which are processed through the feature alignment and fusion operation. Then, the fused features are fed into the anomaly predictor to calculate the anomaly score for each video snippet.\nIV. EXPERIMENTS # In this section, the performance of our LAP model is evaluated on four datasets, namely XD-Violence [39], UCFCrime [10], TAD [40] and ShanghaiTech [41]. The area under the precision-recall curve, also known as the average precision (AP) is employed as the evaluation metric for XD-Violence following the protocol in [35]. For UCF-Crime, TAD and ShanghaiTech, the area under the curve (AUC) of the framelevel receiver operating characteristics (ROC) is used instead. Specifically, AUCall represents the AUC for all testing videos, while AUCabn focuses only on abnormal videos in test set. The\nTABLE II PERFORMANCE COMPARISON OF STATE -OF -THE -ART METHODS ON TAD (AUC%) AND SHANGHAITECH (AUC%). BOLD AND UNDERLINE INDICATE THE BEST AND SECOND -BEST RESULTS .\n| Type | Source | Method | Feat. | TAD AUC | ST\nAUC T Source Method Feat. TAD AUC ST AUC mi ICCV 17’ Luo et al. [43] - 57.9 - Sem CVPR 18’ Liu et al. [44] - 69.1 72.8 CVPR 21’ MIST [45] UNet 89.2 94.8 ICCV 21’ ICCV 21’ RTFM [11] I3D 89.6 97.2 TIP 2 P 21’ W WSAL [40] I3D 89.6 - CVPR 2 CVPR 23’ ECUPL [35] I3D 91.6 - CVP CVPR 23’ CMRL [36] I3D - 97.6 CV CVPR 23’ TEVAD [12] CLIP 92.3 97.3 CVPR 23’ UMIL [28] XCLIP 92.9 96.8 ours LAP CLIP 94.4 97.4 false alarm rates for all videos (FARall) and abnormal videos (FARabn) are also reported in our ablation studies.\nA. Datasets # XD-Violence [39] is a multi-scene public dataset for VAD. It consists of a total duration of 217 hours and includes 4,754 untrimmed videos. The training set contains 3,954 videos while the test set comprises 800 videos. XD-Violence covers various unusual types of events including abuse incidents, car accidents, explosions, fights, riots, and shootings. UCFCrime dataset [10] is a large-scale collection of 1,900 videos captured by surveillance cameras in various indoor and outdoor scenarios. This dataset consists of 1,610 labeled training videos and 290 labeled test videos with a total duration of 217 hours. The dataset covers 13 types of anomalous events such as abuse, robbery, shootings and arson. TAD [40] is a dataset for anomaly detection in traffic scenes, consisting of 400 training videos and 100 test videos, with a total of 25 hours of video footage. It covers seven types of real-world anomalies. ShanghaiTech consists of surveillance videos from different scenes on a campus [42]. The training set contains 237 videos while the testing set has 200 videos.\nB. Implementation Details # The dimension of the visual features d v extracted by CLIP(ViT-L/14) [15] is 768, while the dimension of semantic features d t is also 768. The prompt dictionary capacity P is set to 30 for UCF-Crime, XD-Violence and TAD datasets, 25 for ShanghaiTech. The batch size b is set to 64 for TAD dataset, and it is halved to 32 on the other three datasets. The number of snippets per video L is set to 64 for all datasets. The feature operation θ is set as, a) concatenation for UCF-Crime, b) addition for the other three datasets. The hyper-parameters α = 1 , β = 0 . 1 , γ = 0 . 001 and τ = 1 are consistent across all datasets. The Adam optimizer is utilized with a learning rate of 0.001 and weight decay of 0.005 during the training process.\nFig. 4. Qualitative comparisons of TEVAD [12] and our method on both UCF-Crime (UCF) and XD-Violence (XD). The ground truth of anomalous events is represented by light red regions.\nC. Comparison Results # Quantitative analysis. The comparisons between our LAP model and other state-of-the-art (SOTA) WS-VAD models on the XD-Violence and UCF-Crime datasets are presented in Table I, and Table II shows the comparisons on the TAD and Shanghaitech datasets. It can be seen that the proposed model outperforms almost all the other methods in all datasets.\nSpecifically, our model achieves the highest AP of 86.5% on the XD-Violence dataset, outperforming the second best method VadCLIP [13] by 2.0%, which also combines RGB and text data. Unlike the single-word description, i.e. class labels, used in VadCLIP, the event-level descriptions in our LAP model can provide much richer information, leading to a better understanding of the anomalies. Another ClIP-based method (CLIP-TSA [23]) leverages the visual features from CLIP (VIT/B), while a transformer is employed to enhance its features. However, due to the lack of semantic guidance, its AP falls 4.3% below the proposed LAP. The performance of the other compared the methods are also limited by the absence of efficient anomaly definitions.\nIn the UCF-Crime dataset, our LAP model achieves an AUCall of 88.9%, surpassing the most recent methods by at least 0.9%, including VadCLIP [13] (88.0%), CLIP-TSA [23] (87.6%) and UMIL [12] (86.7%). Notably, the learnable prompts of VadCLIP are based on class labels, which leads to a video-level anomaly matching. While our concise descriptions of basic suspected anomaly events can effectively match the segment-level features. This difference results in a relatively high AUCabn of our LAP (73.0%) comparing to VadCLIP (70.2%). It is important to note that if we use more accurate text descriptions [46] of each snippet, we can achieve a higher AUCall of 90.4% and an AUCabn of 76.1%. The results indicate that our model can effectively utilize the textual prompts of abnormal events for an accurate detection.\nTable II shows the comparisons on the other two less challenging datasets. The AUC of our approach (94.4%) is constantly higher than all SOTA methods compared [35], [40], [28] by a margin of 1.5% to 4.8% on TAD dataset. And our method achieves the second highest AUC (97.4%) in ShanghaiTech, which is only 0.2% lower than the best CMRL method [36]. Noting that, if we switch our visual extractor from CLIP to I3D [8] as the same as CMRL, the AUC will be boosted to 98.0% (0.4% higher CMRL). It indicates that the ShanghaiTech dataset is relatively less complex, while I3D [8] is good enough. The details will be discussed in Section IV-E. For fair comparation, we reimplement TEVAD[12] with visual features from CLIP. The AUC of our LAP exceeds TEVAD for 2.1% in the TAD dataset and 0.1% in the ShanghaiTech dataset. This minor performance gain on ShanghaiTech is due to the anomalies on campus being actually common activities such as riding, skating, and driving on the road, which are quite different from our suspected anomaly descriptions such as fighting, firing, or clashing.\nOverall, these results highlight the superior performance of our LAP model compared to state-of-the-art methods on all four datasets in terms of both AP and AUC metrics. For fair comparison, the UMIL [28], TEVAD [12], CLIP-TSA [23], VadCLIP [13], and the work by Wu et al. [38] are based on the same feature extractor (CLIP) as our LAP.\nFig. 5. The distribution of matched suspected anomalies in the UCF-Crime (upper) and TAD (lower) datasets.\nQualitative analysis. To further demonstrate the effectiveness of our method, the qualitative comparisons between our approach (LAP) and the TEVAD SOTA method [12] are visualized in Figure 4. The normal and abnormal frames of videos from the UCF-Crime and XD-Violence datasets are presented along with their corresponding frame-level anomaly scores, while green and red dashed rectangles indicate normal and abnormal ones, respectively. As shown in the figures, our method not only outperforms TEVAD [12] in terms of anomaly detection ability but also reduces false alarms on normal parts.\nOur prompt dictionary contains event descriptions for various conditions. For the UCF-Crime social dataset and the TAD traffic dataset, Figure 5 illustrates the distributions of matched suspected anomalies in the anomaly vector c, showcasing how our prompt dictionary operates. Since the majority of anomalies in UCF-Crime are linked to human behavior like fights, robbery and violence, the predominant subject is \u0026ldquo;person\u0026rdquo; and the most frequent activities include falling and using weapons. While in the other circumstance, the Traffic Anomaly Dataset (TAD) is consists of anomalies caused by traffic accidents. As expected, \u0026ldquo;car\u0026rdquo; is the main subject, and the most common activities involve smoking and crashes. It indicates the effectiveness of our proposed event prompts.\nD. Ablation Studies # Components. The proposed prompt-related components, i.e. feature synthesis (FS), multi-prompt learning (MPL) and pseudo anomaly labeling (PAL) are the keys to our superior performance in VAD. The ablation results of these three components on three datasets are shown in Table III. The baseline module is a visual-only branch MIL-based network with a CLIP feature extractor [15]. By cooperating text branch for feature synthesis, our model achieves 1.2%, 1.8% and 0.4% AUCall improvement, respectively, which shows the efficiency of semantic information. The MPL module can also improve the AUCall for all datasets by 0.3%, 0.4% and 0.1%, while\nTABLE III ABLATION STUDY OF PROPOSED MODULES. THE DEFAULT SETTINGS OF ALL EXPERIMENTS ARE MARKED IN GRAY COLOR .\nne UCF-Crime UCF-Crime XD-Violence XD-Violence TAD Basel FS MPL PAL AUC ll b AUC ll b AUC all AP all AUC all ✓ 87.0 67.0 93.2 81.3 93.7 ✓ ✓ 88.2 70.4 95.0 84.1 94.1 ✓ ✓ ✓ 88.5 70.7 95.4 85.0 94.2 ✓ ✓ ✓ ✓ 88.9 73.0 95.6 86.5 94.4 TABLE IV COMPARISONS OF THE AUC (%) FOR OPEN-SET VAD ON UCF-CRIME . THE NUMBERS IN BRACES ARE THE AMOUNT OF VIDEOS .\n| Open Category | No | Explo- sion (21) | RoadAcci- dents (23) | Shoplif\u0002\nting (21) RTFM [11] 84.3 83.6(-0.7) 82.1(-2.2) 83.4(-0.9) MLAD [47] 85.4 84.3(-1.1) 83.2(-2.2) 84.5(-0.9) TEVAD [12] 84.9 83.7(-1.2) 81.0(-3.9) 83.1(-1.8) Ours 88.9 88.1(-0.8) 87.0(-1.9) 88.4(-0.5) the AP on XD-Violence and AUCabn on UCF-Crime are also boosted by 0.9% and 0.3%. Further incorporating the PAL module yields better results. It outperforms the baseline by 1.9% in AUCall and 6.0% in AUCabn on UCF-Crime, 0.7% in AUCall on TAD, as well as 2.4% in AUCall and 5.2% in AP on XD-Violence.\nPrompt format. The prompt format plays an important role in the proposed LAP model. Thus, two different formats are tested in this experiment. One is organized by anomaly phrases, such as \u0026ldquo;falling down\u0026rdquo; or \u0026ldquo;on fire\u0026rdquo;. The other contains complete anomaly sentences, like \u0026ldquo;someone is doing something to whom\u0026rdquo; or \u0026ldquo;something is what\u0026rdquo;. As shown in Figure 6a, the sentence-based prompt dictionary outperforms the phrase-based one by 3.1% on XD-Violence and 0.7% on UCF-Crime, respectively. It suggests that prompts containing richer information are more helpful in identifying suspected anomalies.\nPseudo anomaly threshold. As required by the PAL module, pseudo labels are determined according to the threshold Gh. The dynamic threshold given in Eq. 13 is used in previous experiments, which is based on the distribution of the data in the current batch. The hyper-parameter τ will determine the number for pseudo anomalies. When it is set to 0.5, 1.0 and 2.0, the AUC results on UCF-Crime are 88.05%, 88.90%, and 88.21%, correspondingly. Another static threshold strategy is also compared in this test, while Gh is set to 0.5 as prior knowledge. As shown in Figure 6b, the dynamic threshold is better than the static one, whose AP and AUC are 2.1% and 0.5% higher on XD-Violence and UCF-Crime datasets.\nTABLE V CROSS -DATASET EXPERIMENTAL RESULTS ON UCF-CRIME (UCF) AND XD-VIOLENCE (XD) BENCHMARKS .\nSource UCF XD XD UCF Target UCF (AUC %) XD XD (AP %) RTFM [11] 84.3 68.6 (-15.7) 76.6 37.3 (-39.3) CMRL [36] 6.1 69.9 (-16.2) 81.3 46.7 (-34.6) Ours 88.9 83.5 (-5.4) 86.5 60.9(-25.6) Fig. 6. The ablation studies of the prompt format and pseudo anomaly threshold.\nE. Discussions # Class-wise AUC. To demonstrate the detailed performance on specific abnormal events, the class AUC of our model is compared with RTFM [11] in Figure 7. It shows that the proposed LAP model outperforms RTFM in most categories, especially on \u0026ldquo;Assault\u0026rdquo;, \u0026ldquo;Explosion\u0026rdquo;, \u0026ldquo;RoadAccidents\u0026rdquo; and \u0026ldquo;Robbery\u0026rdquo;. This can be attributed to the effective use of our prompt dictionary to describe those anomalies including representative texts such as \u0026ldquo;fire\u0026rdquo;, \u0026ldquo;knife\u0026rdquo; or \u0026ldquo;accident\u0026rdquo;. Combined with the MPL module, their synthetic features are more likely to be identified as abnormal ones. However, our model may be less effective in some cases if the action is subtle or difficult to describe, such as \u0026ldquo;Shoplifting\u0026rdquo; and \u0026ldquo;Fighting\u0026rdquo; in Figure 7.\nOpen set VAD. In practical applications, it is impossible to collect or define all possible anomalies in advance. Hence, it is crucial to examine the robustness of anomaly detection models when confronted with open abnormal categories in real-world scenarios. Following the protocol of open set VAD in MLAD [47], experiments are conducted on the top 3 largest anomaly categories from the UCF-Crime dataset, namely \u0026ldquo;Explosion\u0026rdquo;, \u0026ldquo;RoadAccidents\u0026rdquo; and \u0026ldquo;Shoplifting\u0026rdquo;. These categories are sequentially removed from the training set and treated as real open abnormal events. The comparisons with three SOTA models are presented in Table IV. It is obvious that the proposed LAP model outperforms RTFM [11], MLAD [47] and TEVAD [12] in all three categories. It is worth noting that our method achieves minimal decreases in AUC values when compared to alternative approaches. This indicates that our method is more efficient in handling open abnormal event issues.\nCross-dataset performance. The categories of anomalies varies from different VAD datasets. For instance, the abnormal events in UCF-Crime dataset are collected from surveillance videos, which is quite different from the abnormal categories\nTABLE VI PERFORMANCE OF MPL AND PAL EMBEDDED RTFM [11] ON SHANGHAITECH .\nMethod (ST) AUCall AUCabn FARall FARabn RTFM 97.2 64.3 0.06 0.86 RTFM+MPL 97.6 72.2 0.06 0.71 RTFM+PAL 97.5 73.9 0.03 0.44 RTFM+Both 98 75.6 0.04 0.58 Fig. 7. Comparison of class-wise AUC (%) on UCF-Crime dataset with RTFM [11].\nin XD-Violence developed from movies and online videos. Thus, it will become a challenging transfer learning task, if the model is trained and inferred on different datasets. However, it is actually what will happen in real-world anomaly detection applications. To evaluate the generalization and zero-shot abilities of our proposed method, another set of experiments using different sources of training and inference videos is conducted. Compared with RTFM [11] and CMRL [36], our model explains the definition of anomalies with their descriptions using the prompt dictionary and multi-prompt learning scheme. Such a new paradigm of utilizing the semantic information leads to an extraordinary cross-dataset performance as shown in Table V. The performance degradation of the proposed model is only one-third of the ones of RTFM and CMRL, when it is trained on XD-Violence and tested on ShanghaiTech. It indicates that our method is much less sensitive to variations in the data domain, which is important for practical applications.\nPlug and play. To further explore the potential of our method, the proposed MPL and PAL modules are embedded into the representative WS-VAD work RTFM [11] on ShanghaiTech. For a fair comparison, I3D [8], instead of CLIP [15], is used as the feature extractor, while all experimental settings were kept the same as in RTFM paper. As shown in Table VI, the reimplemented frameworks generally exhibited better performance. By incorporating either MPL or PAL alone, enhancements of 0.4% and 0.3% can be achieved on AUCall, whereas more significant enhancements of 7.9% and 9.6% can be observed on AUCabn. The efficacy of MPL and PAL is demonstrated by their ability to improve the performance of the conventional WSAD framework. Through the collaboration of MPL and PAL, LAP integrated RTFM demonstrates superior AUC and reduced FAR compared to\nits original version, showing significant enhancements (0.8%, 0.02%) on the ShanghaiTech dataset. It is worth noting that the reimplemented RTFM model (98. 0%) could even surpass the latest SOTA model CMRL [36] by 0.4%. It indicates that more frameworks may benefit from our prompt-related modules, which are plug-and-play.\nV. CONCLUSION # In this study, we presented the LAP model, a straightforward yet effective method for WS-VAD. Specifically, the synthesized visual-semantic features have been employed for better feature representation. The multi-prompt learning strategy has shown its capability to guide the learning of suspected anomalies with a prompt dictionary. Additionally, the pseudo anomaly labels generated by the anomaly similarity between the prompts and video captions are useful to enhance the VAD performance. Extensive experiments have demonstrated the effectiveness of our model. We hope that our work will inspire further exploration of defining and learning anomalies from natural languages.\nREFERENCES # [1] Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu, and X.-S. Hua, \u0026ldquo;Spatiotemporal autoencoder for video anomaly detection,\u0026rdquo; in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1933– 1941.\n[2] G. Yu, S. Wang, Z. Cai, E. Zhu, C. Xu, J. Yin, and M. Kloft, \u0026ldquo;Cloze test helps: Effective video anomaly detection via learning to complete video events,\u0026rdquo; in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 583–591.\n[3] C. Tao, C. Wang, S. Lin, S. Cai, D. Li, and J. Qian, \u0026ldquo;Feature reconstruction with disruption for unsupervised video anomaly detection,\u0026rdquo; IEEE Transactions on Multimedia, 2024.\n[4] S. Yu, C. Wang, Q. Mao, Y. Li, and J. Wu, \u0026ldquo;Cross-epoch learning for weakly supervised anomaly detection in surveillance videos,\u0026rdquo; IEEE Signal Processing Letters, vol. 28, pp. 2137–2141, 2021.\n[5] H. Shi, L. Wang, S. Zhou, G. Hua, and W. Tang, \u0026ldquo;Abnormal ratios guided multi-phase self-training for weakly-supervised video anomaly detection,\u0026rdquo; IEEE Transactions on Multimedia, 2023.\n[6] J. Yu, B. Zhang, Q. Li, H. Chen, and Z. Teng, \u0026ldquo;Hierarchical reasoning network with contrastive learning for few-shot human-object interaction recognition,\u0026rdquo; in Proceedings of the 31st ACM International Conference on Multimedia, ser. MM \u0026lsquo;23. New York, NY, USA: Association for Computing Machinery, 2023, p. 4260–4268. [Online]. Available: https://doi.org/10.1145/3581783.3612311\n[7] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, \u0026ldquo;Learning spatiotemporal features with 3d convolutional networks,\u0026rdquo; in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.\n[8] J. Carreira and A. Zisserman, \u0026ldquo;Quo vadis, action recognition? a new model and the kinetics dataset,\u0026rdquo; in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.\n[9] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al. , \u0026ldquo;An image is worth 16x16 words: Transformers for image recognition at scale,\u0026rdquo; arXiv preprint arXiv:2010.11929, 2020.\n[10] W. Sultani, C. Chen, and M. Shah, \u0026ldquo;Real-world anomaly detection in surveillance videos,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6479–6488.\n[11] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro, \u0026ldquo;Weakly-supervised video anomaly detection with robust temporal feature magnitude learning,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 4975–4986.\n[12] W. Chen, K. T. Ma, Z. J. Yew, M. Hur, and D. A.-A. Khoo, \u0026ldquo;Tevad: Improved video anomaly detection with captions,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 5548–5558.\n[13] P. Wu etal., \u0026ldquo;Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,\u0026rdquo; AAAI, 2024.\n[14] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., \u0026ldquo;Training language models to follow instructions with human feedback,\u0026rdquo; Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.\n[15] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., \u0026ldquo;Learning transferable visual models from natural language supervision,\u0026rdquo; in International conference on machine learning. PMLR, 2021, pp. 8748–8763.\n[16] H. Zhou, J. Yu, and W. Yang, \u0026ldquo;Dual memory units with uncertainty regulation for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 3769–3777.\n[17] X. L. Li and P. Liang, \u0026ldquo;Prefix-tuning: Optimizing continuous prompts for generation,\u0026rdquo; arXiv preprint arXiv:2101.00190, 2021.\n[18] B. Lester, R. Al-Rfou, and N. Constant, \u0026ldquo;The power of scale for parameter-efficient prompt tuning,\u0026rdquo; arXiv preprint arXiv:2104.08691 , 2021.\n[19] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, \u0026ldquo;Visual prompt tuning,\u0026rdquo; in European Conference on Computer Vision. Springer, 2022, pp. 709–727.\n[20] F. Sato, R. Hachiuma, and T. Sekii, \u0026ldquo;Prompt-guided zero-shot anomaly action recognition using pretrained deep skeleton features,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6471–6480.\n[21] X. Wang, R. Xian, T. Guan, and D. Manocha, \u0026ldquo;Prompt learning for action recognition,\u0026rdquo; arXiv preprint arXiv:2305.12437, 2023.\n[22] Z. Liu, X.-M. Wu, D. Zheng, K.-Y. Lin, and W.-S. Zheng, \u0026ldquo;Generating anomalies for video anomaly detection with prompt-based feature mapping,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 500–24 510.\n[23] Z. Joo etal., \u0026ldquo;Clip-tsa:clip-assisted temporal self-attention for weakly supervised video anomaly detection,\u0026rdquo; ICIP, 2023.\n[24] C. Cao, X. Zhang, S. Zhang, P. Wang, and Y. Zhang, \u0026ldquo;Weakly supervised video anomaly detection based on cross-batch clustering guidance,\u0026rdquo; in 2023 IEEE International Conference on Multimedia and Expo (ICME) . IEEE, 2023, pp. 2723–2728.\n[25] K. Lin, L. Li, C.-C. Lin, F. Ahmed, Z. Gan, Z. Liu, Y. Lu, and L. Wang, \u0026ldquo;Swinbert: End-to-end transformers with sparse attention for video captioning,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 949–17 958.\n[26] T. Gao, X. Yao, and D. Chen, \u0026ldquo;Simcse: Simple contrastive learning of sentence embeddings,\u0026rdquo; arXiv preprint arXiv:2104.08821, 2021.\n[27] M.-C. Popescu, V. E. Balas, L. Perescu-Popescu, and N. Mastorakis, \u0026ldquo;Multilayer perceptron and neural networks,\u0026rdquo; WSEAS Transactions on Circuits and Systems, vol. 8, no. 7, pp. 579–588, 2009.\n[28] H. Lv, Z. Yue, Q. Sun, B. Luo, Z. Cui, and H. Zhang, \u0026ldquo;Unbiased multiple instance learning for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8022–8031.\n[29] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, \u0026ldquo;Learning temporal regularity in video sequences,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 733–742.\n[30] M. Z. Zaheer, A. Mahmood, M. H. Khan, M. Segu, F. Yu, and S.-I. Lee, \u0026ldquo;Generative cooperative learning for unsupervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 14 744–14 754.\n[31] S. Li, F. Liu, and L. Jiao, \u0026ldquo;Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1395–1403.\n[32] A. Panariello, A. Porrello, S. Calderara, and R. Cucchiara, \u0026ldquo;Consistencybased self-supervised learning for temporal anomaly localization,\u0026rdquo; in European Conference on Computer Vision. Springer, 2022, pp. 338– 349.\n[33] H. Sapkota and Q. Yu, \u0026ldquo;Bayesian nonparametric submodular video partition for robust anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3212–3221.\n[34] Z. Yang, Y. Guo, J. Wang, D. Huang, X. Bao, and Y. Wang, \u0026ldquo;Towards video anomaly detection in the real world: A binarization embedded weakly-supervised network,\u0026rdquo; IEEE Transactions on Circuits and Systems for Video Technology, 2023.\n[35] C. Zhang, G. Li, Y. Qi, S. Wang, L. Qing, Q. Huang, and M.-H. Yang, \u0026ldquo;Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF\nConference on Computer Vision and Pattern Recognition, 2023, pp. 16 271–16 280.\n[36] M. Cho, M. Kim, S. Hwang, C. Park, K. Lee, and S. Lee, \u0026ldquo;Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 137–12 146.\n[37] Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, and Y.-C. Wu, \u0026ldquo;Mgfn: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection,\u0026rdquo; in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 387–395.\n[38] P. Wu, X. Zhou, G. Pang, Y. Sun, J. Liu, P. Wang, and Y. Zhang, \u0026ldquo;Openvocabulary video anomaly detection,\u0026rdquo; arXiv preprint arXiv:2311.07042 , 2023.\n[39] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang, \u0026ldquo;Not only look, but also listen: Learning multimodal violence detection under weak supervision,\u0026rdquo; in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer, 2020, pp. 322–339.\n[40] H. Lv, C. Zhou, Z. Cui, C. Xu, Y. Li, and J. Yang, \u0026ldquo;Localizing anomalies from weakly-labeled videos,\u0026rdquo; IEEE transactions on image processing , vol. 30, pp. 4505–4515, 2021.\n[41] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, \u0026ldquo;Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1237–1246.\n[42] Y. Zhang, H. Lu, L. Zhang, X. Ruan, and S. Sakai, \u0026ldquo;Video anomaly detection based on locality sensitive hashing filters,\u0026rdquo; Pattern Recognition , vol. 59, pp. 302–311, 2016.\n[43] W. Luo, W. Liu, and S. Gao, \u0026ldquo;Remembering history with convolutional lstm for anomaly detection,\u0026rdquo; in 2017 IEEE International conference on multimedia and expo (ICME). IEEE, 2017, pp. 439–444.\n[44] W. Liu, W. Luo, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction for anomaly detection–a new baseline,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6536– 6545.\n[45] J.-C. Feng, F.-T. Hong, and W.-S. Zheng, \u0026ldquo;Mist: Multiple instance selftraining framework for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 009–14 018.\n[46] T. Yuan, X. Zhang, K. Liu, B. Liu, C. Chen, J. Jin, and Z. Jiao, \u0026ldquo;Towards surveillance video-and-language understanding: New dataset baselines and challenges,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 052–22 061.\n[47] C. Zhang, G. Li, Q. Xu, X. Zhang, L. Su, and Q. Huang, \u0026ldquo;Weakly supervised anomaly detection in videos considering the openness of events,\u0026rdquo; IEEE transactions on intelligent transportation systems, vol. 23, no. 11, pp. 21 687–21 699, 2022.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/learning-suspected-anomalies-from-event-prompts/","section":"Papers","summary":"Proposes a novel framework named LAP that leverages textual event prompts and semantic similarity for weakly supervised video anomaly detection. It introduces a multi-prompt learning process, pseudo anomaly labeling, and integrates semantic features derived from a prompt dictionary to guide the detection model, resulting in improved performance across multiple datasets.","title":"Learning Suspected Anomalies from Event Prompts for Video Anomaly Detection","type":"other"},{"content":" Learning to Understand Open-World Video Anomalies # Jiaqi Tang 1 , 2 , 3∗ Hao Lu 1 , 2∗ Ruizheng Wu 4 Xiaogang Xu 5 , 6 Ke Ma 7 Cheng Fang 7 Bin Guo 7 Jiangbo Lu 3 , 4 Qifeng Chen 2 Ying-Cong Chen 1 , 2 , 3†\n1 The Hong Kong University of Science and Technology (Guangzhou)\n2 The Hong Kong University of Science and Technology 3 HKUST(GZ) – SmartMore Joint Lab 4 SmartMore Corporation 5 The Chinese University of Hong Kong 6 Zhejiang University 7 Northwestern Polytechnical University {jtang092, hlu585}@connect.hkust-gz.edu.cn {ruizheng.wu, jiangbo}@smartmore.com xiaogangxu00@gmail.com {2544552413,sura}@mail.nwpu.edu.cn guob@nwpu.edu.cn cqf@ust.hk yingcongchen@hkust-gz.edu.cn\nAbstract # Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs. However, current VAD systems are often limited by their superficial semantic understanding of scenes and minimal user interaction. Additionally, the prevalent data scarcity in existing datasets restricts their applicability in open-world scenarios. In this paper, we introduce HAWK, a novel framework that leverages interactive large Visual Language Models (VLM) to interpret video anomalies precisely. Recognizing the difference in motion information between abnormal and normal videos, HAWK explicitly integrates motion modality to enhance anomaly identification. To reinforce motion attention, we construct an auxiliary consistency loss within the motion and video space, guiding the video branch to focus on the motion modality. Moreover, to improve the interpretation of motion-to-language, we establish a clear supervisory relationship between motion and its linguistic representation. Furthermore, we have annotated over 8,000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8,000 question-answering pairs for users\u0026rsquo; open-world questions. The final results demonstrate that HAWK achieves SOTA performance, surpassing existing baselines in both video description generation and question-answering. Our codes/dataset/demo will be released at https://github.com/jqtangust/hawk .\n1 Introduction # \u0026ldquo;Have eyes like a HAWK!\u0026rdquo; – Longman Dictionary\nIn recent years, the deployment of Video Anomaly Detection (VAD) systems has seen a significant uptick across a diverse array of domains, including but not limited to, autonomous driving [42 , 22], surveillance [5 , 20], and crime scene analysis [30]. The inherent capability of these systems to autonomously monitor and identify disturbances within a scene has markedly diminished the reliance on manual labor, thereby streamlining operational efficiency and reducing associated costs.\nEqual contribution. † Corresponding author.\nPreprint. Under review.\nFigure 1: Different framework in video anomaly detection. (A) shows traditional video anomaly detection methods, which use binary classifiers to detect anomalies. (B), following (A), introduces a multi-class classifier for integrating semantic information, allowing users to obtain different types of anomaly information. Neither (A) nor (B) can interact with users. (C) is a previous video understanding framework that can interactively provide richer semantic information for users, but cannot specifically locate video anomalies. Our framework (D) enhances the anomaly understanding capability and provides annotated labels with rich semantic information.\nDespite the extensive focus on anomaly detection in most existing VAD systems [20 , 41 , 30 , 28 , 7 , 10 , 16 , 31 , 37 , 45 , 49] (as shown in Fig. 1 (A)), there is often a lack of deeper semantic understanding of the scenes and insufficient interaction with users. While Pu et al. [28] and Wu et al. [39] incorporated semantic information for video anomaly detection, their frameworks are limited as multiple-class classifiers (as displayed in Fig. 1 (B)). Consequently, the functionality of these systems is confined to the detection of anomalous frames, necessitating further manual analysis by users to analyze the detected anomalies comprehensively. Although Lv et al. [24] has pioneered the development of a large language model for the video anomaly explanation, their approach primarily relies on pseudo labels for training. The lack of robust training data severely constrains its practical applicability. Besides, such a method focuses more on acquiring long-range context information rather than anomaly-related features on anomaly understanding (as exhibited in Fig. 1 (C)).\nTo solve the above challenges, we propose an interactive large visual-language model [18 , 15 , 26], HAWK, for precisely understanding video anomalies (as illustrated in Fig. 1 (D)). Considering that the motion in normal and abnormal videos is significantly different [41 , 49], we explicitly integrate motion modality by a dual-branch framework in HAWK to enhance the understanding of anomalies (Section 4.1). Besides, to reinforce motion attention, we construct an auxiliary consistency loss based on the mutual information between the original video (appearance feature) and its motion in tight space (Section 4.2), to implicitly guide the video branch to focus on motion-related features. However, the interpretation of motion to the corresponding language remains unclear. Therefore, we extract the motion-related language (verbs and their entities) from the original description to directly supervise the visual and linguistic representations of motion, for accurately enhancing the interpretation of video anomaly in HAWK (Section 4.3).\nFurthermore, we also collect seven video anomaly datasets from various scenarios and generate language descriptions for each video. Besides, to address the open-ended questions raised by users, we utilize language descriptions of the videos to generate potential question-answer pairs for training. Since these datasets cover a range of scenarios (Section 3), including crime (UCF-Crime [30]), campus environments (ShanghaiTech [19] and CUHK Avenue [20]), pedestrian walkways (UCSD Ped1 [5] and Ped2 [34]), traffic situations (DoTA [42]), and human behavior (UBnormal [2]), and finally, the model tends to generalize to open-world scenarios.\nTo train our framework, we initially pre-train it on WebVid [3] to equip it with the capability to understand general videos. Then, we fine-tuned it on our proposed video anomaly dataset to enhance its understanding of video anomalies across multiple scenarios. Compared to other baselines, our\nmodel achieves SOTA performance in both Text-Level and GPT-Guided Metrics. Our contributions are summarized as follows:\nWe propose a novel video-language framework, HAWK, aiming at understanding video anomalies, which incorporates motion modality to enhance its capability. We generate rich language descriptions for seven different video anomaly datasets. Meanwhile, considering the diversity of open-world problems, we also generate question-answer pairs to tackle potential user inquiries. Compared to other large video models, our framework demonstrates SOTA performance for video anomaly understanding and question-answering across multiple scenarios, which will help open-world anomaly understanding in the future. 2 Related Work # Video Anomaly Detection Video Anomaly Detection (VAD) usually focuses on identifying unexpected events from the video and it has been widely applied in various fields, including autonomous driving [42], public surveillance [5 , 20], and crime scene analysis [30] etc. Previous VAD methods [24 , 30 , 20 , 41 , 7 , 10 , 16 , 31 , 37 , 45 , 49] are designed in numerous pathways. Lu et al. [20] designed to learn video features only from normal videos, and hand-craft features or deep-learningbased features are leveraged. Sultani et al. [30] proposed multiple instance learning (MIL), which is the main paradigm for many weakly-supervised learning methods. Recently, Lv et al. [24] first proposed video-based large language models in the framework of VAD.\nHowever, these methods lack sufficient semantic comprehension of scenes and offer inadequate user interaction. Several approaches [28 , 39] have introduced multi-class classifiers to integrate semantic information with various types of anomaly information. Nevertheless, their output is still limited. In contrast, our framework not only integrates more comprehensive semantic information as a general video understanding system but also provides advanced interaction capabilities for users.\nLarge Model in Video Understanding Recent studies have demonstrated the reliable capabilities of large models in video understanding. Beyond powerful vision-language models [13 , 48 , 18 , 21], recent research has increasingly explored more modalities [24 , 15 , 25 , 43 , 23]. Bain et al.[3] introduced a large-scale dataset with general video content descriptions. Several LLM-based works[15 , 25 , 43 , 23] aim to comprehend visual content. Additionally, Video-LLaMa [46] extends comprehension to both auditory and visual information, while Su et al.[29] utilize multi-modal encoders to understand across six modalities. Recently, Lv et al.[24] proposed video-based large language models for VAD tasks in a weakly supervised framework. In this paper, we introduce the motion modality in our proposed vision-language model, which enhances the model\u0026rsquo;s ability to locate anomalies by prioritizing relevant video content.\n3 Data Engineering # Previous datasets are inadequate for addressing our problem. Most existing VAD datasets, such as UBnormal [2] and DoTA [42], only contain simple video category labels and lack detailed language descriptions. This results in video understanding models lacking accurate and comprehensive supervision, creating a significant obstacle to identifying anomalies in videos. Recently, Lv et al.[24] attempted to create pseudo language descriptions for anomaly videos. However, these descriptions are naive combinations of labels and fixed text, relying on a rigid format that offers only limited information. Other datasets, like WebVid[3], include only general descriptions of video content, which may not direct the model\u0026rsquo;s focus on anomalies.\nOur Principle To tackle the above problems, we annotate detailed language descriptions specifically for anomaly scenes in seven different existing \u0026lt;VIDEO\u0026gt; datasets. These seven datasets include a variety of anomalous scenarios such as crime (UCF-Cirme [30]), campus (ShanghaiTech [19] and CUHK Avenue [20]), pedestrian walkways (UCSD Ped1 [5] and Ped2 [34]), traffic (DoTA [42]), and human behavior (UBnormal [2]). With the support of these visual scenarios, we can perform comprehensive fine-tuning for various abnormal scenarios, being closer to open-world scenarios.\nFigure 2: Generation pipeline of our dataset. In the first line, we first segment videos into clips and generate dense captions for each segment, including a comprehensive description of the video content. Then, we use GPT-4 to guide the generation of corresponding anomalous video descriptions based on these descriptions, which are then manually checked to reduce mistakes . In the second line, to generate user-centered QA pairs, we first use GPT-4 to generate open-ended questions based on the proposed two principles. Then, the questions and video descriptions are jointly input into GPT-4 to provide possible answers .\nMoreover, to better account for real-world user situations, we believe that language descriptions should not only include descriptions of the video anomalies themselves, but also address open questions asked by users. Therefore, we construct open-ended question-answer pairs for each scenario to further enhance model\u0026rsquo;s practical ability to answer users\u0026rsquo; varying questions. The procedure for answering users\u0026rsquo; questions is shown in Fig. 2. The data format of can be described by the Eq. (1),\n\u0026lt;VIDEO\u0026gt;: {DIS: \u0026lt;DESCRIPTION\u0026gt; | QA: \u0026lt;QUESTION\u0026gt; → \u0026lt;ANSWERING\u0026gt;} . (1) Anomaly Video Description Generation To construct natural language descriptions \u0026lt;DESCRIPTION\u0026gt; for anomalous video datasets, we refer to previous research such as LLaVa [18] and VideoChat [15], and employ GPT-4 [1] as an assistant. We first split the video into dense clips to ensure key information is captured. Following VideoChat [15], we use perception tools (InternVideo [35], Tag2Text [11], or GRiT [36]) to automatically generate captions for each key clip, obtaining a dense representation of the videos (except for the UCF-Crime dataset, which already has a dense representation built in [44]). Next, we use GPT-4 [1] to generate anomaly-related descriptions based on the captions for each video. Unlike other general video understanding datasets [18 , 15], we provide prompts for GPT-4 to generate specific descriptions closely related to video anomalies. Finally, due to varying quality of dense captions, some videos may have incorrect annotations. Thus, we manually recheck the final generated video anomaly descriptions to ensure label accuracy.\nHuman-Centric Question-Answering Pairs Generation So far, we have obtained nearly accurate descriptions of anomaly videos. However, our framework may still face challenges with more openended questions from users. Therefore, anomaly-related question-answering is a significant practical requirement. Given the diversity of open-world scenes, users may ask questions involving various pronouns. Thus, we mainly consider these two principles: 1 Anomaly-related, our questions should be strongly related to the anomaly in the video. 2 5W2H, we introduce seven different question pronouns (What, Who, Where, When, How, How much, and Why) to simulate various question formats that users may employ. This enables us to address a wide range of open questions related to\nvideo anomalies. We input these two principles into GPT-4 [1] to generate open questions for anomaly videos. We then manually review and select the 100 most suitable questions, which are randomly assigned to each video. Finally, GPT-4 [1] will generate \u0026lt;ANSWERS\u0026gt; to these \u0026lt;QUESTIONS\u0026gt;.\nOur data is more practical compared to previous ones: it not only understands multiple anomalies in videos but also supports question-answering in open scenarios (More details in Appendix D).\n4 Methodology # To construct a practical framework for understanding video anomalies, our goal is to accurately interpret these anomalies into natural language. However, most previous studies [15 , 46 , 26 , 17 , 24] focus on enhancing general video understanding capabilities while neglecting video anomalies. This oversight results in equal attention being given to all parts of the video, such as the background and human appearances, often at the expense of key anomaly features, as shown in Fig. 1 (C). Consequently, these approaches are not effective in accurately focusing on anomaly-related features.\nOverview of Solution The core of our solution is guiding visual instruction to focus on anomalies. Previous studies in video anomaly detection [41 , 49] have demonstrated that motion-related feature help identify multiple anomalies. Therefore, in Section 4.1, we first explicitly integrate a motion modality into our proposed framework to target anomaly-related features. Subsequently, in Section 4.2, we maintain mutual information consistency between the appearance and motion modalities within a tight feature space, implicitly guiding the appearance branch to reinforce motion attention. Finally, in Section 4.3, to improve the interpretation of motion-to-language, we extract motion-related language descriptions to directly match the motion and its corresponding motion-related language.\n4.1 Explicit Motion Modality Integration # To enhance the capability of interpreting anomalies, we build a framework, HAWK, to explicitly integrate motion modality. HAWK has a dual-branch architecture, with fv fv as the original video understanding network and fm fm for motion understanding. Inspired by Video-LLaMA [46], fv fv and fm fm share the same architecture but separate parameters in Fig. 3. Eq. (2) denotes our framework as,\nwhere X v ∈ R T ×C×H×W represents the \u0026lt;VIDEO\u0026gt; input for extracting appearance feature, and T denotes the temporal dimension. X m = M(X v ), with M(·) being the motion extractor.\nfv fv (·) and fm fm (·) are the frozen pre-trained video encoders from BLIP-2 [14], which consist of one EVA-CLIP [8] and one pre-trained Video Q-Former to output embeddings. Then, the output embeddings from fv fv (·) and fm fm (·) are passed through learnable projection networks for video and motion, Pv Pv (·) and Pm Pm (·), respectively. These networks aim to project visual (video and motion) embedding into the language feature space for interpreting. ft(·) is the frozen text token to embedding projection, that makes textual information can be inputted into LLaMA-2 [32]. ⊕ is for combining our input prompt, we define our prompt as: \u0026ldquo;Here is the input video embedding: \u0026lt;VIDEO_EMBEDDING\u0026gt; and motion embedding \u0026lt;MOTION_EMBEDDING\u0026gt; in different frames, please help me to \u0026lt;DE-SCRIBE_VIDEO\u0026gt; | \u0026lt;QUESTION\u0026gt;.\u0026rdquo;. \u0026lt;DESCRIBE_VIDEO\u0026gt; and \u0026lt;QUESTION\u0026gt; are the question classes for video description generation and video question answering respectively (Details see Appendix D). By combining the visual token embedding with the textual embedding, ft(T), LLaMA2 [32], is employed to generate the final language response, Y. This framework explicitly integrates the motion modality during visual instruction tuning, significantly targeting anomaly-related features.\n4.2 Implicitly Motion Attention Reinforcement # Although we integrate the motion modality to facilitate fine-tuning of HAWK, motion and video branches operate independently. Therefore, we cannot expect the original video branch to extract appearance features that focus on the region where the anomaly occurred (i.e., motion). To help HAWK focus more on these regions, we observed the containment relationship in mutual information between motion and the original video. We use this relationship to construct an auxiliary consistency loss function, implicitly reinforcing the motion attention (Fig. 4 2 ).\nFigure 3: Overview of HAWK. During training (Black and Gray path), we aim to optimize for videolanguage matching loss, along with Video-Motion Consistency and Motion-Language Matching. During inference (only Gray path), we generate language descriptions using video, motion, and text.\nExtract Motion Specifically, to obtain motion, we employ a motion describer M(·), which generates motion between two successive frames as shown in Eq. (3),\nwhere M(t)( · ) is the motion describer at the time step t, we currently use Gunnar Farneback\u0026rsquo;s algorithm [9], and X (t) v , X (t−1) v ∈ R 1×C×H×W denote the video frames at time steps t and t − 1 .\nX (t) Motion ∈ R 2×H×W includes two channels motion vector in X (horizontal) and Y (vertical) directions. We use the optical flow magnitude from these channels as a Mask, normalized to [0 , 1] and multiplied with the original video appearance, to hide other non-motion regions, as Eq. (4),\nwhere × is the operator of pixel-wise multiplication. X (t) v , X (t) m ∈ R 1×C×H×W donate the original video and our input motion information at time step t, respectively. We usually extract T frames as motion input X m ∈ R T ×C×H×W , same as X v.\nBuild L MV Loss Then, we consider that X m only contains key information for anomaly and it is contained in X v , and feature space from X v is more sparse. Therefore, we compact features from X m and X v into a tight space. At this space, we aim to maintain the mutual information between X m and X v consistency, and in this way, the appearance feature can be focused on the motion region. Therefore, we construct an auxiliary loss to promote X v \u0026rsquo;s motion attention, as in Eq. (5),\nwhere X c v = Cv Cv (fv fv (X v )) and X c m = Cm Cm (fm fm (X m )) denote the tightly compressed representations of X v and X m , respectively, by the compression functions Cv Cv and Cm Cm. Cv Cv and Cm Cm share some initial shallow layer parameters with Pv Pv and Pm Pm (as Fig. 3). Then, following a subsequent tight projection to compresses both X v and X m into a more compacted space.\nFinally, with this auxiliary loss, we can reinforce the mo- tion attention in the appearance feature, and HAWK\u0026rsquo;s feature space will focus on more abnormal related features, which will promote the understanding of anomalies in the whole framework.\nFigure 4: Visualization of HAWK\u0026rsquo;s loss. 1 is the original video-to-language loss. 2 is the cosine similarity loss for motion modality adaptation. 3 is the motion-to-language loss.\n4.3 Interpreting Motion-to-Language # Although HAWK has already accommodated the motion modality in visual input, the corresponding motion from language is still unclear. This limitation hinders HAWK\u0026rsquo;s interpretation in motion modality. Hence, to augment this relationship, we aim to reinforce the correspondence between motion and their linguistic representation.\nExtract Motion-related Language Previous studies [4 , 33 , 40 , 12] have proved that the representation of motion in the language is predominantly from verbs and their corresponding entities . Therefore, to extract linguistic representation, the first step is to do dependency parsing for the original sentences, as Eq. (6),\nwhere D(·) is the dependency parsing and Ygt is the ground truth. Ggt represents the graph of the dependency structure, which symbolizes the syntactic relationships among the words in a sentence.\nBased on this graph, we can extract predicates (verbs) V, and also entities closely related to these predicates, such as subjects S, objects O, indirect subjects Si, and indirect objects Oi. These elements are then combined to form short phrases representing motion, as in Eq. (7),\nwhere Ml(·) is the language motion extraction operator, and Y m gt is the motion-related language.\nBuild L ML Loss After obtaining motion-related language, we can establish strong supervision between motion in both vision and linguistic representation (as Fig. 4 3 ), significantly enhancing the ability to interpret motion to language in HAWK. Consequently, we design a motion-language matching as an auxiliary loss, as Eq. (8),\nwhere L ML (·) is the cross-entropy loss, which contains N words.\nOptimization Goal Finally, our total loss L shows as, L = t0 × LV L + t1 × LMV + t2 × LML , where L V L is original video to language loss (as Fig. 4 1 ), and t0 , t 1 and t 2 is the hyper-parameter.\n5 Experiments # This section introduces training, testing, baselines, evaluations, and ablation experiments of HAWK .\nTraining \u0026amp; Testing To enhance our framework\u0026rsquo;s anomaly understanding capabilities, we\u0026rsquo;ve structured our training and testing process into three stages, as Fig. 5 . Stage 1 involves pre-training on the WebVid dataset [3] to acquire a general understanding of video content. In Stage 2, we finetune the model\u0026rsquo;s focus towards video anomaly understanding by employing a specially curated dataset described in Section 1, consisting of over 8 , 000 videos. We use 90% of these videos for training and allocate the remaining 10% for testing purposes. We jointly train on two tasks: video \u0026lt;DESCRIPTION\u0026gt; generation and video \u0026lt;QUESTION\u0026gt;→\u0026lt;ANSWERING\u0026gt;. In Stage 3 ,\nSplit Testing\nFigure 5: Training \u0026amp; Testing.\nwe evaluate these two tasks independently in the testing set to ensure our model’s effectiveness.\nBaselines To evaluate the anomaly understanding performance of our proposed framework, we conduct comparisons with SOTA video understanding baselines. We select five baselines: VideoChatGPT [26], VideoChat [15], Video-LLaMA [46], LLaMA-Adapter [47], and Video-LLaVA [17]. The purpose of our comparison is to determine whether these baselines can fully understand and interpret video anomalies.\nTable 1: Quantitative performance of (A) anomaly video description generation and (B) video question-answering. Red indicates the best performance, while blue denotes the second best. (A) Anomaly Video Description Generation\nText-Level (↑) [27] Text-Level (↑) [27] Text-Level (↑) [27] Text-Level (↑) [27] GPT-Guided (↑) [18] GPT-Guided (↑) [18] GPT-Guided (↑) [18] Method BLEU-1 BLEU-2 BLEU-3 BLEU-4 Reasonability Detail Consistency Video-ChatGPT [26] 0.107 0.046 0.017 0.008 0.084 0.108 0.055 VideoChat [15] 0.053 0.023 0.008 0.003 0.107 0.205 0.054 Video-LLaMA [46] 0.062 0.025 0.009 0.004 0.120 0.217 0.066 LLaMA-Adapter [47] 0.132 0.052 0.018 0.008 0.060 0.091 0.038 Video-LLaVA [17] 0.071 0.030 0.012 0.005 0.077 0.115 0.038 Ours 0.270 0.139 0.074 0.043 0.283 0.320 0.218 maly Video Question-Answering maly Video Question-Answering maly Video Question-Answering maly Video Question-Answering maly Video Question-Answering maly Video Question-Answering maly Video Question-Answering maly Video Question-Answering Text-Level (↑) [27] Text-Level (↑) [27] Text-Level (↑) [27] Text-Level (↑) [27] GPT-Guided (↑) [18] GPT-Guided (↑) [18] GPT-Guided (↑) [18] Method BLEU-1 BLEU-2 BLEU-3 BLEU-4 Reasonability Detail Consistency Video-ChatGPT [26] 0.177 0.096 0.058 0.038 0.508 0.430 0.421 VideoChat [15] 0.261 0.133 0.074 0.043 0.699 0.631 0.598 Video-LLaMA [46] 0.156 0.081 0.045 0.027 0.586 0.485 0.497 LLaMA-Adapter [47] 0.199 0.109 0.067 0.043 0.646 0.559 0.549 Video-LLaVA [17] 0.094 0.054 0.034 0.023 0.393 0.274 0.316 Ours 0.319 0.179 0.112 0.073 0.840 0.794 0.753 Evaluation Metrics To accurately evaluate our model\u0026rsquo;s performance in understanding video anomalies, we firstly adopt four Text-Level metrics, from BLEU (Bilingual Evaluation Understudy) [27]-1 to BLEU-4 to measure word overlap between the model-generated text and the ground truth. This approach enables us to objectively assess the similarity and also take into account various levels of granularity at the text-level, thus providing a clear indicator of how well the model understands and describes anomalies.\nBesides, we expand our evaluation framework by incorporating insights from recent research in LLaVa [18] or Video-ChatGPT [26], utilizing GPT-Guided [1] methods to assess the quality of the generated text. GPT [1] serves as a critical evaluator, generating scores for three key aspects of the language produced, with each aspect scored on a scale from 0 to 1. These three aspects are as,\nReasonability: evaluates the logical reasoning and coherence of the generated language. Detail: assesses the level of detail and specificity of the generated language. Consistency: evaluates the coherence and consistency of the generated language. By leveraging GPT [1] as an evaluative tool, we aim to provide a nuanced understanding of the text\u0026rsquo;s quality, focusing on aspects that traditional metrics may overlook.\nQuantitative Evaluation Table 1 (A) and (B) demonstrate the effectiveness of our model to describe abnormal phenomena. Our proposed model significantly outperforms the previous baselines, achieving SOTA performance in every metric for both Text-level and GPT-guided metrics, thus it can generate text that more closely aligns with actual scenarios.\nQualitative Evaluation Table 2 (A) and (B) demonstrate that our proposed framework achieves optimal qualitative performance in video description generation and question-answering, respectively. Compared with other baselines, HAWK can accurately understand and focus on video anomalies. For example, in Table 2 (A) - Video-LLaMa [46], it pays more attention to the clothing information from the people (wearing blue and red jacket), while ignoring the motion-related anomaly (slipping). In Table 2 (B) - Video-ChatGPT, it may produce hallucinations (two people\u0026hellip; who were hit by the car), which differ from the original video anomaly (car suddenly braking). In contrast, HAWK generates descriptions that are close to the real semantics (driver losing control).\nAblation Study We conducted ablation experiments on three key structures proposed in this paper and analyzed their impact on the overall performance in Table 3 (A) and (B).\nTable 2: Qualitative performance on (A) anomaly video description generation, and (B) questionanswering. Red texts indicate key semantic inconsistencies, whereas Green texts signify that the generated results are closely aligned with the Ground Truth. [YELLOW] indicates the text problem.\nEffectiveness of Motion Information: We ablate all the motion components, including fm fm, Pm Pm and the motion input X m for proving the effectiveness of introducing motion modality. When explicit motion information is lacking, the model\u0026rsquo;s ability to describe the motionsrelated anomaly diminishes, leading to inaccurate descriptions or even hallucinations (Table 4 w/o Motion Information), then impedes the overall performance (Table 3). Table 3: Ablation study of (A) anomaly video description generation and (B) video questionanswering. Red indicates the best performance, while blue denotes the second best.\n(A) Anomaly Video Description Generation\nText-Level (↑) [27] Text-Level (↑) [27] Text-Level (↑) [27] Text-Level (↑) [27] GPT-Guieded (↑) [18] GPT-Guieded (↑) [18] GPT-Guieded (↑) [18] BLEU-1 BLEU-2 BLEU-3 BLEU-4 Reasonability Detail Consistency w/o Motion Information 0.249 0.121 0.062 0.034 0.253 0.306 0.189 w/o Video-Motion Consistency 0.249 0.123 0.064 0.036 0.261 0.295 0.194 w/o Motion-Language Matching Loss 0.270 0.135 0.073 0.041 0.276 0.320 0.212 Ours 0.270 0.139 0.074 0.043 0.283 0.320 0.218 (B) Anomaly Video Question-Answering\nText-Level (↑) [27] Text-Level (↑) [27] Text-Level (↑) [27] Text-Level (↑) [27] GPT-Guieded (↑) [18] GPT-Guieded (↑) [18] GPT-Guieded (↑) [18] BLEU-1 BLEU-2 BLEU-3 BLEU-4 Reasonability Detail Consistency w/o Motion Information 0.309 0.171 0.105 0.065 0.837 0.790 0.743 w/o Video-Motion Consistency 0.313 0.172 0.105 0.066 0.833 0.784 0.742 w/o Motion-Language Matching Loss 0.316 0.176 0.109 0.069 0.836 0.788 0.752 Ours 0.319 0.179 0.112 0.073 0.840 0.794 0.753 Table 4: Qualitative evaluation of ablation study. Red/Green texts indicate language semantic inconsistency/consistency with the Ground Truth respectively. [YELLOW] indicates the text problem.\n| w/o Motion Information: | While driving down the [UNRELATED INFORMATION] snowy street, the vehicle suddenly skids on patch of ice , causing it to veer into another car that was [HALLUCINATION] parked beside a tree o\nthe side of the road. w/o Video-Motion Consistency: In this video, the car fails to stop in time due to the [WRONG CONDITION] slippery road caused b the snowy conditions. As a result, it skids and sideswipes the white car parked on the side of the road causing minor damage to both vehicles. w/o Motion-Language Matching: As the white truck drives down the street, it suddenly skids on the wet road surface, losing control, and sideswiping the parked white car. The impact results in visible damage to both vehicles, [UNCLEAR CONDITION] with smoke emitting from the truck’s side and the car’s mirrors shattering Ours: While driving down a narrow street with cars parked on both sides, the current vehicle’s front right side scrapes against a parked car, causing minor damage to both vehicles. Ground Truth: While driving down the street, the silver car suddenly swerves to avoid a parked car, but clips its rear bumper, causing minor damage to both vehicles. Effectiveness of Video-Motion Consistency: The absence of video-motion consistency constraints reduces the generative model\u0026rsquo;s ability to adapt to the motion modality, causing difficulties in accurately understanding motion scenes (Table 4 w/o Video-Motion Consistency), then impedes the overall performance (Table 3). Effectiveness of Motion-Language Matching: Without motion-language matching loss, the correlation between motion and language becomes unclear. This ambiguity leads to the generation of language that includes unspecified motion information (Table 4 w/o Motion-Language Matching), subsequently degrading the overall performance (Table 3). 6 Conclusion # In conclusion, we have developed a novel video-language framework for understanding video anomalies across various scenarios. By incorporating motion features and constructing rich linguistic descriptions, our model demonstrates SOTA performance in the open world. It has the potential to benefit practical applications in diverse domains and paves the way for improving the model\u0026rsquo;s interactivity with users, enabling more efficient and effective communication in addressing userspecific inquiries related to video anomalies.\nReferences # [1] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 4 , 5 , 8\n[2] Acsintoae, A., Florescu, A., Georgescu, M.I., Mare, T., Sumedrea, P., Ionescu, R.T., Khan, F.S., Shah, M.: Ubnormal: New benchmark for supervised open-set video anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20143–20153 (2022) 2 , 3 , 15 , 24\n[3] Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: IEEE International Conference on Computer Vision (2021) 2 , 3 , 7 , 14\n[4] Cadiot, P., Lebas, F., Visetti, Y.M.: The semantics of the motion verbs. Space in Languages: Linguistic Systems and Cognitive Categories 66, 175 (2006) 7\n[5] Chan, A.B., Vasconcelos, N.: Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE transactions on pattern analysis and machine intelligence 30(5), 909–926 (2008) 1 , 2 , 3 , 15 , 25\n[6] Du, H., Zhang, S., Xie, B., Nan, G., Zhang, J., Xu, J., Liu, H., Leng, S., Liu, J., Fan, H., Huang, D., Feng, J., Chen, L., Zhang, C., Li, X., Zhang, H., Chen, J., Cui, Q., Tao, X.: Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 19\n[7] Dubey, S., Boragule, A., Jeon, M.: 3d resnet with ranking loss function for abnormal activity detection in videos. In: 2019 International Conference on Control, Automation and Information Sciences (ICCAIS). pp. 1–6. IEEE (2019) 2 , 3\n[8] Fang, Y., Wang, W., Xie, B., Sun, Q.S., Wu, L.Y., Wang, X., Huang, T., Wang, X., Cao, Y.: Eva: Exploring the limits of masked visual representation learning at scale. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 19358–19369 (2022) 5\n[9] Farneback, G.: Fast and accurate motion estimation using orientation tensors and parametric motion models. In: Proceedings 15th International Conference on Pattern Recognition. ICPR2000. vol. 1, pp. 135–139. IEEE (2000) 6\n[10] He, C., Shao, J., Sun, J.: An anomaly-introduced learning method for abnormal event detection. Multimedia Tools and Applications 77, 29573–29588 (2018) 2 , 3\n[11] Huang, X., Zhang, Y., Ma, J., Tian, W., Feng, R., Zhang, Y., Li, Y., Guo, Y., Zhang, L.: Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657 (2023) 4\n[12] Langacker, R.W.: Nouns and verbs. Language pp. 53–94 (1987) 7\n[13] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023) 3\n[14] Li, J., Li, D., Savarese, S., Hoi, S.C.H.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (2023) 5\n[15] Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023) 2 , 3 , 4 , 5 , 7 , 8 , 9 , 20 , 21 , 22 , 23 , 24 , 25\n[16] Li, S., Liu, F., Jiao, L.: Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 1395–1403 (2022) 2 , 3\n[17] Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023) 5 , 7 , 8 , 9 , 20 , 21 , 22 , 23 , 24 , 25\n[18] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems 36 (2024) 2 , 3 , 4 , 8 , 10\n[19] Liu, W., W. Luo, D.L., Gao, S.: Future frame prediction for anomaly detection – a new baseline. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 2 , 3 , 14 , 23\n[20] Lu, C., Shi, J., Jia, J.: Abnormal event detection at 150 fps in matlab. In: Proceedings of the IEEE international conference on computer vision. pp. 2720–2727 (2013) 1 , 2 , 3 , 14 , 21\n[21] Lu, H., Niu, X., Wang, J., Wang, Y., Hu, Q., Tang, J., Zhang, Y., Yuan, K., Huang, B., Yu, Z., et al.: Gpt as psychologist? preliminary evaluations for gpt-4v on visual affective computing. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) workshop (2024) 3\n[22] Lu, H., Tang, J., Xu, X., Cao, X., Zhang, Y., Wang, G., Du, D., Chen, H., Chen, Y.: Scaling multi-camera 3d object detection through weak-to-strong eliciting. arXiv (2024) 1\n[23] Luo, R., Zhao, Z., Yang, M., Dong, J., Qiu, M., Lu, P., Wang, T., Wei, Z.: Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207 (2023) 3\n[24] Lv, H., Sun, Q.: Video anomaly detection and explanation via large language models. arXiv preprint arXiv:2401.05702 (2024) 2 , 3 , 5\n[25] Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023) 3\n[26] Muhammad Maaz, Hanoona Rasheed, S.K., Khan, F.: Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv 2306.05424 (2023) 2 , 5 , 7 , 8 , 9 , 20 , 21 , 22 , 23 , 24 , 25\n[27] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002) 8 , 10\n[28] Pu, Y., Wu, X., Wang, S.: Learning prompt-enhanced context features for weakly-supervised video anomaly detection. arXiv preprint arXiv:2306.14451 (2023) 2 , 3\n[29] Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023) 3\n[30] Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6479–6488 (2018) 1 , 2 , 3 , 14 , 22\n[31] Tian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J.W., Carneiro, G.: Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4975–4986 (2021) 2 , 3\n[32] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 5\n[33] Vo, N.P.A., Manotas, I., Sheinin, V., Popescu, O.: Identifying motion entities in natural language and a case study for named entity recognition. In: Proceedings of the 28th International Conference on Computational Linguistics. pp. 5250–5258 (2020) 7\n[34] Wang, S., Miao, Z.: Anomaly detection in crowd scene. In: IEEE 10th International Conference on Signal Processing Proceedings. pp. 1220–1223. IEEE (2010) 2 , 3 , 15\n[35] Wang, Y., Li, K., Li, Y., He, Y., Huang, B., Zhao, Z., Zhang, H., Xu, J., Liu, Y., Wang, Z., et al.: Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022) 4\n[36] Wu, J., Wang, J., Yang, Z., Gan, Z., Liu, Z., Yuan, J., Wang, L.: Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280 (2022) 4\n[37] Wu, P., Liu, J.: Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing 30, 3513–3527 (2021) 2 , 3\n[38] Wu, P., Liu, j., Shi, Y., Sun, Y., Shao, F., Wu, Z., Yang, Z.: Not only look, but also listen: Learning multimodal violence detection under weak supervision. In: European Conference on Computer Vision (ECCV) (2020) 18\n[39] Wu, P., Zhou, X., Pang, G., Sun, Y., Liu, J., Wang, P., Zhang, Y.: Open-vocabulary video anomaly detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 2 , 3\n[40] Wunderlich, D.: Cause and the structure of verbs. Linguistic inquiry pp. 27–68 (1997) 7\n[41] Xu, D., Ricci, E., Yan, Y., Song, J., Sebe, N.: Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015) 2 , 3 , 5\n[42] Yao, Y., Wang, X., Xu, M., Pu, Z., Wang, Y., Atkins, E., Crandall, D.J.: Dota: unsupervised detection of traffic anomaly in driving videos. IEEE transactions on pattern analysis and machine intelligence 45(1), 444–459 (2022) 1 , 2 , 3 , 15 , 20\n[43] Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al.: mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023) 3\n[44] Yuan, T., Zhang, X., Liu, K., Liu, B., Chen, C., Jin, J., Jiao, Z.: Towards surveillance video-andlanguage understanding: New dataset, baselines, and challenges (2023) 4\n[45] Zaheer, M.Z., Mahmood, A., Astrid, M., Lee, S.I.: Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16. pp. 358–376. Springer (2020) 2 , 3\n[46] Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023) 3 , 5 , 7 , 8 , 9 , 20 , 21 , 22 , 23 , 24 , 25\n[47] Zhang, R., Han, J., Liu, C., Gao, P., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Qiao, Y.: Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023) 7 , 8 , 9 , 20 , 21 , 22 , 23 , 24 , 25\n[48] Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In: The Twelfth International Conference on Learning Representations (2024) 3\n[49] Zhu, Y., Newsam, S.: Motion-aware feature for improved video anomaly detection. arXiv preprint arXiv:1907.10211 (2019) 2 , 3 , 5\nA Summary of Appendix # This appendix provides supplementary information that was not included in the main paper. Firstly, we address the security statement of our study, ensuring the confidentiality and integrity of the data used. Additionally, we provide detailed explanations of the training and testing resources utilized, including information on the hardware and software configurations. We also present statistics and distribution of the training data, along with the costs associated with human resources involved in the study. Furthermore, we describe the evaluation metrics employed to assess the performance of our method. Moreover, we present additional qualitative results comparisons, showcasing the effectiveness of our approach. Additionally, we provide an open-world demo, demonstrating the real-world applicability of our method. Finally, we discuss the existing limitations of our paper and propose potential avenues for future research.\nB Security Statement # To prevent any potential misuse and ensure responsible use, we have strictly limited the application scope of our proposed method, HAWK. Unless authorized, HAWK is only permitted for use in research domains.\nAdditionally, access to the proposed dataset is restricted to qualified institutions and organizations, who must provide a clear purpose for its use. We explicitly prohibit the application of the dataset in situations that may cause potential danger or have a significant social impact.\nThese measures are in place to ensure the ethical and responsible use of our research.\nC Details in Training and Testing # Computational Resource During the pre-training phase, we utilized four Nvidia GTX A6000 GPUs * to train on the WebVid dataset [3] for approximately 120 hours. In the fine-tuning phase, we employed two Nvidia GTX A6000 GPUs to fine-tune on our proposed dataset for about 80 hours.\nEfficiency During testing, the average model response time for each round of conversation with HAWK is approximately 2ms. Additionally, considering the available graphics memory, the model can handle video clips of up to 32 frames. Therefore, it is necessary to extract different frames from longer videos.\nHyper-parameters In the loss function, t0 is set to 1 for our main task, video-to-language, and t1 and t 2 are set to 0.1, as two auxiliary tasks for balancing different loss values.\nD Details in Dataset # Dataset Introduction and Statistics Our study utilizes seven video anomaly datasets, each encompassing different scenes. The detailed statistics and introduction of these datasets are as follows:\nUCF-Cirme [30]: The UCF-Crime dataset comprises an extensive collection of 128 hours of video. It consists of 1,900 long and untrimmed real-world surveillance videos, featuring 13 distinct classes of realistic anomalies. These anomalies are carefully chosen due to their notable implications for public safety. ShanghaiTech [19]: The ShanghaiTech Campus dataset comprises 13 scenes characterized by complex light conditions and varied camera angles. It encompasses 130 instances of abnormal events and encompasses over 270,000 training frames. Notably, this dataset includes annotations for both frame-level and pixel-level ground truth of abnormal events, providing comprehensive insight into anomaly detection and localization tasks. CUHK Avenue [20]: The CUHK Avenue Dataset comprises 16 training and 21 testing video clips designed for abnormal event detection. Captured within the CUHK campus avenue, https://www.nvidia.com/en-us/design-visualization/rtx-a6000/ these videos encompass a total of 30,652 frames, divided into 15,328 frames for training and 15,324 frames for testing. The training videos capture normal situations, while the testing videos include both normal and abnormal events.\nUCSD Dataset [5 , 34]: The UCSD Anomaly Detection Dataset was captured using a stationary camera positioned at an elevation, providing an overhead view of pedestrian walkways. The crowd density within these walkways exhibits variability, spanning from sparsely populated areas to densely crowded environments. It is split into 2 subsets, each corresponding to a different scene. Ped1 [5] includes a total of 34 training video samples and 36 testing video samples, while Ped2 [34] consists of 16 training video samples and 12 testing video samples. DoTA [42]: The Detection of Traffic Anomaly (DOTA) Dataset introduces the When-WhereWhat pipeline with temporal, spatial, and categorical annotations. It contains 4677 videos, all with a resolution of 1280 x 720 pixels. Notably, the original videos were extracted at a frame rate of 10 fps in this dataset. UBnormal [2]: The UBnormal dataset is a supervised open-set benchmark designed explicitly for video anomaly detection, comprising diverse virtual scenes. It introduces abnormal events annotated at the pixel level during training, which enables the utilization of fullysupervised learning techniques for abnormal event detection. In our study, we extend upon these existing datasets by implementing our data engineering pipeline. This pipeline generates comprehensive descriptions of video anomalies and formulates open questions derived from these anomalies.\nData Distribution To demonstrate the applicability of our data in an open-world scenario, we conducted a statistical analysis of the data distribution. Figure 6 illustrates the data distribution of all the datasets we utilized, indicating that our method can effectively support various open-world datasets. Besides, we acknowledge the need to expand our dataset further to enhance the model\u0026rsquo;s applicability in this task.\nFigure 6: Violin plot of data distribution. We use PCA dimensional reduction to measure the feature distribution of different datasets, where there are significant differences in the feature distribution.\nManual Checking Before conducting the experiments, we performed the manual checking on the textual descriptions generated for the videos. Specifically, we consider the following aspects:\nError Correction: We removed text descriptions that contained obvious errors about the video content and supplemented the correct object, behavior, and scene information. (For instance, GPT tends to misidentify dogs in videos, describe running pedestrians as skateboards and motorcycles, and mistake scenes containing water as rainy days.) Detail Enhancement: We provided more detailed textual descriptions of anomalies in the video (such as pedestrians lingering or jumping in the middle of the road). Human Resource Cost: We formed a team of five annotators to conduct Manual Checking on all the videos. Since most of the videos already had automatically generated annotations, each annotator invested approximately 30 hours of work during the labeling process, processing about 1700 videos. \u0026lt;DESCRIBE_VIDEO\u0026gt; and Generated Open-World \u0026lt;QUESTION\u0026gt; We set 20 problems for \u0026lt;DESCRIBE_VIDEO\u0026gt;, and during each iteration in training, we randomly select one of them.\n1. Can you describe the anomaly in the video? 2. How would you detail the anomaly found in the video? 3. What anomaly can you identify in the video? 4. Could you explain the anomaly observed in the video? 5. Can you point out the anomaly in the video? 6. What \u0026#39; s the anomaly depicted in the video? 7. Could you specify the anomaly present in the video? 8. How do you perceive the anomaly in the video? 9. Can you highlight the anomaly within the video? 10. What anomaly is noticeable in the video? 11. Could you characterize the anomaly seen in the video? 12. Can you detail the specific anomaly encountered in the video? 13. How would you describe the particular anomaly in the video? 14. What details can you provide about the anomaly in the video? 15. Could you elucidate on the anomaly detected in the video? 16. Can you illustrate the nature of the anomaly in the video? 17. What features of the anomaly in the video can you describe? 18. Could you outline the anomaly observed in the video? 19. How does the anomaly in the video manifest? 20. Can you clarify the aspects of the anomaly in the video? We have also generated 100 \u0026lt;QUESTIONS\u0026gt; for open-world anomalies. To mimic user behavior, some of these questions are closely related to the video scene, while others are less closely related. However, all of these questions are potential inquiries in an open-world scenario.\n1. Who is causing the disturbance in the video? 2. What is the unusual activity happening in the video? 3. When did the anomaly occur in the video? 4. Where is the strange event taking place in the video? 5. Why is the object in the video behaving abnormally? 6. How is the anomaly in the video affecting the surroundings? 7. How much damage was caused by the incident in the video? 8. Who is the main person involved in the unusual event? 9. What is the cause of the sudden change in the video? 10. When does the suspicious activity start in the video? 11. Where can I find more information about the incident in the video? 12. Why are the people in the video reacting in that way? 13. How can I identify the source of the problem in the video? 14. How much time does the abnormal event last in the video? 15. Who are the other people affected by the anomaly in the video? 16. What actions were taken to address the issue in the video? 17. When was the video recorded, and is it a recent event? 18. Where else can I find similar incidents in other videos? 19. Why is the vehicle in the video moving erratically? 20. How can I prevent such anomalies from occurring in the future? 21. How much impact does the abnormal event have on the overall situation? 22. Who should I contact if I notice a similar anomaly in another video? 23. What steps can I take to investigate the issue further? 24. When is the best time to report an unusual event in a video? 25. Where can I find resources to help me understand the anomaly better? 26. Why did the equipment in the video malfunction? 27. How can I differentiate between normal and abnormal behavior in a video? 28. How much does it cost to implement a system that detects anomalies in videos? 29. Who can provide expert advice on handling video anomalies? 30. What is the most common type of anomaly found in videos? 31. When should I be concerned about an anomaly in a video? 32. Where can I find a list of known video anomalies and their descriptions? 33. Why is it important to detect and analyze anomalies in videos? 34. How can I improve my ability to spot anomalies in videos? 35. How much training is required to become proficient in detecting video anomalies? 36. Who can I collaborate with to better understand video anomalies? 37. What are the potential consequences of ignoring an anomaly in a video? 38. When did the trend of analyzing anomalies in videos begin? 39. Where can I find examples of successfully resolved video anomaly cases? 40. Why do some anomalies in videos go unnoticed? 41. How can I report a video anomaly to the appropriate authorities? 42. How much time is needed to thoroughly analyze a video anomaly? 43. Who is responsible for monitoring and addressing video anomalies? 44. What are the best tools to use for detecting anomalies in videos? 45. When is it necessary to escalate a video anomaly for further investigation? 46. Where can I find guidelines on how to handle video anomalies? 47. Why do some video anomalies lead to serious consequences? 48. How can I ensure the accuracy of my video anomaly detection system? 49. How much effort is needed to maintain a video anomaly detection system? 50. Who should be informed when a video anomaly is detected? 51. What are the signs that indicate a potential anomaly in a video? 52. When should I perform a follow-up analysis on a detected video anomaly? 53. Where can I find support for dealing with video anomalies? 54. Why is it crucial to act quickly when a video anomaly is detected? 55. How can I improve the efficiency of my video anomaly detection process? 56. How much data is needed to accurately detect anomalies in videos? 57. Who can help me fine-tune my video anomaly detection system? 58. What are the key factors to consider when analyzing video anomalies? 59. When should I update my video anomaly detection system? 60. Where can I find the latest research on video anomaly detection techniques? 61. Why is it necessary to have a video anomaly detection system in place? 62. How can I minimize false alarms in my video anomaly detection system? 63. How much does it cost to maintain a video anomaly detection system? 64. Who can I consult if I encounter difficulties with my video anomaly detection system? 65. What are the best practices for dealing with video anomalies? 66. When is it appropriate to involve law enforcement in a video anomaly case? 67. Where can I find a community of professionals who specialize in video anomaly detection? 68. Why do some video anomalies require immediate attention? 69. How can I enhance the performance of my video anomaly detection system? 70. How much should I invest in a video anomaly detection system? 71. Who can provide training on how to detect and analyze video anomalies? 72. What are the most effective methods for detecting anomalies in videos? 73. When should I seek external help for a video anomaly case? 74. Where can I find a comprehensive database of video anomalies? 75. Why is it important to continuously monitor videos for anomalies? 76. How can I validate the results of my video anomaly detection system? 77. How much influence do external factors have on video anomalies? 78. Who can I reach out to for assistance with a complex video anomaly case? 79. What are the main challenges in detecting and analyzing video anomalies? 80. When is it necessary to involve other stakeholders in a video anomaly case? 81. Where can I find case studies on successful video anomaly detection projects? 82. Why is it essential to have a systematic approach to video anomaly detection? 83. How can I optimize my video anomaly detection system for different scenarios? 84. How much storage is needed to archive video anomalies for future analysis? 85. Who should be held accountable for undetected video anomalies? 86. What are the most common reasons for video anomalies to occur? 87. When should I reevaluate my video anomaly detection system? 88. Where can I find information on the latest video anomaly detection technologies? 89. Why is it beneficial to collaborate with others in the field of video anomaly detection? 90. How can I ensure the confidentiality of video anomaly cases? 91. How much should I rely on automated systems for video anomaly detection? 92. Who can I contact for technical support with my video anomaly detection system? 93. What are the ethical considerations when dealing with video anomalies? 94. When should I notify the public about a video anomaly case? 95. Where can I find reliable sources of information on video anomalies? 96. Why is it important to have a backup plan for dealing with video anomalies? 97. How can I customize my video anomaly detection system for specific use cases? 98. How much time should I allocate for analyzing video anomalies? 99. Who can I turn to for guidance on handling sensitive video anomaly cases? 100. What are the most critical factors to consider when choosing a video anomaly detection system? E Details in GPT-Guided Metrics # In the GPT-Guided metrics, we employ GPT-4 as an auxiliary tool to evaluate the generated response of HAWK. Our evaluation focuses on three primary dimensions: Reasonability, Detail, and Consistency.\nWe first set the system prompt as follows: Initially, we establish the system prompt as shown below:\n{\u0026#34;role\u0026#34;: \u0026#34;system\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;You are an intelligent chatbot designed for evaluating the generative outputs for video-based pairs. you will be given two answers, one reference ground truth and one our generated, but this does not mean that the reference GT is the only answer. Your task is to give the ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ score of the predicted answers.\u0026quot;} Our system prompt is designed to compare the degree of matching between image pairs. However, this does not imply fine-grained matching at the text level. Instead, it emphasizes the semantic information-related aspects.\nTo assess a particular dimension of the metric, we employ the following prompt: {\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;### Video Description Generation Please evaluate the following video-based video description pair: Reference: \u0026lt;DESCRIPTION_GT\u0026gt; Ours: \u0026lt;DESCRIPTION_Ours\u0026gt; ### Video Question-Answering Please evaluate the following video-based video question-answer pair: Question: \u0026lt;QUESTION\u0026gt; Reference: \u0026lt;ANSWER_GT\u0026gt; Ours: \u0026lt;ANSWER_Ours\u0026gt; Provide your evaluation only as a \u0026lt;Reasonability|Detail|Consistency\u0026gt; score\nwhere the \u0026lt;Reasonability|Detail|Consistency\u0026gt; score is a FLOAT value between 0 and 1, with 1 indicating the highest level of \u0026lt;Reasonability|Detail|Consistency\u0026gt;. Please generate the response in the form of a Python dictionary string with key \u0026rsquo; score \u0026rsquo; , where its value is the \u0026lt;Reasonability|Detail|Consistency\u0026gt; score in FLOAT, not STRING. DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. For example, your response should look like this: { \u0026rsquo; score \u0026rsquo; : 0.675}.\u0026quot;} ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ We have developed distinct prompts for two tasks: Video Description Generation and Video QuestionAnswering. The primary difference is the addition of the \u0026lt;QUESTION\u0026gt; field in Video QuestionAnswering, which indicates what kind of question the model should answer. \u0026lt;DESCRIPTION_GT\u0026gt; and \u0026lt;DESCRIPTION_Ours\u0026gt; represent the Ground Truth and our generated video description, respectively. Similarly, \u0026lt;ANSWER_GT\u0026gt; and \u0026lt;ANSWER_Ours\u0026gt; signify the Ground Truth and our generated video answers, respectively. \u0026lt; Reasonability | Detail | Consistency \u0026gt; represents the three dimensions we aim to evaluate. Lastly, besides the essential reminders, we have constrained GPT\u0026rsquo;s output format to {\u0026lsquo;score\u0026rsquo;: 0.675}.\nF More Results # Table (A), (B), (C), (D), (E), and (F) below present additional qualitative results from different datasets. In the tables, red texts indicate key semantic inconsistencies with the Ground Truth, while green texts signify that the generated results closely align with the Ground Truth.\nG Open-World Video Anomaly Understanding Demo # We present a demo showcasing the use of HAWK in an open-world scenario, using XD-Violence [38] (which is not included in our dataset). The practical capability of the system in an unknown scenario in the open world is depicted in Fig.7 and Fig.8. Furthermore, HAWK can provide accurate answers to users\u0026rsquo; questions and engage in long dialogues in the open world.\nH Limitations # Hallucination Although most of the hallucinations can be decreased through motion, some error motion may still also cause hallucinations. Future work may need to consider the connection between the hallucination and the abnormal region more precisely.\nVideo-level v.s. Streaming Data The goal of this paper is video-level video anomaly understanding. However, for a video anomaly detection system, anomaly detection in streaming is essential, so to increase the practical application ability, we need to design a more practical system for streaming data.\nData Limitations While our dataset includes multiple anomaly scenarios and our framework is designed for an open-world setting, the limitations of our data make it difficult to fully support open-world scenarios. This is a significant drawback of our study. To address this limitation, we recommend building larger and more diverse open datasets.\nI Future Work # In this section, we discuss potential avenues for future research to build upon this paper.\nExpanding Applicability One possible direction for future work is to expand the applicability of the model to a wider range of scenarios (like most recent research by Du et al. [6]). While our current dataset includes multiple anomaly scenarios, there are still many diverse scenarios the model may not handle. We could explore ways to incorporate additional data sources or create new datasets that cover a broader range of scenarios. Additionally, we could investigate ways to deploy the model in cloud, edge, and hybrid computing environments to support more diverse deployment scenarios.\nEnriching Task Capabilities While our current model is able to detect anomalies in videos, it does not provide location information where the anomaly occurs. This information could be valuable for certain applications, such as security monitoring or surveillance. Therefore, a valuable direction for future work would be to investigate ways to improve the model\u0026rsquo;s ability to localize anomalies. By improving the model\u0026rsquo;s ability to locate the scene of an anomaly, we could further enhance its usability and expand its range of applications.\n| …… …… 00:00 00:01 00:02 Prompt Please help me to fully describe the details of this video, especially for the anomaly in the video. …… …… | …… …… 00:00 00:01 00:02 Prompt Please help me to fully describe the details of this video, especially for the anomaly in the video.\n…… …… 00:00 00:01 00:02 Video-ChatGPT [26]: The video shows a car driving down a busy street, with a person sitting in the passenger seat. The car is surrounded by other cars, and the driver is seen looking at the camera. The car is seen turning a corner, and the person sitting in the passenger seat is looking at the camera. The video also shows a person walking in the streetand a truck driving by Video-ChatGPT [26]: The video shows a car driving down a busy street, with a person sitting in the passenger seat. The car is surrounded by other cars, and the driver is seen looking at the camera. The car is seen turning a corner, and the person sitting in the passenger seat is looking at the camera. The video also shows a person walking in the street, and a truck driving by. VideoChat [15]: pg g y The video is a car driving on the road with the camera mounted inside the car. It is a clear day with some clouds in the sky. There are cars and other vehicles on the road, with one vehicle parked on the side of the road. The road is lined with trees on both sides. The car is driving along the road, passing other vehicles, and crossing roads. The anomaly in the video occurs when the car suddenly stops in front of a bus, blocking the road. VideoChat [15]: Suddenly, the car appears to be hit by a large object, causing it to flip over and crash. The person inside the car is thrown out, and the car is left in a wrecked state on the road. The scene is captured from a dash cam perspective, providing a clear view of the incident. The video starts with a car driving down a streetand the camera captures VideoChat [15]: car. It is a clear day with some clouds in the sky. There are cars and other vehicles on the road, with one vehicle parked on the side of the road. The road is lined with trees on both sides. The car is driving along the road, passing other vehicles, and crossing roads. The anomaly in the video occurs when the car suddenly stops in front of a bus, blocking the road. Thidiidthttd thbildid kd Video-LLaMA [46]: The video is a car driving on the road with the camera mounted inside the car. It is a clear day with some clouds in the sky. There are cars and other vehicles on the road, with one vehicle parked on the side of the road. The road is lined with trees on both sides. The car is driving along the road passing other vehicles, and crossing roads. The anomaly in the video occurs when the car suddenly stops in front of a busblocking the road Video-LLaMA [46]: The video starts with a car driving down a street, and the camera captures the view from inside the car. The car is seen driving past a blue truck and a red car. The camera then pans to the side of the road, where a man is seen walking down the sidewalk. The camera then pans back to the street, where a blue truck is seen driving down the road. Video-LLaMA [46]: The video starts with a car driving down a street, and the camera captures the view from inside the car. The car is seen driving past a blue truck and a red car. The camera then pans to the side of the road, where a man is seen walking down the sidewalk. The camera then pans back to the street, where a blue truck is seen driving down the road. Ours: While driving down the street, the current vehicle is involved in an acciden when a sudden stop causes it to rear-end the white car in front of it, resultin Ours: While driving down the street, the current vehicle is involved in an acciden when a sudden stop causes it to rear-end the white car in front of it, resulting | …… …… Prompt Please help me to fully describe the details of this video, especially for the anomaly in thid …… …… | …… …… Prompt Please help me to fully describe the details of this video, especially for the anomaly in thid\n…… …… 00:00 Video-ChatGPT [26]: The video shows a man sitting on a bench in a busy city street, surrounded by many people. Suddenly, a dog appears and starts walking towards the man. The dog is wearing a red collar and is accompanied by a person. The dog is walking towards the man, who is sitting on the bench, and the man is watching the dog as it approaches. The dog is walking towards the man, and it appears to be a unique and unexpected event in the busy city street. VideoChat [15]: he video shows a group of students walking outside the entrance to an ffice building located in a park. There are two white benches in front of he entrance, one on the left side and one on the right side. The entrance has large blue sign above it that reads Science and Engineering. People are tanding and walking around the entrance while the students are walking owards the building Video-LLaMA [46]: g First, the video shows a busy street with many people walking around. There is a sign for a bus stop at one corner. Next, the video focuses on a line of people standing in front of a blue building, with a man in a black shirt and black jacket walking towards them. Finally, the video shows a crowd of people at the bus stop, with people walking to and from the bus. There is a green park bench and a building visible behind the people. The street is some walking in pairs or small groups. The hallway is long and has a few benches along the way. One of the people in the group appears to be wearing a tie, which stands out as an anomaly in the otherwise casual attire of the other individuals. In the video, a man is seen standing in a hallway while talking on his cell phone. He is wearing a red shirt and is standing in front of a group of people who are walking by. The man’s red shirt stands out against the crowd, Ours: The video is set at a subway entrance near campus, with pedestrians entering and exiting the area from different directions. A crowd of students, teachers, and tourists is standing at the subway entrance, watching a performance by a street musician. They are throwing coins and applauding. Other students and teachers are walking and talking near the subway entrance. Ground Truth: The video is set at a subway entrance near campus, with pedestrians entering and exiting the area from different directions. A crowd of students, teachers, and tourists is standing at the subway entrance, watching a performance by a street musician. They are throwing coins and applauding. Other students and teachers are walking and talking near the subway entrance. Ground Truth: The video is set at a subway entrance where some pedestrians are hurriedly entering and exiting the subway or passing through nearby pathways. An\u0002other group of people is captivated by a performance and stands at the subway entrance to watch. | …… …… 00:01 00:02 00:04 Question How much impact does the abnormal event have on the overall situation? …… …… | …… …… 00:01 00:02 00:04 Question How much impact does the abnormal event have on the overall situation?\n…… …… 00:01 dChtGPT [26] 2 00:04 Video-ChatGPT [26]: The abnormal event of a person falling on the ground in the middle of the train station has a significant impact on the overall situation. It disrupts the flow of people walking around the station, as they are forced to stop and look at the person lying on the ground. This creates a sense of confusion and concern among the people, as they may not know what to do or how to help the person. VideoChat [15]: ased on the information provided, it’s unclear if there is any abnormal ent occurring in the video. I cannot assess the overall situation without rther context or information. LLMAAdt[47] functioning of the metro station. It may cause delays or disruptions in the train service, potentially affecting passengers’ plans and travel experiences Additionally, first responders may have to intervene to ensure the person’s safety. The abnormal event of a train passing through a tunnel with no passengers or train cars significantly impacts the overall situation. In the image, the Video-LLaVA [17]: emergency situation. The abnormal event of the man falling down has a significant impact on he overall situation. It disrupts the normal flow of the game and adds an element of surprise and unpredictability. The players’ reactions to the event such as the man getting up and continuing to play, show that the game continues despite the unexpected interruption. Figure 7: Demo on Open-World Video Anomaly Understanding. (Part 1)\nFigure 8: Demo on Open-World Video Anomaly Understanding. (Part 2)\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/hawk--learning-to-understand-open-world-video-anomalies/","section":"Papers","summary":"Introduces HAWK, a novel framework leveraging interactive large Visual Language Models with explicit and implicit motion modality integration, auxiliary consistency loss, and detailed language annotations for diverse video anomaly scenarios. Demonstrates state-of-the-art performance in video description and question-answering tasks across multiple open-world datasets, with extensive annotated data and generation pipelines to enhance practical anomaly understanding and interaction capabilities.","title":"Learning to Understand Open-World Video Anomalies","type":"other"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/lei-song/","section":"Authors","summary":"","title":"Lei Song","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/li-shen/","section":"Authors","summary":"","title":"Li Shen","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/li-yu/","section":"Authors","summary":"","title":"Li Yu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/licheng-jiao/","section":"Authors","summary":"","title":"Licheng Jiao","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/lingling-li/","section":"Authors","summary":"","title":"Lingling Li","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/linlin-huang/","section":"Authors","summary":"","title":"Linlin Huang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/linlin-yang/","section":"Authors","summary":"","title":"Linlin Yang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/liyun-zhu/","section":"Authors","summary":"","title":"Liyun Zhu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/luca-zanella/","section":"Authors","summary":"","title":"Luca Zanella","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/massimiliano-mancini/","section":"Authors","summary":"","title":"Massimiliano Mancini","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/meng-dong/","section":"Authors","summary":"","title":"Meng Dong","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/mike-zheng-shou/","section":"Authors","summary":"","title":"Mike Zheng Shou","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/min-xu/","section":"Authors","summary":"","title":"Min Xu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/moshira-abdalla/","section":"Authors","summary":"","title":"MOSHIRA ABDALLA","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/muaz-al-radi/","section":"Authors","summary":"","title":"MUAZ AL RADI","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/muchao-ye/","section":"Authors","summary":"","title":"Muchao Ye","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/naoufel-werghi/","section":"Authors","summary":"","title":"NAOUFEL WERGHI","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/nong-sang/","section":"Authors","summary":"","title":"Nong Sang","type":"authors"},{"content":" This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.\nExcept for this watermark, it is identical to the accepted version;\nthe final published version of the proceedings is available on IEEE Xplore.\nOpen-Vocabulary Video Anomaly Detection # Peng Wu 1 , Xuerong Zhou 1 , Guansong Pang 2* *, Yujia Sun 3 , Jing Liu 3 , Peng Wang 1∗ , Yanning Zhang 1 1 Northwestern Polytechnical University, 2 Singapore Management University, 3 Xidian University\n{xdwupeng, zxr2333}@gmail.com, gspang@smu.edu.sg, yjsun@stu.xidian.edu.cn neouma@163.com,\nAbstract # Current video anomaly detection (VAD) approaches with weak supervisions are inherently limited to a closed-set setting and may struggle in open-world applications where there can be anomaly categories in the test data unseen during training. A few recent studies attempt to tackle a more realistic setting, open-set VAD, which aims to detect unseen anomalies given seen anomalies and normal videos. However, such a setting focuses on predicting frame anomaly scores, having no ability to recognize the specific categories of anomalies, despite the fact that this ability is essential for building more informed video surveillance systems. This paper takes a step further and explores openvocabulary video anomaly detection (OVVAD), in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies. To this end, we propose a model that decouples OVVAD into two mutually complementary tasks – class-agnostic detection and class-specific classification – and jointly optimizes both tasks. Particularly, we devise a semantic knowledge injection module to introduce semantic knowledge from large language models for the detection task, and design a novel anomaly synthesis module to generate pseudo unseen anomaly videos with the help of large vision generation models for the classification task. These semantic knowledge and synthesis anomalies substantially extend our model\u0026rsquo;s capability in detecting and categorizing a variety of seen and unseen anomalies. Extensive experiments on three widely-used benchmarks demonstrate our model achieves state-of-the-art performance on OVVAD task.\n1. Introduction # Video anomaly detection (VAD), which aims at detecting unusual events that do not conform to expected patterns, has become a growing concern in both academia and industry communities due to its promising application prospects\nCorresponding Authors {peng.wang, ynzhang}@nwpu.edu.cn in, such as, intelligent video surveillance and video content review. Through several years of vigorous development, VAD has made significant progress with many works continuously emerging.\nTraditional VAD can be broadly classified into two types based on the supervised mode, i.e., semi-supervised VAD [17] and weakly supervised VAD [38]. The main difference between them lies in the availability of abnormal training samples. Although they are different in terms of supervised mode and model design, both can be roughly regarded as classification tasks. In the case of semisupervised VAD, it falls under the category of one-class classification, while weakly supervised VAD pertains to binary classification. Specifically, semi-supervised VAD assumes that only normal samples are available during the training stage, and the test samples which do not conform to these normal training samples are identified as anomalies, as shown in Fig. 1(a). Most existing methods essentially endeavor to learn the one-class pattern, i.e., normal pattern, by means of one-class classifiers [50] or self-supervised learning technique, e.g. frame reconstruction [9], frame prediction [17], jigsaw puzzles [44], etc. Similarly, as illustrated in Fig. 1(b), weakly supervised VAD can be seen as a binary classification task with the assumption that both normal and abnormal samples are available during the training phase but the precise temporal annotation of abnormal events are unknown. Previous approaches widely adopt a binary classifier with the multiple instance learning (MIL) [38] or TopK mechanism [27] to discriminate between normal and abnormal events. In general, existing approaches of both semi-supervised and weakly supervision VAD restricts their focus to classification and use corresponding discriminator to categorize each video frame. While these practices have achieved significant success on several widely-used benchmarks, they are limited to detecting a closed set of anomaly categories and are unable to handle arbitrary unseen anomalies. This limitation restricts their application in open-world scenarios and poses a risk of increasing missing reports, as many real-world anomalies in actual deployment are not present in the training data.\nFigure 1. Comparison of different VAD tasks.\nTo address this issue, a few recent works explore a whole new line of VAD, i.e., open-set VAD [1 , 5 , 66 , 67]. The core purpose of open-set VAD is to train a model with normal and seen abnormal samples to detect unseen anomalies (see Fig. 1(c)). For example, abnormal training samples only includes fighting and shooting events, and it is expected that the trained model can detect abnormal events that occur in the road accident scene. Compared to traditional VAD, open-set VAD breaks out of the close-set dilemma and then possesses ability to deal with open-world problems. Although these works partly reveal their openworld capacity, they fall short in addressing semantic understanding of the abnormal category, which leads to the ambiguous detection process in the open world.\nRecently, large language/vision model pre-training [11 , 29 , 34 , 64] has been phenomenally successful across a wide range of downstream tasks [13–15 , 24 , 25 , 28 , 47 , 48 , 58 , 65] on account of its learned cross-modal prior knowledge and powerful transfer learning ability, which also allow us to tackle open-vocabulary video anomaly detection (OVVAD). Therefore, in this paper, we propose a novel model built upon large pre-trained vision/language models for OVVAD that aims to detect and categorize seen and unseen anomalies, as shown in Fig. 1(d). Compared to previous VAD, OVVAD has high value to applications as it can provide more informed, fine-grained detection results, but it is more challenging since that 1) it not only needs to detect but also categorize the anomalies; 2) it needs to tackle seen (base) as well as unseen (novel) anomalies. To address these challenges, we explicitly disentangle the OVVAD task into two mutually complementary sub-tasks: one is classagnostic detection, while another one is class-specific categorization. To improve the class-agnostic detection, we make efforts from two aspects. We first introduce a nearly weight-free temporal adapter (TA) module to model temporal relationships, and then introduce a novel semantic knowledge injection (SKI) module designed to incorporate textual knowledge into visual signals with assistance of large language models. To enhance the class-specific categorization, we take inspirations from the contrastive language-image pre-training (CLIP) model [29], and use a scalable way to categorize anomalies, i.e., the alignment between textual labels and videos, and furthermore we design a novel anomaly synthesis (NAS) module to generate vision (e.g., images and videos) materials to assist the model better identify novel anomalies. Based on these operations, our model achieves state-of-the-art performance on three popular benchmarks for OVVAD, attaining 86.40% AUC, 66.53% AP and 62.94% AUC on UCF-Crime [38], XDViolence [51] and UBnormal [1], respectively.\nWe summarize our contributions as follows:\nWe explore video anomaly detection under a challenging yet practically important open-vocabulary setting. To our knowledge, this is the first work for OVVAD. We then propose a model built on top of pre-trained large models that disentangles the OVVAD task into two mutually complementary sub-tasks – class-agnostic detection and class-specific categorization – and jointly optimizes them for accurate OVVAD. In the class-agnostic detection task, we design a nearly weight-free temporal adapter module and a semantic knowledge injection module for substantially-enhanced normal/abnormal frame detection. In the fine-grained anomaly classification task, we introduce a novel anomaly synthesis module to generate pseudo unseen anomaly videos for accurate classification of novel anomaly types. 2. Related Work # Semi-supervised VAD. Mainstream solutions are to build a normal pattern by self-supervised manner (e.g., reconstruction and prediction) or one-class manner. As for the self-supervised manner [8 , 54 , 56], reconstruction-based approaches [4 , 21 , 22 , 33 , 39 , 55 , 60] typically leverage encoder-decoder frameworks to reconstruct normal events and compute the reconstruction errors, and these events with large reconstruction error are classified as anomalies. Follow-up prediction-based approaches [17 , 19] focuses on predicting the future frame with previous video frames and determine whether it is an anomaly frame by calculating the difference between the predicted frame and the actual frame. Recent work [37] combined reconstruction- and prediction-based approaches to improve detection performance. As for one-class models, some works endeavors to learn normal patterns by making use of one-class frameworks [35], e.g., one-class support vector machine and its extension (OCSVM [36], SVDD [50], GODS [45]).\nWeakly supervised VAD. In contrast to semi-supervised VAD, weakly supervised VAD [10 , 40] consists of normal as well as abnormal samples, which can be regarded as a binary classification task and aims to detect anomalies at frame level under the limitation of temporal annotations. As a pioneer work, Sultani et al. [38] first proposed a large-\nscale benchmark and trained a lightweight network with MIL mechanism. Then Zhong et al. [61] proposed a graph convolutional network based approach to capture the similarity relations and temporal relations across frames. Tian et al. [42] introduced self-attention blocks and pyramid dilated convolution layers to capture multi-scale temporal relations. Wu et al. [51 , 52] built the largest-scale benchmark that includes audio-visual signals and proposed a multi-task model to deal with coarse- and fine-grained VAD. Zaheer et al. [57] presented a clustering assisted weakly supervised framework with novel normalcy suppression mechanism. Li et al. [16] proposed a transformer-based network with self-training multi-sequence learning. Zhang et al. [59] attempted to exploit the completeness and uncertainty of pseudo labels. The above approaches simply used video or audio inputs encoded by pre-trained models, such as C3D [43] and I3D [3], although a few works [12 , 23 , 53] introduced CLIP models to weakly-supervised VAD task, they simply used its powerful visual features and ignored the zero-shot ability of CLIP.\nOpen-set VAD. VAD task naturally exists an open-world requirement. Faced with an open-world requirement, traditional semi-supervised works are more prone to producing large false alarms, while weak-supervised works are effective in detecting known anomalies but could fail in unseen anomalies. Open-set VAD aims to train the model based on normality and seen anomalies, and attempts to detect unseen anomalies. Acsintoae et al. [1] developed the first benchmark called UBnormal for supervised openset VAD task. Zhu et al. [67] proposed an approach to deal with open-set VAD task by integrating evidential deep learning and normalizing flows into a MIL framework. Besides, Ding et al. [5] proposed a multi-head network based model to learn the disentangled anomaly representations, with each head dedicated to capturing one specific type of anomaly. Compared to our model, these above works mainly devote themselves to open-world detection and overlook anomaly categorization, moreover, these works also fail to take full advantage of pre-trained models.\n3. Method # Problem Statement. The studied problem, OVVAD, can be formally stated as follows. Suppose we are given a set of training samples X = {xi} N+A i=1 , where Xn Xn = {xi} N i is the set of normal samples and Xa Xa = {xi} N+A i=N+1 is the set of abnormal samples. For each sample xiin Xa Xa , it has a corresponding video-level category label yi , yi ∈ Cbase , Here, Cbase represents the set of base (seen) anomaly categories, and C is the union of Cbase and Cnovel, where Cn Cnovel stands for the set of novel (unseen) anomaly categories. Based on the training samples X , the objective is to train a model capable of detecting and categorizing both base and novel anomalies. Specifically, the goal of model is to predict anomaly confidence for each frame, and identify the anomaly category if anomalies are present in the video.\n3.1. Overall Framework # Traditional methods based on close-set classifications are less likely to deal with VAD under the open-vocabulary scenario. To this end, we leverage language-image pretraining models, e.g., CLIP, as the foundation thanks to its powerful zero-shot generalization ability. As illustrated in Fig. 2, given a training video, we first feed it into image encoder of CLIP ΦCLIP − v to obtain frame-level features xf with shape of n × c, where n is the number of video frames, and c is the feature dimension. Then these features pass through TA module, SKI module and a detector to produce frame-level anomaly confidence p, this pipeline is mainly applied to class-agnostic detection task. On the other hand, for class-specific categorization, we take inspirations from other open-vocabulary works across different vision tasks [31 , 46 , 63] and use cross-modal alignment mechanism. Specifically, we first generate a videolevel aggregated feature across frame-level features, then also generate textual features/embeddings of anomaly categories, finally, we estimate the anomaly category based on alignments between video-level features and textual features. Moreover, we introduce NAS module to generate potential novel anomalies with the assistance of large language models (LLM) and AI-generated content models (AIGC) for novel category identification achievement.\n3.2. Temporal Adapter Module # Temporal dependencies plays a vital role in VAD [49 , 62]. In this work, we employ the frozen image encoder of CLIP to attain vision features, but it lacks consideration of temporal dependencies since CLIP is pre-trained on image-text pairs. To bridge the gap between images and videos, the use of a temporal transformer [13 , 25] has emerged as a routine practice in recent studies. However, such a paradigm suffers from a clear performance degradation on novel categories [13 , 32], the possible reason is that additional parameters in temporal transformer could specialise on the training set, thus harming the generalisation towards novel categories. Therefore, we design a nearly weight-free temporal adapter for temporal dependencies, which is built on top of classical graph convolutional networks. Mathematically, it can be presented as follows,\nwhere LN is the layer normalization operation, H is the adjacency matrix, the softmax normalization is used to ensure the sum of each row of H equals to one. Such a design is used to capture contextual dependencies based on positional distance between each two frames. The adjacency matrix is\nFigure 2. Overview of our proposed framework.\nwhere Ft Ftext ∈ R l×c , ΦCLIP − t denotes the text encoder of CLIP, and Φtoken refers to the language tokenizer that converts words into vectors.\nThen, towards the goal of effectively incorporating these semantic knowledge into visual information to boost anomaly detection, we design a cross-modal injection strategy. This strategy encourage visual signals to seek related semantic knowledge and integrate it into the process. Such an operation is demonstrated as,\nwhere Fknow ∈ R n×c , and we employ sigmoid instead of softmax to ensure that visual signals can encompass more relevant semantic concepts.\nFinally, we concatenate Fknow and xt creating an input that contains both visual information and integrated semantic knowledge. We feed this input into a binary detector to generate anomaly confidence for class-agnostic detection.\n3.4. Novel Anomaly Synthesis Module # While current pre-trained vision-language models, such as, CLIP, possess impressive zero-shot capacities, their zeroshot performance on various downstream tasks, especially video-related ones, remains far from satisfactory. For the same reason, our model, which is built on these pre-trained vision-language models, is trained on base anomalies and normal samples, making it susceptible to a generalization deficiency when faced with novel anomalies. With the advent of large generative models, generating samples as pseudo training data has become a feasible solution [20 , 26]. Consequently, we propose NAS module to generate a series of pseudo novel anomalies based solely on potential anomaly categories. We then leverage these samples to fine- calculated as follows:\nthe proximity relation between i th and j th frames only determined by their relative temporal position. σ is a hyperparameter to control the range of influence of distance relation. According to this formula, the closer the temporal distance between two frames, the higher proximity relation the score, otherwise the lower. Notably, across TA module, only layer normalization involves few parameters.\n3.3. Semantic Knowledge Injection Module # Human often make use of prior knowledge when perceiving the environment, for example, we can infer the presence of a fire based on the smell and smoke without directly seeing the flames. Building on this idea, we propose SKI module to explicitly introduce additional semantic knowledge for assisting visual detection. As depicted in Fig. 2, for normal events in videos, we prompt the large-scale language models, e.g., ChatGPT [2] and SparkDesk 1 , with a fixed template, to obtain about common scenarios and actions, such as, street, park, shopping hall, walking, running, working, etc. Likewise, we generate additional words related to anomaly scenes, including terms like explosion, burst, firelight, etc. Finally, we obtain several phrase lists denoted by Mprior, which consists of noun words (scenes) and verb words (actions). With Mprior in hands, we exploit the text encoder of CLIP to extract textual embeddings as the semantic knowledge, which is show as follows,\n1 https://xinghuo.xfyun.cn\ntune the proposed model for improved categorization and detection of novel anomalies. On the whole, NAS module consists of three key processes:\nInitially, we prompt LLMs (e.g., ChatGPT, ERNIE Bot [41]) with pre-defined templates prompt gen like generate ten shorter scene descriptions about \u0026ldquo;Fighting\u0026rdquo; in real world to produce textual descriptions of potential novel categories. We then employ AIGC models, e.g., DALL·E mini [30], Gen-2 [7], to generate corresponding images or short videos. This can be represented as follows, where Vg Vgen is the combination of generated images (Ig Igen ) and short videos (Sg Sgen ).\nSubsequently, for Ig Igen , we draw inspiration from [18] and introduce a simple yet effective animation strategy to convert single images into video clips that simulates scene changes. Specifically, given an image, we employ the center crop mechanism with different crop ratios to select corresponding image regions, then resize these regions back to original size and cascade them to create new video clips S cat.\nFinally, to mimic real-world situation where anomaly videos are generally long and untrimmed, we introduce the third step, pseudo anomaly synthesis, by inserting Scat or Sg Sgen into randomly selected normal videos. Moreover, the insertion position is also randomly chosen. This process yields the final pseudo anomaly samples Vn Vnas . Refer to supplementary materials for detailed descriptions and results.\nWith Vn Vnas in hands, we fine-tune our model, which was initially trained on X , to enhance its generalization capacities for novel anomalies.\n3.5. Objective Functions # 3.5.1 Training stage without pseudo anomaly samples # For class-agnostic detection, following previous VAD works [27 , 49], we use the Top-K mechanism to select the top K high anomaly confidences in both abnormal and normal videos. We compute the average values of these selections and feed the average values into the sigmoid function as the video-level predictions. Here, we set K = n/16 for abnormal videos and K = n for normal videos. Finally, we compute binary cross entropy Lbce between video-level prediction and binary labels.\nIn regard to class-specific categorization, we compute the similarity between aggregated video-level features and textual category embeddings to derive video-level classification predictions. We also use a cross entropy loss function to compute the video-level categorization loss L ce . Given that OVVAD is a weakly supervised task, we can not obtain video-level aggregated features directly from frame-level annotations. Following [49], we employ a soft attention based aggregation, as shown below,\nFor textual category embeddings, we are inspired by CoOp[63] and append the learnable prompt to original category embeddings.\nFor the parameters of SKI module, namely Ftext, we aim for explicit optimization during the training stage. We intend to distinguish between normal knowledge embeddings and abnormal knowledge embeddings. For normal videos, we expect their visual features have higher similarities with normal knowledge embeddings and lower similarities with abnormal knowledge embeddings. To this end, we first extract the similarity matrix between each video and textual knowledge embeddings, and then select the top 10% highest scores for each frame and compute the average value, finally, we apply the cross-entropy-base loss L sim − n . For abnormal videos, we anticipate the high similarities between abnormal knowledge embeddings and abnormal video-frame features. Since precise frame-level annotations are absent under weak supervision, we employ a hard attention based selection mechanism know as Top-K to locate abnormal regions. The same operations are then performed to compute the loss Lsim − a.\nOverall, during the training phase, we employ three loss functions, with the total loss function given as:\nwhere L sim is the sum of L sim − n and L sim − a.\n3.5.2 Fine-tuning stage with pseudo anomaly samples # After obtaining Vn Vnas from NAS module, we proceed with fine-tuning our model. Vn Vnas is synthetic, providing us with frame-level annotations and allowing us to optimize our model with full supervisions for detection. For categorization, L ce2 remains the same as L ce , with the key difference being that labels are available not only for base categories but also for potential novel categories. For detection, Lbce2 is the binary cross entropy loss at the frame level.\nFinally, the total loss function during the fine-tuning phase is shown as:\n4. Experiments # 4.1. Experiment Setup # Datasets. UCF-Crime [38] is a large-scale VAD dataset for surveillance scenes, containing 13 types of abnormal events. 800 normal videos and 810 abnormal videos are provided for training, and the remaining 140 normal videos\nand 150 abnormal videos for test. XD-Violence [51] is the largest VAD benchmark to date, it contains 6 anomalous categories with 3954 videos for training and 800 videos for test. To align with our model, which supports singlecategory identification, we exclude videos with multiple categories in XD-Violence. UBnormal [1] is a synthesized benchmark which defines seven types of normal events and 22 types of abnormal events. During training, only 7 abnormal categories are visible, while 12 abnormal categories are used for test.\nEvaluation metrics. OVVAD entails detecting and categorizing anomalies. To assess detection performance, we employ standard metrics from previous works [38 , 51]. For UCF-Crime and UBnormal, we use the area under the curve of the frame-level receiver operating characteristic (AUC) to evaluate performance. For XD-Violence, we utilize AUC of the frame-level precision-recall curve (AP). For classification, we report the video-level TOP1 accuracy for abnormal test videos on both UCF-Crime and XD-Violence. UBnormal lacks category labels, so we exclusively report AUC results. During the test phase, we provide these metrics for the entire set of categories, as well as separately for base and novel categories, on both UCF-Crime and XD-Violence.\nImplementation Details. The proposed model is implemented using PyTorch and trained on single RTX3090 GPU. The frozen image encoder and text encoder stem from pre-trained CLIP(ViT-B/16) [6] model. The detector is a modified feed-forward network (FFN) layer in Transformer with ReLU replaced by GeLU. In line with existing works, we process 1 out of 16 frames for each video, and during the training phase, the maximum video length is set to 256. For model optimization, we use AdamW optimizer to train the model with learning rate of 1e − 4 and train epoch of 20. The batch size is set to 64, consisting of an equal number of normal and abnormal samples. During the fine-tune phase with pseudo novel anomalies, the learning rate is set to 1e − 5 on UBnormal and 5e − 6 on UCF-Crime and XD-Violence. The fine-tuning process spans 10 epochs, with each batch containing 10 pseudo novel anomaly videos and 10 base anomaly videos. σ is set to 0.07 across all situations, and λ is set as 1e − 1 on UCF-Crime, 1e 0 on XD-Violence and UBnormal, respectively.\n4.2. Comparison with State-of-the-Arts # In Tab. 1 to Tab. 3, we report comparison results with existing approaches on three public benchmarks. Since prior approaches are designed for close-set VAD, our focus is primarily on the comparison results for open-set detection. For the sake of fairness, most of comparison approaches are reimplemented with the same visual feature as our approach. The symbol † indicates that these approaches follow traditional VAD works and use the entire training set, which includes novel anomaly samples. Consequently, the per-\nMode Method AUC(%) AUCb(%) AUCn(%) Si SVM baseline 50.1 N/A N/A Si OCSVM[36] 63.2 N/A N/A Si Hasan et al.[9] 51.2 N/A N/A Weak Sultani et al.†[38] 84.14 N/A N/A Weak Wu et al.†[51] 84.57 N/A N/A Weak AVVD†[52] 82.45 N/A N/A Weak RTFM†[42] 85.66 N/A N/A Weak DMU†[62] 86.75 N/A N/A Weak UMIL†[23] 86.75 N/A N/A Weak CLIP-TSA†[12] 87.58 N/A N/A Weak Zhu et al.∗[67] 78.82 N/A N/A Weak Sultani et al.[38] 78.25 86.31 80.12 Weak Wu et al.[51] 82.24 90.62 84.13 Weak RTFM[42] 84.47 92.54 85.87 Weak DMU[62] 85.14 93.52 86.24 Weak Ours 86.4 93.80 88.20 Table 1. AUC Comparisons on UCF-Crime.\nTable 2. AP Comparisons on XD-Violence.\nMode Method AP(%) APb(%) APn(%) SVM baseline 50.8 N/A N/A OCSVM[36] 28.63 N/A N/A Hasan et al.[9] 31.25 N/A N/A Sultani et al.†[38] 75.18 N/A N/A Wu et al.†[51] 80 N/A N/A RTFM†[42] 78.27 N/A N/A AVVD†[52] 78.1 N/A N/A DMU†[62] 82.41 N/A N/A CLIP-TSA†[12] 82.17 N/A N/A Weak Zhu et al.∗[67] 64.4 N/A N/A Weak Sultani et al.[38] 52.26 51.25 5 54.64 Weak Wu et al.[51] 55.43 52.94 64.10 Weak RTFM[42] 58.99 55.72 65.97 Weak DMU[62] 63.9 60.12 71.63 Weak Ours 66.53 57.10 76.03 Table 3. AUC Comparisons on UBnormal.\nMode Method AUC(%) Semi Georgescu et al.[8] 59.3 Weak Georgescu et al.[8]+anomalies Sultani et al[38] 61.3 Weak Sultani et al.[38] 50.3 Weak Wu et al.[51] 53.7 Weak RTFM[42] 60.94 Weak DMU[62] 59.91 Weak Ours 62.94 formance of these approaches outperforms models trained without novel anomaly samples. This underscores the considerable challenge presented by OVVAD from a detection perspective. Regarding the comparison between our approach and other approaches on OVVAD task, we observe\nTable 4. Ablations studies with different designed module on UCF-Crime for detection.\nTA SKI NAS AUC(%) AUCb(%) AUCn(%) × × × 84.79 92.75 86.73 √ × × 85.14 93.22 86.79 × √ × 85.04 92.96 86.89 √ √ × 85.81 93.85 87.62 √ √ √ 86.4 93.8 88.2 that our approach demonstrates distinct advantages over state-of-the-art counterparts. In fact, our model performs on par with the best competitors that make use of the complete training dataset. For example, our approach surpasses the top-performing model, DMU[62], by 1.26% AUC on UCF-Crime, 2.63% AP on XD-Violence, and 3.03% AUC on UBnormal. Particularly, when it comes to novel categories, our approach exhibits a clear performance advantage compared to other approaches. Notably, Zhu et al. [67] is the first work to tackle open-set VAD, where the symbol ∗ indicates that its category division setup differs from ours. We report its detection results under settings as identical as possible, with the number of novel categories matching ours. Our model outperforms it by a substantial margin on both UCF-Crime and XD-Violence datasets.\n4.3. Ablation Studies # 4.3.1 Contribution of TA module # As aforementioned, TA module is devised to capture temporal dependencies, thus enhancing class-agnostic detection abilities. To verify the effectiveness of TA module, we conduct experiments and present ablation results in Tab. 4 to Tab. 6. It can be found that the inclusion of TA module, our model achieves a significant performance improvement across various datasets and metrics. More importantly, unlike previous transformer-like temporal modeling modules [13 , 25] used on other open-vocabulary tasks, this nearly weight-free designed module also shows a clear gain for novel anomaly categories, e.g., adding TA module results in an improvement of 14.47% AP on XD-Violence.\n4.3.2 Contribution of SKI module # In this section, we investigate the contribution of SKI module to class-agnostic detection. As reported in Tab. 4 to Tab. 6, SKI module boosts detection performance on all datasets regardless of whether TA module is introduced or not. Similar to TA module, SKI can also clearly improve performance for novel anomaly categories. The difference with TA module is that SKI module leverage LLMs to explicitly introduce semantic knowledge into visual signals and knowledge helps better distinguish between normal and abnormal events.\nTA SKI NAS AP(%) APb(%) APn(%) × × × 53.11 54.84 53.69 √ × × 60.13 59.38 68.16 × √ × 56.62 53.03 63.92 √ √ × 65.6 61.4 73.67 √ √ √ 66.53 57.1 76.03 Table 5. Ablations studies with different designed module on XDViolence for detection.\nTA SKI NAS AUC(%) AP(%) × × × 60.51 65.18 √ × × 61.18 67.36 × √ × 61.93 66.36 √ √ × 61.67 67.43 √ √ √ 62.94 68.07 Table 6. Ablations studies with different designed module on UBnormal for detection.\nTable 7. Ablations studies on UCF-Crime and XD-Violence for categorization.\nACC(%) ACCb(%) ACCn(%) w/o NAS 37.86 43.14 34.83 Finetune N 39.29 37.25 40.45 Finetune N+B(Ours) 41.43 49.02 37.08 w/o NAS 59.6 91.98 15.18 Finetune N 62.03 82.06 34.55 Finetune N+B(Ours) 64.68 89.31 30.9 Table 8. Cross-dataset results on UCF-Crime and XD-Violence.\nTest⇒ UCF Crime UCF Crime XD Violence XD Violence Train⇓ AUC(%) ACC(%) AP(%) ACC(%) UCF Crime 86.05 45.00 63.74 47.90 XD Violence 82.42 40.71 82.86 88.96 4.3.3 Contribution of NAS module # From Tab. 4 to Tab. 6, we can see that, for all test categories and novel anomaly categories, NAS module can obtain a significant performance gain for the class-agnostic detection. For base categories, it causes a relatively small performance degradation, which is also observed in other openvocabulary tasks. We argue that the introduction of pseudo novel samples makes the model pay more attention to these generated novel samples, thus partially diminishing the importance of base categories. Moreover, Tab. 7 reveals that NAS module also obtains a significant performance gain for the class-specific categorization, especially for novel anomaly categories. Besides, we also found that only using generated novel samples results in a clear performance drop for base anomaly categories during the fine-tuning phase. This illustrates while generated anomaly samples benefit the generalization abilities of our model, it is essential to adopt reasonable and effective fine-tuning schemes.\nFigure 3. Qualitative results of our model on testing videos. Colored window denotes ground-truth anomalous region.\n4.3.4 Analysis of cross-dataset ability # To further investigate the zero-shot abilities of our model, we conducted experiments where we train our model under the cross-dataset setup. In this case, we take UCF-Crime and XD-Violence as examples. These datasets have some overlapping categories but completely different sources, with UCF-Crime developed from surveillance videos and XD-Violence collected from movies and online videos. From the evaluation results in Tab. 8, we can draw the following conclusions: First, our model achieves better performance with the whole training samples. Second, the crossdataset test results show that our model can compete with or outperform current approaches on both UCF-Crime and XD-Violence, further validating the favorable generalization abilities of the proposed model.\n4.4. Qualitative Results # We first present qualitative detection results on three datasets in Fig. 3, where the top column denotes UCFCrime, the first three in the bottom column denote XDViolence, and the rest denotes UBnormal. As we can see, whether base or novel categories, our method produces high anomaly confidence in anomaly regions, even there are multiple discontinuous abnormal regions in a video. Besides, we present confusion matrices of anomaly categorization in Fig. 4, it is not hard to see that there are some anomaly categories that our model cannot effectively identify, either base or novel, especially on UCF-Crime, such results indicate OVVAD is a unique and challenging task, especially for the anomaly categorization. Refer to supplementary materials for more ablation studies and qualitative results.\n5. Conclusion # In this paper, we present a new model built on top of pretrained large models for open-vocabulary video anomaly detection task under weak supervision. Owing to the chal-\nFigure 4. Confusion matrices of anomaly categorization.\nlenging nature of open-vocabulary video anomaly detection, current video anomaly detection approaches face difficulties in working efficiently. To address these unique challenges, we explicitly disentangle open-vocabulary video anomaly detection into the class-agnostic detection and class-specific classification sub-tasks. We then introduce several ad-hoc modules: temporal adapter and semantic knowledge injection modules mainly aim at promoting detection for both base and novel anomalies, novel anomaly synthesis module generates several potential pseudo novel sample to assist the proposed model in perceiving novel anomalies more accurately. Extensive experiments on three public datasets demonstrate the proposed model performs advantageously on open-vocabulary video anomaly detection task. In the future, generating more vivid pseudo anomaly samples in the form of videos with the assistance of AIGC models is yet to be explored.\n6. Acknowledgments # This work is supported by the National Natural Science Foundation of China (No. 62306240, U23B2013), China Postdoctoral Science Foundation (No. 2023TQ0272), National Key R\u0026amp;D Program of China (No.2020AAA0106900), and the Fundamental Research Funds for the Central Universities (No. D5000220431).\nReferences # [1] Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ubnormal: New benchmark for supervised open-set video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 20143–20153, 2022. 2 , 3 , 6\n[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 4\n[3] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 3\n[4] Yang Cong, Junsong Yuan, and Ji Liu. Sparse reconstruction cost for abnormal event detection. In CVPR 2011, pages 3449–3456. IEEE, 2011. 2\n[5] Choubo Ding, Guansong Pang, and Chunhua Shen. Catching both gray and black swans: Open-set supervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7388– 7398, 2022. 2 , 3\n[6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 6\n[7] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023. 5\n[8] Mariana-Iuliana Georgescu, Antonio Barbalau, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. Anomaly detection in video via selfsupervised and multi-task learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12742–12752, 2021. 2 , 6\n[9] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 733–742, 2016. 1 , 6\n[10] Chao Huang, Chengliang Liu, Jie Wen, Lian Wu, Yong Xu, Qiuping Jiang, and Yaowei Wang. Weakly supervised video anomaly detection via self-guided temporal discriminative transformer. IEEE Transactions on Cybernetics, 2022. 2\n[11] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021. 2\n[12] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In 2023 IEEE International Conference on Image Processing (ICIP), pages 3230–3234. IEEE, 2023. 3 , 6\n[13] Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 105–124. Springer, 2022. 2 , 3 , 7\n[14] Chen Ju, Zeqian Li, Peisen Zhao, Ya Zhang, Xiaopeng Zhang, Qi Tian, Yanfeng Wang, and Weidi Xie. Multi-modal prompting for low-shot temporal action localization. arXiv preprint arXiv:2303.11732, 2023.\n[15] Dahun Kim, Anelia Angelova, and Weicheng Kuo. Regionaware pretraining for open-vocabulary object detection with vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11144–11154, 2023. 2\n[16] Shuo Li, Fang Liu, and Licheng Jiao. Self-training multisequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1395–1403, 2022. 3\n[17] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018. 1 , 2\n[18] Yu Liu, Huai Chen, Lianghua Huang, Di Chen, Bin Wang, Pan Pan, and Lisheng Wang. Animating images to transfer clip for video-text retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1906–1911, 2022. 5\n[19] Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13588–13597, 2021. 2\n[20] Zuhao Liu, Xiao-Ming Wu, Dian Zheng, Kun-Yu Lin, and Wei-Shi Zheng. Generating anomalies for video anomaly detection with prompt-based feature mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24500–24510, 2023. 4\n[21] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision, pages 2720–2727, 2013. 2\n[22] Weixin Luo, Wen Liu, and Shenghua Gao. Remembering history with convolutional lstm for anomaly detection. In 2017 IEEE International Conference on Multimedia and Expo (ICME), pages 439–444. IEEE, 2017. 2\n[23] Hui Lv, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, and Hanwang Zhang. Unbiased multiple instance learning for weakly supervised video anomaly detection. arXiv preprint arXiv:2303.12369, 2023. 3 , 6\n[24] Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xiang. Zero-shot temporal action detection via vision-language prompting. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pages 681–697. Springer, 2022. 2\n[25] Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for general video recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pages 1–18. Springer, 2022. 2 , 3 , 7\n[26] Minheng Ni, Zitong Huang, Kailai Feng, and Wangmeng Zuo. Imaginarynet: Learning object detectors without real images and annotations. arXiv preprint arXiv:2210.06886 , 2022. 4\n[27] Yujiang Pu, Xiaoyu Wu, and Shengjin Wang. Learning prompt-enhanced context features for weakly-supervised video anomaly detection. arXiv preprint arXiv:2306.14451 , 2023. 1 , 5\n[28] Jie Qin, Jie Wu, Pengxiang Yan, Ming Li, Ren Yuxi, Xuefeng Xiao, Yitong Wang, Rui Wang, Shilei Wen, Xin Pan, et al. Freeseg: Unified, universal and open-vocabulary image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19446– 19455, 2023. 2\n[29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2\n[30] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 5\n[31] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with contextaware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18082–18091, 2022. 3\n[32] Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned clip models are efficient video learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6545–6554, 2023. 3\n[33] Nicolae-Catalin Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, and Mubarak Shah. Self-distilled masked auto-encoders are efficient video anomaly detectors. arXiv preprint arXiv:2306.12041, 2023. 2\n[34] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image ¨ ¨ synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2\n[35] Mohammad Sabokrou, Mohammad Khalooei, Mahmood Fathy, and Ehsan Adeli. Adversarially learned one-class classifier for novelty detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3379–3388, 2018. 2\n[36] Bernhard Scholkopf, Robert C Williamson, Alex Smola, ¨ ¨ John Shawe-Taylor, and John Platt. Support vector method for novelty detection. Advances in neural information processing systems, 12, 1999. 2 , 6\n[37] Chenrui Shi, Che Sun, Yuwei Wu, and Yunde Jia. Video anomaly detection via sequentially learning multiple pretext tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10330–10340, 2023. 2\n[38] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018. 1 , 2 , 5 , 6\n[39] Shengyang Sun and Xiaojin Gong. Hierarchical semantic contrast for scene-aware video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22846–22856, 2023. 2\n[40] Shengyang Sun and Xiaojin Gong. Long-short temporal co-teaching for weakly supervised video anomaly detection. arXiv preprint arXiv:2303.18044, 2023. 2\n[41] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. Ernie 2.0: A continual pretraining framework for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8968–8975, 2020. 5\n[42] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4975–4986, 2021. 3 , 6\n[43] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. 3\n[44] Guodong Wang, Yunhong Wang, Jie Qin, Dongming Zhang, Xiuguo Bao, and Di Huang. Video anomaly detection by solving decoupled spatio-temporal jigsaw puzzles. In European Conference on Computer Vision, pages 494–511. Springer, 2022. 1\n[45] Jue Wang and Anoop Cherian. Gods: Generalized one-class discriminative subspaces for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8201–8211, 2019. 2\n[46] Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021. 3\n[47] Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, and YuGang Jiang. Transforming clip to an open-vocabulary video model via interpolated weight optimization. arXiv preprint arXiv:2302.00624, 2023. 2\n[48] Jianzong Wu, Xiangtai Li, Shilin Xu Haobo Yuan, Henghui Ding, Yibo Yang, Xia Li, Jiangning Zhang, Yunhai Tong,\nXudong Jiang, Bernard Ghanem, et al. Towards open vocabulary learning: A survey. arXiv preprint arXiv:2306.15880 , 2023. 2\n[49] Peng Wu and Jing Liu. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing, 30:3513–3527, 2021. 3 , 5 [50] Peng Wu, Jing Liu, and Fang Shen. A deep one-class neural network for anomalous event detection in complex scenes. IEEE transactions on neural networks and learning systems , 31(7):2609–2622, 2019. 1 , 2 [51] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 322–339. Springer, 2020. 2 , 3 , 6 [52] Peng Wu, Xiaotao Liu, and Jing Liu. Weakly supervised audio-visual violence detection. IEEE Transactions on Multimedia, pages 1674–1685, 2022. 3 , 6 [53] Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024. 3 [54] Qingsen Yan, Tao Hu, Yuan Sun, Hao Tang, Yu Zhu, Wei Dong, Luc Van Gool, and Yanning Zhang. Towards highquality hdr deghosting with conditional diffusion models. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2023. 2 [55] Zhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. Video event restoration based on keyframes for video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14592–14601, 2023. 2 [56] Guang Yu, Siqi Wang, Zhiping Cai, En Zhu, Chuanfu Xu, Jianping Yin, and Marius Kloft. Cloze test helps: Effective video anomaly detection via learning to complete video events. In Proceedings of the 28th ACM International Conference on Multimedia, pages 583–591, 2020. 2 [57] Muhammad Zaigham Zaheer, Arif Mahmood, Marcella Astrid, and Seung-Ik Lee. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 358–376. Springer, 2020. 3 [58] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and ShihFu Chang. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393–14402, 2021. 2 [59] Chen Zhang, Guorong Li, Yuankai Qi, Shuhui Wang, Laiyun Qing, Qingming Huang, and Ming-Hsuan Yang. Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16271–16280, 2023. 3 [60] Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. Spatio-temporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM international conference on Multimedia, pages 1933–1941, 2017. 2 [61] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1237–1246, 2019. 3 [62] Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. arXiv preprint arXiv:2302.05160, 2023. 3 , 6 , 7 [63] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022. 3 , 5 [64] Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jiming Chen. Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection. arXiv preprint arXiv:2310.18961, 2023. 2 [65] Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yifan Liu. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11175–11185, 2023. 2 [66] Jiawen Zhu, Choubo Ding, Yu Tian, and Guansong Pang. Anomaly heterogeneity learning for open-set supervised anomaly detection. arXiv preprint arXiv:2310.12790, 2023. 2 [67] Yuansheng Zhu, Wentao Bao, and Qi Yu. Towards open set video anomaly detection. In European Conference on Computer Vision, pages 395–412. Springer, 2022. 2 , 3 , 6 , 7 ","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/wu_open-vocabulary_video_anomaly_detection_cvpr_2024_paper/","section":"Papers","summary":"This paper explores open-vocabulary video anomaly detection (OVVAD) leveraging pre-trained large models to detect and categorize seen and unseen anomalies. It proposes a disentangled approach with class-agnostic detection and class-specific classification modules, enhanced by semantic knowledge injection, anomaly synthesis, and joint optimization, to achieve state-of-the-art performance.","title":"Open-Vocabulary Video Anomaly Detection","type":"other"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/pan-he/","section":"Authors","summary":"","title":"Pan He","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/peipei-yang/","section":"Authors","summary":"","title":"Peipei Yang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/puning-zhao/","section":"Authors","summary":"","title":"Puning Zhao","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/qianru-sun/","section":"Authors","summary":"","title":"Qianru Sun","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/qianyue-bao/","section":"Authors","summary":"","title":"Qianyue Bao","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/qifeng-chen/","section":"Authors","summary":"","title":"Qifeng Chen","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/qixiang-chen/","section":"Authors","summary":"","title":"Qixiang Chen","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/ruidi-fan/","section":"Authors","summary":"","title":"Ruidi Fan","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/ruixu-zhang/","section":"Authors","summary":"","title":"Ruixu Zhang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/ruizheng-wu/","section":"Authors","summary":"","title":"Ruizheng Wu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/sajid-javed/","section":"Authors","summary":"","title":"SAJID JAVED","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/sanping-zhou/","section":"Authors","summary":"","title":"Sanping Zhou","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/shanjun-zhang/","section":"Authors","summary":"","title":"Shanjun Zhang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/shao-yuan-lo/","section":"Authors","summary":"","title":"Shao-Yuan Lo","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/shibo-gao/","section":"Authors","summary":"","title":"Shibo Gao","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/shiwei-lin/","section":"Authors","summary":"","title":"Shiwei Lin","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/shizhen-zhao/","section":"Authors","summary":"","title":"Shizhen Zhao","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/shuo-li/","section":"Authors","summary":"","title":"Shuo Li","type":"authors"},{"content":" Simplifying Traffic Anomaly Detection with Video Foundation Models # Svetlana Orlova, Tommie Kerssies, Bruno B. Englert, Gijs Dubbelman ´ ´ Eindhoven University of Technology\n{s.orlova, t.kerssies, b.b.englert, g.dubbelman}@tue.nl\nAbstract # Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multirepresentation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) strong pretraining enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity. We release our code, domainadapted encoders, and fine-tuned models to support future work: https://github.com/tue-mps/simple-tad .\n1. Introduction # Traffic risk estimation is fundamental to safe driving, as failures to anticipate danger can lead to life-threatening consequences. Autonomous vehicles must therefore assess potential hazards in real time, even under uncertain, dynamic, and unfamiliar conditions. A common formulation for this problem is the ego-centric traffic anomaly detection (TAD) task [13 , 48], which aims to identify abnormal or dangerous events in a video stream captured by a vehiclemounted camera. Analyzing the top-performing TAD methods [24 , 25 , 32 , 51], we find that they rely on specialized,\nFigure 1. Traffic Anomaly Detection (TAD) performance on DADA-2000 [13]. Simple encoder-only models with strong pretraining (blue) are faster and more accurate than recent multicomponent architectures (orange and red), see Tabs. 2 and 3 .\ncomplex architectures illustrated in Fig. 2(b) and (c): twostage approaches [10 , 34 , 51], which combine a vision encoder with a temporal component, and multi-representation fusion approaches [24 , 25 , 32], which fuse additional representations, often generated by separate deep neural networks or model-based algorithms. While these complex designs have improved performance, their impact on efficiency has not been evaluated, even though this is crucial for TAD, where rapid detection is needed to enable timely action and prevent accidents.\nConsidering that TAD is, in essence, a binary classification task on video, we turn to recent methods for general video classification. On standard video classification benchmarks [8 , 15 , 19 , 36], Video Foundation Models (ViFMs) achieve state-of-the-art performance, predominantly Video Vision Transformers (ViTs) [1 , 11], which rely on largescale self- and weakly-supervised pre-training to learn ex-\nFigure 2. Types of model architectures for TAD: simple encoderonly model (a), two-stage design (b), and multi-representation fusion architecture (c).\npressive and transferable spatiotemporal representations, rather than on architectural inductive biases. Indeed, recent work in related visual perception tasks shows that strong pre-training reduces the need for downstream task-specific components [20 , 21 , 41]. We hypothesize that the same applies to TAD, such that a simple ViT-based ViFM can be effectively applied to this task and match or even outperform complex architectures. Since TAD depends on motion understanding, we investigate whether pre-training strategies that capture spatio-temporal structure are particularly effective.\nTo test our hypotheses, we evaluate multiple ViT-based ViFMs, sharing the same plain video ViT architecture but different pre-training, on two TAD datasets, DoTA [48] and DADA-2000 [13]. We adopt an encoder-only design, with a single linear layer on top of the Video ViT model, as illustrated in Fig. 2(a). While prior work has attempted the encoder-only design [10], it was found to be inferior [24 , 32 , 34]. We revisit this design using stronger pre-training. Similar to prior work, we assess in-domain and out-of-domain generalization, and unlike prior work, we also assess computational efficiency of the models, and compare to the best-performing specialized TAD methods.\nWe confirm our hypothesis by showing that strong pretraining enables a plain video ViT, used as an encoder-only model for TAD, to match and even surpass state-of-the-art methods while also being significantly more efficient, as shown in Fig. 1 and Tab. 2. Interestingly, performance on standard video classification benchmarks does not correlate with TAD. Our comparison of pre-trained models shows that weak supervision from language and full supervision from class labels are effective on standard benchmarks but less so for TAD, likely because they promote appearancefocused features that generalize poorly to anomalous mo- tion [45]. In contrast, self-supervised learning with Masked Video Modeling (MVM), which trains the model to reconstruct missing spatio-temporal regions using both spatial structure and temporal continuity, proves most effective for TAD. Models pre-trained with this objective achieve stateof-the-art performance at their respective model size, as shown in Tab. 3 .\nNext, motivated by the scarcity of labeled data for TAD and the abundance of unlabeled driving videos, we explore whether we can leverage self-supervised learning to better adapt an off-the-shelf ViFM to the downstream domain. Specifically, we apply Domain-Adaptive PreTraining (DAPT) [16] using the Video Masked Autoencoding (MAE) [38] approach. We find that MAE-based DAPT, even when applied at relatively small scale compared to the preceding generic pre-training, significantly boosts the performance even further, particularly for smaller models, as shown in Fig. 3. Importantly, we find that including abnormal driving examples for DAPT is not necessary, as shown in Tab. 4, which is valuable given the difficulty of collecting them at scale.\nWe summarize our main contributions as follows:\nWe show that a plain encoder-only Video ViT, when equipped with strong pre-training, outperforms all prior specialized architectures for TAD, while also being significantly more efficient. We compare pre-training strategies and find that selfsupervised learning with a Masked Video Modeling (MVM) objective is most effective, outperforming both weakly- and fully-supervised alternatives. We demonstrate that Domain-Adaptive Pre-training with MVM leads to measurable performance gains on TAD, even when applied at a small scale to domain-relevant but anomaly-free data. Moreover, since every TAD pipeline relies on its visual encoder, our results offer clear guidance for selecting effective pre-training strategies, enabling future methods to detect traffic anomalies more accurately and robustly.\n2. Related work # 2.1. Traffic Anomaly Detection # Traffic Anomaly Detection (TAD) is typically framed as a binary video classification task, where the goal is to detect potentially dangerous or abnormal events in traffic scenarios from the egocentric viewpoint of a vehicle-mounted camera. While related to the broader field of Video Anomaly Detection (VAD), which commonly targets static-camera surveillance settings, TAD poses unique challenges due to ego-motion, frequent occlusions, and the dynamic interaction of agents in complex driving environments.\nBefore the availability of annotated datasets for TAD, earlier approaches often relied on unsupervised reconstruc- tion or prediction of video frames to flag anomalies using temporal autoencoders [29 , 43], future frame prediction with spatial, temporal, and adversarial losses [26], or spatiotemporal tubelet modeling with ensemble scoring [48]. Some works also explored the use of synthetic data for training [22 , 35].\nThe introduction of large-scale driving anomaly datasets [13 , 48] with comprehensive annotations has enabled more active development of fully-supervised methods and led to substantial improvements in detection performance. As shown in Fig. 2, we categorize TAD methods into three classes based on their architectural complexity, which we detail below.\nEncoder-only design, (Fig. 2, a). Since TAD is a binary classification task, a minimal solution consists of a feature encoder followed by a linear classifier, without additional task-specific modules. Prior work [10] shows that such designs can be effective in related tasks, where R(2+1)D [39] and ViViT [1] demonstrate considerably strong performance. However, several recent studies report underwhelming results for encoder-only ablation variants of their methods [24 , 32]. Our evaluation of R(2+1)D and ViViT models on standard TAD benchmarks (see Tab. 2) reveals a substantial performance gap between these encoderonly models and current state-of-the-art methods.\nTwo-stage design, (Fig. 2, b). Two-stage methods combine a visual encoder with a separate temporal module. VidNeXt [10] pairs a ConvNeXt [27] backbone with a nonstationary transformer (NST) to model both stable and dynamic temporal patterns, evaluating and introducing a new dataset CycleCrash [10] for the related task of collision prediction. Its ablations, ConvNeXt+VT and ResNet+NST, also yield strong results. MOVAD [34] uses a VideoSwin Transformer [28] as a short-term memory encoder over several frames, followed by an LSTM-based long-term module, achieving state-of-the-art performance for TAD.\nMulti-representation fusion design, (Fig. 2, c). Fusion-based models, which currently report state-of-theart performance in TAD, explicitly combine multiple information sources. TTHF [24] augments the CLIP [33] framework with a high-frequency motion encoder and a cross-modal fusion module to align motion features with textual prompts. PromptTAD [32] extends MOVAD [34] by incorporating bounding box prompts via instance- and relation-level attention, enhancing object-centric anomaly localization. ISCRTAD [25] integrates agent features (e.g., appearance, trajectory, depth) using graph-based modeling and fuses them with scene context through contrastive multimodal alignment for robust anomaly detection.\nIn this work, we follow the simple encoder-only design and explore whether strong pre-training can compensate for the lack of task-specific architectural inductive biases. We demonstrate that, when equipped with rich priors from large-scale self-supervised pre-training, such models can achieve state-of-the-art performance, while remaining architecturally simple and highly efficient.\n2.2. Video Foundation Models # While early 3D CNN-based architectures can be referred to as foundation models [30], the term foundation model today commonly refers to Transformer-based [40] models that leverage large-scale pre-training. For vision, these models typically adopt the Vision Transformer [11] (ViT) architecture.\nUnlike the convolutional architecture, which embeds strong spatial and temporal inductive biases, ViTs rely on learning such priors directly from data during pretraining [11]. The quality and scale of this pretraining directly influence their effectiveness in downstream tasks [50]. Pre-training methods vary in supervision type and scalability. Fully-supervised approaches rely on manually annotated labels, providing precise semantic guidance but limited scalability. Weakly-supervised methods, such as CLIP [33], use natural language or metadata as training signals. Though less curated, they offer rich semantic structure and broader concept coverage from web-scale data. Selfsupervised learning (SSL) methods, including masked modeling [17 , 38], learn from the data itself without any annotations, enabling large-scale pre-training and highly transferable representations. Unlike weak supervision, which relies on sparse and often noisy text labels [33], SSL provides denser and more unbiased training signals [6 , 17].\nFirst Video ViTs exploited supervised classification as their pre-training method. ViViT [1] adopted the ViT architecture to video by introducing spatiotemporal 3D cubes called tubelets instead of 2D patches used for images. While this demonstrated that attention-based models can handle video inputs, the method struggled to balance accuracy and efficiency. Subsequent models like TimeSformer [4], MViT [12], and VideoSwin [28] focused on improving efficiency.\nVideoMAE [38] adopted masked autoencoding (MAE), a type of Masked Video Modeling (MVM), as an effective and efficient self-supervised pre-training strategy for plain video ViTs. Its tube masking, when applied to a large fraction of input patches, forces the model to infer spatiotemporal structure from limited visible content. This approach yielded strong results while maintaining architectural simplicity and has since inspired a series of ViT-based Video Foundation Models (ViFMs) that employ self-supervised pre-training [44–46].\nWhile other architectures such as recurrent, hybrid, and state-space models are active research areas [23 , 31 , 47], at the moment, ViT-based ViFMs are arguably the dominant paradigm due to their strong performance, scalability, versatile pre-training strategies, and widespread opti-\nTable 1. Overview of ViFMs. For models trained via distillation, we denote the supervision type(s) used for the teacher. FSL: fullysupervised, WSL: weakly-supervised, SSL: self-supervised learning.\nYear Model Stage Type Objective Supervision 2021 ViViT [1] Stage 1 FSL Classification Class labels 2022 VideoMAE [38] Stage 1 SSL Masked Autoencoder Video frame pixels 2023 MVD [45] Stage 1 SSL Masked feature distillation High-level features of VideoMAE and ImageMAE teache 2023 VideoMAE2 [44] Stage 1 Stage 2 S3 SSL FSL FSL Dual MAE Classification Logit distillation Video frame pixels Class labels Logits of larger VideoMAE2 after stage 2 2024 InternVideo2 [46] Stage 1 Stage 2 Stage 3 WSL+SSL WSL WSL+SSL Unmasked feature distillation Feature distillation + Contrastiv Feature distillation Features of VideoMAE2 and a vision-language encoder Features of audio and text encoders + video, text, audio Features of InternVideo2 after stage 2 across multiple dept mization support (e.g., FlashAttention [9], optimized libraries, hardware acceleration) designed around the plain Transformer [11] architecture. These qualities, along with the growing availability of pre-trained Video ViTs, make them particularly promising for tasks like traffic anomaly detection (TAD), where generalization, robustness, and efficiency are critical.\nTo our knowledge, we are the first to research which type of video pre-training is most effective for the TAD task, and hypothesize that MVM, with its emphasis on learning patch-dense and temporally-aware representations, is particularly well-suited for this task.\n3. Methodology # We fine-tune multiple ViT-based Video Foundation Models (ViFMs) for TAD and evaluate their performance against recent specialized TAD methods. We follow the encoder-only design and attach a single linear classification head to the output of the final encoder layer. This minimal design ensures that performance primarily reflects the effectiveness of the ViFM backbone in capturing patterns relevant for traffic anomaly detection. We investigate (i) whether a simple Video ViT model, pre-trained at scale, can achieve stateof-the-art performance on TAD, (ii) whether better general ViFMs are also better for TAD, and what type of pretraining is more effective (iii) and, finally, whether smallscale domain-adaptive pre-training (DAPT) is feasible and effective for adapting Video ViTs to the driving domain.\n3.1. Task definition # We formulate Traffic Anomaly Detection (TAD) as a binary classification task, specifically focusing on framelevel, ego-centric anomaly classification, where each frame captured from a moving vehicle-mounted camera is assigned an anomaly label.\nLet X t = {It − τ +1 , It − τ +2 , . . . , It} denote a timeordered sliding window of τ consecutive video frames captured from a vehicle-mounted camera up to time t. Each Ik represents an RGB frame at time step k, from the egocentric viewpoint of the vehicle.\nThe task is to learn a function fθ that maps an input window X t to a prediction At at each timestep t:\nwhere A t ∈ {0 , 1} is a binary label that indicates whether an anomaly is observed at time t .\nIn the general case, τ can be 1 and fθ may also maintain an internal hidden state or operate in an autoregressive manner, explicitly conditioning on previously predicted outputs.\n3.2. Evaluation Procedure # Prior work typically reports the Area Under the Receiver Operating Characteristic Curve (AUCROC) as the primary evaluation metric for TAD [24 , 25 , 32 , 48 , 51], and we adopt this metric when comparing to previous methods. However, handling data imbalance is especially important in TAD, so in our evaluations we use the Matthews Correlation Coefficient (MCC), which has been used in related work [18 , 37 , 42]. MCC takes into account all entries of the confusion matrix, including true negatives, and better reflects overall performance under class imbalance [7]. MCC at a given threshold is defined as:\nwhere TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively. Note that MCC ranges from −1 (inverse prediction) to 1 (perfect prediction), with 0 indicating random performance, but we show it in the range -100, 100 to improve readability.\nTo assess discriminative ability independently of the decision threshold, we compute MCC across thresholds in the range [0 , 1] and report the area under this curve, referred to as the Area Under the MCC Curve (AUCMCC). We also report MCC at a fixed threshold of 0.5 (MCC@0.5).\nBeyond metric design, we implement a broader protocol focused on generalization and efficiency. We evaluate in-domain performance, out-of-domain performance, and computational cost.\n3.3. Pre-trained Encoders # We select a range of recent ViFMs that represent various pre-training strategies, and apply them to the TAD task; see Tab. 1 for an overview of their pre-training strategies. When possible, we select variants pre-trained on Kinetics-400 [19] for consistency.\nWe include ViViT [1] as a baseline to represent fullysupervised pre-training. VideoMAE [38], MVD [45], and VideoMAE2 [44] are selected to evaluate progressively stronger variants of self-supervised pre-training from videos. We also assess InternVideo2 [46], which combines self-supervised learning from videos and weaklysupervised learning from multiple modalities, and is one of the leading models across numerous video benchmarks. Together, these ViT-based models cover a diverse range of pre-training strategies.\nFor completeness, we also include the fullyconvolutional R(2+1)D model, pretrained in a fullysupervised manner, motivated by recent studies showing its competitive performance in the related task of collision anticipation [10].\n3.4. Domain-Adaptive Pre-Training # To better align the Video ViT encoder with the driving domain, we adopt the Domain-Adaptive Pre-Training (DAPT) strategy, a simple method originally proposed in the field of natural language processing [16]. DAPT introduces an additional pre-training stage between generic pre-training and downstream fine-tuning, using unlabeled data from the target domain.\nWe apply VideoMAE-based [38] DAPT as follows:\nStep 1: Generic pre-training. As before, we initialize the encoder with an off-the-shelf VideoMAE model pre-trained on large-scale generic video data, mostly unrelated to the driving domain. Step 2: Domain-Adaptive Pre-training (DAPT). We continue pre-training the same model on a medium-sized dataset of unlabeled driving videos using the exact same VideoMAE reconstruction objective: where x is the input video, xmasked is the masked input, fθ is the encoder-decoder VideoMAE model, M is the binary mask, and ⊙ is element-wise multiplication.\nStep 3: Fine-tuning on TAD. As before, we fine-tune the encoder-only model on TAD datasets using the same configuration with a simple linear classification head. The intermediate DAPT step (Step 2) specializes the model towards the driving domain without requiring any labels. It introduces no additional parameters, preserves model efficiency, and remains fully compatible with standard VideoMAE pipelines.\n4. Experiments # 4.1. Experimental setup # Datasets. We evaluate on DoTA [48] and DADA2000 [13], two large-scale real-world driving anomaly datasets with temporal and frame-level annotations. DAPT uses Kinetics-700 [5], BDD100K [49], and CAPDATA [14], described in detail in the Supplementary.\nModel input. All Video ViTs and R(2+1)D are trained on sliding windows of size 224×224×16 at 10 FPS (1.5s temporal context) by default. For InternVideo2, which uses tubelets of size 1, we use 224×224×8 at 5 FPS to match the same duration. MOVAD processes videos frame-byframe at resolution 640×480.\nFine-tuning. With all Video ViTs and R(2+1)D, we closely follow the VideoMAE fine-tuning recipe for HMDB51. We train for 50 epochs (5 warmup), with 50K randomly sampled examples per epoch and a batch size of 56. For VidNeXt variants, and MOVAD, we follow the original training settings.\nDomain-adaptive pre-training (DAPT). We apply the VideoMAE pre-training strategy [38], masking 75% of tokens, using MSE loss on masked tokens only. Training uses a batch size of 800 and 1M samples per epoch, with 12 epochs. We explore DAPT on three domains: (a) Kinetics700, (b) BDD100K (normal driving), and (c) BDD100K + CAP-DATA (abnormal driving), with dataset mixing ratios detailed in the Supplementary.\n4.2. Can an encoder-only model outperform specialized TAD methods? # To answer this question, we evaluate models along three critical axes: classification performance, generalization, and efficiency, as shown in Tab. 2. We select a range of recent top-performing methods proposed for the TAD task. Among encoder-only models, we apply the R(2+1)D [39] model and different sizes of VideoMAE pre-trained Video ViTs (with DAPT, see Sec. 4.4). The results show that these Video ViTs consistently strike a strong balance, demonstrating a good combination of predictive accuracy, generalization across domains, and computational efficiency. Notably, strongly pre-trained Video ViTs achieve the highest AUC ROC scores across both DoTA and DADA-2000 datasets, both in-domain and in cross-dataset evaluation, while being highly efficient with a low memory footprint. In contrast, specialized TAD-specific models not only demonstrate lower classification performance but also incur sub-\nTable 2. Traffic Anomaly Detection (TAD) performance and efficiency. Video ViT-based encoder-only models set a new state of the art on both datasets, while being significantly more efficient than top-performing specialized methods. FPS measured using NVIDIA A100 MIG, 1 2 GPU. †From prior work. ‡Optimistic estimates using publicly available components of the model. \u0026ldquo;A→B\u0026rdquo;: trained on A, tested on B; D2K: DADA-2000.\nDoTA AUCROC, % DoTA AUCROC, % D2K AUCROC, % D2K AUCROC, % Method DoTA→DoTA D2K→DoTA D2K→D2K DoTA→D2K # Param Peak GPU FPS Two-stage TAD methods Two-stage TAD methods Two-stage TAD methods Two-stage TAD methods Two-stage TAD methods Two-stage TAD methods Two-stage TAD methods Two-stage TAD methods VidNeXt [10] 73.9 69.3 70.1 72.4 125 M 0.78 GB 27 ConvNeXt+VT [10] 73.1 61.2 66.8 67.3 125 M 0.77 GB 27 ResNet+NST [10] 74.0 70.1 71.2 72.3 24 M 0.19 GB 124 MOVAD [34] 82.2 77.6 77.0 75.2 153 M 1.10 GB 26 Multi-representation fusion TAD methods Multi-representation fusion TAD methods Multi-representation fusion TAD methods Multi-representation fusion TAD methods Multi-representation fusion TAD methods Multi-representation fusion TAD methods Multi-representation fusion TAD methods Multi-representation fusion TAD methods TTHF [24] 84.7† – – 71.7† 140 M 0.80 GB 26 PromptTAD [32] 83.9† – – 74.6† 106 M 1.88 GB 18 ISCRTAD [25] 86.2† – – 82.7† 359 M‡ 1.51 GB‡ 33‡ Encoder-only models Encoder-only models Encoder-only models Encoder-only models Encoder-only models Encoder-only models Encoder-only models Encoder-only models R(2+1)D [39] 81.5 76.4 78.8 78.4 27 M 0.27 GB 104 DAPT-VideoMAE-S (ours) 86.4 81.7 85.6 84.3 22 M 0.16 GB 95 DAPT-VideoMAE-B (ours) 87.9 83.5 87.6 85.8 86 M 0.54 GB 94 DAPT-VideoMAE-L (ours) 88.4 84.2 88.5 86.6 304 M 1.80 GB 34 stantially higher computational costs and latency. R(2+1)D and ResNet+NST, while being highly efficient, fall short in predictive quality. This confirms that we can outperform specialized, multi-component TAD methods with a simple encoder-only model by applying a Video ViT with strong pre-training.\n4.3. What pre-training is better for TAD? # We investigate how general video recognition performance and pre-training strategies relate to downstream performance on TAD. We evaluate a range of publicly available Video ViT models of several sizes, using their Top-1 accuracy on Kinetics-400 [19] and SomethingSomethingV2 [15] alongside AUCMCC on DoTA[48] and DADA-2000 [13].\nResults are summarized in Tab. 3, from which we observe two key trends. First, we find that for TAD the MAE pre-training objective dominates: MAE-pre-trained models (VideoMAE) and their distilled variants (VideoMAE2, MVD) achieve the highest AUCMCC within each size tier, even when they do not have the highest classification accuracy on general benchmarks. Second, classification accuracy on general benchmarks is not representative of TAD performance: ViViT-B, despite matching VideoMAE on Kinetics-400, demonstrates significantly lower AUCMCC , and InternVideo2, state-of-the-art on general benchmarks, also underperforms on TAD. These findings suggest that the representations which are beneficial for general video classification may not align well with those needed for TAD. In particular, TAD appears to benefit more from dense representations that emphasize fine-grained temporal irregularities rather than the coarse semantic categories typically targeted by general video recognition models.\nThe overall top-ranking model on TAD is VideoMAE2, which incorporates dual masking, an additional pre-training step with distillation from a larger model, and ∼6 times larger-scale pre-training datasets, compared to other MVM pre-trained models. This confirms that both the scale of pretraining and the choice of objectives significantly impact the transferability of ViFMs to TAD.\n4.4. Domain-Adaptive Pre-Training (DAPT) # Larger ViFMs can be pre-trained on a larger scale and, as a result, exhibit better out-of-the-box generalization across domains, while smaller models have shown to benefit less from longer pre-training due to their limited capacity and faster saturation [50]. Therefore, we expect that domain adaptation can help better utilize the capacity of the smaller but at the same time more efficient and faster models. Given that MAE pre-training proves especially effective for TAD, and unlabeled driving data is available in abundance, we investigate whether small-scale self-supervised DAPT with MAE can be an effective and efficient way to scale the performance of smaller models. We initialize a ViT model with\nTable 3. Comparing Video ViT pre-trainings. In contrast to general video classification benchmarks (K400, SthSthV2), fully- and weakly-supervised pre-training are less effective for TAD benchmarks. Self-supervised pre-training performs best for TAD (DoTA, D2K). FSL: fully-supervised; WSL: weakly-supervised; SSL: self-supervised learning; K400: Kinetics-400 [19]; SthSthV2: SomethingSomethingV2 [15]; D2K: DADA-2000 [13].\nTop-1 accuracy Top-1 accuracy MCC@0.5 MCC@0.5 AUCMCC AUCMCC Model Variant Type K400 SthSthV2 DoTA D2K DoTA D2K VideoMAE1600 [38] Small SSL 79.0 66.8 55.5 49.5 52.1 48.1 MVDfromB [45] Small SSL 80.6 70.7 56.2 49.8 50.0 48.1 MVDfromL [45] Small SSL 81.0 70.9 56.5 51.1 50.2 49.1 VideoMAE2 [44] Small SSL+FSL 83.7 – 56.8 51.6 55.2 50.3 InternVideo2 [46] Small WSL+SSL 85.4 71.6 51.6 44.5 49.7 43.7 ViViT [1] Base FSL 79.9 – 30.7 27.6 28.9 26.7 VideoMAE800 [38] Base SSL 80.0 – 58.0 52.0 54.5 51.2 VideoMAE1600 [38] Base SSL 81.0 69.7 58.7 52.6 56.0 52.2 MVDfromB [45] Base SSL 82.7 72.5 57.8 51.6 56.0 50.9 MVDfromL [45] Base SSL 83.4 73.7 59.2 52.1 57.0 51.0 VideoMAE2 [44] Base SSL+FSL 86.6 75.0 58.4 54.8 56.5 53.4 InternVideo2 [46] Base WSL+SSL 88.4 73.5 52.2 44.2 50.0 43.1 VideoMAE1600 [38] Large SSL 85.2 74.3 61.6 56.9 59.7 55.36 MVDfromL [45] Large SSL 86.0 76.1 60.5 54.6 59.0 53.7 Figure 3. DAPT scaling across different model sizes. Smaller models benefit more. S: small, B: base, L: large variant of the Video ViT.\nVideoMAE pre-trained weights and perform several epochs of additional pre-training with the VideoMAE objective on in-domain data. Compared to the original ∼192K training steps with batch size 2048, we use only 15K steps with batch size 800.\nAs shown in Fig. 3, DAPT via MAE brings clear improvements for small and base VideoMAE pre-trained models. As expected, given our small-scale DAPT protocol, the large model sees less improvement.\nTo disentangle the impact of domain relevance from that of additional pre-training, we conduct an ablation study on the data used for DAPT. Specifically, we adapt a model, pre-trained on a general human activity dataset, using three types of unlabeled video data: (1) the original, general pretraining domain used by VideoMAE, which is not related to the TAD task, (2) normal ego-centric driving, and (3) egocentric driving mixed with anomalies. This setup allows\nTable 4. DAPT ablation. Comparing generic (Kinetics700 [19]), driving (BDD100K [49]), and driving + anomaly (CAPDATA [14]) domains shows that driving videos improve performance without requiring anomalies. Using Video ViT-Small. \u0026quot; A→B\u0026quot;: trained on A, tested on B; D: DoTA; D2K: DADA-2000.\nDoTA AUCMCC, DoTA AUCMCC, D2K AUCMCC D2K AUCMCC Method D→D D2K→D D2K→D2K D→D2K w/o DAPT 52.1 43.8 48.1 46.6 Generic DAPT 51.6 -0.5 43.8 48.5 +0.4 46.2 -0.4 Driving DAPT 54.8 +2.7 46.8 +3.0 52.0 +3.9 49.7 +3.1 + anomalies 54.9 +2.8 46.9 +3.1 51.9 +3.8 49.7 +3.1 us to evaluate whether the observed improvements stem from domain alignment or from simply continuing generic pre-training. As shown in Tab. 4, adaptation with domainrelevant data (both normal and abnormal driving) consistently improves generalization and data efficiency, while additional pre-training on the original, generic domain yields no notable gains. Interestingly, pre-training on normal driving videos is sufficient, and mixing in data with driving anomalies does not provide further improvements. We conclude that small-scale self-supervised DAPT is a simple and effective way to improve the performance and generalization of smaller Video ViTs for TAD, which does not necessarily requires rare anomaly data.\nFinally, in Fig. 4 we also include some qualitative examples which clearly demonstrate the positive effect of DAPT.\nFigure 4. Qualitative examples for the effect of DAPT. Predicted anomaly-scores of VideoMAE (top: S, bottom: B) w/ and w/o DAPT.\n5. Discussion # In this work, we show that with stronger pre-training, an encoder-only Video Vision Transformer outperforms all prior Traffic Anomaly Detection models while also being significantly more efficient. However, it remains an open question whether the additional components introduced in earlier work become redundant as pre-training scales, as shown in related perception tasks [21], or whether they still provide complementary benefits.\nWe use Video Masked Autoencoders [38 , 44] as a simple and effective form of masked video modeling (MVM). Approaches that predict in latent space, such as V-JEPA [2 , 3], may offer further gains by avoiding the limitations of noisy pixel-level reconstruction.\n6. Conclusion # Ego-centric Traffic Anomaly Detection (TAD) is a challenging task that requires modeling motion dynamics and agent interactions. While most recent methods for TAD rely on complex, multi-component architectures, we show that a simple encoder-only design using a plain Video Vision Transformer (ViT) with strong self-supervised pre-training is not only more efficient, but also more effective and generalizable. Building on this, Domain-Adaptive Pre-Training (DAPT) offers a label-free and data-efficient way to further boost performance, particularly for smaller models. These findings highlight the strength of learned inductive biases from large-scale pre-training as an alternative to manually crafted architectural complexity, a principle to which TAD is no exception.\nOur experiments further demonstrate that Masked Video Modeling (MVM) is the most effective pre-training strategy for TAD, in contrast to standard video classification tasks. This suggests that different video tasks may benefit from pre-training objectives tailored to their downstream requirements. While TAD is a crucial task in autonomous driving (AD), other AD tasks may align more closely with conventional action recognition. This motivates further research into a universally effective video pre-training strategy, evaluated by its generalization across diverse AD tasks. We hope our findings provide a foundation for future work in this direction.\nAcknowledgements This work was funded by the Horizon Europe programme of the European Union, under grant agreement 101076754 (project AITHENA). Views and opinions expressed here are however those of the author(s) only and do not necessarily reflect those of the European Union or CINEA. Neither the European Union nor the granting authority can be held responsible for them. We also acknowledge the Dutch national e-infrastructure with the support of the SURF Cooperative, grant agreement no. EINF-10314, financed by the Dutch Research Council (NWO), for the availability of high-performance computing resources and support.\nReferences # [1] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luciˇ ˇ c, and Cordelia Schmid. Vivit: A video ´ ´ vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021. 1 , 3 , 4 , 5 , 7 [2] Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier, Yann LeCun, Michael Rabbat, and Nicolas Ballas. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025. 8 [3] Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024. 8 [4] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In ICML, page 4, 2021. 3 [5] Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019. 5 [6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PmLR, 2020. 3 [7] Davide Chicco and Giuseppe Jurman. The matthews correlation coefficient (mcc) should replace the roc auc as the standard metric for assessing binary classification. BioData Mining, 16(1):4, 2023. 4 [8] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018. 1 [9] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Re. Flashattention: Fast and memory-efficient exact ´ ´ attention with io-awareness. Advances in neural information processing systems, 35:16344–16359, 2022. 4 [10] Nishq Poorav Desai, Ali Etemad, and Michael Greenspan. Cyclecrash: A dataset of bicycle collision videos for collision prediction and analysis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025. 1 , 2 , 3 , 5 , 6 [11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 1 , 3 , 4 [12] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision , pages 6824–6835, 2021. 3 [13] Jianwu Fang, Dingxin Yan, Jiahuan Qiao, Jianru Xue, and Hongkai Yu. Dada: Driver attention prediction in driving accident scenarios. IEEE transactions on intelligent transportation systems, 23(6):4959–4971, 2021. 1 , 2 , 3 , 5 , 6 , 7 [14] Jianwu Fang, Lei-Lei Li, Kuan Yang, Zhedong Zheng, Jianru Xue, and Tat-Seng Chua. Cognitive accident prediction in driving scenes: A multimodality benchmark. CoRR , abs/2212.09381, 2022. 5 , 7 [15] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The\u0026quot; something something\u0026quot; video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017. 1 , 6 , 7 [16] Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, ´ ´ Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don\u0026rsquo;t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 2 , 5 [17] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, and Ross Girshick. Masked autoencoders are scalable ´ ´ vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 3 [18] P Rajesh Kanna, S Vanithamani, P Karunakaran, P Pandiaraja, N Tamilarasi, and P Nithin. An enhanced traffic incident detection using factor analysis and weighted random forest algorithm. In 2024 International Conference on IoT Based Control Networks and Intelligent Systems (ICICNIS) , pages 1355–1361. IEEE, 2024. 4 [19] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 , 2017. 1 , 5 , 6 , 7 [20] Tommie Kerssies, Daan de Geus, and Gijs Dubbelman. First Place Solution to the ECCV 2024 BRAVO Challenge: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation. arXiv preprint arXiv:2409.17208, 2024. 2 [21] Tommie Kerssies, Niccolo Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. Your vit is secretly an image segmentation model. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 25303–25313, 2025. 2 , 8 [22] Hoon Kim, Kangwook Lee, Gyeongjo Hwang, and Changho Suh. Crash to not crash: Learn to identify dangerous vehicles using a simulator. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 978–985, 2019. 3 [23] Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. In European Conference on Computer Vision, pages 237–255. Springer, 2024. 3 [24] Rongqin Liang, Yuanman Li, Jiantao Zhou, and Xia Li. Text-driven traffic anomaly detection with temporal highfrequency modeling in driving videos. IEEE Transactions on Circuits and Systems for Video Technology, 2024. 1 , 2 , 3 , 4 , 6 [25] Rongqin Liang, Yuanman Li, Zhenyu Wu, and Xia Li. An interaction-scene collaborative representation framework for detecting traffic anomalies in driving videos. IEEE Transactions on Intelligent Transportation Systems, 2025. 1 , 3 , 4 , 6 [26] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018. 3 [27] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022. 3 [28] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. 3 [29] Weixin Luo, Wen Liu, and Shenghua Gao. Remembering history with convolutional lstm for anomaly detection. In 2017 IEEE International conference on multimedia and expo (ICME), pages 439–444. IEEE, 2017. 3 [30] Neelu Madan, Andreas Møgelmose, Rajat Modi, Yogesh S Rawat, and Thomas B Moeslund. Foundation models for video understanding: A survey. Authorea Preprints, 2024. 3 [31] Viorica Patr ˘ ˘ aucean, Xu Owen He, Joseph Heyward, Chuhan ˘ ˘ Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, Joao Carreira, and Razvan Pascanu. Trecvit: A re- ˜ ˜ current video transformer. arXiv preprint arXiv:2412.14294 , 2024. 3 [32] Hao Qiu, Xiaobo Yang, and Xiaojin Gong. Prompttad: Object-prompt enhanced traffic anomaly detection. IEEE Robotics and Automation Letters, 2025. 1 , 2 , 3 , 4 , 6 [33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021. 3 [34] Leonardo Rossi, Vittorio Bernuzzi, Tomaso Fontanini, Massimo Bertozzi, and Andrea Prati. Memory-augmented online video anomaly detection. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6590–6594. IEEE, 2024. 1 , 2 , 3 , 6 [35] Tim J Schoonbeek, Fabrizio J Piva, Hamid R Abdolhay, and Gijs Dubbelman. Learning to predict collision risk from sim- ulated video data. In 2022 IEEE Intelligent Vehicles Symposium (IV), pages 943–951. IEEE, 2022. 3 [36] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 1 [37] Tiago Tamagusko, Matheus Gomes Correia, Minh Anh Huynh, and Adelino Ferreira. Deep learning applied to road accident detection with transfer learning and synthetic images. Transportation research procedia, 64:90–97, 2022. 4 [38] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022. 2 , 3 , 4 , 5 , 7 , 8 [39] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018. 3 , 5 , 6 [40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 3 [41] Tuan-Hung Vu, Eduardo Valle, Andrei Bursuc, Tommie Kerssies, Daan de Geus, Gijs Dubbelman, Long Qian, Bingke Zhu, Yingying Chen, Ming Tang, Jinqiao Wang, Toma´ ´ s Voj ˇ ˇ ´ˇ ı ´ˇ ´ˇr, Jan Sochman, Ji ˇ ˇ r ´ ´ı Matas, Michael Smith, Frank Ferrie, Shamik Basu, Christos Sakaridis, and Luc Van Gool. The BRAVO Semantic Segmentation Challenge Results in UNCV2024. 2024. 2 [42] Junyao Wang, Arnav Vaibhav Malawade, Junhong Zhou, Shih-Yuan Yu, and Mohammad Abdullah Al Faruque. Rs2g: Data-driven scene-graph extraction and embedding for robust autonomous perception and scenario understanding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7493–7502, 2024. 4 [43] Lin Wang, Fuqiang Zhou, Zuoxin Li, Wangxia Zuo, and Haishu Tan. Abnormal event detection in videos using hybrid spatio-temporal autoencoder. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 2276–2280. IEEE, 2018. 3 [44] Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14549–14560, 2023. 3 , 4 , 5 , 7 , 8 [45] Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, and Yu-Gang Jiang. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6312–6322, 2023. 2 , 4 , 5 , 7 [46] Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for mul- timodal video understanding. In European Conference on Computer Vision, pages 396–416. Springer, 2024. 3 , 4 , 5 , 7\n[47] Jiewen Yang, Xingbo Dong, Liujun Liu, Chao Zhang, Jiajun Shen, and Dahai Yu. Recurring the transformer for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14063–14073, 2022. 3 [48] Yu Yao, Xizi Wang, Mingze Xu, Zelin Pu, Yuchen Wang, Ella Atkins, and David Crandall. Dota: unsupervised detection of traffic anomaly in driving videos. IEEE transactions on pattern analysis and machine intelligence, 2022. 1 , 2 , 3 , 4 , 5 , 6 [49] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020. 5 , 7 [50] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022. 3 , 6 [51] Zhili Zhou, Xiaohua Dong, Zhetao Li, Keping Yu, Chun Ding, and Yimin Yang. Spatio-temporal feature encoding for traffic accident detection in vanet environment. IEEE Transactions on Intelligent Transportation Systems, 23(10): 19772–19781, 2022. 1 , 4 ","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/simplifying-traffic-anomaly-detection-with-video-foundation-models/","section":"Papers","summary":"The paper investigates the use of simple encoder-only Video Vision Transformers (Video ViTs) with various pre-training strategies for traffic anomaly detection (TAD), demonstrating that with strong pretraining and domain adaptation, minimal architectural complexity can outperform complex prior methods, highlighting the importance of pretraining strategies like Masked Video Modeling (MVM).","title":"Simplifying Traffic Anomaly Detection with Video Foundation Models","type":"other"},{"content":" SUVAD: Semantic Understanding Based Video Anomaly Detection Using MLLM # Shibo Gao∗†, Peipei Yang†‡(B), Linlin Huang ∗\n∗ Beijing Jiaotong University\n† State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences ‡ School of Artificial Intelligence, University of Chinese Academy of Sciences\nAbstract—Video anomaly detection (VAD) aims at detecting anomalous events in videos. Most existing VAD methods distinguish anomalies by learning visual features of the video, which usually face several challenges in real-world applications. First, these methods are mostly scene-dependent, whose performances degrade obviously once the scene changes. Second, these methods are incapable of giving explanations to the detected anomalies. Third, these methods cannot adjust definitions of normal or abnormal events during test time without retraining the model. One important reason for the drawbacks is that these visualbased methods mainly detect anomalies by fitting visual patterns rather than semantically understanding the events in videos. In this paper, we propose a training-free method named Semantic Understanding based Video Anomaly Detection (SUVAD) using multi-modal large language model (MLLM). By exploiting MLLMs to generate detailed texture descriptions for videos, SUVAD achieves semantic video understanding, and then detects anomalies directly by large language models. We also designed several techniques to mitigate the hallucination problem of MLLMs. Compared to the methods based on visual features, SUVAD obtains obviously better scene generalization, anomaly interpretability, and the ability of flexible adjustment of anomaly definitions. We evaluate our method on five mainstream datasets. The results show that SUVAD achieves the best performance among all the training-free methods.\nIndex Terms—Video Anomaly Detection, Multi-modal Largelanguage Model, Training-free\nI. INTRODUCTION # Video anomaly detection (VAD) aims to detect events in the videos that significantly deviate from normal patterns. Due to its widespread applications in areas such as intelligent surveillance systems and video censorship, VAD has garnered increasing attention in both academia and industry [1]–[11].\nMost existing VAD methods identify anomalous events by fitting the normal or anomalous visual patterns learned from the training videos [12]–[22]. Although these methods demonstrate good performances in their experiments, they encounter multiple challenges when applied to the real world.\nFirstly, the methods based on visual features are prone to be scene-specific. Models over-fitted to the specific scenes will experience significant performance degradation when switched to another scene. Secondly, these methods focus on detecting anomalies while neglecting to provide detailed explanations to the anomalous events. Thirdly, the definition of anomalous events can be learnt only from the training videos and thus are incapable of being adjusted without re-training the model using new data.\nThe aforementioned problems stem largely from the fact that most existing methods focus exclusively on visual features when detecting anomalies, rather than comprehending the video content. In reality, anomalies within videos are dictated by the events. Clearly, the recently successful multi-modal large language models [23]–[26] (MLLMs) and large language models [27], [28] (LLMs) are well-suited to address this task. Zanella et al. [29] were the first to attempt using MLLMs for VAD tasks. Unfortunately, they failed to fully exploit the potential of MLLMs. Firstly, their definition of anomalies relies entirely on LLMs, making it completely uncontrollable. Secondly, they are severely affected by hallucination problems.\nFig. 1. Compared to the methods based on visual features, our method can support flexible anomaly definition, adapt to different scenes, and provide explanations for anomalies under training-free.\nIn this paper, we propose a training-free VAD method named Semantic Understanding Based VAD (SUVAD), to address the aforementioned problems. It first exploits MLLMs to achieve semantic understanding of the training videos and generate textual descriptions for normal or abnormal events according to the labels. Subsequently, SUVAD generates textual description for each frame of the test video and calculates its anomaly score by comparing it with the descriptions of normal and abnormal events using an LLM. The anomalous frames can be detected according to the scores with interpretative descriptions about the anomalies. Furthermore, we designed several techniques such as score smoothing and caption correction to mitigate hallucination problems. Fig. 1 illustrates the differences between SUVAD and other visual-based methods.\nUsing semantic information instead of visual features for\nanomaly detection, SUVAD can well adapt to different scenes without obvious performance degradation. Benefiting from the multi-modal model, SUVAD can conveniently give explanation to the detected anomalies and adjust the definition of normal or abnormal events according to textual inputs without retraining the model. Finally, the strategy of course-to-fine VAD effectively suppresses hallucination problems, further enhancing model performance.\nWe evaluate our method on both semi-supervised and weakly-supervised tasks using five mainstream datasets. The results show that SUVAD achieves the best performance among all the training-free methods and achieves comparable performance to other supervised methods. We summarize our contributions as follows:\nWe propose a training-free VAD method named SUVAD that detects anomalies based on semantic understanding of the video contents. Compared with the methods based on visual features, it benefits from better generalization to scenes, ability of anomaly explanation and the flexibility to adjust anomaly definitions. We propose a series of techniques to mitigate the hallucination problems of MLLMs for VAD task. Experimental results on five mainstream datasets demonstrate that our method achieves excellent versatility and competitive performance. II. METHOD # A. Overview # Fig. 2. Illustration of the SUVAD.\nFig. 2 illustrates the overall architecture of our proposed method, SUVAD, designed for detecting anomalies in videos through three phases. Specifically, SUVAD accepts arbitrary labeled text, image, or video inputs and generates the lists of normal/anomalous events. Subsequently, SUVAD generates textual descriptions for video clips. By comparing the video descriptions with the event lists, SUVAD assigns the videolevel anomaly scores and locates high-probability anomalous clips. Lastly, SUVAD analyzes the descriptions of each frame within the high-probability anomalous clip to provide a framelevel anomaly score. The final anomaly score is derived from a weighted combination of the video-level and framelevel anomaly scores. To mitigate the hallucination problems inherent in MLLMs, we employ a Caption Correction module to reassign image captions across consecutive frames and apply score smoothing to ensure event continuity.\nB. Normal/Anomalous Patterns Learning in the Training-Free Manner # Most existing VAD methods [12]–[18], [30], [31] typically learn visual features from the videos in the training set to learn the definition of normal or anomalies. Once the model is trained, the boundaries between abnormal and normal events are also fixed. These visual-based methods are not only susceptible to data imbalance, which can lead to misjudgments, but also lack the flexibility to adjust the definition of anomalous events without re-annotating the datasets and re-training the model.\nUnlike the visual-based methods, SUVAD can determine the anomalous or normal events in a simple and flexible way under training-free. Specifically, SUVAD can accept any input in the form of text Ti Tinput , images Iinput, or videos Vinput, and summarize it into the lists of abnormal/normal events L a, L n with brief descriptions. For textual input, SUVAD utilizes LLMs to summarize it. For image or video input, SUVAD first generates corresponding descriptions using MLLMs, and then summarizes these descriptions using LLMs. Through this method, SUVAD retains the ability to learn the definitions of normal or anomalous events from data in the training-free manner, thereby addressing the aforementioned issues.\nC. Coarse-grained Anomaly Detection # In the field of VAD, anomalous events often exhibit a form of continuity, necessitating an analysis that not only focuses on the intricate visual information within individual video frames but also profoundly comprehends the associative information between frames [17], [18], [29]. Furthermore, while MLLMs demonstrate substantial advantages in modeling capabilities compared to CNNs, they also entail the significant increase in computational costs. Based on these two points, the SUVAD initially performs a coarse-grained anomaly detection on the test video and identifies high-probability clips that are most likely to contain anomalies. This strategy, while fully considering the associative information between video frames, also reduces the subsequent computational overhead and alleviates the hallucination problems from MLLMs.\nSpecifically, for a test video contains a series of frames V test = {F1, F2, . . . , Fj}, the SUVAD first divides it into video clips C1 , C 2, . . . , C n according to a fixed interval d , where C i = {Fi ×d , Fi ×d+1 , . . . , F(i+1)×d − 1 }. Then SUVAD uses MLLMs to generate video captions Cap(Ci) for them.\nSubsequently, SUVAD employs LLMs to compare and analyze the video captions with the event lists produced in the previous stage. Based on the degree of match between the captions and the lists, SUVAD assigns an anomaly score s(Ci) to each video clip, indicating potential anomalous events and summarizing these events into a list L ′ a :\nTABLE I COMPARISON WITH OTHER STATE -OF -THE -ART METHODS ON THE SINGLE DATASET .\n| Method | Supervised Mode | Explanation | Semi-Supervised Datasets | Semi-Supervised Datasets | Semi-Supervised Datasets | Weakly Supervised Datasets Vil(AP) UCFCi(AUC) | Weakly Supervised Datasets\nVil(AP) UCFCi(AUC) Method Mode Explanation Ped2(AUC) Avenue(AUC) SH Tech(AUC) XD-Violence(AP) UCF-Crime(AUC) BA Framework [21] Semi No 98.7% 92.3% 82.7% × × Ristea et al. [32] Semi No - 91.6% 83.8% × × Wang et al. [17 Semi No 99.0% 92.2% 84.3% × × CLIP-TSA [14] Weakly No × × × 82.1% 87.5% VadCLIP [15] Weakly No × × × 84.5% 88.0% UMIL [33] Weakly No × × × - 87.5% ZS CLIP [34] Training-free No 61.7% 52.3% 50.2% 17.8% 53.1% LLAVA-1.5 [25] Training-free Yes 82.9% 67.4% 59.6% 50.2% 72.8% Video chatgpt [26] Training-free Yes 85.1% 76.9% 69.1% 53.8% 75.3% LAVAD [29] Training-free Yes × × × 62.0% 80.2% SUVAD(Ours) Training-free Yes 96.8% 89.3% 80.2% 70.1% 83.9% When only input data labeled as normal is provided, L a = ∅. In this scenario, SUVAD analyzes whether the captions Cap(Ci) contain any content that does not belong to the normal event list L n . Based on this, SUM assigns the anomaly score s(Ci) ranging from 0/10 to 10/10. The score of 10/10 signifies that the captions Cap(Ci) definitely include at least one event that is not part of L n , whereas a score of 0/10 indicates that the caption content entirely falls within the normal event category. Conversely, when only input data labeled as anomalous is provided, L n = ∅. SUVAD examines whether the captions Cap(Ci) contain descriptions of events listed in L a and assigns a score from 0/10 to 10/10 based on match degree, using the opposite scoring logic.\nLastly, SUVAD employs two methods to locate highprobability anomalous clips within the test video. On one hand, given a threshold τ , SUVAD flags clips with scores exceeding this threshold as high-probability anomalous clips:\nOn the other hand, considering that dividing the video at a fixed interval may result in the disruption of continuous events, SUVAD also leverages the temporal localization capabilities of MLLMs to uncover clips that might be overlooked yet contain anomalous events using the aforementioned list H2 = MLLM(L ′ a , V test ). Ultimately, SUVAD takes the intersection of them H = H 1 ∪ H 2 as the result and passes it on to the next processing stage.\nD. Fine-Grained Anomaly Detection # As mentioned in II-C, intricate visual information within video frames is also crucial. After coarse-grained anomaly detection, SUVAD conducts a detailed analysis of each frame in the captured high-probability anomalous clips H and integrates video-level and frame-level anomaly scores to provide the overall anomaly scores.\nFirstly, SUVAD generates an image caption Cap(Fj ) for each frame in the high-probability clips. To further mitigate the impact of hallucination problems, SUVAD employs the Caption Correction module to refine the generated image captions. Specifically, SUVAD utilizes an aligned vision- language model to encode several adjacent frames and their corresponding image captions, and then redistributes them:\nwhere \u0026lt; · \u0026gt; is the cosine similarity, and the ΦI and ΦT are the image encoder and the test encoder.\nSubsequently, similar to the previous stage, SUVAD compares the image captions with the lists, assigning an abnormality score s(Fj ) to each frame. The final score is derived from the combination of two scores:\nwhere α and β are the constants.\nConsidering that events are continuous and to further avoid the influence of randomness in MLLMs on anomaly detection, we apply the Savitzky-Golay filter to smooth the scores:\nWhere qp qp /Q is the smoothing coefficient, determined through polynomial fitting using the least squares method.\nIII. EXPERIMENTS # A. Datasets # SVAD Datasets: Commonly used SVAD datasets include UCSD Ped2 [35], CUHK dataset [36], and Shanghai Tech dataset [1]. These datasets only provide videos containing normal events during training and require the model to accurately locate anomalous segments in the test set during the testing phase. The anomalies in these datasets include running, cycling, and skateboarding, etc.\nWVAD Datasets: Commonly used WVAD datasets include UCF-Crime [37] and XD-Violence [38]. These datasets provide videos containing both normal and anomalous events during training, along with specific anomalous categories, but the anomalous annotations are at the video level. These datasets also require the model to accurately locate anomalous segments in the test set. Compared to SVAD datasets, the anomalies included in these datasets are extreme events such as explosions, robberies, shootings, etc.\nB. Implementation Details # In the evaluation phase, SUVAD employs cogvlm2 [24] to generate image captions and its video version, cogvlm2 − video [24], to generate video captions. Additionally, SUVAD utilizes llama − 3 [27] for text analysis and score assignment. It is noteworthy that the selection of these models is not fixed, and currently mainstream MLLMs and LLMs with the same functionality can achieve similar performance. For the constant terms in Equ.4, we set α to 0.4 and β to 0.6. Regarding the Savitzky-Golay filter, we set the order of the polynomial to 3 and the size of the smoothing window to 53.\nC. Comparison with State-of-the-art Methods # Comparison of our method with various representative methods is shown in Table. I. The results demonstrate that SUVAD achieves the best performance among all trainingfree methods and achieves comparable performance to other supervised methods.\nDue to variations in supervisory information and event types across several datasets, it is generally challenging for visualbased methods to transition from one type of task to another. In Table. I, \u0026ldquo;×\u0026rdquo; represents that the method cannot be evaluated on that particular dataset, while \u0026ldquo;−\u0026rdquo; indicates that the paper did not report evaluation results for that dataset.\nBenefiting from the powerful analytical capabilities of MLLMs and LLMs, as well as the training-free framework of SUVAD, our proposed method can easily switch between different VAD tasks. Furthermore, our method can provide explanations for anomalies and flexibly adjust the definition of anomalous events.\nD. Experiment on Scene Generalization # As mentioned previously, a significant advantage of SUVAD lies in its generalization to different scenes. Benefiting from its training-free framework and the ability to understand videos, SUVAD is not susceptible to the interference of visual features like other methods, thereby achieving strong generalization across different scenes.\nTABLE II EXPERIMENTAL RESULTS OF SCENE GENERALIZATION , USING AUC AS THE METRIC .\nMethod Avenue→SHT SHT→Avenue SHT→Ped2 ZS CLIP [34] 60.9% 62.3% 52.7% ZS CLIP IB [39] 61.3% 64.5% 53.6% Astrid et al. [40] 51.7% 54.3% 65.9% Wang et al. [17] 59.3% 62.9% 75.6% SUVAD(Ours) 77.3% 84.9% 96.1% We designed three different experimental schemes using three mainstream SVAD datasets to explore the generalization capability of SUVAD in various scenarios: Shanghai Tech(SHT)→Avenue, Avenue→SHT, and SHT→Ped2. We compared our method with four other high-performance methods under identical experimental settings, and the results are presented in Table. II. It is evident that when evaluating between different scenes, other methods failed to maintain their performance, whereas our method continued to achieve high detection accuracy.\nE. Ablation Study # To further illustrate the superiority of the SUVAD architecture, we conducted ablation experiments on various modules in SUVAD on Shanghai Tech, and the results are shown in Table.III. It can be seen that each module we designed further improves the model\u0026rsquo;s performance.\nTABLE III EXPERIMENTAL RESULTS OF ABLATION STUDY .\n| Setting | Finer Detection | Coarse Detection | Caption Correction | Score Smoothing | Results\n(AUC) A ✓ 71.2% B ✓ ✓ 74.0% C ✓ ✓ ✓ 74.9% D ✓ ✓ ✓ ✓ 80.2% Fig. 3. Visualization of ablation study.\nWe conducted a visualization experiment on the 01 0131 video from Shanghai Tech dataset using four settings from Table. III. The results are shown in Fig. 3. Obviously, the coarse detection significantly reduces the hallucination problem of MLLMs when processing normal frames, while the Caption Correction module mitigates the hallucinations of MLLMs when dealing with anomalous ones. The Score Smoothing module takes into account the anomaly scores of the entire video, greatly improving detection accuracy while reducing the influence of hallucinations.\nIV. CONCLUTION # This paper introduces SUVAD, a training-free video anomaly detection method based on semantic understanding utilizing MLLMs. Most existing VAD methods detect anomalies by using the visual features of normal or abnormal patterns learnt from videos rather than understanding video content. This results in their poor performance in scene generalization, interpretative ability to anomalies, and flexible adjustment of anomaly definitions, which are required in real-world VAD. Leveraging the powerful video understanding capability MLLMs and LLMs, the SUVAD obtains significantly better performances in these aspects. The experimental results show that our method achieves the best performance among all training-free methods and is comparable to the state-of-theart methods of other supervision settings.\nACKNOWLEDGMENT # This work has been supported by \u0026ldquo;Scientific and Technological Innovation 2030\u0026rdquo; Program of China Ministry of Science and Technology (2021ZD0113803) and the National Natural Science Foundation of China (NSFC) grant 62276258.\nREFERENCES # [1] W. Luo, W. Liu, and S. Gao, \u0026ldquo;A revisit of sparse coding based anomaly detection in stacked rnn framework,\u0026rdquo; in Proceedings of the IEEE international conference on computer vision, 2017, pp. 341–349. [2] H. Park, J. Noh, and B. Ham, \u0026ldquo;Learning memory-guided normality for anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 372–14 381. [3] F. Dong, Y. Zhang, and X. Nie, \u0026ldquo;Dual discriminator generative adversarial network for video anomaly detection,\u0026rdquo; IEEE Access, vol. 8, pp. 88 170–88 176, 2020. [4] Y. Lu, F. Yu, M. K. K. Reddy, and Y. Wang, \u0026ldquo;Few-shot scene-adaptive anomaly detection,\u0026rdquo; in European Conference on Computer Vision, 2020, pp. 125–141. [5] C. Park, M. Cho, M. Lee, and S. Lee, \u0026ldquo;Fastano: Fast anomaly detection via spatio-temporal patch transformation,\u0026rdquo; in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2022, pp. 2249–2259. [6] Z. Xu, X. Zeng, G. Ji, and B. Sheng, \u0026ldquo;Improved anomaly detection in surveillance videos with multiple probabilistic models inference,\u0026rdquo; Intelligent Automation \u0026amp; Soft Computing, vol. 31, pp. 1703–1717, 2022. [7] K. Cheng, X. Zeng, Y. Liu, M. Zhao, C. Pang, and X. Hu, \u0026ldquo;Spatialtemporal graph convolutional network boosted flow-frame prediction for video anomaly detection,\u0026rdquo; in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5. [8] S. Gao, J. Gong, P. Yang, C. Liang, and L. Huang, \u0026ldquo;A stable long-term tracking method for group-housed pigs,\u0026rdquo; in International Conference on Image and Graphics. Springer, 2023, pp. 238–249. [9] S. Gao, P. Yang, and L. Huang, \u0026ldquo;Scene-adaptive svad based on multimodal action-based feature extraction,\u0026rdquo; in Proceedings of the Asian Conference on Computer Vision, 2024, pp. 2471–2488. [10] T. Feng, Q. Qi, L. Guo, and J. Wang, \u0026ldquo;Meta-uad: A meta-learning scheme for user-level network traffic anomaly detection,\u0026rdquo; arXiv preprint arXiv:2408.17031, 2024. [11] T. Feng, X. Wang, F. Han, L. Zhang, and W. Zhu, \u0026ldquo;U2udata: A large-scale cooperative perception dataset for swarm uavs autonomous flight,\u0026rdquo; in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 7600–7608. [12] H. Zhou, J. Yu, and W. Yang, \u0026ldquo;Dual memory units with uncertainty regulation for weakly supervised video anomaly detection,\u0026rdquo; arXiv preprint arXiv:2302.05160, 2023. [13] M. Z. Zaheer, A. Mahmood, M. H. Khan, M. Segu, F. Yu, and S.-I. Lee, \u0026ldquo;Generative cooperative learning for unsupervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 14 744–14 754. [14] H. K. Joo, K. Vo, K. Yamazaki, and N. Le, \u0026ldquo;Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection,\u0026rdquo; in 2023 IEEE International Conference on Image Processing (ICIP) , 2023, pp. 3230–3234. [15] P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y. Zhang, \u0026ldquo;Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,\u0026rdquo; arXiv preprint arXiv:2308.11681, 2023. [16] A. Acsintoae, A. Florescu, M.-I. Georgescu, T. Mare, P. Sumedrea, R. T. Ionescu, F. S. Khan, and M. Shah, \u0026ldquo;Ubnormal: New benchmark for supervised open-set video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 143–20 153. [17] G. Wang, Y. Wang, J. Qin, D. Zhang, X. Bao, and D. Huang, \u0026ldquo;Video anomaly detection by solving decoupled spatio-temporal jigsaw puzzles,\u0026rdquo; in European Conference on Computer Vision, 2022, pp. 494–511. [18] W. Liu, W. Luo, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction for anomaly detection–a new baseline,\u0026rdquo; in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6536–6545. [19] C. Park, M. Cho, M. Lee, and S. Lee, \u0026ldquo;Fastano: Fast anomaly detection via spatio-temporal patch transformation,\u0026rdquo; in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2022, pp. 2249–2259. [20] G. Yu, S. Wang, Z. Cai, E. Zhu, C. Xu, J. Yin, and M. Kloft, \u0026ldquo;Cloze test helps: Effective video anomaly detection via learning to complete video events,\u0026rdquo; in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 583–591. [21] M. I. Georgescu, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah, \u0026ldquo;A background-agnostic framework with adversarial training for abnormal event detection in video,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 9, pp. 4505–4523, 2021. [22] Y. Liu, D. Li, W. Zhu, D. Yang, J. Liu, and L. Song, \u0026ldquo;Msn-net: Multiscale normality network for video anomaly detection,\u0026rdquo; in ICASSP 20232023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. [23] T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai et al., \u0026ldquo;Chatglm: A family of large language models from glm-130b to glm-4 all tools,\u0026rdquo; arXiv preprint arXiv:2406.12793 , 2024. [24] W. Hong, W. Wang, M. Ding, W. Yu, Q. Lv, Y. Wang, Y. Cheng, S. Huang, J. Ji, Z. Xue et al., \u0026ldquo;Cogvlm2: Visual language models for image and video understanding,\u0026rdquo; arXiv preprint arXiv:2408.16500 , 2024. [25] H. Liu, C. Li, Y. Li, and Y. J. Lee, \u0026ldquo;Improved baselines with visual instruction tuning,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 296–26 306. [26] M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, \u0026ldquo;Video-chatgpt: Towards detailed video understanding via large vision and language models,\u0026rdquo; arXiv preprint arXiv:2306.05424, 2023. [27] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al., \u0026ldquo;The llama 3 herd of models,\u0026rdquo; arXiv preprint arXiv:2407.21783, 2024. [28] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., \u0026ldquo;Llama 2: Open foundation and fine-tuned chat models,\u0026rdquo; arXiv preprint arXiv:2307.09288, 2023. [29] L. Zanella, W. Menapace, M. Mancini, Y. Wang, and E. Ricci, \u0026ldquo;Harnessing large language models for training-free video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 527–18 536. [30] Z. Liu, X.-M. Wu, D. Zheng, K.-Y. Lin, and W.-S. Zheng, \u0026ldquo;Generating anomalies for video anomaly detection with prompt-based feature mapping,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 24 500–24 510. [31] N. Madan, N.-C. Ristea, R. T. Ionescu, K. Nasrollahi, F. S. Khan, T. B. Moeslund, and M. Shah, \u0026ldquo;Self-supervised masked convolutional transformer block for anomaly detection,\u0026rdquo; IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. [32] N.-C. Ristea, F.-A. Croitoru, R. T. Ionescu, M. Popescu, F. S. Khan, and M. Shah, \u0026ldquo;Self-distilled masked auto-encoders are efficient video anomaly detectors,\u0026rdquo; arXiv preprint arXiv:2306.12041, 2023. [33] H. Lv, Z. Yue, Q. Sun, B. Luo, Z. Cui, and H. Zhang, \u0026ldquo;Unbiased multiple instance learning for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 8022–8031. [34] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., \u0026ldquo;Learning transferable visual models from natural language supervision,\u0026rdquo; in International conference on machine learning, 2021, pp. 8748–8763. [35] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos, \u0026ldquo;Anomaly detection in crowded scenes,\u0026rdquo; in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 1975–1981. [36] C. Lu, J. Shi, and J. Jia, \u0026ldquo;Abnormal event detection at 150 fps in matlab,\u0026rdquo; in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2720–2727. [37] W. Sultani, C. Chen, and M. Shah, \u0026ldquo;Real-world anomaly detection in surveillance videos,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6479–6488. [38] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang, \u0026ldquo;Not only look, but also listen: Learning multimodal violence detection under weak supervision,\u0026rdquo; in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer, 2020, pp. 322–339. [39] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, \u0026ldquo;Imagebind: One embedding space to bind them all,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 180–15 190. [40] M. Astrid, M. Z. Zaheer, and S.-I. Lee, \u0026ldquo;Synthetic temporal anomaly guided end-to-end video anomaly detection,\u0026rdquo; in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 207–214. ","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/suvad_semantic_understanding_based_video_anomaly_detection_using_mllm/","section":"Papers","summary":"Proposes a training-free video anomaly detection method leveraging multi-modal large language models for semantic understanding of videos, enabling scene generalization, interpretability, and flexible anomaly definition without retraining.","title":"SUVAD: Semantic Understanding Based Video Anomaly Detection Using MLLM","type":"method"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/tal-reiss/","section":"Authors","summary":"","title":"Tal Reiss","type":"authors"},{"content":" Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection # Zhiwei Yang 1 , Jing Liu 1* *, Peng Wu 2\nGuangzhou Institute of Technology, Xidian University, Guangzhou, China 2 School of Computer Science, Northwestern Polytechnical University, Xi\u0026rsquo;an, China\n1\n{zwyang97, neouma}@163.com, xdwupeng@gmail.com\nAbstract # Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However, since the existing methods use only RGB visual modality and the utilization of category text information is neglected, thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description, in this paper, we propose a novel pseudo-label generation and selftraining framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically, We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further, we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then, we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally, we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Violence, demonstrating the effectiveness of our proposed method.\n1. Introduction # Anomaly detection has been widely researched and applied in various fields, such as computer vision [23 , 35 , 40 , 43 ,\nCorresponding authors. Figure 1. Illustration of the manual video frame labeling process.\n49], natural language processing [1], and intelligent optimization [29]. One of the most important research issues is the video anomaly detection (VAD). The main purpose of VAD is to automatically identify events or behaviors in the video that are inconsistent with our expectations.\nDue to the rarity of anomalous events and the difficulty of frame-level labeling, current VAD methods focus on semi-supervised [14 , 16 , 18] and weakly supervised [11 , 26 , 52] paradigms. Semi-supervised VAD methods aim to learn normality patterns from normal data, and deviations from this pattern are considered as anomalies. However, due to the lack of discriminative anomaly information in the training phase, these models are often prone to overfitting, leading to poor performance in complex scenarios. Subsequently, weakly supervised video anomaly detection (WSVAD) methods came into prominence. WSVAD involves both normal and abnormal videos with video-level labels in the training phase, but the exact location of abnormal frames is unknown. Current WSVAD methods mainly include one-stage methods based on multi-instance learning (MIL) [17 , 26 , 27] and two-stage methods based on pseudolabel self-training [6 , 11 , 51 , 53]. While the one-stage methods based on MIL show promising results, this paradigm tends to focus on video snippets with prominent anomalous features and suboptimal attention to minor anomalies, thus limiting its further performance improvement.\nIn contrast to the one-stage methods mentioned above, two-stage methods based on pseudo-label self-training generally use an off-the-shelf classifier or MIL to obtain initial\npseudo-labels, and then train the classifier with further refined pseudo-labels. Because these methods train the classifier directly with the generated fine-grained pseudo-labels, they show great potential in performance. However, these methods still have two aspects that have not been considered: first, the generation of pseudo-labels is based only on visual modality and lacks the utilization of textual modality, which limits the accuracy and completeness of the generated pseudo-labels. Second, the mining of temporal dependencies among video frames is insufficient.\nTo further exploit the potential of pseudo-label-based self-training on WSVAD, we dedicate to investigating the two problems mentioned above in this paper. Our motivation for the first question is that we explore how the textual modal information can be effectively utilized to assist in generating pseudo-labels. Recalling our manual process of video frame labeling, we mainly based on textual definitions of anomalous events, i.e., prior knowledge about anomalous events, to accurately locate the video frames. As illustrated in Fig. 1, assuming that we need to annotate the abnormal video frames that contain \u0026ldquo;fighting\u0026rdquo; event, we will first associate the textual definition of \u0026ldquo;fighting\u0026rdquo; and then look for matching video frames, which is actually a process of text-image matching based on prior knowledge. Inspired by this process, we associate a highly popular and powerful contrastive language-image pre-training (CLIP) [19] model to assist us in achieving this goal. On the one hand, the CLIP learns a large number of image-text pairs on the web, and thus has a highly rich prior knowledge; on the other hand, the CLIP is trained by comparative learning, which empowers it with excellent image-text alignment capabilities. For the second motivation, because different video events have diverse durations, this leads to different ranges of temporal dependencies. Existing methods either do not consider temporal dependencies or only consider dependencies within a fixed temporal range, leading to inadequate modeling of temporal dependencies. Therefore, in order to achieve more flexible and adequate modeling of temporal dependencies, we should investigate methods that can adaptively learn temporal dependencies of different lengths.\nBased on the above two motivations, we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our main idea is to utilize the CLIP model to match the textual descriptions of video events with the corresponding video frames, and then infer the pseudo-labels from match similarities. However, since the CLIP model is trained at the image-text level, it may suffer from domain bias and lacks the ability to learn temporal dependencies in videos. In order to better transfer the prior knowledge of CLIP to the WSVAD task, we first construct a contrastive learning framework by designing two ranking losses and a distributional inconsistency loss to fine-tune the CLIP\nmodel for domain adaptation under the weakly-supervised setting. To further improve the accuracy of aligning the descriptive text of video events with video frames, we employ learnable textual prompts to facilitate the text encoder of CLIP to generate more generalized textual embedding features. On this basis, we propose a normality visual prompt (NVP) mechanism to aid this process. In addition, because abnormal videos contain normal video frames as well, we design a pseudo-label generation (PLG) module based on normality guidance, which can reduce the interference caused by individual normal video frames to the alignment of abnormal video frames, thus facilitating the obtaining of more accurate frame-level labels.\nFurthermore, to compensate for the lack of temporal relationship modeling in CLIP as well as to more flexible and adequately mine the temporal dependencies between video frames, we introduce a temporal context self-adaptive learning (TCSAL) module for temporal dependency modeling, inspired by the work [25]. TCSAL allows the attention module in the Transformer to adaptively adjust the attention span according to the inputs by designing a temporal span adaptive learning mechanism. This can facilitate the model to capture the temporal dependencies of video events of different durations more accurately and flexibly.\nOverall, our main contributions are summarized below:\nWe propose a novel framework, i.e., TPWNG, to perform pseudo-label generation and self-training for WSVAD. TPWNG fine-tunes CLIP with the designed ranking loss and distributional inconsistency loss to transfer its strong text-image alignment capability to assist pseudo-label generation by means of the PLG module. We design a learnable text prompt and normality visual prompt mechanism to further improve the alignment accuracy of video events description text and video frames. We introduce a TCSAL module to learn the temporal dependencies of different video events more flexibly and accurately. To the best of our knowledge, we are the first to introduce the idea of self-adaptive learning of temporal context dependencies for VAD. Extensive experiments have been conducted on two benchmark datasets, UCF-Crime and XD-Violence, where the excellent performance demonstrates the effectiveness of our method. 2. Related Work # 2.1. Video Anomaly Detection # The VAD task has been widely focused and researched, and many methods have been proposed to solve this problem. According to different supervision modes, these methods can be mainly categorized into semi-supervised-based and weakly supervised-based VAD.\nSemi-supervised VAD. Early researchers mainly used\nsemi-supervised approaches to solve the VAD problem [2 , 7 , 8 , 10 , 14 , 15 , 20 , 24 , 31 , 33 , 41 – 44 , 46 , 50]. In the semi-supervised setting, only normal data can be acquired in the training phase, which aims to build a model that can characterize normal behavioral patterns by learning normal data. During the testing phase, data that contradict with the normal patterns are considered anomalies. Common semi-supervised VAD methods mainly include one-class classifier-based [21 , 33 , 37] and reconstruction [8 , 38] or prediction errors-based methods [14 , 42]. For example, Xu et al. [38] used multiple one-classifiers to predict anomaly scores based on appearance and motion features. Hasan et al. [8] built a fully convolutional auto-encoder to learn regular patterns in the video. Liu et al. in [14] proposed a novel video anomaly detection method that utilizes the U-Net architecture to predict future frames, where frames with large prediction errors are considered as anomalous.\nWeakly Supervised VAD. Compared to semisupervised VAD methods, WSVAD can utilize both normal and anomalous data with video-level labels in the training phase, but the exact frame location where the abnormal event occurred is unknown. In such a setting, the one-stage approaches based on MIL [3 – 5 , 13 , 17 , 22 , 26 , 27 , 32 , 34 , 45 , 54] and the twostage approaches based on pseudo-labels self-training [6 , 11 , 51 , 53] are the two prevailing approaches. For example, Sultani et al. [26] first proposed a deep MIL ranking framework for VAD, where they considered anomalous and normal videos as positive and negative bags, respectively, and the snippets in the videos are considered as instances. Then a ranking loss is used to constrain the snippets with the highest anomaly scores in the positive and negative bags to stay away from each other. Later, many variants of the method were proposed on this basis. For example, Tian et al. [27] proposed a top-k MIL based VAD method with robust temporal feature magnitude learning.\nHowever, these one-stage methods generally use a MIL framework, which leads to models that tend to focus only on the most significant anomalous snippets while ignoring nontrivial anomalous snippets. A two-stage approach based on pseudo-label self-training provides a relatively more promising solution. The two-stage approach first generates initial pseudo-labels using MIL or an off-the-shelf classifier and then refines the labels before using them for supervised training of the classifier. For example, Zhong et al. in [53] reformulated the WSVAD problem as a supervised learning task under noisy labels obtained by an off-the-shelf video classifier. Feng et al. in [6] introduced a multiple instance pseudo label generator that produces more reliable pseudo labels for fine-tuning a task-specific feature encoder with self-training mechanism. Zhang et al. in [51] exploited completeness and uncertainty properties to enhance pseudo labels for effective self-training. How- ever, all these existing methods only generate pseudo-labels based on visual unimodal information and lack the utilization of textual modal. Therefore, in this paper, we endeavor to combine both visual and textual modal information in order to generate more accurate and complete pseudo-labels for self-training of the classifier.\n2.2. Large Vision-Language Models # Recently, there has been an emergence of large visionlanguage models that learn the interconnections between visual and textual modalities by pre-training on large-scale datasets. Among these methods, the CLIP demonstrates unprecedented performance in many visual-language downstream tasks, e.g. image classification [55], object detection [56], semantic segmentation [12] and so on. The CLIP model has recently been successfully extended to the video domain as well. VideoCLIP [39] is proposed to align video and textual representations by contrasting temporally overlapping video-text pairs with mined hard negatives. ActionCLIP [30] formulated the action recognition task as a multimodal learning problem rather than a traditional unimodal classification task. However, there are fewer attempts to utilize CLIP models to solve VAD tasks. Joo et al. in [9] simply utilizes CLIP\u0026rsquo;s image encoder for extracting more discriminative visual features and does not use textual information. Wu et al. [36], Zanella et al. [48] mainly use textual features from CLIP to enhance the expressiveness of the overall features, followed by MIL-based anomaly classifier learning. The major difference with the above works is that our method is the first to utilize the textual features encoded by the CLIP text encoder in conjunction with the visual features to generate pseudo-labels, and then employ a supervised approach to train an anomaly classifier.\n3. Method # In this section, we first present the definition of the WSVAD task, then introduce the overall architecture of our proposed method, and subsequently elaborate on the details of each module and the execution process.\n3.1. Overall Architecture # Formally, we first define sets D a = {(v a i , yi)} M i=1 and D n = {(v n i , yi)} M i=1 containing M abnormal and normal videos with ground-truth labels, respectively. For each v a i , it is labeled yi = 1, indicating that this video contains at least one anomalous video frame, but the exact location of the anomalous frame is unknown. For each v n i , it is labeled yi = 0, indicating that this video consists entirely of normal frames. With this setting, WSVAD task is to utilize coarsegrained video-level labels to enable a classifier to learn to predict fine-grained frame-level anomaly scores.\nFig. 2 illustrates the overall pipeline of our approach. Normal and abnormal video along with learnable category\nFigure 2. The overall architecture of our proposed TPWNG.\nprompt text are encoded as feature embedding by the image encoder and text encoder of CLIP, respectively. Then, the text encoder of CLIP is encouraged by fine-tuning it to produce textual feature embedding of video event categories that accurately match anomalous or normal video frames, and the NVP assists in this process. Meanwhile, the image features feed the TCSAL module to perform self-adaptive learning of temporal dependencies. Finally, a video frame classifier is trained to predict anomaly scores under the supervision of pseudo-labels obtained by the PLG module.\n3.2. Text and Normality Visual Prompt # Learnable Text Prompt. Constructing textual prompts that can accurately describe various video event categories is a prerequisite for realizing the alignment of text and corresponding video frames. However, it is impractical to manually define description texts that can completely characterize anomalous events in all different scenarios. Therefore, inspired by CoOp [55], we employ a learnable text prompt mechanism to adaptively learn representative video event text prompts to align the corresponding video frames. Specifically, we construct a learnable prompt template, which adds l learnable prompt vectors in front of the tokenized category name, as follows:\nwhere ∂ l denotes the l − th prompt vector. Tokenizer is converting original category labels, i.e., \u0026ldquo;fighting\u0026rdquo;, \u0026ldquo;accident\u0026rdquo;, . . . , \u0026ldquo;normal\u0026rdquo;, etc., into class tokens by means of\nCLIP tokenizer. Then, we add the corresponding location information pos to the learnable prompts and then feed it to the CLIP text encoder ζtext to get the feature embedding Tl Tlabel ∈ R D of the video event description text as follows:\nFinally, we compute all video event categories according to Eqs. (1) and (2) to obtain the video event description text embedding set E = {T 1 a T 1 , T 2 a T 2 , \u0026hellip;, T k a T k − 1 , T k n T k }, where {T i a T i } k − 1 i=1 denotes the description text embedding of preceding k − 1 abnormal events and T k n T k denotes the description text embedding of normal events.\nNormality Visual Prompt. For an anomalous video, which contains both anomalous and normal frames, our core task is to infer pseudo-labels from the match similarities between the description text of the anomalous events and the video frames. However, this process is susceptible to interference from normal frames in the anomalous video because they have a similar background to the anomalous frames. To minimize this interference, we propose a NVP mechanism. NVP is used to assist the normal event description text to more accurately align normal frames in the abnormal video, and thus indirectly assist the description text of abnormal event to align abnormal video frames in the abnormal video by means of the distribution inconsistency loss that will be introduced in Sec. 3.5. Specifically, we first compute the match similarities S nn i, k ∈ R F between the description text embedding of normal event and the video frame features in the normal video. Then, the match similarities after softmax operation are used as weights to aggre-\ngate normal video frame features to obtain NVP Qi ∈ R D . The formulas are represented as follows:\nwhere X n i ∈ R F ×D denotes the visual features of the normal video v n i obtained by the CLIP image encoder, where F and D denote the number of video frames and feature dimensions, respectively. Then, we concatenate Qi and T k n T k in the feature dimension and feed an FFN layer with skip connections to obtain the enhanced description text embedding T k ˙ n T k of normal events. The formula is represented as follows:\n3.3. Pseudo Label Generation Module # In this subsection, we detail how to generate frame-level pseudo labels. For a normal video, we can directly get the frame-level pseudo-labels, i.e., for a v n i = {Ij} F j=1 containing F normal frames, it corresponds to a label set {γ n i, j = 0} F j=1 . Our main goal is to infer the pseudo-labels for anomalous videos that contain both anomalous and normal frames. To this end, we propose a PLG module for inferring accurate pseudo-labels based on the normality guidance. PLG module infers frame-level pseudo-labels by incorporating the match similarities between the description text of the normal event and the abnormal video as a guide into the match similarities between the description text of the corresponding abnormal event and the abnormal video.\nSpecifically, we first compute the match similarities S an i, k = X a i (T k ˙ n T k ) ⊤ between normal event description text embedding enhanced with NVP and anomalous video features, where X a i ∈ R F ×D denotes the visual features of the anomalous video v a i obtained by the CLIP image encoder. Similarly, we compute the match similarities S aa i, τ = X a i (T τ a T τ ) ⊤ between the description text embedding T τ a T τ of the corresponding τ -th (1 ⩽ τ ⩽ k − 1) real anomaly category and the anomaly video features X a i .\nTheoretically, for S aa i, τ , it should have high match similarities corresponding to abnormal frames and low match similarities for normal frames. But it may be interfered by normal frames from the same video having the same background. To reduce the interference of normal frames, we infer pseudo-labels by incorporating the matching similarity corresponding to the description text of normal events with certain weights as a guide into the matching similarity of the description text of corresponding real abnormal events. Specifically, we first perform a normalization and fusion operation on S aa i, τ and S an i, k as follows:\nwhere ˜ ∗ denotes the normalization operation and α denotes the guidance weight. After obtaining ψi, we similarly perform a normalization operation on it to obtain ψ ˜ i . Then, we set a threshold θ on ψ ˜ i to obtain the frame-level pseudolabels in the anomalous video as follows:\nwhere γ a i, j denotes the pseudo-label of the j-th frame in the i-th anomaly video. Finally, we combine the framelevel pseudo-labels γ n i, j and γ a i, j of normal and anomalous videos to get the total pseudo-label set {γi, j} F j=1 .\n3.4. Temporal Context Self-adaptive Learning # To adaptively adjust the learning range of temporal relationship based on the input video data, inspired by the work [25], we introduce a TCSAL module. The backbone of TCSAL is the transformer-encoder, but unlike the original transformer, the spanning range of attention is controlled by a soft mask function χ z for each self-attention head at each layer. χ z is a piecewise function mapping a distance to a value between [0, 1] as follows:\nwhere h represents the distance between the current t-th frame in a video and the r − th (r ∈ [1, t − 1]) frame in the past temporal range. R is a hyperparameter used to control the softness. z is a learnable parameter that is adaptively tuned with the input as follows:\nhere σ represents the sigmoid operation, C and b are learnable parameters during model training. With the soft mask function χ z , the corresponding attention weights ωt, r is computed within this mask, i.e.,\nhere βt,r denotes the dot product output of the Query corresponding to the t-th frame in a video with the Key corresponding to the r−th frame in the past. Under the control of χz, the self-attention heads will be able to adaptively adjust the self-attention span range according to the input.\nFinally, the video features after temporal context adaptive learning are fed into a classifier to predict the framelevel abnormality scores {ηi, j} F j=1 .\n3.5. Objective Function # First, we fine-tune the CLIP text encoder. For a normal video, we further compute the match similarities set φ na i = {S na i, τ = X n i (T τ a T τ ) ⊤ |1 ⩽ τ ⩽ k − 1} between the description texts of the other k − 1 anomalous events and the normal frames. We expect that the maximum in the similarity set\nφ na i should be as small as possible while the maximum in S nn i, k should be as large as possible. Thus, we design the following ranking loss for constraints:\nFor an anomalous video, we first calculate the similarities S an i, k = X a i (T k ˙ n T k ) ⊤ between the description text embedding of normal event and the anomalous video features, the similarity S aa i, τ = X a i (T τ a T τ ) ⊤ between the description text embedding of the τ -th (1 ⩽ τ ⩽ k − 1) real anomalous event category and the anomalous video features, and the similarity set φ aa i = {S aa i, g = X a i (T g a T g ) ⊤ |1 ⩽ g ⩽ k − 1, g ̸= τ } between the description text embedding of other k − 2 anomalous event categories and the anomalous video features, respectively. We expect that the maximum value in S an i, k should be greater than the maximum value in φ aa i . Similarly, the maximum value in S aa i, τ should be greater than the maximum value in φ aa i . In short, it means that we expect that the description texts of real abnormal and normal events should match the abnormal and normal frames in the abnormal video with the highest possible similarity, respectively. Thus, the ranking loss for anomalous videos is designed as follows:\nIn addition, to further ensure that the description texts of real abnormal events and normal events can accurately align the abnormal and normal video frames in the abnormal video, respectively, we design a distribution inconsistency loss (DIL). DIL is used to constrain the similarities between the description text of the real abnormal event and the video frames to be inconsistent with the similarity distribution between the description text of the normal event and the video frames. We use cosine similarity to perform this loss:\nThen, following the work [26], in order to make the generated pseudo-labels satisfy sparsity and smoothing in temporal order, we impose sparsity and smoothing constraints, L sp = P F j=1 (S ˜ aa i, j, τ − S ˜ aa i, j+1, τ ) 2 , L sm = P F j=1 S ˜ aa i, j, τ , on the similarity vectors S ˜ aa i, τ .\nThen, we calculate the binary cross-entropy between the anomaly score ηi,j predicted by the classifier and the pseudo-label γi, j as the classification loss:\nThe final overall objective function balanced by λ1 and λ 2 is designed as follows:\n4. Experiments # 4.1. Datasets and Evaluation Metrics # Datasets. We conduct extensive experiments on two benchmark datasets, UCF-Crime [26] and XD-Violence [34]. UCF-Crime is a large-scale real scene dataset for WSVAD. UCF-Crime duration is 128 hours in total and contains 1900 surveillance videos covering 13 anomaly event categories, of which 1610 videos with video-level labels are used for training and 290 videos with frame-level labels are used for testing. XD-Violence is a large-scale violence detection dataset collected from movies, online videos, surveillance videos, CCTVS, etc. XD-Violence lasts 217 hours and contains 4754 videos covering 6 anomaly event categories, of which 3954 training videos with video-level labels and 800 test videos with frame-level labels.\nEvaluation Metrics. Following the previous methods [6 , 26], for the UCF-Crime dataset, we measure the performance of our method using the area under the curve (AUC) of the frame-level receiver operating characteristics (ROC). Similarly, for the XD-Violence dataset, we follow the evaluation criterion of average precision (AP) suggested by the work [34] to measure the effectiveness of our method.\n4.2. Implementation Details # The image and text encoders in our method use a pre-trained CLIP (VIT-B/16), in which both the image and text encoders are kept frozen, except for the text encoder where the final projection layer is unfrozen for fine-tuning. The feature dimension D is 512. FFN is a standard block from Transformer. The length l of the learnable sequence in the text prompt is set to 8. The normality guidance weight α is set to 0.2 for both the UCF-Crime and XDViolence datasets. The pseudo-labels generation threshold θ is set to 0.55 and 0.35 for the UCF-Crime and XD-Violence datasets, respectively. The parameter R used to control the softness of the soft mask function is set to 256. The sparse loss and smoothing loss weights are set to λ1 = 0 . 1 and λ 2 = 0 . 01. Please refer to the supplementary materials for more details on implementation.\n4.3. Comparison with State-of-the-art Methods # We compare the performance on the UCF-Crime and XDViolence datasets with the current state-of-the-art (SOTA) methods in Tab. 1. As can be observed from the table, our method achieves a new SOTA on both the UCF-Crime and XD-Violence datasets. Specifically, for the UCF-Crime dataset, our method outperforms the current SOTA method\nTable 1. AUC and AP on UCF-Crime and XD-Violence dataset.\nMethods UCF (AUC) XD (AP) Weakly Sultani et al.[26] 77.92% 73.20% Weakly GCN [53] 82.12% - Weakly HL-Net [34] 82.44% 73.67% Weakly CLAWS [45] 82.30% - Weakly MIST [6] 82.30% - Weakly RTFM [27] 84.30% 77.81% Weakly CRFD [32] 84.89% 75.90% Weakly GCL [47] 79.84% - Weakly MSL [11] 85.62% 78.58% Weakly MGFN [3] 86.67% 80.11% Weakly Zhang et al.[51] 86.22% 78.74% Weakly UR-DMU [54] 86.97% 81.66% Weakly CLIP-TSA [9] 87.58% 82.17% Weakly Ours 87.79% 83.68% CLIP-TSA [9] by 0.21%, which is not a trivial improvement for the challenging WSVAD task. Most importantly, compared to methods MIST [6] and Zhang et al. [51] similar to ours that also use pseudo-label-based self-training, our method significantly outperforms them by 5.49% and 1.57%, respectively. This fully demonstrates that our proposed pseudo-label generation and self-training framework is vastly superior to the above two approaches. This also indicates that transferring visual language multimodal associations through CLIP is conducive to generating more accurate pseudo-labels compared to merely utilizing unimodal visual information. For the XD-Violence dataset, our method also surpasses the current optimal method CLIPTSA [9] by 1.52%. Compared to a similar pseudo-labelbased self-training method Zhang et al. [51], our method also outperforms it by 4.94%. The consistent superior performance on two large-scale real datasets strongly demonstrates the effectiveness of our method. This also shows the extraordinary potential of the pseudo-label based selftraining scheme, if accurate pseudo-labels can be generated utilizing multiple modality information.\n4.4. Ablation Studies # We conduct ablation experiments in this subsection to analyze the effectiveness of each component of our framework.\nEffectiveness of Normal Visual Prompt. To verify the validity of NVP, we execute three comparison experiments: without NVP, with NVP based on frame averaging (NVPFA), and with NVP based on match similarities aggregation (NVP-AS). As can be seen from the results in Tab. 2, in the absence of NVP, the performance of our method on the UCF-Crime and XD-Violence datasets decreases by 2.54% and 2.10% compared to with an NVP-AS, respectively. NVP-AS boosts the performance of the method by 0.47% and 0.55% more compared to NVP-FA on UCF-Crime and\nTable 2. The AUC and AP of our method on the UCF and XD datasets without NVP, with NVP-FA, and with NVP-AS.\nUCF-Crime (AU UCF-Crime (AUC) XD-Violence (AP) 85.25% 81.58% w/o NVP 87.32% 83.13% w NVP-FA 87.79% 83.68% w NVP-AS Table 3. The AUC and AP of our method on the UCF and XD datasets with NG and without NG.\nUCF-Crime (AUC XD-Violence (AP) 85.83% 81.32% w/o NG 83.68% w NG XD-Violence datasets, respectively. This reveals two facts: first, NVP can help the text embedding to better match normal frames in anomalous videos, which indirectly aids in generating more accurate pseudo-labels in cooperation with the DIL and the normality guidance mechanism. Second, the NVP-AS can effectively reduce the interference of some noise snippets (e.g., prologue, perspective switching, etc.) in normal videos compared to the NVP-FA approach, thus obtaining a purer NVP.\nEffectiveness of the Normality Guidance. In the pseudo-label generation module, instead of inferring pseudo-labels directly based on the similarity between the corresponding abnormal event description text and the abnormal video, we incorporate guidance from the match similarities of the normal event description text counterparts, aiming to reduce the interference of partially noisy video frames and generate more accurate pseudo-labels. To verify the contribution of the normality guidance, we compare the impact of the pseudo-label generation module on the performance of our method with and without normal guidance (NG), respectively. As can be observed from Tab. 3, when our method is equipped with normal guidance, the performance rises by 1.96% and 2.36% on the UCF-Crime and XD-Violence datasets, respectively. This validates the effectiveness of the normality guidance.\nEffectiveness of TCSAL. To analyze the effectiveness of TCSAL module, we conduct comparative experiments with the Transformer-encoder (TF-encoder) module in [28], MTN module in [27], and GL-MHSA module in [54] by replacing the temporal learning module in our framework with each of these three modules. From Tab. 4, it can be observed that the TF-encoder module has the lowest performance, which is understandable since the global selfattention computation way makes it neglect to pay attention to the local temporal information. Both MTN and GLMHSA outperform TF-encoder with comparable performance. Our introduced TCSAL module achieved the best performance on both datasets. This indicates that adopting\nFigure 3. Anomaly score curves of several test samples on the UCF-Crime and XD-Violence dataset.\nTable 4. The AUC and AP of our method on the UCF and XD datasets with different temporal modules.\nUCF (AUC) XD (AP) w TF-encoder 85.12% 80.02% w MTN 86.22% 81.02% w GL-MHSA 86.43% 81.23% w TCSAL 87.79% 83.68% Table 5. Comparison of the AUC and AP of our method with different loss terms on the UCF-Crime and XD-Violence datasets. \u0026ldquo;bs\u0026rdquo; indicates that Lcl , L sp, L sm three loss functions are used.\nLoss term Loss term Loss term Loss term Dataset Dataset bs Ln rank L a rank Ldil UCF (AUC) XD (AP) ✓ 77.12% 73.32% ✓ ✓ 81.34% 78.67% ✓ ✓ 84.45% 81.56% ✓ ✓ 82.47% 79.96% ✓ ✓ ✓ ✓ 87.79% 83.68% the mechanism of self-attention span range adaptive learning enables the temporal learning module to self-adapt to the inputs of videos with different event lengths, achieving more accurate modeling of temporal dependencies while weakening the interference of other non-relevant temporal information in the non-event span range.\n4.5. Qualitative Results # We show the anomalous scores of our method on several test videos in Fig. 3. It can be obviously noticed that there is a steep rise in the anomaly scores when various anomalous events occur, and as the anomalous events end, the anomaly scores fall back to the lower range rapidly. For normal events, our method gives a lower abnormal score. This intuitively demonstrates that our method has good sensitivity to abnormal events and can accurately and timely detect the occurrence of abnormal events while maintaining a low abnormal score prediction for normal events.\n4.6. Analysis of Losses # To analyze the impact of the three loss functions L n rank, L a rank , and Ldil, we perform ablation experiments on the UCF-Crime and XD-Violence datasets. As shown in Tab. 5 , when all three loss functions are absent, the performance of our method is unsatisfactory. This reveals that the original CLIP suffers from domain bias and is not directly applicable to the VAD domain. When three loss functions are available individually, the performance of our method is clearly improved, where the L a rank gives the biggest boost to the performance. When all three losses are combined and cooperate with each other, our method achieves the best performance. This demonstrates the effectiveness of the three loss functions we have designed, and they can effectively assist CLIP in domain adaptation for WSVAD.\n5. Conclusions # In this paper, we propose a novel framework, TPWNG, to perform pseudo-label generation and self-training for WSVAD. TPWNG finetunes CLIP with the designed ranking loss and distributional inconsistency loss to transfer its textimage alignment capability to assist pseudo-label generation with the PLG module. Further, we design a learnable text prompt and normality visual prompt mechanisms to further improve the alignment accuracy of video events description text and video frames. Finally, we introduce a TCSAL module to learn the temporal dependencies of different video events more flexibly and accurately. We perform extensive experiments on the UCF-Crime and XD-Violence datasets, and the superior performance compared to existing methods demonstrates the effectiveness of our method.\n6. Acknowledgments # This work was supported by the Guangzhou Key Research and Development Program (No. 202206030003), the Fundamental Research Funds for the Central Universities, the Innovation Fund of Xidian University (No. YJSJ24006), and the Guangdong High-level Innovation Research Institution Project (No. 2021B0909050008).\nReferences # [1] Christophe Bertero, Matthieu Roy, Carla Sauvanaud, and Gilles Tredan. Experience report: Log mining using natural ´ ´ language processing and application to anomaly detection. In ISSRE, pages 351–360, 2017. 1 [2] Ruichu Cai, Hao Zhang, Wen Liu, Shenghua Gao, and Zhifeng Hao. Appearance-motion memory consistency network for video anomaly detection. In AAAI, pages 938–946, 2021. 3 [3] Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitudecontrastive glance-and-focus network for weakly-supervised video anomaly detection. In AAAI, pages 387–395, 2023. 3 , 7 [4] MyeongAh Cho, Minjung Kim, Sangwon Hwang, Chaewon Park, Kyungjae Lee, and Sangyoun Lee. Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning. In CVPR, pages 12137– 12146, 2023. [5] MyeongAh Cho, Minjung Kim, Sangwon Hwang, Chaewon Park, Kyungjae Lee, and Sangyoun Lee. Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning. In CVPR, pages 12137– 12146, 2023. 3 [6] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. Mist: Multiple instance self-training framework for video anomaly detection. In CVPR, pages 14009–14018, 2021. 1 , 3 , 6 , 7 [7] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In CVPR, pages 1705–1714, 2019. 3 [8] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In CVPR, pages 733–742, 2016. 3 [9] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In ICIP, pages 3230–3234, 2023. 3 , 7 [10] Sangmin Lee, Hak Gu Kim, and Yong Man Ro. Bman: Bidirectional multi-scale aggregation networks for abnormal event detection. IEEE TIP, 29:2395–2408, 2019. 3 [11] Shuo Li, Fang Liu, and Licheng Jiao. Self-training multisequence learning with transformer for weakly supervised video anomaly detection. In AAAI, pages 1395–1403, 2022. 1 , 3 , 7 [12] Yuqi Lin, Minghao Chen, Wenxiao Wang, Boxi Wu, Ke Li, Binbin Lin, Haifeng Liu, and Xiaofei He. Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation. In CVPR, pages 15305–15314, 2023. 3 [13] Tianshan Liu, Kin-Man Lam, and Jun Kong. Distilling privileged knowledge for anomalous event detection from weakly labeled videos. IEEE TNNLS, pages 1–15, 2023. 3 [14] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Fu- ture frame prediction for anomaly detection–a new baseline. In CVPR, pages 6536–6545, 2018. 1 , 3\n[15] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. In ICCV, pages 341–349, 2017. 3 [16] Hui Lv, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Learning normal dynamics in videos with meta prototype network. In CVPR, pages 15425–15434, 2021. 1 [17] Hui Lv, Chuanwei Zhou, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Localizing anomalies from weakly-labeled videos. IEEE TIP, 30:4505–4515, 2021. 1 , 3 [18] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. Learning memory-guided normality for anomaly detection. In CVPR , pages 14372–14381, 2020. 1 [19] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021. 2 [20] Mohammad Sabokrou, Mahmood Fathy, Mojtaba Hoseini, and Reinhard Klette. Real-time anomaly detection and localization in crowded scenes. In CVPR, pages 56–62, 2015. 3 [21] Mohammad Sabokrou, Mohammad Khalooei, Mahmood Fathy, and Ehsan Adeli. Adversarially learned one-class classifier for novelty detection. In CVPR, pages 3379–3388, 2018. 3 [22] Hitesh Sapkota and Qi Yu. Bayesian nonparametric submodular video partition for robust anomaly detection. In CVPR , pages 3212–3221, 2022. 3 [23] Fangtao Shao, Jing Liu, Peng Wu, Zhiwei Yang, and Zhaoyang Wu. Exploiting foreground and background separation for prohibited item detection in overlapping x-ray images. PR, 122:108261, 2022. 1 [24] Giulia Slavic, Abrham Shiferaw Alemaw, Lucio Marcenaro, David Martin Gomez, and Carlo Regazzoni. A kalman variational autoencoder model assisted by odometric clustering for video frame prediction and anomaly detection. IEEE TIP , 32:415–429, 2022. 3 ´\n[25] Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span in transformers. In ACL, pages 331–335, 2019. 2 , 5\n[26] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In CVPR, pages 6479–6488, 2018. 1 , 3 , 6 , 7\n[27] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In ICCV, pages 4975–4986, 2021. 1 , 3 , 7\n[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 30, 2017. 7\n[29] Chao Wang, Jing Liu, Kai Wu, and Zhaoyang Wu. Solving multitask optimization problems with adaptive knowledge transfer via anomaly detection. IEEE TEC, 26(2):304–318, 2021. 1\n[30] Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021. 3\n[31] Xuanzhao Wang, Zhengping Che, Bo Jiang, Ning Xiao, Ke Yang, Jian Tang, Jieping Ye, Jingyu Wang, and Qi Qi. Robust unsupervised video anomaly detection by multipath frame prediction. IEEE TNNLS, 33(6):2301–2312, 2021. 3\n[32] Peng Wu and Jing Liu. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE TIP , 30:3513–3527, 2021. 3 , 7\n[33] Peng Wu, Jing Liu, and Fang Shen. A deep one-class neural network for anomalous event detection in complex scenes. IEEE TNNLS, 31(7):2609–2622, 2019. 3\n[34] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In ECCV, pages 322–339, 2020. 3 , 6 , 7\n[35] Peng Wu, Jing Liu, Xiangteng He, Yuxin Peng, Peng Wang, and Yanning Zhang. Towards video anomaly retrieval from video anomaly detection: New benchmarks and model. arXiv preprint arXiv:2307.12545, 2023. 1\n[36] Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In AAAI, pages 6074–6082, 2024. 3\n[37] Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553, 2015. 3\n[38] Dan Xu, Yan Yan, Elisa Ricci, and Nicu Sebe. Detecting anomalous events in videos by learning deep representations of appearance and motion. CVIU, 156:117–127, 2017. 3\n[39] Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021. 3\n[40] Minghui Yang, Jing Liu, Zhiwei Yang, and Zhaoyang Wu. Slsg: Industrial image anomaly detection by learning better feature embeddings and one-class classification. arXiv preprint arXiv:2305.00398, 2023. 1\n[41] Zhiwei Yang, Jing Liu, and Peng Wu. Bidirectional retrospective generation adversarial network for anomaly detection in videos. IEEE Access, 9:107842–107857, 2021. 3\n[42] Zhiwei Yang, Peng Wu, Jing Liu, and Xiaotao Liu. Dynamic local aggregation network with adaptive clusterer for anomaly detection. In ECCV, pages 404–421, 2022. 3\n[43] Zhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. Video event restoration based on keyframes for video anomaly detection. In CVPR, pages 14592–14601, 2023. 1\n[44] Muchao Ye, Xiaojiang Peng, Weihao Gan, Wei Wu, and Yu Qiao. Anopcn: Video anomaly detection via deep predictive coding network. In ACM MM, pages 1805–1813, 2019. 3\n[45] Muhammad Zaigham Zaheer, Arif Mahmood, Marcella Astrid, and Seung-Ik Lee. Claws: Clustering assisted weakly\nsupervised learning with normalcy suppression for anomalous event detection. In ECCV, pages 358–376, 2020. 3 , 7\n[46] Muhammad Zaigham Zaheer, Jin-Ha Lee, Arif Mahmood, Marcella Astrid, and Seung-Ik Lee. Stabilizing adversarially learned one-class novelty detection using pseudo anomalies. IEEE TIP, 31:5963–5975, 2022. 3 [47] Muhammad Zaigham Zaheer, Arif Mahmood, Muhammad Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. Generative cooperative learning for unsupervised video anomaly detection. In CVPR, pages 14744–14754, 2022. 7 [48] Luca Zanella, Benedetta Liberatori, Willi Menapace, Fabio Poiesi, Yiming Wang, and Elisa Ricci. Delving into clip latent space for video anomaly recognition. arXiv preprint arXiv:2310.02835, 2023. 3 [49] Vitjan Zavrtanik, Matej Kristan, and Danijel Skocaj. Draem- ˇ ˇ a discriminatively trained reconstruction embedding for surface anomaly detection. In ICCV, pages 8330–8339, 2021. 1 [50] Shuangfei Zhai, Yu Cheng, Weining Lu, and Zhongfei Zhang. Deep structured energy based models for anomaly detection. In ICML, pages 1100–1109, 2016. 3 [51] Chen Zhang, Guorong Li, Yuankai Qi, Shuhui Wang, Laiyun Qing, Qingming Huang, and Ming-Hsuan Yang. Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection. In CVPR, pages 16271– 16280, 2023. 1 , 3 , 7 [52] Jiangong Zhang, Laiyun Qing, and Jun Miao. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In ICIP, pages 4030– 4034, 2019. 1 [53] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In CVPR, pages 1237–1246, 2019. 1 , 3 , 7 [54] Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. In AAAI, pages 3769–3777, 2023. 3 , 7 [55] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. IJCV, V, 130(9):2337–2348, 2022. 3 , 4 [56] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krahenb ¨ ¨ uhl, and Ishan Misra. Detecting twenty-thousand ¨ ¨ classes using image-level supervision. In ECCV, pages 350– 368, 2022. 3 Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection # Supplementary Material # 7. Network Structure Details # TCSAL. The TCSAL module consists of 4 transformerencoder layers with 4 attention heads per layer, and each self-attention head of each layer is self-adaptively adjusting its attention span by a soft-masking function χ z (h). The shape of the soft mask function χ z (h) is shown in Fig. 4 . Classifier. The classifier adopts a simple structure that consists of a layer normalization layer, a linear layer, and a sigmoid layer.\n8. Finetuning and Prompt Learning # The finetuning of the CLIP text encoder is performed together with the training of the NVP, PLG, TCSAL, and Classifier modules. During this process, the weights of both the CLIP image and text encoder are frozen, except for the last projection layer of the text encoder which is unfrozen for finetuning.\nTo set the optimal fine-tuning configuration, we perform finetuning experiments on the final projection layers of the CLIP image encoder and text encoder as shown in Tab. 6 . We find that based on prompt learning, there is a large performance improvement after fine-tuning the CLIP text encoder alone, whereas if we finetune only the final projection layer of the CLIP image encoder or both the image encoder and the text encoder at the same time, the performance instead decreases in both cases. Thus our final choice is prompt learning + finetuning (text encoder). We think this is due to the relatively small video anomaly dataset causing overfitting of the CLIP image encoder, which affects the method performance. When finetuning the text encoder alone, the overfitting situation is mitigated because of the prompt learning assistance. Finetuning also facilitates domain adaptation, so this combination of prompt learning + finetuning (text encoder) performs optimally.\nTable 6. The AUC and AP change of our method on the UCFCrime and XD-Violence datasets with different finetuning and prompt learning configurations.\nCLIP finetuning and prompt learning configurations UCF (AUC) XD (AP) No finetune + prompt learning 86.45% 81.33% Image encoder finetuning + prompt learning 84.23% 81.16% Text\u0026amp;Image encoder finetuning + prompt learning 85.76% 82.12% Text encoder finetuning + prompt learning 87.79% 83.68% Figure 4. The shape of the soft mask function χ z (h) .\n9. Training and Inference # Different from existing two-stage methods that separate pseudo-label generation and classifier self-training into two stages, in our approach, we synchronize pseudo-label generation and classifier training until both converge. This ensures that the updated pseudo-labels are used for supervised classifier training in real time, minimizing the interference of noisy labels on classifier training. After training under the supervision of the generated pseudo-labels, only the CLIP image encoder, the TCSAL, and the classifier are involved in the testing phase, where the video frame anomaly scores are predicted directly by the classifier.\n10. Implementation Details. # Our method is implemented on a single NVIDIA RTX 3090 GPU using the Pytorch framework. We use Adam optimizer with a weight decay of 0.005. The batch size is set to 64, which contains 32 normal videos and 32 abnormal videos randomly sample from the training dataset. For the UCFCrime dataset, the learning rate and total epoch are set to 0.001 and 50, respectively. For the XD-Violence dataset, the learning rate and the total epoch are set to 0.0001 and 20, respectively.\n11. Impact of Normality Guidance Weight α . # The normality guidance weight α is used to control the degree of fusion of S ˜ an i, k and S ˜ aa i, τ during pseudo-labels generation. In order to analyze the effect of α, we set different values of α for comparison experiments. As shown in Fig. 5, our method achieves optimal performance on both UCF-Crime and XD-Violence datasets when α is set to 0.2. It can be observed that as α gradually increases, the performance of our method gradually decreases, we consider that it is because too large α instead affects the alignment of the real anomaly event description text and the anomaly frames, and α = 0 . 2 is the best trade-off.\nFigure 5. The AUC and AP change of our method on the UCFCrime and XD-Violence datasets with different normality guidance weight α .\nFigure 6. The AUC and AP change of our method on the UCFCrime and XD-Violence datasets with different pseudo-label generation threshold θ .\n12. Impact of Pseudo-label Generation Threshold θ . # To analyze the impact of different pseudo-label generation thresholds θ on the performance of our method, we set up a series of different thresholds θ to perform comparative experiments. As shown in Fig. 6, the two datasets have different sensitivities to the threshold θ. When θ is set to 0.55 and 0.35, our method achieves the optimal performance on UCF-Crime and XD-Violence datasets, respectively.\n13. Impact of Context Length l in Learnable Prompt. # To investigate the optimal length of learnable textual prompts, we conduct comparative experiments with the context length l being set to 4, 8, 16, and 32, respectively. As shown in Tab. 7, both datasets achieve the best performance with the context length l set to 8, and slightly lower performance with a length of 16. However, when the context length l is set to 4 or 32, the performance of our method\nTable 7. The AUC and AP of our method on the UCF-Crime and XD-Violence datasets with different context lengths l .\nl UCF-Crime (AUC) XD-Violence (AP) 4 82.26% 77.45% 8 87.79% 83.68% 16 87.24% 82.99% 32 85.23% 81.78% Figure 7. Visualization of pseudo-labels of some video clips on the UCF-Crime dataset.\nsuffers a large degradation. We conjecture that the reason for this result is that too short a context length leads to textual prompts that do not fully characterize the video frame events, leading to model underfitting. Conversely, too long context length may lead to model overfitting.\n14. Visualization of Pseudo-labels. # We visualize part of the pseudo-labels (UCF-Crime) in Fig. 7. The generated pseudo-labels (orange solid line) approximate the ground-truth (shades of blue) well in most cases, which indicates the effectiveness of the generated pseudo-labels.\n15. Visualization of Match Similarities. # To more intuitively show that our constructed framework can facilitate the CLIP model to perform domain adaptation for matching video event text descriptions and corresponding video frames, we visualize the S τ ˜ aa S τ and S ˜ an k , i.e., the match similarities of real abnormal event description text and normal event description text with corresponding abnormal videos, on the UCF-Crime and XD-Violence datasets, respectively. We can observe from Fig. 8 that the distributions of S τ ˜ aa S τ and S ˜ an k are contradictory which can align anomalous video frames and normal video frames, respectively. This shows the effectiveness of our designed distributional inconsistency loss Ldil. In addition, we can notice from Fig. 8 (a) and (f) that there are fluctuations in the alignment of the real abnormal event description text and the corresponding abnormal video frames in these two samples, while the normal event description text has a more accurate alignment, in which case our proposed normal guidance mechanism can assist S τ ˜ aa S τ to better align the abnormal video frames.\nFigure 8. Visualization of match similarities between video event description text and video frames for several anomaly samples on the UCF-Crime and XD-Violence test datasets. The light blue range represents abnormal ground truth.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/text-prompt-with-normality-guidance-for-weakly-supervised-video-anomaly-detection/","section":"Papers","summary":"Proposes a novel pseudo-label generation and self-training framework incorporating CLIP for text-image alignment, learnable text prompts, normality visual prompts, a pseudo-label generation module guided by normality clues, and a self-adaptive temporal dependence learning module, achieving state-of-the-art performance on benchmark datasets.","title":"Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection","type":"method"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/thomas-foltz/","section":"Authors","summary":"","title":"Thomas Foltz","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/tianyu-sun/","section":"Authors","summary":"","title":"Tianyu Sun","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/tommie-kerssies/","section":"Authors","summary":"","title":"Tommie Kerssies","type":"authors"},{"content":" Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models # 1* 2 1 1 2\nJiacong Xu Shao-Yuan Lo Bardia Safaei Vishal M. Patel Isht Dwivedi 1 Johns Hopkins University 2 Honda Research Institute USA\n{jxu155, bsafaei1, vpatel36}@jhu.edu {shao-yuan lo, idwivedi}@honda-ri.com\nAbstract # Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD \u0026amp; reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D\u0026amp;R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a LookTwice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: https://xujiacong.github.io/Anomaly-OV/\n1. Introduction # Visual Anomaly Detection (AD) is a well-established task in computer vision, extensively applied in scenarios such as industrial defect inspection [2 , 3 , 5 , 12 , 35 , 69 , 76 , 77 , 92 , 98] and medical image diagnosis [1 , 24 , 31 , 36 , 89 , 90 , 103 , 106]. In the traditional unsupervised AD (a.k.a. one-class AD) setting, models learn the distribution of normal visual features from normal samples and are required to identify anomaly samples during inference. While recent advance-\nMost of this work was done when J. Xu was an intern at HRI-USA. Figure 1. Visualization of the image-level AUROC comparison between our Anomaly-OV and current state-of-the-art ZSAD methods (WinCLIP [38], AnoVL [19], AnomalyCLIP [110], AdaCLIP [6]). Notably, our zero-shot performance on VisA even surpasses most recent advances in the few-shot setting [28 , 51 , 112].\nments [9 , 25 , 32 – 34 , 37 , 42 , 82 , 84 , 95 , 97 , 104] have significantly improved the detection performance, these approaches assume the availability of a substantial number of normal samples. However, this assumption becomes impractical in certain scenarios due to strict data privacy policies and the significant human effort required for data classification, sometimes involving experts or specialists. Therefore, ZeroShot Anomaly Detection (ZSAD) is emerging as a popular research direction, leading to the development of many innovative methods [6 , 17 , 27 , 38 , 43 , 52 , 78 , 79 , 110 , 113].\nRecent advances in Multimodal Large Language Models (MLLMs) [7 , 15 , 44 , 45 , 47 , 48 , 57 , 58 , 111] have shown revolutionary reasoning capabilities in various vision tasks [14 , 29 , 67 , 70 , 80 , 91 , 94 , 107 , 109]. However, the rea-\nsoning of image abnormalities has not been explored due to the challenges of collecting large-scale datasets and establishing benchmarks. Existing methods simply predict the likelihood of an anomaly without providing rationales [6 , 11 , 19 , 38 , 110]. In contrast, for better interpretability, robustness, and trustworthiness, people would expect models to explain why an image is considered anomalous and provide visual evidence. Interestingly, we find that recent advanced MLLMs, such as GPT-4o [72], fall short in AD \u0026amp; reasoning. As shown in Figure 2, while the detection is correct, the explanation from GPT-4o lacks accuracy, indicating a gap in a comprehensive understanding of the anomaly.\nTo expedite research in AD \u0026amp; reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct125k, and the evaluation benchmark, VisA-D\u0026amp;R, through intensive human efforts. After evaluating current generalist MLLMs, we observe that these models fail to accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (AnomalyOV), the first specialist visual assistant for ZSAD and reasoning. Unlike existing ZSAD methods [6 , 11 , 19 , 38 , 110], Anomaly-OV directly learns object-awareness abnormality embeddings in feature space using only the visual encoder. Inspired by human behavior in visual inspection, AnomalyOV employs a Look-Twice Feature Matching (LTFM) mechanism to assist its LLM in adaptively selecting and emphasizing the most suspicious abnormal visual tokens.\nExtensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extended results of Anomaly-OV, from applications in industrial defect detection to 3D inspection and medical image diagnosis, are provided for future study. With precise descriptions and rationales of visual anomalies, our model can infer potential causes (see Figure 2), assess current impacts, and offer improvement suggestions, positioning itself as a reliable assistant for visual inspection. Our contributions are in two folds:\nWe establish the first visual instruction tuning dataset and benchmark for anomaly detection and reasoning. We propose the first specialist visual assistant with stateof-the-art performance for this new impactful domain. 2. Related Work # Multimodal Large Language Models. Vision-Language Models (VLMs), such as CLIP [73], exhibit robust zero-shot classification capabilities and have been applied to a range of downstream vision tasks [10 , 53 , 56 , 61 , 86]. Combining a VLM\u0026rsquo;s vision encoder and an LLM [20 , 62 , 74], MLLMs [15 , 48 , 49 , 57 , 111] enable text-based interactions related to visual content. MLLMs have shown remarkable reasoning capability, particularly when incorporated with prompting strategies such as Chain-of-Thought [66 , 88 , 105]. Recent studies have harnessed MLLMs to provide reasoning for\nFigure 2. Industrial image anomaly reasoning results from GPT-4o [72] and our Anomaly-OV. The responses for fine-grained anomaly reasoning are highlighted, with the ground truth given for reference.\ndownstream tasks, e.g., video anomaly detection [67 , 94], affective computing [14 , 29], and visual commonsense reasoning [109], revealing more interpretability.\nUnsupervised Anomaly Detection. Due to the scarcity and difficulty of collecting anomalous data, researchers often focus on the unsupervised AD setting, which exclusively uses normal data to train an AD model. Earlier studies, such as reconstruction-based [63 , 69 , 99], student-teacher [18 , 85 , 102], and augmentation-based [46] approaches, assume a large amount of normal data is available. These traditional approaches are less practical when data are restricted or expensive, such as in the medical domain.\nZero-Shot Anomaly Detection. Unlike unsupervised AD [35 , 77] and few-shot AD [22 , 28 , 36 , 51 , 112], ZSAD models directly access the likelihood of abnormality for a given image without requiring data specific to the target object. Existing works [6 , 19 , 38 , 110] accomplish ZSAD by comparing visual and textual features encoded by visual and text encoders of CLIP and constructing their positive (anomaly) and negative (normal) prompts in the format of:\nwhere Vi Vi and Wi Wi are handcrafted or learnable tokens, and object refers to the word object or the class name of the object. However, simply utilizing object to represent all kinds of objects cannot capture the class-awareness abnormality types. Also, for an intelligent visual assistant, the images should be totally blind to the user (object-agnostic).\nFigure 3. Overview of the Anomaly-OV architecture. It consists of two training stages: (1) professional training for the anomaly expert, and (2) visual instruction tuning for anomaly detection and reasoning. Text and visual tokens are distinguished by different colors.\n3. Method # 3.1. Preliminary # Training an MLLM from scratch demands extensive data and computational resources to align the visual and textual embedding spaces and develop robust instruction-following capabilities. Recent studies [23 , 83 , 93] reveal that pretrained MLLMs function as generalists, possessing a broad knowledge base but underperforming in specialized domains. Therefore, our goal is to introduce an auxiliary specialist or expert model designed to guide the generalist in selecting and utilizing critical visual tokens. This approach circumvents the need for large-scale pre-training while preserving the generalization capacity of the original model.\nWe choose LLaVA-OneVision [44] as our base MLLM because it is open-sourced and performs similarly to other commercial models. LLaVA-OneVision follows the model architectures for LLaVA family [47 , 57 – 59] and other generic MLLMs, which typically consist of three major components: Visual Encoder, Projector, and LLM. The visual encoder [73 , 100] extracts the visual information from the raw images, the projector aligns the spaces of visual features with the word embedding, and the LLM is responsible for textual instruction processing and complex reasoning. Since the image resolution for CLIP pre-training is fixed, LLaVAOneVision leverages AnyRes with pooling strategy to scale up the input raw image resolution. Specifically, the highresolution images are divided into a prototyped number of crops, and the visual encoder independently processes the image crops before final spatial pooling.\n3.2. Architecture Overview # With the same image-splitting strategy AnyRes as LLaVAOneVision, the input high-resolution image is split into several crops, and the new image set can be written as:\nwhere I0 I0 is the resized original image and Ij̸=0 refers to the image crops. As shown in Figure 3, the image set I will be processed by the visual encoder Fθ to generate the final visual features {v o j }. Similar to AnomalyCLIP [110], we store the outputs for four selected layers in the ViT [21] to capture the image representations from different levels and apply four adapters to compress the feature dimension. Then, the extracted visual features can be written as:\nwhere i denotes the i-th level and j refers to the index of corresponding image in I. These multi-level features have been demonstrated to be effective in capturing fine-grained local semantics by recent works [6 , 28 , 110].\nFigure 4. Simulation of visual anomaly inspection by humans.\nThe large-scale pre-trained CLIP models align the projection spaces of the textual and visual encoder. Therefore, the encoded image features already contain the class information required by ZSAD. To avoid human involvement in object classification and reduce the model complexity, we remove the heavy text encoder commonly utilized in existing works and let the visual model itself parse the information for suspicious classes or objects. Specifically, the output visual features for the original image v o 0 are leveraged to provide the global description of the target object or regions in the look-back path. With the multi-level features and the global embeddings, the LTFM module is responsible for the recognition and localization of suspicious tokens.\nDrawing inspiration from human visual inspection, where suspicious objects or regions are identified and then inspected closely (see Figure 4), we design the VT selector module for aggregating (zooming in) the crucial visual tokens and explicitly assisting the LLM in distinguishing these tokens from many irrelevant ones when dealing with instructions regarding anomaly detection and reasoning. Additionally, the original visual features are preserved to maintain the generalization capability of the base model on regular instructions, such as Can you describe the content of the image?\n3.3. Look-Twice Feature Matching # Given the global object information v o 0 provided by the lookback path, we generate the class-awareness abnormality description by merging v o 0 with two learnable embeddings: e + ∈ R D and e − ∈ R D , where + and − indicate positive (anomalous) and negative (normal) patterns and D is the embedding dimension. Specifically, a linear layer T i o T i is applied along the token dimension to select and fuse useful tokens from v o 0 , and then the fused vector will be concatenated with e\nand e − independently and pass through two MLPs {G + i , G − i } to generate the abnormality and normality descriptions {d + i , d − i }, which can be represented by: The visual features extracted from different levels of the ViT focus on different scales of semantics. Thus, the parameters of T i o T i and {G + i , G − i } should be independent for different levels, where i indicate the level number.\nSimilar to the zero-shot classification mechanism of CLIP\nmodels, we calculate the possibilities of each patch token in v i j belonging to the anomalous patterns by combining cosine similarity and softmax operations:\nwhere m i j represents the significance map for visual tokens, τ is the temperature hyperparameter, and \u0026lt;, \u0026gt; refers to the cosine similarity operator. The patch weight in m i j indicates the closeness of the corresponding visual token to the anomalous pattern. Then, all the maps are averaged to capture the token significances from low to high levels:\nThe visual features are leveraged twice in the forward and look-back paths, so this module is named by Look-Twice Feature Matching (LTFM), following the nature of two-step human visual inspection shown in Figure 4 .\n3.4. Visual Token Selector # Under the image cropping strategy widely applied in recent MLLMs, there will be a large number of visual tokens for a high-resolution image, e.g., 7290 tokens for an image with 1152×1152 resolution in LLaVA-OneVision. While these tokens provide rich visual details, the LLM is required to pick the most useful information when adapting to a specific task. When the LLM lacks enough knowledge in this domain, the token-picking process will become complicated. Thus, our solution is to introduce a specialist or expert who knows which token is crucial or not and assist the LLM in selecting and emphasizing (zooming in) the crucial tokens.\nGiven the encoded visual tokens {v o j } for each image crop in I and the corresponding significance map mj , the suspicious tokens are emphasized by direct multiplication of the two tensors. Then, the normal tokens will be scaled to zero while the anomalous tokens will be maintained. After that, spatial average pooling P is applied to reduce the number of tokens. This process can be written as:\nwhere q j ∈ R h×w×D refers to the pooled query tokens. Empirically, setting h = w = 2 provides a better trade-off than other options. Then, a Q-Former Q [49] is leveraged to aggregate the correlated tokens in the original output by forwarding q j as the query and v o j as the key and value:\nThe Visual Token Selector (VT Selector) serves as a tool for the anomaly expert to hand-pick visual tokens that contain the most suspicious semantics for a given image.\nFigure 5. Composition of the instruction data in Anomaly-Instruct-125k. There are four main types of image samples: in-the-wild , industrial , medical, and 3D (in the format of multi-view images), covering most image anomaly detection tasks and enabling the possibility of a unified assistant for visual inspection. The reasoning words are highlighted in blue. For more information about dataset establishment, statistics, and the data collection pipeline, please refer to Section A1 in the supplementary.\n3.5. Inference and Loss # Anomaly Prediction. In the traditional anomaly detection task, the model predicts the possibility of the image being abnormal. To achieve anomaly score prediction, we aggregate the anomaly information from all the image crops by an average operation weighted on the significance maps:\nwhere P is the same spatial pooling in VT Selector and r(I) is a vector containing the global anomaly information for the entire image. Then, the anomaly expert can calculate the image-level abnormal possibility by parsing r(I):\nwhere G o is an MLP for distinguishing normal and abnormal semantics. To handle the unbalanced sample distribution, we employ the balanced BCE loss as the professional training objective for the anomaly expert components.\nText Generation. Instead of directly forwarding the concatenation of the original {v o j } and the selected {r(I) , v s j } visual tokens into the LLM, we apply an indication prompt with \u0026lt;adv\u0026gt; suspicious feature: in the middle of the two series of tokens, which will highlight the selected tokens for LLM when handling anomaly-related instructions. This approach can be considered a form of prompt engineering in MLLMs. Besides, the \u0026lt;adv\u0026gt; is chosen from {highly , moderately , slightly} and is determined by score(I) and predefined thresholds {slow , s high }. When the input image I has a high likelihood of anomaly, the LLM will place greater emphasis on the selected tokens; otherwise, these tokens will have less significance. The text generation is implemented by the original auto-regressive token prediction mechanism of LLM:\nwhere X a,\u0026lt;t and X q,\u0026lt;t are the answer and instruction tokens from all prior turns before the current prediction token xt for a sequence of length L. The entire model is parameterized by θ and trained by the original language model cross-entropy loss for each predicted answer token xt .\n4. Dataset and Benchmark # The lack of multimodal instruction-following data for image anomaly detection and reasoning hinders the development of special intelligent assistants in this domain. Even though AnomalyGPT [28] introduces a prompt tuning dataset by simulating the anomalies, the scale of their dataset and the diversity of their instructions and answers are limited, only focusing on anomaly localization. To resolve the data scarcity issue, we establish the first large-scale instruction tuning dataset: Anomaly-Instruct-125k and the corresponding anomaly detection and reasoning benchmark: VisA-D\u0026amp;R .\n4.1. Anomaly-Instruct-125k # LLaVA [57] builds its instruction tuning dataset by leveraging the image caption and bounding boxes available in the COCO dataset[55] to prompt the text-only GPT-4. ShareGPT4V [8] provides a higher-quality dataset by directly prompting GPT-4V [71]. However, there is no image caption provided in existing anomaly detection datasets [1 , 2], and no matter GPT-4V [71] or most recent GPT-4o [72] cannot accurately locate and describe the anomalies in the image without explicit human involvement.\nTo resolve these issues, we design a new prompt pipeline for accurate anomaly description generation. Since most of the datasets contain annotations for anomaly types, we manually combine the class name and anomaly type, such as a [capsule] with [poke] on surface. If the anomaly masks are provided, we draw bounding boxes on the images to highlight the anomalous area. The short description and the image with (or w/o) bounding boxes are used to prompt GPT-4o to generate the detailed image and anomaly descriptions. Then, we employ an in-context learning strategy similar to LLaVA to create the instructions.\nFigure 6. Prompt examples in VisA-D\u0026amp;R for detection and reasoning. The complex reasoning instructions are highlighted.\nFor a unified visual inspection dataset, precise instruction data is collected from MVTec AD [2], the training set of BMAD [1], Anomaly-ShapeNet [50], Real3D-AD [60], and MVTec-3D AD [4], covering both 2D to 3D data across industry to medical domains. The 3D point cloud data are converted into 9 multi-view images, and the corresponding masks are rendered using predefined camera positions. However, the diversities and scales of these datasets are relatively limited, probably due to the difficulty of collecting anomaly images. To scale up the instruction data, we introduce an automatic anomaly data collection pipeline combining GPT4o [72] and Google Image Search [26] for image collection, data cleaning, and instruction generation. Finally, 72k in-thewild images (named as WebAD) targeting anomaly detection are collected, significantly enriching our instruction dataset. Several samples from Anomaly-Instruct-125k are shown in Figure 5. The instructions are mainly in the format of multi-round conversations, covering anomaly detection and description in low-level reasoning and potential cause and future suggestions for complex understanding.\n4.2. VisA-D\u0026amp;R # VisA [115] is a classic but challenging industrial anomaly detection dataset, providing fine-grained anomaly type and segmentation for each image. For evaluation of the anomaly detection and reasoning performance on existing and future methods, we select 10 classes from VisA and follow a similar data generation pipeline of Anomaly-Instruct-125k to create the benchmark. Differently, significant human effort has been invested in meticulously reviewing all generated images and anomaly descriptions. Wrong descriptions are picked out and re-annotated by humans before utilizing them for Q\u0026amp;A generation. Totally, the benchmark consists of 761 normal samples and 1000 anomalous ones.\nFor evaluating detection performance, questions designed to elicit a one-word answer are used to prompt the MLLMs (Figure 6), with results quantified using Accuracy, Precision, Recall, and F1-score. We divide the reasoning performance into two parts: low-level reasoning that focuses on the description of visual defects or anomalies and complex reasoning requiring the MLLMs to provide the potential cause and future improvement strategies for the detected anomalies, where ROUGE-L [54], Sentence-BERT (SBERT) [75], and\nGPT-Score (GPT-4 as the judge [57]) are utilized to quantify the similarity between generated text and ground truth. Note that low-level reasoning is highly correlated to detection performance, while anomaly-type descriptions of low-level reasoning determine the output of complex reasoning.\n5. Experiments # 5.1. Training \u0026amp; Evaluation # There are two independent training stages for Anomaly-OV. In Stage 1, the components of the anomaly expert are trained to obtain the token selection capability, targeting traditional ZSAD. This stage utilizes all of the data with anomaly labels in Anomaly-Instruct-125k. Similar to previous works [6 , 110], when evaluating the model on the datasets contained in the training set, the corresponding datasets are replaced by VisA [115]. In Stage 2, the anomaly expert and visual encoder are frozen, while the projector and LLM are trainable. In addition to our instruction dataset, we sample around 350k data from the original training recipe of LLaVAOneVision to maintain the generalization ability. For more details on training, please refer to the supplementary.\nThe ZSAD performance for the anomaly expert is evaluated on nine benchmarks, including MVTec AD [2], VisA [115], AITEX [81], ELPV [16], BTAD [68], and MPDD [39] for industrial inspection, and BrainMRI [40], HeadCT [41], and Br35H [30] for medical diagnosis. AUROC (Area Under the Receiver Operating Characteristic) is leveraged to quantify the image-level AD performance. For text-based anomaly detection, both normal and anomaly data are employed to assess the accuracy by examining if the generated text contains the word Yes. Differently, only anomaly data are utilized to prompt the MLLMs to determine their anomaly reasoning capabilities since the justifications of normality are similar for different models.\nTable 1. Ablation study for the anomaly expert of Anomaly-OV. w/o. Look-back refers to the removal of v o 0 in LTFM.\nMethod MVTec VisA HeadCT BrainMRI Full Model 94 91.1 97.6 93.9 w/o. Look-back w/o. e \u0026amp; e− | 92.8 | 90.5 | 96.6 | 93.5 | | w/o. Look-back w/o. e \u0026amp; e− | 92.1 | 90.1 | 94.7 | 92.9 | | w/o. Look-back w/o. e \u0026amp; e− | 91.7 | 89.9 | 92.8 | 95.1 | | w/o. Look-back w/o. e \u0026amp; e− | 88.5 | 88.9 | 91.2 | 93.4 | 5.2. Zero-Shot Anomaly Detection # As shown in Table 2, compared with existing methods, the anomaly expert of Anomaly-OV achieves significant imagelevel AUROC improvements on most of the ZSAD benchmarks, which demonstrates that the text encoder widely applied in existing models is not necessary. The success of our model mainly originates from the extra data of WebAD (Table 1), which enables the model to learn more generic\nTable 2. Quantitative comparison of Image-level AUROC on different ZSAD methods (some of the results are borrowed from [6 , 110 , 114]). The best and the second-best results are bolded and underlined, respectively. Please refer to the supplementary for more detailed results.\nModel Industrial Defects Industrial Defects Industrial Defects Industrial Defects Industrial Defects Industrial Defects Medical Anomalies Medical Anomalies Medical Anomalies Average MVTec AD VisA AITEX ELPV BTAD MPDD BrainMRI HeadCT Br35H CLIP [73] 74.1 66.4 71.0 59.2 34.5 54.3 73.9 56.5 78.4 63.1 CoOp [108] 88.8 62.8 66.2 73.0 66.8 55.1 61.3 78.4 86.0 70.9 WinCLIP [38] 91.8 78.8 73.0 74.0 68.2 63.6 92.6 90.0 80.5 79.2 APRIL-GAN [11] 86.2 78.0 57.6 65.5 73.6 73.0 89.3 89.1 93.1 78.4 AnoVL [19] 92.5 79.2 72.5 70.6 80.3 68.9 88.7 81.6 88.4 80.3 AnomalyCLIP [110] 91.5 82.1 62.2 81.5 88.3 77.0 90.3 93.4 94.6 84.5 AdaCLIP [6] 89.2 85.8 64.5 79.7 88.6 76.0 94.8 91.4 97.7 85.3 Ours 94.0 91.1 72.0 83.0 89.0 81.7 93.9 97.6 95.5 88.6 semantics for normality and abnormality from the data distribution in the absence of the text encoder. This observation also reveals that large-scale in-the-wild online data can benefit zero-shot performance in anomaly detection.\nWhile the Q-Former reduces the model performance on BrainMRI, it shows effectiveness on most benchmarks, indicating the importance of token aggregation. Similarly, the look-back information and two learnable embeddings are required for describing class-awareness abnormality and distinguishing positive and negative features, respectively. As previously discussed, the anomaly expert is responsible for selecting suspicious visual tokens for the LLM, and the significance maps shown in Figure 7 demonstrate the interpretable token selection mechanism. The high intensities are automatically distributed around the anomalous areas even without any supervision of the anomaly masks.\nFigure 7. Visualization of the significance map on VisA samples.\n5.3. Anomaly Detection \u0026amp; Reasoning # With the strong capabilities of the anomaly expert for zeroshot detection and suspicious token selection, Anomaly-OV accomplishes significant improvement in text-based anomaly detection and reasoning over other open-sourced generalist MLLMs, as shown in Table 4. Here are a few observations: i) While a larger language model cannot guarantee better detection performance, it always provides greater reasoning ability; ii) Most of the existing MLLMs present much lower recall than precision, indicating their insensitivity to visual anomalies; iii) GPT-4o shows stronger reasoning ability\nTable 3. Anomaly-OV presents more accurate anomaly detection.\nUser Are there any anomalies visible in the image GPT-4o The image shows an HC-SR04 ultrasonic sensor There don’t appear to be any visible anomalies with the sensor itself. The components and pins seem intact and properly aligned. LLaVA-OV The image shows an ultrasonic sensor module, specifi cally the HC-SR04 model. There are no visible anoma lies in the image; it appears to be a standard represen tation of this type of sensor module. Ours Yes, there is a noticeable anomaly in the image: a short circuit on the PCB, indicated by a white, crusty residue near the ’Echo’ pin. compared to other open-sourced models. Table 3 and Table 5 provide the qualitative comparison of our Anomaly-OV with its base model LLaVA-OV-7B [44] and GPT-4o [72]. Both GPT-4o and LLaVA-OV show insensitivity to anomalous features and cannot accurately detect the anomaly in the image. Sometimes, GPT-4o knows the image is anomalous but fails to describe the anomalies precisely.\nWe provide the fine-tuned version of the base model LLaVA-OV-0.5B on Anomaly-Instruct-125k, which presents much higher accuracy and more balanced precision and recall than its original version. This demonstrates the effectiveness of our instruction-tuning dataset. By integrating the anomaly expert with the base model, our Anomaly-OV-0.5B achieves 0 . 08 accuracy and 0 . 06 F1-score improvements in text-based anomaly detection and better reasoning capability in low-level and complex settings. Equipped with a larger language model, Anomaly-OV-7B provides the best detection performance among all the existing MLLMs and shows comparable reasoning ability with GPT-4o. Notably, we observe that the anomaly expert restricts the detection perfor-\nTable 4. Quantitative comparison of text-based anomaly detection and reasoning for MLLMs. Notably, the Accuracy and F1-score for the anomaly expert of Anomaly-OV can be calculated as {0 . 78 , 0 . 77} with threshold 0 . 5. * indicates the model is fine-tuned on our dataset.\nModel Anomaly Detection Anomaly Detection Anomaly Detection Anomaly Detection Low-level Reasoning Low-level Reasoning Low-level Reasoning Complex Reasoning Complex Reasoning Accuracy Precision Recall F1-score ROUGE-L SBERT GPT-Score SBERT GPT-Score GPT-4V [71] 0.68 0.90 0.49 0.55 0.16 0.65 3.31 0.77 5.64 GPT-4o [72] 0.70 0.83 0.71 0.68 0.24 0.71 4.84 0.81 6.89 Qwen2-VL-2B [87] 0.65 0.87 0.55 0.59 0.22 0.55 1.94 0.74 4.26 Qwen2-VL-7B [87] 0.76 0.91 0.69 0.75 0.25 0.61 3.09 0.68 4.62 InternVL-2-8B [13] 0.74 0.78 0.81 0.76 0.23 0.73 3.69 0.80 5.08 InternVL-2-26B [13] 0.73 0.86 0.66 0.68 0.21 0.74 4.13 0.80 5.49 IXC-2.5-7B [101] 0.72 0.88 0.63 0.67 0.21 0.58 2.45 0.77 5.14 LLaVA-OV-0.5B [44] 0.54 0.70 0.19 0.28 0.20 0.63 2.54 0.81 4.34 LLaVA-OV-7B [44] 0.71 0.95 0.56 0.63 0.24 0.66 3.57 0.79 5.44 LLaVA-OV-0.5B* 0.71 0.77 0.84 0.76 0.31 0.70 3.69 0.82 5.31 Anomaly-OV-0.5B 0.79 0.86 0.83 0.82 0.33 0.72 3.87 0.83 5.67 Anomaly-OV-7B 0.79 0.83 0.86 0.83 0.34 0.73 4.26 0.84 6.34 mance of Anomaly-OV. Therefore, the design of a stronger anomaly expert is suggested for future works.\nTable 5. Anomaly-OV presents more precise anomaly reasoning.\nMacaroni Example: Yellowish Spot Macaroni Example: Yellowish Spot User GPT-4o The image shows four pieces of elbow macaroni on a green background. The anomaly is that the macaroni pieces are not whole; they are cut in half LLaVA-OV The image shows four pieces of pasta, specifically macaroni shells, arranged on a green textured surface The pasta appears to be uniformly colored and shaped with no visible defects or anomalies. Ours Yes, there is an anomaly in the image. The bottom right pasta piece has a noticeable yellowish discoloration on its surface. 5.4. Extension # With the generalization and multi-image processing capabilities of MLLMs, it is possible to build a unified assistant for visual inspection. Table 6 demonstrates the comprehensive knowledge of Anomaly-OV (without using AnomalyShapeNet [50] for training) on 3D and medical (testing set of BMAD [1]) AD \u0026amp; reasoning. More data, benchmarks, and investigation on a unified model are meaningful.\n6. Conclusion # In this paper, we establish the first large-scale visual instruction tuning dataset, Anomaly-Instruct-125k, and the corresponding benchmark, VisA-D\u0026amp;R, to address the data scarcity issue for visual anomaly detection and reasoning.\nTable 6. Extension to 3D and medical AD \u0026amp; reasoning.\nThen, a specialist MLLM, Anomaly-OV, targeting visual inspection is introduced to serve as the baseline in this domain. Anomaly-OV leverages an anomaly expert to assist the LLM with suspicious visual token selection and presents significant improvements on both traditional ZSAD and text-based anomaly detection and reasoning tasks over existing methods. Extension to 3D and medical domains is demonstrated.\nReferences # [1] Jinan Bao, Hanshi Sun, Hanqiu Deng, Yinsheng He, Zhaoxiang Zhang, and Xingyu Li. Bmad: Benchmarks for medical anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4042–4053, 2024. 1 , 5 , 6 , 8 , 14 [2] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019. 1 , 5 , 6 , 14 , 15 [3] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4183–4192, 2020. 1 [4] Paul Bergmann, Xin Jin, David Sattlegger, and Carsten Steger. The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization. In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. SCITEPRESS - Science and Technology Publications, 2022. 6 [5] Tri Cao, Jiawen Zhu, and Guansong Pang. Anomaly detection under distribution shift. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 6511–6523, 2023. 1 [6] Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, and Giacomo Boracchi. Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection. In European Conference on Computer Vision , pages 55–72. Springer, 2025. 1 , 2 , 3 , 6 , 7 [7] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 1 [8] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions, 2023. 5 , 14 [9] Qiyu Chen, Huiyuan Luo, Chengkan Lv, and Zhengtao Zhang. A unified anomaly synthesis strategy with gradient ascent for industrial anomaly detection and localization. arXiv preprint arXiv:2407.09359, 2024. 1 [10] Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. Clip2scene: Towards label-efficient 3d scene understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 7020–7030, 2023. 2 [11] Xuhai Chen, Yue Han, and Jiangning Zhang. April-gan: A zero-/few-shot anomaly classification and segmentation method for cvpr 2023 vand workshop challenge tracks 1\u0026amp;2: 1st place on zero-shot ad and 4th place on few-shot ad, 2023. 2 , 7 [12] Yuanhong Chen, Yu Tian, Guansong Pang, and Gustavo Carneiro. Deep one-class classification via interpolated gaussian descriptor. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 383–392, 2022. 1 [13] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, and Wenhai Wang. How far are we to gpt-4v? closing the gap to commercial multimodal models with opensource suites, 2024. 8 [14] Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander Hauptmann. Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning. In Conference on Neural Information Processing Systems , 2024. 1 , 2 [15] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirtyseventh Conference on Neural Information Processing Systems, 2023. 1 , 2 [16] Sergiu Deitsch, Vincent Christlein, Stephan Berger, Claudia Buerhop-Lutz, Andreas Maier, Florian Gallwitz, and Christian Riess. Automatic classification of defective photovoltaic module cells in electroluminescence images. Solar Energy , 185:455–468, 2019. 6 [17] Chenghao Deng, Haote Xu, Xiaolu Chen, Haodi Xu, Xiaotong Tu, Xinghao Ding, and Yue Huang. Simclip: Refining image-text alignment with simple prompts for zero-/fewshot anomaly detection. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 1761–1770, 2024. 1 [18] Hanqiu Deng and Xingyu Li. Anomaly detection via reverse distillation from one-class embedding. In IEEE/CVF conference on computer vision and pattern recognition, 2022. 2 [19] Hanqiu Deng, Zhaoxiang Zhang, Jinan Bao, and Xingyu Li. Anovl: Adapting vision-language models for unified zeroshot anomaly localization. arXiv preprint arXiv:2308.15939 , 2023. 1 , 2 , 7 [20] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. 2 [21] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. 3 [22] Zheng Fang, Xiaoyang Wang, Haocheng Li, Jiejie Liu, Qiugui Hu, and Jimin Xiao. Fastrecon: Few-shot industrial anomaly detection via fast feature reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17481–17490, 2023. 2\n[23] Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, and Michael J Black. Chatpose: Chatting about 3d human pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2093– 2103, 2024. 3 [24] Tharindu Fernando, Harshala Gammulle, Simon Denman, Sridha Sridharan, and Clinton Fookes. Deep learning for medical anomaly detection–a survey. ACM Computing Surveys (CSUR), 54(7):1–37, 2021. 1 [25] Matic Fucka, Vitjan Zavrtanik, and Danijel Sko ˇ ˇ caj. ˇ ˇ Transfusion–a transparency-based diffusion model for anomaly detection. In European conference on computer vision, pages 91–108. Springer, 2025. 1 [26] Google. Google-images-search 1.4.7, 2024. https:// pypi.org/project/Google-Images-Search . 6 , 14 [27] Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Hao Li, Ming Tang, and Jinqiao Wang. Filo: Zero-shot anomaly detection by fine-grained description and high-quality localization. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 2041–2049, 2024. 1 [28] Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1932–1940, 2024. 1 , 2 , 3 , 5 [29] Yuxiang Guo, Faizan Siddiqui, Yang Zhao, Rama Chellappa, and Shao-Yuan Lo. Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models. arXiv preprint arXiv:2409.00304, 2024. 1 , 2 [30] Ahmed Hamada. Br35h: Brain tumor detection 2020, 2020. 6 [31] Changhee Han, Leonardo Rundo, Kohei Murao, Tomoyuki Noguchi, Yuki Shimahara, Zoltan´ ´ Ad ´ am Milacski, Saori ´ ´ Koshino, Evis Sala, Hideki Nakayama, and Shin\u0026rsquo;ichi Satoh. Madgan: Unsupervised medical anomaly detection gan using multiple adjacent brain mri slice reconstruction. BMC bioinformatics, 22:1–20, 2021. 1 [32] Liren He, Zhengkai Jiang, Jinlong Peng, Liang Liu, Qiangang Du, Xiaobin Hu, Wenbing Zhu, Mingmin Chi, Yabiao Wang, and Chengjie Wang. Learning unified reference representation for unsupervised multi-class anomaly detection. arXiv preprint arXiv:2403.11561, 2024. 1 [33] Chih-Hui Ho, Kuan-Chuan Peng, and Nuno Vasconcelos. Long-tailed anomaly detection with learnable class names. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12435–12446, 2024. [34] Jinlei Hou, Yingying Zhang, Qiaoyong Zhong, Di Xie, Shiliang Pu, and Hong Zhou. Divide-and-assemble: Learning block-wise memory for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8791–8800, 2021. 1 [35] Chaoqin Huang, Haoyan Guan, Aofan Jiang, Ya Zhang, Michael Spratling, and Yan-Feng Wang. Registration based few-shot anomaly detection. In European Conference on Computer Vision, pages 303–319. Springer, 2022. 1 , 2 [36] Chaoqin Huang, Aofan Jiang, Jinghao Feng, Ya Zhang, Xinchao Wang, and Yanfeng Wang. Adapting visual-language models for generalizable anomaly detection in medical images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11375–11385, 2024. 1 , 2\n[37] Brian KS Isaac-Medina, Yona Falinie A Gaus, Neelanjan Bhowmik, and Toby P Breckon. Towards open-world objectbased anomaly detection via self-supervised outlier synthesis. In European Conference on Computer Vision (ECCV) , 2024. 1\n[38] Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023. 1 , 2 , 7\n[39] Stepan Jezek, Martin Jonak, Radim Burget, Pavel Dvorak, and Milos Skotak. Deep learning-based defect detection of metal parts: evaluating current methods in complex conditions. In 2021 13th International congress on ultra modern telecommunications and control systems and workshops (ICUMT), pages 66–71. IEEE, 2021. 6\n[40] Pranita Balaji Kanade and PP Gumaste. Brain tumor detection using mri images. Brain, 3(2):146–150, 2015. 6\n[41] Felipe Campos Kitamura. Head ct - hemorrhage, 2018. 6\n[42] Mingyu Lee and Jongwon Choi. Text-guided variational image generation for industrial anomaly detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26519– 26528, 2024. 1\n[43] Aodong Li, Chen Qiu, Marius Kloft, Padhraic Smyth, Maja Rudolph, and Stephan Mandt. Zero-shot anomaly detection via batch normalization. Advances in Neural Information Processing Systems, 36, 2024. 1\n[44] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024. 1 , 3 , 7 , 8 , 15\n[45] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. LLaVA-med: Training a large languageand-vision assistant for biomedicine in one day. In Thirtyseventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 1 , 15\n[46] Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, and Tomas Pfister. Cutpaste: Self-supervised learning for anomaly detection and localization. In IEEE/CVF conference on computer vision and pattern recognition, 2021. 2\n[47] Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024. 1 , 3\n[48] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Interna-\ntional conference on machine learning, pages 12888–12900. PMLR, 2022. 1 , 2\n[49] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023. 2 , 4\n[50] Wenqiao Li, Xiaohao Xu, Yao Gu, Bozhong Zheng, Shenghua Gao, and Yingna Wu. Towards scalable 3d anomaly detection and localization: A benchmark via 3d anomaly synthesis and a self-supervised learning network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22207–22216, 2024. 6 , 8\n[51] Xiaofan Li, Zhizhong Zhang, Xin Tan, Chengwei Chen, Yanyun Qu, Yuan Xie, and Lizhuang Ma. Promptad: Learning prompts with only normal samples for few-shot anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16838– 16848, 2024. 1 , 2\n[52] Yiting Li, Adam Goodge, Fayao Liu, and Chuan-Sheng Foo. Promptad: Zero-shot anomaly detection using text prompts. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1093–1102, 2024. 1\n[53] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023. 2\n[54] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004. 6\n[55] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence ´ ´ Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 5\n[56] Yuqi Lin, Minghao Chen, Wenxiao Wang, Boxi Wu, Ke Li, Binbin Lin, Haifeng Liu, and Xiaofei He. Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15305–15314, 2023. 2\n[57] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 1 , 2 , 3 , 5 , 6 , 16\n[58] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 1\n[59] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024. 3\n[60] Jiaqi Liu, Guoyang Xie, ruitao chen, Xinpeng Li, Jinbao Wang, Yong Liu, Chengjie Wang, and Feng Zheng. Real3d-\nAD: A dataset of point cloud anomaly detection. In Thirtyseventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 6\n[61] Jie Liu, Yixiao Zhang, Jie-Neng Chen, Junfei Xiao, Yongyi Lu, Bennett A Landman, Yixuan Yuan, Alan Yuille, Yucheng Tang, and Zongwei Zhou. Clip-driven universal model for organ segmentation and tumor detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21152–21164, 2023. 2\n[62] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019. 2\n[63] Shao-Yuan Lo, Poojan Oza, and Vishal M Patel. Adversarially robust one-class novelty detection. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. 2\n[64] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017. 15\n[65] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. 15\n[66] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022. 2\n[67] Hui Lv and Qianru Sun. Video anomaly detection and explanation via large language models. arXiv preprint arXiv:2401.05702, 2024. 1 , 2\n[68] Pankaj Mishra, Riccardo Verk, Daniele Fornasier, Claudio Piciarelli, and Gian Luca Foresti. Vt-adl: A vision transformer network for image anomaly detection and localization. In 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), pages 01–06. IEEE, 2021. 6\n[69] Shancong Mou, Xiaoyi Gu, Meng Cao, Haoping Bai, Ping Huang, Jiulong Shan, and Jianjun Shi. RGI: robust GANinversion for mask-free image inpainting and unsupervised pixel-wise anomaly detection. In The Eleventh International Conference on Learning Representations, 2023. 1 , 2\n[70] Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving. In European Conference on Computer Vision, pages 292–308. Springer, 2025. 1\n[71] OpenAI. Gpt-4v(ision) system card, 2023. https:// openai.com/index/gpt-4v-system-card . 5 , 8\n[72] OpenAI. Gpt-4o system card, 2024. https://openai. com/index/gpt-4o-system-card . 2 , 5 , 6 , 7 , 8 , 14 , 15\n[73] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021. 2 , 3 , 7 , 14\n[74] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. 2\n[75] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. 6\n[76] Tal Reiss and Yedid Hoshen. Mean-shifted contrastive loss for anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2155–2162, 2023. 1\n[77] Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Scholkopf, Thomas Brox, and Peter Gehler. Towards total ¨ ¨ recall in industrial anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2022. 1 , 2\n[78] Fumiaki Sato, Ryo Hachiuma, and Taiki Sekii. Promptguided zero-shot anomaly action recognition using pretrained deep skeleton features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6471–6480, 2023. 1\n[79] Eli Schwartz, Assaf Arbelle, Leonid Karlinsky, Sivan Harary, Florian Scheidegger, Sivan Doveh, and Raja Giryes. Maeday: Mae for few-and zero-shot anomaly-detection. Computer Vision and Image Understanding, 241:103958, 2024. 1\n[80] Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE, 2024. 1\n[81] Javier Silvestre-Blanes, Teresa Albero-Albero, Ignacio Miralles, Ruben P ´ ´ erez-Llorens, and Jorge Moreno. A public ´ ´ fabric database for defect detection methods and results. Autex Research Journal, 19(4):363–374, 2019. 6\n[82] Luc PJ Strater, Mohammadreza Salehi, Efstratios Gavves, ¨ ¨ Cees GM Snoek, and Yuki M Asano. Generalad: Anomaly detection across domains by attending to distorted features. arXiv preprint arXiv:2407.12427, 2024. 1\n[83] Haomiao Sun, Mingjie He, Tianheng Lian, Hu Han, and Shiguang Shan. Face-mllm: A large face perception model, 2024. 3\n[84] Jiaqi Tang, Hao Lu, Xiaogang Xu, Ruizheng Wu, Sixing Hu, Tong Zhang, Tsz Wa Cheng, Ming Ge, Ying-Cong Chen, and Fugee Tsung. An incremental unified framework for small defect inspection. In European Conference on Computer Vision, pages 307–324. Springer, 2025. 1\n[85] Tran Dinh Tien, Anh Tuan Nguyen, Nguyen Hoang Tran, Ta Duc Huy, Soan Duong, Chanh D Tr Nguyen, and Steven QH Truong. Revisiting reverse distillation for anomaly detection. In IEEE/CVF conference on computer vision and pattern recognition, 2023. 2\n[86] Hualiang Wang, Yi Li, Huifeng Yao, and Xiaomeng Li. Clipn for zero-shot ood detection: Teaching clip to say no. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1802–1812, 2023. 2\n[87] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model\u0026rsquo;s perception of the world at any resolution, 2024. 8\n[88] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Conference on Neural Information Processing Systems, 2022. 2\n[89] Qi Wei, Yinhao Ren, Rui Hou, Bibo Shi, Joseph Y Lo, and Lawrence Carin. Anomaly detection for medical images based on a one-class classification. In Medical Imaging 2018: Computer-Aided Diagnosis, pages 375–380. SPIE, 2018. 1\n[90] Julia Wolleb, Florentin Bieder, Robin Sandkuhler, and ¨ ¨ Philippe C Cattin. Diffusion models for medical anomaly detection. In International Conference on Medical image computing and computer-assisted intervention, pages 35–45. Springer, 2022. 1\n[91] Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, and Ziwei Liu. Funqa: Towards surprising video comprehension. In European Conference on Computer Vision, pages 39–57. Springer, 2025. 1\n[92] Guoyang Xie, Jinbao Wang, Jiaqi Liu, Yaochu Jin, and Feng Zheng. Pushing the limits of fewshot anomaly detection in industry vision: Graphcore. In The Eleventh International Conference on Learning Representations, 2023. 1\n[93] Hongxia Xie, Chu-Jun Peng, Yu-Wen Tseng, Hung-Jen Chen, Chan-Feng Hsu, Hong-Han Shuai, and Wen-Huang Cheng. Emovit: Revolutionizing emotion insights with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 26596–26605, 2024. 3\n[94] Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. Follow the rules: reasoning for video anomaly detection with large language models. arXiv preprint arXiv:2407.10299, 2024. 1 , 2\n[95] Hang Yao, Ming Liu, Haolin Wang, Zhicun Yin, Zifei Yan, Xiaopeng Hong, and Wangmeng Zuo. Glad: Towards better reconstruction with global and local adaptive diffusion models for unsupervised anomaly detection. arXiv preprint arXiv:2406.07487, 2024. 1\n[96] Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, and Jingdong Wang. Dense connector for mllms, 2024. 17\n[97] Xincheng Yao, Ruoqi Li, Zefeng Qian, Lu Wang, and Chongyang Zhang. Hierarchical gaussian mixture normalizing flow modeling for unified anomaly detection. arXiv preprint arXiv:2403.13349, 2024. 1\n[98] Zhiyuan You, Lei Cui, Yujun Shen, Kai Yang, Xin Lu, Yu Zheng, and Xinyi Le. A unified model for multi-class anomaly detection. In Advances in Neural Information Processing Systems, 2022. 1\n[99] Vitjan Zavrtanik, Matej Kristan, and Danijel Skocaj. Draem- ˇ ˇ a discriminatively trained reconstruction embedding for surface anomaly detection. In IEEE/CVF international conference on computer vision, 2021. 2\n[100] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023. 3\n[101] Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output, 2024. 8\n[102] Xuan Zhang, Shiyu Li, Xi Li, Ping Huang, Jiulong Shan, and Ting Chen. Destseg: Segmentation guided denoising studentteacher for anomaly detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2\n[103] Ximiao Zhang, Min Xu, Dehui Qiu, Ruixin Yan, Ning Lang, and Xiuzhuang Zhou. Mediclip: Adapting clip for fewshot medical image anomaly detection. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 458–468. Springer, 2024. 1\n[104] Ximiao Zhang, Min Xu, and Xiuzhuang Zhou. Realnet: A feature selection network with realistic synthetic anomaly for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 16699–16708, 2024. 1\n[105] Zhuosheng Zhang, Aston Zhang, Mu Li, hai zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models. Transactions on Machine Learning Research, 2024. 2\n[106] He Zhao, Yuexiang Li, Nanjun He, Kai Ma, Leyuan Fang, Huiqi Li, and Yefeng Zheng. Anomaly detection for medical images using self-supervised and translation-consistent features. IEEE Transactions on Medical Imaging, 40(12): 3641–3651, 2021. 1\n[107] Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7641–7649, 2024. 1\n[108] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022. 7\n[109] Kaiwen Zhou, Kwonjoon Lee, Teruhisa Misu, and Xin Eric Wang. Vicor: Bridging visual understanding and commonsense reasoning with large language models. In Findings of the Association for Computational Linguistics, 202\u0026rsquo;. 1 , 2\n[110] Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jiming Chen. AnomalyCLIP: Object-agnostic prompt learning for zero-shot anomaly detection. In The Twelfth International Conference on Learning Representations, 2024. 1 , 2 , 3 , 6 , 7\n[111] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language\nunderstanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2024. 1 , 2\n[112] Jiawen Zhu and Guansong Pang. Toward generalist anomaly detection via in-context residual learning with few-shot sample prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17826– 17836, 2024. 1 , 2 [113] Jiaqi Zhu, Shaofeng Cai, Fang Deng, Beng Chin Ooi, and Junran Wu. Do llms understand visual anomalies? uncovering llm\u0026rsquo;s capabilities in zero-shot anomaly detection. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 48–57, 2024. 1 [114] Jiawen Zhu, Yew-Soon Ong, Chunhua Shen, and Guansong Pang. Fine-grained abnormality prompt learning for zeroshot anomaly detection, 2024. 7 [115] Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the-difference self-supervised pretraining for anomaly detection and segmentation. In European Conference on Computer Vision, pages 392–408. Springer, 2022. 6 , 15 A1. Dataset Establishment # A1.1. How to highlight the anomaly? # Prompt Example: Bubble Inside the Capsule # User Do you observe any anomalies in the image?\nGPT-4o The image shows green gel capsules that appear uniform in size, shape, and color. Upon inspection, no obvious anomalies such as broken capsules, irregular shapes, or unexpected colors are visible.\nLet’s draw a bounding box around the anomaly!\nUser Do you observe any anomalies in the image?\nGPT-4o\nYes, there appears to be an anomaly in the image. One of the capsules, marked with a red box, has a noticeable defect. It seems to have a bubble or an air pocket inside, which is not observed in the other capsules.\nThis could indicate a manufacturing defect or inconsistency in the capsule filling process.\nIf there is no bounding box, let’s give a short hint to GPT-4o!\nUser\nDo you observe any anomalies in the image? (hint: a bubble in the capsule)\nGPT-4o The anomaly in the image is the presence of a visible air bubble inside one of the green gel capsules. While most capsules appear uniformly filled with liquid, the capsule located near the top center of the image has a small, clear air pocket within it. This could be a result of incomplete filling during the manufacturing process or an issue with the capsule sealing.\nTable 7. Comparison of the GPT-4o [72] outputs with and without visual and textual hints for the anomaly.\nAs shown in Table 7, recent advanced MLLMs like GPT4o fail to detect the anomalies in the image, so building the instruction tuning dataset using previous methods [8] is impractical. However, we observe that when the GPT-4o is provided some \u0026ldquo;hints\u0026rdquo;, it presents impressive performance on anomaly reasoning or description. For example, a red bounding box drawn around the anomalous area enables GPT-4o to detect the tiny bubble inside the small capsule. This observation indicates that the anomaly information is already contained in the visual tokens, and the failure of existing MLLMs is because the language model cannot effectively pick out the related tokens, which is the major inspiration of our token-picking mechanism.\nMost of the existing AD datasets, such as MVTec AD [2], contain anomaly masks for anomaly localization. Therefore, we leverage these masks to generate the bounding boxes on the images. Specifically, the masks for an anomalous image are dilated and merged (if two masks are too close) before calculating the coordinates of the bounding boxes. Similarly, the image with bounding boxes drawn on it will serve as the visual prompt for GPT-4o. We also tried many other ways to utilize the anomaly masks, such as highlighting the mask area with different colors, consecutively providing the image and mask, and converting the normalized coordinates of the bounding box into a text prompt. None of them can as effectively guide the GPT-4o in finding anomalous features as drawing bounding boxes on the image.\nA1.2. WebAD – The largest AD dataset # Existing industrial or medical anomaly detection datasets, such as MVTec AD [2] and BMAD [1], only contain a limited number of classes (\u0026lt; 20) and several different anomaly types for each class (most of the anomaly types are similar) due to the collection of these kinds of anomaly images involves extensive human involvements. This limitation hinders the ZSAD model from learning a generic description of anomaly and normal patterns. Also, the MLLMs cannot obtain enough knowledge of visual anomaly descriptions for unseen anomaly types. Therefore, more diverse data is required for a robust ZSAD \u0026amp; reasoning model. Many recent dataset works collect and annotate online images to enrich existing datasets and demonstrate their effectiveness in the training of current data-hungry deep learning models.\nTo collect the online images that can be utilized for anomaly detection, we design an automatic data collection pipeline by combining GPT-4o [72] and Google Image Search [26]. As shown in Figure 8, we first employ GPT-4o to list 400 class names commonly seen in our daily life. Then, for each class, the GPT-4o is asked to generate 10 corresponding anomalous and normal phrases based on the class name. The abnormality or normality descriptions indicated by these phrases are specifically suitable for the class name. These phrases will serve as the search prompts to query the image links in Google Image Search. However, the downloaded images are very \u0026ldquo;dirty\u0026rdquo; and contain many noise samples and duplications. For example, the collected anomaly set contains lots of normal images, and vice versa. A data-cleaning step is applied after the image collection.\nSince the duplications mainly occur within a specific class, we extract the CLIP [73] features for all the images in the class and compare the cosine similarity of these features. If the similarity value is larger than 0 . 99, then one of the images will be removed. To deal with the problematic grouping of anomaly and normal images, we combine the image and its corresponding search prompt and give them to GPT-4o for normal and anomaly classification. In the system prompt, we explicitly tell the GPT-4o that the search prompt is just a hint and not always correct and ask GPT-4o\nFigure 8. Automatic data collection pipeline for WebAD. The entire pipeline is fully automatic at an affordable cost (API usage). Other advanced open-sourced MLLMs can applied to replace GPT-4o for further reduction of cost.\nto determine the normality and abnormality by itself. This step will remove the images with incorrect labels and the artificial images, such as cartons or art. Some samples in the collected WebAD dataset are shown in Figure 9. In total, WebAD contains around 72k images from 380 classes and more than 5 anomaly types for each class.\nA1.3. Instruction Data Generation # For existing datasets, we manually combine the anomaly type and the class name to create the short anomaly prompt (hint). Then, the image with or without the bounding boxes and the corresponding short prompt are utilized to prompt GPT-4o for the generation of detailed descriptions of the image and the anomalies. These descriptions contain all the information required for instruction-following data. The in-context learning strategy is implemented to generate the multi-round conversation data (see Figure 10). Questions designed to elicit a one-word answer are utilized to balance the distribution of the normal and anomaly samples.\nA2. Training Details # In the professional training stage, we leverage AdamW [65] to be the optimizer and CosineAnnealingWarmRestarts [64] as the learning rate scheduler. The initial learning rate is set to be 1e − 4, and the restart iteration is half of the single epoch. The anomaly expert is trained on 8 H100 GPUs for 2 epochs (2 hours), and the total batch size is 128. In the instruction tuning stage, we follow the default training setting of LLaVA-OneVision [44] (reduce the batch size to 128), and the total training time for 0.5B and 7B models are 7 hours and 50 hours on 8 H100, respectively. When sampling the instruction data from the original recipe of LLaVA-OneVision , we put more emphasis on low-level image understanding and 3D multi-view Q\u0026amp;A, considering that anomaly detection originates from the low-level feature differences and the\n3D anomaly detection requires multi-image understanding. Besides, for more knowledge in the medical domain, the model is also fed with the data from LLaVA-Med [45].\nA3. Experimental Results # A3.1. Anomaly Detection # Similar to previous ZSAD works, the detailed image-level AUROC results for the anomaly expert of Anomaly-OV on VisA [115] and MVTec AD [2] are provided in Table 8 .\nA3.2. Anomaly Reasoning # Table 9 to 13 presents more comparison results of GPT-4o [72], LLaVA-OneVision [44], and Anomaly-OV on AD \u0026amp; reasoning. Anomaly-OV shows better performance in the detection and description of the visual anomalies in the images. Table 14 demonstrates the low-level and complex reasoning capability of Anomaly-OV for an in-the-wild image, indicating a comprehensive understanding of the anomaly.\nA4. Limitation and Future Work # Limitation. As shown in Table 15, sometimes, AnomalyOV fails to provide an accurate classification of the target object, describes the anomaly by a general word (wax missing is described by \u0026ldquo;crack\u0026rdquo;), or presents wrong reasoning with hallucination. Also, there is still a large space for improvement in the detection performance of Anomaly-OV. Besides, the images contained in VisA-D\u0026amp;R are from the industrial domain, so more benchmarks in other domains, such as 3D and medical anomaly detection, are required to evaluate a unified AD \u0026amp; reasoning model.\nFuture Work. The detection performance of Anomaly-OV is highly determined by the anomaly expert (see Table 4), so a more advanced design of the expert model is recommended\nFigure 9. Overview of the gallery for in-the-wild image samples in WebAD. The images on the left side are anomalous, while the right side is for normal images. The links to download these images will be released to avoid copyright issues.\nFigure 10. Prompt template for generating multi-round conversation in Anomaly-Instruct-125k (modified from the template of LLaVA [57]).\nin future research. One can change the base model to other open-sourced MLLMs to resolve the wrong classification issue. Also, we found that the diversity of the anomaly type is very limited in existing industrial anomaly datasets (mainly \u0026lsquo;crack\u0026rsquo; or \u0026lsquo;broken\u0026rsquo;), causing the assistant to fail to provide fine-grained anomaly reasoning or description for unseen anomaly features. Therefore, a more diverse industrial anomaly detection dataset is urgently required. Similar to other traditional MLLMs, Anomaly-OV only utilizes the output visual tokens from the last layer of the visual encoder as the input for LLM. However, anomaly detection is highly dependent on low-level visual clues. Hence, forwarding\nTable 8. Per-class image-level AUROC of the anomaly expert of Anomaly-OV on VisA and MVTec AD.\nVisA capsules fryum cashew macaroni1 macaroni2 candle pipe fryum chewinggum pcb1 AUROC 98.6 93.4 93.9 91.1 70.3 96.5 99.0 96.4 90.3 AUROC pcb2 pcb3 pcb4 AUROC 87.7 86.6 90.0 MVTec AD bottle cable grid apsule tile pill screw transistor hazelnu AUROC 95.6 88.8 99.2 92.2 95.5 92.3 98.0 88.0 91.2 AUROC carpet metal nut wood zipper leather toothbrush AUROC 98.9 77.7 100.0 99.6 100.0 94.2 PCB Example: Bent Transparent LED # Table 9. Additional results on VisA-D\u0026amp;R (PCB).\nUser Is there an anomaly in the image? GPT-4o The image shows an infrared sensor module, which appears to be intact and properly assembled. There doesn’t seem to be any visible anomaly or damage to the components. LLaVA-OV The image shows an electronic circuit board with vari\u0002ous components, including a red LED that is not illumi\u0002nated. There doesn’t appear to be anything obviously anomalous in the image. Ours Yes, the anomaly in the image is the bent LED on the circuit board, which is not aligned properly with the board’s surface. Candle Example: Pink Discoloration # User\nGPT-4o\nLLaVA-OV\nOurs\nIs there an anomaly in the image I should know about?\nYes, there is an anomaly in the image. The candle in the top left corner has its wick facing the opposite direction compared to the other three candles.\nBased on the image, there is no apparent anomaly among the candles. They are all similar in size and shape, with a consistent hole in the center for the wick. The coloration appears uniform across the candles, and there are no visible signs of melting or irregularities that would suggest an anomaly.\nYes, the anomaly is the presence of a small red spot on the bottom right candle, which contrasts with the otherwise uniform white color.\nTable 10. Additional results on VisA-D\u0026amp;R (Candle).\nmulti-level features from different layers to the LLM (as\nCapsule Example: Dark Spot # User Is there an anomaly present in the image?\nGPT-4o\nYes, there is an anomaly in the image. One of the capsules is a different color; it is brown instead of green.\nLLaVA-OV\nBased on the image alone, there is no clear indication of an anomaly. The pills appear to be uniformly shaped and colored, with no visible defects or irregularities. However, without additional context or a point of ref- erence for size, it’s difficult to determine if there are any anomalies that might not be immediately apparent from the image.\nOurs\nYes, there is an anomaly present in the image. One of the capsules has a darker spot on its surface, which contrasts with the otherwise uniform green color.\nTable 11. Additional results on VisA-D\u0026amp;R (Capsules).\nrecent paper: \u0026ldquo;Dense Connector for MLLMs\u0026rdquo; [96] ) should be a possible solution for performance improvement.\nFryum Example: Break Edge # User Is there any anomaly in the image?\nGPT-4o\nThe image shows a wheel-shaped snack on a green surface. There doesn\u0026rsquo;t appear to be any obvious anomaly.\nLLaVA-OV\nOurs\nThe image shows a circular object with a pattern that resembles a wheel or a gear, placed on a green textured surface. There doesn\u0026rsquo;t appear to be any obvious anomaly in the image; it seems to be a simple photograph of an object against a background. If there is something specific you are referring to as an anomaly, please provide more details or clarify what aspect you are questioning.\nYes, there is an anomaly in the image. The pasta wheel has a noticeable crack on its lower right side, disrupting its otherwise smooth and continuous surface.\nTable 12. Additional results on VisA-D\u0026amp;R (Fryum).\nCashew Example: Brown Spot # User Does the image contain any anomaly?\nGPT-4o\nThe image shows a cashew nut on a textured black surface. There are no visible anomalies; the cashew appears normal.\nLLaVA-OV\nOurs\nBased on the image provided, there are no visible anomalies. The object appears to be a single, intact cashew nut with a smooth surface and a natural brown coloration that is typical for cashews. There are no signs of damage, such as cracks or holes, and the background does not show any irregularities that would suggest an anomaly in the context of the image.\nYes, there is a noticeable anomaly on the cashew nut:\na small brown spot located near the center of the nut.\nTable 13. Additional results on VisA-D\u0026amp;R (Cashew).\nIn-the-Wild Example: Graffitied Road Sign # User Describe the anomaly in the image. Anomaly-OV The anomaly in the image is the graffiti on the high way sign, which includes black spray paint marks and scribbles over the word ’Seattle’ and other parts of th sign. User What is the reason for the anomaly? Anomaly-OV The graffiti is likely the result of vandalism, where indi viduals have used spray paint to deface public property for personal expression or as a form of protest. User How does the anomaly affect the functionality of the object? Anomaly-OV The graffiti affects the sign’s functionality by poten tially obstructing the visibility of important informa tion, such as the word ’Seattle,’ which could confuse drivers and lead to navigational errors. User How to avoid the anomaly in the future? Anomaly-OV To avoid such anomalies, regular maintenance and cleaning of signs should be implemented. Addition ally, community awareness programs can help reduce vandalism by promoting respect for public property. Table 14. In-the-wild results for an unseen object (Road Sign).\nFailure Example 1: Wrong Classification # User Is there any anomaly visible in the image?\nAnomaly-OV\nYes, there is a noticeable anomaly on the soap bar: a hole with jagged edges located towards the left side.\nFailure Example 2: General Description # User\nAre there any anomalies present in the image with the set of candles?\nAnomaly-OV\nYes, there is an anomaly present in the top right candle. It has a noticeable crack on its surface, which is not typical for a smooth, intact candle.\nFailure Example 3: Wrong Reasoning # User Are there any anomalies in the image?\nAnomaly-OV\nYes, the anomaly in the image is the absence of a crystal oscillator on the HC-SR04 ultrasonic sensor module.\nTable 15. Failure results of Anomaly-OV on VisA-D\u0026amp;R.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/towards-zero-shot-anomaly-detection-and-reasoning/","section":"Papers","summary":"Introduces a specialist visual assistant, Anomaly-OV, leveraging an anomaly expert and visual token selection mechanism to improve zero-shot anomaly detection and reasoning, establishing new datasets and benchmarks in the domain.","title":"Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models","type":"other"},{"content":" ","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/survey-paper/","section":"Papers","summary":"Unspecified","title":"Unspecified","type":"survey"},{"content":" Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought # Chao Huang 1 Benfeng Wang 1 Jie Wen 2 Chengliang Liu 3 Wei Wang 1 Li Shen 1 Xiaochun Cao 1\n1 Shenzhen Campus of Sun Yat-sen University 2 Harbin Institute of Technology, Shenzhen 3 Hong Kong Polytechnic University\n{huangch253, wangbf23, wangwei29, caoxiaochun}@mail.sysu.edu.cn\nwenjie@hit.edu.cn liucl1996@163.com mathshenli@gmail.com\nAbstract # Recent advancements in reasoning capability of Multimodal Large Language Models (MLLMs) demonstrate its effectiveness in tackling complex visual tasks. However, existing MLLM-based Video Anomaly Detection (VAD) methods remain limited to shallow anomaly descriptions without deep reasoning. In this paper, we propose a new task named Video Anomaly Reasoning (VAR), which aims to enable deep analysis and understanding of anomalies in the video by requiring MLLMs to think explicitly before answering. To this end, we propose Vad-R1, an end-to-end MLLM-based framework for VAR. Specifically, we design a Perceptionto-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies, guiding the MLLM to reason anomaly step-by-step. Based on the structured P2C-CoT, we construct Vad-Reasoning, a dedicated dataset for VAR. Furthermore, we propose an improved reinforcement learning algorithm AVAGRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs through a self-verification mechanism with limited annotations. Experimental results demonstrate that Vad-R1 achieves superior performance, outperforming both open-source and proprietary models on VAD and VAR tasks. Codes and datasets will be released at https://github.com/wbfwonderful/Vad-R1 .\n1 Introduction # Video Anomaly Detection (VAD) focuses on identifying abnormal events in videos, and has been widely applied in a range of domains like surveillance systems [49] and automatic driving [37 , 75]. Traditional VAD methods typically fall into two paradigms: semi-supervised and weakly-supervised VADs. The semi-supervised VAD methods [75 , 32 , 20 , 34 , 19 , 17] aim at modeling the features of normal events, while there are only video-level annotations available for weakly-supervised VAD methods [66 , 49 , 18 , 17 , 24 , 90 , 21]. With the development of vision-language models, some studies introduce semantic information into VAD [60 , 68 , 67 , 76 , 7]. However, traditional VAD methods only remain at the level of detection, lacking understanding and explanation of anomalies.\nRecently, the reasoning capability of large language models has emerged as a key frontier [41 , 9 , 54]. Unlike daily dialogue, reasoning requires models to think before answering, enabling them to perform causal analysis and further understanding. In particular, DeepSeek-R1 demonstrates the effectiveness of Reinforcement Learning (RL) in stimulating reasoning capability [9]. Besides, parallel efforts have begun to extend reasoning to the multimodal domain [53 , 56].\nDespite the growing interest in reasoning capability, existing Multimodal Large Language Models (MLLMs) based VAD methods still fall short in this regard. Those methods can be divided into two categories based on the role of MLLMs. Some methods regard MLLMs as auxiliary modules [36 , 84 ,\nPreprint. Under review.\nFigure 1: Overview of Vad-R1. Vad-R1 is an end-to-end framework for video anomaly reasoning. A structured Perception-to-Cognition Chain-of-Thought is proposed to guide Vad-R1 in step-by-step reasoning. Based on the structured CoT, a new dataset for video anomaly reasoning is constructed, including fine-grained anomaly categories. A two-stage training pipeline is adopted to progressively enhance the reasoning capability of Vad-R1. Finally, Vad-R1 outperforms existing MLLMs-based VAD methods with a great margin on VANE benchmark.\n85 , 11], where MLLMs provide supplementary explanation after the classifier predicts the anomaly confidence. In this context, anomaly understanding is a step after detection, and the output of MLLMs does not directly promote anomaly detection. Subsequently, although some methods utilize MLLMs to directly perform anomaly detection and understanding [50 , 38 , 73 , 80 , 13 , 12], MLLMs only generate anomaly descriptions or perform simple anomaly question answering based on video content, lacking thinking and analytical abilities. Thus, reasoning remains underexplored in VAD.\nTo bridge this gap, we propose a new task: Video Anomaly Reasoning (VAR), which aims to empower MLLMs with the ability to perform structured, step-by-step reasoning about anomalous events in videos. Compared with existing video anomaly detection or understanding tasks, VAR targets a deeper level of analysis by mimicking the human cognitive process, enabling contextual understanding, behavior interpretation, and norm violation analysis. To this end, we propose Vad-R1, the first end-to-end MLLM-based framework for VAR, which explicitly performs reasoning before generating a response. However, realizing reasoning in video anomaly tasks presents two major challenges. Firstly, existing VAD datasets lack structured reasoning annotations, making them unsuitable for training and evaluating anomaly reasoning models. Secondly, how to effectively train models to acquire reasoning capability remains an open challenge. Unlike tasks with clearly defined objectives, open-ended VAR requires models to perform multi-step reasoning, making it difficult to define clear training objectives or directly guide the reasoning process.\nFor the first challenge, we design a structured Perception-to-Cognition Chain-of-Thought (P2CCoT) for video anomaly reasoning, as shown in Figure 1(a). Inspired by the process of human understanding the anomalies in the videos, the proposed P2C-CoT first guides the model to perceive from the global environment of the video to the suspicious clips of the video. After perception, the model will make cognition based on visual clues from shallow to deep level. Finally, the model gives the analysis result as answer, including the anomaly category, the anomaly description, the temporal range of anomaly, the approximate spatial position of the anomaly and so on. Then based on the CoT, we construct Vad-Reasoning, a specially designed dataset for VAR, which includes fine-grained anomaly categories as shown in Figure 1(b). Vad-Reasoning consists of two complementary subsets. One subset contains videos with P2C-CoT annotations, which are generated by proprietary models step-by-step. The other subset contains a larger number of videos, where there are only video-level weak labels available due to high annotation costs. For the second challenge, inspired by the success of DeepSeek-R1, we propose a training pipeline with two stages as shown in Figure 1(c). In the first stage, Supervised Fine-Tuning (SFT) is performed to equip the base MLLM with fundamental\nanomaly reasoning capability. In the second stage, RL is employed to further incentivize the reasoning capability with the proposed Anomaly Verification Augmented Group Relative Policy Optimization (AVA-GRPO) algorithm, an extension of original GRPO [47] specifically designed for VAR. During RL training, the model first generates a group of completions. Based on these completions, the original videos are temporally trimmed and the trimmed videos are then fed back to the model to generate new completions. The two sets of completions are subsequently compared, and an additional anomaly verification reward is assigned if a predefined condition is satisfied. Finally, AVA-GRPO promotes MLLM\u0026rsquo;s video anomaly reasoning capability through this self-verification mechanism with limited annotations. In summary, the contributions of this paper are threefold:\nWe propose Vad-R1, a novel end-to-end MLLM-based framework tailored for VAR, which aims at further analysis and understanding of anomalies in the video. We design a structured Perception-to-Cognition Chain-of-Thought, and construct VadReasoning, a specially designed dataset for video anomaly reasoning with two subsets. Besides, we propose an improved reinforcement learning algorithm AVA-GRPO, which incentivizes the reasoning capability of MLLMs through a self verification way. The experimental results show that the proposed Vad-R1 achieves superior performance across multiple evaluation scenarios, surpassing both open-source and proprietary models in video anomaly detection and reasoning tasks. 2 Related Works # Video Anomaly Detection and Dataset Video anomaly detection aims at localizing the abnormal events in the videos. Based on the training data, traditional VAD methods typically fall into two paradigms, the semi-supervised VAD [75 , 32 , 20 , 34 , 19 , 17 , 45 , 72 , 79] and weakly supervised VAD [66 , 49 , 18 , 17 , 24 , 90 , 21 , 91]. Furthermore, some studies try to introduce text description to enhance detection [60 , 68 , 67 , 76 , 7 , 8]. Recently, there has been growing interest in integrating MLLMs into VAD to improve understanding and explanation [36 , 50 , 38 , 73 , 80 , 84 , 85 , 11 , 13 , 12]. However, current studies remain at shallow understanding with MLLMs, lacking in-depth exploration of reasoning capability. In this paper, we propose an end-to-end framework to explore the enhancement of reasoning capability for video anomaly tasks.\nFurthermore, the existing VAD datasets primarily provide coarse-grained category labels [49 , 66 , 37 , 1] or abnormal event description [13 , 12 , 50 , 78], lacking annotation of reasoning process. To address this gap, we propose a structured Perception-to-Cognition Chain-of-Thought and a dataset specially designed for video anomaly reasoning, providing step-by-step CoT annotations.\nVideo Multimodal Large Language Model The video multimodal large models provide an interactive way to understand video content. Early works integrate visual encoders into large language models by aligning visual and textual tokens via mapping networks [25 , 30 , 39 , 83 , 87]. Compared to static images, videos contain more redundant information. Consequently, some studies explore token compression mechanism to obtain longer context [29 , 71 , 86 , 23]. In addition, recent works have explored online video stream understanding [6 , 10 , 74 , 69]. Nevertheless, these methods remain at the level of video understanding and lack exploration of reasoning capability.\nMultimodal Large Language Model with Reasoning Capability Enhancing the reasoning capability of MLLMs has become a major research focus. Some studies propose multi-stage reasoning frameworks and large-scale CoT datasets to enhance the reasoning capability of MLLMs [70 , 59 , 33]. Recently, DeepSeek-R1 [9] demonstrates the potential of reinforcement learning in enhancing the reasoning capability, inspiring subsequent efforts to reproduce its success in multimodal domains [22 , 81]. In the field of video, some studies also utilize RL to improve spatial reasoning [28], temporal reasoning [64] and general causal reasoning [14 , 88]. In this paper, we focus on the video anomaly reasoning task.\n3 Method: Vad-R1 # Overview In this section, we introduce Vad-R1, a novel end-to-end MLLM-based framework for VAR. The reasoning capability of Vad-R1 is derived from a two-stage training strategy: SFT with\n(b) Illustration of the answer after reasoning.\n(c) The arrangement of Vad-Reasoning dataset.\nFigure 2: Overview of the proposed Perception-to-Cognition CoT and Vad-Reasoning dataset.\nhigh quality CoT annotated videos and RL based on AVA-GRPO algorithm. We begin by introducing the proposed P2C-CoT in Section 3.1. Based on the P2C-CoT, we construct Vad-Reasoning, a new dataset as detailed in Section 3.2. Then, we introduce the improved RL algorithm AVA-GRPO in Section 3.3. Finally, we introduce the training pipeline of Vad-R1 in Section 3.4 .\n3.1 Perception-to-Cognition Chain-of-Thought # When humans interpret a video, they typically first observe the events that occur in the video, and then develop a deeper understanding based on visual observation. Motivated by this, we design a structured Perception-to-Cognition Chain-of-Thought (P2C-CoT) for video anomaly reasoning, which gradually transitions from Perception to Cognition consisting of 2 stages with 4 steps as shown in Figure 2(a), and concludes with a concise answer as shown in Figure 2(b).\nPerception When watching a video, humans typically begin with a holistic observation of the scene and environment, and then shift attention to specific objects or events that appear abnormal. In line with this pattern, the perception stage of the proposed P2C-CoT reflects a transition from global observation to focused local observation. The model initially focuses on the whole environment, describes the scenes and recognizes the objects in the video. This step requires the model to have a comprehensive understanding of the normality in the video. Building upon this holistic understanding of the normality, the model then focuses on the events that deviate from the established normality, identifies what happens, when and where the event happens.\nCognition After observing the video content, humans typically identify abnormal events based on visual cues, and then proceed to reason about the potential consequences. Similarly, the cognitive stage of the proposed P2C-CoT reflects a progression from shallow cognition to deep cognition. The model first assesses the abnormality of the event and explains why it is considered anomalous with relevant visual signals. It then engages in higher-level cognition to reason the underlying causes, the violated social expectations, and the possible consequences of the abnormal event.\nAnswer As shown in Figure 2(b), following the reasoning process, the model is expected to provide a short summary of its judgment about the given video. The final answer consists of key points related to the anomaly, including category (Which), description of the event (What), spatio-temporal localization (When \u0026amp; Where), the reason Why it is identified as an anomaly and the potential\nFigure 3: Illustration of the two-stage training pipeline for Vad-R1. Stage 1 enables the model to acquire basic reasoning capability with CoT annotated video. Stage 2 further enhances the model\u0026rsquo;s reasoning capability through reinforcement learning.\ninfluence (How). Notably, for normal videos, the corresponding P2C-CoT is simplified into two steps. Please refer to Appendix B for more details.\n3.2 Dataset: Vad-Reasoning # Video Collection The existing VAD datasets generally lack the annotation of reasoning process. To construct a more suitable dataset for VAR, we take the following two aspects into consideration. On the one hand, we aim for the proposed dataset to cover a wide range of real-life scenarios. Similar to HAWK [50], we collect videos from current VAD datasets. The video scenarios include crimes under surveillance (UCF-Crime [49]), violent events under camera (XD-Violence [66]), traffic (TAD [37]), campus (ShanghaiTech [32]) and city (UBnormal [1]). Besides, we also collect videos from ECVA [12], a multi-scene benchmark. On the other hand, we strive to broaden the coverage of anomaly categories. To this end, we define a taxonomy of anomalies comprising three main types: Human Activity Anomaly, Environments Anomaly, and Objects Anomaly. Each type is categorized into several main categories, which are further divided into fine-grained subcategories. Then, we collect additional videos from the internet based on the existing dataset to expand the categories of anomalies. In total, the proposed Vad-Reasoning dataset contains 8203 videos for training and 438 videos for test. As shown in Figure 2(c), the training set of Vad-Reasoning is split into two subsets: Vad-Reasoning-SFT which contains 1755 videos annotated with high-quality reasoning process, and Vad-Reasoning-RL which contains 6448 videos with video-level weak labels.\nAnnotation To construct the proposed Vad-Reasoning dataset, we design a multi-stage annotation pipeline with two proprietary models Qwen-Max [55] and Qwen-VL-Max [57]. In order to ensure that the P2C-CoT annotation covers all key information in the video, we follow the principle of high frame information density [77]. Specifically, we first prompt Qwen-VL-Max to generate dense description of video frames. These frame-level descriptions are then fed into Qwen-Max to generate the CoT step-by-step with different prompts. Please refer to Appendix B for more details.\n3.3 AVA-GRPO # The original GRPO shows great effectiveness in text-based reasoning tasks. However, as mentioned above, the multimodal tasks like VAR are inherently more complex. In addition, there are only video-level weak labels available for RL stage due to high annotation costs, making it difficult to\nevaluate output quality based solely on accuracy and format reward. To address this challenge, we propose Anomaly Verification Augmented GRPO (AVA-GRPO), which introduces an additional reward through a self-verification mechanism, as illustrated in the right part of Figure 3 .\nOverview of GRPO We begin by reviewing the original GRPO [47]. GRPO discards the value model and aims at maximizing the relative advantages of the answers. For a question q, the model will first generate a group of completions O = {oi} G i=0 . Subsequently, a set of rewards R = {ri} G i=0 are computed based on the predefined reward functions. The rewards are then normalized to compute the relative advantages as\nwhere A i is the advantage score of oi, which provides more effective assessment of both individual answer quality and relative comparisons within the group. What\u0026rsquo;s more, to prevent the current policy πθ from drifting excessively from the reference one πref, GRPO introduces a KL-divergence regularization term. The final objective function of GRPO is formulated as\nwhere the ratio πθ (oi|q) πθ old (oi|q) quantifies the relative change between the current policy and the old one, and the clip (· , 1 − ϵ, 1 + ϵ) operation constrains the ratio within a range.\nAnomaly Verification Reward GRPO replaces the value model with group relative scores, reducing the memory usage and training time. However, simple accuracy and format rewards are insufficient to evaluate the quality of answers for video anomaly reasoning task. To address this, we propose AVA-GRPO, an extension of GRPO that incorporates a novel anomaly verification reward. As shown in the right part of Figure 3, for each completion oi, the predicted category of the video is first extracted. The video is then temporally trimmed based on the extracted prediction, and the trimmed video is fed into the model to generate a new answer. Additional anomaly verification rewards are assigned by comparing the original and regenerated answers.\nOn the one hand, if the video is initially classified as abnormal, the predicted temporal range of the abnormal event is extracted, and the corresponding segment is discarded from the original video to create a new trimmed video containing only normal segments. Then the trimmed video is re-fed into the model. If the trimmed video is subsequently predicted as normal, it suggests that the discarded segment is indeed abnormal and the model\u0026rsquo;s initial prediction was correct. In this situation, a positive reward will be assigned to reinforce the model\u0026rsquo;s original prediction.\nOn the other hand, inspired by Video-UTR [77], we consider the phenomenon of temporal hacking for video-MLLMs, where the models tend to generate predictions by relying only on a few frames, typically the beginning or ending of the video, instead of comprehensively processing the entire video sequence, which is detrimental to the recognition of anomaly events. As a consequence, if the video is initially predicted as normal, we randomly discard either the beginning or the ending segment of the video and feed the trimmed video into the model again. If the trimmed video is then predicted as abnormal, it suggests the model made its original prediction only based on insufficient visual evidence, which is not expected. Therefore, a negative reward is assigned in this case.\n3.4 Training Pipeline # We adopt Qwen-2.5-VL-7B [57] as base MLLM. The training of Vad-R1 consists of two stages, as shown in Figure 3. For the first stage, supervised fine-tuning is performed on the Vad-Reasoning-SFT dataset, in which videos are annotated with high-quality Chain-of-Thought (CoT) as described before.\nTable 1: Effectiveness of anomaly reasoning.\nMethod Strategy Answer Answer Detection Detection BLEU-2 METEOR Recall F1 Qwen2.5-VL-7B [57] Direct Answer 0.184 0.339 0.431 0.597 Qwen2.5-VL-7B [57] Random Reasoning 0.179 0.328 0.377 0.540 Qwen2.5-VL-7B [57] Structured Reasoning 0 0.198 (+0.019 ) 0.352 (+0.013 0.696 (+0.265 ) 0.730 (+0.133 Qwen3-8B [58] Direct Answer 0.038 0.184 0.368 0.534 Qwen3-8B [58] Random Reasoning 0.040 0.191 0.554 0.655 Qwen3-8B [58] Structured Reasoning 0.0 0.043 (+0.005 ) 0.193 (+0.009 0.681 (+0.313 ) 0.686 (+0.153 Vad-R1 Direct Answer td R 0.268 0.441 0.838 0.861 In this stage, the model\u0026rsquo;s capability is gradually shifted from general multimodal understanding to video anomaly understanding, and it is enabled to acquire basic anomaly reasoning capability. In the second stage, training is continued on the Vad-Reasoning-RL dataset with the proposed AVA-GRPO reinforcement learning algorithm, which evaluates the quality of model responses in a self verification manner with only video-level weak labels available. This stage aims at moving the model beyond pattern-matching tendencies from SFT, enabling it to develop more flexible, transferable anomaly reasoning capability. Please refer to Appendix C for more details.\n4 Experiments # 4.1 Experimental Settings # Implementation Details Vad-R1 is trained with two stages based on Qwen-2.5-VL-7B [57]. For the first stage, SFT is performed with Vad-Reasoning-SFT dataset for four epochs. For the second stage, RL is performed with AVA-GRPO for one epoch, where there are only video-level weak labels available for VA-Reasoning-RL dataset. All experiments are conducted with 4 NVIDIA A100 (80GB) GPUs. Please refer to Appendix C for more details.\nEvaluation Metrics and Baselines We first evaluate Vad-R1 on the test set of VA-Reasoning, focusing on two aspects: anomaly reasoning and anomaly detection. For anomaly reasoning, we assess the text quality of reasoning process with BLEU [43], METEOR [3] and ROUGE [31] metrics. For anomaly detection, we report accuracy, precision, recall and f1 scores for anomaly classification, along with mIoU and R@K for anomaly temporal grounding. Besides, to further explore the capabilities of Vad-R1, we also conduct experiments on VANE [15], a video anomaly benchmark for MLLMs, where the MLLMs are asked to answer single choice questions. In this case, we report the accuracy of every category. We compare Vad-R1 with general video MLLMs [25 , 30 , 39 , 83 , 87], reasoning video MLLMs [28 , 64 , 14 , 88] and some proprietary models [56 , 40 , 52 , 51]. Furthermore, we also consider MLLM-based VAD methods [50 , 85 , 84].\nIn the following sections, we present our experimental results by addressing the following questions.\nQ1. Does reasoning improve anomaly detection? Q2. How well does Vad-R1 perform in anomaly reasoning and detection? Q3. How to acquire the capability of reasoning? 4.2 Main Results # Q1: Does reasoning improve anomaly detection? Table 1 demonstrates the effectiveness of anomaly reasoning. On the one hand, we evaluate the performance of Qwen2.5-VL [57] and Qwen3 [58]. As shown in the first two rows of Table 1, compared with directly answering, prompting models to reason according to the proposed perception-to-cognition chain-of-thought will gain greater performance. In the meanwhile, we evaluate the effect of random reasoning. In this case, the performance improvement is minimal, even inferior to direct answering. Notably, Qwen3 is a\nTable 2: Performance comparison of anomaly reasoning and detection on Vad-Reasoning dataset.\n| Method | Params. | Anomaly Reasoning | Anomaly Reasoning | Anomaly Reasoning | Anomaly Detection F1 IU R@03 R@05 | Anomaly Detection F1 IU R@03 R@05 | Anomaly Detection F1 IU R@03 R@05 | Anomaly Detection F1 IU R@03 R@05 | Anomaly Detection\nF1 IU R@03 R@05 Method Params. BLEU-2 METEOR ROUGE-2 Acc F1 mIoU R@0.3 R@0.5 Open-Source video MLLMs Open-Source video MLLMs Open-Source video MLLMs Open-Source video MLLMs Open-Source video MLLMs Open-Source video MLLMs Open-Source video MLLMs Open-Source video MLLMs Open-Source video MLLMs Open-Source video MLLMs InternVideo2.5 [65] 8B 0.110 0.264 0.109 0.715 0.730 0.417 0.458 0.424 InternVL3 [92] 8B 0.124 0.286 0.116 0.779 0.756 0.550 0.613 0.540 VideoChat-Flash [27] 7B 0.012 0.084 0.047 0.683 0.487 0.536 0.538 0.358 VideoLLaMA3 [82] 7B 0.066 0.200 0.092 0.665 0.624 0.425 0.451 0.419 LLaVA-NeXT-Video [89] 7B 0.094 0.238 0.104 0.651 0.423 0.576 0.601 0.585 Qwen2.5-VL [57] 7B 0.113 0.264 0.116 0.761 0.730 0.567 0.610 0.563 Source video reasoning MLLMs Source video reasoning MLLMs Source video reasoning MLLMs Source video reasoning MLLMs Source video reasoning MLLMs Source video reasoning MLLMs Source video reasoning MLLMs Source video reasoning MLLMs Source video reasoning MLLMs Source video reasoning MLLMs Open-R1-Video [63] 7B 0.060 0.179 g 0.084 0.793 0.790 0.559 0.642 0.540 Video-R1 [14] 7B 0.135 0.317 0.132 0.624 0.694 0.334 0.392 0.328 VideoChat-R1 [28] 7B 0.128 0.287 0.123 0.793 0.790 0.559 0.642 0.540 LM-based VAD methods LM-based VAD methods LM-based VAD methods LM-based VAD methods LM-based VAD methods LM-based VAD methods LM-based VAD methods LM-based VAD methods LM-based VAD methods LM-based VAD methods Holmes-VAD [84] 7B 0.003 0.074 0.027 0.565 0.120 - - - Holmes-VAU [85] 2B 0.077 0.182 0.075 0.490 0.371 - - - HAWK [50] 7B 0.042 0.156 0.042 0.513 0.648 - - - roprietary MLLMs roprietary MLLMs roprietary MLLMs roprietary MLLMs roprietary MLLMs roprietary MLLMs roprietary MLLMs roprietary MLLMs roprietary MLLMs roprietary MLLMs Claude3.5-Haiku [2] - 0.097 y 0.253 0.098 0.580 0.354 0.518 0.543 0.524 GPT-4o [40] - 0.154 0.341 0.133 0.711 0.760 0.472 0.565 0.476 Gemini2.5-Flash [51] - 0.133 0.308 0.120 0.624 0.707 0.370 0.437 0.358 ary reasoning MLLMs ary reasoning MLLMs ary reasoning MLLMs ary reasoning MLLMs ary reasoning MLLMs ary reasoning MLLMs ary reasoning MLLMs ary reasoning MLLMs ary reasoning MLLMs ary reasoning MLLMs Gemini2.5-pro [52] - p 0.145 0.356 0.137 0.829 0.836 0.636 0.722 0.638 p QVQ-Max [56] - 0.142 0.318 0.121 0.702 0.747 0.430 0.503 0.412 QQ o4-mini [42] - 0.106 0.263 0.109 0.884 0.875 0.644 0.736 0.631 Vad-R1 (Ours) 7B 0.233 0.406 0.194 0.875 0.862 0.713 0.770 0.706 hybrid reasoning model that supports both reasoning and non-reasoning modes for the same task. The consistent performance gap across different settings further highlights the effectiveness of the proposed P2C-CoT for anomaly reasoning and detection. On the other hand, We compare the performance of Vad-R1 trained with the full P2C-CoT versus training with only the final answer portion of the P2C-CoT as shown in the third row of Table 1. When Vad-R1 is trained with only the final answer, it exhibits a performance drop.\nQ2: How well does Vad-R1 perform in anomaly reasoning and detection? Table 2 shows the performance comparison of anomaly reasoning and detection tasks on the test set of VadReasoning. Vad-R1 achieves great performance on both text quality of anomaly reasoning process and the accuracy of anomaly detection. It is worth noting that Vad-R1 significantly outperforms existing proprietary reasoning MLLMs Gemini2.5-Pro, QVQ-Max and o4-mini on anomaly reasoning capability, with BLEU score improvements of 0.088, 0.091, and 0.127, respectively. Besides, compared with existing MLLM-based VAD methods, Vad-R1 also exhibits greater advantages in anomaly reasoning and detection. Table 3 demonstrates the results on VANE benchmark. Vad-R1 also outperforms all baselines including general video MLLMs and MLLM-based VAD methods.\n4.3 Ablation Studies # Q3: How to obtain the capability of reasoning? Table 4 shows the effectiveness of different training strategies. When directly performing RL to the base model without prior SFT, the performance improvement is limited. This suggests that, without fundamental reasoning capability, the model struggles to benefit from RL training with video-level weak labels. In contrast, applying SFT leads to a more significant performance improvement, indicating that the structured Chain-of-Thought annotations effectively equip the model with basic anomaly reasoning capability. Notably, the combination of SFT and RL gains the best performance. The results align with the conclusion of DeepSeek-R1 [9], which suggests that SFT stage provides fundamental reasoning capability for the model, while RL stage further enhances its reasoning capability.\nTable 3: Performance comparison on VANE.\nMethod SORA OpenSORA RG2 VideoLCM MS-T2 Avenue Ped1 Ped2 Open-Source MLLMs Open-Source MLLMs Open-Source MLLMs Open-Source MLLMs Open-Source MLLMs Open-Source MLLMs Open-Source MLLMs Open-Source MLLMs Open-Source MLLMs Video-LLaMA [83] 11.59 18.00 16.00 10.57 10.41 30.00 16.66 5.55 VideoChat [25] 10.74 28.00 4.00 17.64 20.83 32.25 13.33 13.88 Video-ChatGPT [39] 26.47 22.00 12.00 18.26 16.66 39.39 40.00 19.44 Video-LLaVA [30] 10.86 18.00 16.00 19.23 16.66 3.03 2.77 6.06 MovieChat [48] 8.69 10.00 16.00 14.42 6.25 18.18 6.66 11.11 LLaMA-VID [29] 7.97 14.00 20.00 19.23 14.58 27.27 6.66 19.44 TimeChat [44] 21.73 26.00 28.00 22.11 20.83 24.20 27.58 11.11 LLM-based VAD methods LLM-based VAD methods LLM-based VAD methods LLM-based VAD methods LLM-based VAD methods LLM-based VAD methods LLM-based VAD methods LLM-based VAD methods LLM-based VAD methods Holmes-VAU [85] 2.17 34.00 24.00 29.81 25.00 6.06 3.33 5.56 Holmes-VAD [84] 6.52 34.00 32.00 33.56 22.92 12.12 20.00 5.56 HAWK [50] 24.64 52.00 44.00 36.54 50.00 36.36 36.67 38.89 Vad-R1 (ours) 41.30 78.00 56.00 63.46 60.42 75.76 60.00 63.89 Table 4: Comparison of different training strategies for Vad-R1.\nFigure 4: Qualitative performance on VANE benchmark.\nStrategy Anomaly Reasoning Anomaly Reasoning Anomaly Reasoning Anomaly Reasoning Anomaly Detection Anomaly Detection Anomaly Detection Anomaly Detection Anomaly Detection Strategy BLEU-2 ROUGE-1 ROUGE-2 ROUGE-L Prec. mIoU R@0.3 R@0.5 R@0.7 Qwen2.5-VL 0.113 0.505 0.199 0.477 0.768 0.567 0.610 0.563 0.526 +SFT 0.219 0.456 0.196 0.429 0.712 0.612 0.677 0.599 0.535 +AVA-GRPO 0.143 0.513 0.207 0.486 0.810 0.675 0.736 0.661 0.606 +SFT+AVA-GRPO 0.233 0.530 0.238 0.501 0.882 0.713 0.770 0.706 0.651 4.4 Qualitative Analyses # As shown in Figure 3, Vad-R1 demonstrates great reasoning capability in complex environments and correctly identifies anomalies in the video. In comparison, the reasoning process of HolmesVAU is partially correct, resulting in incorrect judgment, while HolmesVAD makes correct judgment but incorrect reasoning process. Please refer to Appendix D for more qualitative results.\n5 Conclusion # In this paper, we present Vad-R1, a novel end-to-end MLLM-based framework for video anomaly reasoning which aims to enable deep analysis and understanding of anomalies in videos. Vad-R1 performs structured anomaly reasoning process through a structured Chain-of-Thought that progresses gradually from perception to cognition. The anomaly reasoning capability of Vad-R1 is derived from a two-stage training strategy, combining supervised fine-tuning on CoT-annotated videos and reinforcement learning with an anomaly verification mechanism. Experimental results demonstrate that Vad-R1 achieves superior performance on anomaly detection and reasoning tasks.\nReferences # [1] Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ubnormal: New benchmark for supervised open-set video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20143–20153, 2022. 3 , 5 , 18\n[2] Anthropic. Claude 3.5 haiku, 2024. URL https://www.anthropic.com/claude/haiku . 8 , 23\n[3] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005. 7\n[4] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1:8, 2024. 20\n[5] Congqi Cao, Yue Lu, Peng Wang, and Yanning Zhang. A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20392–20401, June 2023. 18\n[6] Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024. 3\n[7] Junxi Chen, Liang Li, Li Su, Zheng-Jun Zha, and Qingming Huang. Prompt-enhanced multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18319–18329, 2024. 1 , 3\n[8] Weiling Chen, Keng Teck Ma, Zi Jian Yew, Minhoe Hur, and David Aik-Aun Khoo. Tevad: Improved video anomaly detection with captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5549–5559, 2023. 3\n[9] DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1 , 3 , 8\n[10] Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval. arXiv preprint arXiv:2503.00540, 2025. 3\n[11] Zongcan Ding, Haodong Zhang, Peng Wu, Guansong Pang, Zhiwei Yang, Peng Wang, and Yanning Zhang. Slowfastvad: Video anomaly detection via integrating simple detector and rag-enhanced vision-language model. arXiv preprint arXiv:2504.10320, 2025. 2 , 3\n[12] Hang Du, Guoshun Nan, Jiawen Qian, Wangchenhui Wu, Wendi Deng, Hanqing Mu, Zhenyan Chen, Pengxuan Mao, Xiaofeng Tao, and Jun Liu. Exploring what why and how: A multifaceted benchmark for causation understanding of video anomaly. arXiv preprint arXiv:2412.07183 , 2024. 2 , 3 , 5 , 17 , 18\n[13] Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, et al. Uncovering what why and how: A comprehensive benchmark for causation understanding of video anomaly. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18793–18803, 2024. 2 , 3 , 17 , 18\n[14] Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776, 2025. 3 , 7 , 8 , 23\n[15] Hanan Gani, Rohit Bharadwaj, Muzammal Naseer, Fahad Shahbaz Khan, and Salman Khan. Vane-bench: Video anomaly evaluation benchmark for conversational lmms. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3123–3140, 2025. 7 , 20\n[16] HPCAI Tech. Open-sora: Democratizing efficient video production for all. https://github. com/hpcaitech/Open-Sora, 2024. 20\n[17] Chao Huang, Zhihao Wu, Jie Wen, Yong Xu, Qiuping Jiang, and Yaowei Wang. Abnormal event detection using deep contrastive learning for intelligent video surveillance system. IEEE Transactions on Industrial Informatics, 18(8):5171–5179, 2021. 1 , 3\n[18] Chao Huang, Chengliang Liu, Jie Wen, Lian Wu, Yong Xu, Qiuping Jiang, and Yaowei Wang. Weakly supervised video anomaly detection via self-guided temporal discriminative transformer. IEEE Transactions on Cybernetics, 54(5):3197–3210, 2022. 1 , 3\n[19] Chao Huang, Jie Wen, Yong Xu, Qiuping Jiang, Jian Yang, Yaowei Wang, and David Zhang. Self-supervised attentive generative adversarial networks for video anomaly detection. IEEE transactions on neural networks and learning systems, 34(11):9389–9403, 2022. 1 , 3\n[20] Chao Huang, Jie Wen, Chengliang Liu, and Yabo Liu. Long short-term dynamic prototype alignment learning for video anomaly detection. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 866–874, 2024. 1 , 3\n[21] Chao Huang, Weiliang Huang, Qiuping Jiang, Wei Wang, Jie Wen, and Bob Zhang. Multimodal evidential learning for open-world weakly-supervised video anomaly detection. IEEE Transactions on Multimedia, 2025. 1 , 3\n[22] Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025. 3\n[23] Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710, 2024. 3\n[24] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In 2023 IEEE International Conference on Image Processing (ICIP), pages 3230–3234. IEEE, 2023. 1 , 3\n[25] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 , 2023. 3 , 7 , 9\n[26] Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013. 18 , 20\n[27] Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling. arXiv preprint arXiv:2501.00574, 2024. 8 , 23\n[28] Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning. arXiv preprint arXiv:2504.06958, 2025. 3 , 7 , 8 , 23\n[29] Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2024. 3 , 9\n[30] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 , 2023. 3 , 7 , 9\n[31] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004. 7\n[32] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018. 1 , 3 , 5 , 18\n[33] Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning. arXiv preprint arXiv:2503.13444, 2025. 3\n[34] Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF international conference on computer vision , pages 13588–13597, 2021. 1 , 3\n[35] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision, pages 2720–2727, 2013. 18 , 20\n[36] Hui Lv and Qianru Sun. Video anomaly detection and explanation via large language models. arXiv preprint arXiv:2401.05702, 2024. 1 , 3\n[37] Hui Lv, Chuanwei Zhou, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Localizing anomalies from weakly-labeled videos. IEEE transactions on image processing, 30:4505–4515, 2021. 1 , 3 , 5 , 18\n[38] Junxiao Ma, Jingjing Wang, Jiamin Luo, Peiying Yu, and Guodong Zhou. Sherlock: Towards multi-scene video abnormal event extraction and localization via a global-local spatial-sensitive llm. In Proceedings of the ACM on Web Conference 2025, pages 4004–4013, 2025. 2 , 3\n[39] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 3 , 7 , 9\n[40] OpenAI. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 7 , 8 , 23\n[41] OpenAI. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024. 1\n[42] OpenAI. Openai o3 and o4-mini system card, 2025. URL https://openai.com/index/ o3-o4-mini-system-card/ . 8\n[43] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 7\n[44] Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14313–14323, 2024. 9\n[45] Nicolae-C Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, Mubarak Shah, et al. Self-distilled masked auto-encoders are efficient video anomaly detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15984–15995, 2024. 3\n[46] Runway Research. Gen-2: The next step forward for generative ai. https://research. runwayml.com/gen2, 2024. 20\n[47] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 3 , 6\n[48] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 9\n[49] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6479–6488, 2018. 1 , 3 , 5 , 17 , 18 , 20\n[50] Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and Yingcong Chen. Hawk: Learning to understand open-world video anomalies. Advances in Neural Information Processing Systems, 37:139751–139785, 2024. 2 , 3 , 5 , 7 , 8 , 9 , 18 , 21 , 23\n[51] Gemini Team. Gemini 2.5 flash preview model card, 2025. URL https://storage. googleapis.com/model-cards/documents/gemini-2.5-flash-preview.pdf . 7 , 8\n[52] Gemini Team. Gemini 2.5 pro preview model card, 2025. URL https://storage. googleapis.com/model-cards/documents/gemini-2.5-pro-preview.pdf . 7 , 8\n[53] Kimi Team. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025. 1\n[54] Qwen Team. QwQ: Reflect deeply on the boundaries of the unknown, 2024. URL https: //qwenlm.github.io/blog/qwq-32b-preview/ . 1\n[55] Qwen Team. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024. 5\n[56] Qwen Team. QVQ-Max: Think with evidence, 2025. URL https://qwenlm.github.io/ blog/qvq-max-preview/ . 1 , 7 , 8 , 23\n[57] Qwen Team. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 5 , 6 , 7 , 8 , 23\n[58] Qwen Team. Qwen3: Think deeper, act faster, 2025. URL https://qwenlm.github.io/ blog/qwen3/ . 7\n[59] Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186, 2025. 3\n[60] Benfeng Wang, Chao Huang, Jie Wen, Wei Wang, Yabo Liu, and Yong Xu. Federated weakly supervised video anomaly detection with multimodal prompt. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 21017–21025, 2025. 1 , 3\n[61] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report, 2023. 20\n[62] Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model, 2023. 20\n[63] Xiaodong Wang and Peixi Peng. Open-r1-video, 2025. URL https://github.com/ Wang-Xiaodong1899/Open-R1-Video . 8 , 23\n[64] Ye Wang, Boshen Xu, Zihao Yue, Zihan Xiao, Ziheng Wang, Liang Zhang, Dingyi Yang, Wenxuan Wang, and Qin Jin. Timezero: Temporal video grounding with reasoning-guided lvlm. arXiv preprint arXiv:2503.13377, 2025. 3 , 7\n[65] Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling. arXiv preprint arXiv:2501.12386, 2025. 8 , 23\n[66] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 322–339. Springer, 2020. 1 , 3 , 5 , 17 , 18\n[67] Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open-vocabulary video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18297–18307, 2024. 1 , 3\n[68] Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6074–6082, 2024. 1 , 3\n[69] Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge. arXiv preprint arXiv:2501.13468, 2025. 3\n[70] Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024. 3\n[71] Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024. 3\n[72] Cheng Yan, Shiyu Zhang, Yang Liu, Guansong Pang, and Wenjun Wang. Feature prediction diffusion model for video anomaly detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5527–5537, 2023. 3\n[73] Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. Follow the rules: reasoning for video anomaly detection with large language models. In European Conference on Computer Vision, pages 304–322. Springer, 2024. 2 , 3\n[74] Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Shengsheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding. arXiv preprint arXiv:2502.10810, 2025. 3\n[75] Yu Yao, Xizi Wang, Mingze Xu, Zelin Pu, Yuchen Wang, Ella Atkins, and David J Crandall. Dota: Unsupervised detection of traffic anomaly in driving videos. IEEE transactions on pattern analysis and machine intelligence, 45(1):444–459, 2022. 1 , 3\n[76] Muchao Ye, Weiyang Liu, and Pan He. Vera: Explainable video anomaly detection via verbalized learning of vision-language models. arXiv preprint arXiv:2412.01095, 2024. 1 , 3\n[77] En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, et al. Unhackable temporal rewarding for scalable video mllms. arXiv preprint arXiv:2502.12081, 2025. 5 , 6\n[78] Tongtong Yuan, Xuange Zhang, Kun Liu, Bo Liu, Chen Chen, Jian Jin, and Zhenzhen Jiao. Towards surveillance video-and-language understanding: New dataset baselines and challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 22052–22061, 2024. 3 , 18\n[79] M Zaigham Zaheer, Arif Mahmood, M Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. Generative cooperative learning for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14744–14754, 2022. 3\n[80] Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. Harnessing large language models for training-free video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18527–18536, 2024. 2 , 3\n[81] Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, and Jinqiao Wang. Vision-r1: Evolving human-free alignment in large vision-language models via visionguided reinforcement learning. arXiv preprint arXiv:2503.18013, 2025. 3\n[82] Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106, 2025. 8 , 23\n[83] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023. 3 , 7 , 9\n[84] Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, and Nong Sang. Holmes-vad: Towards unbiased and explainable video anomaly detection via multi-modal llm. arXiv preprint arXiv:2406.12235, 2024. 1 , 3 , 7 , 8 , 9 , 18 , 23\n[85] Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xiaonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. Holmes-vau: Towards long-term video anomaly understanding at any granularity. arXiv preprint arXiv:2412.06171, 2024. 2 , 3 , 7 , 8 , 9 , 18 , 23\n[86] Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024. 3\n[87] Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023. 3 , 7\n[88] Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video reasoning. arXiv preprint arXiv:2504.09641, 2025. 3 , 7\n[89] Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. 8 , 23\n[90] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1237–1246, 2019. 1 , 3\n[91] Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 3769–3777, 2023. 3\n[92] Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 8 , 23\nA Summary of Appendix # This appendix provides supplementary information for the main paper. Firstly, we provide detailed information about the proposed Vad-Reasoning dataset, including the construction process, statistical analysis, and some examples. Then, we provide more experimental details covering prompts, settings, parameters, and computing resources. Furthermore, we provide more experimental results as well as visualizations. Finally, we discuss the potential impact and limitation.\nB The proposed Vad-Reasoning Dataset # B.1 Annotation Pipeline # The training set of Vad-Reasoning consists of two subsets: Vad-Reasoning-SFT and Vad-ReasoningRL. For Vad-Reasoning-RL, we retain the original dataset annotations and collapse them into video-level weak labels (Abnormal or Normal). For Vad-Reasoning-SFT, we design a multi-stage annotation process based on the proposed P2C-CoT, as shown in Figure 5 .\nFigure 5: Illustration of multi-stage annotation process of Vad-Reasoning-SFT dataset.\nFrame Description Firstly, each video is tagged with (1) the approximate spatial location of anomaly, (2) temporal span of the anomaly and (3) the fine-grained anomaly category. Then, the video is decomposed into separate frames with a frame interval of 16. The extracted frames are then fed into Qwen-VL-Max to generate detailed descriptions.\nGlobal Perception All frame captions are concatenated in temporal order and passed to Qwen-Max, producing a holistic scene description covering environments, objects, and actions. Notably, there is only normal pattern described in this stage.\nLocal perception Captions corresponding to the abnormal frames are isolated and sent to QwenMax again, yielding the description of the abnormal event. However, this stage remains at perception of event that is not inconsistent with the normal pattern, without any judgment about the abnormality.\nShallow Cognition Given the descriptions of abnormal frames, the description of the abnormal event and the corresponding anomaly category, Qwen-Max is required to performs anomaly identification and short explanation in this stage.\nDeep Cognition Building on the output of shallow cognition, Qwen-Max performs deeper reasoning about the anomaly in the video with the description of the abnormal event and the corresponding anomaly category.\nAnswer Finally, the outputs of the above steps are merged by Qwen-Max to generate a short summary of the anomaly with the key words enclosed by defined tags (e.g. \u0026lt;which\u0026gt;\u0026lt;/which\u0026gt; tags to enclose the predicted anomaly type, while \u0026lt;what\u0026gt;\u0026lt;/what\u0026gt; tags to enclose description of the abnormal event)\nFurthermore, throughout the entire annotation process, to ensure high-quality and ethically sound annotations generated by Qwen-VL-Max and Qwen-Max, we define the following annotation guidelines:\nRelevance: All responses should be directly related to the visual content of the video. Any unrelated assumptions or hallucinated contents must be strictly avoided. Objectivity: All responses must be based on observable visual evidence, avoiding speculation or subjective interpretation. Neutrality: All responses should exclude any references to geographic locations, race, gender, political views, or religious beliefs. Non-discrimination: Any form of biased, discriminatory, or offensive language is strictly prohibited. Style: Language should be clear, neutral, and general-purpose to ensure universal readability and usability. Conciseness: Each response should consist of 4 to 6 sentences to maintain clarity and focus. B.2 Statistical Analysis and Comparison # We compare Vad-Reasoning with existing video anomaly detection and understanding datasets in Table 5 and Table 6. Vad-Reasoning consists of a total of 8641 videos, covering 34 million frames and over 360 hours of duration, making it one of the largest datasets among video anomaly understanding benchmarks. Besides, Vad-Reasoning-SFT provides fine-grained Chain-of-Thought (CoT) annotations, explicitly simulating human reasoning over abnormal events, with an average annotation length of 260 words. For the annotations, the recent video anomaly understanding datasets like CUVA [13] and ECVA [12] contain the description about the cause and effect of the anomaly. However, their corresponding annotations are isolated and disjointed, lacking a systematic structure and logical progression. In contrast, the proposed Vad-Reasoning-SFT datset provides structured and coherent anomaly reasoning annotation.\nFigure 6 presents a comprehensive statistical overview of the proposed Vad-Reasoning dataset. The overall distribution of video length is relatively even as shown in Figure 6(a) and (b). Most of the videos in the Vad-Reasoning dataset are collected from UCF-Crime [49] and XD-Violence [66] as shown in Figure 6(c) and (d). And we collect additional 10 percent of videos from the internet. The proportion of normal and abnormal videos in the two subsets is basically balanced as shown in Figure 6(e). Finally, the fine-grained anomaly distributions are shown in Figure 6(f)-(h).\nB.3 Examples # We provide two examples of the proposed Vad-Reasoning dataset in Figure 7 and Figure 8. Notably, the CoT of normal videos will be simplified into two steps, the simple perception and cognition.\nC Implementation Details # C.1 Prompt # The prompt used for performing video anomaly reasoning is shown in Figure 9. The prompt is composed of three parts, Task Definition , Output Specification and Format Requirements . Firstly, the Task Definition outlines the overall goal of video anomaly reasoning and explicitly require the model to think before answering. Secondly, the Output Specification provides detailed guidelines on the reasoning process and the expected answer. Finally, the Format Requirements presents concrete output examples with explicitly defined tags (e.g., \u0026lt;think\u0026gt;\u0026lt;/think\u0026gt; and \u0026lt;answer\u0026gt;\u0026lt;/answer\u0026gt;).\nTable 5: Basic metadata comparison of datasets. Here \u0026ldquo;Mixture\u0026rdquo; indicates that the dataset is composed by integrating videos from multiple existing datasets.\nDataset Source Videos Frames Duration Resolution FPS Traditional Video Anomaly Detection Datasets Traditional Video Anomaly Detection Datasets Traditional Video Anomaly Detection Datasets Traditional Video Anomaly Detection Datasets Traditional Video Anomaly Detection Datasets Traditional Video Anomaly Detection Datasets Traditional Video Anomaly Detection Datasets UCF-Crime [49] Surveillance 1900 13,741,393 128h 320 × 240 Multiple XD-Violence [66] Multiple 4754 18,714,328 217h Multiple 24 ShanghaiTech [32] Campus 437 317,398 - 856 × 480 - UCSD Ped1 [26] Campus 70 14,000 - 238 × 158 - UCSD Ped2 [26] Campus 28 4,560 - 360 × 240 - CUHK Avenue [35] Campus 37 30,652 0.3h 640 × 360 25 TAD [37] Traffic 518 540,212 - Multiple - UBnormal [1] Generation 543 236,902 2.2h Multiple 30 NWPU Campus [5] Campus 547 1,466,073 16.3h Multiple 25 Video Anomaly Understanding Datasets Video Anomaly Understanding Datasets Video Anomaly Understanding Datasets Video Anomaly Understanding Datasets Video Anomaly Understanding Datasets Video Anomaly Understanding Datasets Video Anomaly Understanding Datasets UCA [78] Surveillance 1854 13,163,270 121.9h 320 × 240 Multiple CUVA [13] Multiple 986 3,345,097 32.5h Multiple Multiple ECVA [12] Multiple 2127 19,042,560 88.2h Multiple Multiple VAD-Instruct50k [84] Mixture 6654 32,455,721 345h Multiple Multiple HIVAU-70k [85] Mixture 6654 32,455,721 345h Multiple Multiple HAWK [50] Mixture 7898 14,878,233 142.5h Multiple Multiple Vad-Reasoning-SFT Mixture 2193 8,680,615 88.3h Multiple Multiple Vad-Reasoning-RL Mixture 6448 25,495,729 272.2h Multiple Multiple Vad-Reasoning Mixture 8641 34,173,344 360.5h Multiple Multiple Table 6: The annotation type comparison of datasets. * denotes that the videos in Vad-Reasoning-RL are only labeled with video-level labels (Abnormal or Normal).\nDataset Anomalies Text Annotation Reasoning Traditional Video Anomaly Detection Datasets Traditional Video Anomaly Detection Datasets Traditional Video Anomaly Detection Datasets Traditional Video Anomaly Detection Datasets UCF-Crime [49] 13 Anomaly class - XD-Violence [66] 6 Anomaly class - ShanghaiTech [32] 13 - - UCSD Ped1 [26] 5 - - UCSD Ped2 [26] 5 - - CUHK Avenue [35] 5 - - TAD [37] 7 - - UBnormal [1] 22 - - NWPU Campus [5] 28 - - ideo Anomaly Understanding Datasets ideo Anomaly Understanding Datasets ideo Anomaly Understanding Datasets ideo Anomaly Understanding Datasets UCA [78] 13 Event descriptions - CUVA [13] 42 Anomaly description, cause, effect Isolated ECVA [12] 100 nomaly description, cause, effect Isolated VAD-Instruct50k [84] 13 Clip caption \u0026amp; QA - HIVAU-70k [85] 13 Clip/Event/Video-level Caption \u0026amp; QA Isolated HAWK [50] - Anomaly description \u0026amp; QA - Vad-Reasoning-SFT 37 Chain-of-Thought Structured \u0026amp; coheren Vad-Reasoning-RL 1* Video-level label - Vad-Reasoning 37 Hybrid annotation - C.2 Training Process of AVA-GRPO # The core of the proposed AVA-GRPO is the additional anomaly verification reward as shown in Algorithm 1. Besides, we additionally consider a length reward. We first separately calculate the length of the reasoning text for abnormal videos and normal videos in Vad-Reasoning-SFT. During RL training, if the length of output satisfies the corresponding range, a length reward will be assigned. Notably, for each completion, the model will be only updated once. Consequently, the objective function of AVA-GRPO is simplified as\nFigure 6: Statistical analyses of the proposed Vad-Reasoning dataset.\nwhere πθ no grad is equivalent to πθ. Finally, the training process of AVA-GRPO is shown in Algorithm 2 .\nC.3 More Experimental Details # All experiments are conducted on 4 NVIDIA A100 (80GB) GPUs. For supervised fine-tuning stage, we train the base MLLM on Vad-Reasoning-SFT dataset for four epochs, taking approximately 6 hours. For reinforcement learning stage, we continue to train the model on the Vad-Reasoning-RL dataset for one epoch, taking about 26 hours. For efficiency, we uniformly normalize the video to 16 frames, and the maximum number of pixels per frame is limited to 128 × 28 × 28 during training. The learning rates for both stages are set to 1 × 10 − 6 . The number of completions generated in a group is set to 4. The hyperparameter β in Equation 3 is set as 0.04. AVA-GRPO includes five types of rewards. The specific values and meanings are shown in Table 7. For normal videos, the length range of reasoning process is set as [140 , 261], while it is set as [233 , 456] for abnormal videos.\nStep 1: Global Perception\nStep 2: Local Perception\nFigure 7: An abnormal example of Vad-Reasoning.\n.\nenvironment\u0026lt;/what\u0026gt; gyp\nThis is considered normal because \u0026lt;why\u0026gt;the actions displayed, such as walking at a relaxed pace in everyday attire, pppy gpyy\nalign with typical behaviors expected in well-maintained public spaces without any signs of disruption or unusual activity\u0026lt;/why\u0026gt; Fi8Al lf VdRi\nppy gp Figure 8: An normal example of Vad-Reasoning.\nC.4 Evaluation on VANE Benchmark # VANE [15] is a benchmark designed for evaluate the ability of video-MLLMs to detect anomalies in the video. It consists of 325 video clips and 559 question-answer pairs, covering both real-world surveillance and AI-generated video clips, and are categorized into nine anomaly types. For realworld anomalies, VANE collect 128 videos clips from existing video anomaly detection datasets (e.g., CUHK Avenue [35], UCSD-Ped1/Ped2 [26], and UCF-Crime [49]). For AI-generated anomalies, VANE includes 197 clips videos generated with SORA [4], OpenSora [16], Runway Gen2 [46], ModelScopeT2V [61] and VideoLCM [62]. We report the performance of Vad-R1 and other MLLMbased VAD methods on different categories. Notably, since Vad-R1 is trained with the proposed\n.\nFigure 9: Prompt template for performing video anomaly reasoning.\nTable 7: Reward types and the corresponding values.\nType Meaning Value Accuracy Evaluate classification result 1 Format Evaluate format of outpu 1 Anomaly verification: Abnormal Evaluate correctness of videos predicted as abnormal 0.5 Anomaly verification: Normal Evaluate correctness of videos predicted as normal -0.2 Length Evaluate length of output 0.2 Vad-Reasoning dataset, which incorporates videos from UCF-Crime, we exclude the corresponding UCF-Crime subset from VANE benchmark.\nD More Experimental Results # D.1 LLM-Guided Evaluation # The traditional evaluation metrics, such as BLEU and METEOR, focus primarily on token-level overlap between the generated answer and the reference ground truth, which are inherently limited in capturing the semantic quality of the generated answers, particularly in tasks that require causal reasoning and contextual judgment. To address these limitations, we additionally adopt proprietary LLM to evaluate the quality of the generated answer. Following HAWK [50], we consider the following aspects:\nReasonability assesses whether the generated response presents a coherent and logically valid causal reasoning of the anomaly. Algorithm 1 Anomaly verification reward # Input: Prompt template p, current video v, policy model πθ, generated completions O = {oi} G i=1 . Output: Anomaly verification reward R ano . 1: Init anomaly verification reward: R ano = {ri} G i=1, where r i = 0 2: for each o i ∈ O do 3: Extract prediction p of v from completion oi 4: if p == Normal then 5: Randomly discard either the beginning or the ending segment of v 6: else 7: Discard the predicted abnormal segment of v 8: end if 9: Obtain a trimmed video v˜ ˜ 10: Generate a new completion o˜ ˜ ∼ πθ (· | p, v˜ ˜ ) 11: Extract new prediction p˜ ˜ of v˜ ˜ from new completion o˜ ˜ 12: if p == Abnormal and p˜ ˜ == Normal then 13: Assign positive reward ri ← 0 . 5 14: else if p == Normal and p˜ ˜ == Abnormal then 15: Assign negative reward ri ← −0 . 2 16: end if 17: end for 18: return R ano = {ri} G i=1 Algorithm 2 AVA-GRPO # Input: Prompt template p, Vad-Reasoning-RL dataset D = {(vj , Yj )} j=1 , initial policy model πθ init . Output: Updated policy model πθ .\nN 1: Init policy model: πθ ← πθ init 2: Init reference model: π ref ← πθ 3: for e ∈ {1, ..., E} do 4: for (vj , Yj ) ∈ D do 5: Generate a group of completions O = {oi} G i=1 ∼ πθ (· | p, vj ) 6: Compute accuracy reward R acc = {ri} G i=1 7: Compute format reward Rf = {ri} G i=1 8: Compute anomaly verification reward R ano ← Algorithm 1 9: Compute length reward Rlen = {ri} G i=1 10: Compute sum R = R acc + Rf + R ano + Rlen 11: Compute advantages A = R − mean (R) std(R) 12: Update πθ with Equation 3 13: end for 14: end for 15: return πθ Detail evaluates the level of specificity and informativeness in the model\u0026rsquo;s output. A high-quality response is expected to cover essential contextual elements. Consistency focuses on the factual alignment between the generated answer and the groundtruth metadata, including the event description, potential consequence and so on. Each aspect is scored in the range of [0 , 1], with 1 indicating the highest level of semantic alignment and reasoning quality. The comparisons on the test set of Vad-Reasoning are shown in Table 8. We observe that Vad-R1 achieves the best performance among all open-source methods. Compared with proprietary MLLMs, Vad-R1 demonstrates superior performance, particularly in terms of Reasonability and Consistency, outperforming even GPT-4o.\nTable 8: Comparison of anomaly reasoning quality evaluated by LLM-Guided metrics.\nMethod Params. Reasonability Detail Consistency Open-Source video MLLMs Open-Source video MLLMs Open-Source video MLLMs Open-Source video MLLMs Open-Source video MLLMs InternVideo2.5 [65] 8B 0.580 0.517 0.487 InternVL3 [92] 8B 0.692 0.608 0.586 VideoChat-Flash [27] 7B 0.367 0.292 0.356 VideoLLaMA3 [82] 7B 0.549 0.449 0.497 LLaVA-NeXT-Video [89] 7B 0.541 0.452 0.491 Qwen2.5-VL [57] 7B 0.638 0.555 0.542 Source video reasoning MLLMs Source video reasoning MLLMs Source video reasoning MLLMs Source video reasoning MLLMs Source video reasoning MLLMs Open-R1-Video [63] 7B 0.411 0.307 0.338 Video-R1 [14] 7B 0.390 0.414 0.243 VideoChat-R1 [28] 7B 0.634 0.559 0.528 LM-based VAD methods LM-based VAD methods LM-based VAD methods LM-based VAD methods LM-based VAD methods Holmes-VAD [84] 7B 0.388 0.275 0.343 Holmes-VAU [85] 2B 0.385 0.301 0.375 HAWK [50] 7B 0.218 0.185 0.115 Proprietary MLLMs Proprietary MLLMs Proprietary MLLMs Proprietary MLLMs Proprietary MLLMs Claude3.5-Haiku [2] - 0.711 0.637 0.611 QVQ-Max [56] - 0.690 0.639 0.521 GPT-4o [40] - 0.724 0.679 0.542 Vad-R1 (Ours) 7B 0.734 0.659 0.662 Table 9: Performance comparison of different numbers of input frames and spatial resolutions.\nFrames Max Pixels Anomaly Reasoning Anomaly Reasoning Anomaly Reasoning Anomaly Detection Anomaly Detection Anomaly Detection Anomaly Detection Anomaly Detection Frames Max Pixels BLEU-2 METEOR ROUGE-2 Acc F1 mIoU R@0.3 R@0.5 16 128 × 28 × 28 0.233 0.406 0.194 0.875 0.862 0.713 0.770 0.706 16 256 × 28 × 28 0.238 0.412 0.198 0.886 0.878 0.713 0.772 0.702 32 128 × 28 × 28 0.242 0.416 0.201 0.900 0.891 0.726 0.786 0.715 32 256 × 28 × 28 0.238 0.413 0.198 0.888 0.883 0.708 0.772 0.695 64 128 × 28 × 28 0.244 0.420 0.203 0.895 0.892 0.709 0.777 0.695 D.2 Experiments on More Input Tokens # During both training and inference, the video is uniformly sampled into 16 frames as input, with a maximum pixel count of 128 × 28 × 28 per frame. In this section, we increase the number of frames to 32 and 64 per video, and the maximum pixel to 256 × 28 × 28 per frame. The results are shown in Table 9. On the one hand, We observe that increasing the number of frame from 16 to 64 yields improvement across both anomaly reasoning and detection, showing that the extra frames provide more useful visual evidence. On the other hand, the benefit of a higher resolution depends on the number of input frames. When increasing the max number of pixels to 256 × 28 × 28 with 16 frames, the model gains small but consistent performance improvement, suggesting that high resolution details compensate for the short clip. In contrast, the performance will drop if we increase the max pixels for 32 frames, possibly due to token redundancy. Consequently, increasing frames is more useful, whereas higher resolution might lead to information overload.\nD.3 More Ablation studies # In this section, we evaluate the effectiveness of the proposed AVA-GRPO. Compared with original GRPO, AVA-GRPO has an additional anomaly verification reward, which incentivizes the anomaly reasoning capability of MLLM with only video-level weak labels. In addition, we add a length reward to control the length of output. The effectiveness of the two additional rewards is shown in Table 10. For both 16-frame and 32-frame settings, AVA-GRPO outperforms the original GRPO across video reasoning and detection tasks. In contrast, using only one reward leads to limited or\nTable 10: Ablation results of different reward strategies.\nFrames Strategy Reasoning Reasoning Detection Detection Detection Detection Detection Frames Strategy ROUGE-L ROUGE-2 Precision mIoU R@0.3 R@0.5 R@0.7 16 GRPO 0.502 0.475 0.861 0.712 0.770 0.699 0.640 16 GRPO+len_reward 0.529 0.501 0.856 0.710 0.770 0.697 0.642 16 GRPO+ano_reward 0.496 0.467 0.866 0.707 0.765 0.695 0.638 16 AVA-GRP 0.530 0.501 0.882 0.713 0.770 0.706 0.651 32 GRPO 0.495 0.468 0.831 0.695 0.761 0.692 0.624 32 GRPO+len_reward GRPOd 0.528 0.499 0.849 0.701 0.770 0.695 0.631 32 GRPO+ano_reward 0.494 0.467 0.842 0.699 0.763 0.686 0.629 32 AVA-GRP 0.533 0.504 0.900 0.726 0.786 0.715 0.661 Figure 10: RL training curves of Vad-R1.\nunstable improvement. These results demonstrate that the combination of length and anomaly rewards is essential for improving the overall reasoning and detection performance.\nD.4 Training Curves # Figure 10 demonstrates the key training curves of Vad-R1 during RL stage. Figure 10(a) shows the total reward of AVA-GRPO, which increases steadily and converges after approximately 1000 steps, indicating consistent improvement in the degree of matching policy for the output of Vad-R1. Figure 10(b) illustrates the standard deviation of total reward, which decreases rapidly in the early stage and stabilizes below 0.1, suggesting that the output quality of Vad-R1 gradually improves as the training progresses. Figure 10(c) reports the completion length, which increases in the early steps and then drops to a stable value. This may imply that the model achieves more concise and efficient completions while maintaining high rewards.\nD.5 More Qualitative Results # We provide two qualitative results in Figure 11 and Figure 12. Compared with some proprietary models, Vad-R1 demonstrates stable anomaly reasoning and detection capabilities. For example, in Figure 11, Vad-R1 correctly performs anomaly reasoning and identifies the white plastic bag as an anomaly. In contrast, although Claude identifies the plastic bag as abnormal, it defines the cause of the abnormality as moving plastic bag, rather than the plastic bag acting as an obstacle. Besides, QVQ-Max and o4-mini also identify the white plastic bag, they do not treat it as an anomaly.\nE Impact and Limitation # In this paper, we propose a new task: Video Anomaly Reasoning, which enables MLLM to perform deep analysis and further understanding of the anomalies in the video. We hope our work can contribute to the video anomaly researches.\nHowever, the inference speed of Vad-R1 remains a limitation, as the multi-step reasoning process introduces additional computational overhead.\n3wefcdsghfx\n3wefcdsghfx\nFigure 11: Qualitative result for an abnormal video.\nFigure 12: Qualitative result for a normal video.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/vad-r1-towards-video-anomaly-reasoning-via/","section":"Papers","summary":"Proposes a structured Perception-to-Cognition Chain-of-Thought and introduces Vad-Reasoning dataset, along with an improved reinforcement learning algorithm AVA-GRPO, to enhance the deep reasoning capabilities of Multimodal Large Language Models in video anomaly detection and understanding tasks.","title":"Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought","type":"method"},{"content":" VADSK: VIDEO ANOMALY DETECTION WITH STRUCTURED KEYWORDS # Thomas Foltz # The Pennsylvania State University State College, PA tjf5667@psu.edu\nABSTRACT # This paper focuses on detecting anomalies in surveillance video using keywords by leveraging foundational models\u0026rsquo; feature representation generalization capabilities. We present a novel, lightweight pipeline for anomaly classification using keyword weights. Our pipeline employs a two-stage process: induction followed by deduction. In induction, descriptions are generated from normal and anomalous frames to identify and assign weights to relevant keywords. In deduction, inference frame descriptions are converted into keyword encodings using induction-derived weights for input into our neural network for anomaly classification. We achieved comparable performance on the three benchmarks UCSD Ped2, Shanghai Tech, and CUHK Avenue, with ROC AUC scores of 0.865, 0.745, and 0.742, respectively. These results are achieved without temporal context, making such a system viable for real-time applications. Our model improves implementation setup, interpretability, and inference speed for surveillance devices on the edge, introducing a performance trade-off against other video anomaly detection systems. As the generalization capabilities of open-source foundational models improve, our model demonstrates that the exclusive use of text for feature representations is a promising direction for efficient real-time interpretable video anomaly detection.\nKeywords Anomaly Detection · Machine Learning · Foundational Model · Binary Classification · TF-IDF\n1 Introduction # In our modern society, there is an increasing need for video anomaly detection to ensure public safety, prevent crime, and identify environmental hazards. As surveillance capabilities increase, especially in highly populated locations, there is a high demand for intelligent systems that can efficiently process large amounts of video data to identify anomalies [20]. The sheer volume of data has exceeded the human capacity for effective monitoring, which has led the machine learning research community to devote effort toward developing automated anomaly detection solutions.\nOriginally, video anomaly detection relied on separating feature extraction from the classification process. This proved to be limited when handling complex situations in the data; however, it proved beneficial as a foundation for emerging methods using deep learning techniques [18]. Many modern applications now leverage neural networks to learn representations from raw video data. This has been possible due to the emergence of benchmarks such as UCSD Ped2 [9], ShanghaiTech [6], and CUHK Avenue [8]. These datasets include diverse anomaly scenarios, labeling various events that exclude anomalies in the training data, and a blend of normal and anomalous events in the test data. This enables both one-class and binary classification tasks with supervised, semi-supervised, weakly supervised, and unsupervised learning depending on how the data are preprocessed [20]. Our work only uses the test dataset due to our supervised binary classification approach. Other commonly used benchmarks include UCF Crime [16] and XD-Violence [17], which focus on violent or criminal real-world events. These two datasets separate the types of anomalous events, allowing models to train specifically to identify a specific anomaly event type or to create a multiclass classifier.\nFigure 1: Pipeline Overview. FM abbreviates the foundational models necessary for generating text descriptions from frame input. TF-IDF abbreviates the Term Frequency-Inverse Document Frequency score, which we use to weigh keywords.\nAlthough the task has improved considerably over the past decade, some issues still hinder its real-world applicability. One-class classification tasks have difficulty capturing complexity and diversity between anomaly types. Since these implementations can only train on normal data, they can become sensitive to deviations, leading to high false positive rates [18]. Real-time methods have the issue of sacrificing accuracy for speed with oversimplified pipelines. In many cases, they still require video sequences as input, introducing latency that makes them unusable for applications requiring instant recognition of anomalies.\nAnother issue with deep learning techniques is that they are typically computationally expensive and lack interpretability [18]. The \u0026ldquo;black box\u0026rdquo; nature makes it difficult to interpret why certain examples are flagged as anomalous, which reduces the user\u0026rsquo;s trust and ability to refine the models. Attempts have been made to create interpretable systems that use object detection, pose, or trajectories to justify the predictions. However, the problem is that most of these cannot be implemented in real-time and include complex data pipelines, rendering them unusable in an application. New methods employing Large Language Models (LLMs) address the interpretability issue by generating textual explanations for anomalies that users easily comprehend. These solutions are complex to implement, still struggle to provide concise explanations for decisions, and use extensive computing power.\nThis paper addresses these issues by employing a novel approach for interpretable video anomaly detection without reducing real-time performance or requiring large amounts of computing power. Our approach leverages foundational models\u0026rsquo; feature representation generalization capabilities to extract meaningful keywords from video. We employ a two-stage induction and deduction system as in a similar LLM-based solution [19].\nAs seen in Figure 1, we lay out the sequence of events between frame-level input and prediction. Before test time, the induction stage is performed to preprocess the data and learn feature representations in keyword weightings. Randomly sampled normal and anomalous frames are passed through our pre-trained foundational model to output corresponding descriptions of those frames. Next, we take the normal and anomalous descriptions and pass them through a TF-IDF operation, outputting a corresponding TF-IDF matrix of scores. This matrix stores values of the\ntop k relevant keywords based on how relevant the word is to its corpora of descriptions. With this information, we derive a vector of keyword weights based on how indicative those keywords are of an anomaly. In the deduction stage, we generate frame descriptions identically as in induction, but now the keyword weight vector is used to convert the frame description into a keyword encoding. For each keyword, if that keyword is present in the frame description, the corresponding element of the encoding is set to the weight found in the keyword weight vector. This keyword encoding can be directly passed into the classification network to output a probability prediction if that frame includes an anomaly. It is important to note that before test time, some data must be used to train the classification network. After the network has learned the decision boundary, the model is ready for inference using only the deduction stage of the pipeline.\nThis approach allows us to reduce the computational overhead of inference due to the minimal feature space and simple classification architecture, making it suitable for application on resource-constrained systems. By demonstrating the effectiveness of foundational models for interpretable video anomaly detection, this work creates new opportunities for developing transparent and trustworthy surveillance purposes.\nThe contributions of this paper are as follows:\nOur findings show that a keyword-based approach can potentially identify video anomalies. This approach increases the user interpretability of decisions, improving transparency and trust in our model. We introduce a lightweight, two-stage video anomaly detection pipeline based on induction followed by deduction. During the induction step, normal and anomalous frames are randomly sampled, from which sets of descriptions are generated. We calculate the term frequency-inverse document frequency (TF-IDF) score with these two sets to determine weights for the most relevant keywords. In the deduction step, descriptions are generated on the inference frames and encoded based on the keyword weights from the induction step. This encoding is then passed into a binary classification network for the final prediction. Our methodology has achieved comparable performance to existing benchmarks while achieving near real-time inference, reduced model complexity, and decreased memory usage. This demonstrates the usability of our system for real-world applications that have constrained computing requirements and demand fast response times. 2 Related Works # We reviewed current video anomaly detection approaches to identify areas for improvement in our system. We explored advancements in one-class, real-time, and interpretable video anomaly detection systems. Then, we thoroughly reviewed the emerging field that leverages large-language models\u0026rsquo; prediction capabilities for identifying video anomalies. Finally, we looked at some natural language processing techniques for classifying anomalies in text information. We did this to understand how to reapply older techniques with emerging technologies to improve current methodologies.\n2.1 One-class Classification Methods # One of the original deep-learning-based detection techniques, one-class classification, detects when events occur outside of distribution. They exploit the vast amount of labeled normal data, unique to other systems that require labeled anomaly examples during training. One such method follows this idea by claiming that enhanced inliers and distorted outliers effectively decide anomalies. [14] They employ dual reconstructor-discriminator architecture, where the reconstructor learns the concept of the normal class. This is done to reconstruct the normal samples correctly while distorting the anomalous samples that do not share the same concepts. The discriminator learns how to differentiate the two reconstructed image classes and make a prediction. Works such as HF2-VAD [7] focus on the flow reconstruction of objects in a frame. They predict the optical flow of previous frames and fuse that information with the current frame. A separate reconstructor module uses this information to predict the next frame. An anomaly is detected if the predicted future frame deviates more than expected, determined by a set threshold. Although the one-class classification technique has shown high performance, it tends to indicate false positives often in scenarios where new normal scenarios arise in the data. Additionally, it is difficult for people to understand the criteria determined outside of the normal distribution, making it less interpretable.\n2.2 Real-time Methods # Even though many video anomaly detection classifiers have effective discriminative capability, there have been efforts toward real-time classification to decrease the dependence on extended temporal contexts. One approach uses an\nend-to-end pipeline that learns features directly from raw video data to train their custom visual feature extractor rather than relying on commonly used pre-trained feature extractors that most modern methods use [5]. Then, using k-nearestneighbors (KNN) distances and uniform frame sampling, they train a lightweight classifier to predict anomalies in near-real time in a small decision window of approximately six seconds. Another approach for inference efficiency is introduced by MULDE [10], by measuring how much feature vectors deviate from the normal distribution of frames, similar to how many one-class methods function. They train a classification model with different levels of injected noise into the training data to emulate anomalies. This can be done at the object or frame level. At test time, they used a Gaussian mixture model to combine these noise levels to identify different anomaly cases. They achieve near real-time inference due to their simple pipeline, including a feature extractor, feed-forward network, and Gaussian mixture model.\n2.3 Interpretable Methods # Interpretable methods in video anomaly detection have gained interest due to the \u0026ldquo;black-box\u0026rdquo; nature of many classical methods. These new methods aim to increase transparency in the decision-making process, crucial for understanding and trusting the model outputs. One approach focuses on semantic embedding using scene graphs [2]. It leverages relationships between objects in a scene to provide interpretability in the video. Text Empowered Video Anomaly Detection (TEVAD), increases accuracy and interpretability by fusing textual features with spatio-temporal information [1]. This is achieved using frame captions to capture events\u0026rsquo; semantic meaning and visual features. Another work utilizes attribute-based representations, representing objects in a scene with velocity and pose information [13]. This information is then used to determine an anomaly score through density estimation. These interpretable methods provide valuable information on the decision-making process by incorporating semantic information, textual features, and attribute-based representations. However, these methods rely on long temporal chains of information to effectively identify anomalies and newer LLM-based methods for improving interpretability have emerged in recent years.\n2.4 Large Language Model (LLM) Methods # To further advance interpretability while taking advantage of advances in machine learning, researchers have begun to employ large-language models (LLMs) to detect anomalies in video. One such paper uses video language models (VLMs) captioning capabilities to identify activities and objects in a scene that indicate normal and anomalous behavior [19]. They curate a list of rules that indicate which objects and activities are normal or anomalous. This is done by describing the normal behavior from the normal frames and with that information identifying the opposite anomalous behavior. With these rules, they match captions to these rules during inference and use an LLM to reason if anomaly conditions have been met. We employ a similar approach for identifying anomalies but introduce a simplified and explainable implementation for selecting anomaly keywords used in classification. Another practical approach was proposed by Holmes-VAD [? ], where they leverage LLM capabilities to explain why anomalies can occur in hour-long video sequences. They achieve this through a multi-modal LLM that encodes user text prompts, projected visual classes, and patch tokens that the temporal sampler has selected as noteworthy. While these methods leverage the powerful feature extraction capabilities of large language and multimodal models, these methods are quite expensive to deploy in practice due to the large compute requirements necessary, making them mostly unviable for applications on the edge.\n2.5 Natural Language Processing (NLP) Methods # Natural Language Processing (NLP) methods have long been employed in classification tasks, extending into anomaly detection. The Term Frequency-Inverse Document Frequency (TF-IDF) score has proven to be a versatile and effective approach for this. TF-IDF is a statistical measure that evaluates the importance of a word within a document relative to a larger corpus. One work applies this scoring method to classify anomalies in process logs, treating each log entry as a distinct document within the broader corpus of all logs [15]. This approach enables the identification of unusual terms or patterns that may signify anomalous behavior. Similarly, in the analysis of network switch logs, TF-IDF has been used as part of an approach by combining it with the log frequency and the log probabilities to calculate an abnormal score for different components of the logs, enhancing the overall precision of the identification of anomalies [11]. TF-IDF is effective for this task because of its ability to emphasize words that deviate from the norm within a given context, making it well-suited for detecting anomalies.\nFigure 2: Frame description generation. Fn Fnorm represents labeled normal frames with their respective descriptions D norm and Fa Fanom represents labeled anomalous frames with their respective descriptions D anom generated by the pre-trained foundational model.\n3 Approach # 3.1 Induction # In this first stage of the detection pipeline, it is necessary to identify keywords indicative of anomalies. There are three main steps during induction. We generate frame descriptions by sampling labeled normal and anomalous frames. They are then passed into a foundational model along with a prompt to generate the descriptions. These descriptions are separated into two corpora and passed into a term frequency-inverse document frequency vectorizer. With the output vector, we calculate the difference between the frequency of highlighted keywords and normalize it to achieve a final weighting vector for use in the deduction stage.\n3.1.1 Frame Description Generation # We selected two foundational models that fulfill our requirements for frame description generation. These models must have multi-modal input that allows the user to pass in an image alongside a prompt since our raw data is split videos for processing at the frame level. Additionally, the model should be optimized for visual recognition and image reasoning to create meaningful frame descriptions. The model should be trained with minimal parameters or possess quantization capability to store weights and perform inference on edge devices. Finally, it is important to utilize open-source weights for the transparency and reproducibility of our implementation. Therefore, we selected the Llama-3.2-11B-Vision-Instruct model from Meta [3] and the MiniCPM-V-2_6-int4 model from openbmb [4], both of which are available to the public on the HuggingFace platform [? ].\nWith the selected models, we needed to generate captions that differentiate anomaly cases from normal scene behavior. To achieve this, we selected n frames randomly from the training data, which we know only includes normal frame samples. We then selected random frames known to be anomalies from the test set. This gives us an even sampling of normal and anomaly cases. As shown in Figure 2, we pass those randomly sampled frames into our selected\nfoundational model to generate the frame descriptions. This is passed in with the user prompt \u0026lsquo;You are a surveillance monitor for urban safety. Describe the activities and objects present in this scene.\u0026rsquo; The first sentence in the prompt provides context to the foundational model of its task, and the second sentence explicitly asks for activities and objects so that we can extract meaningful keywords from the descriptions. We determined that sampling 20 normal and anomaly frames was sufficient to capture the most influential keywords in the corpora without over-fitting the data.\n3.1.2 Corpus Formation # Once we have our two sets of generated frame descriptions, we concatenate the string descriptions in each set as depicted in equation 5 to create a tuple representing the text corpora C. These corpora of two strings, one document for normal descriptions and one for anomalous descriptions, are passed into the TF-IDF vectorizer SKLearn library [12] for calculating the TF-IDF scores.\n3.1.3 Term Frequency-Inverse Document Frequency Score # To identify how any term from the corpus relates to a corresponding document, we need to calculate the Term Frequency-Inverse Document Frequency score, a balance between measuring the frequency of a term occurring in a document and the amount that term shows up in any of the documents.\nIn equation 6, the overall TF-IDF score of term t in document d within the corpus C is calculated by multiplying each term\u0026rsquo;s term frequency score by the inverse-document frequency. In equation 7, we determine the frequency of term t in document d by calculating the number of times t appears in d divided by the total number of terms in document d. Finally, in equation 8 we obtain the inverse document frequency of term t in corpus C, calculated with the logarithm of the total number of documents N in C divided by the number of documents containing t. After this TF-IDF vectorization operation on our corpus, we are left with two vectors: one for the scores of terms associated with the normal descriptions and one for the associated scores of the anomaly descriptions.\n3.1.4 Anomaly Keyword Weighting # We derive a normalized difference vector by calculating the difference between the TF-IDF score vectors of the anomaly documents and the normal documents to identify terms indicative of anomalies. We normalize this difference vector to ensure that the magnitude of the differences does not skew the analysis and for stable classification training. The resulting vector highlights the terms more characteristic of the anomaly frame descriptions than the normal frame descriptions, providing insights into the keywords that distinguish anomalies.\nFigure 3: Creating the keyword encoding. Description Ddeduct is generated by passing frame Fdeduct into the pretrained foundational model. Then the description is mapped into a keyword encoding Ededuct using the keyword weights wkeywords from the induction stage.\n3.2 Deduction # In this second stage of the detection pipeline, we predict whether an anomaly was found frame by frame. The deduction stage has two main steps: creating a keyword encoding from the generated frame description and passing that encoding into a classification model for predicting the anomaly probability.\n3.2.1 Keyword Encoding # First, we take a frame from a video we intend to infer. It is passed through the same pre-trained foundational model with identical user prompting as in induction. The difference is that instead of forming a corpus with a combination of frame descriptions as in Section 3.1.1, we individually map each to a keyword encoding, as shown in Figure 3.2.1 below.\nWe assign a weight for each keyword in the description to an encoding. This weighting keyword vector wkeywords is pre-generated in induction. The resulting encoding Ededuct reflects how strongly the generated frame description is associated with anomalies. The length of this encoding vector is equivalent to the number of elements in wkeywords. If a keyword is absent in the frame description, the respective component of the encoding is set to zero. This encoding is still interpretable to the user since we know each keyword\u0026rsquo;s position in the encoding and weight value. Therefore, each encoding can be interpreted as the potential abnormality of the frame based on the presence of anomaly-related keywords.\n3.2.2 Binary Classification Model # The keyword encoding Ededuct is fed into our binary classification model, designed to predict whether or not the frame input contains an anomaly. We decided to utilize a simple feed-forward neural network to satisfy this simple classification task. Our network includes three fully connected layers, as shown in Figure 4. The input dimension has k neurons, where k is the number of elements in the encoding, identical to the number of keywords identified in induction. The output layer has a single neuron that produces the anomaly probability of the frame input.\nWe train the model using this same pipeline. During this training process, our network learns the keyword encodings to output the final anomaly prediction and effectively learns the patterns and relationships in the data that indicate anomalies. Once trained, our model can take any keyword encoding and predict the likelihood that the frame is an anomaly. We then set a threshold the output must exceed to indicate an anomaly during test time.\nFigure 4: Feed-forward network for binary classification. F C − n stands for the fully-connected layer number n. The input and output dimensions are represented by (i, o), with i for input and o for output. The dimension size k is based on the number of keywords generated from induction. The probability P(A) represents the chance that the inputted keyword encoding is an anomaly.\n4 Experiments and Results # This section will review the datasets selected for benchmarking our proposed method for video anomaly detection. Then, we will review the setup process of our experiments and the associated design choices. Next, we discuss the evaluation metrics necessary for measuring the success of our method. Classification results are reported and compared to other video anomaly detection methods. Finally, we compare qualitative features between different approaches and demonstrate the interpretability of our process with an example.\n4.1 Video Anomaly Detection Datasets # In VAD, datasets are typically created from videos split into sequences of frames. These frames typically blend normal activities and abnormal activities or unusual events [20]. Most frames commonly include only normal activities since it is difficult to capture many anomalies due to the infrequency of their occurrence. Regardless, these datasets provide an excellent resource for models to differentiate routine occurrences from abnormal events. The three datasets we use, UCSD Ped2, ShanghaiTech, and CUHK Avenue [9 , 6 , 8], are commonly used to develop robust video anomaly detection. Examples are included below in Figure 5.\nThe UCSD Ped2 dataset was gathered from a stationary camera mounted overlooking a pedestrian walkway at the University of California - San Diego campus. The dataset contains 16 videos for training and 12 videos for testing, including 12 abnormal events [9]. The video should contain only pedestrians in the normal setting. Therefore, abnormal events like bikers, skaters, or cars can be described as any non-pedestrian entity on the walkway. Frame-level and pixel-level annotations are included, but we only utilize the frame-level annotations. UCSD Ped2 is a baseline for video anomaly detection because of its simplistic representation of an environment that lacks complex anomalies. However, many surveillance applications have a narrow scope like the one portrayed in this dataset, making it a suitable choice\nFigure 5: VAD benchmark examples. These include UCSD Ped2 [9] (left), ShanghaiTech [6] (center), and the CUHK Avenue [8] (right) datasets. Images in the top row depict normal occurrences, and the bottom row depicts anomaly occurrences.\nfor testing detection effectiveness.\nAn alternative for video anomaly detection is the CUHK Avenue dataset, capturing scenes from an avenue at the Chinese University of Hong Kong. This dataset includes 16 training and 21 testing videos with 47 abnormal events, including loitering, throwing objects, and running [6]. The advantage of this dataset comes from the increased complexity compared to UCSD Ped2 while maintaining well-defined individual anomalies. It is also worth noting that it includes additional complexity from camera shake and infrequent normal behavior in the dataset, which is important for measuring model robustness.\nCreated from scenes at the Shanghai Tech (SHTech) University Campus, the SHTech dataset provides various anomaly scenarios. Unlike the previous two datasets, this SHTech contains 13 scenes of multiple lighting conditions and camera angles [8]. There are 130 abnormal events with 274,515 training frames and 42,883 testing frames, making this significantly more significant than most video anomaly detection datasets. This allows us to test the adaptability of anomaly detection methods against a wide range of possible anomalies while providing more realistic scenarios compared to UCSD Ped2 and CUHK Avenue.\n4.2 Experimental Setup # Using the Llama-3.2 Vision-Instruct-11B [3] foundational model, we employ quantization for frame description generation to reduce computational and memory requirements. We convert the model\u0026rsquo;s 16-bit float precision to 4-bit integer precision using Huggingface\u0026rsquo;s BitsAndBytesConfig function [? ]. This allows us to achieve faster inference for near-real-time prediction on resource-constrained devices when we deploy our system.\nTo obtain keywords that generalize the normal and anomalous behavior of the datasets, we determined that randomly sampling 20 videos each for both normal and anomalous frames performed well. This is because enough frames were sampled to learn typical behaviors from all the dataset\u0026rsquo;s videos without over-fitting to any one specific occurrence in the videos. We omit these 40 randomly selected frames during classification model training/testing to maintain data integrity.\nWe employ the Scikit-Learn [12] library\u0026rsquo;s TFidfVectorizer function for converting the corpus into a TF-IDF score matrix. We utilize function arguments to simplify extracting meaningful keywords from both corpora. The first one we set is \u0026lsquo;stop_words\u0026rsquo;. In NLP, words such as \u0026ldquo;the\u0026rdquo;, \u0026ldquo;and\u0026rdquo;, \u0026ldquo;is\u0026rdquo;, and \u0026ldquo;or\u0026rdquo; are considered to be insignificant to the meaning of the sentence. Therefore, when we flag this argument, we omit such words from consideration for anomaly keywords, allowing our keyword weights to focus on critical parts of the frame description. Next, the TFidfVectorizer will enable us to set a ngram_range from one to three. N-grams are collections of n successive pieces of text. In our implementation, we maintain the default value of one to consider each keyword independently. This also increases the\nkeyword encoding speed in deduction since we don\u0026rsquo;t have to search the text for sets of words. An argument called max_features is used to limit the number of words generated in the TF-IDF score matrix and, in our case, the number of keywords in our encodings. We limit this amount to 100 for the more straightforward UCSD Ped2 dataset and 200 for the more complex CUHK Avenue and SHTech datasets. The final arguments we employ are the min_df and max_df values. Since the TF-IDF score is between 0 and 1, we can adjust these limits to only output values in the matrix between these min_df and max_df values. In our implementation, we set the max_df value on the UCSD Ped2 dataset to 0.95 for better performance and maintain the 0 to 1 range for the CUHK Avenue and UCSD Ped2 datasets.\nNext, we made some design decisions for training and testing the binary classification model. We use a weighted Binary Cross-Entropy loss to address the common class imbalance issue in these video anomaly detection datasets. The positive class weight for each dataset was calculated as the inverse proportion of anomalous samples within the training set to provide further weight to the uncommon anomaly class during training. The model was initialized with the AdamW optimizer using a learning rate and decay rate of 0.001. To decrease the chance of overfitting, we employ a 5-fold cross-validation. Each fold was trained for a maximum of 20 epochs, and early stopping was used if the validation loss did not decrease for three consecutive epochs. We used custom batch sizes specific to the datasets, with 200 for UCSD Ped2, 1000 for CUHK Avenue, and 2000 frames for SHTech. The model with the best performance across the validation folds was selected for evaluation. We used 80% of the frames for training and 20% for testing.\n4.3 Evaluation Metrics # For most implementations of video anomaly detection, the frame-level area under the receiver operation characteristic (AUROC) is used to evaluate the ground truth labels. AUROC measures how well a classification model can distinguish the positive and negative class in binary classification by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR). VAD employs this metric over accuracy because of its improved reliability when dealing with imbalanced datasets, where one class occurs more often than the other, which is common in VAD datasets. AUROC typically comes in two forms: micro-averaged, where the score is computed on all frames from all the videos, whereas the macro-averaged form computes the score separately for each video and then takes the average of them [2].\n4.4 Ablation Study # We evaluate the impact of the foundational model choice on anomaly detection performance, by conducting an ablation study comparing the Llama-3.2 Vision-Instruct-11B [3] and OpenBmb Mini-CPM [4] models. Table 1 depicts our results on the UCSD Ped2, ShanghaiTech, and CUHK Avenue datasets.\nTable 1: Ablation study results. Each cell shows the AUROC (%) / inference speed in seconds per frame for different foundational models and datasets.\nUCSD Ped2 [9] ShanghaiTech [6] CUHK Avenue [8] Vision-Instruct-11B [3] 0.865/5.77s 0.753/5.17s 0.742/5.38s Mini-CPM [4] 0.865/2.43s 0.707/2.09s 0.604/2.12s These results show a trade-off between detection accuracy and speed. Both models achieve comparable performance on the UCSD Ped2 dataset, but Vision-Instruct-11B performs noticeably better on the more complex ShanghaiTech and CUHK Avenue datasets. Specifically, Vision-Instruct-11B achieves ROC-AUC scores of 74.2% and 75.3% on ShanghaiTech and CUHK Avenue compared to Mini-CPM\u0026rsquo;s 60.4% and 70.7%. This can be explained by the increase in detail in Vision-Instruct\u0026rsquo;s frame descriptions which allow it to capture more subtleties compared to Mini-CPM. The improved accuracy and detail come at the cost of computational overhead and speed due to Vision-Instruct\u0026rsquo;s longer generation times. Mini-CPM exhibits significantly faster inference across all of the datasets, slightly above two seconds per frame compared to Vision-Instruct\u0026rsquo;s five or six seconds per frame. Mini-CPM also defaults to using 4-bit int precision and fewer parameters than Vision-Instruct-11B, making it more viable on constrained systems.\nTherefore, the choice of a foundational model depends on the specific requirements of the application. If high accuracy is critical, even at the expense of computational speed, Vision-Instruct-11B is the better option. Likewise, if real-time performance or resource constraints are critical, Mini-CPM offers an alternative with a reduced computational footprint at the cost of a potential accuracy decrease on complex scenarios.\n4.5 Classification Results # As seen in Table 2, we compare the frame-level AUROC between our method and multiple state-of-the-art approaches across the three selected benchmark datasets: UCSD Ped2, ShanghaiTech, and CUHK Avenue [9 , 6 , 8]. Our results demonstrate that while Video Anomaly Detection with Structured Keywords (VADSK) doesn\u0026rsquo;t outperform most advanced SOTA methods, it achieves competitive performance in certain scenarios, such as on the ShanghaiTech dataset, where it achieves an AUROC of 75.3%. This is comparable to methods such as HF2-VAD with 76. 2% and exceeds Toward Interpretable VAD (68. 9%). It is important to understand that our results on the UCSD Ped2 and CUHK Avenue datasets still underperform the top-performing methods, which achieve scores above 90%.\nOur performance disparity between UCSD Ped2 and the other two benchmarks can be explained by their increased anomaly event complexity, indicating that our method is better suited for handling simpler anomaly events and that there is room for improvement in our method\u0026rsquo;s ability to generalize across many diverse anomaly contexts.\nTable 2: Frame-level AUROC (%) Comparison\nUCSD Ped2 [9] ShanghaiTech [6] CUHK Avenue [8] HF2-VAD [7] 0.993 0.762 0.911 Towards Interpretable VAD [2] - 0.689 0.790 TEVAD [1] 0.987 0.981 - Attribute-based VAD [13] 0.991 0.859 0.937 MULDE [10] 0.997 0.864 0.931 AnomalyRuler [19] 0.965 0.852 0.822 VADSK (ours) 0.865 0.753 0.742 4.6 Qualitative Comparison and Analysis # While the quantitative performance of our approach doesn\u0026rsquo;t outperform state-of-the-art methods across the datasets, as seen in Table 2, the significance of our approach is how we developed a memory-efficient, interpretable system with real-time inference capability. The methods we compare against ours have some, but not all, of these necessary traits, as depicted in Table 3.\nTable 3: Qualitative Comparison\nInterpretable Real-time Memory-efficient HF2-VAD [7] ✗ ✗ ✗ Towards Interpretable VAD [2] ✓ ✗ ✓ TEVAD [1] ✓ ✗ ✗ Attribute-based VAD [13] ✓ ✗ ✗ MULDE [10] ✗ ✓ ✓ AnomalyRuler [19] ✓ ✗ ✗ VADSK (ours) ✓ ✓ ✓ Interpretability is an important aspect of our system, defined by the user\u0026rsquo;s ability to interpret the extracted features and understand how they are used in the classification decision-making process. Typically, this is done with textual information incorporated into the pipeline [19 , 2 , 1] or with visual information such as velocity and pose [13]. Our system is interpretable since we use frame descriptions and encode them based on pre-defined keyword weights that are transparent to the user. As seen in Figure 6, it is possible for the user to view the steps used in outputting the final prediction and adjust the keyword encoding or foundational model for frame description generation for improved results. Such transparency is vital in scenarios where understanding the reason behind an anomaly is equally important as detection.\nAnother advantage our system provides is the near real-time inference. We define real-time systems as the ability to run inference with minimal latency on individual or windows of frames [10], compared to inputting the entire video at once to generate a prediction. Our approach is real-time because we generate descriptions for each frame, pass in the respective keyword encoding one at a time during inference, and receive predictions at most a few seconds after input. This proves valuable in applications that need immediate or rapid responses to detected anomalies.\nFigure 6: Interpretable Inference. The heatmap (right) visualizes a keyword encoding for classification derived from the frame description (center-left). The frame description was generated by passing the frame (top-left) into the MiniCPM foundational model [4]. The different colors in this heatmap represent different weight values between 0 and 1. Finally, the result is a probability (bottom-left) that an anomaly has occurred.\nLastly, our system maintains memory efficiency during the induction and deduction stages. We define memory-efficient systems as those that do not require the storage of temporal information and do not run expensive feature extraction processes in parallel with one another [10 , 2]. Our approach is memory-efficient since we do not require temporal information and use one sequential process for feature extraction and classification. Most of our computational overhead comes from the quantized foundation model, which generates initial frame descriptions from the inputted frame. This efficiency is necessary for many video anomaly detection systems since they are deployed in resource-constrained environments and must be scaled for large-scale surveillance systems.\nCombining these three traits - interpretability, near real-time inference, and efficiency makes VADSK a practical solution for video anomaly use cases that require these critical capabilities at the cost of decreased performance against SOTA methods. Our work, therefore, represents an essential contribution by offering a more balanced approach that addresses these key considerations for the real-world application of video anomaly detection systems.\n5 Conclusion # This paper presents a novel approach to video anomaly detection using structured keywords, demonstrating the potential for exclusively using text-based features for detection. We developed a lightweight interpretable pipeline for video anomaly detection consisting of an induction and deduction stage. Our method achieves comparable performance to state-of-the-art methods on certain benchmarks and demonstrates the feasibility of real-time inference without temporal information. Our performance gap on more complex datasets, such as CUHK Avenue and ShanghaiTech, shows that there is still more room for improvement. Our work has broader implications. The increased interpretability could improve trust and adoption of video surveillance systems for public spaces and critical infrastructure. Our simple, lightweight pipeline can make such systems suitable for edge devices with limited computational resources or large-scale networks, enabling adoption in environments previously limited by hardware constraints or scalability issues. Finally, our near real-time inference allows for immediate response times to anomalous events, which can be critical in detecting security threats or other emergencies.\nFor future work, improving the keyword generation and selection process would be worthwhile. The selection process could include advanced natural language processing techniques for selecting nuanced keywords, dynamically updating the keyword selection from patterns emerging in the video footage, or generating domain-specific keywords depending on the particular environment. Experimenting with different foundational models and classification architectures would help improve the frame description quality and discriminative capabilities. Finally, it could be beneficial to test the effectiveness of our method on specialized domains such as industrial safety, healthcare, or traffic to investigate the effectiveness of our method in real-world scenarios. By leveraging the power of foundational models and natural language processing, we have opened up new possibilities for video anomaly detection to be deployed in real-world scenarios, paving the way for intelligent and responsive surveillance for improving urban safety.\nReferences # [1] W. Chen, K. T. Ma, Z. J. Yew, M. Hur, and D. A. A. Khoo. Tevad: Improved video anomaly detection with captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5549–5559, 2023.\n[2] K. Doshi and Y. Yilmaz. Towards interpretable video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2655–2664, 2023.\n[3] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, and others. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.\n[4] S. Hu, Y. Tu, X. Han, C. He, G. Cui, X. Long, and others. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024.\n[5] H. Karim, K. Doshi, and Y. Yilmaz. Real-time weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 6848–6856, 2024.\n[6] W. Liu, W. Luo, D. Lian, and S. Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018.\n[7] Z. Liu, Y. Nie, C. Long, Q. Zhang, and G. Li. A hybrid video anomaly detection framework via memoryaugmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13588–13597, 2021.\n[8] C. Lu, J. Shi, and J. Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision, pages 2720–2727, 2013.\n[9] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos. Anomaly detection in crowded scenes. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1975–1981, 2010.\n[10] J. Micorek, H. Possegger, D. Narnhofer, H. Bischof, and M. Kozinski. Mulde: Multiscale log-density estimation via denoising score matching for video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18868–18877, 2024.\n[11] S. Nam, J. H. Yoo, and J. W. K. Hong. Log-tf-idf for anomaly detection in network switches. In NOMS 2024-2024 IEEE Network Operations and Management Symposium, pages 1–9, 2024.\n[12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, G. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.\n[13] T. Reiss and Y. Hoshen. Attribute-based representations for accurate and interpretable video anomaly detection. arXiv preprint arXiv:2212.00789, 2022.\n[14] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli. Adversarially learned one-class classifier for novelty detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3379–3388, 2018.\n[15] A. Sandhu and S. Mohammed. Detecting anomalies in logs by combining nlp features with embedding or tf-idf, 2022.\n[16] W. Sultani, C. Chen, and M. Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018.\n[17] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 322–339. Springer International Publishing, 2020.\n[18] P. Wu and others. Deep learning for video anomaly detection: A review. arXiv preprint arXiv:2409.05383, 2024.\n[19] Y. Yang, K. Lee, B. Dariush, Y. Cao, and S. Y. Lo. Follow the rules: reasoning for video anomaly detection with large language models. In European Conference on Computer Vision, pages 304–322, 2025.\n[20] I. M. Yossef, M. Gamal, R. F. Abdel-Kader, and K. A. E. Ali. A review on video anomaly detection datasets. Suez Canal Engineering, Energy and Environmental Science, 1(2):1–9, July 2023.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/vadsk-video-anomaly-detection-with-structured/","section":"Papers","summary":"A lightweight, interpretable, two-stage video anomaly detection pipeline employing foundational models for frame description generation and keyword-based classification, achieving comparable performance to state-of-the-art methods with real-time inference and enhanced interpretability.","title":"VADSK: VIDEO ANOMALY DETECTION WITH STRUCTURED KEYWORDS","type":"method"},{"content":" VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning # Liyun Zhu 1 , 2 , ∗ Qixiang Chen 1 Xi Shen 3 Xiaodong Cun 2 , †\n1 Australian National University 2 GVC Lab, Great Bay University 3 Intellindust AI Lab\n{liyun.zhu, u7227010}@anu.edu.au, shenxiluc@gmail.com, cun@gbu.edu.cn\nAbstract # Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAUBench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at https://github.com/GVCLab/VAU-R1 .\nFigure 1: Effectiveness of Reinforcement Fine-Tuning. We compare QA accuracy and temporal anomaly grounding performance across different models. VAU-R1, trained via Reinforcement Fine-Tuning (RFT), consistently outperforms its Supervised Fine-Tuning (SFT) counterpart. This demonstrates that RFT enhances both reasoning and temporal localization capabilities in VAU tasks.\n1 Introduction # Anomalies are events or behaviors that deviate from regular patterns or expected activities in a given context. In surveillance settings, these may include incidents such as fighting, theft, or\n∗ Work done while the author was a visiting student at GVC Lab, Great Bay University.\n† Corresponding Author\nFigure 2: Overview of VAU-R1. VAU-R1 leverages Reinforcement Fine-Tuning to enhance the reasoning ability of MLLMs for video anomaly understanding. Specifically, we adopt Group Relative Policy Optimization (GRPO) to optimize the model with task-specific rewards, such as answer format, accuracy, and temporal Intersection-over-Union (IoU). We decompose the VAU task into four complementary tasks to facilitate comprehensive reasoning: multiple-choice QA, temporal anomaly grounding, anomaly reasoning, and anomaly classification.\ntraffic violations, etc. Video Anomaly Understanding (VAU) aims to detect and interpret such irregular events in unstructured, real-world video streams [22]. The task is challenging due to scene complexity, context dependence, varying camera viewpoints, and diverse anomaly types [27 , 44 , 56]. Early approaches only focuses on detecting anomalies, which typically framed the task as binary classification, assigning normal or abnormal labels to individual frames and identifying the temporal boundaries of anomalous events [5 , 6 , 11 , 16 , 21 , 33 , 36 , 47 , 55]. While effective for localization, these methods offer limited interpretability and provide little insight into the underlying causes of anomalies [7 , 8 , 49]. Recent advances in Multi-modal Large Language Models (MLLMs) have introduced the ability to generate textual descriptions of anomalous events [9 , 48 , 51 , 52 , 53], improving model transparency to some extent. However, current methods still face three key limitations: (i) they lack the ability to generate coherent, multi-step reasoning chains; (ii) no comprehensive benchmark provides rich annotations to support detailed causal reasoning; and (iii) evaluation protocols for reasoning quality remain underdeveloped.\nTo move beyond shallow classification and toward deeper understanding, we decompose VAU into four progressive stages: (i) Perception — identifying the scene and relevant objects, either through free-text descriptions or guided multiple-choice questions; (ii) Grounding — localizing the precise temporal segment where the anomaly occurs; (iii) Reasoning — explaining the event by analyzing causal factors, temporal dynamics, and contextual cues; and (iv) Conclusion — summarizing the event with a final decision, such as assigning it to a specific category (e.g., fighting vs. robbery). This structured formulation enables models to progressively build semantic understanding and supports more interpretable and task-aligned evaluation.\nTo implement this four-stage formulation, we introduce VAU-R1, a Reinforcement Fine-Tuning (RFT) framework designed to improve the reasoning capabilities of MLLMs on the VAU task. Our method builds on Group Relative Policy Optimization (GRPO) [31], incorporating task-specific reward signals based on answer format correctness, question-answer accuracy, and temporal grounding alignment. The framework is data-efficient and can be applied in low-resource settings, making it practical for real-world deployments. To support training and evaluation, we also construct VAUBench, a new benchmark that spans diverse scenarios and provides rich annotations across the four reasoning stages, including multiple-choice QA pairs, detailed event descriptions, temporal groundings, and step-by-step rationales. Finally, we propose a set of evaluation metrics—QA\naccuracy, temporal Intersection-over-Union (IoU), GPT-based reasoning score, and classification accuracy—to quantitatively assess model performance across perception, grounding, reasoning, and conclusion. Together, VAU-R1 and VAU-Bench offer a scalable and unified framework for advancing structured video anomaly understanding. Our contribution can be summarized as follows:\nWe propose VAU-R1, a data-efficient Reinforcement Fine-Tuning framework that improves the reasoning ability of MLLMs for video anomaly understanding. It outperforms standard supervised fine-tuning on reasoning-intensive tasks. We present VAU-Bench, the first large-scale benchmark with Chain-of-Thought annotations designed for video anomaly reasoning. It contains a diverse collection of videos, QA pairs, temporal labels, and detailed rationales spanning a wide range of real-world scenarios. We design a unified evaluation protocol that measures model performance across four reasoning stages, jointly considering reasoning quality, answer correctness, and temporal localization to capture both interpretability and detection precision. 2 Related Works # From Detection to Understanding. Early efforts in Video Anomaly Detection (VAD) can be broadly categorized into self-supervised and weakly-supervised paradigms. Self-supervised methods rely solely on normal video samples, learning the distribution of normal behavior and flagging deviations as anomalies [11 , 21 , 25]. In contrast, weakly-supervised methods are trained with both normal and anomalous videos using coarse video-level labels rather than fine-grained frame-level annotations [5 , 16 , 33 , 36 , 45 , 47 , 55]. These approaches typically adopt a top-k selection strategy to identify the most likely anomalous segments. While effective for localizing anomaly boundaries, they often rely heavily on motion cues [56], operating under the assumption that rapid or irregular motion is indicative of anomalous behavior. However, this assumption does not hold for subtle or semantically complex anomalies, leading to poor interpretability. To address these limitations, recent work has turned to video anomaly understanding, leveraging MLLMs to provide more semantically grounded and interpretable reasoning [26].\nPrompt-Based vs. Learning-Based Approaches for VAU. Building on the shift toward semantic understanding, recent approaches to VAU fall into two main categories: prompt-based and learningbased methods. Prompt-based methods typically use MLLMs as anomaly scorers [30 , 51], or as reasoning agents via rule-based few-shot prompting [48] or learned question templates [49]. While these methods avoid computationally expensive training, their generalization ability is often limited due to the absence of task-specific adaptation. On the other hand, pretraining [8] and finetuning [52 , 53] approaches aim to learn anomaly-aware representations by incorporating video captions and causal reasoning signals (e.g., cause and effect). Despite this progress, existing methods remain constrained to improving anomaly description and fail to capture the full logical chain of an anomaly. To overcome these limitations, we leverage reinforcement fine-tuning to enhance the model\u0026rsquo;s reasoning ability, enabling end-to-end identification of both when and why anomalies occur.\nReinforcement Learning in MLLMs. With the rise of powerful models such as OpenAI-o1 [15] and DeepSeek-R1 [12], reinforcement learning has been increasingly adopted in the post-training stage of MLLMs to enhance their reasoning capabilities [3 , 10 , 14 , 42 , 54]. While effective, this process often demands substantial computational resources and large-scale datasets, making it less practical for targeted downstream tasks [34]. To address these challenges, Visual-RFT [23] introduces Reinforcement Fine-Tuning (RFT) for visual tasks, demonstrating improved data efficiency and stronger performance compared to Supervised Fine-Tuning (SFT). Building on this idea, VideoChatR1 [17] extends RFT to video domains, achieving promising results in tasks such as question answering, temporal grounding, and object tracking. Yet, these tasks remain fragmented and have not been unified under the video anomaly understanding setting. To bridge this gap, we propose a framework that jointly addresses multiple tasks, aiming to advance comprehensive and interpretable anomaly reasoning.\n3 Methodology # 3.1 Preliminary: Reinforcement Learning via Group Relative Policy Optimization # Group Relative Policy Optimization (GRPO) [31] is a reinforcement learning framework that optimizes a policy πθ using preference-based feedback and multi-aspect reward signals. Given a question x, GRPO generates M candidate outputs O = {o1, o2, . . . , oM} from the old policy πθ old , each output oj assigned a reward Rj computed as a weighted sum of K task-specific components:\nwhere R (k) j is the k-th task-specific reward (e.g., accuracy, IoU, format compliance) and λk is its weight. To measure the relative quality of the j-th output, we calculate the normalized reward R ˜ j for each output oj with the mean µR and standard deviation σR across M candidates:\nGRPO maximises the following objective while keeping the update close to the original MLLM parameters πref through a KL penalty term DKL(· || ·):\nwhere β is a regularization coefficient. This formulation allows GRPO to incorporate diverse reward signals while retaining training stability through KL regularization.\n3.2 VAU-R1 # As shown in Figure 2, VAU-R1 is a data-efficient reinforcement fine-tuning framework designed for the four VAU tasks, including Multi-choice QA, Temporal Anomaly Grounding, Anomaly Reasoning, and Anomaly Classification. Given videos and task-specific questions, we fine-tune a pre-trained MLLM to improve its multi-step reasoning ability across different tasks. The model generates multiple candidate responses for each input, which are then scored using task-specific reward functions (e.g., accuracy, temporal IoU, or format compliance). We employ Group Relative Policy Optimization (GRPO) to optimize the model, which maximizes reward-weighted likelihood while constraining divergence from the reference model via KL regularization. Our reinforcement-based approach outperforms supervised fine-tuning (SFT) in both reasoning capability and generalization to unseen scenarios. The design of task-specific reward functions is further detailed in Section 3.3 .\n3.3 Reward Rules # We adopt the general idea of GRPO-based RFT to optimize the VAU model by designing task-specific reward functions for different VAU components. Below, we detail each reward definition.\nFormat Reward. For multiple-choice QA and anomaly classification tasks, we instruct the model to enclose its reasoning within \u0026lt;think\u0026gt;\u0026hellip;\u0026lt;/think\u0026gt; tags and the answer within \u0026lt;answer\u0026gt;\u0026hellip;\u0026lt;/answer\u0026gt; tags. For the temporal anomaly grounding task, we additionally require \u0026lt;glue\u0026gt;\u0026hellip;\u0026lt;/glue\u0026gt; tags to indicate the predicted time span in seconds. The reward is defined as:\nWe apply a format reward to VAU tasks to enforce structured outputs and discourage format violations.\nAccuracy Reward. We also define an accuracy reward R acc to measure the correctness of the model\u0026rsquo;s answer. In our experiments, this reward is given by:\nFigure 3: Statistics of our VAU-Bench. (a) Distribution of main anomaly types. (b) Distribution of video durations (top) and the proportion of anomalous segments within each video (bottom). (c) The evaluation criteria for four VAU tasks.\nThis simple accuracy reward encourages the model to choose the right answer during training.\nTemporal IoU Reward. To encourage precise temporal grounding, we introduce a temporal Intersection-over-Union (IoU) reward RtIoU, which measures the alignment between the predicted and ground truth anomaly intervals. The reward is defined as:\n(6)\nHere, [s1, s2] denotes the predicted temporal span of the anomaly, while [s ∗ ∗ 1 , s ∗ 2 ] is the ground truth interval. The temporal IoU quantifies the degree of overlap between these intervals, and serves as a fine-grained reward signal to guide the model toward more accurate temporal localization.\nTask-specific Reward Formulations. We apply task-specific combinations of the reward components mentioned above. For the multiple-choice QA task, we use a combination of format and accuracy rewards: R QA = R format + R acc . For temporal anomaly grounding, we further include a temporal IoU term to evaluate localization quality: RTAG = Rformat + R acc + RtIoU. For anomaly classification, we adopt a similar reward design as QA: RCLS = Rformat + R acc.\n3.4 VAU-Bench # Task Definition. We decompose the VAU task into four stages: perception, grounding, reasoning, and conclusion. These stages address four core questions respectively: \u0026ldquo;What happens in this video?\u0026rdquo;, \u0026quot; When does the anomaly occur?\u0026quot;, \u0026ldquo;Why does the anomaly happen?\u0026rdquo;, and \u0026ldquo;What is our overall judgment of the anomaly?\u0026rdquo;. Corresponding to these stages, we define four VAU tasks:\nMultiple-Choice QA: Targets event perception by answering questions about videos. Temporal Grounding: Localizes anomalous segments in the video timeline. Anomaly Reasoning: Explores causal relationships to explain why an anomaly arises. Table 1: Comparison of performance on MSAD and UCF-Crime datasets on multiple-choice QA task and anomaly reasoning task. Accw/o think and Accw/ think refer to the multiple-choice question accuracy without and with thinking, respectively. For the anomaly reasoning task, CLS , KM , FLU , INF, and FAC represent VAU-Eval scores generated by DeepSeek-V3, measuring classification accuracy, key concept alignment, linguistic fluency, informativeness, and factual consistency, respectively. Each dimension is scored on a 10-point scale. Total denotes the aggregated score over five dimensions.\n| Dataset | Model | QA Accuracy | QA Accuracy | VAU-Eval KM↑ FLU↑ INF↑ FAC↑ Total↑ | VAU-Eval KM↑ FLU↑ INF↑ FAC↑ Total↑ | VAU-Eval KM↑ FLU↑ INF↑ FAC↑ Total↑ | VAU-Eval KM↑ FLU↑ INF↑ FAC↑ Total↑ | VAU-Eval KM↑ FLU↑ INF↑ FAC↑ Total↑ | VAU-Eval\nKM↑ FLU↑ INF↑ FAC↑ Total↑ Dataset Model Accw/o think Accw/ think CLS↑ KM↑ FLU↑ INF↑ FAC↑ Total↑ MSAD InternVL2.5-2B 76.67 72.08 6.84 6.23 8.55 6.64 6.64 34.90 MSAD Qwen2.5-VL-7B 84.58 83.33 6.75 6.41 9.27 7.74 6.92 37.08 MSAD InternVL2.5-8B-MPO 82.50 84.17 6.83 6.33 8.32 6.37 6.86 34.72 MSAD Qwen2-VL-2B 77.08 72.50 5.94 5.43 8.77 6.29 5.90 32.25 MSAD +SFT 82.92 85.83 6.04 5.43 8.89 6.55 5.93 32.84 MSAD +RFT 82.92 (↑5.84 83.75 (↑11.25) 6.05(↑) 5.49(↑) 8.89(↑) 6.50(↑) 6.05(↑) 32.98(↑) MSAD Qwen2.5-VL-3B 85.83 82.50 5.77 5.24 9.02 6.74 (↑ 570 32.47 MSAD +SFT 86.25 84.58 2.89 2.22 9.02 489 3.52 2.44 15.96 MSAD +RFT 88.33 87.08 (↑4. 2.89 597(↑ 2.22 4.89 6.84(↑ 2.44 603( 15.96 3338( MSAD +RFT 88.33 (↑ 87.08 (↑4.58) 5.97(↑) 5.49(↑) 9.05(↑) 6.84(↑) 6.03(↑) 33.38(↑) InternVL2.5-2B 84.86 68.13 4.40 3.08 8.09 5.69 3.47 24.74 Qwen2.5-VL-7B 92.03 89.64 4.80 3.73 8.95 7.05 4.25 28.78 InternVL 2.5 8B-MPO 89.64 90.44 3.79 3.20 8.23 5.77 3.48 24.47 Qwen2-VL-2B 87.25 83.67 3.47 2.48 7.75 4.49 2.82 21.02 +SFT 83.67 86.06 3.61 2.26 7.30 4.79 2.70 20.66 +RFT 88.45 (↑1.20) 88.05 (↑4.38) 4.04(↑) 2.75(↑) 7.72(↓) 4.89(↑) 3.11(↑) 22.52(↑) Qwen2.5-VL-3B 91.63 83.27 4.31 2.88 8.70 5.95 3.27 25.10 +SFT +RFT 92.03 (↑0.40) 91.63 (↑8.36) 1.80 1.01 4.15 2.82 1.11 10.89 +RFT 92.03 (↑0.40) 91.63 (↑8.36) 4.42(↑) 2.98(↑) 8.71(↑) 5.98(↑) 3.39(↑) 25.49(↑) Anomaly Classification: Assigns the anomaly to its corresponding category. This structured decomposition provides a clear framework for systematically addressing different perspectives of VAU, with each task rigorously evaluated using domain-specific metrics.\nDataset Construction and Annotation. Existing video anomaly datasets typically provide only frame-level labels [1 , 33 , 56] or sparse descriptions [8 , 9 , 50], limiting their usefulness for reasoningbased tasks. To address this, we construct VAU-Bench, a unified benchmark built from MSAD [56], UCF-Crime [33], and ECVA [8], enriched with Chain-of-Thought (CoT) annotations, including: (i) video descriptions, (ii) temporal boundaries, (iii) multiple-choice QA, and (iv) reasoning rationales. We apply a cleaning pipeline to remove corrupted or overly long videos and merge overlapping anomaly types. For UCF-Crime and ECVA, we use DeepSeek-V3 [18] to generate video-level summaries, QA pairs, and reasoning chains. For MSAD, CoT annotations are produced through a two-stage pipeline: we first apply InternVL-8B-MPO [42] to generate initial captions and analyses, which are then verified and refined using DeepSeek-V3 to obtain more accurate QA pairs and coherent reasoning rationales. We also give further construction and annotation details in the Appendix.\nDataset Statistics. Figure 3 presents an overview of VAU-Bench, the first VAU benchmark designed for Chain-of-Thought reasoning. Our dataset contains 4,602 videos covering 19 major anomaly types, with a total duration of 169.1 hours. It includes over 1.5 million words of fine-grained textual annotations, averaging 337 words per video, encompassing detailed descriptions, reasoning rationales, and multiple-choice questions. The dataset is split into 2,939 training, 734 validation, and 929 test videos. Additionally, we provide 3,700 temporal annotations to support the anomaly grounding task. Figure 3a shows the distribution of the main anomaly categories, while Figure 3b illustrates the diversity in video duration and anomaly sparsity. The evaluation protocols and metrics used for different tasks are summarized in Figure 3c, and we give more dataset statistics in the Appendix.\nReasoning Evaluation Metric: VAU-Eval. For VAU tasks, prior work has adopted BLEU and ROUGE [8 , 35 , 53] to evaluate semantic content. However, such n-gram-based metrics often fall short in capturing reasoning quality and deeper relational understanding. To better assess anomaly reasoning, we propose VAU-Eval, a GPT-based metric that compares model-generated descriptions and analyses with ground truth annotations. As illustrated in Figure 3c, we evaluate each response along five dimensions using DeepSeek-V3 [18] as the judge: classification accuracy, key concept\nTable 2: Comparison of temporal anomaly grounding performance on the three datasets. For each dataset, we present results for the base models, followed by SFT and RFT variants. w/o think and w/ think refer to the inference prompt without and with thinking, respectively. Rows highlighted in light yellow denote the results on the UCF-Crime dataset, serving as an out-of-distribution test for cross-dataset evaluation.\n| Dataset | Model | w/o think | w/o think | w/o think | w/o think | w/ think @03 @0@0 | w/ think @03 @0@0 | w/ think @03 @0@0 | w/ think\n@03 @0@0 Dataset Model mIoU R@0.3 R@0.5 R@0.7 mIoU R@0.3 R@0.5 R@0.7 MSAD Qwen2-VL-2B 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 MSAD Qwen2.5-VL-7B 45.90 70.83 45.83 21.67 17.57 26.67 11.67 3.33 MSAD Qwen2.5-VL-3B 21.27 30.00 10.83 4.17 13.00 16.67 5.83 1.67 MSAD + SFT 30.65 47.50 30.00 9.17 35.17 50.83 34.17 15.00 MSAD + RFT 35.77 (↑14.50) 53.33 34.17 15.83 30.70 (↑17.7 48.33 29.17 12.50 ECVA Qwen2-VL-2B 0.00 0.00 0.00 0.00 0.17 0.30 0.00 0.00 ECVA Qwen2.5-VL-7B 19.85 25.87 15.17 9.70 5.71 7.96 4.73 2.99 ECVA Qwen2.5-VL-3B 14.21 17.16 6.47 3.23 6.35 7.21 1.99 0.50 ECVA + SFT 45.30 66.67 49.75 24.13 45.96 65.67 51.00 26.12 ECVA + RFT 35.09 (↑20.88) 49.00 28.86 19.40 33.25 (↑26.90) 48.51 30.60 18.41 UCF-Crime Qwen2-VL-2B 2.74 4.84 0.00 0.00 0.12 0.00 0.00 0.00 UCF-Crime Qwen2.5-VL-7B 22.72 33.87 16.13 8.06 4.89 8.06 1.61 0.00 UCF-Crime Qwen2.5-VL-3B 10.91 15.32 6.45 3.23 7.68 10.48 4.84 1.61 UCF-Crime + SFT 4.98 3.23 0.81 0.00 5.76 5.65 0.81 0.81 UCF-Crime + RFT 16.80 (↑5.89) 23.39 8.06 4.03 9.21 (↑1.53) 9.68 4.03 1.61 alignment, fluency, informativeness, and factual consistency. Each dimension is scored on a 10-point scale to provide fine-grained assessment of reasoning quality.\n4 Experiment # Implementation Details. Our main experiments are conducted using the Qwen2-VL-2B-Instruct [41] and Qwen2.5-VL-3B-Instruct [2] models. We apply full-parameter fine-tuning without adapters or LoRA, using 2 NVIDIA H20 GPUs for training. During the RFT training process, we adopt a structured prompting strategy that guides the model to generate intermediate reasoning and final answers in a standardized format. Specifically, each prompt instructs the model to enclose its reasoning process within \u0026lt;think\u0026gt;\u0026hellip;\u0026lt;/think\u0026gt; tags and its final answer within \u0026lt;answer\u0026gt;\u0026hellip;\u0026lt;/answer\u0026gt; tags. This format ensures consistency across different tasks. During inference, for Qwen-VL models, we sample frames at 1 FPS. For InternVL models, we uniformly sample 16 frames per video.\n4.1 Evaluation of VAU-R1 # Evaluation Protocol. We report results separately on the MSAD, ECVA, and UCF-Crime datasets rather than using a single aggregated benchmark, as these datasets differ substantially in anomaly types, video durations, and scene contexts. All evaluation metrics for our four tasks are summarized in Figure 3c. For the QA task, we report multiple-choice accuracy. Temporal anomaly grounding is evaluated using temporal mean Intersection over Union (mIoU), as well as recall at different IoU thresholds: R@0.3, R@0.5, and R@0.7. For anomaly reasoning, we adopt the GPT-based VAU-Eval introduced in Section 3.4. Finally, binary and multi-class classification accuracy are used for evaluating the anomaly classification task.\nEvaluation on QA-Guided Reasoning. As shown in Table 1, we evaluate the reasoning capabilities of VAU-R1 on MSAD and UCF-Crime using multiple-choice QA accuracy and GPT-based VAU-Eval scores. We highlight two key observations. First, base models often perform worse when generating answers with reasoning (Accw/think) compared to without (Accw/o think) reasoning, indicating that naive Chain-of-Thought generation may introduce hallucination. In contrast, reinforcement fine-tuning (RFT) improves both QA accuracy with reasoning (e.g., +11.25 on MSAD) and overall reasoning quality. Second, RFT leads to consistent gains across five VAU-Eval dimensions—classification, demonstrating its ability to strengthen structured reasoning. For instance, on MSAD, Qwen2.5-VL3B+RFT achieves the highest total VAU-Eval score (33.38), showing substantial improvement over\nTable 3: Ablation study on task co-training for anomaly classification. Bin. Acc. denotes binary classification accuracy (normal vs. abnormal), and Multi Acc. denotes multi-class accuracy over 19 anomaly types plus the normal class. Results are reported with and without think prompting.\nModel w/o think w/o think w/ think w/ think Model Bin. Acc. Multi Acc. Bin. Acc Multi Acc. Baseline (Qwen2.5-VL-3B-Instruct) 62.77 47.96 59.33 39.06 +SFT w/ CLS 81.12 29.08 83.37 32.19 +RFT w/ CLS 60.30 46.14 59.01 42.27 +RFT w/ QA 59.01 46.14 58.91 41.95 +RFT w/ TAG 67.81 49.46 74.14 46.14 +RFT w/ QA-TAG 65.77 47.53 67.60 45.06 +RFT w/ QA-TAG-CLS 64.70 48.61 65.02 45.60 its SFT counterpart. These results confirm that RFT not only enhances answer correctness but also fosters robust and generalizable multimodal reasoning under the VAU setting.\nEvaluation on Temporal Anomaly Grounding. As shown in Table 2, we evaluate the temporal anomaly grounding performance across three datasets. Note that all models are trained only on MSAD and ECVA, while UCF-Crime serves as an out of distribution test set. We observe several key findings. First, across both inference settings (w/ and w/o think), RFT consistently outperforms the corresponding base models, demonstrating its effectiveness in improving temporal localization. Notably, the RFT-finetuned 3B model achieves higher mIoU than the larger 7B base model on ECVA. Second, similar to our observations in QA-guided reasoning, Chain-of-Thought prompting does not necessarily enhance grounding performance. In some cases, adding reasoning leads to degraded localization accuracy. Third, RFT shows significantly better generalization compared to SFT. In cross-dataset evaluation (e.g., UCF-Crime as an out-of-distribution test), SFT demonstrates limited generalization, whereas RFT maintains strong performance across unseen scenarios. While SFT occasionally outperforms RFT in isolated cases, we observe that its direct predictions are opaque and lack interpretability, often yielding repetitive, non-discriminative outputs across videos (see Figure 4). These results highlight the advantages of RFT for enhancing generalization in VAU tasks.\nAblation Study. For VAU, the core objective is to make accurate high-level judgments about anomaly categories (e.g., distinguishing a fight from a robbery). To explore effective task formulations, we train models with different combinations of VAU tasks—multiple-choice QA, temporal anomaly grounding (TAG), and multi-class classification (CLS)—to assess their impact on reasoning. As shown in Table 3, RFT models trained with TAG alone achieve the highest binary accuracy (74.14) and strong multi-class performance (46.14) under the think setting, highlighting the benefit of temporal grounding for perception and category discrimination. Combining QA and TAG also improves performance but is slightly less effective than TAG alone. In contrast, SFT tends to over-predict anomalies, yielding high binary accuracy but poor multi-class results, suggesting overfitting. Overall, grounding-based tasks are more effective for anomaly classification, and jointly optimizing tasks via reinforcement learning yields complementary gains in both accuracy and reasoning.\nCase Study. Figure 4 illustrates two representative examples from the QA and TAG tasks, comparing SFT and our VAU-R1 under the same Chain-of-Thought (CoT) prompt. In the QA example, SFT incorrectly selects a normal explanation based on surface cues, while VAU-R1 correctly infers a people-falling anomaly by identifying posture and behavioral irregularities. In the TAG example, SFT outputs a coarse anomaly span without rationale, whereas VAU-R1 localizes the anomaly more precisely (0.0–13.6s) and provides an interpretable causal chain. These cases highlight VAU-R1\u0026rsquo;s superior reasoning and interpretability in both classification and localization settings. More qualitative case studies are provided in the Appendix.\n4.2 Discussion # RFT Enhances Generalization and Interpretability. Our experiments demonstrate that RFT consistently outperforms SFT across multiple VAU tasks, offering improved interpretability (Table 1) and better generalization (Table 2). In contrast, SFT tends to memorize task-specific patterns and suffers from poor generalization to unseen scenarios. This suggests that SFT-trained models are more prone to overfitting, especially when trained on limited or narrowly defined tasks.\nIs Chain-of-Thought Reasoning Necessary for VAU? Our findings suggest that Chain-of-Thought (CoT) reasoning does not always lead to better performance in visual understanding tasks. However,\nFigure 4: Qualitative case of the QA (top) and TAG (bottom) task. All ground-truths and correct answers are highlighted in orange. Both SFT and RFT perform inference using the same CoT prompt. RFT\u0026rsquo;s explicit chain-of-thought yields precise, interpretable QA choice and anomaly interval, whereas SFT\u0026rsquo;s output is less informative and tends to produce inaccurate responses.\nit significantly enhances interpretability by providing structured justifications. Unlike mathematical or logical tasks, where reasoning is more deterministic, visual understanding involves inherently diverse reasoning paths. Therefore, designing simpler sub-tasks with well-defined reward signals to guide reasoning effectively remains underexplored. Directly applying complex tasks (e.g., multi-class anomaly classification) without task co-training often leads to suboptimal results (Table 3).\nRethinking Anomaly Understanding in Multimodal Contexts. VAU calls for constructing a coherent reasoning chain that bridges spatial-temporal localization and causal inference. Yet, leveraging diverse cues such as keyframes, salient objects, and even additional modalities (e.g., audio) to support unified reasoning remains underexplored. We envision that future work could benefit from integrating these multimodal signals into a structured reasoning framework, enabling more robust and interpretable anomaly understanding. Our method and benchmark take a step in this direction by proposing a unified evaluation protocol across perception, localization, and reasoning dimensions, ultimately guiding models toward accurate and justifiable anomaly judgments.\n5 Conclusion # We present VAU-R1, an advanced and unified Video Anomaly Understanding framework focusing on four VAU tasks: multi-choice QA, temporal grounding, anomaly reasoning, and classification. VAU-R1 leverages a multimodal large language model (MLLM) and, notably, employs reinforcement fine-tuning to enhance anomaly reasoning and explainability via carefully designed GRPO reward functions for each task. To facilitate the training and evaluation of this framework, we also introduce VAU-Bench, the first chain-of-thought benchmark designed to train and evaluate VAU tasks at the\nreasoning level. The experiments on different tasks prove the strong performance of the proposed method than baselines.\nReferences # [1] A. Acsintoae, A. Florescu, M.-I. Georgescu, T. Mare, P. Sumedrea, R. T. Ionescu, F. S. Khan, and M. Shah. Ubnormal: New benchmark for supervised open-set video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20143–20153, 2022.\n[2] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025.\n[3] J. Bi, S. Liang, X. Zhou, P. Liu, J. Guo, Y. Tang, L. Song, C. Huang, G. Sun, J. He, et al. Why reasoning matters? a survey of advancements in multimodal reasoning (v1). arXiv preprint arXiv:2504.03151, 2025.\n[4] C. Cao, Y. Lu, P. Wang, and Y. Zhang. A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20392–20401, 2023.\n[5] Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, and Y.-C. Wu. Mgfn: Magnitude-contrastive glanceand-focus network for weakly-supervised video anomaly detection. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 387–395, 2023.\n[6] D. Ding, L. Wang, L. Zhu, T. Gedeon, and P. Koniusz. Lego: Learnable expansion of graph operators for multi-modal feature fusion. arXiv preprint arXiv:2410.01506, 2024.\n[7] K. Doshi and Y. Yilmaz. Towards interpretable video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2655–2664, 2023.\n[8] H. Du, G. Nan, J. Qian, W. Wu, W. Deng, H. Mu, Z. Chen, P. Mao, X. Tao, and J. Liu. Exploring what why and how: A multifaceted benchmark for causation understanding of video anomaly. arXiv preprint arXiv:2412.07183, 2024.\n[9] H. Du, S. Zhang, B. Xie, G. Nan, J. Zhang, J. Xu, H. Liu, S. Leng, J. Liu, H. Fan, et al. Uncovering what why and how: A comprehensive benchmark for causation understanding of video anomaly. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18793–18803, 2024.\n[10] K. Feng, K. Gong, B. Li, Z. Guo, Y. Wang, T. Peng, B. Wang, and X. Yue. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776, 2025.\n[11] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1705–1714, 2019.\n[12] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.\n[13] K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018.\n[14] W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, Y. Hu, and S. Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749 , 2025.\n[15] A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024.\n[16] J. Leng, Z. Wu, M. Tan, Y. Liu, J. Gan, H. Chen, and X. Gao. Beyond euclidean: Dual-space representation learning for weakly supervised video violence detection. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.\n[17] X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y. He, Y. Wang, Y. Qiao, Y. Wang, and L. Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning. arXiv preprint arXiv:2504.06958, 2025.\n[18] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024.\n[19] K. Liu, W. Liu, C. Gan, M. Tan, and H. Ma. T-c3d: Temporal convolutional 3d network for real-time action recognition. In Proceedings of the AAAI conference on artificial intelligence , volume 32, 2018.\n[20] K. Liu and H. Ma. Exploring background-bias for anomaly detection in surveillance videos. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1490–1499, 2019.\n[21] W. Liu, W. Luo, D. Lian, and S. Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6536–6545, 2018.\n[22] Y. Liu, D. Yang, Y. Wang, J. Liu, J. Liu, A. Boukerche, P. Sun, and L. Song. Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models. ACM Computing Surveys, 56(7):1–38, 2024.\n[23] Z. Liu, Z. Sun, Y. Zang, X. Dong, Y. Cao, H. Duan, D. Lin, and J. Wang. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025.\n[24] C. Lu, J. Shi, and J. Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision, pages 2720–2727, 2013.\n[25] Y. Lu, F. Yu, M. K. K. Reddy, and Y. Wang. Few-shot scene-adaptive anomaly detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 125–141. Springer, 2020.\n[26] H. Lv and Q. Sun. Video anomaly detection and explanation via large language models. arXiv preprint arXiv:2401.05702, 2024.\n[27] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel. Deep learning for anomaly detection: A review. ACM computing surveys (CSUR), 54(2):1–38, 2021.\n[28] B. Ramachandra and M. Jones. Street scene: A new dataset and evaluation protocol for video anomaly detection. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2569–2578, 2020.\n[29] R. Rodrigues, N. Bhargava, R. Velmurugan, and S. Chaudhuri. Multi-timescale trajectory prediction for abnormal human activity detection. In The IEEE Winter Conference on Applications of Computer Vision (WACV), March 2020.\n[30] Y. Shao, H. He, S. Li, S. Chen, X. Long, F. Zeng, Y. Fan, M. Zhang, Z. Yan, A. Ma, et al. Eventvad: Training-free event-aware video anomaly detection. arXiv preprint arXiv:2504.13092 , 2025.\n[31] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.\n[32] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.\n[33] W. Sultani, C. Chen, and M. Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018.\n[34] H. Tan, Y. Ji, X. Hao, M. Lin, P. Wang, Z. Wang, and S. Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning. arXiv preprint arXiv:2503.20752, 2025.\n[35] J. Tang, H. Lu, R. Wu, X. Xu, K. Ma, C. Fang, B. Guo, J. Lu, Q. Chen, and Y. Chen. Hawk: Learning to understand open-world video anomalies. Advances in Neural Information Processing Systems, 37:139751–139785, 2024.\n[36] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4975–4986, 2021.\n[37] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.\n[38] M. Vijay, W.-X. LI, B. Viral, and V. Nuno. Anomaly detection in crowded scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1975–1981, 2010.\n[39] L. Wang, W. Li, W. Li, and L. Van Gool. Appearance-and-relation networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 1430–1439, 2018.\n[40] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.\n[41] P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y. Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin. Qwen2-vl: Enhancing vision-language model\u0026rsquo;s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024.\n[42] W. Wang, Z. Chen, W. Wang, Y. Cao, Y. Liu, Z. Gao, J. Zhu, X. Zhu, L. Lu, Y. Qiao, et al. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization. arXiv preprint arXiv:2411.10442, 2024.\n[43] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.\n[44] P. Wu, C. Pan, Y. Yan, G. Pang, P. Wang, and Y. Zhang. Deep learning for video anomaly detection: A review. arXiv preprint arXiv:2409.05383, 2024.\n[45] P. Wu, X. Zhou, G. Pang, Y. Sun, J. Liu, P. Wang, and Y. Zhang. Open-vocabulary video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18297–18307, 2024.\n[46] P. Wu, X. Zhou, G. Pang, Z. Yang, Q. Yan, P. Wang, and Y. Zhang. Weakly supervised video anomaly detection and localization with spatio-temporal prompts. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 9301–9310, 2024.\n[47] P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y. Zhang. Vadclip: Adapting visionlanguage models for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6074–6082, 2024.\n[48] Y. Yang, K. Lee, B. Dariush, Y. Cao, and S.-Y. Lo. Follow the rules: reasoning for video anomaly detection with large language models. In European Conference on Computer Vision , pages 304–322. Springer, 2024.\n[49] M. Ye, W. Liu, and P. He. Vera: Explainable video anomaly detection via verbalized learning of vision-language models. arXiv preprint arXiv:2412.01095, 2024.\n[50] T. Yuan, X. Zhang, K. Liu, B. Liu, C. Chen, J. Jin, and Z. Jiao. Towards surveillance videoand-language understanding: New dataset baselines and challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22052–22061, 2024.\n[51] L. Zanella, W. Menapace, M. Mancini, Y. Wang, and E. Ricci. Harnessing large language models for training-free video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18527–18536, 2024.\n[52] H. Zhang, X. Xu, X. Wang, J. Zuo, C. Han, X. Huang, C. Gao, Y. Wang, and N. Sang. Holmesvad: Towards unbiased and explainable video anomaly detection via multi-modal llm. arXiv preprint arXiv:2406.12235, 2024.\n[53] H. Zhang, X. Xu, X. Wang, J. Zuo, X. Huang, C. Gao, S. Zhang, L. Yu, and N. Sang. Holmesvau: Towards long-term video anomaly understanding at any granularity. arXiv preprint arXiv:2412.06171, 2024.\n[54] H. Zhou, X. Li, R. Wang, M. Cheng, T. Zhou, and C.-J. Hsieh. R1-zero\u0026rsquo;s\u0026quot; aha moment\u0026quot; in visual reasoning on a 2b non-sft model. arXiv preprint arXiv:2503.05132, 2025.\n[55] H. Zhou, J. Yu, and W. Yang. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 3769–3777, 2023.\n[56] L. Zhu, L. Wang, A. Raj, T. Gedeon, and C. Chen. Advancing video anomaly detection: A concise review and a new dataset. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024.\nA Further Dataset Details # Figure 5: More dataset statistics of our VAU-Bench. (a) Distribution of training, validation, and test splits across the four tasks included in VAU-Bench. (b) Word cloud visualization of frequent terms appearing in the multiple-choice questions and choices.\nDataset Annotation. VAU-Bench is constructed from three datasets: UCF-Crime, ECVA, and MSAD. While UCF-Crime [33] and ECVA [8] provide basic scene-level descriptions, they lack the structured annotations necessary for fine-grained reasoning. To address this, we leverage DeepSeekV3 [18], a powerful large language model, to enrich the existing annotations from HIVAU-70K (which includes UCF-Crime) [53] and ECVA [8]. We use prompt-based instruction to guide the model in extracting key events, causal relationships, and anomalous behaviors, thereby producing reasoning-oriented annotations suitable for causal understanding. The detailed prompt design is provided in the blue-colored box below.\nVideo Understanding Prompt. # You are an expert in video understanding and reasoning. I will give you structured metadata for a surveillance or behavior-related video. Your task is twofold: Please analyze the entire video description, including anomaly labels, events, and all textual summaries. Based on this, generate a comprehensive summary of what happens in the video in the following JSON structure: { \u0026#34;judgement\u0026#34;: \u0026#34;Does this video depict an anomaly? If yes, what is it called?\u0026#34;, \u0026#34;description\u0026#34;: \u0026#34;Chronological and factual summary of what happened in the video\u0026#34;, \u0026#34;analysis\u0026#34;: { \u0026#34;Specific Anomaly Type\u0026#34;: \u0026#34;Select from the [Anomaly Type]\u0026#34;, \u0026#34;Location\u0026#34;: \u0026#34;Where the event occurs: indoor/outdoor/specific\u0026#34;, \u0026#34;Key Evidence\u0026#34;: \u0026#34;Key actions or objects that support classification\u0026#34;, \u0026#34;Detailed Explanation\u0026#34;: \u0026#34;Why these events are normal/anomalous\u0026#34;, \u0026#34;Cause and Effect\u0026#34;: \u0026#34;What led to the event and its outcome\u0026#34;, \u0026#34;Conclusion\u0026#34;: \u0026#34;Wrap-up reasoning with final conclusion about the event\u0026#34; } } Generating QA Pair Prompt. # You are an expert in reasoning-focused QA generation for surveillance analysis videos. You will be given a structured video summary, including: (i) A judgement (whether the video is anomalous or normal). (ii) A chronological description of what happens in the video. (iii) A multi-part analysis that breaks down the event\u0026rsquo;s anomaly type, location, key evidence, explanation, causes, and conclusion. Please generate a single multiple-choice question-answer pair in JSON format.\nFor the MSAD [56] dataset, which lacks textual annotations, we design a structured Chain-of-Thought (CoT) annotation pipeline. We first use InternVL2.5-8B-MPO [42] as the Vision-Language Model (VLM) to generate initial annotations that include detailed descriptions, step-by-step reasoning, and anomaly classification. To further improve the quality of these annotations, we apply DeepSeek-V3 in a secondary refinement stage, which enhances the coherence and clarity of the generated descriptions, QA pairs, and reasoning chains. The overall annotation pipeline consists of the following stages:\nTask Definition: The VLM is instructed to act as an anomaly detector. Video Description: The VLM generates a detailed description of the video content. Step-by-Step Reasoning: The VLM performs multi-step reasoning to analyze the presence and nature of anomalies. Verification: Given the ground-truth anomaly type, the VLM verifies whether its prediction aligns with it. If not, it regenerates both the description and reasoning. Key Object Summarization: The VLM identifies key visual objects or cues relevant to the anomaly, expressed in 1–3 words. QA Generation: The VLM constructs multiple-choice questions by generating and shuffling plausible anomaly-related answer options. Quality Enhancement: We use DeepSeek-V3 to validate and refine the generated QA pairs, descriptions, and reasoning chains. After completing the CoT annotation for the entire VAU-Bench, we perform a manual review to ensure the accuracy and consistency of all generated annotations.\nMore Dataset Statistics. Table 4 presents a detailed comparison of our VAU-Bench and existing video anomaly datasets. Compared to previous datasets, our benchmark offers a longer total video duration, a more diverse set of primary anomaly types (with similar categories merged), diverse multi-choice QA pairs, and richer Chain-of-Thought reasoning annotations. Figure 5a shows the dataset splits across four tasks. Each task contains a balanced number of training, validation, and test samples, supporting robust evaluation. Figure 5b presents a word cloud of frequent phrases extracted from the multiple-choice questions and answers in VAU-Bench. Notably, the presence of phrases such as \u0026ldquo;best describes\u0026rdquo; , \u0026ldquo;plausible explanation\u0026rdquo;, and \u0026ldquo;behavioral clue\u0026rdquo; highlights the variety of question formulations, encouraging models to engage in fine-grained interpretation. In addition, keywords such as robbery , man action, and scene indicate that our questions are intentionally crafted to guide models toward recognizing specific objects and anomaly types in complex real-world scenarios.\nDataset Examples. We present representative examples from our VAU-Bench, each annotated to support four core tasks of video anomaly understanding. As illustrated in Figure 7, each example is richly labeled with a question-answer pair, key visual evidence, anomaly type, temporal annotation, and a multi-part reasoning chain that includes location, cause and effect, and a high-level conclusion. This annotation format enables models not only to detect and classify anomalies, but also to explain them in a structured, interpretable manner. Figure 7 and Figure 8 show challenging anomaly scenarios, while Figure 9 depicts a normal scene, included to test model robustness and reduce false positives. These examples demonstrate the breadth and depth of our annotations, enabling holistic evaluation across perception and reasoning dimensions.\nB Experiment Details # Training Details. We use the Adam optimizer with a learning rate of 2 × 10 − 5 . The supervised fine-tuning (SFT) stage runs for less steps (e.g. 200) to avoid overfitting, while the Reinforcement Fine-Tuning (RFT) stage takes approximately 15 hours for 1.5k steps. We set the hyperparameter β in the KL divergence term of the GRPO to 0.04, using M = 4 candidate outputs per prompt. The maximum response length is capped at 1024 tokens.\nTable 4: Comparison of video anomaly detection benchmarks. We compare VAU-Bench with existing datasets in terms of size, annotation granularity, and reasoning capabilities. VAU-Bench is the first benchmark to support structured reasoning via multiple-choice questions and Chain-of-Thought (CoT) annotations. Columns indicate whether each dataset provides QA pairs, free-text descriptions (Descrip.), anomaly judgement (Judge.), reasoning (Reason.), and full CoT rationales.\nDataset Year #Videos Total Len. #Type Year #Videos Total Len. #Type Year #Videos Total Len. #Type Year #Videos Total Len. #Type Annotation QA Pairs Descrip. Judge. Reason. CoT Annotation QA Pairs Descrip. Judge. Reason. CoT Annotation QA Pairs Descrip. Judge. Reason. CoT Annotation QA Pairs Descrip. Judge. Reason. CoT Annotation QA Pairs Descrip. Judge. Reason. CoT Annotation QA Pairs Descrip. Judge. Reason. CoT UCSD Ped1 [38] 2010 70 0.1h 5 Bounding-box % % % % % UCSD Ped2 [38] 2010 28 0.1h 5 Bounding-box % % % % % CUHK Avenue [24] 2013 35 0.5h 5 Bounding-box % % % % % ShanghaiTech [21] 2017 437 3.5h 13 Bounding-box % % % % % UCF-Crime [33] 2018 1900 128.0h 13 Frame % % % % % Street Scene [28] 2020 81 3.8h 17 Bounding-box % % % % % % IITB Corridor [29] 2020 358 2.0h 10 Frame % % % % % % % % UBNormal [1] 2022 543 2.2h 22 Frame % % % % % % NWPU [4] 2023 547 16.3h 43 Frame % % % % % MSAD [56] 2024 720 4.1h 11 Frame % % % % % UCA [50] 2024 1854 121.9h 13 Time Duration 13 Time Duration % ! % % % CUVA [9] 2024 1000 32.5h 11 Time Duration 11 Time Duration ! ! ! ! % ECVA [8] 2024 2240 88.2h 21 Time Duration 21 Time Duration ! ! ! ! % HIVAU-70K [53] 2025 5443 NA NA Time Duration NA Time Duration ! ! % % % VAU–Bench (Ours) 2025 VAU–Bench (Ours) 2025 4596 169.1h 9 Time 9 Time Duration ! ! ! ! ! VAU-Eval Prompt. # Below is a ground-truth description and analysis, followed by a model-generated description and analysis. Please evaluate the model\u0026rsquo;s outputs from the following aspects:\nClassification Correctness (10 pts) Key Object and Action Matching (10 pts) Fluency and Coherence (10 pts) Informativeness and Domain Awareness (10 pts) Factual Consistency (10 pts) Evaluation Details for Anomaly Reasoning. To evaluate the alignment between model-generated outputs and our annotated ground truth in video anomaly understanding, we introduce VAU-Eval, a GPT-based evaluation protocol. The evaluation is structured as a multi-turn interaction, where the model first generates a description of the video and then performs reasoning to determine whether the video contains an anomaly. We then use DeepSeek-V3 [18] to assess the similarity between the predicted answers and the ground truth across five aspects: classification correctness, key object and action matching, fluency and coherence, informativeness and domain awareness, and factual consistency. Each aspect is scored out of 10 points, yielding a total of 50 points per sample. To better reflect the model\u0026rsquo;s actual reasoning capabilities, we do not fine-tune the model on any reasoning-style description or analysis. Instead, we directly test models that are trained solely on the multiple-choice QA task, thus ensuring that their descriptive reasoning is not memorized but inferred. The detailed evaluation prompt used in this process is shown in the blue box above.\nC Further Evaluations # More Evaluations. As shown in Table 5, we conduct experiments on the ECVA dataset. Compared to UCF-Crime and MSAD, ECVA poses greater challenges across both recognition and reasoning tasks. All models consistently achieve lower VAU-Eval reasoning scores on ECVA, indicating that its longer videos, more camera movements, viewpoint shifts and richer anomaly diversity make fine-grained understanding more difficult. While our RFT-enhanced models achieve consistent improvements in multiple-choice QA accuracy, their VAU-Eval reasoning scores does not always improve. This suggests that while RFT helps models better predict the final answer, it does not necessarily enhance the reasoning process. These findings highlight the need for more fine-grained reward signals to guide the generation of high-quality rationales in complex scenarios.\nTable 5: Comparison of performance on ECVA datasets on multiple-choice QA task and anomaly reasoning task. Accw/o think and Accw/ think refer to the multiple-choice question accuracy without and with thinking, respectively. For the anomaly reasoning task, CLS , KM , FLU , INF, and FAC represent VAU-Eval scores generated by DeepSeek-V3, measuring classification accuracy, key concept alignment, linguistic fluency, informativeness, and factual consistency, respectively. Each dimension is scored on a 10-point scale. Total denotes the aggregated score over five dimensions.\nDataset Model QA Accuracy QA Accuracy VAU-Eval VAU-Eval VAU-Eval VAU-Eval VAU-Eval VAU-Eval Dataset Model Accw/o think Accw/ think CLS↑ KM↑ FLU↑ INF↑ FAC↑ Total↑ InternVL2.5-2B 78.84 58.84 2.86 2.78 7.57 4.62 3.03 20.86 Qwen2.5-VL-7B 83.02 86.98 3.70 3.67 8.64 6.40 4.04 26.45 InternVL2.5-8B-MPO 90.00 83.72 3.4 3.31 7.87 4.48 3.47 22.53 Qwen2-VL-2B 86.98 83.95 2.41 2.36 7.81 3.81 2.57 18.96 ECVA +SFT 84.88 84.65 2.20 2.12 7.37 3.99 2.22 17.90 +RFT 90.23 (↑3.25 84.42 (↑0.47 2.26 2.28 7.52 3.70 2.40 18.16 Qwen2.5-VL-3B 85.58 75.81 2.21 2.58 8.33 5.02 2.75 20.89 +SFT 89.30 86.98 1.50 1.22 4.37 2.66 1.24 10.99 +RFT 89.53 (↑3.95 86.51 (↑10.7 1.45 2.24 8.05 4.32 2.39 18.45 Table 6: Performance of HolmesVAU 2B [53] and our VAU-R1 2B on multiple-choice QA and anomaly reasoning task.\n| Model | Dataset | QA Accuracy | QA Accuracy | VAU-Eval ↑ ↑ AC↑ ↑ | VAU-Eval ↑ ↑ AC↑ ↑ | VAU-Eval ↑ ↑ AC↑ ↑ | VAU-Eval ↑ ↑ AC↑ ↑ | VAU-Eval ↑ ↑ AC↑ ↑ | VAU-Eval ↑ ↑ AC↑ ↑ | VAU-Eval\n↑ ↑ AC↑ ↑ Model Dataset Accw/o think Accw/ think k CLS↑ KM↑ FLU↑ INF↑ FAC↑ Total↑ HolmesVAU 2B MSAD 85.00 86.25 2.7 2.72 6.82 3.55 3.33 20.15 HolmesVAU 2B UCF-Crime 86.45 85.66 3.05 1.97 6.30 3.08 2.39 16.79 HolmesVAU 2B ECVA 70.47 70.70 70 2.54 1.71 6.26 2.78 2.30 15.59 VAU-R1 2B MSAD 82.92 83.75 5 (↓2.5 6.05 5.49 8.89 6.50 6.05 32.98 VAU-R1 2B UCF-Crime 88.45 (↑ 8.05 (↑2. 05 (↑2.39 4.04 2.75 7.72 4.89 3.11 22.52 VAU-R1 2B ECVA 90.23 (↑19 84.42 (↑13.7 42 (↑13.72) 2.26 2.28 7.52 3.70 2.40 18.16 Comparison with Prior Work. As shown in Table 6, we evaluate HolmesVAU 2B [53], a recently released baseline for VAU, on our benchmark to assess its reasoning capability in complex scenarios. While HolmesVAU 2B achieves reasonable performance across all datasets, it consistently underperforms compared to our Qwen-based models, particularly on the challenging ECVA dataset. This performance gap is evident in both multiple-choice QA accuracy and VAU-Eval reasoning scores, indicating limitations in HolmesVAU 2B\u0026rsquo;s ability to generalize to diverse and complex scenarios. In contrast, VAU-R1 demonstrates stronger alignment with human-annotated reasoning chains and greater robustness across datasets.\nClassification Results. Table 7 presents the binary and multi-class anomaly classification accuracy on three datasets: MSAD, UCF-Crime, and ECVA. We directly apply the RFT strategy to train a multi-class anomaly classification task, which includes 19 different anomaly types as well as the normal class. However, directly training the complex multi-class task with RFT degrades performance, suggesting it is more effective to decompose the task into simpler sub-tasks with structured rewards to better guide learning. We compare multiple models under two settings: w/o think and w/ think . We observe that, for the relatively challenging multi-class anomaly task, incorporating an explicit \u0026ldquo;think\u0026rdquo; reasoning step improves the model\u0026rsquo;s classification accuracy.\nTemporal Localization Performance. Table 8 summarizes the temporal localization (mIoU) performance of representative methods, categorized into traditional models, multi-modal approaches, and MLLMs. As expected, early appearance-based methods (e.g., Two-stream [32], TSN [40], C3D [37]) achieve limited performance. Incorporating spatio-temporal modeling via 3D convolutions (T-C3D [19], ARTNet [39], 3DResNet [13]) brings moderate improvements, with Liu et al. [20] reaching a mIoU of 16.40. More recent multi-modal approaches, such as VADClip [47] and STPrompt [46], achieve significantly better performance, with STPrompt reaching 23.90 mIoU. Our MLLM-based methods show promising yet limited temporal grounding capabilities. While Qwen2.5-VL-3B achieves only 10.91 mIoU, reinforcement tuning (+RFT) boosts performance to 16.80, indicating that structured reward learning helps align model outputs with temporal structures.\nTable 7: Comparison of anomaly classification accuracy on three datasets. Bin. Acc. denotes binary classification accuracy (normal vs. abnormal), and Multi Acc. denotes multi-class accuracy over 19 anomaly types and the normal class. Results are reported with and without think prompting.\n| Dataset | Model | w/o think | w/o think | w/ think ccMulti Acc | w/ think\nccMulti Acc Bin. Acc. Multi Acc. Bin. Acc. Multi Acc. MSAD Qwen2-VL-2B-Instruct Qwen2.5-VL-7B-Instruct Qwen2.5-VL-3B-Instruct 75.00 90.00 79.17 62.50 70.00 69.58 60.42 75.00 7333 52.92 66.67 MSAD Qwen2.5-VL-7B-Instruct 90.00 70.00 75.00 66.67 MSAD + SFT 70.83 28.75 74.58 56.67 3333 MSAD + RFT 70.83 208 (↑29 71.25 (↑1.67) 74.58 74.58 (↑1.25 60.83 (↑4.16) MSAD + RFT 82.08 (↑2.91) 71.25 (↑1.67) 74.58 (↑1.25 60.83 (↑4.16) UCF-Crime Qwen2-VL-2B-Instruct Qwen2.5-VL-7B-Instruct 60.56 8685 53.78 6215 60.16 51.79 UCF-Crime Qwen2.5-VL-7B-Instruct 86.85 62.15 70.12 61.35 UCF-Crime Qwen2.5-VL-3B-Instruct 64.54 58.57 62.55 52.19 UCF-Crime + SFT RFT | 64.14 62.55 (↓1.99 | 28.69 57.77 (↓0.80) | 69.32 62.15 (↓0.40 | 37.05 | | ECVA | Qwen2-VL-2B-Instruct Qwen25-VL-7B-Instruct | 41.95 | 24.72 3288 | 32.88 4354 | 19.05 2381 | | ECVA | Qwen2.5-VL-7B-Instruct | 64.85 | 32.88 | 43.54 | 23.81 | | ECVA | Qwen2.5-VL-3B-Instruct | 52.83 | 30.16 | 49.89 | 22.00 | | ECVA | + SFT | 96.37 | 29.48 | 96.15 | 28.80 | | ECVA | + RFT | 49.66 (↓3.17) | 30.61 (↑0.45) | 55.78 (↑5.89) | 31.07 (↑9.07 | Table 8: Comparison of temporal localization performance (mIoU) across different methods on UCF-Crime dataset.\nCategory Method Feature mIoU Multi-moda Two-stream [32] 2.20 2.2 Multi-moda TSN [40] SN 2.6 Multi-moda C3D [37] C3D 7.2 Multi-moda T-C3D [19] C3D 10.2 Multi-moda ARTNet [39] ARTNets 11.4 Multi-moda 3DResNet [13] I3D-ResNe 10.3 Multi-moda NLN [43] I3D-ResNe 12.2 Multi-moda Liu et al. [20] I3D-ResNet 16.4 Multi-modal VADClip [47] 22.05 22.05 Multi-modal p STPrompt [46] CLIP 23.9 MLLMs Qwen2.5-VL-3B 10.91 10.91 MLLMs Qwen2.5-VL-3B + RFT ViT 16.8 MLLMs Qwen2.5-VL-7B ViT 22.72 However, even with RFT, MLLMs still underperform compared to specialized temporal models, suggesting that current architectures may lack explicit temporal reasoning modules required for fine-grained localization.\nCase Study on Anomaly Reasoning. Figure 6 presents a qualitative comparison between outputs generated by SFT and our proposed VAU-R1 model on anomaly reasoning task. Both models are evaluated using the same Chain-of-Thought (CoT) prompt and scored based on five criteria: classification correctness (CLS), key object matching (KM), fluency (FLU), informativeness (INF), and factual consistency (FAC). The SFT output incorrectly identifies the anomaly as a political argument, which does not match the core issue (an escalator malfunction). It also fails to mention any key visual evidence or relevant location. In contrast, VAU-R1 produces a more contextually appropriate response, identifying an emergency situation at a subway station involving injured individuals and emergency vehicles. While the response focuses on surface-level emergency context rather than the root cause, it demonstrates greater fluency and relevance. The evaluation assigns a higher total score of 22, with solid performance across all dimensions, particularly in fluency and informativeness.\nD Limitation and Future Work # One limitation of this work is its focus on a constrained set of tasks, namely multiple-choice question answering, temporal grounding, anomaly reasoning, and anomaly classification. While these tasks form a strong foundation for video anomaly understanding, there remains substantial room for extension. Future work could incorporate additional tasks such as spatial localization of key objects, which would enable more fine-grained event understanding. Moreover, introducing additional modalities (e.g., audio) may provide complementary cues that enhance both the robustness and contextual depth of anomaly reasoning.\nE Potential Societal Impact # We propose a new method and benchmark for video anomaly understanding. Accurate and interpretable anomaly understanding systems can contribute to a wide range of safety-critical applications, such as disaster early warning, fire prevention, fall detection, and public safety monitoring. By enabling models to reason about abnormal events, our approach can assist first responders in identifying urgent situations earlier and more reliably.\nHowever, this research inevitably involves scenarios that depict violent or chaotic abnormal behaviors. We strictly follow established ethical guidelines throughout our study. The datasets used in this study are publicly available and have been processed in accordance with the guidelines provided by their original publishers. We strictly adhere to these terms of use and employ the data solely for academic research purposes. To ensure privacy protection, the datasets include safeguards such as reduced video resolution and facial blurring, effectively preventing the identification of individuals. Looking ahead, we plan to explore anomaly understanding methods that incorporate privacy preservation as a core design principle.\nFigure 6: Qualitative case of the Anomaly Reasoning task. All correct description and analysis are highlighted in orange. The evaluation results are presented on the right of the answer respectively. Both SFT and VAU-R1 perform inference using the same CoT prompt. VAU-R1\u0026rsquo;s output correctly identifies the anomaly with high fluency but lacks reasoning for the core event, whereas SFT\u0026rsquo;s output is inaccurate and tends to produce repetitive responses.\nFigure 7: Example of VAU-Bench. An explosion case in an outdoor backyard, highlighting complex anomaly detection and dynamic scene understanding, labeled with a question-answer pair, key visual evidence, anomaly type, and a multi-part reasoning chain that includes location, cause and effect, and a high-level conclusion.\nKey Object: The perpetrators # Figure 8: Example of VAU-Bench. A stealing incident, demonstrating capabilities in human activity recognition and intent analysis.\nFigure 9: Example of VAU-Bench. A normal scene, used to evaluate model robustness against false positives and to enhance dataset diversity.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/vau-r1-advancing-video-anomaly-understanding-via-reinforcement-fine-tuning/","section":"Papers","summary":"Introduces VAU-R1, a reinforcement fine-tuning framework leveraging Group Relative Policy Optimization (GRPO) to enhance multimodal large language models\u0026rsquo; (MLLMs) reasoning capabilities in video anomaly understanding (VAU). Develops VAUBench, a comprehensive Chain-of-Thought benchmark with rich annotations across perception, grounding, reasoning, and classification tasks, supported by multiple evaluation metrics including VAU-Eval, QA accuracy, temporal IoU, and Factual Consistency. Demonstrates significant improvements over supervised fine-tuning in question answering accuracy, temporal localization, and interpretability, thereby establishing a scalable, interpretable, and reasoning-aware VAU framework.","title":"VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning","type":"method"},{"content":" This CVPR paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the accepted version;\nthe final published version of the proceedings is available on IEEE Xplore.\nVERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models # Muchao Ye 1*\nWeiyang Liu 2\nPan He 3\n1 The University of Iowa 2 Max Planck Institute for Intelligent Systems, Tubingen ¨ ¨ 3 Auburn University 1 muye@uiowa.edu 2 weiyang.liu@tuebingen.mpg.de 3 pan.he@auburn.edu ⇤ Corresponding Author https://vera-framework.github.io\nAbstract # The rapid advancement of vision-language models (VLMs) has established a new paradigm in video anomaly detection (VAD): leveraging VLMs to simultaneously detect anomalies and provide comprehendible explanations for the decisions. Existing work in this direction often assumes the complex reasoning required for VAD exceeds the capabilities of pretrained VLMs. Consequently, these approaches either incorporate specialized reasoning modules during inference or rely on instruction tuning datasets through additional training to adapt VLMs for VAD. However, such strategies often incur substantial computational costs or data annotation overhead. To address these challenges in explainable VAD, we introduce a verbalized learning framework named VERA that enables VLMs to perform VAD without model parameter modifications. Specifically, VERA automatically decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions capturing distinct abnormal patterns. It treats these reflective questions as learnable parameters and optimizes them through data-driven verbal interactions between learner and optimizer VLMs, using coarsely labeled training data. During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores, which are then refined into frame-level scores via the fusion of scene and temporal contexts. Experimental results on challenging benchmarks demonstrate that the learned questions of VERA are highly adaptable, significantly improving both detection performance and explainability of VLMs for VAD.\n1. Introduction # Video anomaly detection (VAD) aims to automatically identify unexpected and abnormal events in video sequences, with broad applications ranging from autonomous driving [2] to industrial manufacturing [31]. While achieving good performance in VAD is essential, providing clear explanations for detected anomalies is even more crucial.\nTo this end, our work primarily focuses on explain-\nFigure 1. VERA renders frozen VLMs to describe and reason with learnable guiding questions learned from coarsely labeled data.\nable VAD, which requires both comprehensive visual understanding and the ability to generate human-interpretable predictions. The rapid advancement of vision-language models (VLMs) [7 , 18 , 21 , 58] enables us to address both requirements through their strong visual reasoning and language interaction capabilities. As multi-modal architectures that effectively combine the reasoning capabilities from large language models (LLMs) [4] and the visual understanding capabilities from pretrained vision encoders [8], VLMs are particularly well-suited for VAD for they can offer explainable predictions that clearly illustrate the rationale behind specific anomalies, making the results more interpretable to users. Recent research on VAD has consequently focused on how to effectively leverage the power of pretrained VLM. As shown in Fig. 1, existing approaches aim to address the misalignment problem between VLMs\u0026rsquo; pretraining tasks and the VAD requirements through either additional reasoning modules or instruction tuning (IT):\nOne line of research introduces external LLMs to assist frozen VLMs to reason in VAD [46 , 52]. It uses VLMs to caption what they see given a video, and the descriptions are then passed to an external LLM, e.g., GPT-4 [1], to reason whether an anomaly occurs. Another line of research, instead, expands VLMs to generate explainable prediction via IT [26 , 55]. This research line creates additional VAD datasets with frame- level annotations and leverages exemplary instructions to fine-tune the VLM, enabling it to detect anomalies and generate human-interpretable explanations.\nKey Observations and Research Question. While prior research demonstrates the potential of applying VLMs to VAD, we identify that this new paradigm is hindered by a shared critical issue: the use of additional reasoning modules or fine-grained labeled datasets incurs significant computational cost either in the inference or training phases. First, decoupling a VAD system into a frozen VLM and an extra LLM introduces more overhead in inference, because it separates the description generation and reasoning processes. Secondly, although IT-based methods enable VLMs to effectively integrate description and reasoning for VAD, they require additional manpower and computational resources for annotating and finetuning on fine-grained labeled instruction datasets, which is time-consuming and not scalable for large-scale datasets. In light of this, we investigate the following unexplored yet important question:\nCan we enable a frozen VLM to integrate description and reasoning for VAD without instruction tuning?\nOur Approach. This research question is nontrivial because the reasoning ability of a frozen VLM is limited in general visual tasks, and it struggles to handle complex reasoning tasks like VAD, which requires the understanding of subtle, context-dependent outliers. To illustrate, Table 1 shows that prompting frozen VLMs with simple VAD questions used in existing works leads to unsatisfactory results. Thus, instruction-tuning a VLM seems necessary to make it responsive to specific instructional cues and capture delicate visual variations. In this paper, we question the necessity of such an operation and propose a principled approach to tailor frozen VLMs for VAD.\nSpecifically, our solution is guided by the intuition that the reasoning ability of VLMs for VAD will improve if we find questions with suitable and concrete description of abnormal patterns rather than with abstract and general words like \u0026ldquo;anomaly\u0026rdquo; to prompt them. Our idea is to iteratively refine anomaly descriptions from abstract ones (e.g., \u0026ldquo;is there any anomaly?\u0026rdquo;) to detailed, specific characterizations.\nDriven by such insight, we propose a framework, termed VERA, to explore verbalized learning (VL) for VAD. This framework considers the practical constraint that it is suboptimal to manually write down VAD guiding questions across VLMs, so it introduces a data-driven learning task to identify suitable anomaly-characterization questions containing concrete abnormal patterns for the frozen VLM using coarsely labeled datasets, eliminating the need for IT. Specifically, in the training phase, VERA treats the questions guiding the reasoning of VLMs in VAD as learnable parameters, improving them based on the verbal feedback from an optimizer VLM on the performance of a learner\nVAD Question for InternVL2-8B AUC (%) “Describe the video and is there any anomaly?” [26] 53.05 “Are there any abnormal events in the video?” [ 65.03 Table 1. Instructing a frozen VLM (InternVL2-8B [7]) with simple questions to perform VAD yields poor AUC on UCF-Crime [32] dataset.\nVLM on an intermediate VAD subtask—binary video classification for each video in the VAD training set. This design is both efficient and appropriate for VAD, as it accounts for video-specific properties like temporality while relying solely on provided coarse video-level labels. After that, considering the large scale of video frames, VERA assigns a fine-grained anomaly score for each frame in a coarse-tofine manner in the inference phase. First, VERA generates segment-level anomaly scores by querying VLMs with the learned guiding questions. Next, VERA improves the initial score by incorporating scene context into each segment score via ensembling. Finally, VERA outputs frame-level scores by fusing temporal context via Gaussian smoothing and frame-level position weighting.\nContributions. To sum up, our contributions are:\nTo our knowledge, we present the first approach, that is, VERA, to adapt frozen VLMs as an integrated system for VAD by learning detailed anomaly-characterization questions in prompts that decompose anomalies into concrete and recognizable patterns. VERA learns them directly from coarsely labeled datasets, eliminating the need for IT or external reasoning modules. We introduce an effective VL-based algorithm for VLMs in VAD, allowing direct adaptation without modifying model parameters. With coarse labeled VAD datasets only, our approach obtains good guiding questions in VAD by relying on the verbal interaction between learner and optimizer VLMs in verbalized training. Additionally, we design a coarse-to-fine strategy to derive frame-level anomaly scores from verbally learned guiding questions in VAD, integrating both scene and temporal contexts for better VAD performance and reasoning. The learned guiding questions from VERA are expressed in natural languages, providing a unified method to encode and transfer prior VAD knowledge seamlessly to other datasets or VLMs. In challenging VAD datasets like UCF-Crime [32] and XD-Violence [42], VERA achieves state-of-the-art explainable VAD performance and enjoys good generalization ability across models and datasets. 2. Related Work # Video Anomaly Detection. VAD is the task of localizing frames that contain abnormal events in a given video. This task is challenging for anomalies cover a broad scope of events like accidents and criminal activities while training sets only offer coarse annotations. Modern VAD methods are based on deep neural networks (DNNs) for their superi-\nority and are going through a paradigm shift in using VLMs: (1) Early DNNs for VAD are task-specific, which often employ unsupervised (including one-class) or weakly supervised (WS) learning techniques for training. Most unsupervised learning methods [23 , 25 , 37 , 38 , 48 , 56] train DNNs on frame reconstruction/prediction tasks to establish representation spaces for normal/abnormal videos. WS learning methods [5 , 27 , 32 , 44 , 47 , 53] leverage both normal and abnormal videos to train a feature extractor that distinguishes anomalies from normalcy, typically using multiple instance learning [32] objectives. (2) Recent VAD methods adopt VLMs due to their remarkable success across core vision tasks [12 , 21 , 28 , 33]. Early research [26 , 46 , 52 , 55] has leveraged VLMs to generate textual descriptions of detected anomalies to enhance prediction explainability for VAD. However, current approaches incur high processing demands from external LLMs or require substantial effort and cost for fine-tuning on additional datasets, which are computationally inefficient in training or inference. Our work reduces the processing overhead by adapting frozen VLMs for VAD without model parameter modification or extra reasoning modules via learnable guiding questions, which elicit superior reasoning from frozen VLMs and significantly boost their performance in VAD.\nVerbalized Learning for VLMs. The designed VL framework is inspired by a recent technique called verbalized machine learning (VML) [45]. The main idea of VML is to use LLMs to approximate functions and learn the verbal rules and descriptions of performing specific tasks, which casts traditional machine learning tasks as language-based learning tasks. This approach regards the language expressions that define classification rules and other task-specific criteria as learned parameters, and optimize them in a datadriven fashion through interactions between a learner and an optimizer modeled by LLMs or VLMs. However, the VML framework is limited to tasks involving regression on scalar values or classification for static images. A similar idea has also been explored in a concurrent method, TextGrad [49], which integrates the process of incorporating textual feedback from LLMs for improving prompts in PyTorch and further proves its effectiveness in coding, question answering, and optimization in chemistry and medicine. Compared to existing works, our work pioneers VL for the VAD task and video data, which remains unsolved for previous VL frameworks focus on static-data tasks and cannot handle the challenges of temporality and scene dynamics in videos. Specifically, VERA introduces a new learning paradigm for VAD: generating effective questions that encapsulate key abnormal patterns in videos to elicit the reasoning ability from VLMs for explainable VAD. Additionally, VERA works for any VAD dataset and supports WS learning. Unlike previous WS methods, VERA only needs to learn concise text but not millions of parameters, so the training is lightweight.\n3. The VERA Framework # Our approach adapts VLMs to detect video anomalies without additional reasoning modules or IT. We now formulate the VAD task and detail the design of VERA.\n3.1. Problem Formulation # Video Anomaly Detection. Let V be a video with F frames, represented as V = {Ii} F i=1 , where Ii is the i-th frame (1  i  F). Our objective is to locate and detect the start and end of anomalous events within V . In standard labeling, any frame associated with an anomaly is labeled as 1, and normal frames are labeled as 0. Therefore, the ground truth label sequence for V is Y = [y1,\u0026hellip;,yF ], where yi 2 {0 , 1} represents the fine-grained label for Ii. We aim to use a frozen VLM, fVLM, to generate anomaly score predictions across all frames, Y ˆ = [ˆy1 ,\u0026hellip;, y ˆ F ], where y ˆ i 2 [0 , 1] is a continuous anomaly score for Ii .\nAvailable Training Data for VAD. Typically, VAD datasets only provide coarsely labeled training sets [23 , 25 , 32 , 42]. We denote a VAD training set as D = {(V (j), Y (j))} N j=1 , where N is the total number of training videos, V (j) represents the j-th video (1  j  N) and Y (j) is the corresponding video-level label. Y (j) = 1 if V (j) contains any anomaly defined by the dataset annotators, e.g., abuse or arson activities, and Y (j) = 0 if V (j) has no anomalies. For V (j), we suppose it contains Fj Fj frames and denote the frames sequence as V (j) = {I i (j) I i } Fj Fj i=1 , where I i (j) I i is the i-th frame (1  i  Fj ) in V (j) .\n3.2. Training in VERA # Training Objective. We aim to learn guiding questions that break down a complex and ambiguous concept (i.e., what is an \u0026ldquo;anomaly\u0026rdquo;) into a set of identifiable anomalous patterns to unlock reasoning capabilities within frozen VLMs for VAD tasks. Those patterns vary among datasets, making manually designed descriptions ineffective for generalization. To address this, we propose a general VL framework shown in Fig. 2 to generate the desired guiding questions. We denote the guiding question set as Q = {q1,\u0026hellip;,q m } , where qi is the i-th question (1  i  m) and m is the number of questions. The training framework considers Q as the learnable parameters, which are optimized through verbal interaction between a learner and an optimizer, modeled by VLMs through leveraging their ability to follow instructions with given prompts.\nTraining Data. The training data for learning Q consist of paired sampled video frames and video-level labels. Sampling is necessary because the amount of video frames is so huge that we cannot compute with every frame. We explore three types of sampling strategies and find that uniform sampling [54] yields the best results. That is, with any video V (j) 2 D, we first calculate the interval between\nFigure 2. The overall training pipeline in VERA aims to optimize VAD guiding questions iteratively. In each iteration t, the optimization is verbalized by providing verbal instructions for the learner and optimizer to follow. They will generate predictions and new guiding questions, respectively.\nsampled frames as l = floor(Fj/S), where S is the number of sampled frames, and floor denotes rounding down to the nearest integer. Given l, the uniformly sampled frames from V (j) are represented by V˜ ˜ (j) = [I (j) 1 , I l (j) I l+1 ,\u0026hellip;,I (j) (S1)·l+1 ] . The label used for training is Y (j) only, resulting in training data pairs {(V ˜ (j) , Y (j))} N j=1 for VERA.\nUpdating Q via Learner and Optimizer. Since Q are verbal expressions for specific anomaly patterns, VERA inherits the idea of VML [45] in training: optimizing language-based parameters by verbal communication between a learner agent flearner and an optimizer agent fopt, rather than by numerical optimization algorithms like Adam [16]. W.l.o.g., we take an arbitrary iteration t when implementing the complete algorithm (detailed in Supplementary Material) for illustration. We denote any LLMbased model as f(x; ) where x represents the input data, and denotes the natural language instructions for f to follow, which is considered as learnable parameters in our VL framework. Specifically, Q contains parameters to be learned in VERA. As depicted in Fig. 2, in each iteration t, the learner agent f l (t) f learner is modeled by the frozen VLM fVLM(·) used for VAD with a specific prompt template ✓ that guide fVLM(·) to conduct a learning task by pondering on current guiding questions Qt. We denote the learner agent as f l (t) f learner (x) = fVLM(x; (✓ , Qt)), where x is the input in a learning task, and Qt, the learnable guiding questions applied in each iteration t, constitutes the core parameters that distinguish the learner between iterations. Meanwhile, we introduce an optimizer f o (t) f opt to assess the quality of the\npredictions of the learner and to optimize Qt. W.l.o.g., we use the same frozen VLM fVLM to model the optimizer. As demonstrated in Fig. 2, we provide another specific prompt template for the learner to follow to optimize Qt, so we denote the optimizer agent as f o (t) f opt (z) = fVLM(z; ( , Qt)) , where z is its input and is the instruction to improve Qt . It is important to note that f l (t) f learner 6= f o (t) f opt because f l (t) f learner follows (✓ , Qt) to conduct a learning task, while f o (t) f opt follows ( , Qt) to refine Qt .\nLearning Task for flearner. The learner executes the \u0026ldquo;forward pass\u0026rdquo; and outputs a prediction. Recall that we only use the original coarsely labeled information for training. Thus, we design a binary classification task for flearner, which accounts for the temporal nature of video data, the sparsity of anomalies, and the weak supervision in VAD datasets. In this task, the job of the learner flearner is to produce a binary classification prediction Y ˆ (j) to determine whether there is an anomaly in the video based on the sampled frames V ˜ (j) . As shown in Fig. 2, we explain the task in natural language in the \u0026ldquo;Model Description\u0026rdquo; section in ✓. Guiding questions Qt are inserted in the \u0026ldquo;Prompt Questions\u0026rdquo; section in ✓ to elicit reasoning of the VLM. This template design is based on the prompt structures used in VML, with targeted modifications to help the learner effectively address this WS learning task. Given ✓ and a sampled frame set V ˜ (j) , the learner will output a prediction as\nwhere Y ˆ (j) = 1 if the learner thinks there is an anomaly af-\nter skimming across the sampled frames V ˜ (j) and reasoning through the guiding questions Qt, and otherwise, Y i ˆ Y i = 0 . Optimization Step in fopt. The optimizer executes the \u0026ldquo;backward pass\u0026rdquo; to update the questions Qt via a mini-batch (batch size is n). Suppose the visual input in a batch is Vbatch = [V b ˜ (1) V batch, ··· , V b ˜ (n) V batch ] and the corresponding ground truths are Ybatch = [Y b (1) Y batch, ··· , Y b (n) Y batch ]. The learner generates prediction as Y b ˆ Y batch = [Y b ˆ (1) Y batch, ··· , Y b ˆ (n) Y batch ] with the current questions Qt by Eq. (1). The optimizer will output a new set of questions Qt+1 by following the prompt with batched data. We denote the optimization step as\nwhere Qt+1 is a new set of guiding questions constructed from f o (t) f opt owing to its text generation and instruction following abilities after reading .\n3.3. Inference in VERA # During training, we denote the one with the largest validation accuracy as Q ⇤ . In inference, given Q ⇤ , VERA yields fine-grained anomaly score Y ˆ for a test video V via a coarse-to-fine process shown in Fig. 3 .\nStep 1: Initial Anomaly Scores via Learned Guiding Questions. We divide the video into segments and analyze each segment independently first. Following [52], we perform equidistant frame sampling within V to obtain the set of segment centers C = {I1, Id+1 , ··· , I(h 1)·d+1 }, where d is the interval between centers and h = floor(F/d) is the total number of segments. For each center frame I( u 1)·d+1 (1  u  h), we define a 10-second window around it as the u -th segment, within which we uniformly sample 8 frames. We denote the sampled frame set in the u-th segment as Vu Vu. Next, we input Vu Vu in fVLM with the prompt (✓ , Q ⇤ ) to get the initial score\nwhere y˜ ˜ u = 1 if fVLM thinks the segment contains an anomaly after reasoning via Q ⇤ with Vu Vu , and otherwise, y ˜ u = 0. By repeating Eq. (3) for each segment, we have a segment-level initial anomaly score set Y ˜ = [˜y1 , ··· , y ˜ h ] . Step 2: Ensemble Segment-Level Anomaly Scores with Scene Context. Note that the scores derived above only examine a short moment in a long video without considering any context. To resolve it, we refine the initial segment-level score by incorporating scene context—defined as preceding and following segments that contain similar elements, such as actors and background, to those in the current segment.\nWe measure the relevance between different video segments by the cosine similarity of their feature representations [22], extracted by a pretrained vision feature extractor\nStep 1: Initial Anomaly Scores via Learned Guiding Questions\nFigure 3. VERA computes anomaly scores with Q ⇤ in three steps.\ng , e.g., ImageBind [10]. For the u-th segment Vu Vu , its similarity with any segment Vw Vw (1  w  h) is sim(u, w) = cos ⇣ e u · e w ||e u ||·||e w || ⌘ , where cos denotes the cosine function, and e u = g(Vu Vu ) and e w = g(Vw Vw ) represent their features. Let  u = [ (1) u ,\u0026hellip;,  (K) u ] denote the indices of the top-K segments similar to Vu Vu . We refine the anomaly score by\nwhere y¯ ¯ u is an ensemble of initial scores of top-K video segments relevant to Vu Vu . Here, the initial score of each retrieved segment is weighted by a factor derived from the cosine similarity and normalized by the Softmax function (with ⌧ as the temperature hyperparameter). Accordingly, scenes with greater similarity are assigned higher weights, making the ensemble score a more comprehensive reflection of anomalies with the video context. By applying Eq. (4) for all segments, we obtain Y ¯ = [¯y1 ,\u0026hellip;, y ¯ h ] .\nStep 3: Frame-level Anomaly Scoring with Temporal Context. Given Y ¯ , we aim to incorporate temporal context to capture how events evolve over time when computing frame-level anomaly scores, for the abnormality of an event often depends on the timing and progression of observed activities. To detail, we first apply Gaussian smoothing [11] to aggregate local temporal context into the segment-level anomaly scores. We denote the Gaussian kernel (suppose the filter size is !) as G(p) = exp( p 2 2 2 1 ) where p is the distance from the kernel center and 1 is the variance. We update segment-level scores as ¯ = Y ¯ ⇤ G = [¯1 , ··· , ¯ h ] , where ⇤ is the convolution operation. Next, we integrate global temporal context by position weighting. With ¯ , we flatten it into frame-level scores by assigning the score ¯ ¯ u to each frame in the u-th segment, i.e ., [I( u 1)·d+1, ··· , Iu Iu· d ] . We denote the frame-level score sequence after flattening as [⇢1 , ··· , ⇢F ]. We then apply the Gaussian function to encode position weights as w(i) = exp ⇣ (ic) 2 2 2 2 ⌘ , where\ni (1  i  F) is any frame index, c = floor(F/2) is the center frame index, and 2 is the variance. The anomaly score for the i-th frame is:\nThis operation scales the score ⇢i, diminishing the anomaly score for frames near the beginning and end of the event. This helps better capture the temporal progression of anomalies: the score gradually increases as the anomaly reaches its peak and decreases afterward. The final scores is denoted as Y ˆ = [ˆy1 ,\u0026hellip;, y ˆ F ] after applying Eq. (5).\nExplainable VAD by VERA. When using template ✓ embedded with Q ⇤ to compute Y ˆ , we ask the VLM to \u0026ldquo;provide an explanation in one sentence\u0026rdquo; when reasoning, and VLM will explain the anomaly score it assigns based on Q ⇤ .\n4. Experiments and Results # In this section, we present an evaluation of VERA as follows, addressing key questions of interest including: (Q1) Does it enhance the effectiveness of frozen VLMs in VAD? (Q2) Is its design reasonable and well-structured? (Q3) How well does it generalize across different scenarios?\n4.1. Experimental Settings # Datasets. We conduct experiments on two large-scale VAD datasets: (1) UCF-Crime [32] collected from surveillance videos with 13 types of anomalies and 290 (140 abnormal) test videos (2.13 minutes long on average). (2) XDViolence [42] with 6 anomaly categories and 800 (500 abnormal) test videos (1.62 minutes long on average).\nMetrics. Following approaches in [52 , 55], we mainly evaluate VAD performance using the Area Under the Curve (AUC) of the frame-level Receiver Operating Characteristic (ROC) curve, as it provides a comprehensive measure of model performance across all thresholds.\nBaselines. We categorize baselines into non-explainable approaches and explainable ones as [55] does. Nonexplainable ones are obtained by WS learning [6 , 9 , 15 , 17 , 19 , 32 , 36 , 41 – 43 , 50 , 51 , 57] and unsupervised learning [13 , 25 , 34 , 35 , 37 , 38]. These non-explainable approaches cannot provide language-based explanations for VAD. For explainable approaches, we use LAVAD [52], Holmes-VAD [55], and VADor [26] as representatives of Pipeline 1 and Pipeline 2 shown in Fig. 1. It should be noted that [46] does not report performance on UCF-Crime and XD-Violence. Additionally, we include zero-shot (ZS) VAD by frozen VLMs designed by [52] as baselines.\nImplementation of VERA. In our experiments, we choose a small VLM, InternVL2-8B [7], as the backbone fVLM for building VERA by default, if not otherwise specified. We also explore other backbones, such as Qwen2-VL-7B [40] and InternVL2-40B [7] for ablation. We train Q for no more than 10 epochs, with a validation accuracy calculated every 100 iterations to determine Q ⇤ . We set n as 2, S as 8, and m as 5 for training. The initial questions Q0 is \u0026ldquo;1. Is there any suspicious person or object that looks unusual in this scene? 2. Is there any behavior that looks unusual in this scene?\u0026rdquo;, inspired by previous VAD methods [13 , 43], which assume anomalies appear with unusual appearance or motions.\n4.2. Comparison to State-of-the-art Methods # We address Q1 by empirically comparing VERA to existing VAD methods. First, in Table 2, VERA achieves the highest AUC among explainable VAD methods on UCF-Crime, outperforming Holmes-VAD and VADor (without IT, as reported in their papers) in a fair comparison. Importantly, unlike these methods, VERA does not need to modify the model parameters, demonstrating its suitability to directly adapt VLM to the VAD task with minimal training requirements. Moreover, VERA surpasses LAVAD by 6% in AUC on UCF-Crime, uniquely integrating both description and reasoning capabilities in VAD. Compared to non-explainable methods, VERA achieves AUC performance that is comparable to one of the top-performing\nTable 2. AUC (%) on UCF-Crime. No IT is used for Holmes-VAD and VADor.\nMethod AUC Non-explainable VAD Methods Methods Wu et al. [42] 82.44 OVVAD [43] 86.40 S3R [41] 85.99 RTFM [36] 84.30 MSL [19] 85.62 MGFN [6] 86.98 SSRL [17] 87.43 CLIP-TSA [15] 87.58 Sultani et al. [32] 77.92 GCL [51] 79.84 GCN [57] 82.12 MIST [9] 82.30 CLAWS [50] 83.03 DYANNET [35] 84.50 Tur el al. [37] 66.85 GODS [38] 70.46 Explainable VAD Methods Explainable VAD Methods LAVAD [52] 80.28 Holmes-VAD [55] 84.61 VADor [26] 85.90 ZS CLIP [52] 53.16 ZS IMAGEBIND-I [52] 53.65 ZS IMAGEBIND-V [52] 55.78 LLAVA-1.5 [20] 72.84 VERA 86.55 methods, CLIP-TSA, on UCF-Crime, while offering the additional advantage of explainable predictions.\nSimilar advantages are also observed in Table 3 for XD-Violence. Considering multiple factors, including performance, training efficiency, system integration, and explainability, VERA stands out as a promising pipeline for VLMs in VAD.\n4.3. Ablation Studies # We perform necessary ablation studies on UCF-Crime to answer both Q2 and Q3\nTable 3. AUC (%) on XD-Violence.\nMethod AUC Non-Explainable VAD Methods Non-Explainable VAD Methods Hasan et al. [13] 50.32 Lu et al. [25] 53.56 BODS [38] 57.32 GODS [38] 61.56 RareAnom [34] 68.33 Explainable VAD Methods Explainable VAD Methods LAVAD [52] 85.36 ZS CLIP [52] 38.21 ZS IMAGEBIND-I [52] 58.81 ZS IMAGEBIND-V [52] 55.06 LLAVA-1.5 [20] 79.62 VERA 88.26 for a comprehensive evaluation on our design choices.\nFrame Sampling Strategy in Training. We compare three frame sampling strategies for obtaining each V ˜ (j) in training: uniform sampling, random sampling, and TSN sampling (random sampling from equally divided segments).\nTable 4 shows that uniform sampling performs the best (with batch size n = 2 and S = 8). This is because uniform sampling preserves the temporal structure and maintains consistent motion patterns throughout\nTable 4. Sampling strategies explored in VERA training.\nStrategy AUC (%) Random [3] 83.63 TSN [39] 82.63 Uniform [54] 86.55 the long video, making it easier for VLMs to understand the video and update Q .\nTable 5. The way we obtain guiding questions affects AUC substantially.\nQuestion Type AUC (%) No questions 78.81 Manually written questions by human 81.15 Learned questions w/o iteratively inputting Vbatch in Eq. (2) 78.06 Iteratively learned questions (used in VERA) 86.55 How to Obtain Guiding Questions Q for VLM. As seen in Table 5, if the guiding questions are not incorporated into the VLM prompt, the AUC will drop largely to 78.81%, confirming the need to use simpler and more focused questions to provoke reasoning in the VLMs for VAD. Meanwhile, if we use manually written questions (Q0), the performance is suboptimal with an 81.15% AUC, which shows the need to use VL to find guiding questions. Lastly, if we only input batched predictions Y b ˆ Y batch and ground truths Yb Ybatch without inputting Vbatch in the optimizer, the Q updated in this way will dumb the VLMs and make it have a low AUC. Thus, inputting video frames as Eq. (2) does is necessary to learn good Q .\nNumber of Questions m . As shown in Fig. 4, when m is set to 1, the reasoning is limited to a single perspective, resulting in a lower AUC. As m increases up to 5, the model captures more comprehensive anomaly patterns, leading to improved AUC. However, increasing m\nFigure 4. Effect of the number of guiding questions on AUC.\nbeyond 5 yields no significant gains. Therefore, we set m to 5 by default in VERA, if not otherwise specified.\nTable 6. Ablation study of each step in VERA’s inference.\nOperation AUC (%) Initial (Step 1) 76.10 Initial + Retrieval (Step 2) 84.53 (+8.43) Initial + Retrieval + Smoothing (Step 3) 85.48 (+0.95) Initial + Retrieval + Smoothing + Weighting (Step 3) 86.55 (+1.07) Coarse-to-Fine Anomaly Score Computation. We also validate the anomaly score computation by VERA. Table 6\nshows the AUC is 76.10% when using the flattened initial score obtained in Step 1, and leveraging retrieved segments in Step 2 significantly boosts the AUC to 84.53%, highlighting the effectiveness of incorporating ensemble scores based on scene context. Meanwhile, smoothing and weighting in Step 3 further improves the AUC by around 1% each, verifying the benefit of integrating temporal context.\nGeneralizability Test. We further examine the generalizability of VERA across different model sizes, VLM architectures, and datasets to address Q3.\nFirst, we apply VERA to InternVL2-40B, a larger model in the InternVL2 family compared to InternVL2-8B. As shown in Table 7, InternVL2-40B achieves effective AUC performance, slightly exceeding that of InternVL2-8B, indicating that VL in VERA enables models of various scales to identify a Q suitable\nTable 9. AUC (%) across datasets.\nfVLM Source of Q Source of Q fVLM InternVL2-8B 8B InternVL2-40B InternVL2-8B 86.55 80.43 InternVL2-40B 85.24 86.72 Table 7. AUC (%) acro AUC (%) across model sizes . AUC (%) across model sizes . fVLM Source of Q InternVL28B Qwen2VL7B Source of Q InternVL28B Qwen2VL7B InternVL2-8B 86.55 81.37 Qwen2-VL-7B 79.60 82.64 Table 8. AUC (%) a UC (%) across architectures. UC (%) across architectures. Dataset Source of Q Source of Q UCF-Crime X e XD-Violence UCF-Crime 86.55 88.26 XD-Violence 86.26 88.26 for their reasoning capabilities. Additionally, We also evaluate the transferability of Q across different scales and and observe an interesting phenomenon: the Q learned by InternVL2-8B remains effective for InternVL2-40B, but not vice versa. This is likely because the Q learned by the smaller model is readily interpretable by the larger model, whereas the Q derived from the larger model is more complex in syntactic structure and does not align well with the reasoning framework of the smaller model. Secondly, we select a different VLM, Qwen2-VL-7B [40], as the backbone for VERA. As shown in Table 8, while the AUC achieved with Qwen2-VL-7B is lower than that with InternVL2-8B, the VL in VERA remains effective, allowing it to outperform notable baselines such as LAVAD [52]. However, a notable gap exists when transferring Q across different model architectures in Table 8. Developing a universal Q that can effectively elicit reasoning capabilities across various VLM structures would be an promising direction for future research. Lastly, we observe that the transferability of Q depends on the training dataset. From Table 9, we observe that transferring Q learned from UCFCrime to XD-Violence results in a smaller performance drop compared to the reverse case. This suggests the source dataset is crucial to the transferability of Q across datasets.\n4.4. Qualitative Results and Case Studies # W.l.o.g., we take one video on UCF-Crime to illustrate the explainability brought by the learned Q ⇤ qualitatively (on\n1\nLearned Guiding Questions !\n∗\nin VERA\n.\nAre there any people in the video who are not in their typical positions or engaging in\nactivities that are not consistent with their usual behavior?\n2\n.\nAre there any vehicles in the video that are not in their typical positions or being used\nin a way that is not consistent with their usual function?\n3\n.\nAre there any objects in the video that are not in their typical positions or being used\nin a way that is not consistent with their usual function?\n4\n.\nIs there any visible damage or unusual movement\nin the video that indicates an\nanomaly?\n5\n.\nAre there any unusual sounds or noises in the video that suggest an anomaly?\nFigure 5. Given Q ⇤ by VERA, the frozen VLM (InternVL2-8B) will reason and explain the scene based on it. For illustration, we take as an example the video \u0026ldquo;Arrest007 x264\u0026rdquo; from UCF-Crime and include 6 scenes here. The complete anomaly scores are shown in Fig. 8 .\nUCF-Crime Q ⇤ is \u0026ldquo;1. Are there any people in the video who are not in their typical positions or engaging in activities that are not consistent with their usual behavior? 2. Are there any vehicles in the video that are not in their typical positions or being used in a way that is not consistent with their usual function? 3. Are there any objects in the video that are not in their typical positions or being used in a way that is not consistent with their usual function? 4. Is there any visible damage or unusual movement in the video that indicates an anomaly? 5. Are there any unusual sounds or noises in the video that suggest an anomaly?\u0026rdquo;). As shown in Fig. 5, the main anomaly in this video is that a man tries to steal money from the washing machines in a laundromat and is arrested after being found by the police. In the selected 6 main video segments, the frozen VLM with VERA\u0026rsquo;s learned questions is able to explain the scene by closely following the detailed anomaly characterization of the five learned guiding questions. W.l.o.g., we take the first 3 segments in Fig. 5 for instance to closely compare the caption quality with LAVAD, a representative baseline. As shown in Fig. 6, VERA\u0026rsquo;s captions include both precise descriptions (bold text) and reasoning (text in purple) about anomalies, while LAVAD\u0026rsquo;s captions contain only plain descriptions. This difference owes to VERA\u0026rsquo;s learned guiding questions, which transform VLM\u0026rsquo;s thinking and phrasing.\nA more interesting advantage of VERA is that it allows humans to further interact with VLMs because it retains the general question-answering ability of pretrained VLMs. This is because VERA does not require finetuning of the VLM backbone weights. Although finetuning VLMs with parameter-efficient methods like [14 , 24 , 29] is easy and computationally tractable, instruction-tuned models still inevitably lose the flexibility to handle general questions (due to catastrophic forgetting), as they are trained to respond to certain queries with fixed answer styles. In contrast, as shown in Fig. 7, the learned Q ⇤ can steer reasoning in a frozen VLM while allowing it to flexibly answer openended (like follow-up or counterfactual) questions, which is\nFigure 6. Qualitative comparison between VERA and LAVAD.\nFigure 7. VERA can take open-ended questions and interact with humans.\nan important ability lost in IT-based models.\nMoreover, as shown in Fig. 8, owing to the proposed coarse-to-fine anomaly scoring, the anomaly score dynamics from VERA well represent the actual real-time anomaly level in this video and gradually increases to nearly 1 when the man is being arrested. This result verifies that VERA allows VLMs to effec-\nFigure 8. Anomaly scores generated by VERA (with InternVL2-8B) in \u0026ldquo;Arrest007 x264\u0026rdquo; from UCF-Crime.\ntively identify anomalies with a holistic model, reducing the manpower and computational overhead for VAD.\n5. Concluding Remarks # We propose a novel pipeline, VERA, which can effectively elicit the reasoning ability from VLMs to perform explainable VAD without additional computation overhead. This is done through an effective and novel application of verbalized machine learning [45] to VLM. In training, VERA obtains the guiding questions detailing anomaly patterns through the verbal interaction between the learner and the optimizer agents. In inference, VERA uses them to enhance VLMs for identifying anomalies and compute frame-level anomaly scores in a coarse-to-fine process. Experimental results validate the effectiveness of the VERA framework in achieving state-of-the-art explainable VAD performance.\nReferences # [1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. 1 [2] Daniel Bogdoll, Maximilian Nitsche, and J Marius Zollner. ¨ ¨ Anomaly detection in autonomous driving: A survey. In CVPR Workshops, 2022. 1 [3] Meinardus Boris, Batra Anil, Rohrbach Anna, and Rohrbach Marcus. The surprising effectiveness of multimodal large language models for video moment retrieval. arXiv preprint arXiv:2406.18113, 2024. 7 [4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In NeurIPS, 2020. 1 [5] Junxi Chen, Liang Li, Li Su, Zheng-jun Zha, and Qingming Huang. Prompt-enhanced multiple instance learning for weakly supervised video anomaly detection. In CVPR , 2024. 3 [6] Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitudecontrastive glance-and-focus network for weakly-supervised video anomaly detection. In AAAI, 2023. 6 , 14 [7] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR , 2024. 1 , 2 , 6 [8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. 1 [9] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. Mist: Multiple instance self-training framework for video anomaly detection. In CVPR, 2021. 6 [10] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In CVPR, 2023. 5 , 15 [11] Rafael C Gonzalez. Digital image processing. Pearson education india, 2009. 5 [12] Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and Sifei Liu. Regiongpt: Towards region understanding vision language model. In CVPR, 2024. 3 [13] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In CVPR, 2016. 6 , 11\n[14] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2021. 8\n[15] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In ICIP, 2023. 6 , 14\n[16] Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 4\n[17] Guoqiu Li, Guanxiong Cai, Xingyu Zeng, and Rui Zhao. Scale-aware spatio-temporal relation learning for video anomaly detection. In ECCV, 2022. 6\n[18] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML , 2023. 1\n[19] Shuo Li, Fang Liu, and Licheng Jiao. Self-training multisequence learning with transformer for weakly supervised video anomaly detection. In AAAI, 2022. 6 , 14\n[20] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR , 2024. 6 , 14\n[21] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2024. 1 , 3\n[22] Weiyang Liu, Yan-Ming Zhang, Xingguo Li, Zhiding Yu, Bo Dai, Tuo Zhao, and Le Song. Deep hyperspherical learning. In NeurIPS, 2017. 5\n[23] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In CVPR, 2018. 3\n[24] Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, et al. Parameter-efficient orthogonal finetuning via butterfly factorization. In ICLR, 2024. 8\n[25] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In ICCV, 2013. 3 , 6\n[26] Hui Lv and Qianru Sun. Video anomaly detection and explanation via large language models. arXiv preprint arXiv:2401.05702, 2024. 1 , 2 , 3 , 6\n[27] Hui Lv, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, and Hanwang Zhang. Unbiased multiple instance learning for weakly supervised video anomaly detection. In CVPR, 2023. 3\n[28] Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. In ICCV, 2023. 3\n[29] Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Scholkopf. Controlling text-to-image diffusion by orthogo- ¨ ¨ nal finetuning. In NeurIPS, 2023. 8\n[30] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 18\n[31] Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Scholkopf, Thomas Brox, and Peter Gehler. Towards total ¨ ¨ recall in industrial anomaly detection. In CVPR, 2022. 1\n[32] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In CVPR, 2018. 2 , 3 , 6 , 14\n[33] Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and Yingcong Chen. Hawk: Learning to understand open-world video anomalies. Advances in Neural Information Processing Systems, 37:139751–139785, 2024. 3\n[34] Kamalakar Vijay Thakare, Debi Prosad Dogra, Heeseung Choi, Haksub Kim, and Ig-Jae Kim. Rareanom: A benchmark video dataset for rare type anomalies. Pattern Recognition, 140:109567, 2023. 6\n[35] Kamalakar Vijay Thakare, Yash Raghuwanshi, Debi Prosad Dogra, Heeseung Choi, and Ig-Jae Kim. Dyannet: A scene dynamicity guided self-trained video anomaly detection network. In WACV, 2023. 6\n[36] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In ICCV, 2021. 6 , 14\n[37] Anil Osman Tur, Nicola Dall\u0026rsquo;Asen, Cigdem Beyan, and Elisa Ricci. Unsupervised video anomaly detection with diffusion models conditioned on compact motion representations. In International Conference on Image Analysis and Processing, 2023. 3 , 6\n[38] Jue Wang and Anoop Cherian. Gods: Generalized one-class discriminative subspaces for anomaly detection. In ICCV, V, 2019. 3 , 6\n[39] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016. 7\n[40] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model\u0026rsquo;s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 6 , 7\n[41] Jhih-Ciang Wu, He-Yen Hsieh, Ding-Jie Chen, Chiou-Shann Fuh, and Tyng-Luh Liu. Self-supervised sparse representation for video anomaly detection. In ECCV, 2022. 6 , 14\n[42] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In ECCV, 2020. 2 , 3 , 6 , 14\n[43] Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open-vocabulary video anomaly detection. In CVPR, pages 18297–18307, 2024. 6 , 11 , 14\n[44] Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 6074–6082, 2024. 3\n[45] Tim Z Xiao, Robert Bamler, Bernhard Scholkopf, and ¨ ¨ Weiyang Liu. Verbalized machine learning: Revisiting machine learning with language models. arXiv preprint arXiv:2406.04344, 2024. 3 , 4 , 8\n[46] Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. Follow the rules: reasoning for video anomaly detection with large language models. arXiv preprint arXiv:2407.10299, 2024. 1 , 3 , 6\n[47] Zhiwei Yang, Jing Liu, and Peng Wu. Text prompt with normality guidance for weakly supervised video anomaly detection. In CVPR, 2024. 3\n[48] Muchao Ye, Xiaojiang Peng, Weihao Gan, Wei Wu, and Yu Qiao. Anopcn: Video anomaly detection via deep predictive coding network. In ACM international conference on multimedia, 2019. 3\n[49] Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic \u0026ldquo;differentiation\u0026rdquo; via text. arXiv preprint arXiv:2406.07496, 2024. 3\n[50] Muhammad Zaigham Zaheer, Arif Mahmood, Marcella Astrid, and Seung-Ik Lee. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In ECCV, 2020. 6\n[51] M Zaigham Zaheer, Arif Mahmood, M Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. Generative cooperative learning for unsupervised video anomaly detection. In CVPR, 2022. 6\n[52] Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. Harnessing large language models for training-free video anomaly detection. In CVPR , 2024. 1 , 3 , 5 , 6 , 7 , 14 , 15\n[53] Chen Zhang, Guorong Li, Yuankai Qi, Shuhui Wang, Laiyun Qing, Qingming Huang, and Ming-Hsuan Yang. Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection. In CVPR, 2023. 3\n[54] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. In EMNLP, 2023. 3 , 7\n[55] Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, and Nong Sang. Holmes-vad: Towards unbiased and explainable video anomaly detection via multi-modal llm. arXiv preprint arXiv:2406.12235, 2024. 1 , 2 , 3 , 6 , 14\n[56] Menghao Zhang, Jingyu Wang, Qi Qi, Haifeng Sun, Zirui Zhuang, Pengfei Ren, Ruilong Ma, and Jianxin Liao. Multiscale video anomaly detection by multi-grained spatiotemporal representation learning. In CVPR, 2024. 3\n[57] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In CVPR, 2019. 6\n[58] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024. 1\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/ye_vera_explainable_video_anomaly_detection_via_verbalized_learning_of_vision-language_cvpr_2025_paper/","section":"Papers","summary":"Introduces VERA, a framework that enables frozen vision-language models to perform explainable video anomaly detection by learning detailed anomaly-characterization questions from coarsely labeled data, without model parameter modifications. The method decomposes complex reasoning into reflections on guiding questions, optimizes them via verbal interactions, and guides VLMs to generate segment- and frame-level anomaly scores with improved explainability and performance.","title":"VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models","type":"method"},{"content":" Video Anomaly Detection and Explanation via Large Language Models # Hui Lv 1 , Qianru Sun 1*\n1 Singapore Management University\n1 {huilyu, qianrusun}@smu.edu.sg\nFigure 1. Prediction scores from a baseline VAD model, and clip descriptions by using VLLMs, for a car accident video (as shown in the middle of the figure). On the score curve, the red dashed lines denote anomaly thresholds. The bottom shows the answers from Video-LLaMA [24] by feeding it with two pairs of video clips and questions, respectively: {Green: a normal video clip, \u0026ldquo;Is there any anomaly in the video?\u0026rdquo;} and {Orange: an abnormal video clip, \u0026ldquo;Is there a car accident? If so, is it an anomaly?\u0026rdquo;}\nthreshold given diverse video content as well as abnormal events. For example, as depicted on the top of Figure 1, using different thresholds on the prediction results (scores) of the VAD model yields different detection outcomes. Secondly, with a carefully selected threshold, anomalies are localized along the timeline based on only scores, and these scores provide little information for users to comprehend the contexts or ascertain the reasons behind the anomalies. In this paper, we are interested in the VAD model not merely to automatically identify anomalies but also to provide comprehensive textual explanations. We incorporate the Videobased Large Language Models (VLLMs) into the framework of VAD, Video-LLaMA [24] in our case, and call the method VAD-LLaMA. In the following, we elaborate on the challenges and our solutions.\nWell-trained VLLMs (such as Video-ChatGPT [15],\nAbstract # Video Anomaly Detection (VAD) aims to localize abnormal events on the timeline of long-range surveillance videos. Anomaly-scoring-based methods have been prevailing for years but suffer from the high complexity of thresholding and low explanability of detection results. In this paper, we conduct pioneer research on equipping video-based large language models (VLLMs) in the framework of VAD, making the VAD model free from thresholds and able to explain the reasons for the detected anomalies. We introduce a novel network module Long-Term Context (LTC) to mitigate the incapability of VLLMs in long-range context modeling. We design a three-phase training method to improve the efficiency of fine-tuning VLLMs by substantially minimizing the requirements for VAD data and lowering the costs of annotating instruction-tuning data. Our trained model achieves the top performance on the anomaly videos of the UCF-Crime and TAD benchmarks, with the AUC improvements of +3.86% and +4.96%, respectively. More impressively, our approach can provide textual explanations for detected anomalies. Our code is in the Appendix.\n1. Introduction # Video Anomaly Detection (VAD) is to identify unexpected events in video sequences. It has practical applications that span a multitude of fields including intelligent manufacturing [6], traffic surveillance [7 , 14] and public security [16 , 18]. Conventional VAD methods [13 , 14 , 18 , 19 , 24 , 28] are designed to predict anomaly scores along the timeline of the video, i.e., one anomaly score for each video frame. A higher score indicates a higher possibility of being abnormal on the frame. These anomaly-score-based designs are simple to implement but remain far away from the ideal agents of VAD which should be both automatic (i.e., free from manually-selected thresholds) and explainable (i.e., being able to explain why an event is abnormal).\nFirstly, it is not intuitive how to determine the optimal\nCorresponding author Videochat [9], and Video-LLaMA [24]) can generate detailed captions for any input video. However, there is a discrepancy between VLLMs\u0026rsquo; and humans\u0026rsquo; understanding of anomalies. As illustrated in Figure 1, Video-LLaMA [24] identifies several irrelevant objects in the scene as \u0026ldquo;anomalies\u0026rdquo;, while overlooking the car accident—the real anomaly humans care about. To solve this issue, we propose a novel Video Anomaly Detector (VADor) equipped with the modules from Video-LLaMA [24], and a new method for co-training VADor and VLLMs without needing a large amount of domain-specific data and labels.\nThis co-training poses two key challenges. The first challenge is that open-sourced VLLMs lack long-range context modeling ability. They are mostly trained on short videos with simple contexts, but the videos of VAD exhibit high context complexities. For example, WebVid [1], commonly used for fine-tuning VLLMs, features an average video length of only 18 seconds, notably shorter compared to the average length of 240 seconds in the VAD dataset UCF-Crime [18]. In long videos, anomalies really depend on long-range video contexts. For example, identifying a burglary event requires consideration of preceding activities, such as the breaking of windows or doors, even if the event itself only displays moving valuables outside. The second challenge is the lack of VAD data and labels. The widely-used VAD dataset, UCF-Crime [18], is used for weakly-supervised VAD, as it offers only videolevel anomaly annotations, i.e., given a video, it has only a one-hot label indicating normal or abnormal. Therefore, it is not intuitive how to generate text-based instruction data to fine-tune VLLMs. Besides, VAD datasets have a small scale, e.g., UCF-Crime contains 1.9K training videos, significantly smaller than the VLLM training datasets such as WebVid [1] containing 10M videos. The fine-tuning of VLLMs on VAD datasets is thus challenging.\nTo tackle the first challenge, we introduce a novel LongTerm Context (LTC) module in the VADor. The key idea is to integrate the long-term normal/abnormal contexts into the video representation. First, we split a video into multiple clips, and use the video encoder (VE) of Video-LLaMA to extract the features of each clip. Taking the features as input, VADor can output an anomaly score for each clip. Based on the lowest (highest) K anomaly scores, we pick corresponding clip features and stack them into a normal (abnormal) list. The generation of these two lists is implemented as an online operation for each video: every new clip will be immediately evaluated based on its anomaly score to update the lists (or not). Given the \u0026ldquo;raw\u0026rdquo; features of the next clip, we integrate the current lists of normal and abnormal features by cross-attention and weighted-sum operations, see Sec. 3.2, i.e., the way we integrate the long-term contexts of the video into the video representation.\nTo resolve the second challenge, we propose a three- phase training method. The first phase is to train a baseline VADor, based on which we can easily form a new VAD dataset with each frame \u0026ldquo;annotated\u0026rdquo; by an anomaly score. In the second phase, we co-train VADor and the proposed LTC on the above dataset. The primary objective here is to incorporate long-term contextual understanding into the LTC and then use it to enhance the video representation of Video-LLaMA in the final phase. The final phase is to fine-tune Video-LLaMA. Based on the above dataset, we manually compose simple textual templates (showcased in Sec. 3.1) to generate instruction-tuning data, and then use the data to train only the projection layer of Video-LLaMA. We avoid fine-tuning the entire Video-LLaMA due to the limited scale of the VAD dataset. Moreover, to prevent overfitting on VAD videos, we incorporate a diverse training sample set, drawing from both the UCF-Crime and WebVid datasets. The latter has been instrumental in the pre-training of Video-LLaMA. Our method enhances the efficiency of training VAD-LLaMA by substantially minimizing the requirements for VAD data and lowering the costs of creating instruction-tuning data. During testing, VAD-LLaMA is capable of not only identifying anomalies from the input video but also outputting textual explanations of the reasons for being abnormal.\nOur contributions are thus three-fold. 1) A new approach called VAD-LLaMA that introduces VLLMs for tackling the task of VAD. 2) A novel LTC module that enhances the long video representation ability of existing VLLMs. 3) A novel three-phase training method for the proposed VADLLaMA, by resolving the issues of lacking VAD data and instruction-tuning data.\n2. Related Work # Video anomaly detection (VAD) has been a prominent research area with diverse real-life applications. However, it remains a challenging task primarily due to the scarcity of anomalous data and labels. Consequently, researchers often turn to Weakly Supervised Video Anomaly Detection (WSVAD) methods to address the VAD problem. These approaches make use of both normal and abnormal training data, relying on weak annotations provided only at the video-level [18]. Multiple instance learning (MIL) is the mainstream paradigm that uses video-level labels for training snippet-level anomaly detectors [5 , 10 , 18 – 20 , 22 , 25 , 29]. Generally, they embrace the two-stage anomaly detection pipeline, which performs anomaly detection upon preextracted features. In particular, Zhong et al. [27] considered the WSVAD task as supervised learning under noise labels and they designed an alternate training procedure to enhance the discrimination of action classifiers. Lv et al. [14] focused on anomaly localization and proposed a higher-order context model as well as a margin-based MIL loss. Li et al. [10] proposed multiple sequence learning, where consecutive snippets with high anomaly scores are selected in MIL learning. More recently, Lv et al. [13] proposed an unbiased MIL framework for removing the context bias. And they integrated feature representation fine-tuning and anomaly detector learning into an end-to-end training fashion. In this paper, we follow the end-to-end manner to tackle the WSVAD problem and make the first effort to introduce VLLMs into VAD for endowing the VAD model with the ability of anomaly description.\nVideo-based large language models (VLLMs) have demonstrated remarkable language understanding and reasoning abilities, thanks to the ongoing research efforts in exploring the use of LLMs for processing multi-modal problems [4 , 9]. Bain et al. [1] introduced WebVid, a large-scale dataset of short videos with textual descriptions sourced from stock footage sites. Based on it, Li et al. [9] improved image encoders, enabling large models to understand visual content in videos. Su et al. [17] utilized multi-modal encoders to enable large models to understand six modalities. Zhang et al. [24] trained fundamental models to comprehend both the visual and auditory content in videos. In this work, we focus on the visual modal in videos, since most videos in VAD are collected from road surveillance, which falls short of audio signals. By integrating the designed VADor with the VLLMs, we propose a novel approach VAD-LLaMA, which is able to not only detect the anomalies but also explain the details of the anomalies.\n3. Method # Our VAD-LLaMA architecture is illustrated in Figure 2 and its training method is shown in Figure 3. It aims to adapt the general video representation knowledge of a pretrained large video-language model Video-LLaMA [24] to tackle VAD tasks. Below, we first elaborate on its network architecture in Sec. 3.1. Then, we delve into the specifics of the three-phase training method in Section 3.2 .\n3.1. Model Architecture # Overview. As depicted in Figure 2, VAD-LLaMA mainly consists of a new VADor, and two pre-trained modules (VE and LLaMA) from Video-LLaMA [24]. The VADor is built upon the VE and includes a novel LTC module and a simple Anomaly Predictor (AP) g consisting of two fully-connected (fc) layers. Besides, VAD-LLaMA learns an adaptor f between the VADor and the LLaMA to align their feature distributions.\nVE and Feature Extraction. Given a video sequence, we first divide it into m segments. For each segment, we randomly sample a video clip (consecutive frames), and feed it into the pre-trained VE to extract clip-level features. We denote x i , i ∈ {1, . . . , m} as the VE feature of the i-th clip. In this work, the adopted VE from Video-LLaMA [24] consists of an image encoder (BLIP-2 [8]) and a Video-\nFigure 2. The network architecture of the proposed VAD-LLaMA. It consists of a Video Anomaly Detector (VADor) with the LongTerm Context (LTC) module and a simple Anomaly Predictor (AP), a projection layer (called Adaptor), and the pre-trained Video-LLaMA [24] (composed by a Video Encoder (VE) and a LLaMA). The training of VAD-LLaMA is decomposed into three phases, and the trainable and frozen modules vary among different training phases. Training phases are given in Figure 3 .\nQformer, sharing the same architecture with Query Transformer [8]. The image encoder includes a ViT-G/14 from EVA-CLIP [3] and an image-level Query Transformer. As aforementioned, these VE features lack long-term context information, as the VE was pre-trained mainly with short and normal videos.\nLong-Term Context (LTC) Module. The LTC module is proposed to solve the above challenge. Specifically, we collect the clip-level VE features with K lowest (highest) anomaly scores and stack them into a normal (abnormal) list. We denote the normal list as N = {nj} K j=1 , and abnormal list as A = {Aj} K j=1 . These two lists are online updated and every new clip will be immediately evaluated based on its anomaly score to update the lists (or not). In addition, we introduce the cross-attention mechanism in the LTC module for integrating the two lists\u0026rsquo; information into the VE features. Output features of the LTC module are not only taken as inputs into the AP, but also stacked with the VE features to serve as the visual prompts (input embeddings) of LLaMA. Based on the LTC-enhanced features, we are able to derive a more robust VADor and also provide comprehensive video contexts for LLaMA.\nFeature Adaptor. In VAD-LLaMA, the Adaptor f (one fc layer) is added to convert the visual prompts into the same dimension with the inputs of LLaMA and align the visual feature distributions with the pre-trained LLaMA.\nLLaMA. In this work, we adopt the LLaMA of version vicuna-7b [26]. It\u0026rsquo;s important to highlight that the finetuning of LLMs is based on the instruction-tuning data [1].\nFigure 3. The training phase of VAD-LLMs consists of three phases. 1) VAD baseline training, 2) VAD co-training with LTC, and 3) Instruction-tuning Adaptor. In the LTC module, N and A represent the long-term normal and abnormal feature lists, respectively. The red arrow denotes the generation process from anomaly scores to pseudo instructions with text templates.\nTypically, this data is comprised of video instruction pairs, where each pair includes a textual instruction corresponding to the content of the accompanying video. These instructions are often generated using simple templates, commonly in a question-answer format. Here\u0026rsquo;s an illustrative example with the underlined part generated as a pseudo instruction:\nQuestion: ### Human: \u0026lt;Video\u0026gt; [Video Tokens] \u0026lt;/video\u0026gt; [Video Description] Is there any anomaly in the video?\nAnswer: ### Assistant: Yes, there are anomalies from 1.5s to 2s.\nIn the Question, [Video Tokens] denoted the tokens (places) for inserting visual prompts. [Video Description] is simple video clip details, e.g., video length and frame sample rate. During the instruction-tuning of VAD-LLaMA, the Question is first transformed into textual embeddings with a pre-trained LLM (vicuna-7b [26]) and then concatenated with visual prompts to serve as the inputs of LLaMA. Later, the textual embeddings transformed from the Answer are utilized as the \u0026ldquo;ground truth\u0026rdquo; of LLaMA\u0026rsquo;s generation.\n3.2. Model Training # The training pipeline of VAD-LLaMA is outlined in Figure 3. We implement a three-phase approach. In the first phase, clip-level VE features are input into the VADor to establish a baseline for predicting initial anomaly scores. In the second phase, these preliminary scores facilitate the aggregation of representative normal and abnormal features within the LTC module. This module is co-trained with the VADor to merge long-term contextual information into the process of representation learning and anomaly detection. In the third phase, we refine our VAD-LLaMA model by exclusively training the feature adaptor (the projection layer), utilizing the robust features produced by the VE and LTC modules. These features impart a broad understanding of general video content and specific guidance for anomaly detection, which are integral to the instruction-tuning process. Concurrently, the improved anomaly scores, ascertained through the VADor in conjunction with the LTC module, are transformed into pseudo instructions. These are then amalgamated with straightforward text templates, serving as the instruction-tuning data for LLaMA. The details of the training phases are given below.\nPhase 1: Training VADor. In this phase, we train a simple VADor baseline by directly passing the VE features through AP g, as shown in the left of Figure 3. Facing the scarcity of anomalous data and labels in VAD, many researchers opt to address the VAD problem in a Weakly Supervised Video Anomaly Detection (WSVAD) framework. In this setting, each training video is annotated with a binary anomaly label y ∈ {0 , 1} (i.e., normal or abnormal), denoting whether it is categorized as normal or abnormal. This allows for training VAD models without the need for frame-level annotations of specific anomalous events, making it more feasible in real-world applications. In this work, we adopt the same setting as in WSVAD.\nThe prevailing approach to WSVAD is Multiple Instance Learning (MIL). It aims to train a clip-level AP g base on the VE features {xi} m i=1 . In this process, it distinguishes the most anomalous clip in a normal video (i.e., y = 0) as normal, and identifies the most anomalous clip within an abnormal video (i.e., y = 1) as abnormal. To achieve this, MIL constructs a tuple set S, one tuple for each video, which includes the prediction y ′ generated by g on the most anomalous snippet and the corresponding video-level label, denoted as y. This tuple is represented as (y ′ , y), where y ′ is computed as max{g(xi)} m i=1 . The parameters of g are trained by minimizing the binary cross-entropy (BCE) loss:\nIn this way, for a normal video with y = 0, by minimizing max{g(xi)} m i=1, g is compelled to assign low abnormal probabilities to all video clips. Conversely, for an abnormal video with y = 1, by maximizing max{g(xi)} m i=1, g is trained to yield an even higher probability for the most confident abnormal snippet. Following previous method [13] that directly predicts binary logits for the normal and abnormal probabilities. During the inference of WSVAD, we utilize the abnormal probabilities as anomaly scores to calculate the evaluation metrics (AUC).\nPhase 2: Co-Training VADor and LTC. The VADor trained in phase 1 is a MIL-based baseline. It is trained from VE features. However, this VE was pre-trained on short videos whose contexts that significantly differ from those in long and complex videos of VAD.\nTo enhance the long-term video representation ability of VADor, we co-train it with a novel LTC module, which is designed to encode the most normal as well as the most abnormal events seen in the input video. As mentioned in Sec. 3.1, the top-K normal and abnormal VE features are collected and stacked into a normal list N and an abnormal list A, respectively. In the forward pass, the top-K selection is based on the preliminary anomaly scores predicted from the VADor baseline, we pick up the VE features with the K lowest (highest) scores as the items of list N (A). Moreover, we online update the lists by re-collecting the historical features based on their anomaly scores, when inputting a new video clip into the LTC module.\nBased on the \u0026ldquo;memory\u0026rdquo; stored in the LTC module, the next question is how to incorporate them into the representation (i.e., the VE features) of new video clips. To this end, we introduce the cross-attention mechanism to automatically retrieve contextual features from the LTC lists, based on their relevance to the current VE feature. Specifically, we regard the current VE feature xi as the query and the stacked features from LTC lists N , A are utilized as the key and value at the same time. Taking the i-th VE feature as an example, the process derives as:\nwe denote the acquired feature from Eq. (2) as ni , ai, separately from N and A. Here, × denotes the dot product. In detail, the i-th VE feature is first multiplied with the LTClisted features to generate attention weights, then the relative features are retrieved with these weights, i.e. the higher the feature similarity, the higher the attention weight. After that, we combine these features with the VE feature xi:\nhere, instead of carefully tuning hyper-parameters to manually select a descent weight tuple, we introduce neural soft weights with parameters w n , wa ∈ W to automatically balance the features. Then the feature after cross-attention is fed into the AP and form a more robust VADor. Later, the VADor and the LTC are co-trained and supervised with BCE loss as in (1).\nNote that, to further integrate the short-term historical information that involves the variation of happening events, we add a list for storing the past K VE features, represented as H = {hj} K j=1 . These VE features in the short-term history contribute to learning a more comprehensive feature representation and boost the robustness of VADor. In this way, we upgrade the LTC module with a plus version, namely Long-Short-Term Context (LSTC) module. Extensive experiments demonstrate the effectiveness of VADor with the LTC and LSTC module as in Sec. 4.5 .\nPhase 3: Instruction-Tuning Adaptor. In this phase, we incorporate the VADor with the pre-trained VE and LLaMA from Video-LLaMA [24] by adding an adaptor. Considering the limited training data in VAD, we opt to freeze the large modules (VADor, VE, and LLaMA) and train only the adaptor that aligns the feature distribution of VADor with the LLaMA. The frozen modules helps to reduce the model\u0026rsquo;s dependence on the scale of the training data. Also the features from the well-trained VADor provide a comprehensive understanding of general video content and specific guidance for anomaly detection, which are integral to the instruction-tuning process.\nAnomaly Prompt. To seamlessly incorporate video representations and anomaly information into LLaMA, we utilize the LTC feature x˜ ˜ i trained in the co-training phase as the anomaly prompt. As illustrated in Figure 2, the anomaly prompt is stacked with the VE feature, resulting in xˆ ˆ i = [xi , x ˜ i ]. Then, we add an adaptor (linear layer) f to project them into the same dimension as the inputs of LLaMA. The output feature embedding f(xˆ ˆ i ) serves as clip-level soft visual prompts that guide the pre-trained LLaMA to generate text that is conditioned on the visual content and anomaly status of the video clip.\nPseudo-Instruction. There is no frame-level anomaly annotation in WSVAD, so it is not intuitive how to construct temporal instructions for LLaMA. To address this challenge, we propose to convert anomaly scores (output by VADor) into pseudo instructions and manually compose anomaly-related text templates to generate instruction-tuning data. The process is showcased in the red line of Figure 2 and Figure 3 .\nTo generate video instruction pairs for VAD data (e.g., UCF-Crime dataset [18]), we start by inserting the visual prompts {xˆ ˆ } m i=1 into the textual embeddings Q of the Question. Then, we convert the predicted anomaly score from the VADor into the pseudo instruction as showcased in the Answer , e.g., \u0026ldquo;0.9\u0026rdquo; is transformed into the underlined part of Answer in Sec. 3.1, according to the time duration of the video clip. Hence, for i-th video clip, the video instruction pair becomes (ˆy ′ , y ˆ ), here yˆ ˆ ′ = [f(xˆ ˆ i ), Q] stands for the inputs of LLaMA and yˆ ˆ denotes the textual embeddings transformed from the pseudo instruction in Answer .\nTo prevent overfitting on VAD videos, we incorporate a diverse training sample set P, drawing from both the UCFCrime and WebVid datasets. Finally, we train the adaptor with the cross-entropy loss. Note that Cross-Entropy (CE) loss is commonly employed for training LLMs [2], which quantifies the disparity between the text sequence generated by the model and the target text sequence. The formula of CE loss is derived as follows:\nwhere n is the number of embedding tokens in yˆ ˆ , y ˆ j is the true label for token j and LLaMA(ˆy ′ )j is the LLaMApredicted probability for token j. Additionally, we illustrate the detailed pipeline with a pseudo-code in the Appendix.\n4. Experiments # 4.1. Datasets and Evaluation Metrics # To verify the performance of our VADor, we conducted extensive experiments and ablations on two standard WSVAD evaluation datasets [14 , 18]. As per the standard in WSVAD, the training videos only have video-level labels, and the test videos have frame-level labels. Other details of the experimental setting are given below.\nUCF-Crime [18] is a large-scale dataset comprising 1,900 untrimmed real-world surveillance videos on general scenarios, encompassing both outdoor and indoor environments. The dataset boasts a total duration of 128 hours and includes 13 distinct classes of anomalous events. It is divided according to the standard split, with a training set of 1,610 videos, and a test set of 290 videos.\nTAD dataset [14] features real-world traffic scene videos, with an average of 1,075 frames per video. These videos encompass over seven common road-related anomaly categories. The dataset is split into a training set consisting of 400 videos and a test set comprising 100 videos.\nEvaluation Metrics. Following previous works [13 , 18], we adopted the Area Under the Curve (AUC) of the framelevel ROC (Receiver Operating Characteristic) as the main WSVAD evaluation metric for TAD and UCF-Crime. Intuitively, a larger AUC means a larger margin between the normal and abnormal predictions of video clips, suggesting a superior anomaly classifier. Taking inspiration from UMIL [13], our evaluation goes beyond calculating AUC for the entire test set, denoted as AUCO. We also compute the AUC specifically for abnormal videos, referred to as AUCA. It is for excluding normal videos where all clips are normal (label 0), and keeping only the abnormal ones with both kinds of clips (label 0,1). This selective evaluation truly challenges a classifier\u0026rsquo;s capability to accurately localize and detect anomalies within a mixed context.\n4.2. Implementation Details # Given that VAD videos are predominantly sourced from CCTV surveillance, which typically lacks audio signals, we omit the audio branch while retaining the visual part as the VE for generating video clip features. We also implemented the pre-trained LLaMA from Video-LLaMA [24] to retain general video description knowledge. In our VAD-LLaMA, two fc layers are used to implement the AP g, and another two fc layers are used to balance the features as W in the LTC module, each corresponding to a feature list, i.e., normal or abnormal. All the fc layers in VAD-LLaMA are initialized with random weights and trained to locate and describe the possible anomalies in videos. The length of LTC lists is uniformly set as 4 by default, according to the ablation study in Sec. 4.5 .\nWe trained our model with the AdamW optimizer [11] using an initial learning rate of 1e-5, weight decay of 0 . 001 , and batch size of 8 in the training phases 1 and 2. Due to the large memory consumption of LLaMA, the batch size was reduced to 2 during phase 3. We utilized the cosine annealing scheduler and warmed up the learning rate for 5 epoch among each training phase. The VADor baseline was trained with MIL-based BCE loss for 30 epochs, followed by 30 epochs of co-training with the LTC module. After that, we trained the adaptor for 30,000 iterations and froze the VADor, VE, and LLaMA. We conducted all experiments on 4 Nvidia L40 GPUs. We implemented the max value scores and max margin scores in Eq 1 as [13 , 14].\n4.3. Quantitative Results # Weakly Supervised Video Anomaly Detection (WSVAD). In Table 1, we compared our VADor with other state-of-the-art (SOTA) methods in WSVAD. On UCFCrime [18], VADor with LTC achieves the best AUCO and AUCA among all the methods, with an improvement of +0.88% and +2.44%, respectively. VADor with LTC achieves the second best AUC O in TAD [14] and significantly outperforms all methods on AUCA by +3.21%. Moreover, with the introduction of LSTC, we witness a further improvement among the two benchmarks.\nOverall Observations. 1) Notice that our baseline VADor performs far better than the previous MIL-based two-stage model [18]. This validates the strong video representation power of the pre-trained VE [24]. 2) Moreover, our VADor with LTC module significantly improves the AUCA over VADor baseline (e.g., +4.45% on UCF and +10.84% on TAD), which demonstrates the effectiveness of incorpo-\nTable 1. WSVAD comparison on UCF-Crime. \u0026ldquo;2-stage\u0026rdquo; and \u0026ldquo;E2E\u0026rdquo; stand for the two-stage pipeline and the end-to-end framework. \u0026ldquo;w\u0026rdquo; and \u0026ldquo;w/o\u0026rdquo; are abbreviations for \u0026ldquo;with\u0026rdquo; and \u0026ldquo;without\u0026rdquo;. AUCO and AUCA denote that the AUC computed on the overall test set and only abnormal test videos, respectively. The best results are in bold, and the second-best results are underlined.\nCategory Method AUCO (%) AUCA (%) Sultani et al. [18] 75.41 54.25 Zhang et al. [25] 78.66 - Motion-Aware [29] 79.1 62.18 GCN-Anomaly [27] 82.12 59.02 Wu et al. [21] 82.44 - RTFM [19] 84.3 - WSAL [14] 85.38 67.38 ECUPL [23] 86.22 - E2E UMIL [13] 86.75 68.68 E2E VADor w/o LTC 85.9 66.67 E2E VADor w LTC 87.63 71.12 E2E VADor w LSTC 88.13 72.54 Table 2. WSVAD comparison on TAD benchmark.\nCategory Method AUCO (%) AUCA (%) Sultani et al. [18] 81.42 55.97 Motion-Aware [29] 83.08 56.89 GIG [12] 85.64 58.65 RTFM [19] 89.61 - WSAL [14] 89.64 61.66 ECUPL [23] 91.66 - E2E UMIL [13] 92.93 65.82 E2E VADor w/o LTC 85.2 58.19 E2E VADor w LTC 90.91 69.03 E2E VADor w LSTC 91.77 70.78 rating long-range contextual information into the anomaly analysis. Additionally, the incorporation of short-term historical information leads to a further enhancement of the AUC performance. 3) Our VADor achieves the second best AUC O results on TAD, it is mainly because we froze the pre-trained VE, while UMIL [13] fine-tuned the feature backbone with VAD data. The higher AUCO, but lower AUCA of UMIL demonstrate that UMIL is better at distinguishing the normal clips in normal videos, but our VADor achieves a better anomaly localization performance among anomalous videos with a much higher AUCA (+3.21%).\n4.4. Quantitative Examples # We showcase a video example of \u0026lsquo;Abuse\u0026rsquo; for comparison between our VAD-LLaMA and Video-LLaMA [24] in Figure 4. As observed, the Video-LLaMA, available as an open-source model, struggles to precisely correlate the detected anomaly (Abuse) with specific events in the video, notably the incident where a woman is knocked down. Additional examples of our model with examples are put in\nFigure 4. An abuse example for comparison between the VADLLaMA and Video-LLaMA. The red boxes in the frames are ground-truth anomalies. The orange boxes are the question from humans. The gray and blue boxes are the answers from the VideoLLaMA and our VAD-LLaMA, respectively. Best viewed in color.\nFigure 5. The model effectively identifies anomalies, i.e ., \u0026ldquo;Abuse\u0026rdquo; and \u0026ldquo;Car accident\u0026rdquo;, accurately pinpoints their temporal locations, and provides a detailed description of the anomalies. For normal videos, our VAD-LLaMA is able to comprehensively analyze the content of the video and eliminate the possibility of anomalies. Moreover, users are able to engage in multi-turn dialogues pertaining to the video content. More qualitative examples and comparisons between VAD-LLaMA and Video-LLaMA are moved to the Appendix.\n4.5. Ablation studies # LTC Components. We validate the effectiveness of longterm context modeling in Table 3 with AUCO. By comparing the third and fourth lines with the first line, we observe that the normal (abnormal) features in long-range video context can improve AUCO from 85.90% to 87.08% (87.45%) on UCF-crime and 85.20% to 88.39% (89.08%) on TAD. In the fifth line, with the combination of the normal and abnormal contexts, a further improvement of AUCO proves the critical role of the long-range video context in robust anomaly mining. In addition, we verify the success of short-term historical information in VADor, which boosts the AUC O to reach 88.13% on UCF-crime and 91.77% on TAD. For an independent evaluation of the effectiveness of our VADor, we re-implemented the previous SOTA UMIL [13] using the VE features, denoted as UMIL*. The results are presented in line 2. Our VADor with LTC in line 4 consistently outperforms UMIL* (+0.85% on UCF-Crime and +2.26% on TAD), thereby validating the efficacy of our design, based on the same feature backbone.\nFigure 5. Two qualitative examples of VAD-LLaMA.\nTable 3. Ablation studies of the components in the LTC module. Here, \u0026ldquo;Nor\u0026rdquo; and \u0026ldquo;Abn\u0026rdquo; denote the normal and abnormal list, respectively. \u0026ldquo;His\u0026rdquo; stands for the short-term history list and \u0026ldquo;UMIL\u0026rdquo; is the unbiased term proposed in [13].\nBaseline Nor Abn His UMIL UCF TAD ✓ 85.9 85.2 ✓ ✓ 86.78 88.65 ✓ ✓ 87.08 88.39 ✓ ✓ 87.45 89.08 ✓ ✓ ✓ 87.63 90.91 ✓ ✓ ✓ ✓ 88.13 91.77 LTC Length. In the LTC, K is employed as the length of the feature lists. Through empirical analysis presented in Table 4, we determine that K = 4 is a suitable choice across the two datasets, and thus, it is the default setting for our experiments. In general, the selection of K hinges on the reliance of anomaly detection on video contexts. For instance, a small K might not capture sufficient temporal in- formation, while a large K could involve unexpected noise.\nClass-wise AUC. On the UCF-Crime dataset, each test video is labeled with the class of anomaly, enabling us to analyze models\u0026rsquo; capabilities in detecting subtle abnormal events through class-wise AUCA comparisons. In Figure 6 , we compare VADor with the baseline and UMIL, where \u0026ldquo;Average\u0026rdquo; represents the overall AUCA, and the remaining bars show the class-wise values.\nOur observations are as follows: 1) Both the VADor baseline and UMIL demonstrate strong performance on\nFigure 6. Class-wise AUCA of three methods on UCF-Crime. Here, \u0026ldquo;VADor\u0026rdquo; stands for our VADor with the LTC module.\nTable 4. Ablation of the LTC Length on UCF-Crime and TAD.\nThreshold(%) 0 2 4 6 8 AUCO (%) - UCF 85.9 87.49 87.63 87.27 87.18 AUCO (%) - TAD 85.2 90.18 90.91 90.65 90.13 anomaly classes characterized by drastic motions, such as \u0026ldquo;Assault\u0026rdquo; and \u0026ldquo;Vandalism\u0026rdquo;. These classes represent intuitive anomalies primarily relying on feature representation learning in a short time duration, given that VE and the backbone of UMIL are sufficient to capture local details in short video clips. 2) However, these methods struggle to distinguish anomalies that depend on long-range temporal analysis, like \u0026ldquo;Arson\u0026rdquo; and \u0026ldquo;Shoplifting\u0026rdquo;. These classes correspond to the hard examples, which are inadequately addressed by the long-term context modeling in our VADor. VADor with the LTC module performs similarly well on the aforementioned intuitive anomaly classes and significantly outperforms the other methods on other anomalies that require comprehensive context modeling. This substantial improvement contributes to the superior anomaly detection performance. Overall, observations 1 and 2 empirically validate the effectiveness of mining long-range video contexts for a more robust anomaly analysis.\n5. Conclusion # In this work, we introduced VAD-LLaMA, a novel Video Anomaly Detection (VAD) approach that integrates videobased large language models (VLLMs) into the VAD framework, making the VAD model free from thresholds and able to explain the reasons for the detected anomalies. In our model, we introduced a Long-Term Context (LTC) module to mitigate the incapability of existing VLLMs in longrange context modeling. In addition, our three-phase training method significantly improves the efficiency of training VLLMs in specific domain as VAD by minimizing the requirements for VAD data and reducing the costs of annotating instruction-tuning data. Our approach was empirically validated by the state-of-the-art performance and extensive ablations on standard WSVAD benchmarks. Also, we showcased the anomaly localization and description capability of VAD-LLaMA in the multi-dialogue based on the video content. In the future, we seek to develop a VAD model with fast adaption capability that can detect new anomalies based on either a few example clips or textual descriptions of the targeted anomalies.\nReferences # [1] Max Bain, Arsha Nagrani, Gul Varol, and Andrew Zisser- ¨ ¨ man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021. 2 , 3 [2] Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An empirical study of training end-to-end vision-and-language transformers. In CVPR , 2022. 6 [3] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023. 3 [4] Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. ArXiv, 2023. 3 [5] Chengkun He, Jie Shao, and Jiayu Sun. An anomalyintroduced learning method for abnormal event detection. Multimedia Tools and Applications, 2018. 2 [6] Zijie Huang and Yulei Wu. A survey on explainable anomaly detection for industrial internet of things. In DSC, 2022. 1 [7] Shunsuke Kamijo, Yasuyuki Matsushita, Katsushi Ikeuchi, and Masao Sakauchi. Traffic monitoring and accident detection at intersections. ITITS, 2000. 1 [8] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv , 2023. 3 [9] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. ArXiv, 2023. 2 , 3 [10] Shuo Li, Fang Liu, and Licheng Jiao. Self-training multisequence learning with transformer for weakly supervised video anomaly detection. AAAI, 2022. 2 [11] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ICLR, 2019. 6 [12] Hui Lv, Chunyan Xu, and Zhen Cui. Global information guided video anomaly detection. In ACM MM, 2020. 7 [13] Hui Lv, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, and Hanwang Zhang. Unbiased multiple instance learning for weakly supervised video anomaly detection. In CVPR, 2023. 1 , 3 , 5 , 6 , 7 , 8 [14] Hui Lv, Chuanwei Zhou, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Localizing anomalies from weakly-labeled videos. TIP, 2021. 1 , 2 , 6 , 7 [15] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv , 2023. 1\n[16] Sadegh Mohammadi, Alessandro Perina, Hamed Kiani, and Vittorio Murino. Angry crowds: Detecting violent events in videos. In ECCV, 2016. 1 [17] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. ArXiv, 2023. 3 [18] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In CVPR, 2018. 1 , 2 , 5 , 6 , 7 [19] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In ICCV, 2021. 1 , 2 , 7 [20] Peng Wu and Jing Liu. Learning causal temporal relation and feature discrimination for anomaly detection. TIP, 2021. 2 [21] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In ECCV, 2020. 7 [22] Muhammad Zaigham Zaheer, Arif Mahmood, Marcella Astrid, and Seung-Ik Lee. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In ECCV, 2020. 2 [23] Chen Zhang, Guorong Li, Yuankai Qi, Shuhui Wang, Laiyun Qing, Qingming Huang, and Ming-Hsuan Yang. Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection. In CVPR, 2023. 7 [24] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. ArXiv, 2023. 1 , 2 , 3 , 5 , 6 , 7 [25] Jiangong Zhang, Laiyun Qing, and Jun Miao. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In ICIP, 2019. 2 , 7 [26] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv, 2023. 3 , 4 [27] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In CVPR, 2019. 2 , 7 [28] Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. AAAI, 2023. 1 [29] Yi Zhu and Shawn Newsam. Motion-aware feature for improved video anomaly detection. BMVC, 2019. 2 , 7 ","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/video-anomaly-detection-and-explanation-via-large-language-models/","section":"Papers","summary":"The paper introduces VAD-LLaMA, a novel framework integrating video-based large language models (VLLMs) for threshold-free, explainable video anomaly detection, featuring a Long-Term Context (LTC) module and a three-phase training process that enhances long-range context modeling and minimizes data annotation costs.","title":"Video Anomaly Detection and Explanation via Large Language Models","type":"other"},{"content":"Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.\nDigital Object Identifier 10.1109/ACCESS.2017.DOI\nVideo Anomaly Detection in 10 Years: A Survey and Outlook # MOSHIRA ABDALLA 1 , SAJID JAVED 1 , MUAZ AL RADI 1 , ANWAAR ULHAQ 2 AND NAOUFEL WERGHI 1\n1 Department of Computer Science, Khalifa University of Science and Technology, Abu Dhabi, UAE. 2 A. Ulhaq is with the School of Engineering \u0026amp; Technology, Central Queensland University Australia, 400 Kent Street, Sydney 2000, Australia.\nCorresponding author: Sajid Javed (sajid.javed@ku.ac.ae ).\nABSTRACT Video anomaly detection (VAD) holds immense importance across diverse domains such as surveillance, healthcare, and environmental monitoring. While numerous surveys focus on conventional VAD methods, they often lack depth in exploring specific approaches and emerging trends. This survey explores deep learning-based VAD, expanding beyond traditional supervised training paradigms to encompass emerging weakly supervised, self-supervised, and unsupervised approaches. A prominent feature of this review is the investigation of core challenges within the VAD paradigms including largescale datasets, features extraction, learning methods, loss functions, regularization, and anomaly score prediction. Moreover, this review also investigates the vision language models (VLMs) as potent feature extractors for VAD. VLMs integrate visual data with textual descriptions or spoken language from videos, enabling a nuanced understanding of scenes crucial for anomaly detection. By addressing these challenges and proposing future research directions, this review aims to foster the development of robust and efficient VAD systems leveraging the capabilities of VLMs for enhanced anomaly detection in complex real-world scenarios. This comprehensive analysis seeks to bridge existing knowledge gaps, provide researchers with valuable insights, and contribute to shaping the future of VAD research.\nINDEX TERMS Reconstruction-based Techniques, Video Anomaly Detection, Vision-Language Models, Video Surveillance, Weak Supervision.\nI. INTRODUCTION # A NOMALY detection aims to pinpoint events or patterns that stray from the typical or expected behavior within a given data modality [1], [2]. It holds diverse applications, spanning from fraud detection and cybersecurity for network intrusion detection to quality assurance in manufacturing, fault identification in industrial machinery, healthcare monitoring, and beyond. Our primary emphasis, however, revolves around video anomaly detection (VAD) [3], a pivotal technology that automates the identification of unusual or suspicious activities within video sequences.\nWhile conventional methods for VAD have been extensively studied, the rapid advancements in deep learning techniques have opened up new avenues for more effective anomaly detection [4]. Deep learning algorithms, such as convolutional neural networks (CNNs) and vision transformers (ViTs), have shown remarkable capabilities in learning complex patterns and representations from large-scale data [5], [6]. These advancements have led to significant improve- ments in VAD performance, enabling more accurate and reliable detection of anomalies in video data. By leveraging deep learning techniques, researchers and practitioners have been able to develop innovative approaches that outperform traditional methods and address the limitations associated with traditional ML [7], [8].\nAlongside the deep learning-based VAD systems that rely on supervised learning paradigms, recent years have witnessed a surge in exploring novel approaches, including weakly supervised, self-supervised, and unsupervised methods [9]–[11]. These alternative approaches present promising solutions to the challenges encountered by conventional VAD methods, such as the requirement for extensively annotated datasets and the complexity of capturing intricate spatiotemporal patterns. Through the utilization of deep learning techniques, researchers aspire to cultivate robust and efficient VAD systems adept at navigating diverse real-world scenarios and applications [1], [12]. Similarly, several research challenges are either not thoroughly explored or remain unex-\nplored. For instance, a crucial consideration is the quality and diversity of available datasets. The development of emerging comprehensive datasets, such as the UCF Crime dataset [11], XD-Violence dataset [13], and ShanghaiTech dataset [14], plays a pivotal role in this progress. These datasets encompass various types of anomalies, significantly contributing to the enhancement of VAD techniques.\nIn VAD, videos consist of sequences of frames, resulting in complex, high-dimensional spatiotemporal data. Detecting anomalies within such complex data necessitates methods capable of effectively capturing spatial, temporal, spatiotemporal, and textual features. To address these challenges, numerous VAD methods and deep feature extractors have been introduced in academic research, which has played a significant role in advancing the current state-of-the-art. It is also worth mentioning the importance of choosing a correct loss function based on the type of VAD paradigm. Hence, loss functions and regularization techniques are another challenge in the VAD problem. Another significant challenge lies in the diverse paradigms including self-supervised, weakly supervised, fully supervised, and unsupervised models utilized to address the VAD problem.\nThese paradigms can be categorized into classical and deep learning. The classical approaches employ handcrafted features including spatiotemporal gradient [15], Histograms of Oriented Gradients (HOG) features [16], [17], and Histograms of Optical Flows (HOF) [18] within a temporal cuboid. These features are selected for their effectiveness in capturing appearance and motion information in a spatiotemporal context.\nThe deep learning approaches are more powerful in terms of rich feature representation and end-to-end learning and have gained much popularity in the past decade due to advancements in different learning models. Specifically, these approaches leverage powerful feature extractors to capture meaningful spatiotemporal features. Examples include convolutional neural networks [19], autoencoders [20], GANs [21], vision transformers [22], [23], and vision language models [24], [25].\nIn addition to the aforementioned inherent challenges within the VAD problem, each paradigm comes up with diverse loss functions, spatiotemporal regularization/constraints, and anomaly score prediction components . We also identify these different modules in this survey and provide take-home messages for each of the abovementioned challenges. Figure ?? shows the performance variation of representative deep learning-based VAD methods on two publicly available benchmark datasets including UCF-Crime [11] and ShanghaiTech [14]. The performance is reported in terms of the area under the ROC curve (AUC%). The performance improvement trend in this figure underscores a significant advancement in deep learning methodologies over the past decade, with the most notable improvement achieved by utilizing the recently proposed vision-languagebased model [26].\nMotivation: Despite the growing interest in VAD and\nFIGURE 1. Performance improvement from 2017 until 2023 on two popular benchmarks. Performance is measured by the area under the ROC curve (AUC%). Note that some of the models were developed before the datasets were created but were used after the creation by other researchers such as [15], [16], the other proposed models are: [27], [28], [29], [30], [31], [32], [23], [33], [26].\nthe proliferation of deep learning-based approaches, there remains a need for a comprehensive survey that explores the latest developments in this field. Existing surveys often focus on traditional VAD methods and may not adequately cover the emerging trends and methodologies such as visionlanguage models (VLMs) [26], [33]–[36]. By delving into VLMs and deep learning-based VAD, including supervised, weakly supervised, self-supervised, and unsupervised approaches, this survey aims to provide a thorough understanding of the state-of-the-art techniques and their potential applications. Our main focus lies on deep learning-based solutions for VAD problems. Particularly, we explore emerging VAD paradigms that employ different datasets, feature extractions, loss functions, and spatiotemporal regularization. In addition, we offer a comprehensive analysis of over 50 different methodologies employed in VAD, with a specific focus on the potential of textual features extracted with vision language models. In this survey, our main contributions are summarized as follows:\nWe identify the core challenges of the SOTA VAD paradigms including the vision-language models in terms of feature extraction, large-scale VAD training datasets, loss functions, spatiotemporal regularization, and video anomaly scores prediction. We provide a comparative analysis through quantitative and qualitative comparisons of SOTA models on different benchmarking datasets. This analysis sheds light on the strengths and weaknesses of existing methodologies, offering valuable insights for researchers and practitioners in the field. Following our analysis, we outline our proposed recommendations for addressing the open challenges in VAD. These recommendations are informed by our deep understanding of the current landscape of VAD research and aim to guide future research directions toward over- coming existing limitations and advancing the SOTA.\nThe structure of this paper is as follows: Section II outlines the methodology used to select the research studies included in this review over the past decade. Section III examines previous surveys conducted in the field of video anomaly detection. Section IV presents the formulation of the video anomaly detection problem. Section V presents a systematic approach and taxonomy for analyzing the VAD problem, and then the core challenges including the datasets V-A, feature extraction V-B, supervision schemes V-C, loss functions V-D, regularization techniques V-E and the anomaly score V-F. Section VI outlines datasets guidelines and evaluation protocols used. Section VII provides a comparative analysis through quantitative and qualitative comparisons of state-ofthe-art models. Section VIII provides visualization of bibliometric networks for thematic analysis. Finally, we conclude our work and present additional future directions in Section IX\nII. REVIEW METHODOLOGY # This survey paper exclusively considers research directly related to \u0026ldquo;Video Anomaly Detection.\u0026rdquo; The survey is conducted systematically, utilizing publications from top computer vision venues, including CVPR (Conference on Computer Vision and Pattern Recognition), ICCV (International Conference on Computer Vision), ECCV (European Conference on Computer Vision), IEEE TPAMI (Transactions on Pattern Analysis and Machine Intelligence), IJCV (International Journal of Computer Vision), CVIU (Computer Vision and Image Understanding). The objective of this review paper is to present the state-of-the-art in Video Anomaly Detection. Accordingly, this study focuses on identifying published research concerning the implementation of various computer Vision and deep Learning-based methods to address the challenges of VAD. The reviewed works span the last decade covering over 50 articles.\nIII. RELATED WORK # In the field of anomaly detection, several survey papers have been conducted in the past decade [1], [4], [7], [8], [37]. The first survey paper, published in 2018 by [8], focused on deep learning techniques for VAD, with particular attention to unsupervised and semi-supervised methods. Their classification of models included three distinct categories: reconstruction-based, spatio-temporal predictive, and generative models. Notably, this paper preceded the notable Multiple Instance Learning (MIL) approach, resulting in the omission of weakly supervised methods from their study.\nAnother significant study by Chalapathy et al. [37] highlighted the potential of deep anomaly detection methods in addressing various detection challenges. Their work covered diverse application areas such as the Internet of Things, intrusion detection, and surveillance videos. They further classified anomalies into collective, contextual, and point anomalies. The deep learning methods they categorized into\nfour primary classes include unsupervised, semi-supervised, hybrid, and One-Class Neural Network models.\nIn another survey, Ramachandra et al. [1] focus primarily on detecting anomalies within a single scene while also highlighting the differences from multi-scene anomaly detection. An important distinction lies in the fact that singlescene VAD may involve anomalies dependent on specific locations, whereas multi-scene detection cannot. The survey also sheds light on benchmark datasets employed for singlescene versus multi-scene detection and the associated evaluation procedures. In a broader context, the survey classifies previous research in video anomaly detection into three main categories: distance-based, probabilistic, and reconstructionbased approaches.\nIn a different work, Nayak et al. [7] categorize learning frameworks into four main categories: supervised, unsupervised, semi-supervised, and active learning. In the context of deep-learning-based VAD, state-of-the-art methods fall into several distinct categories, including Trajectory-based methods, Global pattern-based methods, Grid pattern-based methods, Representation learning models, Discriminative models, Predictive models, Deep generative models, Deep one-class deep neural networks, and Deep hybrid models. Moreover, the research provides a comprehensive analysis of performance evaluation methodologies, covering aspects such as the choice of datasets, computational infrastructure, evaluation criteria, and performance metrics.\nAnother work by Pang et al. [4] focused on deep learning techniques for anomaly detection and explored various challenges within the anomaly detection (AD) problem. These challenges included issues such as class imbalance in the data, complex anomaly detection, the presence of noisy instances in weakly supervised AD methods, and more. They discussed how deep learning methods offer solutions to these diverse challenges. To structure their analysis, Pang et al. introduced a hierarchical taxonomy for categorizing deep anomaly detection methods. This taxonomy comprised three principal categories: deep learning for feature extraction, learning feature representations of normality, and end-to-end anomaly score learning. Additionally, they provided 11 finegrained subcategories from a modeling perspective.\nIn their research, Mohammad Baradaran and Robert Bergevin [38] delve deeply into the semi-supervised VAD approaches, focusing on scenarios where labeled anomaly data is limited. They emphasize the role of feature extractors in these contexts, highlighting their ability to differentiate intricate patterns within video data by capturing crucial spatial and temporal details. These feature extractors are pivotal for detecting anomalies in semi-supervised tasks, where the model learns predominantly from an extensive set of normal data. The authors conduct an experimental analysis to shed light on the strengths and weaknesses of various VAD methods. They categorize the DL semi-supervised approaches into six distinct types: reconstruction, prediction, memorization, object-centric, segmentation, and multi-task learning-based methods. For each category, they provide a comprehensive\nexamination of the strengths and shortcomings, particularly focusing on the effectiveness of different feature extraction techniques.\nAnother recently published survey paper by Nomica et al. [39] provides an in-depth analysis of machine learning techniques for detecting anomalies in video surveillance systems. It categorizes these methods into supervised, semisupervised, and unsupervised approaches, highlighting their strengths, weaknesses, and applicability. However, it does not address the critical differences between feature types—such as temporal, spatial, textual, and hybrid features. These differences significantly influence the choice of feature extractors, impacting the effectiveness of the detection models. Additionally, it significantly overlooks the topic of VisionLanguage Models as feature extractors. In contrast, our work provides an in-depth analysis that encompasses all these aspects, offering a comprehensive understanding of their impact on anomaly detection models.\nA very recently published survey paper in 2024 by Yang et al. [40] categorizes the VAD approaches into unsupervised, Weakly-supervised, fully unsupervised, and supervised VAD. It highlights their strengths, such as improved feature extraction and detailed object analysis, while also noting the importance of handling spatiotemporal features and illumination changes. Although the survey provides an extensive comparison of different VAD methodologies, it notably omits the discussion of Vision-Language Models as feature extractors, a growing area of interest in the field.\nPrevious research efforts have overlooked the critical importance of employing diverse feature extractors within deep learning models. This oversight becomes apparent when considering emerging trends such as the widespread adoption of transformers and vision-language pretraining models within video anomaly detection. These innovations have significantly influenced the overall performance of deep learning models. In our survey paper, we aim to address these research gaps that have not been adequately outlined in previous surveys.\nAdditionally, we classify learning and supervision methods into four distinct groups: supervised, unsupervised, selfsupervised, and weakly supervised techniques. This classification scheme allows us to systematically analyze and compare different approaches, providing readers with valuable insights into the diverse methodologies employed in video anomaly detection research. By organizing our study in this manner, we aim to facilitate a deeper understanding of the underlying principles and techniques driving advancements in this field.\nA. CONTRIBUTIONS # In this survey, our main contributions compared to existing survey papers are summarized as follows:\nThis work defines a clear and comprehensive problem formulation of the VAD problem specifically in the context of supervised learning, where frame-level labels are available. To the best of our knowledge, this is the first survey to highlight the emerging importance of integrating Vision-Language Models (VLMs) as feature extraction in VAD. We explore how VLMs can significantly enhance model performance by effectively combining visual and textual data to better understand and detect anomalies. The paper is organized to serve as a detailed guide for readers and researchers new to the field of VAD. We provide foundational knowledge and a structured approach to navigate the complexities of VAD research as shown in Figure 2. This work presents a well-structured discussion starting from the selection of datasets and the types of features that are critical for VAD. This includes a focus on textual features and deep feature extractors. We also explore the various learning and supervision paradigms, including supervised, self-supervised, weakly supervised, and unsupervised/reconstruction approaches, detailing their respective advantages and disadvantages. This work provides a taxonomy of video anomaly detection that is systematically categorized into two main dimensions: learning and supervision schemes, and feature extraction as shown in Figure 3 This work provides insights into selecting appropriate loss functions and regularization techniques, which are crucial for optimizing the performance of VAD models. A comprehensive guideline is provided for choosing the most suitable datasets and evaluation metrics for experimental purposes, ensuring that researchers can effectively assess and compare their VAD methods. This work offers a detailed comparative analysis of the state-of-the-art (SOTA) VAD methods, both quantitatively and qualitatively. This analysis helps in understanding the current landscape and performance benchmarks in the field. Finally, this work provides an extensive discussion on potential future research directions in VAD. This includes exploring new technologies, methodologies, and application areas, and guiding researchers toward promising avenues for further investigation. IV. DEFINING THE VIDEO ANOMALY DETECTION PROBLEM # Anomalies within video data encompass events or behaviors that notably diverge from anticipated or regular patterns. The primary goal of video anomaly detection is to devise and implement resilient algorithms and models capable of autonomously identifying and flagging these anomalies in real-time. This involves converting raw video data into interpretable feature representations utilizing robust feature extractors adept at capturing both spatial and temporal characteristics. Additionally, it necessitates the selection of appropriate algorithms or techniques and the establishment of effective evaluation metrics to assess detection performance accurately.\nIn the context of supervised learning scenarios where frame-level labels are available, video anomaly detection problem can be succinctly described as follows:\nWe represent each video as Vi, which consists of a sequence of frames {fi,1, fi,2, . . . , fi,n}. From each frame, we can extract essential feature representations, denoted by xi,j . Define a model M that takes these features extracted from each frame and produces an anomaly score for that frame. For each frame fi,j , the anomaly score is S(fi,j ) = M(xi,j ) . The total anomaly score for video Viis the sum of the scores of its frames:\nThis anomaly score S(Vi) is compared with a predetermined threshold T, we can define the predicted binary label, Y i ˆ Y i , for the video as:\nWhere: 1 indicates that Vi Vi is anomalous and 0 indicates that Vi Vi is normal. The true label of the video is represented as Yi Yi ∈ {0 , 1}. The objective is to train the model M such that the difference between Y i ˆ Y i and Yi Yi is minimized for all videos in the training set and generalizes well to unseen videos.\nV. ANALYZING VIDEO ANOMALY DETECTION: A SYSTEMATIC APPROACH # In this section, we dig deeper into the complexities of the VAD process by analyzing relevant literature. The diagram presented in Figure 2 acts as a guide for navigating VAD research in the subsequent sections, where we explore the challenges associated with the VAD problem. Commencing with an array of diverse datasets, we traverse through various feature extraction techniques utilized to extract spatial, temporal, spatiotemporal, or textual features, leveraging vision language models.\nOur exploration begins with the utilization of diverse datasets and a spectrum of feature extraction techniques. These methodologies are applied to extract spatial, temporal, spatiotemporal, and textual features. Furthermore, our exploration will encompass various learning and supervision strategies, including supervised methods, self-supervised techniques, and unsupervised techniques (often classified as reconstruction-based or one-class classification approaches), alongside weakly supervised and prediction methods. Additionally, we will illuminate the significance of loss functions, regularization techniques, and anomaly score computation. Evaluation protocols for models are discussed in section VI-B. In the upcoming sections, we will explore these challenges further, examining the ways they have been approached.\na: Taxonomy of Video Anomaly Detection # The taxonomy of video anomaly detection, as shown in Figure 3, is systematically categorized into two main di-\nmensions: Learning and Supervision Schemes, and Feature Extraction. The Learning and Supervision Schemes dimension includes Supervised, Self-Supervised, Weakly Supervised (such as Multiple Instance Learning), and Unsupervised methods. Unsupervised methods are further divided into One-Class Classification, Reconstruction, and Future Frame Prediction approaches. The Feature Extraction dimension involves Deep Feature Extractors, including CNNs, Autoencoders, GANs, Sequential Deep Learning models (like LSTMs and Vision Transformers), Vision Language models, and Hybrid models. Additionally, it covers different types of features such as Spatial, Temporal, SpatioTemporal, and Textual.\nA. DATASETS BUILDING AND SELECTION # The field of Video Anomaly Detection relies significantly on publicly available datasets which are used for testing and benchmarking the proposed models. In this section, we present an overview of the common datasets utilized in the field of VAD, each carefully curated to facilitate the study of anomalies in various scenarios. These datasets encompass a wide range of scenes, offering diverse challenges for anomaly detection. We will investigate the number of videos contained in each dataset, the specific types of scenes they cover, and the anomalies present.\n1) Subway dataset # The Subway dataset [41] consists of two videos collected using a CCTV camera, each capturing distinct perspectives of an underground train station. The first video focuses on the \u0026ldquo;entrance gate\u0026rdquo; area, where individuals typically descend through the turnstiles and enter the platform with their backs turned to the camera. In contrast, the second video is situated at the \u0026ldquo;exit gate,\u0026rdquo; observing the platform where passengers ascend while facing the camera. These two cameras provide unique vantage points within the station, offering valuable insights for analysis and surveillance. The dataset has a combined duration of 2 hours. Anomalies within this dataset encompass activities such as walking in incorrect directions and loitering. Notably, this dataset is recorded within an indoor environment.\n2) UCSD Pedestrian # The UCSD anomaly detection dataset [42] was collected using a stationary camera positioned at an elevated vantage point to monitor pedestrian walkways. This dataset offers a wide range of scenes depicting varying crowd densities, spanning from sparsely populated to highly congested environments. Normal videos within the dataset predominantly feature pedestrians, while abnormal events arise from two primary sources: the presence of non-pedestrian entities in the walkways and the occurrence of anomalous pedestrian motion patterns. Common anomalies observed include bikers, skaters, small carts, individuals walking across designated walkways or in adjacent grass areas, as well as instances of people in wheelchairs. The dataset comprises two\nFIGURE 2. Video anomaly detection paradigm including (A) state-of-the-art dataset building and selection V-A, (B) Spatial, temporal, spatio-temporal, and textual deep feature extraction V-B, (C) Diverse deep learning and supervision schemes (supervised, self-supervised, weakly supervised, and unsupervised methods) V-C, (D) selection of loss functions V-D, (E) integration of regularization techniques within loss functions V-E, (F) anomaly score calculation V-F, and (G) model evaluation techniques VI-B.\nFIGURE 3. Taxonomy of Video Anomaly Detection.\ndistinct subsets, Peds1 and Peds2, each capturing different scenes. Peds1 depicts groups of people walking towards and away from the camera, often exhibiting perspective distortion, while Peds2 focuses on scenes where pedestrians move parallel to the camera plane.\nThe dataset is further divided into clips, with each clip containing approximately 200 frames. Ground truth annotations are provided at a frame level, indicating the presence or absence of anomalies. Additionally, a subset of clips in both Peds1 and Peds2 is accompanied by manually generated pixel-level binary masks, enabling the evaluation of algorithms\u0026rsquo; ability to localize anomalies.The UCSD Ped1 \u0026amp; Ped2 datasets are publicly available: http://www.svcl.ucsd.edu/projects/anomaly/.\n3) Street Scene dataset # The Street Scene dataset proposed by Ramachandra et al. [43] is designed for video anomaly detection and comprises 46 training and 35 testing high-resolution (1280×720) video sequences. These sequences were captured using a USB camera positioned to overlook a two-lane street with bike lanes and pedestrian sidewalks during the daytime. This dataset presents a challenging environment due to the diverse range of activities captured, including cars driving, turning, stopping, and parking; pedestrians walking, jogging, and pushing strollers; and bikers riding in bike lanes. Additionally, the videos feature changing shadows, moving backgrounds such as flags and trees blowing in the wind, and occlusions from trees and large vehicles.\nThe dataset includes 56,847 frames for training and 146,410 frames for testing, extracted at a rate of 15 frames per second. It contains a total of 205 naturally occurring anomalous events. These anomalies range from illegal activities like jaywalking and illegal U-turns to uncommon events not present in the training set, such as pets being walked and a meter maid issuing tickets. This variety makes the Street Scene dataset a comprehensive and demanding resource for advancing video anomaly detection research. However, a notable challenge of this dataset is that it is a\nFIGURE 4. Sample frames showcasing the diversity of scenes and anomalies present in publicly available datasets used for Video Anomaly Detection. These frames offer a glimpse into the range of challenges and scenarios addressed within the field, providing valuable insights for testing and benchmarking anomaly detection models.\nsingle-scene dataset where only the street scene is included. This prevents models trained only on this dataset from generalizing their anomaly detection capabilities to other scenes. On the other hand, this dataset could be beneficial for models that are specifically designed to work on single scenes such as EVAL [44]. The dataset is publicly available on: http://www.merl.com/demos/video-anomaly-detection\n4) UCF-Crime dataset # The UCF-Crime dataset [11] stands as a widely utilized large-scale dataset within recent research endeavors, boasting a multi-scene approach. Comprised of long, untrimmed surveillance videos, this dataset encapsulates 13 real-world anomalies with profound implications for public safety. The anomalies span a broad spectrum, including Abuse, Arrest, Arson, Assault, Road Accident, Burglary, Explosion, Fighting, Robbery, Shooting, Stealing, Shoplifting, and Vandalism.\nTo ensure the dataset\u0026rsquo;s integrity and quality, a meticulous curation process was undertaken. Ten annotators were trained to collect videos from platforms like YouTube and LiveLeak, utilizing text search queries across multiple languages. Pruning conditions were enforced to exclude manually edited videos, prank videos, those captured by handheld cameras, and those sourced from news or compilations. The resultant dataset comprises 950 unedited real-world surveillance videos, each featuring clear anomalies, alongside an equal number of normal videos, totaling 1900 videos.\nTemporal annotations were meticulously acquired by assigning the same videos to multiple annotators and averaging their annotations. The dataset is thoughtfully partitioned into a training set, consisting of 800 normal and 810 anomalous videos, and a testing set, comprising 150 normal and 140 anomalous videos. With its expansive coverage of various anomalous events, the UCF-Crime dataset serves as a comprehensive resource for evaluating anomaly detection algorithms across diverse real-world scenarios. The dataset is publicly available: http://crcv.ucf.edu/projects/real-world/.\n5) CUHK Avenue # The CUHK Avenue dataset [15] is a widely employed resource for Video Anomaly Detection, meticulously crafted by researchers from the Chinese University of Hong Kong (CUHK). Set within the CUHK campus avenue, this dataset primarily focuses on anomalies commonly encountered in urban environments and public streets. It offers a diverse array of lighting conditions, weather variations, and human activities, thereby presenting formidable challenges for VAD models.\nAn array of anomalous behaviors is captured within the dataset, encompassing both physical anomalies such as fighting and sudden running, and non-physical anoma-\nlies like unusual gatherings and incorrect movement directions. Notably, the dataset introduces several key challenges for VAD models, including slight camera shake in certain frames, occasional absence of normal behaviors in the training set, and outliers in the training data. In total, the dataset comprises 30,652 frames, with 15,328 frames allocated for training purposes and the remaining 15,324 frames earmarked for testing. With its diverse scenarios and realistic challenges, the CUHK Avenue dataset serves as a valuable benchmark for evaluating and advancing VAD techniques. The dataset is publicly available: http://www.cse.cuhk.edu.hk/leojia/projects/detectabnormal /dataset.html.\n6) ShanghaiTech # The ShanghaiTech Campus dataset [14] stands as a substantial contribution to the field of anomaly detection, presenting a vast and diverse collection of data. Across its 13 scenes, characterized by intricate light conditions and varied camera angles, the dataset encapsulates a comprehensive array of challenging scenarios. Notably, it boasts an extensive compilation of over 270,000 training frames, alongside 130 abnormal event occurrences meticulously annotated at the pixel level, facilitating precise evaluation and analysis.\nComprising a total of 330 training videos and 107 testing videos, all rendered at a resolution of 856x480, the dataset ensures consistency and compatibility across its contents. Each video maintains a frame rate of 24 frames per second (24fps), ensuring smooth and standardized playback. With its wealth of data and meticulous annotations, the ShanghaiTech Campus dataset serves as a cornerstone resource for advancing anomaly detection methodologies in real-world scenarios. The dataset is publicly available: https://sviplab. github.io/dataset/campus_dataset.html\n7) XD-Violence # The XD-Violence dataset [13] is a large-scale and multiscene dataset with a total duration of 217 hours, containing 4754 untrimmed videos. This dataset comprises 2405 violent videos and 2349 non-violent videos, all of which include audio signals and weak labels. The focus of the dataset was on the field of weakly supervised violence detection, where only video-level labels are available in the training set. This approach offers the advantage of being more labor-saving compared to annotating frame-level labels. As a result, forming large-scale datasets of untrimmed videos and training a data-driven and practical system is no longer a challenging endeavor. The dataset incorporates audio signals and is sourced from both movies and in-the-wild scenarios. The dataset encompasses six physically violent classes: Abuse, Car Accident, Explosion, Fighting, Riot, and Shooting. The dataset is divided into a training set, which comprises 3954 videos, and a test set which includes 800 videos. Within the test set, there are 500 violent videos and 300 non-violent videos. https://roc-ng.github.io/XD-Violence/\n8) NWPU Campus dataset # The NWPU Campus dataset [45] is newly available. It represents a significant contribution to the field of video anomaly detection and anticipation. The dataset was compiled by setting up cameras at 43 outdoor locations on the campus to capture the activities of pedestrians and vehicles. To ensure a sufficient number of anomalous events, more than 30 volunteers were involved in performing both normal and abnormal activities. The dataset covers a wide range of normal events, including regular walking, cycling, driving, and other daily behaviors that adhere to rules. The anomalies encompass various categories such as single-person anomalies, interaction anomalies, group anomalies, scene-dependent anomalies, location anomalies, appearance anomalies, and trajectory anomalies. The dataset consists of 305 training videos and 242 testing videos, totaling 16 hours of footage. Frame-level annotations are provided for the testing videos, indicating the presence or absence of anomalous events. Privacy concerns are addressed by blurring the faces of volunteers and pedestrians. Compared to other datasets, the NWPU Campus dataset stands out due to its larger volume of data, diverse scenes, and inclusion of scene-dependent anomalies. Additionally, it is the first dataset designed for video anomaly anticipation. These features make the NWPU Campus dataset a valuable resource for advancing research in video anomaly detection and anticipation. The dataset is publicly available: https://campusvad.github.io/\n9) Discussion on Datasets # The existing literature presents a plethora of diverse and extensive datasets covering a broad spectrum of normal and abnormal scenarios. These datasets vary from singlescene datasets, which focus on specific situations and their anomalies, such as the UCSD Pedestrian dataset [42], to more diverse datasets featuring multiple scenes and corresponding anomalous conditions, like the UCF-Crime [11] and XDViolence [13] datasets.\nFurthermore, these datasets exhibit variations in size and total duration, ranging from as low as five minutes to more than two hundred hours. Some datasets comprise a small number of long videos, primarily collected from fixed scenes, as seen in the Subway dataset [41], while others contain a larger number of shorter videos sourced from a wide array of locations, like the XD-Violence dataset [13]. Additionally, the datasets showcase a noticeable diversity in the types of anomalies present, spanning from action-related anomalies such as unexpected movement direction and criminal activities to object-related anomalies like suspicious personnel and specific vehicle types.\nTable 1 provides a comprehensive summary of the most commonly used publicly available datasets, while Figure 4 illustrates sample frames extracted from these datasets.\nTABLE 1. Overview of the most important publicly available datasets.\nDataset Name Paper Year Balanced # Videos Length Example anomalies Frames Per. Video Train:Test Challenges Subway Entrance Gate [41] 2008 No 1 1.5 hrs Wrong direction, No payment 121,749 13:87 Single scene, indoor only, limited number of anomalies Silidl Subway Exit Gate [41] 2008 No 1 1.5 hrs Wrong direction, No payment 64,901 13:87 limited number of anomalies Single scene, indoor only, limited number of anomalies Small sizesingle scene UCSD Ped1 [42] 2010 No 70 5 min Bikers, small carts, walking across walkways, wheelchair users 201 49:51 Small size, single scene, only outdoor, only vehicle anomalies Small size, single scene, only outdoor, only UCSD Ped2 [42] 2010 No 28 5 min Bikers, small carts, walking across walkways, wheelchair users 163 55:45 Small size, single scene, only outdoor, only vehicle anomalies CUHK Avenue [15] 2013 No 37 30 min Strange action, Wrong direction, Abnormal object 839 50:50 Small size, single scene, outdoor only, camera shake liii ShanghaiTech Campus Datase [14] 2017 No 437 3.67 hrs Running, Loitering, Biking in Restricted Areas, Unusual Gatherings, Theft Abt 1,040 86:14 Only university setting anomalies, single geographic location Imbalance between norma UCF-Crime [11] 2018 Yes 1,900 128 hrs Abuse, arrest, arson, assault, accident, burglary fighting, robbery jlki 7,247 85:15 Imbalance between normal and abnormal classes, Variation in Video Quality Single geographic location, Street Scene [43] 2020 No 81 226 mins jaywalking, illegal U-turns, pets, metermaid ticketing a car 2,509 57:43 Single geographic location single scene, only outdoor, XD-Violence [13] 2020 No 4,754 214 hrs Abuse, Car Accident, Explosion, Fighting, RiotShooting 3,944 83:17 y Limited number of anomalies, Variation in Video Quality NWPU Campus dataset [45] 2023 No 547 16 hrs Single-person, interaction group, location, appearance, trajectory 2,527 56:44 Only university setting anomalies, single geographic location Despite the availability of numerous datasets dedicated to the problem of VAD, several crucial trends and challenges need to be addressed:\nThe majority of the publicly available VAD datasets offer limited environmental diversity and are restricted to a specific setting. Examples are the ShanghaiTech [14] and the CUHK Avenue [15] datasets which are only limited to university campus scenes and anomalies. Such a trend can significantly hinder the trained models\u0026rsquo; capability to generalize well in other types of settings. A recurring theme in multiple datasets is the limited number of anomalous events contained in the dataset. Some datasets contained as little as three anomalous event types (strange action, wrong direction, abnormal object, CUHK Avenue [15]) in comparison to other datasets that featured up to 11 types of anomalies (ShanghaiTech [14]). This constraint lessens the models\u0026rsquo; efficacy in practical applications by limiting their capacity to learn a broad range of potential abnormal behaviors. Most datasets available in the literature are imbalanced in terms of the normal and abnormal classes. Particularly in larger datasets like UCF-Crime [11], the imbalance between normal and abnormal classes makes it difficult for models to be trained to detect anomalies accurately without being surpassed by the majority class (normal). We draw the following future directions for the VAD problem in surveillance videos particularly in the realm of benchmark datasets for addressing the aforementioned issues:\nThe benchmark datasets should be diverse and strive to encompass a wide range of anomalies, including both subtle and overt deviations from normal behavior. The datasets should aim for realism, mimicking realworld surveillance environments as closely as possible. Additionally, scalability is crucial to accommodate the growing size and complexity of surveillance video data. Anomalies often manifest over time, making temporal context essential for effective detection. Benchmark datasets should incorporate long-term temporal information to capture the dynamics of normal and abnormal behaviors accurately Beyond detecting anomalies, future benchmark datasets could focus on localizing and segmenting anomalous regions within video frames or sequences. This finer granularity aids in understanding the nature and extent of anomalies, facilitating prompt response measures. Integrating data from multiple modalities, such as visual, auditory, and textual information, can enhance anomaly detection performance. Future benchmark datasets might include multi-modal data to reflect the complexity of real-world surveillance systems. For advancing anomaly detection research, openness and collaboration are crucial therefore future benchmark datasets should be openly accessible, inviting contributions from researchers worldwide and fostering innovation in the field. By addressing these aspects, datasets can facilitate the development of more robust and effective anomaly detection approaches, ultimately improving the security and efficiency of surveillance systems.\nB. FEATURE LEARNING IN DEEP VIDEO ANOMALY DETECTION # Feature learning plays a pivotal role in the effectiveness of deep learning models for video anomaly detection. This section examines various feature extraction techniques utilized in the literature, emphasizing their significance and performance in surveillance video analysis.\n1) Exploring Different Feature Types # Video frames encompass diverse types of features crucial for VAD. These include spatial, temporal, spatiotemporal, and textual features.\na: Spatial Features # Spatial features pertain to the visual characteristics present within individual frames of a video. These encompass attributes such as shapes, textures, colors, and object positions within the frame. In the realm of anomaly detection, the analysis of spatial features aids in the detection of unusual patterns or objects within specific regions of the video frame. Initially, traditional machine learning methods dominated this area, employing techniques like Gaussian mixture models and manually constructed features [17], [18]. However, the transition to deep learning facilitated automated feature extraction, thereby augmenting the capability to discern intricate spatial details within video data.\nb: Temporal Features # Temporal features encompass changes or movements occurring over time within videos, including object motion, speed variations, and environmental alterations from frame to frame. For instance, Liu et al. [46] employed the optical flow approach to capture motion features within frames. In video anomaly detection, temporal features play a pivotal role in identifying unusual actions or events spanning consecutive frames, such as unauthorized running in restricted areas [11], [16].\nc: Spatiotemporal Features # Relying solely on one type of feature can be restrictive: temporal features might fail to pinpoint where an anomaly occurs, while spatial features might be overlooked when it happens. Incorporating both spatiotemporal features offers a more holistic perspective, capturing not only the occurrence but also the precise location of an anomaly. This approach leads to more accurate and effective anomaly detection [46]– [48].\nd: Textual Features # In video anomaly detection, textual features such as captioning and labeling significantly enhance the system\u0026rsquo;s ability to recognize and understand anomalies. By incorporating descriptive captions and relevant labels into video frames, these systems gain a deeper understanding of the content, context, and actions depicted in the videos. This semantic layer aids in distinguishing normal from abnormal activities more effectively. Advanced techniques like vision language pretraining, including image-to-text, are employed to analyze these textual annotations, which can also encompass temporal and spatial context. The integration of textual features with visual data leads to more sophisticated, context-aware anomaly detection [24], [25], [33], [35].\n2) Deep Feature Extractors # Different feature extractors have been utilized by previous researchers.\na: Convolutional Neural Networks (CNNs) # 2D Convolutional Neural Networks (2D CNNs): CNNs have revolutionized spatial feature processing, enabling detailed analysis of structural elements in scenes. In their work [49], they discuss the use of Faster RCNN, a specific type of CNN architecture, because of its high accuracy and its dual capability of performing object classification and bounding box regression simultaneously. This means it can both locate and classify objects within the video frames, which is particularly useful for identifying and localizing anomalies. 3D Convolutional Networks (3D CNNs): enhance traditional CNNs by incorporating temporal analysis, allowing for an effective evaluation of spatiotemporal characteristics within video data. The adoption of models such as C3D [50] and I3D [51] have markedly elevated the performance in state-of-the-art (SOTA) systems. A significant number of studies, including those by [11], [28], [52], have employed these 3D CNN architectures as their foundational backbone, demonstrating their outstanding efficacy in spatiotemporal feature extraction. b: Autoencoders (AEs) # AEs excel in VAD due to their unsupervised learning ability. They encode data into a lower-dimensional space and then reconstruct it, learning key data features without needing labeled examples. This is vital in anomaly detection, where anomalies are rare and often not well-defined, allowing AEs to effectively identify unusual patterns in video data.\nIn the work of [16], their approach utilized a fully convolutional autoencoder to learn low-level motion features, enhancing the ability to learn regular dynamics in longduration videos, making it an effective model for detecting irregularities. This method is versatile and applicable to various tasks, including temporal regularity analysis, frame prediction, and abnormal event detection in videos.\nc: Generative Adversarial Networks (GANs) # GANs consist of two parts: a generator that creates realistic data and a discriminator that distinguishes between generated and real data. Through this adversarial process, GANs effectively learn the distribution of real data. This ability is particularly valuable in anomaly detection, as GANs can generate data that closely resembles normal instances,\nmaking it easier to identify anomalies that deviate from this learned pattern [21].\nTheir use in reconstruction-based approaches is especially noteworthy; these approaches reconstruct input data (like video snippets) using high-level representations learned from normal videos [53]. The premise is that anomalies, being outof-distribution inputs, are harder to reconstruct accurately compared to normal data, making reconstruction error a viable metric for anomaly detection.\nAs seen in the work of [47], GANs are employed for future frame predictions, which are then reconstructed, showcasing their versatility and effectiveness in anomaly detection tasks.\nIn this domain, GANs alongside AEs have shown to be effective in capturing these intricate patterns in video data, aiding in the more precise identification of anomalous activities. Their combined use enables the learning of high-level representations and the generation of realistic data, enhancing the ability to detect anomalies accurately and efficiently.\nd: Sequential Deep Learning # Long Short-Term Memory (LSTM): LSTMs are a specialized type of Recurrent Neural Network designed to capture temporal dependencies. This characteristic makes them exceptionally well-suited for processing sequential data, such as textual content in NLP-based systems [54]. Moreover, their ability to remember and integrate long-range patterns is also invaluable in video applications. Specifically, they facilitate tracking and identifying temporal anomalies within sequential video streams, a critical function in spatiotemporal analysis [12]. Vision Transformers (ViT): Vision Transformers, known for their attention mechanisms, have revolutionized the way sequential data is processed. They can weigh the importance of different parts of the input data, giving precedence to the most relevant features. This makes them highly effective for extracting both temporal and spatiotemporal features in videos, which is crucial for detecting complex anomalies. Yang et al. [55] introduce a novel video anomaly detection method by restoring events from keyframes, using a U-shaped Swin Transformer Network (a special type of ViT). This approach, distinct from traditional methods, focuses on inferring missing frames and capturing complex spatiotemporal relationships, demonstrating improved anomaly detection in video surveillance. e: Vision Language Models (VLM): # Conventional techniques often rely solely on spatial-temporal characteristics, which might fall short in complex real-life situations where a deeper semantic insight is required for greater precision. Vision Language features involve training models using both visual and textual encoders, enabling comprehensive analysis of complex surveillance scenarios.\nThe notable rise of vision-language feature extractors using contrastive learning like the CLIP [34] and BLIP [56]\nalso aims to align vision and language, promising to bring about a transformative change in how surveillance videos are processed and interpreted. These models are designed to augment video content with rich semantic understanding, effectively narrowing the gap between simple pixel-based data and a more human-like interpretation of video content [35], additionally, the language models using text captions are used in text-to-video anomaly retrieval as in [57]. In their methodology of [35] VadClip model for video anomaly detection, Wu et al. combined the CLIP model\u0026rsquo;s visionlanguage features with a dual-branch system using both visual and textual encoders. It includes a local-global temporal adapter (LGT-Adapter) for modeling video temporal relations and leverages both visual and textual features for anomaly detection. One branch performs binary classification using visual features, while the other aligns visual and textual data for finer anomaly detection.\nIn contrast, [33] produced text features using video captions with SwinBERT [36], enriching the semantic context for detecting abnormalities. This approach expands the semantic understanding of video content beyond the pixel level, enhancing anomaly detection capabilities.\nf: Hybrid Feature Learners: # Certain studies have developed hybrid feature extractors that integrate various types of feature extraction techniques to capture both spatial and temporal information effectively for video anomaly detection.\nFor example, [58] proposed a hybrid architecture that combines U-Net and a modified Video Vision Transformer (ViViT). The U-Net, known for its encoder-decoder structure with skip connections, excels at capturing detailed spatial information. In contrast, the modified ViViT, originally designed for video classification tasks, is adapted here to encode both spatial and temporal information effectively for video prediction. This hybrid model leverages the detailed high-resolution spatial information from the CNN features of the U-Net and the global context captured by the transformer.\nSimilarly, the work by [6] employed a composite system that combines CNNs with transformers. In this configuration, the CNN component is responsible for discerning spatial characteristics, whereas the transformers are tasked with identifying extended temporal attributes. This hybrid approach offers a complementary fusion of spatial and temporal features, enhancing the model\u0026rsquo;s capability to detect anomalies in video sequences. Additionally, [59] proposed a CNN architecture that integrates convolutional autoencoders (CAEs) and a UNet. Each stream within the network contributes uniquely to the task of detecting anomalous frames. The CAEs are responsible for extracting spatial features, while the UNet focuses on capturing contextual information, enabling the model to effectively identify anomalies in video frames by leveraging both local and global features.\nSimilarly, [60] combined I3D 3D CNNs with LSTM networks to enhance anomaly detection performance. The I3D CNN extracts spatio-temporal features from video frames, ef-\nfectively identifying potential anomalies. However, as CNNs primarily capture short-term dynamics, LSTMs are integrated to capture long-range temporal dependencies crucial in lengthy, untrimmed surveillance footage. To handle the high-dimensional data from the I3D network, a pooling strategy is employed to optimize feature vectors for efficient processing by the LSTM. This hybrid architecture effectively integrates the strengths of both CNNs and LSTMs, enhancing the model\u0026rsquo;s ability to detect anomalies in varying temporal scales.\nWhile each feature representation has its strengths and weaknesses as empirically investigated by the SOTA VAD approaches [6], [26], [35], [55], [61]. The end-to-end deep feature learning paradigms are the recent trends and these approaches have advanced the VAD performance. We provide the following future directions in terms of the feature representations to improve detection, accuracy, and efficiency.\nA continuous exploration and refinement of deep learning architectures tailored for VAD is needed. This may involve developing more efficient architectures such as spatiotemporal CNNs, ViTs, or RNNs that can effectively capture temporal dependencies and spatial information in video sequences. The hybrid models, which integrate multiple types of features, including appearance, motion, and spatiotemporal information, could also improve the performance. This could involve combining deep learning-based features with handcrafted features or leveraging multimodal approaches such as VLMs to capture diverse aspects of anomalies. Investigation of cross-modal feature representations that integrate information from multiple modalities, such as visual, audio, and textual cues present in surveillance videos. Cross-modal representations can capture complementary information sources and improve the robustness of anomaly detection models to diverse types of anomalies and environmental conditions. Exploration of self-supervised learning techniques for pre-training feature representations in an unsupervised manner can help in learning rich representations from unlabeled data. The pre-trained models can then be finetuned for specific anomaly detection tasks with limited labeled data. Integration of attention mechanisms into deep learning architectures to focus on relevant spatiotemporal regions or frames within videos. Attention mechanisms can help improve the model\u0026rsquo;s ability to attend to informative parts of the input data and suppress irrelevant noise, leading to more robust anomaly detection features. Utilization of graph-based feature representations to model complex relationships among different entities (e.g., objects, regions) within surveillance videos. GNNs can capture dependencies and interactions among entities, enabling more effective anomaly detection in scenarios where anomalies manifest as abnormal inter- actions or relationships.\nInvestigation of adversarial learning techniques to enhance the robustness of feature representations against adversarial attacks. Adversarial training methods can help improve the model\u0026rsquo;s ability to generalize to unseen anomalies and mitigate the risk of evasion attacks in real-world surveillance systems. Development of incremental learning approaches to adaptively update feature representations over time as new data becomes available can help the model adapt to evolving patterns of normal and abnormal behavior in dynamic surveillance environments without forgetting previously learned knowledge. C. LEARNING AND SUPERVISION SCHEMES # In the context of this survey, we will delve into various learning approaches aimed at addressing the VAD problem.\n1) Supervised Approaches # In supervised learning, algorithms are developed using datasets where each frame is pre-labeled as either \u0026rsquo;normal\u0026rsquo; or \u0026lsquo;anomalous.\u0026rsquo; This allows the model to learn explicit distinctions between normal and abnormal events based on these annotations. However, the use of supervised methods in VAD is relatively rare. This is primarily because acquiring datasets with detailed, frame-level annotations is highly challenging and resource-intensive. Anomalous events are often rare and diverse, making it difficult to compile a comprehensive set of labeled anomalies. Additionally, the manual process of annotating each frame in extensive video data is laborious and time-consuming, further limiting the availability of such datasets. Consequently, while supervised learning can be powerful when precise labels are available, its application in VAD is constrained by the practical difficulties in obtaining extensive, accurately annotated training data.\nAn exemplary study by [62] introduced an approach for detecting and localizing anomalies in crowded scenes using spatial-temporal Convolutional Neural Networks (CNNs). This method can simultaneously process spatial and temporal data from video sequences, capturing both appearance and motion information. Tailored specifically for crowded environments, it efficiently identifies anomalies by focusing on moving pixels, enhancing accuracy and robustness.\nAnother work by [53] addresses the scarcity of labeled anomalies in anomaly detection by employing a conditional Generative Adversarial Network (cGAN) to generate supplementary training data that is class balanced. This approach utilizes labeled data for training the model. They propose a novel supervised anomaly detector, the Ensemble Active Learning Generative Adversarial Network (EAL-GAN). This network, unique in its architecture of one generator against multiple discriminators, leverages an ensemble learning loss function and an active learning algorithm. The goal is to alleviate the class imbalance problem and reduce the cost of labeling real-world data.\n2) Self-supervised Approaches # Self-supervised approaches involve training models using data that is not explicitly labeled for anomalies. Instead, the models learn to identify anomalous events by solving \u0026ldquo;proxy tasks\u0026rdquo;, which generate supervisory signals from the data itself. These tasks are designed to be related to the main goal of detecting anomalies, helping the model to develop an understanding of normal patterns in the data without needing direct supervision from labeled examples of anomalous events. However, it demands careful design and selection of proxy tasks to ensure that the learned features are useful for identifying anomalous events. In the work by [9], researchers propose a method based on self-supervised and multi-task learning at the object level. Self-supervision involves training a 3D CNN on several proxy tasks that do not require labeled anomaly data. These tasks include determining the direction of object movement (arrow of time), identifying motion irregularities by comparing objects in consecutive versus intermittent frames, and reconstructing object appearances based on their preceding and succeeding frames. By learning from the video data itself what normal object behavior looks like, the model becomes adept at detecting anomalies when deviations from this learned behavior occur. This methodology enables effective anomaly detection even in the absence of explicit labels.\nThe authors continued their work in [63]. The new enhancements include integrating advanced object detection methods such as YOLOv5, optical flow, and background subtraction, which improve the detection of rapidly moving objects and those outside predefined classes. They also introduce transformer blocks into the architecture, exploring both 2D and 3D Convolutional Vision Transformers (CvT) to better capture complex spatiotemporal dependencies. These updates culminate in a more robust framework that significantly enhances the detection accuracy and adaptability in identifying anomalous events in video sequences.\n3) Weakly Supervised Approaches # In weakly supervised approaches, unlike supervised approaches, obtaining precise frame-level annotations for anomalies in long video sequences can be challenging and time-consuming. Instead, annotators may label \u0026ldquo;snippets\u0026rdquo; or short segments within the video where anomalies are observed, serving as weak labels for training the model.\nThe concept of weak supervision was first introduced by Sultani et al. [11] through their pioneering multiple instance learning (MIL) model. This model was the first to adopt the notion of weakly labeled training videos. In this approach, normal and anomalous videos are considered negative and positive bags, respectively, with video segments serving as instances in MIL. These bags are processed through feature extractors to capture spatiotemporal features before being directed to a fully connected network to produce the final output. The output anomaly score ranges from 0 to 1, with the goal of increasing the model\u0026rsquo;s output anomaly score for abnormal segments while decreasing it for normal segments.\nHowever, this method can introduce noisy labels since the weak annotations do not provide precise information about the exact location of anomalies within the segments. This ambiguity can lead to the model learning less accurate representations of normal and abnormal behaviors. Additionally, the reliance on segment-level labels can sometimes result in the inclusion of irrelevant frames within the labeled segments, further complicating the training process.\nOther works have solved this issue. For example, the authors [28] introduced a novel approach to weakly supervised anomaly detection (WSVAD), reframing it as a supervised learning task with noisy labels, departing from the conventional MIL framework. Leveraging a Graph Convolutional Network (GCN), this method effectively cleans noisy labels, thereby enhancing the training and reliability of fully supervised action classifiers for anomaly detection. Additionally, the authors [64] introduced a binarization-embedded WSVAD (BE-WSVAD) method, which innovates by embedding binarization into GCN-based anomaly detection module.\nIn another study, [27] enhances traditional MIL with a Temporal Convolutional Network (TCN) and a unique Inner Bag Loss (IBL). The IBL strategically focuses on the variation of anomaly scores within each bag (video), emphasizing a larger score gap in positive bags (containing anomalies) and a smaller gap in negative bags (without anomalies). Meanwhile, the TCN effectively captures the temporal dynamics in videos, a crucial aspect often overlooked in standard MIL approaches.\nMoreover, authors [29], [65] propose a self-reasoning approach based on binary clustering of spatio-temporal video features to mitigate label noise in anomalous videos. Their framework involves generating pseudo labels through clustering, facilitating the cleaning of noise from the labels, and enhancing the overall anomaly detection performance. This method not only removes noisy labels but also improves the network\u0026rsquo;s performance through a clustering distance loss.\nTraditional MIL approaches often neglect the intricate interplay of features over time.\nIn [31], the method commences with a relation-aware feature extractor that captures multi-scale CNN features from videos. The unique aspect of their approach is the integration of self-attention with Conditional Random Fields (CRFs), leveraging self-attention to capture short-range feature correlations and CRFs to learn feature interdependencies. This approach offers a more comprehensive analysis of complex movements and interactions for anomaly detection.\nOn the other hand, [32] proposed Robust Temporal Feature Magnitude Learning (RTFM). RTFM addresses the challenge of recognizing rare abnormal snippets in videos dominated by normal events, particularly subtle anomalies that exhibit only minor differences from normal events. It employs temporal feature magnitude learning to improve the robustness of the MIL approach against negative instances from abnormal videos and integrates dilated convolutions and self-attention mechanisms to capture both long and short-range temporal dependencies.\nFurthermore, [52] introduced \u0026ldquo;MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection\u0026rdquo; a novel approach for WSVAD. MIST diverges from traditional MIL by introducing a pseudo label generator with a sparse continuous sampling strategy for more accurate clip-level pseudo labels, and a self-guided attention-boosted feature encoder for focusing on anomalous regions within frames.\nAdditionally, [66] presents a novel weakly supervised temporal relation learning framework (WSTR). The framework, which uses I3D for feature extraction and incorporates snippet-level classifiers and top-k video classification for weak supervision, is the first of its kind to apply transformer technology in this context.\nIn [26], a novel approach called CLIPTSA leverages Vision Language (ViT) features for WSVAD. Unlike conventional models such as C3D or I3D, CLIPTSA utilizes the Vision Transformer (ViT) encoded visual features from CLIP [34] to efficiently extract discriminative representations. It incorporates a Temporal Self-Attention (TSA) mechanism to model both long- and short-range temporal dependencies, thereby enhancing detection performance in VAD.\nAnother recently published work by [61], proposes a novel weakly supervised framework called Text Prompt with Normality Guidance (TPWNG), leveraging the CLIP model for aligning textual descriptions with video frames to generate pseudo-labels. The method involves fine-tuning CLIP for domain adaptation using ranking and distributional inconsistency losses and introducing a learnable text prompt mechanism with normality visual prompts to improve textvideo alignment. The framework also incorporates a pseudolabel generation module based on normality guidance to infer reliable frame-level pseudo-labels and a temporal context self-adaptive learning module to flexibly capture temporal dependencies in video events.\n4) Unsupervised and Reconstruction-based Approaches # Reconstruction-based approaches for video anomaly detection operate under the principle that normal events can be effectively reconstructed from a learned representation, while anomalies or abnormal events deviate significantly from this representation and hence are harder to reconstruct. Essentially, the model learns to represent or \u0026ldquo;reconstruct\u0026rdquo; normal data, and anomalies are detected based on how poorly the model reconstructs them.\nThese methods are particularly suitable when labeled anomaly data is scarce. During training, only normal videos are considered, while during testing, the model is evaluated on both normal and abnormal videos to assess its anomaly detection capabilities.\nAlthough these models are feasible, scalable, and costeffective approaches. However, the effectiveness of these models is highly dependent on the quality and comprehensiveness of the normal training data. If the normal data is not representative of all possible normal variations, the model\u0026rsquo;s performance in detecting anomalies can suffer.\nAnother issue is that the model could generate a high number of false positives, labeling normal variations or benign deviations as anomalies. This occurs because any deviation from the learned normal patterns, even if it is not truly anomalous, might be flagged.\nDeep learning techniques, especially CNN [59] or autoencoders [16], [67], are widely utilized for this approach. An autoencoder attempts to learn a compressed representation of its input data and then reconstructs the original data from this representation. During training, the model learns to minimize the reconstruction error between the input and the output.\nAfter training, when the model encounters new data, it tries to reconstruct it based on what it has learned. The difference between the reconstructed output and the original input is measured, typically as a reconstruction error. A high reconstruction error indicates that the input is significantly different from what the model considers \u0026ldquo;normal\u0026rdquo; implying that the input might be an anomaly.\nIn the work by [16], the authors developed a method to assess the regularity of frames in video sequences using reconstruction models. They employed two autoencoders: one with convolutional layers and the other without. The model processed two distinct types of inputs: manually designed features (such as HOG and HOF, enhanced with trajectorybased characteristics), and a combination of 10 successive frames aligned along the time axis. The error in reconstructing these frames serves as an indicator of their regularity score.\na: The evolution of the reconstruction approaches # The reconstruction-based paradigm is a key component of unsupervised learning frameworks and is often categorized within certain learning models as one-class classification (OCC) or unsupervised learning, primarily due to its emphasis on training using only the \u0026ldquo;normal\u0026rdquo; class of videos. Within the OCC paradigm, the model is trained only on data from one class (typically the \u0026ldquo;normal\u0026rdquo; class), and any deviation from this learned structure in the testing phase \u0026ldquo;anomaly\u0026rdquo; results in higher reconstruction error.\nOn the other hand, unsupervised learning focuses on understanding the structure or distribution within the data itself. It learns to reconstruct data without explicit labels indicating what is normal or anomalous.\nAs the work of, [68], proposed a novel approach called Generative Cooperative Learning, which combines a generator and a discriminator trained together using a negative learning paradigm. The generator, designed as an autoencoder, reconstructs both normal and abnormal representations. Through negative learning, the discriminator learns to estimate the probability of an instance being abnormal by identifying the reconstruction errors from the generator. This method leverages the assumption that anomalies are less frequent than normal events and that normal events exhibit temporal consistency, allowing for more effective anomaly detection.\nAnother unsupervised approach by [69] ) utilizes diffusion models, a type of generative model, to leverage the reconstruction capabilities for detecting anomalies. The approach begins by extracting features from video clips using a 3D convolutional neural network (3D-CNN). These features are then fed into a diffusion model, which reconstructs the features without relying on labeled data. The diffusion model progressively adds Gaussian noise to the input data and learns to reverse this process, effectively reconstructing the input.\nFuture Frame Prediction Approaches: The concept of reconstruction-based approaches evolved as researchers observed that deep neural networks may not always produce significant reconstruction errors for unusual events. Additionally, certain normal videos, previously unencountered, could be incorrectly labeled as abnormal. To address these challenges, researchers began focusing on reconstructing future frames based on previous video frames while considering optical flow constraints to ensure motion consistency. This evolved methodology, termed \u0026ldquo;prediction approaches,\u0026rdquo; introduced Generative Adversarial Networks (GANs), which played a significant role in enhancing this approach [46], [47]. Building upon this framework, the HSTGCNN [70] model incorporates a sophisticated Future Frame Prediction (FFP) mechanism that significantly refines the anomaly detection process. By integrating hierarchical graph representations, the model not only predicts future frames but also encodes complex interactions among individuals and their movements, thus providing a more robust and context-aware anomaly detection system.\nA hybrid system incorporating flow reconstruction and frame prediction was introduced by [20]. This system detects unusual events in videos by memorizing normal activity patterns and predicting future frames. It utilizes a memoryaugmented autoencoder for accurate flow reconstruction and a Conditional Variational Autoencoder for frame prediction. Anomalies are highlighted by larger errors in flow reconstruction and subsequent frame prediction.\nDifferent VAD paradigms have their strengths and weaknesses. ViT-based approaches with the integration of language models/language supervision are the recent trends for improving VAD performance. In the realm of different approaches for VAD, several promising directions are emerging:\nDevelopment of vision-language models for VAD problems will be more dominant compared to vision-only models. These models do not only capture the semantic representation of the videos but also consider the natural language descriptions for anomalies. Exploration of methods to localize and describe anomalies in videos using natural language descriptions is the recent trend. Language-guided anomaly localization techniques enable the model to identify spatial and temporal regions corresponding to anomalies and generate human-interpretable descriptions, enhancing situational awareness and facilitating response efforts. Exploration of GAN-based approaches for VAD, where the generator learns to generate normal video frames while the discriminator distinguishes between real and generated frames. GANs can capture complex distributions of normal behavior and detect deviations indicative of anomalies. Development of graph-based models that represent spatial and temporal dependencies among entities in surveillance videos as a graph structure. Spatiotemporal graph models can capture intricate relationships among objects, activities, and contexts, enabling more accurate anomaly detection in complex scenes. Investigation of transfer learning techniques to transfer knowledge from labeled source domains to unlabeled or sparsely labeled target domains can help mitigate the scarcity of labeled data in target domains and improve anomaly detection performance by leveraging knowledge learned from related surveillance scenarios. Adoption of multi-resolution analysis techniques to analyze videos at different spatial and temporal resolutions. Multi-resolution approaches enable the detection of anomalies occurring at varying scales, from smallscale events to large-scale spatial or temporal patterns, enhancing the model\u0026rsquo;s sensitivity to diverse types of anomalies. Adoption of dynamic ensemble learning strategies to combine predictions from multiple anomaly detection models dynamically. Dynamic ensemble methods can adaptively adjust the ensemble composition based on the current surveillance context, improving detection robustness and resilience to evolving anomalies. D. LOSS FUNCTIONS # Loss functions play a pivotal role in quantifying the disparity between predicted outcomes by a model and the actual results. They serve as guides for the optimization process, facilitating adjustments and enhancements to model parameters during training. The selection of an appropriate loss function is critical as it directly impacts the model\u0026rsquo;s capacity to discern underlying patterns in the data and, consequently, its performance on unseen data. Various tasks necessitate distinct loss functions to adeptly capture the specific characteristics or complexities inherent in the problem domain.\n1) Multiple Instance Learning (MIL) Loss # Within the category of weakly supervised learning, notably Multiple Instance Learning (MIL) [11], the objective function is crafted to proficiently distinguish between normalcy and anomaly within video segments.\nWhen dealing with a dataset of labeled videos, denoted as V, where V is a video segment. The videos are categorized into positive bags B+ and negative bags B − . The overall objective function is represented as,\nThis function aims to ensure that the anomaly score S(V ; θ) for the positively labeled video dataset is greater than any segment within the negatively labeled set. This is achieved by employing a hinge loss function which is parameterized by θ .\n2) Cross Entropy Loss # Alongside the MIL loss, some other researchers [28], [71], used the binary cross entropy within their loss functions to classify between normal and abnormal video snippets in the WSVAD tasks.\nThe goal is to refine the model to identify the snippet with the highest anomaly score in a video, which is denoted as normal, and conversely in an anomalous video. For each video instance, MIL formulates a pair that includes the model\u0026rsquo;s prediction for the snippet with the highest anomaly score and the true label for the video\u0026rsquo;s anomaly (e.g., max{f(Vi)} n i=1 , yi), where yi = 0 for normal and yi = 1 for anomalous instances). MIL subsequently consolidates these pairs from all the videos to form a set C of confidently labeled snippets. The model f is refined by optimizing the Binary CrossEntropy (BCE) loss, calculated as follows:\nwhere the maximum score yˆ ˆ i = max{f(Vi)} n i=1 . Under this paradigm, the model f must assign the lowest anomaly probability to all snippets in a normal video, thereby minimizing max{f(Vi)} n i=1 , and the highest anomaly probability in an anomalous video, hence maximizing max{f(Vi)} n i=1 . The strategy is to focus on the snippet with the strongest abnormality, even if the video is largely normal, to generate a set of snippets labeled with high confidence as anomalous.\n3) Reconstruction Error loss # As for the unsupervised approach, the objective function of the model, as shown in Equation 4, is to minimize the reconstruction error (the squared Euclidean distance, ∥ · ∥ 2 2 ), as proposed by [16], between the original input frames pixel values and its reconstruction by the model with weights θ. The input frames denoted as Vi are passed through an encoder to obtain a compressed representation, which is then reconstructed back into frames F(Vi; θ) using a decoder and N is the size of the minibatch.\nFuture directions for video anomaly detection in terms of loss functions involve the exploration and development of novel loss functions tailored to address specific challenges and objectives in anomaly detection tasks. The promising future directions in this domain are:\nDesigning anomaly-aware loss that explicitly accounts for the characteristics of anomalies could penalize model errors differently based on the severity or rarity of anomalies, helping the model focus more on detecting critical anomalies while reducing false alarms for common or benign events. Incorporating temporal consistency constraints into the loss function to encourage smooth transitions and coherent predictions over time may penalize abrupt changes or inconsistencies in the model\u0026rsquo;s predictions across consecutive frames, promoting more stable and coherent anomaly detection results. Integrating adversarial losses into the training process to improve the robustness of anomaly detection models against adversarial attacks can encourage the model to generate predictions that are resilient to adversarial perturbations or manipulations, enhancing the reliability and effectiveness of anomaly detection systems in realworld scenarios. Incorporating uncertainty-aware losses to quantify and mitigate uncertainty in anomaly detection predictions may enable the model to estimate the confidence or uncertainty associated with its predictions, facilitating more reliable decision-making and uncertainty quantification in anomaly detection systems. E. REGULARIZATION # Regularization techniques in deep learning, particularly for video anomaly detection, are crucial for preventing overfitting and improving the generalization of the models.\n1) Weight Decay regularization # Weight decay is a form of regularization commonly used in training neural networks to prevent overfitting. The term \u0026ldquo;weight decay\u0026rdquo; specifically refers to a technique that modifies the learning process to shrink the weights of the neural network during training. Weight decay works on the principle of adding a penalty to the loss function of a neural network related to the size of the weights [72]. The primary objective is to keep the weights small, which helps in reducing the model\u0026rsquo;s complexity and its tendency to overfit the training data. By penalizing large weights, weight decay ensures that the model does not rely too heavily on any single feature or combination of features, leading to better generalization.\na: L1 Regularization # L1 regularization is a regularization term that adds a penalty to the loss function of the model equivalent to the absolute value of the magnitude of the weights of the model. The general equation for a loss function with L1 regularization can be expressed generally as follows:\nwhere L(θ) is the total loss function after including the L1 regularization term, L0(θ) is the original loss function without regularization, and λ is a regularization parameter that controls the strength of the regularization effect. Higher values of λ lead to greater regularization, potentially resulting in more features being effectively ignored by reducing their coefficients to zero. The key aspect of L1 regularization is its ability to create sparse models. Sparsity here refers to the property where some of the coefficients become exactly zero, effectively excluding some features from the model. This is particularly useful in feature selection, where the most important features in the model are to be identified.\nb: L2 Regularization # L2 regularization, also known as Ridge regularization, is a technique that modifies the loss function of a model by adding a penalty equivalent to the square of the magnitude of the model\u0026rsquo;s weights. The formula for a loss function with L2 regularization is generally expressed as follows:\nHere, L(θ) represents the total loss function, including the L2 regularization term. L0(θ) is the original loss function without regularization, and λ is the regularization parameter, which determines the strength of the regularization effect. Increasing the value of λ enhances the regularization effect, penalizing larger weights more significantly. The key characteristic of L2 regularization is its ability to prevent overfitting by keeping the weights small, which leads to a more generalized model. Unlike L1 regularization, which promotes sparsity in the model (some coefficients becoming exactly zero), L2 regularization tends to shrink all coefficients towards zero but does not make them exactly zero. This attribute makes it particularly useful in scenarios where many features contribute small amounts to the predictions. By penalizing the square of the weights, L2 regularization ensures that no single feature dominates the model, which can be essential for models where all features carry some level of importance.\n2) Temporal and Spatial Constraints # Temporal and spatial constraints are other types of regularization terms that are commonly used in VAD [11]. The addition of these terms in the loss function ensures important specifications in the model\u0026rsquo;s spatial and temporal learning for distinguishing between the normal and abnormal events in videos [73]. Temporal constraints are used to guarantee the continuity and progression of events over time while spatial constraints are used to enhance the model\u0026rsquo;s understanding of the objects\u0026rsquo; positions and physical locations throughout the frames of the video and their relation to the existence of anomalies. By incorporating both types of constraints in the\nloss function of a VAD model, the model is inclined toward achieving more robust anomaly detection.\na: Temporal Smoothing Constraint # The temporal smoothing constraint is a crucial component of loss functions in VAD, employed to maintain stability in predicted anomaly scores across consecutive frames. This constraint aims to mitigate abrupt fluctuations in anomaly scores between adjacent frames, aligning with the expectation that successive frames typically exhibit similar anomaly characteristics. The temporal smoothing constraint penalizes sudden changes in the anomaly score between sequential frames as follows in equation 7:\nwhere a λ 1 is a smoothing coefficient, S(Vi; θ) is the anomaly score of frame i, and S(Vi+1; θ) is the anomaly score of frame i + 1 .\nb: Sparsity constraint # The sparsity constraint is a commonly used regularization term in VAD that forces the anomalous frames to be a minority among the video frames. In VAD, the anomalous frames are generally assumed to be the minority compared to the normal frames. The sparsity constraint penalizes the total number of anomalous frames in the video as follows in equation 8:\nwhere a λ 2 is a sparsity coefficient and S(Vi; θ) is the anomaly score of frame i .\nWhile spatio-temporal constraints suit well for handling the abrupt anomalies within the video sequence, the recent trends toward incorporating VAD-specific constraints improve the overall performance. Future directions in this domain include the following:\nAdversarial regularization techniques may help the model learn more resilient features by introducing adversarial perturbations during training, thereby enhancing its ability to detect anomalies in the presence of adversarial manipulations. Incorporating temporal regularization techniques to enforce temporal consistency and coherence in anomaly detection predictions over time can encourage smooth transitions and consistent predictions across consecutive frames. Graph-based regularization techniques can impose structural constraints on the learned representations, encouraging the model to capture meaningful interactions and contextual information among objects, scenes, or events, leading to more accurate anomaly detection. Integrating sparse regularization techniques to encourage sparsity in the learned representations, focusing the model\u0026rsquo;s attention on salient features or regions within videos can promote the selection of informative features while suppressing irrelevant noise. Addressing catastrophic forgetting and model degradation over time by incorporating continual learning regularization techniques can enable the model to adapt and learn from new data while preserving previously learned knowledge, facilitating the adaptation in the surveillance environments. F. ANOMALY SCORE # The anomaly score indicates the likelihood of a segment or frame in a video being abnormal, calculated based on how much it deviates from normal patterns. A high anomaly score suggests a high probability of an anomaly.\nFor the reconstruction approaches [16], [67], after training the model, its effectiveness is measured by inputting test data to see if it can accurately identify unusual activities with a minimal number of false alarm rates. Then, calculate the anomaly score S(Vi) for each frame Vi by scaling between 0 and 1 using the reconstruction error of the frame e(Vi) .\nHence, we can define the regularity score to be:\nHowever for future frame prediction methods, [74] has shown that Peak Signal-to-Noise Ratio (PSNR) is a superior method for evaluating image quality [46], [47], as indicated below:\nThe assumption is that we can make accurate predictions for normal events. Therefore, we can detect anomalies by measuring the difference between a predicted frame Y ˆ and its corresponding ground truth Y . Where max Y ˆ is the maximum possible value of the image intensities divided by the mean square error (MSE). In image reconstruction tasks, a higher PSNR would indicate a lower error and, thus, higher fidelity to the original image, which is the desired outcome.\nAfter calculating the PSNR between the predicted Y ˆ and its corresponding ground truth Y for each frame Viin a test video and normalizing these values, each frame\u0026rsquo;s regularity score is determined S r (Vi) using the equation below.\nThis score indicates the likelihood of a frame being normal or abnormal, with the potential to set a threshold point for distinguishing between the two.\nVI. MODEL EVALUATION # A. DATASETS GUIDELINE # As discussed in section V-A, several datasets were used in literature for training and evaluating VAD models. These datasets are well-suited for different types of supervision schemes. Most of the datasets, including UCF-crime [11], ShanghaiTech [14], XD-Violence [13], and CUHK Avenue [15], provide video-level labels for the training set which could be utilized for training weakly-supervised VAD as in the works by Joo et al. [26] and Cho et al. [73]. These datasets could also be utilized for unsupervised VAD by utilizing the training set without the video-level labels as in the works of Shi at al. [75] and Zaheer et al. [68] and for self-supervised learning as in the work of Zhou et al. [62].\nB. EVALUATION METRICS # Most previous works mainly compared their results on different datasets using metrics such as the Area Under the ROC Curve (AUC) of the frame level, Equal Error Rate (EER), and Average Precision (AP) ) of the frame level to measure performance. The main aim of these metrics is to assess the model\u0026rsquo;s proficiency in differentiating between normal and abnormal videos.\nArea Under the ROC Curve (AUC): The AUC is a significant metric used to evaluate model performance. It represents the area under the Receiver Operating Characteristic (ROC) curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) across various threshold settings. The TPR is defined as:\nand the FPR is given by:\nA higher AUC value indicates better model performance, where an AUC of 1 signifies a perfect model and an AUC of 0.5 suggests no discriminative power between positive and negative classes. Two types of AUC that are commonly used for evaluating VAD models are Micro-AUC and MacroAUC. Micro-AUC considers all samples across all classes together to compute a single AUC, making it sensitive to the most frequent classes. Macro-AUC, on the other hand, calculates the AUC for each class independently and averages these values, providing a balanced view of performance across all classes. While Micro-AUC reflects overall detection capability, Macro-AUC ensures the model performs well across different types of anomalies or normal events, offering insights into potential biases towards certain classes.\nAverage Precision (AP): The AP metric computes the average value of precision for different recall values. It\nsummarizes the precision-recall curve into a single value representing average performance across all threshold levels.\nEqual Error Rate (EER): The EER represents the point at which the FPR equals the False Negative Rate (FNR). Lower EER values indicate a greater degree of accuracy in the system [62].\nIn this survey, we use the AUC as the performance metric in our quantitative comparison of the SOTA models in Section VII-A.\nVII. COMPARATIVE ANALYSIS OF SOTA MODELS # A. QUANTITATIVE COMPARISON # There have been several works in literature that proposed different architectures of deep learning models for encountering the problem of VAD. Table 2 highlights a benchmarking of recent anomaly detection works on publicly available datasets. The table offers a detailed look at the evolution and current state of anomaly detection techniques, primarily focusing on publicly available datasets including UCF-Crime, Shanghai-Tech, XD-Violence, and Avenue. The data spans a decade, from 2013 to 2023, providing a comprehensive view of the field\u0026rsquo;s progression in the last ten years. An important trend that can be observed in the table is the shift in research interest from supervised VAD approaches to weakly supervised VAD learning methods. This change, particularly notable from 2018 onwards and greatly influenced by the work of Sultani et al. [11] who introduced the \u0026ldquo;UCF-Crime\u0026rdquo; dataset and the Multi-Instance Learning (MIL) learning pipeline, suggests a growing preference in the industry for techniques that require less fully labeled data, which is often expensive or difficult to obtain, in comparison to weakly supervised approaches that rely on video-level labels. This shift reflects the practical challenges and evolving needs in anomaly detection applications particularly in applications where hundreds of hours of unlabelled or weakly labeled video footage is available.\nA notable aspect to consider is the wide array of feature extractors utilized in these studies. Ranging from simpler methods like 3D cube gradient with Principal Component Analysis (PCA) to more sophisticated deep learning architectures such as Convolutional Auto Encoders (CAE), 3D Convolutional Networks (C3D), Temporal Segment Networks (TSN), Inflated 3D ConvNet (I3D), and CLIP, the diversity in approaches underscores the intricate nature of the VAD task, emphasizing the necessity for tailored methodologies to address diverse scenarios.\nAnother critical facet of the analysis pertains to performance metrics, particularly the Area Under the Curve (AUC) scores across different datasets, indicating notable enhancements in model performance over time. This trend not only signifies advancements in methodological efficacy but also underscores the increasing accuracy and reliability of anomaly detection systems. Leading models, notably recent ones like \u0026ldquo;CLIP-TSA\u0026rdquo; [26] published in 2023, exhibit exceptionally high AUC scores, illustrating the rapid progress\nachieved in recent years, particularly with the integration of advanced techniques from Natural Language Processing (NLP) into models like CLIP, BLIP, and GPT. A novel approach that utilized the vision language models is [61], published in 2024, which outperformed the CLIP-TSA over two datasets as shown in Table 2. Based on the SOTA comparison, it is evident that Vision-language-based approaches are currently at the forefront of the VAD research.\nHowever, the analysis reveals several instances of missing data. These gaps across various methodologies and datasets highlight the necessity for more comprehensive benchmarking efforts. Addressing such gaps is crucial to foster a deeper understanding and facilitate meaningful comparisons among different anomaly detection methods, ultimately advancing the field.\nB. QUALITATIVE COMPARISON # To observe in more detail the performance of current state-ofthe-art models performance on various normal and anomalous surveillance videos, this section presents a qualitative analysis of these models\u0026rsquo; performances. Figure 5 depicts a qualitative assessment of correctly classified and misclassified video frames using four VAD models, namely Sultani et al. [11], GCN [68], CLAV [73], and VAD-CLIP [35]. As shown in the figure\u0026rsquo;s first row, these four VAD models were able to correctly classify normal and anomalous video frames taken from different scenes and environments. However, these models misclassified other similar frames taken from the same videos either by detecting anomalies in a normal video or by not detecting an anomalous video segment. For example, the model proposed by [11] misclassified the frame number 10000 in the video \u0026ldquo;Burglary032\u0026rdquo;, which shows a person entering an office through a window, by labeling it as \u0026ldquo;Normal\u0026rdquo; while it is an \u0026ldquo;Anomaly\u0026rdquo;. This was attributed to the immense darkness of the scene which made recognizing such an ambiguous anomaly difficult. Another example of misclassification is with the GCN model by [68] as it misclassified the frame number 4400 in the video \u0026ldquo;Stealing058\u0026rdquo;, which shows a car leaving the parking spot, by labeling it as an \u0026ldquo;Anomaly\u0026rdquo; while it is \u0026ldquo;Normal\u0026rdquo;. This could be attributed to the sudden movement of the car with its lights turned on. Similarly, the model proposed by [73] named CLAV misclassified the initial frames of a scene showing cars stopping and accelerating again in a street. This increased anomaly score could be explained by the abrupt stopping of the cars which in many cases could be considered an anomaly. Finally, the VAD-CLIP model proposed by [35] misclassified the frame number 360 in the video \u0026ldquo;Shooting008\u0026rdquo;, which shows a person crawling on the floor, by classifying it as an \u0026ldquo;Anomaly\u0026rdquo; while it is \u0026ldquo;Normal\u0026rdquo;. This misclassification could be explained as the person crawling moves unconventionally in contrast with what commonly happens in normal street and shopping center scenes. It could be concluded that some normal and abnormal frames could be misclassified when the overall context of the video cannot be inferred directly from the frame in cases where the frames\nTABLE 2. Benchmarking of recent anomaly detection works on publicly available datasets.\nSupervision type Method Venue Year Feature extrac UCF-Crime AUC STech AUC XD-Violance AUC Avenue AUC Ref pyp self-supervised SS-MTL CVPR 2021 3D CNN // // // 86.9 [9] self-supervised Lu et al. ICCV 2013 3D cube gradient and PCA 65.51 68.0 // // [15] self-supervised Hasan et al. CVPR 2016 onvolutional autoencode 50.60 60.85 // 70.20 [16] self-supervised AMC ICCV 2019 Conv-A // // // 86.9 [59] self-supervised MemAE ICCV 2019 2D CNN Encoder-Decoder // 71.2 // 83.3 [76] self-supervised MPED-RNN CVPR 2019 RNN // 73.4 // // [77] self-supervised CVAE ICCV 2021 ML-MemAE-SC, CVAE // 76.2 // 91.1 [20] self-supervised GCL CVPR 2022 ResNex 71.04 78.93 // // [68] self-supervised USTN-DSC CVPR 2023 Transformer Encoder-Decoder // 73.8 // 89.9 [55] self-supervised FPDM ICCV 2023 Diffusion Model 74.7 78.6 // // [78] self-supervised SLMPT ICCV 2023 U-Net // 78.8 // 90. [] [75] self-supervised EVAL CVPR 2023 3D CNN // 76.63 // 86.02 [44] Sultani et al. CVPR 2018 C3D 75.41 // // // [11] TCN-IBL ICIP 2019 C3D 78.66 82.50 // // [27] GCN-TSN CVPR 2019 TSN - RG 82.12 84.44 // // [28] SRF IEEE SPL 2020 C3D 79.54 84.16 // // [] [29] Noise Cleaner CVPRW 2021 C3D 78.26 84.16 // // [65] DAM AVSS 2021 I3D- RGB 82.67 88.22 // // [30] SA-CRF ICCV 2021 Relation-aware TSN ResNet-50 85.00 96.85 // // [31] RTFM ICCV 2021 I3D- RGB 84.30 97.21 77.81 // [] [32] MIST CVPR 2021 I3D- RGB 82.30 94.84 // // [52] MSLNet-CTE AAAI 2022 VideoSwin-RGB 85.62 97.32 78.5 // [23] CUPL CVPR 2023 I3D+VGGish 86.22 // // // [79] HSC CVPR 2023 AE // 83.4 // 93.7 [] [22] UMIL CVPR 2023 X-CLIP-B/32 86.75 // // // [71] TeD-SPAD ICCV 2023 U-Net + I3D 75.06 // // // [80] CLAV CVPR 2023 3D-RGB 86.1 97.6 81.3 89.8 [] [73] TEVAD CVPRW 2023 Net-50 I3D-RGB / SwinBERT 84.90 98.10 79.80 // [] [33] CLIP-TSA ICIP 2023 CLIP 87.58 98.32 82.19 // [26] TPWNG CVPR 2024 CLIP (ViT-B/16 87.79 // 83.68 // [61] give a false impression of a different context.\nVIII. VISUALIZING BIBLIOMETRIC NETWORKS FOR THEMATIC ANALYSIS # In bibliometrics, visualization emerges as a potent technique for analyzing diverse networks, encompassing citationbased, co-authorship-based, or keyword co-occurrence-based networks. Density visualizations provide a rapid overview of the essential components within a bibliometric network. Employing NLP techniques, one can construct term cooccurrence networks for textual data in English. The rationale behind employing interconnected networks is to illustrate the temporal and relevance-driven evolution of the field.\nWe utilized VOSviewer [81], a tool employing a distancebased method to generate visualizations of bibliometric networks. Directed networks, like those formed by citation relationships, are treated as undirected in this context. VOSviewer autonomously organizes network nodes into clusters, where nodes with similar attributes are grouped. Each node in the network is neatly assigned to one of these clusters. Furthermore, VOSviewer employs color to denote a node\u0026rsquo;s membership within a cluster in bibliometric network visualizations.\nFive distinct clusters have emerged, as shown in Figure 6, each representing a different research theme. The green cluster, centered around the theme of diffusion, explores methods based on diffusion models for anomaly detection, with representative work such as \u0026ldquo;Anomaly Detection in Satellite Videos using Diffusion Models\u0026rdquo; [82]. In the yellow cluster, efficiency takes precedence, with research efforts directed towards developing algorithms capable of accurate anomaly detection at millisecond-level latencies, as demonstrated in the paper \u0026ldquo;EfficientAD: Accurate Visual Anomaly Detection at Millisecond-Level Latencies\u0026rdquo; [83]. The purple cluster focuses on interleaving one-class and weakly-supervised models with adaptive thresholding for unsupervised video anomaly detection, exemplified by the paper \u0026ldquo;Interleaving One-Class and Weakly-Supervised Models with Adaptive Thresholding for Unsupervised Video Anomaly Detection\u0026rdquo; [84]. Within the red cluster, the exploration delves into vision language models (VLM), particularly CLIP, for video anomaly recognition, as seen in \u0026ldquo;Delving into CLIP latent space for Video Anomaly Recognition\u0026rdquo; [85]. Finally, the blue cluster is dedicated to research involving Generative Adversarial Networks (GANs) for VAD tasks, with representative work titled \u0026ldquo;Video Anomaly Detection using GAN\u0026rdquo; [86]. All clusters are further illustrated in Table 3.\nIX. DISCUSSION AND FUTURE RESEARCH # In this survey paper, we present a comprehensive review that serves as a guideline to solve challenges related to VAD. The past decade has witnessed a remarkable evolution in the field of Video Anomaly Detection (VAD), marked by a notable transition from supervised to weakly supervised learning approaches and reconstruction-based techniques. This shift reflects a growing preference for methods capable of operating effectively with less reliance on fully labeled data, which can be both costly and challenging to obtain. Notably, the majority of benchmarking datasets are weakly supervised, featuring video-level annotations, while reconstruction ap-\nFIGURE 5. A qualitative comparison and illustration of correctly and incorrectly classified frames using four VAD models, namely Sultani et al. [11], GCN [68], CLAV [73], and VAD-CLIP [35].\nFIGURE 6. Visualizing Bibliometric Networks for Thematic Analysis of Recent Literature (50 top cited papers) on Video Anomaly Detection between the year 2023-2024.\nproaches utilize only normal data in an unsupervised manner during training.\nThe existing literature presents a diverse array of VAD\ndatasets, ranging from specific single-scene datasets like UCSD Pedestrian to more comprehensive, multi-scene collections such as UCF-Crime and XD-Violence. These\nTABLE 3. Cluster central point and color for each cluster with representative paper and theme\nCluster Research Paper: The Title nomaly Detection in Satellite Videos using Diffusion Models (Theme: Diffusion) • Delving into CLIP latent space for Video Anomaly Recognition (Theme: VLM) • : Accurate Visual Anomaly Detection at Millisecond-Level Latencies (Theme: Efficiency • Video Anomaly Detection using GAN (Theme: GAN) • Interleaving One-Class and Weakly-Supervised Models with Adaptive Thresholding for Unsupervised Video Anomaly Detection (Theme: Weakly-Supervised) • datasets vary significantly in size, duration, and the nature of anomalies they encompass.\nHowever, despite the progress made, challenges persist within the realm of VAD. Issues such as limited environmental diversity, a narrow range of anomalous event types, and class imbalance continue to impact the generalization and detection accuracy of models across different settings. This persistent gap underscores the urgent need for more diverse datasets, emphasizing the importance of expanding available resources to facilitate comprehensive and varied testing of VAD models.\nFurthermore, the utilization of various hybrid deep learning techniques for feature extraction, including Convolutional Neural Networks (CNNs), Autoencoders (AEs), Generative Adversarial Networks (GANs), Sequential deep learning, and vision-language models as feature extractors, highlights the complexity of VAD tasks. These techniques underscore the extraction of crucial types of features such as spatiotemporal and textual features, underscoring the necessity for specialized approaches tailored to specific scenarios and reflecting the dynamic nature of the field.\nMoreover, the selection of appropriate loss functions has played a pivotal role in the effectiveness of various tasks, serving as a fundamental component for model optimization and directly influencing a model\u0026rsquo;s capacity to learn and make accurate predictions. Additionally, incorporating regularization terms into loss functions, particularly sparsity and smoothing constraints, has been essential to enhance the model\u0026rsquo;s capability to discern between normal and abnormal events.\nIn the testing phase, evaluating the anomaly score using metrics such as Area Under the Curve (AUC), Average Precision (AP), and Equal Error Rate (EER) has been critical for comprehending the quality and biases of false positives, thereby providing insights into the model\u0026rsquo;s limitations. It has been imperative to assess models across multiple datasets to ensure their robustness and generalization ability.\nA. FUTURE DIRECTIONS FOR RESEARCH # Exploring the fusion of state-of-the-art vision-language models, particularly integrating textual features, with traditional VAD approaches presents a promising avenue for future in- vestigations. This interdisciplinary approach holds the potential to enhance anomaly detection systems by imbuing them with a deeper understanding of complex video content, where semantic meanings are enriched through textual annotations.\nWhile significant progress has been made in the field of VAD, there remains a pressing need for more diverse and extensive datasets. Specifically, datasets covering a broader range of scenarios, encompassing multiple scenes and anomalies, are essential. Such datasets would not only facilitate the benchmarking of existing models but also stimulate innovation by presenting researchers with more challenging real-world situations to address.\nFurthermore, since VLMs are emerging, an important direction for future research involves the integration of textual descriptions in the data containing contextual details, such as frame-level captions, into anomaly detection models. This integration has the potential to significantly improve model performance by providing rich, descriptive information that aids in the interpretation and analysis of visual content. Consequently, this trend will necessitate the development of stronger loss functions capable of more effectively managing textual information. In conclusion, this survey paper highlights an exciting phase of growth and transformation in the field of VAD, characterized by methodological advancements, the integration of new technologies, and a shift towards more efficient learning approaches. However, the identified gaps and challenges underscore the need for continued efforts within the research community to develop comprehensive datasets and explore novel methodologies, ultimately advancing the state of the art in anomaly detection.\nREFERENCES # [1] B. Ramachandra, M. J. Jones, and R. R. Vatsavai, \u0026ldquo;A survey of singlescene video anomaly detection,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 5, pp. 2293–2312, 2020.\n[2] C. C. Aggarwal and C. C. Aggarwal, \u0026ldquo;Applications of outlier analysis,\u0026rdquo; Outlier Analysis, pp. 373–400, 2013.\n[3] M. Jiang, C. Hou, A. Zheng, X. Hu, S. Han, H. Huang, X. He, P. S. Yu, and Y. Zhao, \u0026ldquo;Weakly supervised anomaly detection: A survey,\u0026rdquo; arXiv preprint arXiv:2302.04549, 2023.\n[4] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, \u0026ldquo;Deep learning for anomaly detection: A review,\u0026rdquo; ACM computing surveys (CSUR), vol. 54, no. 2, pp. 1–38, 2021.\n[5] Z. Shao, H. Bian, Y. Chen, Y. Wang, J. Zhang, X. Ji et al., \u0026ldquo;Transmil: Transformer based correlated multiple instance learning for whole slide\nimage classification,\u0026rdquo; Advances in neural information processing systems, vol. 34, pp. 2136–2147, 2021.\n[6] W. Ullah, T. Hussain, F. U. M. Ullah, M. Y. Lee, and S. W. Baik, \u0026ldquo;Transcnn: Hybrid cnn and transformer mechanism for surveillance anomaly detection,\u0026rdquo; Engineering Applications of Artificial Intelligence, vol. 123, p. 106173, 2023.\n[7] R. Nayak, U. C. Pati, and S. K. Das, \u0026ldquo;A comprehensive review on deep learning-based methods for video anomaly detection,\u0026rdquo; Image and Vision Computing, vol. 106, p. 104078, 2021.\n[8] B. R. Kiran, D. M. Thomas, and R. Parakkal, \u0026ldquo;An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos,\u0026rdquo; Journal of Imaging, vol. 4, no. 2, p. 36, 2018.\n[9] M.-I. Georgescu, A. Barbalau, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah, \u0026ldquo;Anomaly detection in video via self-supervised and multi-task learning,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 742–12 752.\n[10] G. Liu, L. Shu, Y. Yang, and C. Jin, \u0026ldquo;Unsupervised video anomaly detection in uavs: a new approach based on learning and inference,\u0026rdquo; Frontiers in Sustainable Cities, vol. 5, p. 1197434, 2023.\n[11] W. Sultani, C. Chen, and M. Shah, \u0026ldquo;Real-world anomaly detection in surveillance videos,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6479–6488.\n[12] R. Nawaratne, D. Alahakoon, D. De Silva, and X. Yu, \u0026ldquo;Spatiotemporal anomaly detection using deep learning for real-time video surveillance,\u0026rdquo; IEEE Transactions on Industrial Informatics, vol. 16, no. 1, pp. 393–402, 2019.\n[13] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang, \u0026ldquo;Not only look, but also listen: Learning multimodal violence detection under weak supervision,\u0026rdquo; in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer, 2020, pp. 322–339.\n[14] W. Luo, W. Liu, and S. Gao, \u0026ldquo;A revisit of sparse coding based anomaly detection in stacked rnn framework,\u0026rdquo; in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 341–349.\n[15] C. Lu, J. Shi, and J. Jia, \u0026ldquo;Abnormal event detection at 150 fps in matlab,\u0026rdquo; in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2720–2727.\n[16] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, \u0026ldquo;Learning temporal regularity in video sequences,\u0026rdquo; in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 733–742.\n[17] N. Dalal and B. Triggs, \u0026ldquo;Histograms of oriented gradients for human detection,\u0026rdquo; in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05), vol. 1. Ieee, 2005, pp. 886–893.\n[18] N. Dalal, B. Triggs, and C. Schmid, \u0026ldquo;Human detection using oriented histograms of flow and appearance,\u0026rdquo; in Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part II 9. Springer, 2006, pp. 428–441.\n[19] W. Luo, W. Liu, and S. Gao, \u0026ldquo;Remembering history with convolutional lstm for anomaly detection,\u0026rdquo; in 2017 IEEE International conference on multimedia and expo (ICME). IEEE, 2017, pp. 439–444.\n[20] Z. Liu, Y. Nie, C. Long, Q. Zhang, and G. Li, \u0026ldquo;A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flowguided frame prediction,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 588–13 597.\n[21] D. Li, X. Nie, R. Gong, X. Lin, and H. Yu, \u0026ldquo;Multi-branch gan-based abnormal events detection via context learning in surveillance videos,\u0026rdquo; IEEE Transactions on Circuits and Systems for Video Technology, 2023.\n[22] S. Sun and X. Gong, \u0026ldquo;Hierarchical semantic contrast for scene-aware video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 846–22 856.\n[23] S. Li, F. Liu, and L. Jiao, \u0026ldquo;Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1395–1403.\n[24] B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, \u0026ldquo;Expanding language-image pretrained models for general video recognition,\u0026rdquo; in European Conference on Computer Vision. Springer, 2022, pp. 1–18.\n[25] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, \u0026ldquo;Prompting visuallanguage models for efficient video understanding,\u0026rdquo; in European Conference on Computer Vision. Springer, 2022, pp. 105–124.\n[26] H. K. Joo, K. Vo, K. Yamazaki, and N. Le, \u0026ldquo;Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection,\u0026rdquo; in\n2023 IEEE International Conference on Image Processing (ICIP). IEEE, 2023, pp. 3230–3234.\n[27] J. Zhang, L. Qing, and J. Miao, \u0026ldquo;Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection,\u0026rdquo; in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 4030–4034.\n[28] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, \u0026ldquo;Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1237–1246.\n[29] M. Z. Zaheer, A. Mahmood, H. Shin, and S.-I. Lee, \u0026ldquo;A self-reasoning framework for anomaly detection using video-level labels,\u0026rdquo; IEEE Signal Processing Letters, vol. 27, pp. 1705–1709, 2020.\n[30] S. Majhi, S. Das, and F. Brémond, \u0026ldquo;Dam: dissimilarity attention module for weakly-supervised video anomaly detection,\u0026rdquo; in 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2021, pp. 1–8.\n[31] D. Purwanto, Y.-T. Chen, and W.-H. Fang, \u0026ldquo;Dance with self-attention: A new look of conditional random fields on anomaly detection in videos,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 173–183.\n[32] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro, \u0026ldquo;Weakly-supervised video anomaly detection with robust temporal feature magnitude learning,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 4975–4986.\n[33] W. Chen, K. T. Ma, Z. J. Yew, M. Hur, and D. A.-A. Khoo, \u0026ldquo;Tevad: Improved video anomaly detection with captions,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5548–5558.\n[34] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., \u0026ldquo;Learning transferable visual models from natural language supervision,\u0026rdquo; in International conference on machine learning. PMLR, 2021, pp. 8748–8763.\n[35] P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y. Zhang, \u0026ldquo;Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 6074–6082.\n[36] K. Lin, L. Li, C.-C. Lin, F. Ahmed, Z. Gan, Z. Liu, Y. Lu, and L. Wang, \u0026ldquo;Swinbert: End-to-end transformers with sparse attention for video captioning,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 949–17 958.\n[37] R. Chalapathy and S. Chawla, \u0026ldquo;Deep learning for anomaly detection: A survey,\u0026rdquo; arXiv preprint arXiv:1901.03407, 2019.\n[38] M. Baradaran and R. Bergevin, \u0026ldquo;A critical study on the recent deep learning based semi-supervised video anomaly detection methods,\u0026rdquo; Multimedia Tools and Applications, vol. 83, no. 9, pp. 27 761–27 807, 2024.\n[39] N. Choudhry, J. Abawajy, S. Huda, and I. Rao, \u0026ldquo;A comprehensive survey of machine learning methods for surveillance videos anomaly detection,\u0026rdquo; IEEE Access, 2023.\n[40] Y. Liu, D. Yang, Y. Wang, J. Liu, J. Liu, A. Boukerche, P. Sun, and L. Song, \u0026ldquo;Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models,\u0026rdquo; ACM Computing Surveys, 2023.\n[41] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz, \u0026ldquo;Robust real-time unusual event detection using multiple fixed-location monitors,\u0026rdquo; IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 3, pp. 555–560, 2008.\n[42] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos, \u0026ldquo;Anomaly detection in crowded scenes,\u0026rdquo; in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 1975–1981.\n[43] B. Ramachandra and M. Jones, \u0026ldquo;Street scene: A new dataset and evaluation protocol for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 2569– 2578.\n[44] A. Singh, M. J. Jones, and E. G. Learned-Miller, \u0026ldquo;Eval: Explainable video anomaly localization,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 717–18 726.\n[45] C. Cao, Y. Lu, P. Wang, and Y. Zhang, \u0026ldquo;A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 20 392–20 401.\n[46] W. Liu, W. Luo, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction for anomaly detection–a new baseline,\u0026rdquo; in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6536–6545.\n[47] W. Luo, W. Liu, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction network for video anomaly detection,\u0026rdquo; IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 11, pp. 7505–7520, 2021. [48] Y. Zhong, X. Chen, Y. Hu, P. Tang, and F. Ren, \u0026ldquo;Bidirectional spatiotemporal feature learning with multiscale evaluation for video anomaly detection,\u0026rdquo; IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8285–8296, 2022. [49] R. F. Mansour, J. Escorcia-Gutierrez, M. Gamarra, J. A. Villanueva, and N. Leal, \u0026ldquo;Intelligent video anomaly detection and classification using faster rcnn with deep reinforcement learning model,\u0026rdquo; Image and Vision Computing, vol. 112, p. 104229, 2021. [50] J. Carreira and A. Zisserman, \u0026ldquo;Quo vadis, action recognition? a new model and the kinetics dataset,\u0026rdquo; in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308. [51] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u0026ldquo;Imagenet classification with deep convolutional neural networks,\u0026rdquo; Advances in neural information processing systems, vol. 25, 2012. [52] J.-C. Feng, F.-T. Hong, and W.-S. Zheng, \u0026ldquo;Mist: Multiple instance selftraining framework for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 009–14 018. [53] Z. Chen, J. Duan, L. Kang, and G. Qiu, \u0026ldquo;Supervised anomaly detection via conditional generative adversarial network and ensemble active learning,\u0026rdquo; IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7781–7798, 2022. [54] M. Abdalla, H. Hassan, N. Mostafa, S. Abdelghafar, A. Al-Kabbany, and M. Hadhoud, \u0026ldquo;An nlp-based system for modulating virtual experiences using speech instructions,\u0026rdquo; Expert Systems with Applications, vol. 249, p. 123484, 2024. [55] Z. Yang, J. Liu, Z. Wu, P. Wu, and X. Liu, \u0026ldquo;Video event restoration based on keyframes for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 592–14 601. [56] J. Li, D. Li, S. Savarese, and S. Hoi, \u0026ldquo;Blip-2: Bootstrapping languageimage pre-training with frozen image encoders and large language models,\u0026rdquo; arXiv preprint arXiv:2301.12597, 2023. [57] P. Wu, J. Liu, X. He, Y. Peng, P. Wang, and Y. Zhang, \u0026ldquo;Toward video anomaly retrieval from video anomaly detection: New benchmarks and model,\u0026rdquo; IEEE Transactions on Image Processing, vol. 33, pp. 2213–2225, 2024. [58] H. Yuan, Z. Cai, H. Zhou, Y. Wang, and X. Chen, \u0026ldquo;Transanomaly: Video anomaly detection using video vision transformer,\u0026rdquo; IEEE Access, vol. 9, pp. 123 977–123 986, 2021. [59] T.-N. Nguyen and J. Meunier, \u0026ldquo;Anomaly detection in video sequence with appearance-motion correspondence,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1273–1283. [60] S. Majhi, R. Dash, and P. K. Sa, \u0026ldquo;Temporal pooling in inflated 3dcnn for weakly-supervised video anomaly detection,\u0026rdquo; in 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE, 2020, pp. 1–6. [61] Z. Yang, J. Liu, and P. Wu, \u0026ldquo;Text prompt with normality guidance for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 899–18 908. [62] S. Zhou, W. Shen, D. Zeng, M. Fang, Y. Wei, and Z. Zhang, \u0026ldquo;Spatial– temporal convolutional neural networks for anomaly detection and localization in crowded scenes,\u0026rdquo; Signal Processing: Image Communication, vol. 47, pp. 358–368, 2016. [63] A. Barbalau, R. T. Ionescu, M.-I. Georgescu, J. Dueholm, B. Ramachandra, K. Nasrollahi, F. S. Khan, T. B. Moeslund, and M. Shah, \u0026ldquo;Ssmtl++: Revisiting self-supervised multi-task learning for video anomaly detection,\u0026rdquo; Computer Vision and Image Understanding, vol. 229, p. 103656, 2023. [64] Z. Yang, Y. Guo, J. Wang, D. Huang, X. Bao, and Y. Wang, \u0026ldquo;Towards video anomaly detection in the real world: A binarization embedded weakly-supervised network,\u0026rdquo; IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2023. [65] M. Z. Zaheer, J.-h. Lee, M. Astrid, A. Mahmood, and S.-I. Lee, \u0026ldquo;Cleaning label noise with clusters for minimally supervised anomaly detection,\u0026rdquo; arXiv preprint arXiv:2104.14770, 2021. [66] D. Zhang, C. Huang, C. Liu, and Y. Xu, \u0026ldquo;Weakly supervised video anomaly detection via transformer-enabled temporal relation learning,\u0026rdquo; IEEE Signal Processing Letters, vol. 29, pp. 1197–1201, 2022. [67] Y. S. Chong and Y. H. Tay, \u0026ldquo;Abnormal event detection in videos using spatiotemporal autoencoder,\u0026rdquo; in Advances in Neural Networks-ISNN 2017: 14th International Symposium, ISNN 2017, Sapporo, Hakodate, and Muroran, Hokkaido, Japan, June 21–26, 2017, Proceedings, Part II. Springer International Publishing, 2017, pp. 189–196. [68] M. Z. Zaheer, A. Mahmood, M. H. Khan, M. Segu, F. Yu, and S.-I. Lee, \u0026ldquo;Generative cooperative learning for unsupervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 14 744–14 754. [69] A. O. Tur, N. Dall\u0026rsquo;Asen, C. Beyan, and E. Ricci, \u0026ldquo;Exploring diffusion models for unsupervised video anomaly detection,\u0026rdquo; in 2023 IEEE International Conference on Image Processing (ICIP). IEEE, 2023, pp. 2540– 2544. [70] X. Zeng, Y. Jiang, W. Ding, H. Li, Y. Hao, and Z. Qiu, \u0026ldquo;A hierarchical spatio-temporal graph convolutional neural network for anomaly detection in videos,\u0026rdquo; IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 1, pp. 200–212, 2023. [71] H. Lv, Z. Yue, Q. Sun, B. Luo, Z. Cui, and H. Zhang, \u0026ldquo;Unbiased multiple instance learning for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8022–8031. [72] J. Kukacka, V. Golkov, and D. Cremers, \u0026ldquo;Regularization for deep learning: ˇ ˇ A taxonomy,\u0026rdquo; arXiv preprint arXiv:1710.10686, 2017. [73] M. Cho, M. Kim, S. Hwang, C. Park, K. Lee, and S. Lee, \u0026ldquo;Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 137–12 146. [74] M. Mathieu, C. Couprie, and Y. LeCun, \u0026ldquo;Deep multi-scale video prediction beyond mean square error,\u0026rdquo; arXiv preprint arXiv:1511.05440, 2015. [75] C. Shi, C. Sun, Y. Wu, and Y. Jia, \u0026ldquo;Video anomaly detection via sequentially learning multiple pretext tasks,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 330–10 340. [76] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel, \u0026ldquo;Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1705–1714. [77] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh, \u0026ldquo;Learning regularity in skeleton trajectories for anomaly detection in videos,\u0026rdquo; in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 996–12 004. [78] C. Yan, S. Zhang, Y. Liu, G. Pang, and W. Wang, \u0026ldquo;Feature prediction diffusion model for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5527– 5537. [79] C. Zhang, G. Li, Y. Qi, S. Wang, L. Qing, Q. Huang, and M.-H. Yang, \u0026ldquo;Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16 271–16 280. [80] J. Fioresi, I. R. Dave, and M. Shah, \u0026ldquo;Ted-spad: Temporal distinctiveness for self-supervised privacy-preservation for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 598–13 609. [81] N. Van Eck and L. Waltman, \u0026ldquo;Software survey: Vosviewer, a computer program for bibliometric mapping,\u0026rdquo; scientometrics, vol. 84, no. 2, pp. 523– 538, 2010. [82] A. Awasthi, S. Ly, J. Nizam, S. Zare, V. Mehta, S. Ahmed, K. Shah, R. Nemani, S. Prasad, and H. Van Nguyen, \u0026ldquo;Anomaly detection in satellite videos using diffusion models,\u0026rdquo; arXiv preprint arXiv:2306.05376, 2023. [83] K. Batzner, L. Heckler, and R. König, \u0026ldquo;Efficientad: Accurate visual anomaly detection at millisecond-level latencies,\u0026rdquo; in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 128–138. [84] Y. Nie, H. Huang, C. Long, Q. Zhang, P. Maji, and H. Cai, \u0026ldquo;Interleaving one-class and weakly-supervised models with adaptive thresholding for unsupervised video anomaly detection,\u0026rdquo; arXiv preprint arXiv:2401.13551, 2024. [85] L. Zanella, B. Liberatori, W. Menapace, F. Poiesi, Y. Wang, and E. Ricci, \u0026ldquo;Delving into clip latent space for video anomaly recognition,\u0026rdquo; arXiv preprint arXiv:2310.02835, 2023. [86] A. Sethi, K. Saini, and S. M. Mididoddi, \u0026ldquo;Video anomaly detection using gan,\u0026rdquo; 2023. MOSHIRA ABDALLA is a Graduate Research and Teaching Assistant (GRTA) and a Ph.D. in Electrical and Computer Engineering student at Khalifa University. She received a B.Sc. in Computer and Systems Engineering from Minya University, Egypt, in 2020 and her Master\u0026rsquo;s degree in Electrical and Computer Engineering from the University of Ottawa, Canada, in 2022. Her research is focused on Computer Vision, Anomaly Detection, and Artificial Intelligence (AI).\nSAJID JAVED is a faculty member at Khalifa University (KU), UAE. Prior to that, he was a research fellow at KU from 2019 to 2021 and at the University of Warwick, U.K, from 20172018. He received his B.Sc. degree in computer science from the University of Hertfordshire, U.K, in 2010. He completed his combined Master\u0026rsquo;s and Ph.D. degrees in computer science from Kyungpook National University, Republic of Korea, in 2017.\nMUAZ AL RADI is a Graduate Research and Teaching Assistant (GRTA) and a Ph.D. in Electrical and Computer Engineering student at Khalifa University. He received his B.Sc. degree in Sustainable and Renewable Energy Engineering from the University of Sharjah, Sharjah, UAE, in 2020 and his M.Sc. in Electrical and Computer Engineering from Khalifa University, Abu Dhabi, UAE, in 2022. His research is focused on Computer Vision, Anomaly Detection, Vision-based\nControl, Artificial Intelligence (AI), and Robotics.\nANWAAR ULHAQ received the Ph.D. degree in artificial intelligence from Monash University, Australia. He is currently working as a Senior Lecturer (AI) with the School of Computing, Mathematics, and Engineering, Charles Sturt University, Australia. He has developed national and international recognition in computer vision and image processing. His research has been featured 16 times in national and international news venues, including ABC News and IFIP (UNESCO). He is an Active Member of IEEE, ACS, and the Australian Academy of Sciences. As the Deputy Leader of the Machine Vision and Digital Health Research Group (MaViDH), he provides leadership in artificial intelligence research and leverages his leadership vision and strategy to promote AI research by mentoring junior researchers in AI and supervising HDR students devising plans to increase research impact.\nNAOUFEL WERGHI is a Professor at the Department of Computer Science at Khalifa University for Science and Technology, UAE. He received his Habilitation and PhD in Computer Vision from the University of Strasbourg. His main research area is 2D/3D image analysis and interpretation, where he has been leading several funded projects related to biometrics, medical imaging, remote sensing, and intelligent systems.\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/survey-2/","section":"Papers","summary":"A comprehensive survey exploring deep learning-based video anomaly detection, including emerging paradigms such as weakly supervised, self-supervised, and unsupervised approaches, with a focus on core challenges, feature extraction, supervision schemes, loss functions, regularization techniques, and the potential of vision-language models (VLMs) for enhanced anomaly detection.","title":"Video Anomaly Detection in 10 Years: A Survey and Outlook","type":"survey"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/vishal-m.-patel/","section":"Authors","summary":"","title":"Vishal M. Patel","type":"authors"},{"content":" VISIONGPT: LLM-ASSISTED REAL-TIME ANOMALY DETECTION FOR SAFE VISUAL NAVIGATION # Hao Wang\nSchool of Computing Clemson University Clemson, SC, USA hao9@g.clemson.edu\nAshish Bastola School of Computing Clemson University Clemson, SC, USA abastol@g.clemson.edu\nJiayou Qin # Department of Electrical and Computer Engineering Stevens Institute of Technology Hoboken, NJ, USA jqin6@stevens.edu\nXiwen Chen School of Computing Clemson University Clemson, SC, USA\nxiwenc@g.clemson.edu\nJohn Suchanek # School of Computing Clemson University Clemson, SC, USA\njsuchan@g.clemson.edu\nZihao Gong # School of Cultural and Social Studies Tokai University Tokyo, Japan 0CPD1206@mail.u-tokai.ac.jp\nAbolfazl Razi School of Computing Clemson University Clemson, SC, USA arazi@clemson.edu\nABSTRACT # This paper explores the potential of Large Language Models(LLMs) in zero-shot anomaly detection for safe visual navigation. With the assistance of the state-of-the-art real-time open-world object detection model Yolo-World and specialized prompts, the proposed framework can identify anomalies within camera-captured frames that include any possible obstacles, then generate concise, audio-delivered descriptions emphasizing abnormalities, assist in safe visual navigation in complex circumstances. Moreover, our proposed framework leverages the advantages of LLMs and the open-vocabulary object detection model to achieve the dynamic scenario switch, which allows users to transition smoothly from scene to scene, which addresses the limitation of traditional visual navigation. Furthermore, this paper explored the performance contribution of different prompt components, provided the vision for future improvement in visual accessibility, and paved the way for LLMs in video anomaly detection and vision-language understanding.\nKeywords Open World Object Detection · Anomaly Detection · Large Language Model · Vision-language Understanding · Prompt Engineering · Generative AI · GPT\n1 Introduction # Accessible technologies have seen remarkable development in recent years due to the rise of machine learning and mobile computing [1–5]. Deep learning has significantly enhanced the accuracy and speed of object detection and segmentation models [5–9], which catalyzed a surge of real-world applications, impacting numerous aspects of daily life, industry, and transportation. Visual navigation has benefited significantly from the evolution of such computer vision techniques [5, 10, 11].\nConsequently, innovations such as Augmented Reality (AR) have been instrumental in enhancing the safety and mobility of individuals across various scenarios, including driving and walking. Many of these technologies aim to bridge the\ngap between the physical world and digital assistance, highlighting the critical need for adaptive solutions to navigate the complexities of real-world environments.\nHowever, visual navigation presents significant challenges in dynamic urban environments [11–13]. Although the newborn zero-shot object detection [14] addresses the significant limitations of classical object detection models such as YOLOv8 [15, 16] in complex scenarios, it encounters difficulties in developing custom class labels for dynamic environments due to the long-tail response. Furthermore, real-time vision-language understanding can be critical for complex scenarios for safety concerns, especially for visually impaired individuals who must traverse streets, sidewalks, and other public spaces.\nVision-language understanding has recently become a hotspot due to the emergence of Multimodal Large Language Models (LLMs) [17]. Multimodal LLMs represent an evolutionary leap in the field of artificial intelligence as they integrate the processing of text, images, and even audio and video [18] to create a comprehensive understanding of the world that mirrors human cognitive abilities more closely than ever before, making it possible to handle more advances tasks for robotics [19, 20]. Specifically, GPT-4V is now being heavily used in image tasks such as data evaluation [21, 22], medical image diagnosis [23–25], and content creation [26–28].\nFigure 1: Framework for vision-language processing and prompting.\nMultimodal LLMs possess substantial improvement in interpreting, analyzing, and generating content across different modalities [29], bringing the possibility to interdisciplinary applications [30, 31]. Interestingly, LLMs also exhibit impressive zero-shot and few-shot learning abilities, potentially enabling them to capture visual concepts with minimal training data. This opens a way to address object detection challenges, particularly in data with limited annotation [32]. Recent research attempts to bring LLMs to the accessibility field, yet most work only focuses on basic natural language processing such as text reading, image recognition, and voice assistance [33].\nTherefore, a critical gap exists when using the vision-language understanding of LLMs for safety and accessible applications. Despite past works that investigated the use of LLMs in visual assistance [34] and visual navigation [35–37], only a few focused on the safety aspects [32, 38] but barely considered the induced latency during the inference.\nOur research introduces a framework that combines the speed of locally executed open-world object detection with the intelligence of LLMs to create a universal anomaly detection system. The primary goal of this system is to deliver real-time, personalized scene descriptions and safety notifications, ensuring the safety and ease of navigation for visually impaired users by identifying and alerting them to potential obstacles and hazards in their path, where these obstacles and hazards can be considered \u0026ldquo;anomalies\u0026rdquo; in the context of a safe and clear path for navigation. The proposed framework can also be applied to robotic systems, augmented reality platforms, and all other mobile computing edge units.\nParticularly, the major contributions of this paper are summarized as follows:\nZero-shot anomaly detection: The proposed integration is train-free and ready for video anomaly detection and annotation with different response preferences. Real-time feedback: Our framework is optimized for real-time response in complex scenarios with very low latency. Dynamic scene transition and interest setting: This framework can dynamically switch the object detection classes based on the user\u0026rsquo;s needs. Furthermore, users can interact with the LLM module and setting a prior task (e.g., find the nearest bench). 2 Related Work # 2.1 Open-vocabulary object detection # Open-vocabulary object detection (OVD) [39] represents a significant shift in object detection focusing on identifying items outside predefined categories. Initial efforts [40] trained on known classes for evaluating the detection of novel objects facing generalization and adaptability issues due to limited datasets and vocabularies. Recent approaches, however, [41 – 43] employ image-text matching with extensive data to expand the training vocabularies inspired by vision language pre-training [29, 44]. OWL-ViTs [45] and GLIP [46] utilize vision transformers and phrase grounding for effective OVD while Grounding DINO [47] combines these with detection transformers for crossmodality fusion. Despite the promise, existing methods often rely on complex detectors increasing computational demands significantly [48,49]. ZSD-YOLO [50] also explored an open-vocabulary detection with YOLO using language model alignment; however, YOLO-world [51] presents a more efficient and real-time OVD solution aiming to be much more efficient with real-time inference using an effective pre-training strategy while still being highly generalizable.\n2.2 Prompt Engineering # Prompt engineering has emerged as a critical technique for unlocking the capabilities of large language models (LLMs) [52 – 55] to various applications without finetuning on large datasets. This involves carefully crafting text prompts, instructions, or examples to guide LLM behavior and elicit desired responses. Researchers are actively exploring prompt engineering using various prompting techniques such as zero-shot prompting [56], Few-shot prompting [57], Chain-ofthought prompting [58], self-ask prompting [59], etc. to fine-tune LLMs for various tasks, demonstrating significant performance gains compared to traditional model training approaches. Studies have showcased how prompt engineering can adapt LLMs for diverse natural language tasks like question-answering [60], smart-reply [55], summarization [61], and text classification [62, 63]. Furthermore, researchers are increasingly developing frameworks to systematize prompt engineering efforts. Such frameworks aim to simplify the creation of effective prompts and facilitate the adaptation of LLMs to specific domains and applications and are highly customizable to user needs. While prompt engineering has seen significant improvements in natural language processing, its potential in computer vision on accessibility remains less explored. Our work builds upon the success of prompt engineering in NLP, exploring its application in the visual domain to enhance object detection and description.\n2.3 Accessible Technology # Computer vision-driven accessible technologies are mostly designed to empower individuals with visual impairments through enhanced scene understanding and hazard detection. A range of solutions exist, including mobile apps that provide object recognition and audio descriptions of surroundings [64–66], to wearable systems that offer real-time alerts about obstacles or potential dangers [67, 68]. For example, technologies that detect approaching vehicles and crosswalk signals significantly improve the safety of visually impaired pedestrians in urban environments. Moreover, computer vision is integrated into assistive technologies for reading text aloud from documents and identifying objects in daily life, enabling greater independence [5]. Research in this domain also focuses on indoor navigation, where object detection and spatial mapping can guide users within buildings and public spaces [69]. The core emphasis of these computer vision-powered accessibility technologies aims to enhance safety. By providing real-time information on key elements within an individual\u0026rsquo;s surroundings, the risk of accidents and injuries is significantly reduced. Identifying potential hazards, such as oncoming traffic, obstacles on sidewalks, or unattended objects, allows visually impaired individuals to navigate with greater confidence and autonomy.\n3 Methodology # Our system offers real-time anomaly alerts by integrating object detection with large language model capabilities, featuring a multi-module architecture. The system operates continuously with the object detection module processing\nreal-time camera frames. Multi-frame object information is then included in specially engineered prompts and submitted to the LLM module. The system then processes the LLM\u0026rsquo;s response, classifying potential anomalies. Finally, the LLM module conveys important alerts and essential scene descriptions to the user.\nThe proposed project is fully open-sourced and available at: https://github.com/AIS-Clemson/VisionGPT\n3.1 Object Detection Module # To ensure real-time performance on mobile devices, we employ lightweight yet powerful object detection models for real-time detection. Specifically, we applied the state-of-the-art YOLO-World model for open-vocabulary detection whose detection classes are customizable for a wide range of scenarios. As we focus on accessible visual navigation, we prompt the proposed LLM module to personalize the detection classes relevant to safe navigation in daily use circumstances, including pedestrians, vehicles, bicycles, traffic signals, and any potential road hazards or obstacles. Therefore, our proposed multi-functional prompt manager allows users to switch detection classes dynamically.\n3.2 Detection Class Manager # This sub-module aims to create a detailed categorization for object detection algorithms, enabling them to identify and distinguish potential obstacles, hazards, and useful landmarks. This approach ensures the detection system is finely tuned to urban navigation\u0026rsquo;s specific needs and challenges, enhancing the user\u0026rsquo;s ability to move safely and independently through city streets. By focusing on road hazards and obstacles, the updated list aims to provide a more relevant and focused set of detection classes for the \u0026lsquo;urban walking\u0026rsquo; context, optimizing the system\u0026rsquo;s utility and effectiveness for the visually impaired user.\nAs shown in Figure 1, the user can interact with the LLM (advance) module. Based on this operation logic, the user can ask to change the object detection classes based on scenarios. For instance, if a user experiences a scene transition from sidewalk to park, the detection classes specialized for sidewalk objects (e.g., car, road cone, traffic signal, etc.) can be replaced by new object classes that are more relevant to the park scene to adapt to the situation.\nOriginal prompt: \u0026ldquo;The user is switching the scene to custom_scene please generate a new list that contains the top 100 related objects, including especially road hazards and possible obstacles\u0026rdquo;\n3.3 Anomaly Handle Module # The proposed anomaly detection system aims to enhance navigation safety and awareness in various environments, particularly for visually impaired individuals and others requiring navigation assistance (e.g., robotics systems). The system analyzes real-time imagery captured from a camera and splits the image into four distinct regions based on an \u0026lsquo;H\u0026rsquo; pattern.\nFigure 2: Type H image splitter. (1) and (2) represent the left and right area, (3) represent the ground area, and (4) represent the front area.\nSpecifically, The system categorizes detected objects into four types based on their location within the image, each corresponding to a specific splitting part of the \u0026lsquo;H\u0026rsquo; pattern segmentation: Left , Right , Front, and Ground. This categorization helps identify and respond to potential hazards more effectively. As shown in Figure 2.\nMoreover, we find that:\nLeft and Right: These regions cover the left and right 25% of the image, respectively. Objects detected here are typically in motion and may occupy much of the visual field. This area is crucial for identifying moving hazards such as vehicles or cyclists that may approach the user from the sides. Front: This region focuses on the center 50% of the image\u0026rsquo;s width and the upper half vertically. It captures objects still at a distance but directly ahead of the user. Identifying objects in this region is necessary for assessing the overall situation and planning movements, especially in detecting upcoming objects at high speed such as cars and cyclists. Ground: Occupying the center 50% of the image\u0026rsquo;s width and the lower half vertically, this area highlights objects nearby on the ground. Immediate attention to detections in this area is critical for avoiding hazards that require cautious navigation, such as cracks, puddles, or uneven surfaces. The system then records detailed information for each object, including classification, size, and position. All size and position data have been converted into percentage expressions for a better interpretation by LLM. Finally, by analyzing objects\u0026rsquo; locations and sizes, alerts for anomalies are generated for objects that appear on the \u0026lsquo;ground\u0026rsquo; area or occupy significant space (\u0026gt; 10% in this study) in the \u0026rsquo;left\u0026rsquo; or \u0026lsquo;right\u0026rsquo; regions. The detection and movement information is then post-processed into a structured format, supporting LLM for better understanding.\nOriginal prompt: \u0026ldquo;The location information center x , center y , height, width of objects is the proportion to the image, the detected objects are categorized into 4 type based on the image region. Left and Right: objects located on left 25% or right 25% of the image, these objects are usually moving and has large proportion.Front: objects that may still far away, can be used to discriminate the current situation.Ground: objects that may nearby.\u0026rdquo;\n3.4 Data Collection # Even though various datasets exist for static images [70] and CCTV camera feeds [71–73], no extensive datasets are available for detecting large anomalies in visual navigation from the first person\u0026rsquo;s perspective.\nThus, we collected 50 video clips of point-of-view cruising in various scenarios. These custom videos are filmed in public spaces with first-person view and continuous forward-moving. Table ?? shows the details of the collected data.\nTable 1: Collected data for video anomaly detection.\nLocation Scene Movement Weather Clips Total length Unique Classes Total detected objects Urban Sidewalk Scooter Cloudy 8 10 mins 31 16944 Suburban Bikeline Scooter Cloudy 5 6 mins 26 8394 Urban Park Scooter Cloudy 6 5 mins 23 15310 City Road Biking Sunny 5 5 mins 21 5464 City Sidewalk Biking Sunny 7 6 mins 27 9569 City Park Biking Cloudy 5 5 mins 19 4781 Town Park Walking Cloudy 6 4 mins 18 5156 Town Sidewalk Walking Sunny 8 7 mins 14 8274 City Coast Walking Sunny 2 5 mins 37 29280 Suburban Theme Park Walking Rain 3 6 mins 34 24180 We then conducted the experiments by combining the open-vocabulary object detection model with our novel imagesplitting method to annotate the frames as anomalies.\nSpecifically, a frame is labeled as an anomaly if it meets either of the following criteria:\nObjects are detected within the Ground area. Objects appear in either the Left or the Right areas of the image and occupy more than 10% of the total image area. We set this rule-based method as the baseline of anomaly detection in this study, as our captured video clips are customized for this H-splitting principle.\n3.5 LLM Module # This module processes the detected object information and passing to the LLM. Specifically, we use both GPT-3.5 and GPT-4 to process the information. First, GPT-3.5 is mainly used for low-level information processes such as\nobject detection data analysis, data format converting, and prompt reasoning. Therefore, GPT-4 is used for a high-level command instance understanding and a comprehensive vision-language understanding.\nThe system sensitivity settings indicate a focus on identifying and reporting hazards based on their potential impact on the user\u0026rsquo;s safety and navigation. The system\u0026rsquo;s goal to report objects based on their level of inconvenience or danger aligns with the anomaly detection objective of identifying and reacting to deviations that matter most in the given context. Note that the system sensitivity in the context is distinct from the model sensitivity as a statistical term.\nThese prompts sketch the conceptual framework and operational guidelines for a voice-assisted navigation system for visual accessibility. The system utilizes data from a phone camera, which is always facing forward, to detect objects and categorize their location within the field of view. Based on these analyses, the system provides auditory feedback to users, helping them navigate their environment safely and avoid potential hazards. Furthermore, the annotated data can be used for the training of other anomaly detection models.\nThe main LLM prompts consist of:\nPrompt instruction: \u0026ldquo;You are a voice assistant for a visually impaired user, the input is the actual data collected by a phone camera, and the phone is always facing front, please provide the key information for the blind user to help him navigate and avoid potential danger. Please note that the center_x and center_y represent the object location (proportional to the image), object height and width are also a proportion.\u0026rdquo; Prompt sensitivity: \u0026ldquo;System sensitivity: Incorporate the sensitivity setting in your response. For a lowsensitivity setting, identify and report only imminent and direct threats to safety. For medium sensitivity, include potential hazards that could pose a risk if not avoided. For high sensitivity, report all detected objects that could cause any inconvenience or danger. Current sensitivity: low.\u0026rdquo; 4 Experiments # We compare our proposed vision-LLM system with the rule-based anomaly detection (baseline) to show its performance and reliability.\n4.1 System Optimization # While the proposed system is running, we input captured images into the object detection model every 5 frames to boost the FPS (Frame-Per-Second), this can significantly improve the performance, especially for mobile devices that have limited computation resources. Then, we send the detected information to the anomaly handle module to label the frames as the baseline. With the frame compensation, the real-time detection performance is boosted from 16 FPS to 73 FPS, as shown in Table 4.\nMeanwhile, we apply the LLM module to process the detected information every 30 frames in parallel due to the latency of LLMs. To optimize the latency for better performance, the proposed system uses the GPT 3.5 Turbo model as the core of the LLM module.\n4.2 Detection Accuracy # By setting the rule-based detector as the baseline, this study aims to test the zero-shot learning capability of the LLM detector, and meanwhile, interpret which prompt may impact the performance significantly.\nAfter comparing the annotation results of prompt-based anomaly detection with the rule-based anomaly detection on our collected data, we find that prompt-based anomaly detection achieves high precision with all prompt modules working properly. Specifically, we compared the LLM anomaly detection with different sensitivity settings: low, normal, and high. As shown in Figure 3, the Receiver Operating Characteristic (ROC) curve indicates that a low system sensitivity leads to better performance, as it is less sensitive than the rule-based detector. For instance, objects detected by the rule-based detector with low confidence and classes of low risk will be filtered by LLM due to no emergency. Conversely, the higher the system sensitivity, the worse the performance, as the system tends to categorize all possible anomalies as immediate emergencies.\nAs shown in Figure 4, the LLM anomaly detector with low-system sensitivity captures more True Positive and True Negative cases and tries to minimize the False Positive rate.\nFigure 3: ROC curve.\n4.3 Quality Evaluation # We picked one of the video clips to analyze the detection difference between the rule-based detector and the LLM anomaly detector. Specifically, in Figure 5, the first row shows the anomalies labeled by the rule-based detector, while the second row indicates the anomalies predicted by the LLM-based detector (low sensitivity setting). As shown in Figure 5, the proposed LLM detector has less acuity with a low-sensitivity prompt setting, which tends to filter anomalies that are non-emergency. Table 2 shows the selected sample output of the LLM module.\nFigure 5: Anomaly annotation. The first row represents the labeled anomalies by the rule-based detector (binary), and the second row represents the anomalies predicted by the proposed LLM detector (float). Color represents the probability of anomalies.\nFigure 4: Confusion matrix of total frames. LLM setting is low-system sensitivity setting.\nTable 2: Selected feedback and caption of the output of the LLM module. Frane ID indicates the video frame index, the Anomaly Index represents the predicted anomalies of LLM, and Reason represents the response message from LLM for anomaly interpretation.\nFrame ID Anomaly Index Reason 2290 0.85 ’Car and people nearby.’ 3770 0.2 ’Green traffic light detected.’ 4430 0.5 ’Obstacles in path.’ 5450 0.7 ’Car on the left’ 7000 0 ’No immediate danger.’ 8230 1 ’Bike in close proximity’ 9800 1 High risk of collision with multiple people 4.4 Ablation Study # To explore the contribution of different prompt modules, we conducted the ablation study of each module. Table 3 shows that the proposed system performs worse without specific prompt modules. For instance, while the instruction prompt is missing, the system may generate random content due to the confusion of the current task and lack of instruction. Moreover, missing region information of detected objects may also weaken the performance, as the system cannot evaluate the priority of the emergency.\nTable 3: Ablation study. ✓indicates incorporated modules of system, and ✗indicate missing modules\nSensitivity Location Instruction AP AUC Low ✓ ✓ 88.01 87.26 Normal ✓ 82.73 81.17 High ✓ ✓ 72.35 77.29 g Low ✓ ✗ 68.84 73.64 Low ✗ ✓ 69.57 75.56 ✗ ✓ ✓ 69.16 80.39 Moreover, we find that LLM produced different performances with different sensitivity prompts. Unexpectedly, Low system sensitivity appears higher accuracy and precision, as the system tries to catch True Positive cases as much as possible and avoid false alarms. This is significant for visually impaired navigation, as the user can efficiently avoid misinformation and frequent-unnecessary alerts.\n4.5 Performance Evaluation # We further explore the performance efficiency of the proposed system on multiple platforms to reveal its potential for other applications.\nLatency: As shown in Table 4, we measured end-to-end system latency and individual module processing times to identify bottlenecks and optimize for real-time performance. Results indicated an average end-to-end latency of 60 ms on the mobile device (e.g., smartphone) with neural engines, ensuring timely feedback.\nTask Model Framework Chipset Architecture FPS Latency w/ Frame Compensation Detection Yolo-v8l Pytorch V100 GPU 22.01 45 ms 102.56 Detection Yolo-v8x Pytorch V100 GPU 14.22 71 ms 70.11 Segmentation Yolo-v8x-seg Pytorch V100 GPU 12.06 83 ms 59.68 Detection Yolov8-World Pytorch V100 GPU 20.12 50 ms 98.06 Detection Yolov8x-World-v2 Pytorch V100 GPU 16.74 62 ms 76.88 Detection Yolov8x-World-v2 CoreML M2 CPU 5.01 199 ms N/A Detection Yolov8x-World-v2 CoreML M2 Neural Engine 19.6 51 ms N/A Detection Yolov8x-World-v2 CoreML A16-Bionic CPU 1.26 789 ms N/A Detection Yolov8x-World-v2 CoreML A16-Bionic Neural Engine 16.24 61 ms N/A Table 4: Object Detector Test on multiple platforms. A16-Bionic processors are used in iPhone 14 pro max, and M2 processors are widely used in Vision Pro and the latest Mac models. The PyTorch-based implementation was run on NVIDIA GPU.\nEconomy: We further investigated the system latency and token consumption for economy evaluation. We designed three different modes for users to choose from:\nVoice only: only output voice messages for emergency response, minimum latency. Annotation: output both anomaly index and reason for system testing and practical annotation. Full: output full information in a structured JSON format. (See the original prompt for more information) Table 5: Economy and latency test.\nMode LLM Latency Completion Token Prompt Tokens Total Tokens Charge (USD/day) Voice only GPT3.5 407 35 573 608 2.44 y Annotation GPT3.5 628 48 573 617 2.58 Full GPT3.5 1818 176 1195 1371 13.53 As shown in Table 5, we estimated the cost of our system with different modes. The prices are calculated with an average of 2 hours of daily usage and are based on the chatGPT API pricing policy.\nOriginal prompt:\nPrompt_format_full: \u0026lsquo;Please organize your output into this format: \u0026ldquo;scene\u0026rdquo;: quickly describe the current situation for blind user; \u0026ldquo;key_objects\u0026rdquo;: quickly and roughly locate the key objects for blind user; \u0026ldquo;anomaly_checker\u0026rdquo;: quickly diagnose if there is potential danger for a blind person; \u0026ldquo;anomaly_label\u0026rdquo;: output 1 if there is an emergency, output 0 if not; \u0026ldquo;anomaly_index\u0026rdquo;: object_id, danger_index, estimate a score from 0 to 1 about each objects that may cause danger; \u0026ldquo;voice_guide\u0026rdquo;: the main output to instant alert the blind person for emergency.\u0026rsquo; Prompt_format_voice: \u0026lsquo;Please organize your output into this format: \u0026ldquo;voice_guide\u0026rdquo;: the main output to instantly alert the blind person for an emergency.\u0026rsquo; Prompt_format_annotation: \u0026lsquo;Please organize your output into this format: \u0026ldquo;anomaly_score\u0026rdquo;: predict a score from 0 to 1 to evaluate the emergency level; \u0026ldquo;reason\u0026rdquo;: explain your annotation reason within 10 words.\u0026rsquo; 5 Conclusion # This research demonstrates the significant potential of combining lightweight mobile object detection with large language models to enhance accessibility for visually impaired individuals. Our system successfully provides real-time scene descriptions and hazard alerts, achieving low latency and demonstrating the flexibility of prompt engineering for tailoring LLM output to this unique domain. Our experiments highlight the importance of balancing detection accuracy with computational efficiency for mobile deployment. Prompt design is a key component of our system in guiding LLM responses and ensuring the relevance of generated descriptions. Additionally, the integration of user feedback proved invaluable for refining the system\u0026rsquo;s usability and overall user experience.\nWhile this project offers a promising foundation, further research is warranted. Explorations into even more advanced prompt engineering for complex scenarios would pave the way for the wide adoption of such assistive technologies. Our findings illustrate the power of integrating computer vision and large language models, leading to greater independence and safety in daily life: a true testament to AI\u0026rsquo;s ability to improve the quality of life for all.\nReferences # [1] Jonathan Donner. After access: Inclusion, development, and a more mobile Internet. MIT press, 2015.\n[2] Fatma Al-Muqbali, Noura Al-Tourshi, Khuloud Al-Kiyumi, and Faizal Hajmohideen. Smart technologies for visually impaired: Assisting and conquering infirmity of blind people using ai technologies. In 2020 12th Annual Undergraduate Research Conference on Applied Computing (URC), pages 1–4. IEEE, 2020.\n[3] Muiz Ahmed Khan, Pias Paul, Mahmudur Rashid, Mainul Hossain, and Md Atiqur Rahman Ahad. An ai-based visual aid with integrated reading assistant for the completely blind. IEEE Transactions on Human-Machine Systems, 50(6):507–517, 2020.\n[4] Bing Li, Juan Pablo Munoz, Xuejian Rong, Qingtian Chen, Jizhong Xiao, Yingli Tian, Aries Arditi, and Mohammed Yousuf. Vision-based mobile indoor assistive navigation aid for blind people. IEEE transactions on mobile computing, 18(3):702–714, 2018.\n[5] Ashish Bastola, Md Atik Enam, Ananta Bastola, Aaron Gluck, and Julian Brinkley. Multi-functional glasses for the blind and visually impaired: Design and development. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, volume 67, pages 995–1001. SAGE Publications Sage CA: Los Angeles, CA, 2023.\n[6] Mouna Afif, Riadh Ayachi, Yahia Said, Edwige Pissaloux, and Mohamed Atri. An evaluation of retinanet on indoor object detection for blind and visually impaired persons assistance navigation. Neural Processing Letters , 51:2265–2279, 2020.\n[7] Hernisa Kacorri, Kris M Kitani, Jeffrey P Bigham, and Chieko Asakawa. People with visual impairment training personal object recognizers: Feasibility and challenges. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 5839–5849, 2017.\n[8] Abinash Bhandari, PWC Prasad, Abeer Alsadoon, and Angelika Maag. Object detection and recognition: using deep learning to assist the visually impaired. Disability and Rehabilitation: Assistive Technology, 16(3):280–288, 2021.\n[9] Fahad Ashiq, Muhammad Asif, Maaz Bin Ahmad, Sadia Zafar, Khalid Masood, Toqeer Mahmood, Muhammad Tariq Mahmood, and Ik Hyun Lee. Cnn-based object recognition and tracking system to assist visually impaired people. IEEE access, 10:14819–14834, 2022.\n[10] Askat Kuzdeuov, Shakhizat Nurgaliyev, and Hüseyin Atakan Varol. Chatgpt for visually impaired and blind. Authorea Preprints, 2023.\n[11] Ashish Bastola, Julian Brinkley, Hao Wang, and Abolfazl Razi. Driving towards inclusion: Revisiting in-vehicle interaction in autonomous vehicles. arXiv preprint arXiv:2401.14571, 2024.\n[12] Yi Zhao, Yilin Zhang, Rong Xiang, Jing Li, and Hillming Li. Vialm: A survey and benchmark of visually impaired assistance with large models. arXiv preprint arXiv:2402.01735, 2024.\n[13] Ashish Bastola, Aaron Gluck, and Julian Brinkley. Feedback mechanism for blind and visually impaired: a review. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, volume 67, pages 1748–1754. SAGE Publications Sage CA: Los Angeles, CA, 2023.\n[14] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. Zero-shot object detection. In Proceedings of the European conference on computer vision (ECCV), pages 384–400, 2018.\n[15] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.\n[16] Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics YOLO, January 2023.\n[17] Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. Vision-language navigation: A survey and taxonomy. Neural Computing and Applications, 36(7):3291–3316, 2024.\n[18] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.\n[19] Zhen Zhang, Anran Lin, Chun Wai Wong, Xiangyu Chu, Qi Dou, and KW Au. Interactive navigation in environments with traversable obstacles using large language and vision-language models. arXiv preprint arXiv:2310.08873, 2023.\n[20] Jialu Li and Mohit Bansal. Improving vision-and-language navigation by generating future-view image semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10803–10812, 2023.\n[21] Feng Tian, Yuanyuan Lu, Fang Liu, Guibao Ma, Neili Zong, Xin Wang, Chao Liu, Ningbin Wei, and Kaiguang Cao. Supervised abnormal event detection based on chatgpt attention mechanism. Multimedia Tools and Applications , pages 1–19, 2024.\n[22] Yunkang Cao, Xiaohao Xu, Chen Sun, Xiaonan Huang, and Weiming Shen. Towards generic anomaly detection and understanding: Large-scale visual-linguistic model (gpt-4v) takes the lead. arXiv preprint arXiv:2311.02782 , 2023.\n[23] Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.\n[24] Zhichao Yang, Zonghai Yao, Mahbuba Tasmin, Parth Vashisht, Won Seok Jang, Feiyun Ouyang, Beining Wang, Dan Berlowitz, and Hong Yu. Performance of multimodal gpt-4v on usmle with image: Potential for imaging diagnostic support with explanations. medRxiv, pages 2023–10, 2023.\n[25] Ethan Waisberg, Joshua Ong, Mouayad Masalkhi, Sharif Amit Kamran, Nasif Zaman, Prithul Sarker, Andrew G Lee, and Alireza Tavakkoli. Gpt-4: a new era of artificial intelligence in medicine. Irish Journal of Medical Science (1971-), 192(6):3197–3200, 2023.\n[26] Pragnya Sridhar, Aidan Doyle, Arav Agarwal, Christopher Bogart, Jaromir Savelka, and Majd Sakr. Harnessing llms in curricular design: Using gpt-4 to support authoring of learning objectives. arXiv preprint arXiv:2306.17459 , 2023.\n[27] Zhaoyi Sun, Hanley Ong, Patrick Kennedy, Liyan Tang, Shirley Chen, Jonathan Elias, Eugene Lucas, George Shih, and Yifan Peng. Evaluating gpt-4 on impressions generation in radiology reports. Radiology, 307(5):e231259, 2023.\n[28] Henner Gimpel, Kristina Hall, Stefan Decker, Torsten Eymann, Luis Lämmermann, Alexander Mädche, Maximilian Röglinger, Caroline Ruiner, Manfred Schoch, Mareike Schoop, et al. Unlocking the power of generative ai models and systems such as gpt-4 and chatgpt for higher education: A guide for students and lecturers. Technical report, Hohenheim Discussion Papers in Business, Economics and Social Sciences, 2023.\n[29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.\n[30] Timo Lüddecke and Alexander Ecker. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7086–7096, 2022.\n[31] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.\n[32] Haicheng Liao, Huanming Shen, Zhenning Li, Chengyue Wang, Guofa Li, Yiming Bie, and Chengzhong Xu. Gpt-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models. Communications in Transportation Research, 4:100116, 2024.\n[33] An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023.\n[34] Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting for unified vision-and-language navigation. arXiv preprint arXiv:2401.07314, 2024.\n[35] Bowen Pan, Rameswar Panda, SouYoung Jin, Rogerio Feris, Aude Oliva, Phillip Isola, and Yoon Kim. Langnav: Language as a perceptual representation for navigation. arXiv preprint arXiv:2310.07889, 2023.\n[36] Dhruv Shah, Błazej Osi ˙ ˙ nski, Sergey Levine, et al. Lm-nav: Robotic navigation with large pre-trained models of ´ ´ language, vision, and action. In Conference on Robot Learning, pages 492–504. PMLR, 2023.\n[37] Yue Zhang, Quan Guo, and Parisa Kordjamshidi. Navhint: Vision and language navigation agent with a hint generator. arXiv preprint arXiv:2402.02559, 2024.\n[38] Hochul Hwang, Sunjae Kwon, Yekyung Kim, and Donghyun Kim. Is it safe to cross? interpretable risk assessment with gpt-4v for safety-aware street crossing. arXiv preprint arXiv:2402.06794, 2024.\n[39] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393–14402, 2021.\n[40] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.\n[41] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022.\n[42] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and Ishan Misra. Detecting twenty-thousand classes using image-level supervision. In European Conference on Computer Vision, pages 350–368. Springer, 2022.\n[43] Size Wu, Wenwei Zhang, Sheng Jin, Wentao Liu, and Chen Change Loy. Aligning bag of regions for openvocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15254–15264, 2023.\n[44] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.\n[45] Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley, Daniel Keysers, Mario Luciˇ ˇ c, Fisher Yu, ´ ´ and Thomas Kipf. Video owl-vit: Temporally-consistent open-world localization in video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13802–13811, 2023.\n[46] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.\n[47] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.\n[48] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 , 2022.\n[49] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9759–9768, 2020.\n[50] Johnathan Xie and Shuai Zheng. Zsd-yolo: Zero-shot yolo detection using vision-language knowledgedistillation. arXiv preprint arXiv:2109.12066, 1(2):3, 2021.\n[51] Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. arXiv preprint arXiv:2401.17270, 2024.\n[52] Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, and Philip Torr. A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980, 2023.\n[53] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2023.\n[54] Jiaqi Wang, Enze Shi, Sigang Yu, Zihao Wu, Chong Ma, Haixing Dai, Qiushi Yang, Yanqing Kang, Jinru Wu, Huawen Hu, et al. Prompt engineering for healthcare: Methodologies and applications. arXiv preprint arXiv:2304.14670, 2023.\n[55] Ashish Bastola, Hao Wang, Judsen Hembree, Pooja Yadav, Nathan McNeese, and Abolfazl Razi. Llm-based smart reply (lsr): Enhancing collaborative performance with chatgpt-mediated smart reply system (acm)(draft) llm-based smart reply (lsr): Enhancing collaborative performance with chatgpt-mediated smart reply system. arXiv preprint arXiv:2306.11980, 2023.\n[56] Xingchen Wan, Ruoxi Sun, Hanjun Dai, Sercan O Arik, and Tomas Pfister. Better zero-shot reasoning with self-adaptive prompting. arXiv preprint arXiv:2305.14106, 2023.\n[57] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.\n[58] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.\n[59] Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022.\n[60] Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems, 36, 2024.\n[61] Jonathan Pilault, Raymond Li, Sandeep Subramanian, and Christopher Pal. On extractive and abstractive neural document summarization with transformer language models. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 9308–9319, 2020.\n[62] Raul Puri and Bryan Catanzaro. Zero-shot text classification with generative language models. arXiv preprint arXiv:1912.10165, 2019.\n[63] Benjamin Clavié, Alexandru Ciceu, Frederick Naylor, Guillaume Soulié, and Thomas Brightwell. Large language models in the workplace: A case study on prompt engineering for job type classification. In International Conference on Applications of Natural Language to Information Systems, pages 3–17. Springer, 2023.\n[64] Rakesh Chandra Joshi, Saumya Yadav, Malay Kishore Dutta, and Carlos M Travieso-Gonzalez. Efficient multiobject detection and smart navigation using artificial intelligence for visually impaired people. Entropy, 22(9):941, 2020.\n[65] Lilit Hakobyan, Jo Lumsden, Dympna O\u0026rsquo;Sullivan, and Hannah Bartlett. Mobile assistive technologies for the visually impaired. Survey of ophthalmology, 58(6):513–528, 2013.\n[66] Karol Matusiak, Piotr Skulimowski, and P Strurniłło. Object recognition in a mobile phone application for visually impaired users. In 2013 6th International Conference on Human System Interactions (HSI), pages 479–484. IEEE, 2013.\n[67] Shafin Rahman, Salman Khan, and Nick Barnes. Improved visual-semantic alignment for zero-shot object detection. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11932–11939, 2020.\n[68] Ali Jasim Ramadhan. Wearable smart system for visually impaired people. sensors, 18(3):843, 2018.\n[69] Navid Fallah, Ilias Apostolopoulos, Kostas Bekris, and Eelke Folmer. The user as a sensor: navigating users with visual impairments in indoor spaces using tactile landmarks. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 425–432, 2012.\n[70] Wu Tang, De-er Liu, Xiaoli Zhao, Zenghui Chen, and Chen Zhao. A dataset for the recognition of obstacles on blind sidewalk. Universal Access in the Information Society, 22(1):69–82, 2023.\n[71] Xi Li, Anthony Dick, Chunhua Shen, Anton Van Den Hengel, and Hanzi Wang. Incremental learning of 3d-dct compact representations for robust visual tracking. IEEE transactions on pattern analysis and machine intelligence , 35(4):863–881, 2012.\n[72] Ramin Mehran, Alexis Oyama, and Mubarak Shah. Abnormal crowd behavior detection using social force model. In 2009 IEEE conference on computer vision and pattern recognition, pages 935–942. IEEE, 2009.\n[73] Dong-in Kim and Jangwon Lee. Anomaly detection for visually impaired people using a 360 degree wearable camera. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022.\nA Original Prompt # This section illustrates all used prompts in the proposed system.\nA.1 LLM Instruction # System instructions are usually directly fed into LLMs as a self-prompt and generally do not consume token usage.\nMain Instruction: \u0026ldquo;You are a voice assistant for a visually impaired user, the input is the actual data collected by a phone camera, and the phone is always facing front, please provide the key information for the blind user to help him navigate and avoid potential danger. Please note that the center_x and center_y represent the object location (proportional to the image), object height and width are also a proportion.\u0026rdquo; System Sensitivity Prompt: \u0026ldquo;System sensitivity: Incorporate the sensitivity setting in your response. For a low-sensitivity setting, identify and report only imminent and direct threats to safety. For medium sensitivity, include potential hazards that could pose a risk if not avoided. For high sensitivity, report all detected objects that could cause any inconvenience or danger. Current sensitivity: low.\u0026rdquo;\nLocation Prompt: \u0026ldquo;The location information center x , center y , height, width of objects is the proportion to the image, the detected objects are categorized into 4 type based on the image region. Left and Right: objects located on left 25% or right 25% of the image, these objects are usually moving and has large proportion.Front: objects that may still far away, can be used to discriminate the current situation.Ground: objects that may nearby.\u0026rdquo;\nMotion Prompt: \u0026ldquo;Using the information from last frame and current frame to analyze the movement (speed and direction) and location of each object to determine its trajectory relative to the user. Use this information to assess whether an object is moving towards the user or they are static. If moving, how quickly a potential collision might occur based on the object\u0026rsquo;s speed and direction of movement.\u0026rdquo;\nA.2 LLM Prompt # LLM prompts are the master prompts that are directly input from the user end, text tokens are counted to the usage. To control and optimize the usage, we designed three different output modes. Furthermore, the designed prompt will guide LLM to generate a structured data format (dictionary, list, JSON, etc.).\nFull diagnose mode: \u0026lsquo;Please organize your output into this format: \u0026ldquo;scene\u0026rdquo;: quickly describe the current situation for blind user; \u0026ldquo;key_objects\u0026rdquo;: quickly and roughly locate the key objects for blind user; \u0026ldquo;anomaly_checker\u0026rdquo;: quickly diagnose if there is potential danger for a blind person; \u0026ldquo;anomaly_label\u0026rdquo;: output 1 if there is an emergency, output 0 if not; \u0026ldquo;anomaly_index\u0026rdquo;: object_id, danger_index, estimate a score from 0 to 1 about each objects that may cause danger; \u0026ldquo;voice_guide\u0026rdquo;: the main output to instant alert the blind person for emergency.\u0026rsquo;\nVoice-only mode: \u0026lsquo;Please organize your output into this format: \u0026ldquo;voice_guide\u0026rdquo;: the main output to instantly alert the blind person for an emergency.\u0026rsquo;\nAnnotation mode: \u0026lsquo;Please organize your output into this format: \u0026ldquo;anomaly_score\u0026rdquo;: predict a score from 0 to 1 to evaluate the emergency level; \u0026ldquo;reason\u0026rdquo;: explain your annotation reason within 10 words.\u0026rsquo;\nA.3 Other Prompts # Detection Classes Switch: \u0026ldquo;The user is switching the scene to custom_scene please generate a new list that contains the top 100 related objects, including especially road hazards and possible obstacles\u0026rdquo;\nInterest Target Setting: \u0026ldquo;Please analyze the user command and extract the user required object, output into this format: \u0026ldquo;add\u0026rdquo;: object_name.\u0026rdquo;\nB Detection Labels for Custom Scenes # This section illustrates the detection classes generated by GPT-4 that are customized for specific scenes.\nVisually impaired navigation: [ \u0026lsquo;car\u0026rsquo;, \u0026lsquo;person\u0026rsquo;, \u0026lsquo;bus\u0026rsquo;, \u0026lsquo;bicycle\u0026rsquo;, \u0026lsquo;motorcycle\u0026rsquo;, \u0026rsquo;traffic light\u0026rsquo;, \u0026lsquo;stop sign\u0026rsquo;, \u0026lsquo;fountain\u0026rsquo;, \u0026lsquo;crosswalk\u0026rsquo;, \u0026lsquo;sidewalk\u0026rsquo;, \u0026lsquo;door\u0026rsquo;, \u0026lsquo;stair\u0026rsquo;, \u0026rsquo;escalator\u0026rsquo;, \u0026rsquo;elevator\u0026rsquo;, \u0026lsquo;ramp\u0026rsquo;, \u0026lsquo;bench\u0026rsquo;, \u0026rsquo;trash can\u0026rsquo;, \u0026lsquo;pole\u0026rsquo;, \u0026lsquo;fence\u0026rsquo;, \u0026rsquo;tree\u0026rsquo;, \u0026lsquo;dog\u0026rsquo;, \u0026lsquo;cat\u0026rsquo;, \u0026lsquo;bird\u0026rsquo;, \u0026lsquo;parking meter\u0026rsquo;, \u0026lsquo;mailbox\u0026rsquo;, \u0026lsquo;manhole\u0026rsquo;, \u0026lsquo;puddle\u0026rsquo;, \u0026lsquo;construction sign\u0026rsquo;, \u0026lsquo;construction barrier\u0026rsquo;, \u0026lsquo;scaffolding\u0026rsquo;, \u0026lsquo;hole\u0026rsquo;, \u0026lsquo;crack\u0026rsquo;, \u0026lsquo;speed bump\u0026rsquo;, \u0026lsquo;curb\u0026rsquo;, \u0026lsquo;guardrail\u0026rsquo;, \u0026rsquo;traffic cone\u0026rsquo;, \u0026rsquo;traffic barrel\u0026rsquo;, \u0026lsquo;pedestrian signal\u0026rsquo;, \u0026lsquo;street sign\u0026rsquo;, \u0026lsquo;fire hydrant\u0026rsquo;, \u0026rsquo;lamp post\u0026rsquo;, \u0026lsquo;bench\u0026rsquo;, \u0026lsquo;picnic table\u0026rsquo;, \u0026lsquo;public restroom\u0026rsquo;, \u0026lsquo;fountain\u0026rsquo;, \u0026lsquo;statue\u0026rsquo;, \u0026lsquo;monument\u0026rsquo;, \u0026lsquo;directional sign\u0026rsquo;, \u0026lsquo;information sign\u0026rsquo;, \u0026lsquo;map\u0026rsquo;, \u0026rsquo;emergency exit\u0026rsquo;, \u0026rsquo;no smoking sign\u0026rsquo;, \u0026lsquo;wet floor sign\u0026rsquo;, \u0026lsquo;closed sign\u0026rsquo;, \u0026lsquo;open sign\u0026rsquo;, \u0026rsquo;entrance sign\u0026rsquo;, \u0026rsquo;exit sign\u0026rsquo;, \u0026lsquo;stairs sign\u0026rsquo;, \u0026rsquo;escalator sign\u0026rsquo;, \u0026rsquo;elevator sign\u0026rsquo;, \u0026lsquo;restroom sign\u0026rsquo;, \u0026lsquo;men restroom sign\u0026rsquo;, \u0026lsquo;women restroom sign\u0026rsquo;, \u0026lsquo;unisex restroom sign\u0026rsquo;, \u0026lsquo;baby changing station\u0026rsquo;, \u0026lsquo;wheelchair accessible sign\u0026rsquo;, \u0026lsquo;braille sign\u0026rsquo;, \u0026lsquo;audio signal device\u0026rsquo;, \u0026rsquo;tactile paving\u0026rsquo;, \u0026lsquo;detectable warning surface\u0026rsquo;, \u0026lsquo;guide rail\u0026rsquo;, \u0026lsquo;handrail\u0026rsquo;, \u0026rsquo;turnstile\u0026rsquo;, \u0026lsquo;gate\u0026rsquo;, \u0026rsquo;ticket barrier\u0026rsquo;, \u0026lsquo;security checkpoint\u0026rsquo;, \u0026lsquo;metal detector\u0026rsquo;, \u0026lsquo;baggage claim\u0026rsquo;, \u0026rsquo;lost and found\u0026rsquo;, \u0026lsquo;information desk\u0026rsquo;, \u0026lsquo;meeting point\u0026rsquo;, \u0026lsquo;waiting area\u0026rsquo;, \u0026lsquo;seating area\u0026rsquo;, \u0026lsquo;boarding area\u0026rsquo;, \u0026lsquo;disembarking area\u0026rsquo;, \u0026lsquo;charging station\u0026rsquo;, \u0026lsquo;water dispenser\u0026rsquo;, \u0026lsquo;vending machine\u0026rsquo;, \u0026lsquo;ATM\u0026rsquo;, \u0026lsquo;kiosk\u0026rsquo;, \u0026lsquo;public telephone\u0026rsquo;, \u0026lsquo;public Wi-Fi hotspot\u0026rsquo;, \u0026rsquo;emergency phone\u0026rsquo;, \u0026lsquo;first aid station\u0026rsquo;, \u0026lsquo;defibrillator\u0026rsquo;, \u0026rsquo;tree\u0026rsquo;, \u0026lsquo;pole\u0026rsquo;, \u0026rsquo;lamp post\u0026rsquo;, \u0026lsquo;staff\u0026rsquo;, \u0026lsquo;road hazard\u0026rsquo;]\nUrban Walking: [\u0026lsquo;pedestrian\u0026rsquo;, \u0026lsquo;cyclist\u0026rsquo;, \u0026lsquo;car\u0026rsquo;, \u0026lsquo;bus\u0026rsquo;, \u0026lsquo;motorcycle\u0026rsquo;, \u0026lsquo;scooter\u0026rsquo;, \u0026rsquo;electric scooter\u0026rsquo;, \u0026rsquo;traffic light\u0026rsquo;, \u0026lsquo;stop sign\u0026rsquo;, \u0026lsquo;crosswalk\u0026rsquo;, \u0026lsquo;sidewalk\u0026rsquo;, \u0026lsquo;curb\u0026rsquo;, \u0026lsquo;ramp\u0026rsquo;, \u0026lsquo;stair\u0026rsquo;, \u0026rsquo;escalator\u0026rsquo;, \u0026rsquo;elevator\u0026rsquo;, \u0026lsquo;bench\u0026rsquo;, \u0026rsquo;trash can\u0026rsquo;, \u0026lsquo;pole\u0026rsquo;, \u0026lsquo;fence\u0026rsquo;, \u0026rsquo;tree\u0026rsquo;, \u0026lsquo;fire hydrant\u0026rsquo;, \u0026rsquo;lamp post\u0026rsquo;, \u0026lsquo;construction barrier\u0026rsquo;, \u0026lsquo;construction sign\u0026rsquo;, \u0026lsquo;scaffolding\u0026rsquo;, \u0026lsquo;hole\u0026rsquo;, \u0026lsquo;crack\u0026rsquo;, \u0026lsquo;speed bump\u0026rsquo;, \u0026lsquo;puddle\u0026rsquo;, \u0026lsquo;manhole\u0026rsquo;, \u0026lsquo;drain\u0026rsquo;, \u0026lsquo;grate\u0026rsquo;, \u0026rsquo;loose gravel\u0026rsquo;, \u0026lsquo;ice patch\u0026rsquo;, \u0026lsquo;snow pile\u0026rsquo;, \u0026rsquo;leaf pile\u0026rsquo;, \u0026lsquo;standing water\u0026rsquo;, \u0026lsquo;mud\u0026rsquo;, \u0026lsquo;sand\u0026rsquo;, \u0026lsquo;street sign\u0026rsquo;, \u0026lsquo;directional sign\u0026rsquo;, \u0026lsquo;information sign\u0026rsquo;, \u0026lsquo;parking meter\u0026rsquo;, \u0026lsquo;mailbox\u0026rsquo;, \u0026lsquo;bicycle rack\u0026rsquo;, \u0026lsquo;outdoor seating\u0026rsquo;, \u0026lsquo;planter box\u0026rsquo;, \u0026lsquo;bollard\u0026rsquo;, \u0026lsquo;guardrail\u0026rsquo;, \u0026rsquo;traffic cone\u0026rsquo;, \u0026rsquo;traffic barrel\u0026rsquo;, \u0026lsquo;pedestrian signal\u0026rsquo;, \u0026lsquo;crowd\u0026rsquo;, \u0026lsquo;animal\u0026rsquo;, \u0026lsquo;dog\u0026rsquo;, \u0026lsquo;bird\u0026rsquo;, \u0026lsquo;cat\u0026rsquo;, \u0026lsquo;public restroom\u0026rsquo;, \u0026lsquo;fountain\u0026rsquo;,\n\u0026lsquo;statue\u0026rsquo;, \u0026lsquo;monument\u0026rsquo;, \u0026lsquo;picnic table\u0026rsquo;, \u0026lsquo;outdoor advertisement\u0026rsquo;, \u0026lsquo;vendor cart\u0026rsquo;, \u0026lsquo;food truck\u0026rsquo;, \u0026rsquo;emergency exit\u0026rsquo;, \u0026rsquo;no smoking sign\u0026rsquo;, \u0026lsquo;wet floor sign\u0026rsquo;, \u0026lsquo;closed sign\u0026rsquo;, \u0026lsquo;open sign\u0026rsquo;, \u0026rsquo;entrance sign\u0026rsquo;, \u0026rsquo;exit sign\u0026rsquo;, \u0026lsquo;stairs sign\u0026rsquo;, \u0026rsquo;escalator sign\u0026rsquo;, \u0026rsquo;elevator sign\u0026rsquo;, \u0026lsquo;restroom sign\u0026rsquo;, \u0026lsquo;braille sign\u0026rsquo;, \u0026lsquo;audio signal device\u0026rsquo;, \u0026rsquo;tactile paving\u0026rsquo;, \u0026lsquo;detectable warning surface\u0026rsquo;, \u0026lsquo;guide rail\u0026rsquo;, \u0026lsquo;handrail\u0026rsquo;, \u0026rsquo;turnstile\u0026rsquo;, \u0026lsquo;gate\u0026rsquo;, \u0026lsquo;security checkpoint\u0026rsquo;, \u0026lsquo;water dispenser\u0026rsquo;, \u0026lsquo;vending machine\u0026rsquo;, \u0026lsquo;ATM\u0026rsquo;, \u0026lsquo;kiosk\u0026rsquo;, \u0026lsquo;public telephone\u0026rsquo;, \u0026lsquo;public Wi-Fi hotspot\u0026rsquo;, \u0026rsquo;emergency phone\u0026rsquo;, \u0026lsquo;charging station\u0026rsquo;, \u0026lsquo;first aid station\u0026rsquo;, \u0026lsquo;defibrillator\u0026rsquo;, \u0026rsquo;tree\u0026rsquo;, \u0026lsquo;pole\u0026rsquo;, \u0026rsquo;lamp post\u0026rsquo;, \u0026lsquo;staff\u0026rsquo;, \u0026lsquo;road hazard\u0026rsquo;]\nWalking General: [\u0026lsquo;vehicles\u0026rsquo;, \u0026lsquo;pedestrians\u0026rsquo;, \u0026rsquo;traffic signs and signals\u0026rsquo;, \u0026lsquo;roadway features\u0026rsquo;, \u0026lsquo;surface conditions\u0026rsquo;, \u0026lsquo;street furniture\u0026rsquo;, \u0026lsquo;construction areas\u0026rsquo;, \u0026lsquo;vegetation\u0026rsquo;, \u0026lsquo;animals\u0026rsquo;, \u0026lsquo;public amenities\u0026rsquo;, \u0026rsquo;navigation aids\u0026rsquo;, \u0026rsquo;temporary obstacles\u0026rsquo;, \u0026rsquo;emergency facilities\u0026rsquo;, \u0026rsquo;transportation hubs\u0026rsquo;, \u0026rsquo;electronic devices\u0026rsquo;, \u0026lsquo;safety features\u0026rsquo;]\nUrban Walking Hazards: [\u0026lsquo;person\u0026rsquo;, \u0026lsquo;cyclist\u0026rsquo;, \u0026lsquo;car\u0026rsquo;, \u0026lsquo;bus\u0026rsquo;, \u0026lsquo;motorcycle\u0026rsquo;, \u0026lsquo;scooter\u0026rsquo;, \u0026lsquo;fountain\u0026rsquo;, \u0026lsquo;red traffic light\u0026rsquo;, \u0026lsquo;green traffic light\u0026rsquo;, \u0026lsquo;stop sign\u0026rsquo;, \u0026lsquo;curb\u0026rsquo;, \u0026lsquo;ramp\u0026rsquo;, \u0026lsquo;stair\u0026rsquo;, \u0026rsquo;escalator\u0026rsquo;, \u0026rsquo;elevator\u0026rsquo;, \u0026lsquo;bench\u0026rsquo;, \u0026rsquo;trash can\u0026rsquo;, \u0026lsquo;pole\u0026rsquo;, \u0026rsquo;tree\u0026rsquo;, \u0026lsquo;fire hydrant\u0026rsquo;, \u0026rsquo;lamp post\u0026rsquo;, \u0026lsquo;construction barrier\u0026rsquo;, \u0026lsquo;construction sign\u0026rsquo;, \u0026lsquo;scaffolding\u0026rsquo;, \u0026lsquo;hole\u0026rsquo;, \u0026lsquo;crack\u0026rsquo;, \u0026lsquo;speed bump\u0026rsquo;, \u0026lsquo;puddle\u0026rsquo;, \u0026lsquo;manhole\u0026rsquo;, \u0026lsquo;drain\u0026rsquo;, \u0026lsquo;grate\u0026rsquo;, \u0026rsquo;loose gravel\u0026rsquo;, \u0026lsquo;ice patch\u0026rsquo;, \u0026lsquo;snow pile\u0026rsquo;, \u0026rsquo;leaf pile\u0026rsquo;, \u0026lsquo;standing water\u0026rsquo;, \u0026lsquo;mud\u0026rsquo;, \u0026lsquo;sand\u0026rsquo;, \u0026lsquo;street sign\u0026rsquo;, \u0026lsquo;directional sign\u0026rsquo;, \u0026lsquo;information sign\u0026rsquo;, \u0026lsquo;parking meter\u0026rsquo;, \u0026lsquo;mailbox\u0026rsquo;, \u0026lsquo;bicycle rack\u0026rsquo;, \u0026lsquo;outdoor seating\u0026rsquo;, \u0026lsquo;planter box\u0026rsquo;, \u0026lsquo;bollard\u0026rsquo;, \u0026lsquo;guardrail\u0026rsquo;, \u0026rsquo;traffic cone\u0026rsquo;, \u0026rsquo;traffic barrel\u0026rsquo;, \u0026lsquo;pedestrian signal\u0026rsquo;, \u0026lsquo;crowd\u0026rsquo;, \u0026lsquo;animal\u0026rsquo;, \u0026lsquo;dog\u0026rsquo;, \u0026lsquo;bird\u0026rsquo;, \u0026lsquo;cat\u0026rsquo;, \u0026lsquo;public restroom\u0026rsquo;, \u0026lsquo;fountain\u0026rsquo;, \u0026lsquo;statue\u0026rsquo;, \u0026lsquo;monument\u0026rsquo;, \u0026lsquo;picnic table\u0026rsquo;, \u0026lsquo;outdoor advertisement\u0026rsquo;, \u0026lsquo;vendor cart\u0026rsquo;, \u0026lsquo;food truck\u0026rsquo;, \u0026rsquo;emergency exit\u0026rsquo;, \u0026rsquo;no smoking sign\u0026rsquo;, \u0026lsquo;wet floor sign\u0026rsquo;, \u0026lsquo;closed sign\u0026rsquo;, \u0026lsquo;open sign\u0026rsquo;, \u0026rsquo;entrance sign\u0026rsquo;, \u0026rsquo;exit sign\u0026rsquo;, \u0026lsquo;stairs sign\u0026rsquo;, \u0026rsquo;escalator sign\u0026rsquo;, \u0026rsquo;elevator sign\u0026rsquo;, \u0026lsquo;restroom sign\u0026rsquo;, \u0026lsquo;braille sign\u0026rsquo;, \u0026lsquo;audio signal device\u0026rsquo;, \u0026rsquo;tactile paving\u0026rsquo;, \u0026lsquo;detectable warning surface\u0026rsquo;, \u0026lsquo;guide rail\u0026rsquo;, \u0026lsquo;handrail\u0026rsquo;, \u0026rsquo;turnstile\u0026rsquo;, \u0026lsquo;gate\u0026rsquo;, \u0026lsquo;security checkpoint\u0026rsquo;, \u0026lsquo;water dispenser\u0026rsquo;, \u0026lsquo;vending machine\u0026rsquo;, \u0026lsquo;ATM\u0026rsquo;, \u0026lsquo;kiosk\u0026rsquo;, \u0026lsquo;public telephone\u0026rsquo;, \u0026rsquo;emergency phone\u0026rsquo;, \u0026lsquo;charging station\u0026rsquo;, \u0026lsquo;first aid station\u0026rsquo;, \u0026lsquo;defibrillator\u0026rsquo;, \u0026lsquo;oil spill\u0026rsquo;, \u0026lsquo;road debris\u0026rsquo;, \u0026lsquo;branches\u0026rsquo;, \u0026lsquo;water\u0026rsquo; \u0026rsquo;low-hanging signage\u0026rsquo;, \u0026lsquo;road signs\u0026rsquo;, \u0026lsquo;roadworks\u0026rsquo;, \u0026rsquo;excavation sites\u0026rsquo;, \u0026lsquo;utility works\u0026rsquo;, \u0026lsquo;fallen objects\u0026rsquo;, \u0026lsquo;spilled cargo\u0026rsquo;, \u0026lsquo;flood\u0026rsquo;, \u0026lsquo;ice\u0026rsquo;, \u0026lsquo;snowdrift\u0026rsquo;, \u0026rsquo;landslide debris\u0026rsquo;, \u0026rsquo;erosion damage\u0026rsquo;, \u0026lsquo;parked vehicles\u0026rsquo;, \u0026lsquo;moving equipment\u0026rsquo;, \u0026rsquo;large gatherings\u0026rsquo;, \u0026lsquo;parade\u0026rsquo;, \u0026lsquo;marathon\u0026rsquo;, \u0026lsquo;street fair\u0026rsquo;, \u0026lsquo;scaffolding\u0026rsquo;, \u0026rsquo;electrical hazards\u0026rsquo;, \u0026lsquo;wire tangle\u0026rsquo;, \u0026lsquo;manhole covers\u0026rsquo;, \u0026lsquo;street elements\u0026rsquo;, \u0026lsquo;road hazards\u0026rsquo;, \u0026rsquo;toxic spill\u0026rsquo;, \u0026lsquo;biohazard materials\u0026rsquo;, \u0026lsquo;wildlife crossings\u0026rsquo;, \u0026lsquo;stray animals\u0026rsquo;, \u0026lsquo;pets\u0026rsquo;, \u0026lsquo;flying debris\u0026rsquo;, \u0026lsquo;air pollution\u0026rsquo;,\u0026lsquo;smoke plumes\u0026rsquo;, \u0026lsquo;dust storms\u0026rsquo;, \u0026lsquo;sandstorms\u0026rsquo;, \u0026lsquo;floods\u0026rsquo;, \u0026lsquo;road crack\u0026rsquo;]\nWalking test: [ \u0026lsquo;vehicle\u0026rsquo;, \u0026lsquo;pedestrian\u0026rsquo;, \u0026lsquo;cyclist\u0026rsquo;, \u0026rsquo;traffic signal\u0026rsquo;, \u0026lsquo;street sign\u0026rsquo;, \u0026lsquo;crosswalk\u0026rsquo;, \u0026lsquo;sidewalk\u0026rsquo;, \u0026lsquo;curb\u0026rsquo;, \u0026lsquo;ramp\u0026rsquo;, \u0026lsquo;stair\u0026rsquo;, \u0026rsquo;escalator\u0026rsquo;, \u0026rsquo;elevator\u0026rsquo;, \u0026lsquo;public seating\u0026rsquo;, \u0026rsquo;trash receptacle\u0026rsquo;, \u0026lsquo;street furniture\u0026rsquo;, \u0026rsquo;tree\u0026rsquo;, \u0026lsquo;construction site\u0026rsquo;, \u0026lsquo;road obstruction\u0026rsquo;, \u0026rsquo;loose materials\u0026rsquo;, \u0026lsquo;slick surface\u0026rsquo;, \u0026lsquo;animal\u0026rsquo;, \u0026lsquo;outdoor advertisement\u0026rsquo;, \u0026lsquo;vendor\u0026rsquo;, \u0026lsquo;water feature\u0026rsquo;, \u0026lsquo;monument\u0026rsquo;, \u0026lsquo;information point\u0026rsquo;, \u0026lsquo;access point\u0026rsquo;, \u0026lsquo;safety equipment\u0026rsquo;, \u0026rsquo;navigation aid\u0026rsquo;, \u0026lsquo;public amenity\u0026rsquo;, \u0026rsquo;transport hub\u0026rsquo;, \u0026lsquo;obstacle crowd\u0026rsquo; ]\nAnnotation mask = ’people’, ’human face’, ’car license plate’, ’license plate’, ’plate’\n","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/visiongpt-llm-assisted-real-time-anomaly-detection-for-safe-visual-navigation/","section":"Papers","summary":"A framework combining lightweight object detection and large language models for real-time visual navigation safety and anomaly detection, with dynamic scenario switching and prompt engineering.","title":"VISIONGPT: LLM-ASSISTED REAL-TIME ANOMALY DETECTION FOR SAFE VISUAL NAVIGATION","type":"application"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/wanshun-su/","section":"Authors","summary":"","title":"Wanshun Su","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/weiyang-liu/","section":"Authors","summary":"","title":"Weiyang Liu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/wenhong-wang/","section":"Authors","summary":"","title":"Wenhong Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/wenxuan-liu/","section":"Authors","summary":"","title":"Wenxuan Liu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/willi-menapace/","section":"Authors","summary":"","title":"Willi Menapace","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/wilton-w.t.fok/","section":"Authors","summary":"","title":"Wilton W.T.Fok","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xi-shen/","section":"Authors","summary":"","title":"Xi Shen","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xian-zhong/","section":"Authors","summary":"","title":"Xian Zhong","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiang-wang/","section":"Authors","summary":"","title":"Xiang Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiaodong-cun/","section":"Authors","summary":"","title":"Xiaodong Cun","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiaogang-xu/","section":"Authors","summary":"","title":"Xiaogang Xu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiaohao-peng/","section":"Authors","summary":"","title":"Xiaohao Peng","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiaojuan-qi/","section":"Authors","summary":"","title":"Xiaojuan Qi","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiaoyu-wu/","section":"Authors","summary":"","title":"Xiaoyu Wu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiaozhen-ding/","section":"Authors","summary":"","title":"Xiaozhen Ding","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xinyi-wang/","section":"Authors","summary":"","title":"Xinyi Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiwen-chen/","section":"Authors","summary":"","title":"Xiwen Chen","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xu-liu/","section":"Authors","summary":"","title":"Xu Liu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xuxu-wang/","section":"Authors","summary":"","title":"Xuxu Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yanwei-li/","section":"Authors","summary":"","title":"Yanwei Li","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yashika-jain/","section":"Authors","summary":"","title":"Yashika Jain","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yedid-hoshen/","section":"Authors","summary":"","title":"Yedid Hoshen","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yi-wang/","section":"Authors","summary":"","title":"Yi Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yik-chung-wu/","section":"Authors","summary":"","title":"Yik-Chung Wu","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yiming-wang/","section":"Authors","summary":"","title":"Yiming Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/ying-cong-chen/","section":"Authors","summary":"","title":"Ying-Cong Chen","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yingxian-chen/","section":"Authors","summary":"","title":"Yingxian Chen","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yinzhi-cao/","section":"Authors","summary":"","title":"Yinzhi Cao","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yuchen-yang/","section":"Authors","summary":"","title":"Yuchen Yang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yuehuan-wang/","section":"Authors","summary":"","title":"Yuehuan Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yujia-sun/","section":"Authors","summary":"","title":"Yujia Sun","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yuran-wang/","section":"Authors","summary":"","title":"Yuran Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/zheng-wang/","section":"Authors","summary":"","title":"Zheng Wang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/zhiwei-yang/","section":"Authors","summary":"","title":"Zhiwei Yang","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/zihao-gong/","section":"Authors","summary":"","title":"Zihao Gong","type":"authors"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/zihao-liu/","section":"Authors","summary":"","title":"Zihao Liu","type":"authors"},{"content":" VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs # Hanan Gani\n*1 Rohit Bharadwaj *2 Fahad Shahbaz Khan1,4 Salman Khan1,5\nMuzammal Naseer 3\n1 Mohamed Bin Zayed University of Artificial Intelligence, 2 University of Edinburgh,\n3 Department of Computer Science, Khalifa University,\n4 Linköping University, 5 Australian National University\nCorrespondence: hanan.ghani@mbzuai.ac.ae , rohit.bharadwaj@ed.ac.uk\nAbstract # The recent advancements in Large Language Models (LLMs) have greatly influenced the development of Large Multi-modal Video Models (Video-LMMs), significantly enhancing our ability to interpret and analyze video data. Despite their impressive capabilities, current Video-LMMs have not been evaluated for anomaly detection tasks, which is critical to their deployment in practical scenarios e.g., towards identifying deepfakes, manipulated video content, traffic accidents and crimes. In this paper, we introduce VANE-Bench, a benchmark designed to assess the proficiency of Video-LMMs in detecting and localizing anomalies and inconsistencies in videos. Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models, encompassing a variety of subtle anomalies and inconsistencies grouped into five categories: unnatural transformations, unnatural appearance, pass-through, disappearance and sudden appearance. Additionally, our benchmark features real-world samples from existing anomaly detection datasets, focusing on crime-related irregularities, atypical pedestrian behavior, and unusual events. The task is structured as a visual question-answering challenge to gauge the models\u0026rsquo; ability to accurately detect and localize the anomalies within the videos. We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies. In conclusion, our research offers significant insights into the current capabilities of Video-LMMs in the realm of anomaly detection, highlighting the importance of our work in evaluating and improving these models for real-world applications. Our code and data is publicly available at https://github.com/ rohit901/VANE-Bench .\nEqual contribution 1 Introduction # Large Language Models (LLMs) like ChatGPT have ushered in a new era of real-world AI applications in varied and diverse sectors like manufacturing, legal services, space exploration, transportation, retail, healthcare, education, and technology (Abdullah et al. , 2022; Marr , 2023). Further, the current trend in the development of these LLMs has been to introduce multi-modal capabilities like vision and audio to these models along with text (et al. , 2024; OpenAI , 2024). This motivates us to ask the question whether the current Large Multi-modal Models (LMMs) are capable and accurate in tackling the problem statement of Video Anomaly Detection (VAD) which has immense practical applications in factories, autonomous driving, crime warning, and traffic management (Liu et al. , 2024a).\nFurther, we have recently observed superior visual quality of various AI-generated videos due to the improvements in the underlying algorithms, which are based on diffusion models, and transformers (Brooks et al. , 2024; HPCAI Tech , 2024; Peebles and Xie , 2023). The current state-of-theart (SOTA) AI text-to-video model is SORA from OpenAI (Brooks et al. , 2024). The videos produced by SORA are of extremely high fidelity, which makes them nearly indistinguishable from real-life footage. Thus, SORA brings new challenges in tackling misinformation, identifying deepfakes, and distinguishing real from fake videos, especially during crucial events like democratic elections. Therefore, developing automated solutions to identify AI-generated videos has become the need of the hour.\nMotivated by the above-mentioned points, we propose a novel and challenging benchmark, VANE-Bench: Video ANomaly Evaluation Benchmark, to evaluate various closed-source and open-source Video-LMMs on their ability to detect\nFigure 1: Samples showing the AI-Generated video category of VANE-Bench. We collect these synthetic videos from SORA (Brooks et al. , 2024), Open-Sora (HPCAI Tech , 2024), Runway Gen2 (Runway Research , 2024), ModelScopeT2V (Wang et al. , 2023a), and VideoLCM (Wang et al. , 2023b). The correct option in each question is highlighted in bold. Note that many of these anomalies are extremely subtle and difficult for humans to detect since the changes happen in rapid succession, with the entire video played in under a second. Anomalies are identified with red bounding boxes for clarity. Note that our actual dataset does not contain bounding box overlays.\nanomalies in the videos. Our VANE-Bench consists of both real-world video anomalies from diverse surveillance footage capturing unusual pedestrian behaviour, criminal activities, and unusual events, as well as subtle and challenging anomalies and inconsistencies present in various AI-generated videos (See Fig 1). These AI-generated videos, especially from SOTA models like SORA, have subtle and hard to detect anomalies, which makes this a challenging task even for many humans. However, automatically detecting and identifying the anomalies in these synthetic video clips serves as an important step towards identifying AI-generated videos in the wild. We reformulate the problem statement of VAD into a visual question-answering (VQA) task to facilitate easier evaluation of LMMs. However, despite evaluating over nine recent VideoLMMs on VANE-Bench, we find that most current LMMs still struggle on this benchmark (see Fig.2), making VANE-Bench a challenging and a useful benchmark for tracking the progress of Video-LMMs for the foreseeable future.\nOur contributions can be summarized as follows:\nWe present VANE-Bench: Video ANomaly Evaluation Benchmark, consisting of 325 video clips, and 559 challenging questionanswer pairs from both real-world video surveillance, and AI-generated videos.\nWe perform detailed evaluation of over nine state-of-the-art closed-source and open-source Video-LMMs on VANE-Bench, and show that most models exhibit poor performance, highlighting the challenging nature of our proposed benchmark.\nWe conduct detailed result analysis, and also perform human evaluation on VANE-Bench to set a reasonable benchmark target.\nWe will open-source our code, and describe the data construction process of VANE-Bench along with making our data publicly available.\nWe hope that VANE-Bench serves as a strong benchmark to improve the performance and capabilities of Video-LMMs on anomaly detection.\n2 Related Work # Video-LMMs: LMMs integrate linguistic and visual data to process videos, leveraging LLMs like Llama (Meta , 2024) and connecting them with modality-specific encoders via interfaces like Qformer (Zhang et al. , 2024; Dai et al. , 2023; Yin et al. , 2024). Notable open-source Video-LMMs include VideoChat (Li et al. , 2023b), which uses a chat-centric system, and VideoChatGPT (Maaz et al. , 2023), which combines a visual encoder with an LLM for detailed video conversations. VideoLLaMA (Zhang et al. , 2023a) integrates audio and visual signals using Q-formers, while LLaMAVID (Li et al. , 2023c) represents each frame with context and content tokens for efficient processing. Despite these advancements, our work shows current LMMs perform poorly on VANE-Bench, highlighting the need for stronger models in anomaly detection.\nVideo-LMMs Benchmarking: Benchmarks like SEED-Bench (Li et al. , 2023a) and MV-Bench (Li et al. , 2024b) assess general comprehension through multiple-choice questions but lack focus on anomaly detection in AI-generated videos. CVRR-ES (khattak et al. , 2024) evaluates realworld scenarios with open-ended questions but doesn\u0026rsquo;t address AI-generated inconsistencies. VANE-Bench specifically evaluates VAD in both real-world and AI-generated videos, providing a targeted benchmark for this task. While Perception Test (Patr ˘ ˘ aucean et al. ˘ ˘ , 2023) focuses on lower-level perception in real-world videos, VANEBench targets subtle anomalies in AI-generated content, making it essential for assessing VideoLMM robustness.\nVideo Anomaly Detection: Traditional VAD methods typically rely on hand-crafted features and statistical models to identify deviations from normality. CUVA (Du et al. , 2024) is a comprehensive benchmark that focuses on the causation of video anomalies. A survey on generalized VAD (Liu et al. , 2024b) categorizes various methodologies and highlights benchmark limitations. These methods often fail with complex AI-generated videos. VANE-Bench addresses this by focusing on VAD in such videos, complementing existing benchmarks and targeting subtle inconsistencies in high-fidelity AI-generated content.\n3 Dataset \u0026amp; Benchmark # Recent advancements in multi-modal Large Language Models (LLMs) have enabled these models to process text, image, and video data, presenting new opportunities and challenges in Video Anomaly Detection (VAD) (Liu et al. , 2024a). Motivated by this progress, we aim to benchmark the capabilities of these multi-modal models (LMMs) on VAD.\nTo address VAD, we propose VANE-Bench: Video ANomaly Evaluation Benchmark for Conversational LMMs, comprising 325 video clips and 559 challenging ground-truth question-answer (QA) pairs. We have adapted the VAD problem into a Multiple-Choice Video Question Answering (MC-Video QA) (Tapaswi et al. , 2016; Lei et al. , 2019; Yu et al. , 2019) task to facilitate the evaluation of LMMs, allowing for a more granular assessment of their video content understanding.\nWe evaluate the latest closed-source and opensource LMMs on VANE-Bench. Sec. 3.1 provides an overview of VANE-Bench, Sec. 3.2 describes the dataset categories, and Sec. 3.3 outlines our data collection methodology.\n3.1 Overview # VANE-Bench consists of 325 video clips spanning real-world and synthetic video anomalies. We adapted standard VAD surveillance datasets such as CUHK Avenue (Lu et al. , 2013), UCF-Crime (Sultani et al. , 2018), and UCSD Pedestrian (Li et al. , 2014) to our MC-Video QA problem. Additionally, we included 197 video clips from various opensource and state-of-the-art closed-source text-tovideo diffusion models (Brooks et al. , 2024; HP-\nFigure 2: Left: Performance of Video-LMMs on five anomaly categories of SORA dataset. Right: Overall performance of Video-LMMs averaged across all the benchmark datasets, including AI-generated and real-world anomaly datasets.\nCAI Tech , 2024; Runway Research , 2024; Wang et al. , 2023a , b). # The diverse data backgrounds and varied difficulty levels in VANE-Bench make it ideal for evaluating the reasoning and understanding capabilities of video LMMs. Benchmarking these models on a range of real-world and synthetic anomalies helps us understand their strengths and limitations, guiding future multi-modal AI research.\nOverall, VANE-Bench aims to push the boundaries of what LMMs can achieve in video anomaly detection, providing a rigorous standard for evaluating their performance on this challenging task.\n3.2 Categories # The VANE-Bench dataset encompasses a variety of categories derived from both real-world surveillance footage and AI-generated video clips. Each category represents a distinct source and type of video anomaly. Below, we detail the different categories included in the dataset:\nReal-World anomalies: The videos with these anomalies are sourced from several established real-world anomaly datasets, encompassing diverse anomaly types. The distribution of these anomalies is depicted in Fig. 3 (middle). Fig. 3 (right) provides the total number of anomaly clips along with corresponding QA pairs for each dataset in this category. Detailed descriptions of each dataset within this category follow below.\nCUHK Avenue (Lu et al. , 2013): This category consists of 11 video clips with 33 associated question-answer (QA) pairs. The clips capture anomalous events in a campus environment, which shows individuals commuting in a university campus, and walking in and out of buildings. Anomaly types. The anomalies include unusual pedestrian behav- ior like randomly throwing bags and papers or performing weird actions or dance moves. UCF-Crime (Sultani et al. , 2018): Comprising 95 video clips with 95 QA pairs, this category includes real-world surveillance footage. Anomaly types. The videos depict various criminal activities, such as arrest, assault, burglary, robbery, stealing, and vandalism. UCSD-Ped1 (Li et al. , 2014): This category contains 10 video clips with 30 QA pairs. The videos focus on pedestrian walkways. The Ped1 dataset is captured by a camera facing perpendicular to the road. Anomaly types. The anomalous events are due to the presence of non pedestrian entities (i.e. bikers, skaters, small carts, and wheelchairs) in the walkways. UCSD-Ped2 (Li et al. , 2014): Similar to UCSD-Ped1, this category includes 12 video clips with 36 QA pairs. In contrast with Ped1, the Ped2 dataset uses camera which is parallel to the road. Anomaly types. Abnormal events are due to non pedestrian entities in the walkways including bikers, skaters, small carts, and people walking across a walkway. AI-Generated anomalies: The videos with these anomalies are obtained from various closed-source, and open-source text-to-video diffusion models. The anomalies in these clips are usually subtle, and hard to detect, which makes our VANE-Bench benchmark challenging. General anomaly types: The anomalies include the sudden appearance of objects, the unnatural transformation of solid physical objects, the disappearance of objects, objects passing through other solids, and unnatural appearance of objects (i.e., distorted and deformed facial\nFigure 3: VANE-Bench dataset statistics: Left and Middle: Composition and type of anomalies present in AIgenerated and real-world videos. Right: Number of samples and QA pairs present in each type of video dataset.\nfeatures, or other unnatural appearance like presence of extra fingers). The distribution of these anomalies in the dataset is shown in Fig. 3 (left), and statistics about the number of clips and corresponding QA pairs are presented in Fig. 3 (right). Below, we describe the type of video samples in this category.\nSORA (Brooks et al. , 2024): This category consists of 46 video clips with 138 QA pairs. The video clips are generated using SORA, a state-of-the-art AI text-to-video model. Due to the high quality and almost realistic-looking videos generated by SORA, it becomes quite difficult to accurately identify the inconsistencies or anomalies present in the videos. OpenSora (HPCAI Tech , 2024): With 50 video clips and 50 QA pairs, this category features AI-generated videos from the opensource version of SORA. Runway Gen2 (Runway Research , 2024): This category includes 25 video clips with 25 QA pairs created using a commercial text-tovideo AI model. ModelScopeT2V (Wang et al. , 2023a): This category comprises of 24 video clips with 48 QA pairs, leveraging the video diffusion model trained by (Wang et al. , 2023a) to produce videos from text captions. The videos were generated with 50 diffusion steps with 16 fps. VideoLCM (Wang et al. , 2023b): This category features 52 video clips with 104 QA pairs, generated using latent consistency models (Wang et al. , 2023b) designed to create videos with high variability and with less latency. We used 20 diffuson steps to generate the videos with 16 fps. The videos were further post-processed by an LCM model trained on higher resolution videos to obtain better quality videos.\nBy including a wide range of video sources and anomaly types, the VANE-Bench dataset provides a comprehensive benchmark for evaluating the capabilities of large multi-modal models in video anomaly detection.\n3.3 Constructing VANE-Bench # Fig. 4 describes the construction process of the VANE-Bench dataset. Since the synthetic AIgenerated videos from state-of-the-art models like SORA (Brooks et al. , 2024) have subtle and hardto-detect inconsistencies, we require high-quality captions describing all of the specific inconsistencies present in the given video. Our pipeline first annotates the anomalies using the frame annotation module (FAM). The caption-generating module (CGM) then utilizes these annotations to produce captions, followed by the question-answer generation module (QAGM), creating QA pairs based on the annotated frames and captions. Annotating the clips before caption generation is crucial for focusing the model on the specific anomaly regions in the video (Shtedritski et al. , 2023; Zhang et al. , 2023b; Yang et al. , 2023). Without annotations, the CGM often fails to reference the anomalies in the captions, as demonstrated in Sec. C of supplementary material. We briefly describe all the three stages involved in the semi-automatic dataset construction pipeline below.\n3.3.1 Frame Annotation Module (FAM) # As described in Sec. 3.2, we first collect raw videos from existing VAD datasets like CUHK Avenue (Lu et al. , 2013), UCF-Crime (Sultani et al. , 2018), UCSD-Ped (Li et al. , 2014), and also add additional challenging AI generated videos\nFigure 4: Flow diagram showing the semi-automatic construction process of our VANE-Bench dataset. The entire process can be divided into 3 interconnected stages/modules, i.e., i. Frame Annotation Module (FAM), ii. Caption Generation Module (CGM), iii. Question Answer Generation Module (QAGM).\nto the mix. For the VAD datasets, the bounding box annotations were already provided for a subset of the videos from these datasets. Thus, we only annotate the anomalies present in the AIgenerated videos. In this stage, we first break down the raw videos into their constituent image frames. Second, we select and filter 10 consecutive frames from the video that contain the inconsistency. We annotate these selected frames with a bounding box mentioning the type of inconsistency. We consider the following inconsistency types: \u0026lsquo;Sudden Appearance\u0026rsquo;, \u0026lsquo;Unnatural Transform\u0026rsquo;, \u0026lsquo;Disappearance\u0026rsquo;, \u0026lsquo;Pass-through , and \u0026lsquo;Unnatural Appearance\u0026rsquo;. Fig. 4 shows the annotated \u0026lsquo;Unnatural Transform\u0026rsquo; inconsistency affecting the kangaroo\u0026rsquo;s legs and tails.\n3.3.2 Caption Generation Module (CGM) # The second stage of our data collection process involves the Caption Generation Module (CGM), which uses the annotated video frames from FAM to generate a high-quality and detailed caption which describes the inconsistency, along with the general events in the video. To generate the caption, we design a specialised custom prompt (Sec. D.1), and use the recently released GPT-4o (OpenAI , 2024) LMM, which has shown both impressive performance gains and cost savings. Thus, GPT-4o model takes in our custom prompt, along with the annotated frames to generate the descriptive video caption as shown in Fig. 4 .\n3.3.3 Question Answer Generation Module (QAGM) # The final stage of our VANE-Bench construction process involves using the generated caption from CGM, and the annotated frames from FAM to output the final high-quality, and challenging Question and Answer (QA) pairs. We create another custom prompt (Sec. D.2) which we pass to the GPT-4o model, along with caption, and the annotated frames as input to generate the QA pairs. The selected raw frames containing the inconsistency, and their corresponding generated QA pairs form our VANE-Bench dataset.\n4 Experiments and Results # Video-LMMs. We evaluate the anomaly detection and comprehension capabilities of both opensource and closed-source models. Among the open-source models, we evaluate 7 recent VideoLMMs, including Video-LLaVA (Lin et al. , 2023), TimeChat (Ren et al. , 2023), MovieChat (Song et al. , 2023), LLaMA-ViD (Li et al. , 2023c), VideoChat (Li et al. , 2023b), Video-ChatGPT (Maaz et al. , 2023), and Video-LLaMA-2 (Zhang et al. , 2023a). For evaluating closed-source models, we use Gemini-1.5 Pro (Google , 2023) and GPT-4o (OpenAI , 2024).\nEvaluation Protocol. For the evaluation of Gemini and GPT-4o, we utilize their respective official APIs, with each model receiving 10 video frames as input. The 10 frames are selected in a manner that encompasses all or the majority of the inconsistencies present in the video. In cases where an anomaly spans a longer duration, we sample mul-\nTable 1: Evaluation results of Video-LMMs across different types of video samples on the VANE benchmark. We present results for both open-source and closed-source models. The first five rows show results on AI-generated videos and last four contain results on real world anomaly datasets.\nBenchmark Category Video-LLaMA VideoChV Video-CVide Video-L MovieC LLaMA-VTimeC TimeChat Gemini-1.5 GPT4o SORA 11.59 10.74 26.47 10.86 8.69 7.97 21.73 51.45 55.8 OpenSORA 18 28 22 18 10 14 26 84 68 Runway Gen2 16 4 12 16 1600 20 28 28 40 VideoLCM 10.57 17.64 18.26 19.23 14.42 19.23 22.11 49.04 50.96 Modelscope-T2V 10.41 20.83 16.66 16.66 6.25 14.58 20.83 75 64.58 Avenue 30 32.25 39.39 3.03 18.18 27.27 24.2 100 84.85 UCFCrime 9.47 11.57 31.57 10.52 18.51 15.78 7.3 76.84 83.16 UCSD-Ped1 16.66 13.33 40 2.77 6.66 6.66 27.58 96.67 93.33 UCSD-Ped2 5.55 13.88 19.44 6.06 11.11 19.44 11.11 94.44 86.11 tiple sets of 10 frames to ensure comprehensive coverage. As GPT-4o does not inherently support videos, we input the video clips as 10 frames to the GPT API, accompanied by the corresponding Visual Question-Answering (VQA) query. For each model under assessment, we generate responses to the questions independently and without retaining the conversation history. Few models, such as Moviechat, output hallucinated responses when instructed to answer the query. In such cases, we consider the hallucinated responses as incorrect answers due to the inability of the model to comprehend the situation in the video.\nEvaluation metric. For the evaluation results of the Video-LMMs on our proposed VANE-Bench benchmark, we employ the standard VQA accuracy measure, which assigns a score of 1 to each correct answer and a score of 0 to each incorrect answer.\n4.1 Main Evaluation Results # 4.1.1 Evaluation on Video-LLMs # AI-Generated anomalies. The AI-generated videos in our dataset are derived from five distinct models: SORA, OpenSORA, Runaway Gen-2, VideoLCM, and Modelscope-T2V. In the majority of these videos, the anomalies are subtle and not readily apparent, even to the human eye. As previously stated in section 3.2, the synthetic anomalies can manifest in five different forms. As shown in Table 1, the performance of open-source models in detecting anomalies in these videos is subpar. Although closed-source models outperform their open-source counterparts, their overall comprehension and detection of anomalies in the videos re- main inadequate. This indicates that even robust closed-source models encounter difficulties in identifying subtle anomalies within the videos.\nReal-world anomalies. Our real-world anomaly datasets benchmark, as discussed in section 3.2 , comprises four real-world datasets and focuses on detecting crime-related irregularities, atypical pedestrian behavior, and unusual events. These anomalies are prevalent in real-world scenarios. In our analysis, we find that open-source models encounter difficulties in locating and identifying these anomalies. As shown in Table 1, these models perform poorly on these datasets. Conversely, we observe that closed-source models excel at detecting such real-world anomalies, indicating that they can effectively differentiate between unusual events in real-world scenarios. This can be attributed to the fact that these models are trained on a vast amount of existing real-world, internet-scale data.\nWe provide results on additional latest VideoLMMs in Section A.1 of Supplementary.\n4.1.2 Human Evaluation # We conducted a human evaluation on SORAgenerated videos, which contain subtle and challenging anomalies that are difficult for humans to detect (see Fig. 1 top row) in a single viewing. Moreover, most of the video clips contain a multitude of foreground and background characters and elements, which makes it difficult for humans to focus on the inconsistencies within the short time frame. Some of the questions also specifically inquire about inconsistencies present in the background characters of the clips rather than the foreground ones. To ensure fairness, our human eval-\nFigure 5: Human vs Video-LMMs\u0026rsquo; performance on SORA. Performance comparison of humans vs VideoLMMs on VQA task of detecting anomalies in SORA dataset. We find that closed-source Video-LMMs perform comparably to humans while open-source VideoLMMs struggle to detect subtle anomalies.\nuation was conducted under a set of rules, which include showing all 10 frames of the video to the human evaluator only once, followed by the question. Our human evaluation comparisons are presented in Fig 5. While humans outperform open-source models in detecting these subtle anomalies, their performance remains sub-optimal. This indicates that, with the advancements in video generation techniques, there is a pressing need for more sophisticated and effective Video-LMMs capable of assisting in the detection of such challenging cases capable of evading human eyes as well.\n4.2 Additional Analysis # Inconsistencies in Predictions. We find that, in the majority of cases, open-source Video-LMMs generate different results when prompted to answer the same query multiple times. Fig. 6 illustrates a sample example where the same questions were posed twice to the corresponding Video-LMMs, yielding different responses. In some instances, the answers generated by the Video-LMMs in both rounds were dissimilar and incorrect. However, we also found cases where Video-LMMs initially produced the correct answer, followed by an incorrect answer to the same query, albeit phrased slightly differently. This suggests that the majority of these open-source Video-LMMs struggle to comprehend the same query when presented in a different manner, leading to inconsistent and paradoxical predictions. In contrast, closed-source Video-LMMs are less prone to such inconsistent predictions and consistently produce the same output for the same queries, regardless of how they are phrased, indicating a superior comprehension of language. Refer to supplementary Section B for additional results. Performance Analysis on SORA anomalies. The overall performance of open-source Video-LMMs on anomaly categories in synthetically generated SORA videos is subpar. To gain further insights, as depicted in Figure 2 (left), all open-source VideoLMMs exhibit less than 10% accuracy in detecting the \u0026ldquo;disappearance\u0026rdquo; anomaly, indicating that this particular type is the most difficult to identify for the majority of Video-LMMs. Among the open-source models, Videochat demonstrates above par performance compared to its open-source counterparts on most anomaly types, with the exception of the \u0026ldquo;unnatural appearance\u0026rdquo; category, where Timechat outperforms it. The remaining models display a fluctuating trend, with accuracy levels ranging from extremely low to moderately low across all anomaly types. The closed source-models, on the other hand, demonstrate superior performance compared to open-source models across all anomaly types.\nWe provide more insights and discussions in Section A.4 of Supplementary material.\n5 Conclusion # We introduced VANE-Bench, a comprehensive benchmark for evaluating Video LMMs in VAD tasks, featuring real-world and AI-generated video clips. The AI-generated content, especially from advanced models like SORA, includes subtle inconsistencies, making VANE-Bench particularly challenging. Our evaluation of nine recent VideoLMMs on VANE-Bench shows significant gaps in detecting video anomalies, with even robust closedsource models struggling with nuanced discrepancies. Human assessments on SORA-generated videos confirm these subtle anomalies are challenging to identify, highlighting the need for advanced Video-LMMs. VANE-Bench is vital for advancing Video-LMMs in anomaly detection. As highfidelity AI-generated content rises, our benchmark is crucial for developing models to identify subtle inconsistencies, aiding in the fight against misinformation and deepfakes. We hope VANE-Bench will guide future research to enhance the robustness and capability of Video-LMMs in this critical area.\n6 Limitations # Our VANE-Bench is the first benchmark for evaluating Video-LMMs on anomalous videos from both AI-generated and real-world sources. While we have done our best to ensure a high-quality evaluation of these Video-LMMs, certain limitations\nFigure 6: Inconsistency in Predictions: Left: Video-ChatGPT and VideoChat predict accurately, while VideoLLAMA selects incorrectly. Right: With a rephrased query, predictions shift. Video-ChatGPT and VideoChat err, whereas Video-LLAMA predicts correctly. This indicates the sensitivity of Video-LMMs towards query rephrasing.\nstill manifest.\nOur Question-answer pairs are designed to have 4 options. We design the instruct prompt to ensure that each Video-LMM outputs one out of 4 options. However, in some instances, the model outputs a hallucinated response and does not follow the instructions. As a result, we employ a post-response human-based filtration process, which involves an exhaustive verification and rectification of these errors. In our current setup, we mark these cases as wrong. We believe that future Video-LLMs will be more aligned with human intent and will follow human instructions appropriately.\nAdditionally, the video samples from the SORA are limited in VANE-Bench. This is due to the fact that SORA model is not open-source yet, hence we rely on publicly available samples of SORA for evaluation.\nReferences # Malak Abdullah, Alia Madain, and Yaser Jararweh. 2022. Chatgpt: Fundamentals, applications and social impacts. In 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS), pages 1–8. Ieee.\nKirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. 2024a. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413 .\nKirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao Zhu, Jürgen Schmidhuber, and Mohamed Elhoseiny. 2024b. Goldfish: Vision-language understanding of arbitrarily long videos . Preprint, arXiv:2407.12679.\nTim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. 2024. Video generation models as world simulators.\nWenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose visionlanguage models with instruction tuning . Preprint , arXiv:2305.06500.\nHang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jianhang Chen, Qimei Cui, and Xiaofeng Tao. 2024. Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly . Preprint, arXiv:2405.00181.\nGemini Team et al. 2024. Gemini: A family of highly capable multimodal models . Preprint , arXiv:2312.11805.\nGoogle. 2023. Gemini .\nHPCAI Tech. 2024. Open-sora: Democratizing efficient video production for all. https://github. com/hpcaitech/Open-Sora .\nMuhammad Uzair khattak, Muhammad Ferjad Naeem, Jameel Hassan, Naseer Muzzamal, Federcio Tombari, Fahad Shahbaz Khan, and Salman Khan. 2024. How good is my video lmm? complex video reasoning and robustness evaluation suite for video-lmms. arXiv:2405.03690 .\nJie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. 2019. Tvqa: Localized, compositional video question answering . Preprint, arXiv:1809.01696.\nBohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. 2023a. Seed-bench: Benchmarking multimodal llms with generative comprehension . Preprint, arXiv:2307.16125.\nFeng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024a. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895 .\nKunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. 2023b. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 .\nKunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. 2024b. Mvbench: A comprehensive multimodal video understanding benchmark . Preprint , arXiv:2311.17005.\nWeixin Li, Vijay Mahadevan, and Nuno Vasconcelos. 2014. Anomaly detection and localization in crowded scenes . IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(1):18–32.\nYanwei Li, Chengyao Wang, and Jiaya Jia. 2023c. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 .\nBin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. 2023. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 .\nYang Liu, Dingkang Yang, Yan Wang, Jing Liu, Jun Liu, Azzedine Boukerche, Peng Sun, and Liang Song. 2024a. Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models . ACM Comput. Surv., 56(7).\nYang Liu, Dingkang Yang, Yan Wang, Jing Liu, Jun Liu, Azzedine Boukerche, Peng Sun, and Liang Song. 2024b. Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models . Preprint, arXiv:2302.05087.\nCewu Lu, Jianping Shi, and Jiaya Jia. 2013. Abnormal event detection at 150 fps in matlab. In 2013 IEEE International Conference on Computer Vision, pages 2720–2727.\nMuhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. 2023. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 .\nBernard Marr. 2023. 15 amazing real-world applications of ai everyone should know about. https: //bit.ly/4f2nrTd. Accessed: 26 Mar, 2024.\nMeta. 2024. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta. com/blog/meta-llama-3/ .\nOpenAI. 2024. Hello gpt-4o. https://openai.com/ index/hello-gpt-4o/ .\nWilliam Peebles and Saining Xie. 2023. Scalable diffusion models with transformers . Preprint , arXiv:2212.09748.\nViorica Patr ˘ ˘ aucean, Lucas Smaira, Ankush Gupta, ˘ ˘ Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, and João Carreira. 2023. Perception test: A diagnostic benchmark for multimodal video models . Preprint , arXiv:2305.13786.\nShuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. 2023. Timechat: A time-sensitive multimodal large language model for long video understanding. arXiv preprint arXiv:2312.02051 .\nRunway Research. 2024. Gen-2: The next step forward for generative ai. https://research.runwayml. com/gen2 .\nAleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. 2023. What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11987–11997.\nEnxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. 2023. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449 .\nWaqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488.\nMakarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering . Preprint , arXiv:1512.02902.\nJiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. 2023a. Modelscope text-to-video technical report . Preprint , arXiv:2308.06571.\nXiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. 2023b. Videolcm: Video latent consistency model . Preprint , arXiv:2312.09109.\nJianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2023. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 .\nShukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. A survey on multimodal large language models . Preprint , arXiv:2306.13549.\nZhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019. Activitynet-qa: A dataset for understanding complex web videos via question answering . Preprint, arXiv:1906.02467.\nDuzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. Mmllms: Recent advances in multimodal large language models . Preprint, arXiv:2401.13601.\nHang Zhang, Xin Li, and Lidong Bing. 2023a. Videollama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 .\nShilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. 2023b. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601 .\nAppendix # In the following sections, we provide additional information for the paper: VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs. The contents are organized in the following order.\nAdditional Findings and Results (Appendix A) Additional Results on Prediction Inconsistency (Appendix B) Importance of Frame Annotation Module (FAM) (Appendix C) Implementation Details (Appendix D) Distribution of VANE-Bench dataset (Appendix E) A Additional Findings and Qualitative Results # A.1 Additional Quantitative Results # Video-LMMs are fewer in number compared to image-based multi-modal models, which limits the range of options available for evaluation. Given this scarcity, we selected 7 open-source and 2 closedsource LMMs that are currently among the most widely used. To ensure that our benchmark remains representative, we have also included additional results from other latest open-source LMMs: (Ataallah et al. , 2024a , b; Li et al. , 2024a), as shown in Table 2. Our findings reveal that open-source models still lag behind their closed-source counterparts in performance, indicating that simply adding more models wouldn\u0026rsquo;t necessarily improve the overall representativeness of our benchmark. Our selected set, which includes both open-source and closedsource models, is already comprehensive, featuring state-of-the-art models like GPT-4o and Gemini1.5 Pro. Given the limited number of Video-LMMs available, our reliance on this specific set of models is justified, as it accurately represents the current landscape of Video-LMM capabilities.\nA.2 Qualitative Results # In Fig. 8, we showcase the response of both opensource and closed-source Video-LMMs on anomalous video samples from our VANE-benchmark. The query to the Video-LMMs contains the video and a question with multiple options associated\nTable 2: Evaluation results of additional latest Video-LMMs across different types of video samples on the VANE benchmark. We present results for both open-source and closedsource models. The first five rows show results on AIgenerated videos and last three contain results on real world anomaly datasets.\nBenchmark Category ry LLaVA-Ne LLaVA-NeXT MiniG MiniGPT4-Video Goldfish SORA 11.59 10.74 26.47 OpenSORA 18 28 22 Runway Gen2 16 4 12 VideoLCM 10.57 17.64 18.26 Modelscope-T2V 10.41 20.83 16.66 UCFCrime 9.47 11.57 31.57 UCSD-Ped1 16.66 13.33 40 UCSD-Ped2 5.55 13.88 19.44 with the specific anomaly present in the video. The anomalies in Fig. 8 constitute pass through (first row), unnatural appearance (second row), sudden appearance (third row), disappearance (fourth row) and unnatural transformation (fifth row).\nA.3 VANE-Bench frequent instances # Figure 7: Frequent keywords: Illustration of the most frequent keywords in the correct option set of VANE benchmark. These keywords signify the objects or human attributes in the videos that are most likely to exhibit anomalous behavior\nFigure 7 presents a word cloud visualization, highlighting the most frequently occurring keywords within the correct option set of the VANEBenchmark dataset. These prominent words are indicative of objects or human attributes in the videos that are most likely to exhibit anomalous behavior. From the figure, the most frequently occurring keyword is \u0026ldquo;Face\u0026rdquo; which indicates that the synthetically generated videos most likely struggle to generate a perfect human face.\nA.4 Additional Discussions on Experimental Results # Per anomaly performance: To give further insights, Figure 2 (left) of the main paper illustrates the performance of LMMs on each type of anomaly present in the AI-generated videos. We can observe that closed-source models like GPT-4o and Gemini-1.5 Pro consistently exhibit strong performance across all five anomaly categories compared to their open-source counterparts. This likely stems from their access to significantly larger training datasets and model parameters, allowing for a more robust understanding of visual anomalies. Conversely, open-source models exhibit fluctuating performance depending on the anomaly type. We also note that open-source models struggle, especially with the \u0026ldquo;disappearance\u0026rdquo; anomaly. We believe that it might be because of the fact that these models are trained on datasets focusing on the presence of objects and actions, and hence being more biased towards presence. Further, we believe that open-source models suffers from limited temporal reasoning capability and often use short-term mechanisms that limit their ability to track objects over time. The lack of datasets focusing on anomalies like \u0026ldquo;disappearance\u0026rdquo; also limits the model\u0026rsquo;s capability to detect such patterns.\nHigher performance of some LMMs: As seen in Table 1 and Figure 5 of main paper, we notice that some open-source LMMs perform better than their counterparts. For instance, we notice VideoChatGPT achieves higher performance compared to other open-source models. We believe that it might be because of the following two reasons: 1. Training Data: While most open-source models rely solely on web-scraped video captioning data, Video-ChatGPT incorporates a manually annotated video instruction dataset specifically designed for video understanding. This provides the model with a more direct and targeted learning experience, potentially enhancing its sensitivity to anomalies. 2. Two-Stage Training: Video-ChatGPT employs a two-stage training process involving both videolanguage pre-training and instruction tuning. This enables the model to first develop a strong understanding of general video semantics and then refine its ability to follow user instructions and reason about specific events within videos.\nB Additional results on Prediction inconsistency # As discussed in section 4.2 almost all Video-LMMs generate different results when prompted to answer the same query rephrased multiple times. While it is most common in open-source Video-LMMs, we found that closed-source Video-LMMs occasionally suffer from this problem as well. Fig. 9 illustrates additional sample examples where the same questions (phrased slightly differently) were posed twice to the corresponding Video-LMMs, yielding different responses.\nWe find that, in the majority of cases, opensource Video-LMMs generate different results when prompted to answer the same query multiple times. Fig. 6 illustrates a sample example where the same questions were posed twice to the corresponding Video-LMMs, yielding different responses. In some instances, the answers generated by the Video-LMMs in both rounds were dissimilar and incorrect. However, we also found cases where Video-LMMs initially produced the correct answer, followed by an incorrect answer to the same query, albeit phrased slightly differently. This suggests that the majority of these opensource Video-LMMs struggle to comprehend the same query when presented in a different manner, leading to inconsistent and paradoxical predictions. In contrast, closed-source Video-LMMs are less prone to such inconsistent predictions and consistently produce the same output for the same queries, regardless of how they are phrased, indicating a superior comprehension of language.\nC Importance of Frame annotation module # Since the video inconsistencies present in state of the art AI models like SORA are quite subtle, and hard to detect, our Frame Annotation Module (FAM) ensures that we are able to generate highquality and accurate captions for these videos. As shown in Fig. 10, without FAM, the generated caption is not able to describe the sudden appearance of the kangaroo\u0026rsquo;s right foot near its tail. Further, the caption generated without our FAM is also not able to describe the extra set of paws that appear suddenly from the legs of the cat. Thus, FAM plays an important role in curating high-quality and accurate video captions.\nD Implementation Details # We use the official code of each open-source VideoLMM for evaluation. Each of these codes are implemented in pytorch framework. We evaluate each one of them on a single NVIDIA A100 40GB GPU. For closed source Video-LMMs, we use their respective API for evaluation. We use GPT-4o (OpenAI , 2024) as our LMM to generate the captions and the final QA pairs in VANE-Bench. Next, we describe the prompts used in our Caption Generation Module (CGM), Question Answer Generation Module (QAGM), and in evaluating various Video-LMMs on VANE-Bench in the subsequent subsections.\nD.1 Caption Generation Module (CGM) # System Prompt: You are a helpful and intelligent AI assistant which can generate informative captions for a given input of 10 consecutive images/frames from a video. The video is generated from an AI text-to-video diffusion model and has some obvious inconsistencies or anomalies in the form of various deformations, unrealistic physical transformations, unnatural appearance of objects, human faces, body parts, or sudden appearance, disappearance, or merging of objects. Your task is to generate a descriptive caption for the given input video, highlighting the inconsistencies or anomalies present in the video.\nText Prompt: Please generate a detailed caption which describes all the given frames. Some of the frames may contain inconsistencies which are annotated with a green bounding box around them with the type/name of the inconsistency. Your generated caption should capture the details of the entire video, while also describing all the inconsistencies. Thus, properly look at all the given frames and the region marked by the green bounding boxes when describing the inconsistencies. Further, make sure to mention specific details about each of the inconsistencies, and mention the exact names of the inconsistencies from the marked green bounding box. Also, while describing the inconsistency please be as specific and detailed as possible, don\u0026rsquo;t be vague or general about the inconsistency. The reader of the caption should perfectly understand what inconsistencies/anomalies are in the video and what the video is about. Do not mention the green bounding box in your response; it is only for you to identify the inconsistencies. Make sure to describe all the inconsistencies in your caption. Do not ana- lyze the impact of the inconsistencies; you should only describe them. There is no need to mention when the inconsistencies start or end, just describe them.\nD.2 Question Answer Generation Module (QAGM) # System Prompt: You are a helpful and intelligent AI assistant which can curate high-quality and challenging question and their corresponding answers, which are used to test the video understanding capabilities of an multi-modal LLM model capable of taking videos as their inputs.\nText Prompt: You are given a video input, which is generated by a state-of-the-art AI algorithm. Thus, these videos look very natural and almost realistic, but they are actually synthetic and generated by an AI algorithm. The videos may have some inconsistencies or anomalies present in them, which are generally localized to only a specific location in the video as identified by the green bounding boxes in the video. The rest of the video appears completely natural or realistic. This specific inconsistency may last for only a few frames of the video or may last for the entire video itself. The inconsistency or anomalies in the video are generally events and phenomena which is not observed in real-world and physical scenarios. You will also be given a caption as input that describes the video, along with the specific inconsistency present in the video. Based on the given video and caption input, your task is to formulate 3 diverse and misleading questions to test whether the multi-modal LLM model can correctly identify the options based on the inconsistencies present in the video or not. So, your generated questions should give the model few options to choose from to make its answer, and these options should be of high quality and also have misleading choices so that you can test deeper level of understanding of these multi-modal LLM models. Thus, the goal of these questions is to accurately assess the multi-modal LLM\u0026rsquo;s ability to accurately identify the inconsistencies present in the video. Generate questions that comprise both interrogative and declarative sentences, utilizing different language styles, and provide an explanation for each. Your response should be presented as a list of dictionary strings with keys \u0026lsquo;Q\u0026rsquo; for questions and \u0026lsquo;A\u0026rsquo; for the answer. Follow these rules while generating question and answers:\nDo not provide answers in the question itself. For example, the ground-truth attribute or compo- nent that makes the video scene unusual should never be mentioned in the question itself.\nEnsure the questions are concrete and specific, and not vague or ambiguous. The questions should be formed based on your deep understanding of the video and the caption. Thus, properly read the caption and look at the given video to generate the questions. The questions should only pertain to the inconsistencies present in the video, and not about the video in general. You may also ask the model some misleading questions talking about non-existent inconsistencies in the video, to test the model\u0026rsquo;s ability to differentiate between real and fake inconsistencies. Do not ask vague questions, and the answer should only contain one of the correct option mentioned in the question. In your question itself you must provide multiple choice options for the answer, and the answer should be one of the options provided in the question. Please ensure you provide option choices and their corresponding letters in the question itself. In your answer, only mention the correct option letter from the question. Make sure that the correct option letter is not always the same, and randomly shuffle the correct option letter for each question. You must only follow the below output format and strictly must not output any other extra information or text. Your output format should be strictly as follows, without any additional information or text: [\u0026ldquo;Q\u0026rdquo;: \u0026lsquo;first question A) \u0026lt;option1\u0026gt; B) \u0026lt;option2\u0026gt; C) \u0026lt;option3\u0026gt; D) \u0026lt;option4\u0026gt;\u0026rsquo;, \u0026ldquo;A\u0026rdquo;: \u0026lsquo;Pick the correct option letter from A) B) C) D)\u0026rsquo;, \u0026ldquo;Q\u0026rdquo;: \u0026lsquo;second question A) \u0026lt;option1\u0026gt; B) \u0026lt;option2\u0026gt; C) \u0026lt;option3\u0026gt; D) \u0026lt;option4\u0026gt;\u0026rsquo;, \u0026ldquo;A\u0026rdquo;: \u0026lsquo;Pick the correct option letter from A) B) C) D)\u0026rsquo;, \u0026hellip; }]\nGiven below is the caption input which describes the given video along with the specific inconsistency present in the video. The caption is: {caption}\nD.3 Evaluating Video-LMMs # System Prompt: You are a helpful and intelligent multi-modal AI assistant, capable of performing visual question-answering (VQA) tasks. You will be given as input 10 consecutive frames from a video, and a corresponding question related to the video, you have to answer the given question after analyzing and understanding the given input video.\nThe question itself will present you with 4 lettered options like A) B) C) D), your task is to only output single letter corresponding to the correct answer (i.e. string literal \u0026lsquo;A\u0026rsquo;, \u0026lsquo;B\u0026rsquo;, \u0026lsquo;C\u0026rsquo;, or \u0026lsquo;D\u0026rsquo;), and you should not output anything else.\nText Prompt: {question}\nE Distribution of VANE-Bench dataset # How to view the dataset? The dataset alongside metadata will be hosted on the Hugging Face platform for download post acceptance of the paper. Users can directly load the dataset using Hugging Face Datasets library or download the zip file in the same Hugging Face repository. All instructions and code files to reproduce the experiments of the paper will be provided in a github repository.\nHow will the dataset be distributed? The dataset will be distributed to the public using the Hugging Face Dataset Hub. We have publicly released the codebase alongside instructions to reproduce and evaluate models on GitHub.\nDataset License. This work and dataset is licensed under a Creative Commons AttributionNonCommercial-ShareAlike 4.0 International License. The videos in the VANE-Bench dataset are collected from publicly available sources and existing real-world datasets and are for academic research use only. The video generative models used to synthesize data samples in our VANE-Bench benchmark are open to use publicly and do not pose any privacy concerns as the persons or objects present in the generated videos are synthetic and do not exist in the real world. The real-world surveillance datasets - UCFCrime (Sultani et al. , 2018), UCSD Pedestrian (Li et al. , 2014), Avenue (Lu et al. , 2013); on the other hand, used in our work are all existing well-known and publicly available datasets that are released under open-source licenses. Thus, the original creators of these datasets have collected the data after taking informed consent from the stakeholders. By using VANE-Bench, you agree not to use the dataset for any harm or unfair discrimination. Please note that the data in this dataset may be subject to other agreements. Video copyrights belong to the original dataset providers, video creators, or platforms.\nFigure 8: Qualitative examples: Figure shows the response of Video-LMMs to the VQA task of detecting anomalies in the video. The correct answer is written in bold in the user query. We find that majority of Video-LMMs struggle to answer the questions correctly.\nFigure 9: Prediction Inconsistency: Figure shows the response of Video-LMMs to the VQA task of detecting anomalies in the video. The correct answer is written in bold in the user query. We find that the majority of Video-LMMs struggle to answer the questions correctly.\nFigure 10: Example showcasing the importance of our Frame Annotation Module (FAM). We note that without FAM, the LMM responsible for generating the captions is not able to identify or describe the accurate anomaly present in the video. However, by providing the bounding box annotation for the inconsistency, we are able to ensure that the generated caption accurately describes the anomaly in the video.\n","date":"15 July 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/vane-bench-video-anomaly-evaluation/","section":"Papers","summary":"This paper introduces a novel deep learning framework for detecting anomalies in video content by leveraging semi-supervised approaches that require minimal labeled data, enhancing robustness and efficiency.","title":"Advanced Video Anomaly Detection Using Deep Learning","type":"method"},{"content":"","date":"15 July 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/jane-doe/","section":"Authors","summary":"","title":"Jane Doe","type":"authors"},{"content":"","date":"15 July 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/john-smith/","section":"Authors","summary":"","title":"John Smith","type":"authors"},{"content":"","date":"1 May 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/fang-shen/","section":"Authors","summary":"","title":"Fang Shen","type":"authors"},{"content":"","date":"1 May 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/haodong-zhang/","section":"Authors","summary":"","title":"Haodong Zhang","type":"authors"},{"content":" SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model # Zongcan Ding ∗ Northwestern Polytechnical University Xi\u0026rsquo;an, China dingzongcan@mail.nwpu.edu.cn\nGuansong Pang # Singapore Management University Singapore, Singapore gspang@smu.edu.sg\nAbstract # Video anomaly detection (VAD) aims to identify unexpected events in videos and has wide applications in safety-critical domains. While semi-supervised methods trained on only normal samples have gained traction, they often su!er from high false alarm rates and poor interpretability. Recently, vision-language models (VLMs) have demonstrated strong multimodal reasoning capabilities, o!ering new opportunities for explainable anomaly detection. However, their high computational cost and lack of domain adaptation hinder real-time deployment and reliability. Inspired by dual complementary pathways in human visual perception, we propose SlowFastVAD, a hybrid framework that integrates a fast anomaly detector with a slow anomaly detector (namely a retrieval augmented generation (RAG) enhanced VLM), to address these limitations. Speci\u0026quot;cally, the fast detector \u0026ldquo;rst provides coarse anomaly con\u0026quot;dence scores, and only a small subset of ambiguous segments, rather than the entire video, is further analyzed by the slower yet more interpretable VLM for elaborate detection and reasoning. Furthermore, to adapt VLMs to domain-speci\u0026quot;c VAD scenarios, we construct a knowledge base including normal patterns based on few normal samples and abnormal patterns inferred by VLMs. During inference, relevant patterns are retrieved and used to augment prompts for anomaly reasoning. Finally, we smoothly fuse the anomaly con\u0026quot;dence of fast and slow detectors to enhance robustness of anomaly detection. Extensive experiments on four benchmarks demonstrate that SlowFastVAD e!ectively combines the strengths of both fast and slow detectors, and achieves remarkable detection accuracy and interpretability with signi\u0026quot;cantly reduced computational overhead, making it well-suited for real-world VAD applications with high reliability requirements. The code will be released upon acceptance.\n∗ These authors contributed equally to this work.\n† Corresponding author.\nHaodong Zhang ∗ Northwestern Polytechnical University Xi\u0026rsquo;an, China hdzhang@mail.nwpu.edu.cn\nZhiwei Yang Xidian University Guangzhou, China zwyang97@163.com\nYanning Zhang Northwestern Polytechnical University Xi\u0026rsquo;an, China ynzhang@nwpu.edu.cn\nCCS Concepts # Computing methodologies → Scene anomaly detection; Activity recognition and understanding . Keywords # Video anomaly detection, Vision-language model, Semi-supervised learning, Interpretable learning\n1 Introduction # Video Anomaly Detection (VAD) aims to automatically identify abnormal events in video streams that deviate signi\u0026quot;cantly from typical normal patterns. It plays a vital role in a wide range of real-world applications [55 , 65 , 67]. Given the rarity and high acquisition cost of anomalous samples in real-world scenarios, increasing attention has been directed toward the semi-supervised VAD paradigm, which trains models exclusively on normal videos [30 , 34 , 75]. By learning the underlying distribution of normal patterns, these methods attempt to detect anomalies as deviations from expected behaviors during inference.\nHowever, semi-supervised VAD methods su!er from several inherent limitations. Since these models are trained exclusively on normal samples, they are prone to misclassifying rare yet plausible normal behaviors as anomalies, leading to high false positive rates. Moreover, existing one-class detection approaches, such as those based on autoencoders [7 , 18 , 40 , 48 , 53 , 73 , 83], generative adversarial networks (GANs) [13 , 19 , 77], or di!usion models [10 , 15 , 70], often exhibit limited adaptability when deployed in complex and dynamic real-world environments. In addition, most of these methods rely on end-to-end deep neural networks that are trained to \u0026ldquo;t only the distribution of normal behaviors. As a result, their decisionmaking results are often monotonous and lack interpretability or reasoning, making them ill-suited for scenarios where transparency and explainability are crucial.\nPeng Wu † Northwestern Polytechnical University Xi\u0026rsquo;an, China xdwupeng@gmail.com\nPeng Wang Northwestern Polytechnical University Xi\u0026rsquo;an, China peng.wang@nwpu.edu.cn\nMajority ight). Certain . . b . b . ) . ) . . o . o . r Figure 1: Comparative analysis between conventional fast \u0026hellip; \u0026hellip; \u0026hellip; detector based on DNN (Left), recent slow detectors based on Majori VLMs (Middle), and our SlowFastVAD (Right).\nVideo\nDNN\nFast deo Classifcation data. For inst Interpretable ce, in the wid Reasoning elyused A g pedestrian gg id is observed ( pushing various a do bicycle mains a , a an d action that deviates d A from previously -supervise pedestrian d VAD ta behaviors ti . Based d on this t , i this grat is io classified n and sem as an an tic anomaly y unco . Normal\u0026amp;Abnomral eir multimodal Distribution Ms can e!ec Efficient edestrian Classifcation n street datas Interpretable tht \u0026ldquo;l Reasoning dt VLM ] t Minority ve demon Ambiguous trated gr Video demo Clips trated A dtiibd Recently, vision-language models (VLMs) have achieved remarkp pushing a bicycle, an ihdif Minority Ambiguous Video Clips able progress across various domains and have demonstrated great action that deviates from previously pedestrian VLM potential in the semi-supervised VAD task [71]. Bene\u0026quot;ting from behaviors. Based on this, this is classified as an Normal\u0026amp;Abnomral VLM their multimodal integration and semantic understanding abilities, anomaly. Distribution VLMs can e!ectively uncover latent behavioral patterns within Fast Classifcation Interpretable Reasoning Efficient Classifcation bli video data. For instance, in the widely-used pedestrian street dataset Interpretable Reasoning Ped2, VLMs can infer the normative pattern that \u0026ldquo;only pedestrians are allowed to walk on the sidewalk\u0026rdquo; by learning from normal videos. In complex real-world scenarios, such models are capable of learning and constructing semantic representations of normal patterns, thereby enabling more accurate identi\u0026quot;cation of anomalies that deviate from these learned norms. Furthermore, by leveraging their language generation capabilities and the established semantic rules, VLMs can provide clear reasoning for their detection results, signi\u0026quot;cantly enhancing the interpretability and trustworthiness.\nDespite the promising potential of VLMs in the VAD task, their practical application still faces several critical challenges that warrant further investigation. First, VLMs are susceptible to the hallucination, where the generated reasoning or predictions deviate from the actual video content. For example, in Ped2, VLMs occasionally misinterpret normal pedestrian walking as a crowd gathering, thereby incorrectly labeling it as an anomalous event, resulting in semantically inconsistent judgments. Second, most current VLMs are pre-trained on general-purpose datasets, and their anomaly understanding is typically based on commonsense reasoning rather than task-speci\u0026quot;c behavioral modeling. Consequently, such models may misinterpret anomalies in speci\u0026quot;c environments due to semantic ambiguity. For instance, riding a bicycle on the sidewalk is treated as an anomalous event within the Ped2 dataset\u0026rsquo;s context. However, since such behavior is often deemed acceptable in real-world scenarios, the model may fail to detect it as an anomaly. Furthermore, from a deployment perspective, VLMs often incur substantial computational overhead and exhibit slow inference speeds. Performing dense inference on every video frame is particularly impractical in scenarios requiring real-time responses, such as public safety surveillance. These limitations signi\u0026quot;cantly hinder the practical utility of VLM-based VAD methods.\nInspired by the dual complementary pathways in human visual perception [12 , 74], namely, a cognition-driven pathway for precise understanding (slow) and an action-driven pathway for rapid\nClips\nDNN\nVLM\nFast V Classifcation VLM to gen DNN tar Interpretable te various v Reasoning isual de VLM st an A l pedestrian feedforw is ar observed d netwo pushing tiVL a M bicycle M bd , an action tion V that VLM deviates M based from slow previously bilitd pedestrian dtti behaviors ability a . Based nd de on te this ctio , this i is classified di as an t anomaly arios, w . Normal\u0026amp;Abnomral in speci\u0026quot;c s Distribution p Efficient Classifcation g f Interpretable criptions fro Reasoning m norm DNN aly d VLM orma Minority based fa Ambiguous t detecto Video ector Clips peci\u0026rdquo; Majority integrat Certain s the co Video g Clips l i . p . . . . . . . . Fast w Classifcation we propose a DNN bas l Interpretable ntropybase Reasoning d interve VLM equir A pedestrian d is observed pushing based a on bicycle the lan , n an ua action d that deviates from g h previously and incorp pedestrian orates the behaviors t . Based td on this , this ore is ta classified geted an as om an mal anomaly onal ov . Normal\u0026amp;Abnomral VLM towar Distribution high comp Efficient tion dete Classifcation ction strateg Interpretable d Reasoning g td bt DNN samp VLM p ti Minority descrip Ambiguous on of the Video escrip Clips on of Majority This proc ll b Certain ss of kn Video numb Clips r of n . i . i . a . s . . . a . ab . b \u0026hellip; \u0026hellip; \u0026hellip; response (fast), and these pathways work in tandem to respond e !ectively even in extreme scenarios. This paper proposes a novel Majority Certain Video Clips VAD framework, SlowFastVAD, which integrates the complemenNN VLM p tary strengths of fast and slow detectors. The goal is to achieve e DNN #cient, accurate, and interpretable anomaly detection by combinA pedestrian is observed hibil Minority Ambiguous ing a traditional feedforward network based fast detector with a pushing a bicycle, an action that deviates from Video Clips high-generalization VLM based slow detector. Speci\u0026quot;cally, to enpreviously pedestrian behaviors. Based on this, VLM hance the adaptability and detection performance of large models this is classified as an anomaly. mal\u0026amp;Abnomral istribution in speci\u0026quot;c scenarios, we design a retrieval augmented generation Effiit Clifti (RAG) driven anomaly reasoning module. This module guides the t Classifcation Interpretable Reasoning Efficient Classifcation Interpretable Reasoning VLM to generate various visual descriptions from normal samples, summarizes normal patterns under the given context, and further leverages Chain-of-Thought (CoT) reasoning to infer po\u0026hellip; \u0026hellip; \u0026hellip; tential abnormal patterns. These normal and abnormal patterns Majority Certain are structured into a knowledge base. This process of knowledge DNN VLM jy Video Clips base construction requires only a small number of normal samDNN VLM DNN ples, eliminating the requirement for full-sample training. During inference, the model retrieves relevant behavioral rules from the p pushing a bicycle, an ihdif Minority Ambiguous Video Clips knowledge base based on the language description of the current action that deviates from previously pedestrian VLM video segment and incorporates them into prompts to guide the behaviors. Based on this, this is classified as an rmal\u0026amp;Abnomral VLM VLM toward more targeted anomaly detection. To mitigate the anomaly. istribution high computational overhead associated with the VLM inference, t Classifcation Interpretable Reasoning Efficient Classifcation ttbli we propose an entropy-based intervention detection strategy. This Interpretable Reasoning strategy leverages the anomaly con\u0026quot;dence generated by the fast detector to identify video segments with high uncertainty, which are then selectively forwarded to the VLM-base slow detector for further analysis. This enables signi\u0026quot;cant improvements in detection accuracy and interpretability while maintaining computational e #ciency. Finally, we introduce a decision fusion mechanism that integrates the predictions from both fast and slow detectors, thereby enhancing the overall robustness of the framework. We illustrate the key di!erences between SlowFastVAD, traditional DNN-based fast detectors and VLM-based slow detector in Figure 1, and our SlowFastVAD e!ectively addresses the limitations of current fast detector, namely the limited generalization capability, poor interpretability, and high computational cost of slow detector.\nThe main contributions of this work are summarized as follows: · We propose the SlowFastVAD framework, which, to our knowledge, is the \u0026ldquo;rst to innovatively integrate the traditional fast anomaly detector with slow yet interpretable VLM-based detector, achieving a synergy between e#ciency and explainability. We develop a RAG-driven anomaly reasoning module, in which VLM summarizes normal and abnormal patterns during training to construct a knowledge base. This knowledge base is then dynamically retrieved during inference to enhance prompts, improving the generalization to speci\u0026quot;c VAD scenarios. We design an entropy-based intervention detection strategy that e !ectively selects video segments likely to be misclassi\u0026quot;ed by the fast detector, precisely triggering the VLM inference. This strategy signi\u0026quot;cantly reduces overall computational costs. Extensive experiments on multiple public datasets demonstrate that our proposed SlowFastVAD e!ectively integrates the advantages of both fast and slow detectors, achieving state-of-the-art detection performance along with interpretable outputs. 2 Related Work # 2.1 Non-VLM-based Video Anomaly Detection # 2.1.1 Semi-supervised VAD. In semi-supervised VAD, training processing relies solely on normal samples, where the model learns normal patterns and identi\u0026quot;es deviations from these patterns during inference as anomalies. Under the current deep learning paradigm, semi-supervised VAD approaches can be broadly categorized based on the network architecture into three main types: autoencoderbased approaches, generative adversarial networks (GANs)-based approaches, and di!usion-based approaches. Autoencoder-based approaches utilize an encoder to compress input samples into lowdimensional latent representations and a decoder to reconstruct the original input from the latent space. Anomalies are detected by measuring the reconstruction error between the input and output [7 , 18 , 40 , 48 , 53 , 73 , 83]. GAN-based approaches consist of a generator and a discriminator. The generator learns to synthesize realistic normal samples, while the discriminator aims to distinguish between real and generated data. Test samples with low authenticity scores from the discriminator are classi\u0026quot;ed as anomalies [13 , 19 , 77]. Di!usion-based approaches progressively generate samples from noise through a reverse di!usion process. The quality of the generated samples is then used to assess the normality, with poor reconstruction indicating potential anomalies [10, 15, 70].\n2.1.2 Weakly Supervised VAD. Weakly-supervised VAD utilizes both normal and abnormal samples during training, but lacks precise annotations of anomalies, and only coarse video-level labels are available. Current research mainly follows two paradigms: onestage multiple instance learning (MIL) approaches [29 , 54] and twostage self-training strategies [72 , 80]. To further improve detection performance, recent e!orts have explored various enhancement techniques, including temporal modeling, spatiotemporal modeling, MIL-based optimization, and feature metric learning. Speci\u0026quot;cally, temporal modeling captures sequential dependencies in videos, enabling the model to utilize contextual information [11 , 17 , 54 , 86]. Spatiotemporal modeling integrates spatial and temporal features to localize anomalous regions while suppressing background noise [26 , 54]. MIL-based optimization strategies address the limitation of conventional MIL methods that focus only on high-scoring segments, by incorporating external priors, such as textual knowledge, to improve anomaly localization [9 , 36]. Feature metric learning constructs a discriminative embedding space by clustering similar features and separating dissimilar ones, thereby enhancing the representation discrimination [14].\n2.2 VLM-based Video Anomaly Detection # 2.2.1 Semi-supervised VAD. In the \u0026ldquo;eld of VAD, VLMs have demonstrated signi\u0026quot;cant potential and adaptability. Yang et al. proposed AnomalyRuler [71], which detects anomalies by integrating the inductive summarization and deductive reasoning capabilities of VLMs. Speci\u0026quot;cally, in the inductive phase, the model derives behavioral rules from a small number of normal samples, while in the deductive phase, it identi\u0026quot;es anomalous frames based on these rules. In addition, Jiang et al. introduced the VLAVAD framework [25], which employs cross-modal pre-trained models and leverages the reasoning capabilities of large language models (LLMs) to enhance the interpretability and e!ectiveness of VAD. However, due to the slow inference speed of VLMs, the overall processing time of these methods remains high. In contrast, our SlowFastVAD integrates conventional fast detectors with VLMs, enabling sparse yet deeper reasoning based on the initial outputs of the fast detector. This design e!ectively balances inference speed and detection accuracy.\n2.2.2 Weakly Supervised VAD. VLMs have also been widely applied in weakly supervised VAD. They not only enhance anomaly detection performance through visual-language enhanced features (e.g., CLIP-TSA [21]) and cross-modal semantic alignment (e.g., VadCLIP [69], TPWNG [72], and STPrompts [68]), but also contribute to interpretability by generating descriptions for anomalous events, as demonstrated in the Holmes-VAU [82]. Moreover, VLMs can be leveraged for training-free anomaly detection by utilizing their extensive prior knowledge [35 , 79], o!ering advantages in rapid deployment and reduced computational cost. For instance, Zanella et al. [79] adopted an explainable approach in which re$ective questions are used to guide the model in generating anomaly scores, without requiring additional model training.\n2.3 VLM-based Vision Tasks # Currently, VLMs have made signi\u0026quot;cant progress and found widespread application in various vision \u0026ldquo;elds [57]. In image classi\u0026quot;cation, VLM enhances zero-shot classi\u0026quot;cation capabilities, especially in handling unknown object categories, showing excellent performance and supporting stronger domain generalization [1 , 22]. In semantic segmentation, VLM improves the ability to handle unseen categories signi\u0026quot;cantly by combining open-vocabulary techniques with image-text fusion [33 , 60]. In video generation, VLM is used to generate consistent and multi-scene video content, pushing forward the advancement of video generation technology [31]. In crossmodal retrieval, VLM improves the e!ectiveness and e#ciency by integrating image and language information [6 , 23]. In action recognition, VLM enhances the recognition of \u0026ldquo;ne-grained actions by combining pose information with language models, particularly excelling in action anticipation[39, 81].\n3 Methodology # 3.1 Overview # Our proposed method is illustrated in Figure 2, which consists of two branches: a fast DNN-based detector and a slow VLM-based detector. The fast detector is built upon an autoencoder-based architecture, o!ering high detection speed but limited interpretability. In contrast, the slow detector leverages VLMs, which provides strong interpretability at the cost of slower inference. By integrating multiple specialized components, our framework e!ectively combines the advantages of both detectors to achieve a balanced trade-o! between e#ciency and accuracy. The overall pipeline is as follows: The fast detector \u0026ldquo;rst performs preliminary detection and identi\u0026quot;es potentially ambiguous segments, which are then passed to the slow detector for further analysis. The slow detector generates both anomaly con\u0026quot;dence scores and interpretable descriptions. To select ambiguous segments more e!ectively, we propose an intervention detection strategy based on entropy measures. Additionally, to improve the adaptability of the VLM in speci\u0026quot;c anomaly detection\nFigure 2: Overview of the proposed SlowFastVAD method. It consists of two branches: a fast DNN-based detector and a slow VLM-based detector. To seamlessly integrate the two detection branches and leverage their respective strengths, we designed three key components, i.e., intervention detection strategy, RAG-driven anomaly reasoning module, and integration mechanism, enabling an e!cient and interpretable VAD framework.\nscenarios, we introduce an anomaly-oriented RAG module. This module constructs a knowledge base by extracting normal patterns from training videos and inferring potential abnormal patterns, thus enhancing scene-speci\u0026quot;c reasoning capabilities. Finally, an integration mechanism combines the outputs from both detectors to yield the \u0026ldquo;nal prediction. This mechanism mitigates hallucination e !ects commonly associated with VLMs and enables the system to achieve high detection accuracy, faster inference, and interpretable output.\n3.2 Fast Detector # 3.2.1 Foundation Model. In the fast detector, we adopt the AEDMAE [48], which utilizes a lightweight masked autoencoder architecture. By incorporating motion gradient based weighting, selfdistillation training, and synthetic anomaly augmentation strategies, this method achieves fairly e#cient VAD. AED-MAE is characterized by its compact model size and extremely fast inference speed, reaching up to 1655 FPS (frame per second).\n3.2.2 Intervention Detection Strategy. In the context of VAD, video frames that are easy to classify typically exhibit low variance in anomaly con\u0026quot;dence scores, resulting in low uncertainty, i.e., low entropy. However, since the fast detector is trained solely on normal samples via reconstruction, it may produce high reconstruction errors for normal-but-rare samples during inference, leading to noisy or $uctuating anomaly con\u0026quot;dence scores. Besides, in complex scenes where the test data deviates from distributions of the training set, the fast detector may fail to generalize e!ectively, again causing instability in anomaly con\u0026quot;dence scores. These $uctuations are re $ected as increased entropy in the anomaly con\u0026quot;dence scores.\nTo address this, we propose a novel entropy-based intervention detection strategy to identify and select ambiguous segments that are di#cult to accurately classify. Speci\u0026quot;cally, given a testing video, we take its frame-level anomaly con\u0026quot;dence scores from the fast detector as input and partition it into a set of non-overlapping subsequences = { } = 1 using a window size . For each subsequence , we compute its entropy. To account for temporal context, we apply a Gaussian \u0026ldquo;lter for smoothing, integrating the entropy values of neighboring subsequences to obtain a context-aware entropy score. Given that the anomaly con\u0026quot;dence scores are decimals ranging from 0 to 1, we adopt the di!erential entropy formula for calculation. The detailed calculation procedure is shown as follows.\nWe \u0026ldquo;rst estimate the probability density function of the obtained subsequence = { } =1 , where indicates the anomaly score of the -th video frame. Here, we employ the frequency distribution histogram to serve as an approximation of the probability density function for the subsequence . The following are the detailed steps: Firstly, determine the number of histogram bins as . Subsequently, calculate the di!erence between the maximum and minimum values within . Divide the obtained di!erence by the number of groups to derive the class interval, based on which the grouping intervals can be further ascertained. On this foundation, count the number of elements of within each grouping interval, and then compute the corresponding frequencies to obtain the frequency distribution histogram ↑ R . For each value in , \u0026ldquo;rst identify the group to which it belongs in the frequency distribution histogram , and take the frequency of that group as the probability of its occurrence. In this way, the \u0026ldquo;nal probability density function ˆ ˆ () of is obtained. Based on the obtained probability density function ˆ ˆ () of the subsequence , we\ncompute the di!erential entropy of as follows:\nWe further apply a Gaussian \u0026ldquo;lter (·) to , integrating the information from neighboring subsequences + , so as to obtain the \u0026ldquo;nal entropy value ˆ of , which is shown below:\nWe set a threshold to determine which subsequences are considered uncertain. If the entropy value of a certain subsequence exceeds , then the corresponding video segment = { } =1 will be fed into the slow detector for further analysis.\nMoreover, to improve the interpretability of overall detection results, we also introduce a periodic sampling mechanism. Speci\u0026quot;cally, one video segment is sampled from every video segments and sent to the slow detector for semantic description and anomaly scoring. These results serve as global context cues that complement the \u0026ldquo;nal decision-making process with interpretable outputs.\n3.3 Slow Detector # 3.3.1 Basic Procedure. The input to the slow detector is the ambiguous video segment identi\u0026quot;ed by the intervention detection strategy. In the VAD task, spatiotemporal information is of crucial importance [63 , 85]. Temporal information can capture the sequential evolution process of events and their durations, which helps to distinguish between normal and abnormal behaviors, because anomalies often manifest as sudden interruptions in the temporal dimension. Spatial information is divided into two parts: the foreground and the background. Foreground information focuses on the positions and motion patterns of foreground objects. Anomalies usually manifest as unusual spatial arrangements or sudden changes in positions. Background information focuses on the relatively stable scene characteristics. By understanding the background information, VLM and LLM can better extract and summarize the normal and abnormal patterns in the current scene. Based on this, is concatenated with the CoT prompt (refer to Appendix for details) and then fed into the VLM (denoted as ) to extract its spatiotemporal representation . Subsequently, the spatiotemporal representation is encoded into a vector by the embedding model text-embedding-v2 1 (denoted as ). Detailed processes can be presented as follows:\nBased on the similarity between and constructed patterns, the top relevant patterns and their associated binary anomaly predictions (i.e., normal and abnormal) are retrieved from the constructed knowledge base D, which is introduced in the following section. Combine and to obtain the knowledge =\n1 https://help.aliyun.com/zh/model-studio/user-guide/embedding\n{(1 , 1) , ··· , ( , )} related to the current video. = ({( , , ( , )))|( , ) ↑ D}) (4)\nwhere (·) denotes the similarity computation.\nFinally, the extracted spatiotemporal representation and the retrieved knowledge are concatenated and combined with a CoT reasoning prompt to form a structured prompt = [; ]. This prompt is fed into LLM for step-by-step reasoning, producing anomaly scores along with corresponding interpretive descriptions .\n3.3.2 RAG-driven Anomaly Reasoning. This module is designed to extract normal patterns from training videos, enabling VLM trained on general scenarios to better adapt to the speci\u0026quot;c VAD task. To achieve this, we apply a sparse temporal sampling strategy [59], where a segment containing consecutive frames is randomly selected from \u0026ldquo;xed-length segments of training videos. Throughout this process, we extensively incorporate the CoT prompt to guide the reasoning of models in a more interpretable and coherent manner. The overall procedure consists of four stages: visual description generation, pattern extraction and prediction, pattern re\u0026quot;nement and aggregation, and knowledge base construction.\nVisual Description Generation: Here, we follow the same procedure described in Section 3.3.1 to extract the spatiotemporal representation for the video segment.\nPattern Extraction and Prediction: Based on the extracted spatiotemporal representations , we further employ the CoT prompt to guide the LLM in re\u0026quot;ning representative normal patterns N (e.g., \u0026ldquo;a person walking slowly on the road\u0026rdquo; or \u0026ldquo;a small group engaged in conversation\u0026rdquo;). Building upon these patterns, the model is further prompted to reason about spatial regularity, behavioral pattern, and interaction dynamic, thereby enabling the prediction of potential abnormal patterns A. This step not only encodes prior knowledge of normalcy but also enhances semantic interpretability of potential anomalies. The detailed processes are presented as follows:\nwhere -denotes the reasoning of LLMs with the assist of CoT prompt.\nPattern Re\u0026quot;nement and Aggregation: After obtaining the initially extracted normal and abnormal patterns, we design a votingbased strategy for pattern re\u0026quot;nement and aggregation. Considering that normal patterns within the same video scene often exhibit high consistency, while abnormal patterns tend to be more diverse, we aggregate highly similar patterns to re\u0026quot;ne stable behavioral representations. Meanwhile, dissimilar patterns are retained to preserve behavioral diversity. This process results in a pattern set that is both representative and diverse, laying a solid foundation for subsequent knowledge base construction. Speci\u0026quot;cally, we process the patterns summarized from the videos within each scene separately. Here, we take the normal patterns as an example for illustration, and the abnormal patterns are processed in the same way. For the -th scene, the -th normal pattern is \u0026ldquo;rst compared for similarity with the existing patterns in the knowledge base. If the\naverage similarity between it and the existing patterns is below the threshold , it indicates that this pattern is dissimilar to the existing patterns in the knowledge base, and it will then be directly added to the knowledge base. Conversely, if the sum of similarities is not less than , we identify the \u0026ldquo;rst normal patterns in the knowledge base that are similar to it. These similar patterns are then aggregated and cleaned, and the aggregated and cleaned patterns are added to the knowledge base. Through continuous loop processing, after traversing all normal patterns of the -th scene, we \u0026ldquo;nally obtain the set N ↗ N of all processed normal patterns for the -th scene. After obtaining the set N ↗ N of normal patterns and the set A ↗ of abnormal patterns for the -th scene, we combine the two to obtain the set P of all patterns for this scene. The formula is expressed as follows:\nKnowledge Base Construction: The cleaned normal and abnormal patterns P, along with their corresponding anomaly predictions , are structured into standardized data formats are then encoded into vector representations using the text-embedding-v2 1 model, thereby constructing the knowledge base tailored for the VAD task. Mathematically, the knowledge base D can be expressed as follows:\n3.4 Slow-Fast Integration and Inference # To derive the \u0026ldquo;nal anomaly con\u0026quot;dence score, we integrate from the fast detector and from the slow detector via a integration mechanism. First, we use the weighted-averaging method to obtain the initial fused , which is shown as follows:\nwhere the weighting factor serves to balance the performance of fast and slow detectors. Subsequently, a Gaussian \u0026ldquo;lter is applied for smoothing. Moreover, the anomaly reasoning is generated by the slow detector, endowing the detection result with high interpretability.\n4 Experiments # 4.1 Datasets and Evaluation Metrics # 4.1.1 Datasets. We evaluate the proposed method on four public datasets: UCSD Ped2 [38], Avenue [32], ShanghaiTech [42], and UBnormal [61]. UCSD Ped2 is a single-scene dataset captured on a pedestrian walkway that contains anomalies such as cyclists, skateboarders, and cars. Avenue is also a single-scene dataset, recorded on the main avenue of the CUHK campus, with anomalies including running and bicycling. ShanghaiTech is a more challenging multiscene dataset from 13 di!erent campus environments, characterized by variations in lighting conditions and camera perspectives. As the largest dataset for semi-supervised VAD, it comprises 270000 frames for training and approximately 50000 for testing. UBnormal is an open-set dataset comprising 29 synthetic scenes, where the sets of anomaly types in the training and testing splits are disjoint. For each dataset, we adopt the default training and testing splits under the semi-supervised setting, using only normal samples during training. The normal reference frames used by SlowFastVAD are randomly and uniformly sampled from normal training videos.\n4.1.2 Evaluation Metrics. We follow recent related works [48 , 71] and report the frame-level Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC). Speci\u0026quot;cally, we compute both the Micro AUC and Macro AUC. For Micro AUC, all test frames from every video are merged into a single sequence, and AUC is calculated across the entire set. In contrast, Macro AUC is computed by \u0026ldquo;rst calculating the AUC for each individual test video, followed by averaging these scores to obtain the \u0026ldquo;nal result.\n4.2 Implementation Details # Our method is implemented using the PyTorch framework. Unless otherwise speci\u0026quot;ed, Qwen-VL-Max [3] is used as VLM for visual perception, while Qwen-Max [56] serves as LLM for spatiotemporal information aggregation and retrieval-augmented generation. For Qwen-VL-Max, the sampling temperature is set to 0.01, while for Qwen-Max, the sampling temperature is set to 1.1 during mode training and 0.7 during model testing. The default hyperparameter settings for SlowFastVAD are as follows: the window size of video segment is set to 8, and the random/uniform sampling interval is set to 20 during training and reduced to 10 during testing to ensure \u0026ldquo;ner temporal resolution for inference. in Eq (4) is set to 6, and the weighting factor in Eq (11) is empirically set to 0.8, 0.5, and 0.7 for Ped2, Avenue, and ShanghaiTech, respectively, to balance the fast and slow detectors. For the fast detector, we follow the con \u0026ldquo;guration used in AED-MAE [48].\n4.3 Comparison with State-of-the-art Methods # In this section, we compare the proposed SlowFastVAD with dozens of baseline VAD methods across four datasets to evaluate its detection performance. Notably, for a fair comparison, we restrict our evaluation to frame- or cube-centric methods, as object-centric methods completely remove background information and irrelevant content. As shown in Tables 1 and 2, SlowFastVAD achieves overall state-of-the-art results, particularly excelling on UCSD Ped2 and UBnormal datasets, with Micro AUC scores of 99.1% and 72.2%, respectively. These results demonstrate the strong generalization ability and detection accuracy across diverse scenarios. The key advantage of SlowFastVAD lies in its dual-branch (slow and fast) architecture, which fully leverages the ability of VLMs to re\u0026quot;ne and amend the initial predictions from the fast detector. This design achieves a balanced trade-o! between inference e#ciency and detection accuracy. Compared to traditional VAD approaches based on visual features and reconstruction costs, SlowFastVAD bene\u0026quot;ts from the semantic understanding and external prior knowledge provided by VLMs, enabling more robust anomaly detection. For instance, compared to previous best counterpart AED-MAE [48], our SlowFastVAD yields considerable gains across di!erent evaluation\nTable 1: AUC scores of several state-of-the-art methods versus SlowFastVAD on Ped2 and Avenue datasets. The top three methods are shown in red, green, and blue.\n| | | | Ped2 Avenue AUC (%) | Ped2 Avenue AUC (%) | Ped2 Avenue AUC (%) | Ped2 Avenue\nAUC (%) Method Reference Year AUC (%) AUC (%) AUC (%) AUC (%) Micro Macro Micro Macro LSHF [84] PR 2016 91.0 - - - AnomalyGAN [47] ICIP 2017 93.5 - - - FuturePred [27] CVPR 2018 95.4 - 85.1 81.7 MC2ST [28] BMVC 2018 87.5 - 84.4 - DeepMIL [50] CVPR 2018 - - - - PnP-CMA [46] WACV 2018 88.4 - - - MemAE [16] ICCV 2019 94.1 - 83.3 - NNC [20] WACV 2019 - - 88.9 - BMAN [24] TIP 2019 96.6 - 90.0 - AMCVAD [41] ICCV 2019 96.2 - 86.9 - DeepOC [66] TNNLS 2019 96.9 - 86.6 - StreetScene [45] WACV 2020 88.3 - 72.0 - MNAD [44] CVPR 2020 97.0 - 82.8 86.8 SCRD [51] ACMMM 2020 - - 89.6 - CAC [62] ACMMM 2020 - - 87.0 - VEC-AM [75] ACMMM 2020 97.3 - 90.2 - AEP [76] TNNLS 2021 97.3 - 90.2 - LNRA [2] BMVC 2021 96.5 - 87.1 - TimeSformer [5] ICML 2021 - - - - SSPCAB [49]+[27] CVPR 2022 - - 87.3 84.5 SSPCAB [49]+[44] CVPR 2022 - - 84.8 88.6 GCL [78] CVPR 2022 - - - - FastAno [43] WACV 2022 96.3 - 85.3 - S3R [64] ECCV 2022 - - - - HSNBM [4] ACMMM 2022 95.2 - 91.6 - ERVAD [52] ACMMM 2022 97.1 - 92.7 - DM-UVAD [58] ICIP 2023 - - - - FPDM [70] ICCV 2023 - - 90.1 - SSMCTB [37]+[27] TPAMI 2023 - - 89.1 84.8 SSMCTB [37]+[44] TPAMI 2023 - - 86.4 86.3 AnomalyRuler [71] ECCV 2024 97.9 - 89.7 - AED-MAE [48] CVPR 2024 95.4 98.4 91.3 90.9 SSAE [8] TPAMI 2024 - - 90.2 - SlowFastVAD — — 99.1 99.7 89.6 93.2 metrics. Furthermore, in contrast to VLM-only methods AnomalyRuler [71], SlowFastVAD not only achieves signi\u0026quot;cantly faster inference, but also delivers improved detection performance.\n4.4 Ablation Studies # 4.4.1 Impact of Each Component. In this section, we conduct an ablation study on di!erent con\u0026quot;gurations of SlowFastVAD to evaluate the contribution of each component to overall VAD performance. The following con\u0026quot;gurations are considered: (1)Baseline: No additional components are used; the slow detector re-evaluates anomalies based solely on the fast detector\u0026rsquo;s results under uniform sampling; (2) + Intervention: Only the intervention strategy is added; (3) +Intervention+ Integration: Both the intervention and integration components are used; (4) Full Model: All components, including the RAG module, are applied. The performance comparison is presented in Table 3. We observe that the baseline setting with uniform sampling yields relatively conservative performance,\nTable 2: AUC scores of several state-of-the-art methods versus SlowFastVAD on ShanghaiTech and UBnormal datasets. The top three methods are shown in red, green, and blue.\n| | | | ShanghaiTech UBnormal () | ShanghaiTech UBnormal () | ShanghaiTech UBnormal () | ShanghaiTech UBnormal\n() Method Reference Year AUC (%) AUC (%) AUC (%) AUC (%) Method Micro Macro Micro Macro FuturePred [27] CVPR 2018 72.8 80.6 - - MC2ST [28] BMVC 2018 - - - - DeepMIL [50] CVPR 2018 - 76.5 50.3 76.8 MemAE [16] ICCV 2019 71.2 - - - MNAD [44] CVPR 2020 68.3 79.7 - - SCRD [51] ACMMM 2020 74.7 - - - CAC [62] ACMMM 2020 79.3 - - - VEC-AM [75] ACMMM 2020 74.8 - - - LNRA [2] BMVC 2021 75.9 - - - TimeSformer [5] ICML 202 - - 68.5 80.3 SSPCAB [49]+[27] CVPR 2022 74.5 82.9 - - SSPCAB [49]+[44] CVPR 2022 69.8 80.2 - - GCL [78] CVPR 2022 78.9 - - - FastAno [43] WACV 2022 72.2 - - - S3R [64] ECCV 2022 80.4 - - - HSNBM [4] ACMMM 2022 76.5 - - - ERVAD [52] ACMMM 2022 79.3 - - - DM-UVAD [58] ICIP 2023 76.1 - - - FPDM [70] ICCV 2023 78.6 - 62.7 - SSMCTB [37]+[27] TPAMI 2023 74.6 83.3 - - SSMCTB [37]+[44] TPAMI 2023 70.6 80.3 - - AnomalyRuler [71] ECCV 2024 85.2 - 71.9 - AED-MAE [48] CVPR 2024 79.1 84.7 58.5 81.4 SSAE [8] TPAMI 2024 80.5 - - - SlowFastVAD — — 85.0 90.7 72.2 82.4 Table 3: Impact of each novel components on Ped2, Avenue, and ShanghaiTech datasets.\nPed2 Ped2 Avenue Avenue ShanghaiTech ShanghaiTech Intervention Integration RAG Intervention Integration RAG Intervention Integration RAG AUC AUC g AUC (%) g AUC (%) g AUC (%) g AUC (%) Micro M o Macro o Micro o Macro Micro M Macro × × × 87.8 89.6 80.1 86.1 76.3 82.3 ⊋ × × 90.6 91.1 85.8 89.0 80.6 83.6 ⊋ ⊋ × 94.3 97.2 86.1 88.5 83.9 88.4 ⊋ ⊋ ⊋ 99.1 99.7 89.6 93.2 85.0 90.7 indicating its limited ability to capture the key temporal segments of anomalous events. Introducing the intervention strategy leads to consistent improvements across all four datasets, especially on Avenue and ShanghaiTech, con\u0026quot;rming its e!ectiveness in guiding the model to focus on informative abnormal regions. Adding the integration mechanism further boosts performance, notably on Ped2 and ShanghaiTech, suggesting that it e!ectively combines the outputs of fast and slow detectors while better modeling temporal dependencies. Finally, incorporating the RAG module into the full model results in the best overall performance, with substantial gains on Ped2 and Avenue. This highlights the value of enhanced prompts generated by RAG in assisting the slow detector with more accurate anomaly reasoning. In summary, each component contributes to performance improvements to varying degrees. The \u0026ldquo;nal con \u0026ldquo;guration consistently outperforms others across all datasets, particularly excelling on Ped2 and ShanghaiTech.\nFigure 3: Visualization of partial detection results on Ped2, Avenue, ShanghaiTech, and UBnormal. Three detection results are shown: the top displays anomaly scores generated solely by the fast detector; the middle shows the updated scores after intervention by the slow detector; the bottom presents the \u0026ldquo;nal results obtained through the integration of both detectors.\nTable 4: Impact of fast detector, slow detector and the hybrid SlowFastVAD on Ped2, Avenue, and ShanghaiTech datasets.\nPed2 Ped2 Avenue Avenue ShanghaiTech ShanghaiTech FPS Branch AUC (%) AUC (%) AUC (%) AUC (%) AUC (%) AUC (%) FPS Micro Macro Micro Macro Micro Macro Fast Detector 95.4 98.4 91.3 90.9 79.1 84.7 1655 Slow Detector 98.4 99.0 74.5 78.0 87.7 85.6 0.5 SlowFastVAD 99.1 99.7 89.6 93.2 85.0 90.7 16 Note: The FPS results is obtained on a single RTX 3090 GPU. Due to limited GPU resources, the locally deployed model is Qwen2-VL-7B. If multiple GPUs are used for parallel processing, the speed can be further improved.\n4.4.2 Impact of Di!erent Detectors. We further evaluated the performance of the fast detector, slow detector, and their hybrid approach across di!erent datasets, with the results summarized in Table 4. The fast detector alone demonstrates competitive performance and delivers high inference e#ciency (i.e., 1655 FPS) on all three datasets. In contrast, the slow detector exhibits relatively lower performance and considerably slow inference speed (i.e., 0.5 FPS), which can be attributed to the hallucination e!ects commonly observed in LLMs when operating independently, thereby compromising their ability to accurately identify anomalous events. By integrating both detectors, the hybrid approach achieves the superior overall performance across all datasets. Although a slight decrease in Micro AUC is observed on Avenue dataset, the dual-branch combination e!ectively suppresses hallucination e!ects, signi\u0026quot;cantly reducing false positives and false negatives while leveraging the strengths of the fast detector. Moreover, the hybrid approach maintains a favorable balance between detection accuracy and real-time inference (16 FPS), making it a practical and robust solution for VAD in diverse scenarios. Moreover, this also substantiates the e!ectiveness of our biologically inspired design, which emulates the human visual system\u0026rsquo;s dual complementary pathways, namely, mimicking the coordination between rapid action-oriented responses and slower cognition-driven reasoning.\n4.5 Qualitative Analyses # Figure 3 visualizes the detection results of our SlowFastVAD and its variants on di!erent datasets. The abnormal parts are highlighted with green bounding boxes in video frames. In the detection result, the red sections represent video segments labeled as abnormal in ground truth, while the blue sections represent the detection results after the intervention of slow detector. It is evident that using only the fast detector can achieve relatively good detection performance; However, it still su!ers from noticeable false positives and false negatives, especially as observed in samples from Ped2 and Avenue. By incorporating the slow detector based on VLM through the intervention stragety to analyze suspicious regions, the local detection performance is signi\u0026quot;cantly improved. Nevertheless, the localized enhancements have limited in$uence on the overall prediction. Therefore, the \u0026ldquo;nal integration of the fast and slow\ndetectors via a Gaussian \u0026ldquo;lter leads to a more globally consistent improvement, further enhancing overall detection performance.\nIn addition, we present several representative reasoning results from the slow detector. Due to space limitations, we randomly select a subset of intervention segments for illustration. Compared to the fast detector, which relies on simple data \u0026ldquo;tting to produce anomaly scores, the VLM-based slow detector leverages both pretrained knowledge and domain-speci\u0026quot;c information introduced via the RAG module to enable brain-inspired deep reasoning over events, thereby producing more interpretable and accurate anomaly assessments.\n5 Conclusion # In this work, we introduce SlowFastVAD, a novel hybrid framework that integrates a fast anomaly detector with a retrieval augmented generation enhanced vision-language model to achieve both e#ciency and interpretability in video anomaly detection. The fast detector provides initial detection results, while several ambiguous segments are selectively analyzed by the slower yet more explainable VLM, reducing unnecessary computational overhead. By leveraging this dual-branch detection pipeline, our method e!ectively balances computational cost and detection accuracy. Speci\u0026quot;cally, the proposed entropy-based intervention strategy ensures that only uncertain segments are processed by the VLM, while the construction of a domain-adapted knowledge base further enhances the VLM\u0026rsquo;s adaptability to speci\u0026quot;c VAD scenarios. Extensive experiments conducted on four datasets demonstrate that SlowFastVAD outperforms existing methods, achieving state-of-the-art detection performance while maintaining interpretability. In the future, we will further explore task-speci\u0026quot;c foundation models centered on VAD and continue to enhance reasoning e#ciency.\nReferences # [1] Sravanti Addepalli, Ashish Ramayee Asokan, Lakshay Sharma, and R Venkatesh Babu. 2024. Leveraging vision-language models for improving domain generalization in image classi\u0026quot;cation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23922–23932.\n[2] Marcella Astrid, Muhammad Zaigham Zaheer, Jae-Yeong Lee, and Seung-Ik Lee. 2021. Learning not to reconstruct anomalies. arXiv preprint arXiv:2110.09742 (2021).\n[3] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966 (2023).\n[4] Qianyue Bao, Fang Liu, Yang Liu, Licheng Jiao, Xu Liu, and Lingling Li. 2022. Hierarchical scene normality-binding modeling for anomaly detection in surveillance videos. In Proceedings of the 30th ACM international conference on multimedia . 6103–6112.\n[5] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4.\n[6] Davide Ca!agni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2024. Wiki-llava: Hierarchical retrievalaugmented generation for multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1818–1826.\n[7] Ruichu Cai, Hao Zhang, Wen Liu, Shenghua Gao, and Zhifeng Hao. 2021. Appearance-motion memory consistency network for video anomaly detection. In Proceedings of the AAAI conference on arti!cial intelligence, Vol. 35. 938–946.\n[8] Congqi Cao, Hanwen Zhang, Yue Lu, Peng Wang, and Yanning Zhang. 2024. Scene-dependent prediction in latent space for video anomaly detection and anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).\n[9] Junxi Chen, Liang Li, Li Su, Zheng-Jun Zha, and Qingming Huang. 2024. Promptenhanced multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18319–18329.\n[10] Kai Cheng, Yaning Pan, Yang Liu, Xinhua Zeng, and Rui Feng. 2024. Denoising di!usion-augmented hybrid video anomaly detection via reconstructing noised frames. In Proceedings of the Thirty-Third International Joint Conference on Arti!cial Intelligence. 695–703.\n[11] MyeongAh Cho, Minjung Kim, Sangwon Hwang, Chaewon Park, Kyungjae Lee, and Sangyoun Lee. 2023. Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12137–12146.\n[12] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202–6211.\n[13] Xinyang Feng, Dongjin Song, Yuncong Chen, Zhengzhang Chen, Jingchao Ni, and Haifeng Chen. 2021. Convolutional transformer based dual discriminator generative adversarial networks for video anomaly detection. In Proceedings of the 29th ACM International Conference on Multimedia. 5546–5554.\n[14] Joseph Fioresi, Ishan Rajendrakumar Dave, and Mubarak Shah. 2023. Ted-spad: Temporal distinctiveness for self-supervised privacy-preservation for video anomaly detection. In Proceedings of the IEEE/CVF international conference on computer vision. 13598–13609.\n[15] Alessandro Flaborea, Luca Collorone, Guido Maria D\u0026rsquo;Amely Di Melendugno, Stefano D\u0026rsquo;Arrigo, Bardh Prenkaj, and Fabio Galasso. 2023. Multimodal motion conditioned di!usion model for skeleton-based video anomaly detection. In Proceedings of the IEEE/CVF international conference on computer vision. 10318– 10329.\n[16] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. 2019. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF international conference on computer vision. 1705–1714.\n[17] Chao Huang, Chengliang Liu, Jie Wen, Lian Wu, Yong Xu, Qiuping Jiang, and Yaowei Wang. 2022. Weakly supervised video anomaly detection via self-guided temporal discriminative transformer. IEEE Transactions on Cybernetics 54, 5 (2022), 3197–3210.\n[18] Chao Huang, Jie Wen, Chengliang Liu, and Yabo Liu. 2024. Long short-term dynamic prototype alignment learning for video anomaly detection. In Proceedings of the Thirty-Third International Joint Conference on Arti!cial Intelligence . 866–874.\n[19] Chao Huang, Jie Wen, Yong Xu, Qiuping Jiang, Jian Yang, Yaowei Wang, and David Zhang. 2022. Self-supervised attentive generative adversarial networks for video anomaly detection. IEEE transactions on neural networks and learning systems 34, 11 (2022), 9389–9403.\n[20] Radu Tudor Ionescu, Sorina Smeureanu, Marius Popescu, and Bogdan Alexe. 2019. Detecting abnormal events in video using narrowed normality clusters. In 2019 IEEE winter conference on applications of computer vision (WACV). IEEE, 1951–1960.\n[21] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. 2023. Clip-tsa: Clipassisted temporal self-attention for weakly-supervised video anomaly detection. In 2023 IEEE International Conference on Image Processing (ICIP). IEEE, 3230–3234.\n[22] Yannis Kalantidis, Giorgos Tolias, et al . 2024. Label propagation for zero-shot classi\u0026quot;cation with vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23209–23218.\n[23] Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. 2023. Grounding language models to images for multimodal inputs and outputs. In International Conference on Machine Learning. PMLR, 17283–17300.\n[24] Sangmin Lee, Hak Gu Kim, and Yong Man Ro. 2019. BMAN: Bidirectional multiscale aggregation networks for abnormal event detection. IEEE Transactions on Image Processing 29 (2019), 2395–2408.\n[25] Changkang Li and Yalong Jiang. 2024. VLAVAD: Vision-Language Models Assisted Unsupervised Video Anomaly Detection. (2024).\n[26] Guoqiu Li, Guanxiong Cai, Xingyu Zeng, and Rui Zhao. 2022. Scale-aware spatiotemporal relation learning for video anomaly detection. In European Conference on Computer Vision. Springer, 333–350.\n[27] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6536–6545.\n[28] Yusha Liu, Chun-Liang Li, and Barnabás Póczos. 2018. Classi\u0026quot;er two sample test for video anomaly detections.. In BMVC. 71.\n[29] Yang Liu, Jing Liu, Mengyang Zhao, Shuang Li, and Liang Song. 2022. Collaborative Normality Learning Framework for Weakly Supervised Video Anomaly Detection. IEEE Transactions on Circuits and Systems II: Express Briefs 69, 5 (2022), 2508–2512. doi:10.1109/TCSII.2022.3161061\n[30] Yang Liu, Zhaoyang Xia, Mengyang Zhao, Donglai Wei, Yuzheng Wang, Siao Liu, Bobo Ju, Gaoyun Fang, Jing Liu, and Liang Song. 2023. Learning causalityinspired representation consistency for video anomaly detection. In Proceedings of the 31st ACM international conference on multimedia. 203–212.\n[31] Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei. 2024. VideoStudio: Generating Consistent-Content and Multi-Scene Videos. In European Conference on Computer Vision. Springer, 468–485.\n[32] Cewu Lu, Jianping Shi, and Jiaya Jia. 2013. Abnormal Event Detection at 150 FPS in MATLAB. In 2013 IEEE International Conference on Computer Vision. 2720–2727. doi:10.1109/ICCV.2013.338\n[33] Jiayun Luo, Siddhesh Khandelwal, Leonid Sigal, and Boyang Li. 2024. Emergent open-vocabulary semantic segmentation from o!-the-shelf vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4029–4040.\n[34] Weixin Luo, Wen Liu, Dongze Lian, and Shenghua Gao. 2021. Future frame prediction network for video anomaly detection. IEEE transactions on pattern analysis and machine intelligence 44, 11 (2021), 7505–7520.\n[35] Hui Lv and Qianru Sun. 2024. Video anomaly detection and explanation via large language models. arXiv preprint arXiv:2401.05702 (2024).\n[36] Hui Lv, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, and Hanwang Zhang. 2023. Unbiased multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8022–8031.\n[37] Neelu Madan, Nicolae-C%t%lin Ristea, Radu Tudor Ionescu, Kamal Nasrollahi, Fahad Shahbaz Khan, Thomas B Moeslund, and Mubarak Shah. 2023. Selfsupervised masked convolutional transformer block for anomaly detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 1 (2023), 525–542.\n[38] Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. 2010. Anomaly detection in crowded scenes. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1975–1981. doi:10.1109/CVPR.2010.5539872\n[39] Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, and Kwonjoon Lee. 2024. Can\u0026rsquo;t make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18580–18590.\n[40] Romero Morais, Vuong Le, Truyen Tran, Budhaditya Saha, Moussa Mansour, and Svetha Venkatesh. 2019. Learning Regularity in Skeleton Trajectories for Anomaly Detection in Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) .\n[41] Trong-Nguyen Nguyen and Jean Meunier. 2019. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the IEEE/CVF international conference on computer vision. 1273–1283.\n[42] Carl Olsson, Marcus Carlsson, Fredrik Andersson, and Viktor Larsson. 2017. Nonconvex Rank/Sparsity Regularization and Local Minima. In 2017 IEEE International Conference on Computer Vision (ICCV). 332–340. doi:10.1109/ICCV.2017.44\n[43] Chaewon Park, MyeongAh Cho, Minhyeok Lee, and Sangyoun Lee. 2022. FastAno: Fast anomaly detection via spatio-temporal patch transformation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2249–2259.\n[44] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. 2020. Learning memoryguided normality for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14372–14381.\n[45] Bharathkumar Ramachandra and Michael Jones. 2020. Street scene: A new dataset and evaluation protocol for video anomaly detection. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2569–2578.\n[46] Mahdyar Ravanbakhsh, Moin Nabi, Hossein Mousavi, Enver Sangineto, and Nicu Sebe. 2018. Plug-and-play cnn for crowd motion analysis: An application in abnormal event detection. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1689–1698.\n[47] Mahdyar Ravanbakhsh, Moin Nabi, Enver Sangineto, Lucio Marcenaro, Carlo Regazzoni, and Nicu Sebe. 2017. Abnormal event detection in videos using generative adversarial nets. In 2017 IEEE international conference on image processing (ICIP). IEEE, 1577–1581.\n[48] Nicolae-C Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, and Mubarak Shah. 2024. Self-Distilled Masked AutoEncoders are E#cient Video Anomaly Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 15984–15995.\n[49] Nicolae-C%t%lin Ristea, Neelu Madan, Radu Tudor Ionescu, Kamal Nasrollahi, Fahad Shahbaz Khan, Thomas B Moeslund, and Mubarak Shah. 2022. Selfsupervised predictive convolutional attentive block for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 13576–13586.\n[50] Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6479–6488.\n[51] Che Sun, Yunde Jia, Yao Hu, and Yuwei Wu. 2020. Scene-aware context reasoning for unsupervised abnormal event detection in videos. In Proceedings of the 28th ACM international conference on multimedia. 184–192.\n[52] Che Sun, Yunde Jia, and Yuwei Wu. 2022. Evidential reasoning for video anomaly detection. In Proceedings of the 30th ACM International Conference on Multimedia . 2106–2114.\n[53] Shengyang Sun and Xiaojin Gong. 2023. Hierarchical semantic contrast for scene-aware video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 22846–22856.\n[54] Shengyang Sun and Xiaojin Gong. 2023. Long-short temporal co-teaching for weakly supervised video anomaly detection. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2711–2716.\n[55] Shengyang Sun, Jiashen Hua, Junyi Feng, Dongxu Wei, Baisheng Lai, and Xiaojin Gong. 2024. TDSD: Text-driven scene-decoupled weakly supervised video anomaly detection. In Proceedings of the 32nd ACM International Conference on Multimedia. 5055–5064.\n[56] Qwen Team. 2024. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 (2024).\n[57] Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. 2024. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289 (2024).\n[58] Anil Osman Tur, Nicola Dall\u0026rsquo;Asen, Cigdem Beyan, and Elisa Ricci. 2023. Exploring di!usion models for unsupervised video anomaly detection. In 2023 IEEE international conference on image processing (ICIP). IEEE, 2540–2544.\n[59] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36.\n[60] Yuan Wang, Rui Sun, Naisong Luo, Yuwen Pan, and Tianzhu Zhang. 2024. Image-to-image matching via foundation models: A new perspective for openvocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3952–3963.\n[61] Zejin Wang, Jiazheng Liu, Guoqing Li, and Hua Han. 2022. Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2017–2026. doi:10. 1109/CVPR52688.2022.00207\n[62] Ziming Wang, Yuexian Zou, and Zeming Zhang. 2020. Cluster attention contrast for video anomaly detection. In Proceedings of the 28th ACM international conference on multimedia. 2463–2471.\n[63] Jie Wu, Wei Zhang, Guanbin Li, Wenhao Wu, Xiao Tan, Yingying Li, Errui Ding, and Liang Lin. 2021. Weakly-supervised spatio-temporal anomaly detection in surveillance video. arXiv preprint arXiv:2108.03825 (2021).\n[64] Jhih-Ciang Wu, He-Yen Hsieh, Ding-Jie Chen, Chiou-Shann Fuh, and Tyng-Luh Liu. 2022. Self-supervised sparse representation for video anomaly detection. In European Conference on Computer Vision. Springer, 729–745.\n[65] Peng Wu, Jing Liu, Xiangteng He, Yuxin Peng, Peng Wang, and Yanning Zhang. 2024. Toward video anomaly retrieval from video anomaly detection: New benchmarks and model. IEEE Transactions on Image Processing 33 (2024), 2213– 2225.\n[66] Peng Wu, Jing Liu, and Fang Shen. 2019. A deep one-class neural network for anomalous event detection in complex scenes. IEEE transactions on neural networks and learning systems 31, 7 (2019), 2609–2622.\n[67] Peng Wu, Chengyu Pan, Yuting Yan, Guansong Pang, Peng Wang, and Yanning Zhang. 2024. Deep Learning for Video Anomaly Detection: A Review. arXiv preprint arXiv:2409.05383 (2024).\n[68] Peng Wu, Xuerong Zhou, Guansong Pang, Zhiwei Yang, Qingsen Yan, Peng Wang, and Yanning Zhang. 2024. Weakly supervised video anomaly detection and localization with spatio-temporal prompts. In Proceedings of the 32nd ACM International Conference on Multimedia. 9301–9310.\n[69] Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. 2024. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Arti!cial Intelligence, Vol. 38. 6074–6082.\n[70] Cheng Yan, Shiyu Zhang, Yang Liu, Guansong Pang, and Wenjun Wang. 2023. Feature prediction di!usion model for video anomaly detection. In Proceedings of the IEEE/CVF international conference on computer vision. 5527–5537.\n[71] Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. 2024. Follow the rules: reasoning for video anomaly detection with large language models. In European Conference on Computer Vision. Springer, 304–322.\n[72] Zhiwei Yang, Jing Liu, and Peng Wu. 2024. Text prompt with normality guidance for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18899–18908.\n[73] Zhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. 2023. Video Event Restoration Based on Keyframes for Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14592–14601.\n[74] Zheyu Yang, Taoyi Wang, Yihan Lin, Yuguo Chen, Hui Zeng, Jing Pei, Jiazheng Wang, Xue Liu, Yichun Zhou, Jianqiang Zhang, et al . 2024. A vision chip with complementary pathways for open-world sensing. Nature 629, 8014 (2024), 1027–1033.\n[75] Guang Yu, Siqi Wang, Zhiping Cai, En Zhu, Chuanfu Xu, Jianping Yin, and Marius Kloft. 2020. Cloze test helps: E!ective video anomaly detection via learning to complete video events. In Proceedings of the 28th ACM international conference on multimedia. 583–591.\n[76] Jongmin Yu, Younkwan Lee, Kin Choong Yow, Moongu Jeon, and Witold Pedrycz. 2021. Abnormal event detection and localization via adversarial event prediction. IEEE transactions on neural networks and learning systems 33, 8 (2021), 3572–3586.\n[77] Muhammad Zaigham Zaheer, Jin-Ha Lee, Marcella Astrid, and Seung-Ik Lee. 2020. Old Is Gold: Rede\u0026quot;ning the Adversarially Learned One-Class Classi\u0026quot;er\nTraining Paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) .\n[78] M Zaigham Zaheer, Arif Mahmood, M Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. 2022. Generative cooperative learning for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14744–14754. [79] Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. 2024. Harnessing large language models for training-free video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18527–18536. [80] Chen Zhang, Guorong Li, Yuankai Qi, Shuhui Wang, Laiyun Qing, Qingming Huang, and Ming-Hsuan Yang. 2023. Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 16271–16280. [81] Haosong Zhang, Mei Chee Leong, Liyuan Li, and Weisi Lin. 2024. PeVL: PoseEnhanced Vision-Language Model for Fine-Grained Human Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18857–18867. [82] Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xiaonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. 2024. Holmes-vau: Towards long-term video anomaly understanding at any granularity. arXiv preprint arXiv:2412.06171 (2024). [83] Menghao Zhang, Jingyu Wang, Qi Qi, Pengfei Ren, Haifeng Sun, Zirui Zhuang, Huazheng Wang, Lei Zhang, and Jianxin Liao. 2024. Video Anomaly Detection via Progressive Learning of Multiple Proxy Tasks. In Proceedings of the 32nd ACM International Conference on Multimedia. 4719–4728. [84] Ying Zhang, Huchuan Lu, Lihe Zhang, Xiang Ruan, and Shun Sakai. 2016. Video anomaly detection based on locality sensitive hashing \u0026ldquo;lters. Pattern Recognition 59 (2016), 302–311. [85] Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. 2017. Spatio-temporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM international conference on Multimedia. 1933–1941. [86] Hang Zhou, Junqing Yu, and Wei Yang. 2023. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Arti!cial Intelligence, Vol. 37. 3769–3777. ","date":"1 May 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/slowfastvad-video-anomaly-detection-via-integrating-simpledetector-and-rag-enhanced-vision-language-model/","section":"Papers","summary":"Proposes a hybrid framework that integrates a fast anomaly detector with a slow, RAG-enhanced vision-language model to improve efficiency and interpretability in video anomaly detection. It employs a retrieval-augmented reasoning module for better scene-specific adaptation, uses an entropy-based intervention strategy to select ambiguous segments for slow detector analysis, and fuses outputs for robust detection.","title":"SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model","type":"method"},{"content":"","date":"1 May 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/zongcan-ding/","section":"Authors","summary":"","title":"Zongcan Ding","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/benedetta-liberatori/","section":"Authors","summary":"","title":"Benedetta Liberatori","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/david-aik-aun-khoo/","section":"Authors","summary":"","title":"David Aik-Aun Khoo","type":"authors"},{"content":" Delving into CLIP latent space for Video Anomaly Recognition # Luca Zanella ⋄a,∗∗ , Benedetta Liberatori ⋄a , Willi Menapace a , Fabio Poiesi b , Yiming Wang b , Elisa Riccia,b a University of Trento, Trento, Italy b\nFondazione Bruno Kessler, Trento, Italy\nABSTRACT # We tackle the complex problem of detecting and recognising anomalies in surveillance videos at the frame level, utilising only video-level supervision. We introduce the novel method AnomalyCLIP , the first to combine Large Language and Vision (LLV) models, such as CLIP, with multiple instance learning for joint video anomaly detection and classification. Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace, which in turn allows us to effectively learn text-driven directions for abnormal events. When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class. We also introduce a computationally efficient Transformer architecture to model short- and long-term temporal dependencies between frames, ultimately producing the final anomaly score and class prediction probabilities. We compare AnomalyCLIP against state-of-the-art methods considering three major anomaly detection benchmarks, i.e. ShanghaiTech, UCF-Crime, and XD-Violence, and empirically show that it outperforms baselines in recognising video anomalies. Project website and code are available at https://luca-zanella-dvl.github.io/AnomalyCLIP/ .\n© 2023 This manuscript version is made available under the CC-BY-NC-ND 4.0 license .\n1. Introduction # Video anomaly detection (VAD) is the task of automatically identifying activities that deviate from normal patterns in videos (Suarez and Naval Jr , 2020). VAD has been widely studied by the computer vision and multimedia communities (Bao et al. , 2022; Feng et al. , 2021b; Mei and Zhang , 2017; Nayak et al. , 2021; Sun et al. , 2022; Wang et al. , 2020; Xu et al. , 2019) for several important applications, such as surveillance (Sultani et al. , 2018) and industrial monitoring (Roth et al. , 2022).\nVAD is challenging because data is typically highly imbalanced, i.e. normal events are many, whilst abnormal events are rare and sporadic. VAD can be addressed as an outof-distribution detection problem, i.e. one-class classification (OOC) (Liu et al. , 2021; Lv et al. , 2021; Park et al. , 2020; Xu et al. , 2019): only visual data corresponding to the normal state is used as training data, and an input test video is classified as normal or abnormal based on its deviation from the learnt\n∗∗ Corresponding author:\ne-mail: luca.zanella-3@unitn.it (Luca Zanella ⋄ )\n⋄ Equal contribution.\nnormal state. However, OOC methods can be particularly ineffective in complex real-world applications where normal activities are diverse. An uncommon normal activity may cause a false alarm because it differs from the learnt normal activities. Alternatively, VAD can be addressed with fully-supervised approaches based on frame-level annotations (Bai et al. , 2019; Wang et al. , 2019). Despite their good performance, they are considered impractical because annotations are costly to produce. Unsupervised approaches can also be used, but their performance in complex settings is not yet satisfactory (Zaheer et al. , 2022). For these reasons, the most recent approaches are designed for weakly-supervised learning scenarios (Li et al. , 2022a; Sultani et al. , 2018; Tian et al. , 2021; Wu and Liu , 2021): they exploit video-level supervision.\nWhilst existing weakly-supervised VAD methods have shown to be effective in anomaly detection (Li et al. , 2022a), they are not designed for recognising anomaly types (e.g. shooting vs. explosion). Performing Video Anomaly Recognition (VAR) in addition to VAD, that is not only detecting anomalous events but also recognising the underlying activities, is desirable as it provides more informative and actionable insights. However, addressing VAR in a weakly-supervised setting is highly challenging due to the extreme data imbalance\nFig. 1: Comparison of various anomaly recognition methods on the ShanghaiTech, UCF-Crime, and XD-Violence datasets in terms of the mean area under the curve (mAUC) of the receiver operating characteristic (ROC) and the mean average precision (mAP) of the precision-recall curve (PRC), which calculate the mean of binary AUC ROC and AP PRC values for all anomalous classes, respectively. A higher mAUC and mAP are crucial for video anomaly recognition as they reflect the model\u0026rsquo;s ability in correctly recognising the correct abnormal class. Notably, our proposed method, AnomalyCLIP, achieves the highest performance on all datasets, surpassing both the state-of-the-art methods on video anomaly detection that are re-purposed for anomaly recognition and CLIP-based video action recognition methods.\nand the limited samples representing each anomaly (Sultani et al. , 2018).\nWe have recently experienced the emergence of powerful deep learning models that are trained on massive web-scale datasets (Schuhmann et al. , 2021). These models, commonly referred to as Large Language and Vision (LLV) models or foundation models (Radford et al. , 2021; Singh et al. , 2022), have shown strong generalisation capabilities in several downstream tasks and have become a key ingredient of modern computer vision and multimedia systems. These pre-trained models are publicly available and can be seamlessly integrated into any recognition system. LLV models can also be effectively applied to videos and to supervised action recognition tasks (Wang et al. , 2021; Xu et al. , 2021).\nIn this paper, we introduce the first method that jointly addresses VAD and VAR with LLV models. We argue that by leveraging representations derived from LLV models, we can obtain more discriminative features for recognising and classifying abnormal behaviours. However, as supported by our experiments (Fig. 1), a naive application of existing LLV models to VAR-VAD does not suffice due to the imbalance of the training data and the subtle differences between frames of the same video containing and non containing anomalous contents.\nTherefore, we propose AnomalyCLIP, a novel solution for VAR based on the CLIP model (Radford et al. , 2021), achieving state-of-the-art anomaly recognition performance as shown in Fig. 1 .\nAnomalyCLIP produces video representations that can be mapped to the textual description of the anomalous event. Rather than directly operating on the CLIP feature space, we re-centre it around a normality prototype, as shown in Fig. 2 (a). In this way, the space assumes important semantics: the magnitude of the features indicates the degree of anomaly, while the direction from the origin indicates the anomaly type. To learn the directions that represent the desired anomaly classes, we propose a Selector model that employs prompt learning and a projection operator tailored to our new space to identify the parts in a video that better match the textual description of the anomaly. This ability is instrumental to address the data imbalance problem. We use the predictions of the Selector model to implement a semantically-guided Multiple Instance Learning (MIL) strategy that aims to widen the gap between the most anomalous segments of anomalous videos and normal ones. Differently from the features typically employed in VAD that are extracted using temporal-aware backbones (Carreira and Zisserman , 2017; Liu et al. , 2022), CLIP visual features do not bear any temporal semantics as it operates at the image level. We thus propose a Temporal model, implemented as an Axial Transformer (Ho et al. , 2019), which models both short-term relationships between successive frames and long-term dependencies between parts of the video.\nAs illustrated in Fig.1, we evaluate the proposed approach on three benchmark datasets, ShanghaiTech (Liu et al. , 2018), UCF-Crime (Sultani et al. , 2018) and XD-Violence (Wu et al. , 2020), and empirically show that our method achieves state-ofthe-art performance in VAR.\nThe contributions of our paper are summarised as follows:\nwe propose the first method for VAR that is based on LLV models to detect and classify the type of anomalous events; we introduce a transformation of the LLV model feature space driven by a normality prototype to effectively learn the prompt directions for anomaly types; we propose a novel Selector model that uses semantic information imbued in the transformed LLV feature space as a robust way to perform MIL segment selection and anomaly recognition; we design a Temporal model to better aggregate temporal information by modelling both the short-term relationships between neighbouring frames and the long-term dependencies among segments. 2. Related Works # Video Anomaly Detection. Recognising anomalous behaviours in video surveillance streams is a traditional task in computer vision and multimedia analysis. Existing methods\nfor VAD can be grouped into four main categories based on the level of supervision available during training. The first group includes fully-supervised methods that assume available framelevel annotations in the training set (Bai et al. , 2019; Wang et al. , 2019). The second group includes weakly-supervised approaches that only require video-level normal/abnormal annotations (Li et al. , 2022a , b; Sultani et al. , 2018; Tian et al. , 2021; Wu and Liu , 2021). The third group includes one-class classification methods that assume the availability of only normal training data (Liu et al. , 2021; Lv et al. , 2021; Park et al. , 2020). The fourth group includes unsupervised models that do not use training data annotations (Narasimhan , 2018; Zaheer et al. , 2022).\nAmongst these types of methods, weakly-supervised approaches have gained higher popularity, as they typically yield good results while limiting the annotation effort. Sultani et al. (2018) were the first to formulate weakly-supervised VAD as a multiple-instance learning (MIL) task, dividing each video into short segments that form a set, known as bag. Bags generated from abnormal videos are called positive bags, and those generated from normal videos negative bags. Since this pioneering work, MIL has become a paradigm for VAD and several subsequent works have proposed to refine the associated ranking model to more robustly predict anomaly scores. For example, Tian et al. (2021) proposed a Robust Temporal Feature Magnitude (RTFM) loss that is applied to a deep network consisting of a pyramid of dilated convolutions and a self-attention mechanism to model both short-term and long-term relationships between video snippets close in time and events in the whole video. Wu et al. (2022) introduced Self-Supervised Sparse Representation Learning, an approach that combines dictionary-based representation with self-supervised learning techniques to identify abnormal events. Chen et al. (2022) introduced Magnitude-Contrastive Glance-and-Focus Network, a neural network that uses a feature amplification mechanism and a magnitude contrastive loss to enhance the importance of feature discriminative for anomalies. Motivated by the fact that anomalies can occur at any location and at any scale of the video, Li et al. (2022a) proposed Scale-Aware Spatio-Temporal Relation Learning (SSRL), an approach that extends RTFM by not only learning short-term and long-term temporal relationships but also learning multi-scale region-aware features. While SSRL achieves state-of-the-art results in common VAD benchmarks, its high computational complexity limits its applicability. To the best of our knowledge no previous works have explored foundation models (Radford et al. , 2021) for VAD, as we propose in this work.\nLarge Language and Vision models. The emergence of novel large multimodal neural networks (Radford et al. , 2021; Schuhmann et al. , 2021 , 2022; Singh et al. , 2022), which can learn joint visual-text embedding spaces, has enabled unprecedented results in several image and video understanding tasks. Current LLV models adopt modality-specific encoders and are trained via contrastive techniques to align the data representations from different modalities (Jia et al. , 2021; Radford et al. , 2021). Despite their simplicity, these methods have been shown to achieve impressive zero-shot generalisation capabilities. While earlier approaches such as CLIP (Radford et al. , 2021) operate on images, LLV models have recently and successfully been extended to the video domains. VideoCLIP (Xu et al. , 2021) is an example of this and it is designed to align video and textual representations by contrasting temporally overlapping video-text pairs with mined hard negatives. VideoCLIP can achieve strong zeroshot performance in several video understanding tasks. ActionCLIP (Wang et al. , 2021) models action recognition as a video-text matching problem rather than a classical 1-out-of-N majority vote task. Similarly to ours, their method uses the feature space of CLIP to learn semantically-aware representations of videos. However, a direct exploitation of the CLIP feature space fails in capturing information on anomalous events for which a specific adaptation, proposed in this work, is necessary. In addition, action recognition methods often fall short in weakly-supervised VAD tasks due to data imbalance between normal and abnormal events, coupled with the need for framelevel evaluation at test time, despite only having video-level supervision. To the best of our knowledge, no prior work has specifically utilised LLV models to tackle the VAD problem.\n3. Proposed approach # Weakly-supervised VAD is the task of learning to classify each frame in a video as either normal or anomalous using a dataset of tuples in the form (V , y), where V is a video and y a binary label indicating whether the video contains an anomaly in any of its frames. With respect to VAD, in VAR we introduce the additional task of recognising the type of the detected anomaly in each frame. Therefore, VAR considers a dataset of tuples (V , c ), where c indicates the type of anomaly in the video (c = ∅ means no anomaly is present, thus being Normal). In the following, we omit the subscripts for the purpose of readability.\nTo address the video-level supervision and the imbalance between normal videos and abnormal ones in VAD, the Multiple Instance Learning (MIL) framework (Sultani et al. , 2018) is widely used. MIL models each video as a bag of segments V = [S1 , \u0026hellip;, S S ] ∈ R S×F×D , where S is the number of segments, F is the number of frames in each segment, and D is the number of features associated to each frame. Each segment can be seen as S = [x1 , \u0026hellip;, xF] ∈ R F×D where x ∈ R D is the feature corresponding to each frame. MIL computes a likelihood of each frame being anomalous, selects the most anomalous ones based on it, and maximises the difference in the predicted likelihood between the normal frames and the ones selected as the most anomalous.\nIn this paper, we propose to leverage the CLIP model (Radford et al. , 2021) to address VAR and show that:\ni) the alignment between the visual and textual modalities in the CLIP feature space can be used as an effective likelihood estimator for anomalies; ii) such estimator, not only can detect anomalous occurrences, but also their types; iii) such estimator is effective only when adopting our proposed CLIP space re-centring transformation (see Fig. 2 (a)). Our method is composed of two models as shown in Fig. 2 (b): a Selector model and a Temporal model. The Selector model S produces the likelihood that each frame belongs to an anomalous\nFig. 2: (a) Illustration of the CLIP space and the effects of the re-centring transformation with features of normal. When the space is not re-centred around the normality prototype m, directions d ′ are similar, making it difficult to discern anomaly types, and feature magnitude is not linked to the degree of anomaly, making it difficult to identify anomalous events. When re-centred, the distribution of the magnitudes of features projected on each d identifies the degree of detected anomaly of the corresponding type. (b) Illustration of our proposed framework. The Selector model learns directions d using CoOp (Zhou et al. , 2022), and uses them to identify the likelihood of each feature x to represent an occurrence of the corresponding anomalous class. MIL selection of the top-K and bottom-K abnormal segments is performed by considering the distribution of likelihoods along the corresponding direction. A Temporal model performs temporal aggregation of the features to produce the final prediction.\nclass S(x) ∈ R C , where C is the number of anomalous classes. We exploit the vision-text alignment in the CLIP feature space and the CoOp prompt learning approach (Zhou et al. , 2022) to estimate this likelihood. The Temporal model T assigns a binary likelihood to each frame of a video indicating whether the frame is anomalous or normal. Unlike S , T exploits temporal information to improve predictions and we implement it with a Transformer network (Ho et al. , 2019). The predictions from S and T are then aggregated to produce a distribution indicating the probability of a frame being normal or abnormal, and which abnormal class it belongs to. We train our model using a combination of MIL and regularisation losses. Importantly, as T is randomly initialised, the likelihood scores are less reliable, thus we always use the likelihoods produced by S to perform segment selection in MIL.\nWe describe the proposed Selector model and Temporal model in detail in Sec. 3.1 and Sec. 3.2, respectively. In Sec. 3.3, we show how we aggregate the predictions of both models for estimating the final probability distribution. Finally, we describe the training and inference in Sec. 3.4 .\n3.1. Selector model # It is crucial for VAD and VAR to reliably distinguish anomalous and normal frames in anomalous videos given only videolevel weak supervision. Motivated by the recent findings in applying LLV models to video action recognition tasks (Wang et al. , 2021; Xu et al. , 2021), we propose a novel likelihood estimator, encapsulated by our Selector model, that combines the CLIP (Radford et al. , 2021) feature space and the CoOp (Zhou et al. , 2022) prompt learning approach to learn a set of directions in this space that identify each type of anomaly and their likelihood.\nOur main intuition (see Fig. 2 (a)) is that the CLIP feature space presents an underlying structure where the set of CLIP features extracted for each frame in the dataset forms a space that is clustered around a central point which we call the normality prototype. Consequently, the difference between a fea- ture and the normal prototype determines important characteristics: the magnitude of the distance reflects the likelihood of it being abnormal, while its direction indicates the type of anomaly. Such important characteristics would not be exploited by a naive application of the CLIP feature space to VAR (see Table 9). Unleashing the potential of this space in detecting anomalies thus requires a re-centring transformation, a main contribution of this work.\nFollowing this intuition, we define the normal prototype m as the average feature extracted by the CLIP image encoder EI on all N frames I contained in videos labelled as normal in the dataset:\nFor each frame I in the dataset, we produce frame features x by subtracting the normality prototype from the CLIP encoded feature, i.e., x = EI(I) − m .\nWe then exploit the visual-text aligned CLIP feature space and learn the textual prompt embeddings whose directions are used to indicate the anomalous classes. In particular, we employ the prompt learning CoOp method (Zhou et al. , 2022) which we find ideal to find such directions as empirically demonstrated by our experiments (see Sec. 4.3).\nGiven a class c and the textual description of the corresponding label t c expressed as a sequence of token embeddings, we consider a sequence of learnable context vectors t ctx and derive the corresponding direction for the class d c ∈ R D as:\nwhere E T indicates the CLIP text encoder. The use of the textual description acts as a prior for the learned direction to match the corresponding type of anomaly, while the context vectors are jointly optimised during training as part of the parameters of S in order to enable the refinement of the direction. A different direction is learned for each class.\nThe learned directions serve as the base for our Selector\nmodel S. As shown in Fig. 2(b), the magnitude of the projection of frame feature x on direction d c indicates the likelihood of the anomalous class c:\nwhere P indicates our projection operation. However, simply projecting the feature vector on the direction would make the magnitude of the projection susceptible to scale, where anomalous features of one class can potentially have a different magnitude from features of another anomalous class. To mitigate this issue, we perform a batch normalisation (Ioffe and Szegedy , 2015) after the projection which produces a distribution of projected features with zero mean and unitary variance:\nwhere BN indicates batch normalisation without affine transformation. As such, we expect within a batch the dominant normal features to be close to the origin and the abnormal features to be at the right side tail of the distribution.\nThe definition of likelihood can be extended to segments by summing the likelihoods of each frame:\n3.2. Temporal Model # The Selector model only learns an initial time-independent separation between anomalous and normal frames as the CLIP model operates at the image frame level. However, the temporal information is an important piece of information for VAR that we can exploit. We thus propose the Temporal model T to model the relationships among frames in both short-term and long-term, to enrich the visual features and to produce the predictions that indicate the likelihood of whether a frame is anomalous:\nWe use a Transformer architecture to capture the short-term temporal dependencies between frames in a segment and the long-term temporal dependencies between all segments in a video, motivated by their success in relevant sequence modelling tasks (Vaswani et al. , 2017). As all the video segments of V are received as the input, the large number of segments S and frames F increases the computational requirements for self attention. To reduce this cost, we implement T as an Axial Transformer (Ho et al. , 2019) that computes attention separately for the two axes corresponding to the segments and the features in each segment. As suggested by experiments in Sec. 4.3, Axial Transformer is also less prone to over-fitting, a likely case in VAR, as compared to standard Transformer. We terminate the model with a sigmoid activation so that the output likelihood can also be interpreted as a probability.\n3.3. Predictions Aggregation # We combine the predictions from S and T to obtain the final output: the probabilities indicating whether a frame is normal or anomalous (pN(x) and pA(x)) and the probability that a frame presents an anomaly of a certain class (pA , c (x)).\nGiven an input frame feature x, we define its probability of being anomalous pA(x) as its corresponding output from the Temporal model T. The probability of the frame being normal is pN(x) = 1 − pA(x). To obtain the probability distribution of the frame to present an anomaly of a specific class pA , c (x), we employ the predictions of the Selector model that can be seen as the conditional distribution over the anomalous classes pc|A(x) = softmax(S(x)). From the definition of conditional probability it follows that pA , c (x) = pA(x) ∗ p c |A (x).\n3.4. Training # We train the model following the MIL framework. Specifically, MIL considers a batch with an equal number of normal and anomalous videos, uses the predicted likelihoods to identify the top-K most abnormal segments in anomalous videos, and imposes separation from the other, normal ones (Sultani et al. , 2018). Due to the higher capacity of T with respect to S and its initial random initialisation, T can not directly perform this selection since the predicted likelihoods would be excessively noisy. Instead, we use the likelihood predictions from S to perform MIL segment selection.\nOur framework is trained end-to-end using losses on anomalous videos, losses on normal videos, and regularization losses, which we describe in the following.\nGiven an anomalous video V of class c, we define the set of top-K most anomalous segments V + = {S + 1 , \u0026hellip;, S + K } and, symmetrically, of bottom-K least anomalous segments V − = {S − 1 , \u0026hellip;, S − K } according to the likelihood assigned by the framelevel model S on the direction corresponding to class c. We consider all frames in V + and maximise the likelihood of the corresponding class being predicted by S by minimising the loss L A DIR L A :\nwhere the likelihood tensor is indexed using the class c. To provide gradients to the temporal model, we also maximise pA , c (x) for each frame contained in the segments using cross entropy:\nDistinguishing normal and anomalous frames in anomalous videos is a challenging problem in VAR due to the appearance similarity between frames of the same video. To foster a better separation between these frames, we additionally consider V − and maximise pN(x) for each frame in the segments using cross entropy:\nTo leverage the information in normal videos, for each segment S i in normal video V, we minimise the likelihood predicted by the Selector model:\nFollowing the VAD literature (Feng et al. , 2021a; Sultani et al. , 2018; Tian et al. , 2021) we also require the model to maximise the probability of each frame in its top-K most abnormal segments V + = {S + 1 , \u0026hellip;, S + K } to be normal :\nWe regularise training with two additional losses (Sultani et al. , 2018) on all frames of anomalous videos only. One is a sparsity loss on the predicted scores and encourages the minimal amount of frames to be predicted as abnormal:\nThe other is a smoothness term that regularises the predictions along the temporal dimension:\nwhere indexing is performed on the flattened sequence of frames in the video.\nWe jointly train the Selector and Temporal models using as final training objective:\n4. Experiments # In this section, we validate our method against a range of baselines taken from state-of-the-art VAD and action recognition methods which we adapt to the VAR task. After introducing the metrics for the novel VAR task, we perform evaluation on three datasets and perform comparison in both the VAD and VAR tasks. An extensive ablation study is performed to justify our main design choices. Sec 4.1 describes our experiment setup in terms of datasets and evaluation protocols. We then present and discuss the results in comparison against state-ofthe-art methods in Sec 4.2 and the ablation study in Sec 4.3 .\n4.1. Experiment Setup # Datasets. We perform our study using three widely-used VAD datasets, i.e., ShanghaiTech (Liu et al. , 2018), UCF-Crime (Sultani et al. , 2018), and XD-Violence (Wu et al. , 2020). ShanghaiTech consists of 437 videos, recorded from multiple surveillance cameras in a university campus. A total of 130 abnormal events of 17 anomaly classes are captured in 13 different scenes. We adopt the dataset in the configuration of Zhong et al. (2019) which adapts it to the weakly-supervised setting by organising it into 238 training videos and 199 testing videos. UCF-Crime is a large-scale dataset of real-world surveillance videos, containing 1900 long untrimmed videos that cover 13 real-world anomalies with significant impacts on public safety. The training set consists of 800 normal and 810 anomalous videos and the testing set includes the remaining 150 normal and 140 anomalous videos. XD-Violence is a large-scale violence detection dataset comprising 4754 untrimmed videos with audio signals and weak labels, divided into a training set of 3954 videos and a test set of 800 videos. With a total duration of 217 hours, the dataset covers various scenarios and captures 6 categories of anomalies. Notably, each violent video may have multiple labels, ranging from 1 to 3. To accommodate our training setup, where only one anomaly type per video is considered, we select the subset of 4463 videos containing at most one anomaly.\nPerformance Metrics. We perform evaluation in terms of both VAD and VAR. Following previous works, we measure the performance regarding VAD using the area under the curve (AUC) of the frame-level receiver operating characteristics (ROC) as it is agnostic to thresholding for the detection task. A larger frame-level AUC means a better performance in classifying between normal and anomalous events. To measure the VAR performance, we extend the AUC metric to the multi-classification scenario. For each anomalous class, we measure the AUC by considering the anomalous frames of the class as positive and all other frames as negatives. Successively, the mean AUC (mAUC) is computed over all the anomalous classes. Similarly, for the XD-Violence dataset, we follow the established evaluation protocol (Wu et al. , 2020) and present VAD results using the average precision (AP) of the precision-recall curve (PRC), while for VAR results we report the mean AP (mAP), which is calculated by averaging the binary AP values across all anomalous classes.\nImplementation details. At training time, each video is divided into S non-overlapping blocks. From each block, a random start-index is sampled from which segments of F consecutive frames are considered. If the raw video has length smaller than S × F, we adopt loop padding and repeat the video from the start until the minimum length of S × F is reached. Each mini-batch of size B used for training is composed of B/2 normal clips and B/2 anomalous clips. This is a simple but effective way to balance the mini-batch formation, which otherwise will contain mainly normal clips. At inference, to handle videos covering arbitrary temporal windows, we first divide each video V into S non-overlapping blocks, where each block contains frames whose number is a multiple of F, i.e., J × F, where J depends on the length of V ⋄ . We process V with J inferences to classify all frames in the video. At each j th inference, we extract the j th consecutive F frames from each block, forming segments with a total of S × F that span the whole video. We then feed the segments into our approach so that our Temporal model can reason the long-term temporal relationships among segments.\nFor a fair comparison with previous works in VAD (Tian et al. , 2021; Wu et al. , 2022; Li et al. , 2022a), we use K = 3 for the MIL selection of the top-K and bottom-K abnormal segments, S = 32 number of segments, F = 16 frames per segment and B = 64 batch size. Please refer to Appendix A for more implementation details and Appendix B for more details on hyper-parameters.\n⋄\nWe perform loop padding to ensure that each video is of length J × S × F\nTable 1: Results of the state-of-the-art methods and our AnomalyCLIP in terms of VAD and VAR on ShanghaiTech.\nSupervision Method Features VAD VAR AUC(%) m mAUC(%) One-class MNAD (Park et al., 2020) MPN (Lv et al., 2021) HF2VAD (Liu et al., 2021) Zaheer et al. (2022) ✓ ✓ ✓ 70.50 73.80 One-class MNAD (Park et al., 2020) MPN (Lv et al., 2021) HF2VAD (Liu et al., 2021) Zaheer et al. (2022) ✓ 73.80 HF2VAD (Liu et al., 2021 ✓ 76.20 Ui Zaheer et al(2022) ResNext ✓ 7893 Unsupervised () CLIP (Rdfd t l ResNext i/6 ✓ ✓ 49.17 51.02 Sultani et al. (2018) C3D-RGB ✓ 86.30 IBL (Zhang et al., 2019) C3D-RGB ✓ 82.50 Zaheer et al. (2022) ResNext ✓ 86.21 GCN (Zhong et al., 2019) TSN-RGB ✓ 84.44 MIST (Feng et al., 2021a I3D-RGB ✓ 94.83 Wu et al. (2020) I3D-RGB ✓ Weakly- CLAWS (Zaheer et al., 2020) C3D-RGB ✓ 89.67 Weakly- RTFM (Tian et al., 2021) I3D-RGB ✓ 97.21 81.60 Wu and Liu (2021) I3D-RGB ✓ 97.48 MSL (Li et al., 2022b) I3D-RGB ✓ 96.08 MSL (Li et al., 2022b) VideoSwin-RGB ✓ 97.32 S3R (Wu et al., 2022) I3D-RGB ✓ 97.48 87.88 MGFN (Chen et al., 2022) I3D-RGB ✓ MGFN (Chen et al., 2022) VideoSwin-RGB ✓ ActionCLIP (Wang et al., 2021) ViT-B/16 ✓ 96.36 75.63 ActionCLIP (Wang et al., 202 ViT-B/16 ✓ 96.36 75.63 Table 2: Results of the state-of-the-art methods and our AnomalyCLIP in terms of VAD and VAR on UCF-Crime.\nSupervision Method Features VAD V AR AUC mAUC(%) One-class SVM Baseline (Sultani et al., 2018) SSV (Sohrab et al., 2018) BODS (Wang and Cherian, 2019) GODS (Wang and Cherian, 2019) Zaheer et al. (2022) I3D-RGB I3D-RGB ✓ ✓ ✓ ✓ ✓ 50.00 58.50 68.26 70.46 74.20 () Un-supervised CLIP (Radford et al2021 ResNext ✓ ✓ 5863 74.28 BODS (Wang and Cherian, 2019) GODS (Wang and Cherian, 2019) Zaheer et al. (2022) I3D-RGB ✓ 686 70.46 74.20 Zaheer et al. (2022) I3D-RGB ✓ 74.20 Zaheer et al. (2022) ResNext ✓ 74.20 Un-supervised Zaheer et al. (2022) ViT-B/16 58.63 74.28 Sultani et al. (2018) C3D-RGB ✓ 75.41 Sultani et al. (2018) I3D-RGB ✓ 77.92 IBL (Zhang et al., 2 C3D-RGB ✓ 79.84 GCN (Zhong et al., 2019 TSN-RGB ✓ 82.12 MIST (Feng et al., 2021a I3D-RGB ✓ 82.30 Wu et al. (2020) I3D-RGB ✓ 82.44 CLAWS (Zaheer et al., 202 C3D-RGB ✓ 82.44 8303 Weakly- supervised RTFM (Tian et al., 2021) VideoSwin-RGB ✓ 83.31 Weakly- supervised RTFM (Tian et al., 2021) I3D-RGB ✓ 84.03 84.86 Wu and Liu (2021) I3D-RGB ✓ 84.89 MSL (Li et al., 2022b I3D-RGB ✓ 85.30 MSL (Li et al., 2022b) VideoSwin-RGB ✓ 85.62 S3R (Wu et al., 2022) I3D-RGB ✓ 85.99 86.55 MGFN (Chen et al., 2022) VideoSwin-RGB 86.67 MGFN (Chen et al., 2022) I3D-RGB 86.98 SSRL (Li et al., 2022a) I3D-RGB ✓ 87.43 88.88 ActionCLIP (Wang et al., 2021 ViT-B/16 82.30 88.88 8772 AnomalyCLIP (ours) ViT-B/16 ✓ 86.36 87.72 Table 3: Results of the state-of-the-art methods and our AnomalyCLIP in terms of VAD and VAR on XD-Violence.\nSupervision Method Features VAD VAR AP(%) mAP(%) Zero-shot CLIP (Radford et al., 2021) ViT-B/16 ✓ 27.21 21.32 Wu et al. (2020) C3D-RGB ✓ 67.19 Wu et al. (2020) I3D-RGB ✓ 73.2 Weakly- supervised MSL (Li et al., 2022b) C3D-RGB ✓ 75.53 Weakly- supervised Wu and Liu (2021) I3D-RGB ✓ 75.9 Weakly- supervised RTFM (Tian et al., 2021) I3D-RGB ✓ 77.81 43.04 Weakly- supervised MSL (Li et al., 2022b) I3D-RGB ✓ 78.28 MSL (Li et al., 2022b) VideoSwin-RGB ✓ 78.58 S3R (Wu et al., 2022) I3D-RGB ✓ 80.26 36.06 MGFN (Chen et al., 2022) I3D-RGB ✓ 79.19 MGFN (Chen et al., 2022) VideoSwin-RGB ✓ 80.11 ActionCLIP (Wang et al., 2021) ViT-B/16 ✓ 61.01 40.24 AnomalyCLIP (ours) ViT-B/16 ✓ ✓ 78.51 49.41 4.2. Evaluation Against Baselines # Regarding VAD, we compare AnomalyCLIP against stateof-the-art methods with different supervision setups, including one-class (Park et al. , 2020; Liu et al. , 2021; Lv et al. , 2021), unsupervised (Zaheer et al. , 2022) and weakly-supervised (Li et al. , 2022a; Tian et al. , 2021; Wu et al. , 2022). As none of the above-mentioned methods address the VAR task, we produce baselines by re-purposing some best-performing VAD methods including RTFM (Tian et al. , 2021), S3R (Wu et al. , 2022) and SSRL (Li et al. , 2022a) ⋄ , and CLIP-based baselines (Radford et al. , 2021; Wang et al. , 2021):\nMulti-classification with RTFM (Tian et al. , 2021), S3R (Wu et al. , 2022) and SSRL (Li et al. , 2022a) (weaklysupervised) . We keep the original pretrained model frozen and add a multi-class classification head that we train to predict the class using a cross entropy objective on the top-K most anomalous segments selected as in the original method. These baselines are weakly-supervised.\nCLIP (Radford et al. , 2021) (zero-shot). We achieve the classification by soft-maxing of the cosine similarities of the input frame feature x with vectors corresponding to the embedding of the textual prompt \u0026ldquo;a video from a CCTV camera of a {class} \u0026quot; using the pre-trained CLIP model. ActionCLIP (Wang et al. , 2021) (weakly-supervised). We retrain ActionCLIP (Wang et al. , 2021) on our datasets by propagating the video-level anomaly labels to each frame of the corresponding video. Table 1 presents the results on ShanghaiTech (Liu et al. , 2018). Although ShanghaiTech is a rather saturated dataset for VAD due to its simplicity in scenarios, AnomalyCLIP scores the state-of-the-art results on both VAD and VAR, with +0 . 09% and +2 . 85% in terms of AUC ROC and mAUC ROC, respectively. ActionCLIP (Wang et al. , 2021) performs poorly in terms of mAUC, which we attribute to the low proportion of abnormal events in ShanghaiTech that makes the MIL selection strategy of particular importance to avoid incorrect supervisory signals on normal frames of abnormal videos. In contrast, our proposal has a better recognition of the positive instances of abnormal videos, thus achieving better performance even when anomalies are rare. AnomalyCLIP achieves a large improvement of +45 . 44% in terms of mAUC against zero-shot CLIP, demonstrating that a naive application of a VAR pipeline in the CLIP space does not yield satisfactory results. A revision of this space, implemented as our proposed transformation, is necessary to use it effectively.\nTable 2 reports the results on UCF-Crime (Sultani et al. , 2018). Our method exhibits the best discrimination of the anomalous classes, achieving the highest mAUC ROC among baselines. Similar to ShanghaiTech, it also achieves an improvement in terms of mAUC against zero-shot CLIP, verifying the importance of our proposed adaptation of the CLIP space. Compared to ActionCLIP (Wang et al. , 2021), our AnomalyCLIP obtains +2 . 94% in terms of mAUC, highlighting the need for a MIL framework to mitigate mis-assignment of anomalous class labels to normal frames of anomalous videos. It is also worth noting that the higher mAUC obtained by ActionCLIP does not result in a competitive AUC ROC on VAD, which implicates a worse separation between normal and abnormal frames. When compared to the best performing method\n⋄\nWe thank authors for making their code and models publicly available\nTable 4: Results of the state-of-the-art methods and our AnomalyCLIP in terms of VAR on UCF-Crime. The table highlights the top performers, with cells highlighted in red representing first place, cells in orange representing second place, and cells in yellow representing third place.\nMethod Class Class Class Class Class Class Class Class Class Class Class Class Class mAUC Abuse Arrest Arson Assaul Burglary Explosion Fighting RoadAcc Robbery Shooting Shoplifting Stealing Vandalism mAUC RTFM Tian et al. (2021) 79.99 62.57 90.53 82.27 85.53 92.76 85.21 90.31 81.17 82.82 92.56 90.23 87.20 84.86 S3R Wu et al. (2022) 86.38 68.45 92.19 93.55 86.91 93.55 81.69 85.03 82.07 85.32 91.64 94.59 83.82 86.55 SSRL Li et al. (2022a) 95.33 79.26 93.27 91.74 89.06 92.25 87.36 80.24 87.75 84.50 92.31 94.22 88.17 88.88 CLIP zero-shot Radford et al. (2021) 57.37 80.65 93.72 80.83 74.34 90.31 83.54 87.46 70.22 63.99 71.21 45.49 66.45 74.28 ActionCLIP Wang et al. (2021) 91.88 90.47 89.21 86.87 81.31 94.08 83.23 94.34 82.82 70.53 91.60 94.06 89.89 87.72 AnomalyCLIP 75.03 94.56 96.66 94.80 90.08 94.79 88.76 93.30 86.85 87.45 89.47 97.00 89.78 90.66 Table 5: Results of the state-of-the-art methods and our AnomalyCLIP in terms of VAR on ShanghaiTech. The table highlights the top performers, with cells highlighted in red representing first place, cells in orange representing second place, and cells in yellow representing third place.\nMethod Class Class Class Class Class Class Class Class Class Class Class Class Class Class Class mAUC Car Chasin Circui Fall Fightin Jumping Monocyc Push Robbery Runnin Skateboard Stoop ThrowingObj Vaudevill Vehicle RTFM Tian et al. (2021) 99.70 95.41 99.83 70.19 97.36 89.14 37.9 35.28 67.01 90.59 96.81 64.1 97.93 91.75 90.85 81.60 S3R Wu et al. (2022) 98.71 96.80 99.97 85.63 95.93 69.33 96.82 54.76 61.19 94.43 96.92 75.46 97.63 97.78 96.84 87.8 SSRL Li et al. (2022a 99.35 97.31 99.95 91.24 96.88 93.07 89.7 90.62 91.81 94.47 97.73 71.8 98.44 96.32 95.49 93.6 CLIP zero-shot Radford et al. (2021) 61.65 77.88 5.95 61.73 79.37 23.68 77.7 63.36 37.71 54.39 76.15 8.47 44.10 65.97 27.08 51.02 ActionCLIP Wang et al. (2021) 98.50 93.86 98.59 16.38 97.45 89.63 98.0 8.14 67.36 78.25 97.10 0.76 97.70 98.65 93.97 75.63 AnomalyCLIP 98.08 96.66 97.97 96.69 98.03 95.48 86.8 97.99 95.00 97.95 97.29 98.62 96.50 96.97 96.79 96.46 Table 6: Results of the state-of-the-art methods and our AnomalyCLIP in terms of VAR on XD-Violence. The table highlights the top performers, with cells highlighted in red representing first place, cells in orange representing second place, and cells in yellow representing third place.\nMethod Class Class Class Class Class Class mAP Abuse CarAccident Explosion Fighting Riot Shooting RTFM Tian et al. (2021) 9.25 25.36 53.53 61.73 90.38 18.01 43.04 S3R Wu et al. (2022) 2.63 23.82 45.29 49.88 90.41 4.34 36.06 CLIP zero-shot Radford et al. (2021) 0.32 12.21 22.26 25.25 66.60 1.26 21.32 ActionCLIP Wang et al. (2021) 2.73 25.15 55.28 58.09 87.31 12.87 40.24 AnomalyCLIP 6.10 31.31 68.75 71.44 92.74 26.13 49.41 SSRL (Li et al. , 2022a) on VAD, our method obtains an improvement of +1 . 78% in terms of mAUC on VAR, while being slightly worse with −1 . 07% in terms of AUC ROC on VAD.\nTable 3 shows the results on XD-Violence (Wu et al. , 2020). AnomalyCLIP outperforms other state-of-the-art methods on VAR achieving the highest mAP. Compared to the VAD baselines\u0026rsquo; models, AnomalyCLIP outperforms RTFM (Tian et al. , 2021) and demonstrates performance close to S3R (Wu et al. , 2022). Please refer to Appendix C for further details on how we obtain results on XD-Violence.\nTables 4 , 5, and 6 display the multi-class AUC and AP for each individual abnormal class. The proposed method has a clear advantage when applied to the UCF-Crime and XDViolence datasets, which are generally considered to be complex benchmarks in anomaly detection. Our method achieves the best mAUC and mAP on average, while it is less advantageous when dealing with anomalies that exhibit slight deviations from normal patterns, such as Shoplifting in UCF-Crime. The advantage of our proposed method is less noticeable when applied to the ShanghaiTech dataset, which captures simple scenes where most methods have achieved a saturated performance.\nFig. 3 presents the qualitative results of our proposed AnomalyCLIP in detecting and recognising anomalies within a set of UCF-Crime, ShanghaiTech, XD-Violence test videos. The model is capable of predicting both the presence of anomalies in test videos and the category of the anomalous event. In video Normal Video 246 from UCF-Crime (Row 2, Column 2), it can be seen how some frames have a higher-than-expected proba- bility of being abnormal. It is interesting to note how in the video RoadAccidents133 from UCF-Crime (Row 1, Column 2) the anomaly score remains high even in the aftermath of the accident. It is also interesting to note that for Normal videos, AnomalyCLIP is able to obtain a relatively low anomaly probability all over the frames, meaning our model has learnt a robust normal representation among Normal videos. Please refer to Appendix E for more results on the test videos. Furthermore, for a more intuitive understanding of the results presented in the paper, we invite readers to access the website https://lucazanella-dvl.github.io/AnomalyCLIP, where easily accessible qualitative results are available.\n4.3. Ablation # In this section, we perform ablations of our method to validate our main design choices with UCF-Crime: the way in which we represent and learn directions, the transformations applied to the CLIP space and the employed way for estimating the likelihood of anomaly, the choice of architecture for the Temporal model, training objectives, and the impact of using features extracted from different backbones.\nRepresentation and Learning of the Directions. In the ablation shown in Table 7, we evaluate the choice of the CoOp (Zhou et al. , 2022) framework to learn directions in the CLIP space. When CoOp is removed, we directly learn the directions from randomly initialized points in the CLIP space (Row 1) or make use of fixed engineered prompts of the form \u0026ldquo;a video from a CCTV camera of a {class} \u0026quot; (Row 2). Both choices result in degradation of the results, indicating that text-\nFig. 3: Qualitative results for VAR on four test videos from UCF-Crime (the top two rows), two test videos from ShanghaiTech (the third row), and two test videos from XD-Violence (the bottom row). For each video, we show at the bottom the predicted probability of each frame being anomalous by our model over the number of frames. We showcase some key frames to reflect the relevance between the predicted anomaly probability and the visual content. The red shaded areas denote the temporal ground-truth of anomalies. We also indicate the predicted anomalous class for detected abnormal frames in the red boxes, while videos without detected anomalies are indicated with blue boxes as Normal.\nTable 7: Ablation on representation and learning of the directions of abnormality. \u0026lsquo;Finetuning\u0026rsquo; indicates that the last projection layer is fine-tuned. The final configuration of our model is represented by the row highlighted in grey in the table.\nText encoder Directions AUC mAUC No Direct Optimisation 84.98 69.86 Frozen Engineered Prompts 84.66 81.35 Frozen CoOp 85.88 87.39 Finetuning CoOp 86.36 90.66 Table 8: Comparisons of different architectural choices for the CoOp module. \u0026lsquo;Shared\u0026rsquo; means that all the classes share a unified context, otherwise each class has a specific context. The final configuration of our model is represented by the row highlighted in grey in the table.\nContext vectors Shared AUC mAUC 4 86.16 91.05 8 86.36 90.66 16 85.82 90.65 8 ✓ 85.97 90.01 guided initialization of the directions and directions finetuning are both necessary. Furthermore, we show that unfreezing the last projection of the text encoder (Row 4) enables a greater freedom in finetuning the discovered directions, yielding the best results.\nIn the ablation shown in Table 8, we evaluate the architectural choices on the CoOp module to learn directions in the CLIP space. Specifically, we experimented by varying the number of context vectors t ctx used from 4 to 8 to 16, and using shared or class-specific context vectors. Although using 4 context vectors results in a slightly higher mAUC score, we eventually opted to use 8 context vectors because they produce a higher AUC score. Results (Row 2 and 4) show that learning a specific set of context vectors for each class, is more tailored to fine-grained categories, rather than relying on more generic shared context vectors for all classes.\nLikelihood Estimation and CLIP Latent Space Transformation. The way in which the extracted CLIP features are transformed and the chosen likelihood estimation method play a crucial role in the quality of segment selection. We evaluate several choices in this procedure in Table 9. Directly using the CLIP space and cosine similarities with the learned directions as likelihood estimators (Row 1) produces the worst VAR results, indicating that the use of the normality prototype m is of high importance in the context of anomaly detection. Second, Row 2 shows that MIL segment selection as a function of the feature magnitude without accounting for the direction is not as effective, given that the large magnitude could be attributed to irrelevant factors.\nTemporal Model Architecture. Capturing temporal information is an essential aspect of VAR since it provides insights into the behaviour of objects and scenes over time. Table 10 shows results for different architectures of T i.e. a 3-layer MLP, two Transformer Encoders (Vaswani et al. , 2017), the multi-scale temporal network (MTN), designed in RTFM and used in S3R and SSRL, and the employed Axial Transformer. In particu-\nTable 9: Ablation of different likelihood estimation methods, feature space transformations and MIL selection. \u0026lsquo;Features\u0026rsquo; indicates the transformation applied to CLIP features. The final configuration of our model is represented by the row highlighted in grey in the table.\nLikelihood Features MIL Selection AUC mAUC cosine sim. CLIP cosine sim. 85.59 83.69 S CLIP - m feature magnitude 84.92 89.82 S CLIP - m S 86.36 90.66 Table 10: Comparisons of different architectural choices for the Temporal model. The final configuration of our model is represented by the row highlighted in grey in the table.\nTemporal Model Short-term Long-term AUC mAUC MLP 74.86 84.46 Transformer ✓ 84.69 88.38 Transformer ✓ 85.1 89.29 MTN ✓ 82.71 87.65 Axial Transformer ✓ ✓ 86.36 90.66 lar, one transformer encoder (Row 2) performs self-attention on each independent 16-frame segment, solely modelling shortterm dependencies. The other (Row 3) applies self-attention on segment embeddings, which are obtained by averaging 16frame feature embeddings within each segment, thereby only modelling long-term dependencies. To ensure a fair comparison, both transformers are designed to have a capacity similar to that of the Axial Transformer. The reduced performance of the MLP baseline (Row 1) indicates the necessity of considering temporal information which is not readily available in the extracted CLIP features. The Axial transformer can capture temporal dependencies and outperform the compared architectures.\nTable 11 shows the results for different values of the embedding size and the number of layers. In the final architecture we use 1 layer and an embedding size of 256, for a total of 10.4 M trainable parameters.\nLosses. Table 12 illustrates the contribution of the losses on the Selector model\u0026rsquo;s outputs, where we progressively remove the losses from the full training objective. The loss on abnormal videos contributes to improved VAD and VAR results on UCFCrime. Similarly, using the loss on normal videos improves the results on Shanghaitech and XD-Violence, as can be seen in Tables D.16 and D.17 of Appendix D .\nTable 13 similarly shows the contribution of the losses on the aggregated model\u0026rsquo;s output, where we remove each from the complete training objective. We validate that each of the proposed losses promotes performance on both the VAD and VAR tasks.\nThe bottom-K least anomalous segments V − = {S − 1 , \u0026hellip;, S − K } of anomalous videos proved to be beneficial for learning the Temporal Model. Inspired by this, we analyse the impact of incorporating this set of frames into the Selector Model loss by minimising the loss:\nTable 11: Comparisons of different architectural choices for the Axial Transformer. The final configuration of our model is represented by the row highlighted in grey in the table.\nEmbedding size Number of layers AUC mAUC 64 1 82.83 90.1 128 1 84.97 90.53 256 1 86.36 90.66 512 1 85.51 89.28 256 2 85.89 89.67 256 3 85.15 88.14 Table 12: Ablation of the losses on the Selector model. The final configuration of our model is represented by the row highlighted in grey in the table.\n| LDIR A | LDIR\nN AUC mAUC ✓ 85.89 89.34 ✓ 85.91 87.26 ✓ 86.46 90.75 ✓ ✓ 86.36 90.66 Moreover, instead of using all segments of normal videos in the Selector Model loss, we evaluate the impact of using only the top-K most abnormal segments V + = {S + 1 , \u0026hellip;, S + K } by minimising the likelihood predicted by the Selector Model:\nIn Table 14, we present our findings, which indicate that modifying the loss function in either of two ways cause a degradation of performance. Specifically, our experiments (Row 1) demonstrate that using the bottom-K least abnormal segments is only effective when learning the Temporal Model. This is because if there is no clear separation between the bottom-K and top-K abnormal features, the Selector Model can lead to incorrectly selected bottom-K features that prevent it from learning good directions in the feature space. However, incorporating the bottom-K least abnormal segments becomes beneficial in the Temporal Model, which has a greater capacity. Furthermore, our experiments indicate that using all normal segments (Row 3) provides a more robust estimation of the direction from normal to anomalous compared to using only the top-K most abnormal segments (Row 2).\nFeature Representation. The purpose of this ablation study is to determine the most suitable feature space for the proposed method AnomalyCLIP. To achieve this, we first investigate whether the space learned by the Selector Model can be applied to the Temporal Model. This C-dimensional space is formed by projecting each frame feature x onto every d c direction, where C represents the number of anomalous classes. Our results, presented in Table 15, indicate that using only this space leads to sub-optimal model performance (Row 3). This finding highlights the necessity of incorporating the information contained in the original feature space as well. We also experiment with using I3D features for both the Selector Model and the Temporal Model (Row 1), but the results demonstrate that the model using these features performs worse. We attribute this to the\nTable 13: Ablation of losses on the aggregated outputs. The final configuration of our model is represented by the row highlighted in grey in the table.\nLA+ LA− LN+ AUC mAUC ✓ ✓ 45.23 69.57 ✓ ✓ 84.5 90.88 ✓ ✓ 80.96 86.1 ✓ ✓ ✓ 86.36 90.66 Table 14: Ablation on the variation of Selector model losses. The final configuration of our model is represented by the row highlighted in grey in the table.\n| LDIR A | LDIR N | LDIR N+ | LDIR\nA− AUC mAUC ✓ ✓ ✓ 86.41 88.29 ✓ ✓ 86.17 90.53 ✓ ✓ 86.36 90.66 fact that I3D features are mapped to a region of space that is not aligned with the text features, unlike the features generated by CLIP\u0026rsquo;s image encoder. For this reason, we also experimented using I3D features for the Temporal Model and features from CLIP\u0026rsquo;s image encoder for the Selector Model (Row 2). The result of this experiment further emphasises that the latent space of CLIP is a more semantic space in which anomalous events of different classes are more separated, which in turn leads to superior discriminative ability in detecting and recognising anomalous events.\n5. Conclusions # In this work, we addressed the challenging task of Video Anomaly Recognition that extends the scope of Video Anomaly Detection by further requiring the classification of the anomalous activities. We proposed AnomalyCLIP, the first method that leverages LLV models in the context of VAR. Our work shed light on the fact that a naive application of existing LLV models (Radford et al. , 2021; Wang et al. , 2021) to VAR leads to unsatisfactory performance and we demonstrated that several technical design choices are required to build a multimodal deep network for detecting and classifying abnormal behaviours. We also performed an extensive experimental evaluation showing that AnomalyCLIP achieves state-or-the-art VAR results on the benchmark ShanghaiTech (Liu et al. , 2018), UCFCrime (Sultani et al. , 2018), and XD-Violence (Wu et al. , 2020) datasets. As future work, we plan to extend our method in open-set scenarios to reflect the real-world applications where anomalies are often not pre-defined. We will also investigate the applicability of our method in other multi-modal tasks, e.g., fine-grained classification.\nReferences # Bai, S., He, Z., Lei, Y., Wu, W., Zhu, C., Sun, M., Yan, J., 2019. Traffic anomaly detection via perspective map based on spatial-temporal information matrix, in: CVPR Workshops. Bao, Q., Liu, F., Liu, Y., Jiao, L., Liu, X., Li, L., 2022. Hierarchical scene normality-binding modeling for anomaly detection in surveillance videos, in: ACM Multimedia. Carreira, J., Zisserman, A., 2017. Quo vadis, action recognition? a new model and the kinetics dataset, in: CVPR. Table 15: Comparisons of different features. The final configuration of our model is represented by the row highlighted in grey in the table.\nSelector Model Temporal Model AUC mAUC I3D-RGB I3D-RGB 65.05 84.24 ViT-B/16 I3D-RGB 78.11 88.26 ViT-B/16 S(x) 84.44 86.78 ViT-B/16 ViT-B/16 86.36 90.66 Chen, Y., Liu, Z., Zhang, B., Fok, W., Qi, X., Wu, Y.C., 2022. Mgfn: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection. arXiv . Feng, J.C., Hong, F.T., Zheng, W.S., 2021a. Mist: Multiple instance selftraining framework for video anomaly detection, in: CVPR. Feng, X., Song, D., Chen, Y., Chen, Z., Ni, J., Chen, H., 2021b. Convolutional transformer based dual discriminator generative adversarial networks for video anomaly detection, in: ACM Multimedia. Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T., 2019. Axial attention in multidimensional transformers. arXiv . Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: ICML. Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T., 2021. Scaling up visual and vision-language representation learning with noisy text supervision, in: ICML. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al., 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 . Li, G., Cai, G., Zeng, X., Zhao, R., 2022a. Scale-aware spatio-temporal relation learning for video anomaly detection, in: ECCV, Springer. Li, S., Liu, F., Jiao, L., 2022b. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection, in: AAAI. Liu, W., Luo, W., Lian, D., Gao, S., 2018. Future frame prediction for anomaly detection–a new baseline, in: CVPR. Liu, Z., Nie, Y., Long, C., Zhang, Q., Li, G., 2021. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flowguided frame prediction, in: ICCV. Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H., 2022. Video swin transformer, in: CVPR. Loshchilov, I., Hutter, F., 2019. Decoupled weight decay regularization, in: ICLR. Lv, H., Chen, C., Cui, Z., Xu, C., Li, Y., Yang, J., 2021. Learning normal dynamics in videos with meta prototype network, in: CVPR. Mei, T., Zhang, C., 2017. Deep learning for intelligent video analysis, in: ACM Multimedia. Narasimhan, M.G., 2018. Dynamic video anomaly detection and localization using sparse denoising autoencoders. Multimedia Tools and Applications . Nayak, R., Pati, U.C., Das, S.K., 2021. A comprehensive review on deep learning-based methods for video anomaly detection. Image and Vision Computing 106. Park, H., Noh, J., Ham, B., 2020. Learning memory-guided normality for anomaly detection, in: CVPR. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al., 2021. Learning transferable visual models from natural language supervision, in: ICML. Roth, K., Pemula, L., Zepeda, J., Scholkopf, B., Brox, T., Gehler, P., 2022. ¨ ¨ Towards total recall in industrial anomaly detection, in: CVPR. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al., 2022. Laion5b: An open large-scale dataset for training next generation image-text models. arXiv . Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A., 2021. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv . Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., Kiela, D., 2022. Flava: A foundational language and vision alignment model, in: CVPR. Sohrab, F., Raitoharju, J., Gabbouj, M., Iosifidis, A., 2018. Subspace support vector data description, in: ICPR. Suarez, J.J.P., Naval Jr, P.C., 2020. A survey on deep learning techniques for video anomaly detection. arXiv . Sultani, W., Chen, C., Shah, M., 2018. Real-world anomaly detection in surveillance videos, in: CVPR. Sun, C., Jia, Y., Wu, Y., 2022. Evidential reasoning for video anomaly detection, in: ACM Multimedia. Tian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J.W., Carneiro, G., 2021. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning, in: ICCV. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. NeurIPS . Wang, G., Yuan, X., Zheng, A., Hsu, H.M., Hwang, J.N., 2019. Anomaly candidate identification and starting time estimation of vehicles from traffic videos, in: CVPR workshops. Wang, J., Cherian, A., 2019. Gods: Generalized one-class discriminative subspaces for anomaly detection, in: ICCV. Wang, M., Xing, J., Liu, Y., 2021. Actionclip: A new paradigm for video action recognition. arXiv . Wang, Z., Zou, Y., Zhang, Z., 2020. Cluster attention contrast for video anomaly detection, in: ACM Multimedia. Wu, J.C., Hsieh, H.Y., Chen, D.J., Fuh, C.S., Liu, T.L., 2022. Self-supervised sparse representation for video anomaly detection, in: ECCV. Wu, P., Liu, J., 2021. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing . Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., Yang, Z., 2020. Not only look, but also listen: Learning multimodal violence detection under weak supervision, in: ECCV. Xu, H., Ghosh, G., Huang, P.Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., Feichtenhofer, C., 2021. Videoclip: Contrastive pretraining for zero-shot video-text understanding, in: EMNLP. Xu, K., Sun, T., Jiang, X., 2019. Video anomaly detection and localization based on an adaptive intra-frame classification network. IEEE Transactions on Multimedia . Zaheer, M.Z., Mahmood, A., Astrid, M., Lee, S.I., 2020. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection, in: ECCV. Zaheer, M.Z., Mahmood, A., Khan, M.H., Segu, M., Yu, F., Lee, S.I., 2022. Generative cooperative learning for unsupervised video anomaly detection, in: CVPR. Zhang, J., Qing, L., Miao, J., 2019. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection, in: ICIP. Zhong, J.X., Li, N., Kong, W., Liu, S., Li, T.H., Li, G., 2019. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection, in: CVPR. Zhou, K., Yang, J., Loy, C.C., Liu, Z., 2022. Learning to prompt for visionlanguage models. International Journal of Computer Vision . In this appendix, we provide further details on the implementation and training of the proposed AnomalyCLIP. We also provide more details on how we obtain XD-Violence results. Furthermore, we report supplementary results of the ablation performed on the loss of the Selector model to support our design choices. Lastly, we offer additional qualitative results.\nAppendix A. Implementation Details # Similarly to CoOp (Zhou et al. , 2022), context vectors t ctx are randomly initialised by drawing from a zero-mean Gaussian distribution with standard deviation equal to 0 . 02. We use the CLIP image encoder (Radford et al. , 2021), specifically the ViT-B/16 implementation, without fine-tuning, and apply standard CLIP image augmentations to each frame. As supported by the ablation in Table 11, we employ a one-layer axial transformer (Ho et al. , 2019) for the Temporal Model with an embedding size of 256 for UCF-Crime (Sultani et al. , 2018) and 128 for XD-Violence (Wu et al. , 2020), and a two-layer axial transformer with an embedding size of 256 for ShanghaiTech (Liu et al. , 2018). In the case of UCF-Crime and XD-Violence, we use the image features of the CLIP space as input to the Temporal Model. However, for ShanghaiTech, we observe an improvement in performance by incorporating the output of the Selector Model as an additional input to the Temporal Model. This is likely because ShanghaiTech is less challenging than the other two and, as a result, the Selector Model already provides sufficient discriminative features.\nConsistent with previous work on VAD Tian et al. (2021); Wu et al. (2022); Chen et al. (2022); Li et al. (2022a), we incorporate a random masking strategy during the selection process operated by S. Specifically, we randomly mask 70% of the segments to prevent the model from repeatedly selecting the same segments. This approach ensures a more diverse and representative selection of segments, thus improving the overall performance.\nAppendix B. Training Details # Training was performed using the AdamW optimiser (Loshchilov and Hutter , 2019) with parameters β1 = 0 . 9, β2 = 0 . 98, ϵ = 10 − 8 and weight decay w = 0 . 2. We tuned the learning rate and the number of epochs based on the behaviour of the training loss. Specifically, the learning rate is set to 5 × 10 − 4 , 10 − 5 and 5 × 10 − 6 for ShanghaiTech, UCF-Crime, and XD-Violence, respectively, warmed up for 10% of the total training epochs and decayed to zero following a cosine annealing schedule. The number of epochs is set to 50 for UCF-Crime and XD-Violence, while it is set to 100 for ShanghaiTech, due to its smaller size. We set the weight for each loss term to 1 without tuning. Following previous work (Tian et al. , 2021; Wu et al. , 2022; Li et al. , 2022a), we use λ1 = 8 × 10 − 3 ad λ2 = 8 × 10 − 4 for sparsity and smoothness regularisation terms, respectively.\nTable D.16: Ablation of the losses on the Selector model on ShanghaiTech (Liu et al. , 2018) The final configuration of our model is represented by the row highlighted in grey in the table.\n| LDIR A | LDIR\nN AUC mAUC 97.86 95.92 ✓ 97.35 96 ✓ 97.95 96.35 ✓ ✓ 98.07 96.46 Table D.17: Ablation of the losses on the Selector model on XD-Violence (Wu et al. , 2020) The final configuration of our model is represented by the row highlighted in grey in the table.\n| LDIR A | LDIR\nN AP mAP 77.45 47.74 ✓ 78.69 48.03 ✓ 78.16 49.02 ✓ ✓ 78.51 49.41 Appendix C. Reproducibility XD-Violence # As the original implementations of RTFM (Tian et al. , 2021) and S3R (Wu et al. , 2022) do not provide neither the code nor the trained models for XD-Violence (Wu et al. , 2020), we made the necessary adaptations to support XD-Violence based on the information available in the original papers and on the opensource platform Github, and used the 2048-D features extracted after the final average pooling layer of the I3D ResNet50 model, pre-trained on Kinetics400 (Kay et al. , 2017). First, we pretrain their models on the entire XD-Violence dataset and save the checkpoint at the training iteration that obtains the highest average precision (AP) on the test set, following their training protocol. Subsequently, we maintain the original pre-trained model frozen and introduce a multiclass classification head. This newly introduced head undergoes training following the methodology outlined in Sec. 4.2 .\nAppendix D. Ablation # Losses. Tables D.16 and D.17 illustrate the contribution of the losses on the Selector model\u0026rsquo;s outputs, where we progressively remove the losses from the full training objective, on ShanghaiTech (Liu et al. , 2018) and XD-Violence (Wu et al. , 2020), respectively. Both the losses on anomalous and normal videos contribute to better VAD and VAR results.\nAppendix E. Qualitative Results # Fig. E.4 presents additional qualitative results of our proposed AnomalyCLIP in detecting and recognising anomalies within a set of UCF-Crime and ShanghaiTech test videos. The model is capable of predicting both the presence of anomalies in test videos and the category of the anomalous event. Videos Arson016 (Row 1, Column 1), Arrest001 (Row 1, Column 2) and Burglary033 (Row 2, Column 2) serve as good examples of the effectiveness of the proposed method. The anomalies are temporally located, and the ground-truth labels (as indicated in\nFig. E.4: Qualitative results for VAR on twelve test videos from UCF-Crime (the top three rows), ShanghaiTech (the fourth row) and XD-Violence (the bottom row). For each video, we show at the bottom the predicted probability of each frame being anomalous by our model over the number of frames. We showcase some key frames to reflect the relevance between the predicted anomaly probability and the visual content. The red shaded areas denote the temporal ground-truth of anomalies. We also indicate the predicted anomalous class for detected abnormal frames in the red boxes, while videos without detected anomalies are indicated with blue boxes as Normal.\nthe video name) are correctly identified. However, it is worth noting that in Arson016 some frames are misjudged as Explosion, which is nevertheless a similar type of anomaly.\nOne failure case is observed in the sample Shoplifting039 (Row 2, Column 2), where the proposed method fails to detect the anomaly. The reason for this failure could be attributed to the fact that the annotated anomaly is visually very similar to a normal situation, making it difficult even for humans to understand that a shoplifting is taking place and not an authorised person moving an object. This result underscores the challenge of accurately detecting anomalies in complex and visually similar scenarios. In video Robbery102 (Row 2, Column 1), the anomaly is correctly located but wrongly classified as Assault, indicating the challenges of VAR.\nVideos Shooting032 (Row 4, Column 2) and Fighting033 (Row 4, Column 1) are interesting examples that highlight the ability of the proposed method to detect anomalous situations even in the aftermath of the anomaly. In these videos, the anomaly probability remains high even after the anomalous situation annotated in the ground truth has ended, correctly indicating that there is still something anomalous happening.\nThe videos from ShanghaiTech (Row 5, Columns 1-2) also provide insights into the performance of the proposed method. In the video on the left, the anomaly is correctly classified as a vehicle. However, there is also a false alarm, which represents a failure case. On the right side of the last row, the video shows a monocycle anomaly that is wrongly classified as Running. It is reasonable to assume that the fast movement of the person riding the monocycle could have contributed to this misclassification.\n","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/delving-into-clip-latent-space-for-video-anomaly-recognition/","section":"Papers","summary":"Proposes AnomalyCLIP, a novel method leveraging Large Language and Vision (LLV) models like CLIP, combined with multiple instance learning and a re-centring transformation of the CLIP feature space, to detect and classify video anomalies and recognize anomaly types. Introduces a Selector model with prompt learning and a Temporal Transformer-based model for temporal dependency modeling; demonstrates state-of-the-art performance on multiple benchmarks.","title":"Delving into CLIP latent space for Video Anomaly Recognition","type":"other"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/dian-zheng/","section":"Authors","summary":"","title":"Dian Zheng","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/elisa-riccia/","section":"Authors","summary":"","title":"Elisa Riccia","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/fabio-poiesi/","section":"Authors","summary":"","title":"Fabio Poiesi","type":"authors"},{"content":" This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.\nExcept for this watermark, it is identical to the accepted version;\nthe final published version of the proceedings is available on IEEE Xplore.\nGenerating Anomalies for Video Anomaly Detection with Prompt-based Feature Mapping # Zuhao Liu, Xiao-Ming Wu, Dian Zheng, Kun-Yu Lin, Wei-Shi Zheng * School of Computer Science and Engineering, Sun Yat-sen University, China Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China\n{liuzh327, wuxm65, zhengd35, linky5}@mail2.sysu.edu.cn, wszheng@ieee.org\nAbstract # Anomaly detection in surveillance videos is a challenging computer vision task where only normal videos are available during training. Recent work released the first virtual anomaly detection dataset to assist real-world detection. However, an anomaly gap exists because the anomalies are bounded in the virtual dataset but unbounded in the real world, so it reduces the generalization ability of the virtual dataset. There also exists a scene gap between virtual and real scenarios, including scene-specific anomalies (events that are abnormal in one scene but normal in another) and scene-specific attributes, such as the viewpoint of the surveillance camera. In this paper, we aim to solve the problem of the anomaly gap and scene gap by proposing a prompt-based feature mapping framework (PFMF). The PFMF contains a mapping network guided by an anomaly prompt to generate unseen anomalies with unbounded types in the real scenario, and a mapping adaptation branch to narrow the scene gap by applying domain classifier and anomaly classifier. The proposed framework outperforms the state-of-the-art on three benchmark datasets. Extensive ablation experiments also show the effectiveness of our framework design.\n1. Introduction # Video anomaly detection (VAD) aims to identify abnormal scenarios in surveillance videos with broad applications in public security. However, due to the small probability of occurrence, abnormal events are difficult to be observed in real-life surveillance. The challenge increases because of the unconstrained nature of abnormal events. Given a specific scenario, the event different from normal events can all be regarded as anomalies, so the anomaly type is unbounded.\nCorresponding author Most VAD approaches address this challenge by learning the distribution of normal events in the training stage and detecting the out-of-distribution events in the testing stage. These methods are categorized into reconstructionbased methods [1 , 14 , 31] to reconstruct the current frame and prediction-based methods [26 , 27 , 27 , 30 , 34] to predict the upcoming frame. Significant reconstruction or prediction error is regarded as an anomaly. However, due to the strong generalization ability of the deep networks and the similarity between normal and abnormal events, the anomalies do not always lead to enough error to be detected. Without prior knowledge of abnormal distribution, it is difficult for the network to detect unseen anomalies.\nTherefore, instead of calculating error with the distribution of normal behaviors, some methods [11 , 12 , 53 , 54] try to generate pseudo anomalies to simulate the distribution of abnormal behaviors. For example, Georgescu et al. [12] collect a large number of images from Tiny ImageNet unrelated to the detection scenario as pseudo anomalous samples. Their other work [11] tries to generate temporal anomalies by reversing the action order or motion irregularity by extracting intermittent frames. The network can get a glimpse of the feature distribution different from normal events by manually applying pseudo anomalies. However, the main drawback of these methods is the unavoidable gap between pseudo and natural anomalies.\nTo solve the problem of pseudo anomalies, Acsintoae et al. [2] released a virtual VAD dataset named Ubnormal using 3D animations and 2D background images. It contains 22 types of anomaly, such as fighting, stealing, laying down, etc. The distribution of real anomalies can be well evaluated by applying the virtual dataset. However, applying virtual anomalies to real scenarios is a great challenge due to the large domain gap. Acsintoae et al. [2] train a CycleGAN [60] to achieve video-level style transfer from virtual to the real domain to address the challenge.\nHowever, existing methods fail to address two key challenges. Firstly, the anomalies are bounded in the virtual dataset but unbounded in the real world, and we define\nFigure 1. An overview of prompt-based feature mapping framework (PFMF). The PFMF totally contains three parts, i.e., feature extractor, prompt-based feature mapping network, and mapping adaptation branch. The feature extractor is used to transform the input instances into corresponding features, so the mapping process can be completed at the feature level. The prompt-based feature mapping network aims to map normal features into abnormal feature space under the same domain guided by an anomaly prompt, so the unseen anomalies in the real domain can be generated from normal features. The mapping adaptation branch is added to make the generated anomalies scene-specific and solve the problem of scene-specific attributes.\nthis difference as anomaly gap. Secondly, different scenarios have scene-specific anomalies (events that are abnormal in one scene but normal in another) and scene-specific attributes (such as the viewpoint of the surveillance camera), and we define this difference as scene gap .\nOur work is motivated by the above two key challenges. To solve the problem of anomaly gap and scene gap, we propose a novel framework named prompt-based feature mapping framework (PFMF), as shown in Fig. 1. In terms of narrowing the anomaly gap, the PFMF employs a promptguided mapping network to generate unbounded anomalies through a divergent mapping process. The prompts are sampled from distribution learned by a variational auto-encoder (VAE) [17]. As for the scene gap, we introduce a mapping adaptation branch to solve it. In detail, the branch consists of an anomaly classifier to make the generated anomalies scene-specific, and two domain classifiers to reduce the inconsistency caused by scene-specific attributes.\nIn summary, this paper makes the following contributions:\n(1) Proposing a novel prompt-based feature mapping framework (PFMF) for video anomaly detection. This framework addresses the challenge of applying virtual VAD datasets with limited anomalies to the real scenario by generating unseen anomalies with unbounded types. (2) Proposing a mapping adaptation branch to ensure the anomalies generated by PFMF are scene-specific and solve the problem of scene-specific attributes. (3) Showing the effectiveness of the proposed framework on three public VAD datasets, ShanghaiTech, Avenue, and UCF-Crime. Extensive experiments show that the proposed framework performs the best compared with the state-of- the-art.\n2. Related Work # 2.1. Video Anomaly Detection # The goal of the VAD task is to detect anomaly events in videos. In recent years, many works try to learn the distribution of normal events and detect out-of-distribution events in the testing stage [1 , 13 , 14 , 26 , 27 , 30 , 31 , 34]. These methods are categorized into reconstruction-based or prediction-based. Some of the reconstruction-based methods use generative models [14], sparse coding [31], or deep auto-encoder [1] to reconstruct the current frame based on several adjacent frames. Prediction-based methods always predict the future frame using techniques such as motion feature extraction [27 , 34], deep auto-encoder [13 , 30] or ConvLSTM [26]. The occurrence of an anomaly will lead to significant reconstruction or prediction error. However, these methods lie in the \u0026lsquo;over-generalizing\u0026rsquo; dilemma where both normal and abnormal frames can be predicted or reconstructed well because of the strong representation ability of deep network [32]. Recently, some methods try to solve this problem by adding pseudo anomalies in the training process [11 , 12 , 53 , 54]. The pseudo anomalies are collected from unrelated datasets [12] or generated from normal events [11 , 54]. However, these methods face the problem of the large gap between pseudo and natural anomalies.\n2.2. Datasets under Virtual Environment # Due to the enormous cost and privacy sensitivity of collecting real-world datasets, generating a virtual dataset has become a viable alternative in many fields, including per-\nson re-identification [47], semantic segmentation [40], action recognition [37], etc. Due to the lack of anomalies in real-world datasets, instead of generating pseudo anomalies, Acsintoae et al. [2] introduce the first virtual VAD dataset named Ubnormal with a large number of videos containing anomalies such as falling, fighting, stealing, etc.\n2.3. Feature Mapping # Our proposed framework shares underlying similarities with feature mapping techniques in domain adaptation [6 , 18 , 41 , 57 , 61]. To address the problem of heterogeneous feature spaces in different domains, feature mapping is used to map data from one domain to another. Two mapping paradigms, i.e., common space mapping [41 , 61] and asymmetric mapping [6 , 18 , 57] are used. However, the gap between the real and virtual domains is large. Therefore, instead of mapping features from one domain to another, the proposed PFMF applies mapping from normal features to abnormal features under the same domain.\n2.4. Prompting Methods # Recently, prompt-based learning has been a popular method in both natural language processing [4 , 25 , 36 , 36 , 49 , 51] and computer vision [8 , 16 , 22 , 58 , 59]. Usually, prompt in textual form is used to adapt language model pre-trained on the large dataset to downstream tasks [4 , 25 , 36 , 36 , 49 , 51]. Textual prompts are also used in the vision-language model [8 , 22 , 58 , 59] to complete computer vision tasks. In addition to applying textual prompts, visual prompts in the form of the learnable vector are proposed to fine-turn the Vision Transformer (ViT) [7]. In this work, instead of applying a pre-trained model to downstream tasks, the proposed anomaly prompt is used to guide the mapping network to achieve divergent mapping.\n3. Method # In this section, we elaborate on the framework and the training process of our method. An overview of the proposed PFMP is provided in Section 3.1. Then, the training data organization is explained in Section 3.2. In Section 3.3 , the proposed PFMP is illustrated in detail by describing the feature mapping procedure (Section 3.3.1), anomaly prompt (Section 3.3.2), and mapping adaptation branch (Section 3.3.3), respectively. Finally, the optimization process is elaborated in Section 3.4 .\n3.1. Overview # The overall framework of our proposed method is shown in Fig. 1. Here, we call the real-world dataset (ShanghaiTech, Avenue, or UCF-Crime) as the real domain and the Ubnormal dataset as the virtual domain. The framework takes three inputs, i.e., real domain normal instance S r nor, virtual domain normal instance S v nor , and virtual domain abnormal instance S v abn . We first use a feature extractor to obtain the features of three inputs. To solve the problem of anomaly gap, a feature mapping network assisted by an anomaly prompt is used to map normal features to abnormal features. The anomaly prompt is the key factor to generate unbounded types of anomalies to narrow the anomaly gap. Then we can generate anomalies in the real domain by the feature mapping network and a sampled anomaly prompt. We also propose a mapping adaptation branch to make the generated anomalies scene-specific and solve the problem of scene-specific attributes, which greatly narrows the scene gap. The detail of the proposed PFMF is illustrated in the following parts.\n3.2. Training Data Organization # The virtual VAD dataset provides rich instance-level anomaly annotations. The dataset annotates whether the behaviors are abnormal or not, and the outline of each person is also provided. However, the real-world VAD dataset only contains raw videos without annotations, and abnormal behavior is absent. Therefore, as shown in the left part of Fig. 2, our framework takes three types of inputs, i.e., real domain normal instance S r nor , virtual domain normal instance S v nor , and virtual domain abnormal instance S v abn . Both S v nor and S v abn are cropped from virtual video based on outline annotations of each person. Considering the absence of bounding box annotations in the real domain dataset, we apply a YOLOv3 object detector [38] pre-trained on the MS COCO dataset [24] to extract the bounding box of each person.\n3.3. Prompt-based Feature Mapping # After obtaining the instance-level inputs, a feature extractor is used to extract high-dimensional features, as shown in Fig. 2. Then, a feature mapping network is applied to build a bridge between normal and abnormal features in the virtual domain by asymmetric mapping. The mapping process is guided by an anomaly prompt to generate unbounded types of anomaly. The prompt generation process is shown in Fig. 3. Finally, all features are fed into the mapping adaptation branch to further narrow the scene gap, as shown in Fig. 4 .\n3.3.1 Feature Mapping Network # The feature mapping network aims to map normal features into abnormal feature space under the same domain. Denote the feature extractor output as X ∈ R C×T ×H×W , where C , T , H , W represent channel number, temporal length, height, and width, respectively. The normal feature in the real domain is represented as X r nor , and the normal and abnormal features in the virtual domain are represented\nFigure 2. Process of training data organization and feature mapping. For virtual domain, the mapping network ε maps normal feature X v nor to abnormal feature X v abn . Then MAE Loss is used to minimize the gap between mapped feature X v map and abnormal feature X v abn . For the real domain, the mapping network learned in the virtual domain is used to generate unseen anomalies X r abn.\nas X v nor and X v abn , respectively. The mapping network de- Prompt Global AvgPool noted by ε( . ) contains an encoder ε p g e ( . ) to extract high-level information and a decoder εd( . ) to up-sample the encoded features. ectivel Prompt The Global y. The m () AvgPool\nSince we want to generate unbounded types of anomaly, we design a divergent mapping process (one normal feature can be mapped to many types of abnormal features). Moreover, we apply an anomaly prompt p to indicate the mapping direction. In virtual domain, the mapped feature X v map is generated as\nwhere p v is anomaly prompt for feature in virtual domain, and [ . ] means concatenating feature maps along channel dimension.\nBy training the mapping network ε, we aim to minimize the mean absolute error (MAE) between the mapped abnormal feature and true abnormal feature in the virtual domain, as\nThrough the optimization process of Eq. 2, the normal feature can be transformed into abnormal feature space.\nIn the real domain, there are no abnormal samples in the training set, so the network does not have a perception of the abnormal feature distribution. To simulate the abnormal feature distribution in the real domain, we generate abnormal features by using the mapping network learned in the virtual domain. The formula is defined as:\nwhere X r abn is generated anomaly and p r is the real domain prompt generated by sampling from a learned distribution.\n3.3.2 Anomaly Prompt # To generate unbounded types of anomaly through divergent feature mapping, we create an anomaly prompt as extra input of the mapping network. Since we can assign different\nFigure 3. Generation process of anomaly prompt for feature mapping network. It is obtained by concatenating the anomaly vector (a r or a v ) and scene vector (s r or s v ). The scene vector is generated from a ResNet18 network pre-trained in the Places365 dataset. The anomaly vector is sampled from the Gaussian distribution in VAE. The VAE is trained by reconstructing abnormal features in the virtual domain.\ngenerated directions by different anomaly prompts, the produced anomalies tend to be unbounded. The anomaly vector is obtained by concatenating the scene vector and anomaly vector, as shown in Fig. 3 ,\nScene Vector an squ Anomaly vector contains information about anomaly type. As shown in Fig. 3 (a), the anomaly vector in the virtual domain a v is obtained by squeezing the spatial dimension of abnormal features through a global average pooling. Then, a v S is fed into a VAE f to generate reconstruction vector a ∗ v . Scene Vector The VAE is trained by minimizing the mean square error (MSE) between the anomaly and reconstruction vector, as\nConcat z from VAE on o Anomaly , nment is Vector ment i Concat In real domain, we sample a latent variable z from VAE the posterior distribution of VAE and decode z to obtain anomaly vector a r , as shown in Fig. 3 (b). Since the VAE Al is learned from aligned abnormal features, it can simulate Anomaly Vector the distribution of the anomalies. The alignment is done by the mapping adaptation branch and we will discuss it later. Then, through sampling latent variables that obey the Gaussian distribution, we can get more types of anomaly vectors. Scene vector. We aim to make the generated anomaly features scene-independent (applicable to any input scene) to narrow the scene gap. Therefore, additional scene information is added by generating scene vector s r . As shown in Fig. 3 (b), we fed the scene image (without detection of YOLOv3) to a ResNet-18 pre-trained on Places365 dataset [56] to identify scene information. We apply the features before the softmax in ResNet-18 as the scene vector.\nAnomaly Pr\nTraining data Organization\nS\nYOLO\nv3\nS\nS\nr\nnor\nv\nnor\nv\nabn\nFigure 4. The mapping adaptation branch in our PFMF contains one anomaly classifier and two domain classifiers.\nFeature Mapping a input of mappi i r X nt r X Anomaly prompt. After obtaining both the anomaly vector and scene vector, the anomaly prompt is generated by concatenating these two vectors to fuse input scene and anomaly type information. Then, as shown in Fig. 1, we Feature Mapping use anomaly prompt as an extra input of mapping network eatue r X to achieve divergent feature mapping.\nnor abn\n3.3.3 Mapping Adaptation Branch # anomaly\nFeature ch is al Extractor wn in F MAE y prompt v n nor ed X lie v e abn fic X lv i v n map g X pin prompt v X v map X In addition to the anomaly prompt, the mapping adaptation Feature nor X p branch is also applied in PFMF to narrow the scene gap. As Extractor shown in Fig. 4, our mapping adaptation branch contains one anomaly classifier and two domain classifiers, which v X are designed to solve the problem of scene-specific anomaM L abn X lies and scene-specific attributes, respectively.\n\nLoss y t Loss Anomaly classifier is used to distinguish between normal and abnormal features for each input scene to explicitly make the generated anomalies scene-specific, as shown in Fig. 4. For each scene, events that are not normal are all treated as anomalies. Thus, maximizing the accuracy of the anomaly classifier can deviate mapped abnormal features from normal features in the same scene. Therefore, the generated anomalies will have different feature distribution from the normal events in the same scene, so they are regarded to be scene-specific.\nDomain classifier. In addition, the scene-specific attributes are also a great challenge when applying virtual datasets to real scenarios. The CycleGAN applied in previous work [2] can partly reduce some scene-specific attributes like dressing and background. However, scene-specific attributes still exist, such as the viewpoint of the surveillance camera. We solve this problem by aligning the feature space between the virtual and real domain. The alignment can extract common attributes of two domains and reduces the inconsistency caused by scene-specific attributes.\nInspired by literature [10], we apply two domain classifiers and use gradient reversal layer (GRL) to train the feature extractor. The domain classifier is dedicated to recognizing which domain the input feature belongs to. The preceding feature extractor tries to puzzle the domain classifier to shrink the domain gap. The GRL acts as an identity function during forward-propagation as\nwhere R( . ) and X represent GRL and input feature, respectively. During backward propagation, GRL reverses the gradient of the preceding feature extractor by multiplying −λ , as\nwhere I is the identity matrix, and λ is the adaptation factor, which is set as 1. We aim to extract common attributes between real and virtual domain rather than the characteristics between normal and abnormal feature. Therefore, we design two independent domain classifiers. These two classifiers separately act on normal or abnormal features in both domains, so the features with the same label but the different domain will have a similar distribution.\n3.4. Training and Inference # Loss function. The proposed PFMF is trained in a unified way. Total loss contains four terms, feature mapping loss L m , anomaly classification loss L a , domain classification loss L d , and VAE reconstruction loss L v . We employ MAE loss for L m to minimize the error between mapped and true abnormal features. We also employ MSE loss for L v to minimize the error between the input anomaly vector and the reconstruction vector of VAE. For L a and L d , the cross-entropy loss is applied to achieve anomaly or domain classification. The entire loss L all is the weighted sum of three terms, as\nWe empirically find the choice of domain loss weight λd impacts the network training.\nInference. The generated anomalies allow the network to be trained in a fully-supervised way. Given an unseen instance S unseen , it is first fed into the feature extractor θ , then through the anomaly classifier σ. The classification result in the anomaly classifier is regarded as the instancelevel anomaly score, as\nFollowing the operation in literature [11], the instancelevel anomaly scores are assembled into an anomaly map with the same shape as the input frame. The frame-level anomaly score is obtained by taking the maximum value in each frame of the anomaly map.\n4. Experiments # 4.1. Datasets and Metrics # In addition to the Ubnormal dataset described in Section 2.2 and Section 3.2, we evaluate the proposed PFMF in three real-world VAD datasets, ShanghaiTech [31], Avenue [29], and UCF-Crime [42].\nShanghaiTech is a large-scale VAD dataset containing 437 videos captured in 13 locations. The dataset is organized for unsupervised learning by dividing it into a training set with 330 videos containing only normal events and a testing set with 107 videos containing both normal and abnormal events. The anomalies include fighting, robbing, riding bikes on the sidewalk, etc. Each video in the dataset has a resolution of 480×856.\nAvenue dataset contains 16 videos for training and 21 videos for testing with 15324 frames. Similar to ShanghaiTech, only the testing set contains abnormal events. Each video in the dataset has a resolution of 480×856. The anomalies include throwing objects, running, loitering, etc. UCF-Crime dataset contains 13 anomaly types, and the total video length is 128 hours. We use normal videos from the training set for our model training, and abnormal videos of human-related anomalies (except classes of explosions, car accidents, and normal) from the testing set for model evaluation.\nEvaluation Metrics In Section 4.3, we use the accuracy (Acc) of Ubnormal instance inputs and feature mapping error (Err) to evaluate the feature mapping effect. The lower error means a better feature mapping effect. The feature mapping error is MAE between the mapped and true abnormal features. In Section 4.4 and 4.5, We use commonly used metric, i.e., area under ROC curve (AUC), to evaluate the frame-level anomaly detection performance of our framework [9 , 30 , 54]. A Higher AUC value means better anomaly detection ability. Following literature [12], we evaluate both the Micro and Macro versions of AUC.\n4.2. Implementation Details # Based on trials of preliminary study (Section 4.3), we set the layer number of the mapping network to 4 and loss function to MAE loss for subsequent experiments. Each layer of the mapping network contains one convolution followed by an instance normalization and a ReLU activation. As described in Section 3.4, the choice of Ld influences on the network training, and it is set to 0.2. Adam optimizer is used with learning rate of 0.001. The confidence threshold for the YOLOv3 detector is set to 0.5 for ShanghaiTech and UCF-Crime, and 0.95 for Avenue. The temporal length for each input video clip is set to 7. For the feature extractor, 3D CNN with a total of six convolution layers is applied.\nSince the inputs of the proposed framework are from two domains, we empirically found that batch normaliza-\nTable 1. Preliminary study to obtain the optimal structure and loss function for mapping network. Accuracy (Acc) of Ubnormal instance inputs and feature mapping error (Err) are used to evaluate the feature mapping effect.\n| Loss Type | Layer\nNum Acc(%) Err(%) 0 80.3 0.98 1 83.4 1.01 2 84 1.2 3 84.1 1.82 4 85.3 0.81 MSE 0 86.9 4.29 MSE 1 87.2 6.79 MSE 2 84.9 5.33 MSE 3 84.1 7.02 MSE 4 85.8 6.7 tion will lead to optimization failure due to inaccurate running mean and variance. Therefore, instance normalization is used in our framework to replace batch normalization.\n4.3. Preliminary Study # In this section, we explore the optimal structure and loss function for the mapping network. Only S v nor and S v abn are fed to our PFMF in this section because the mapping results in the real domain cannot be evaluated by reconstruction error and instance-level accuracy (we do not have abnormal instances in the real domain). After obtaining virtual domain instances described in Section 3.2, we split the 70% of instances for training and 30% for testing. We design different structures by changing the down-sampling number of the mapping network. Setting the layer number to 0 means no down-sampling layer exists in the mapping network. We also evaluate the effect of different feature mapping loss L m , i.e., MAE loss and MSE loss. Results are shown in Table 1. From the table, we find the MAE loss can significantly reduce the feature mapping error. When using MAE loss, deeper layers result in higher accuracy. The structure with layer number 4 and MAE loss achieves the lowest feature mapping error with 0.81%. The structure with layer number 2 and MSE loss obtains the highest accuracy, but its feature mapping error is too large (6.79%). In summary, we apply a mapping network with layer number 4 and MAE loss to our PFMF.\n4.4. Comparisons with State-of-the-art # In this section, we compare the performance of the proposed PFMF with state-of-the-art methods in Micro and Macro AUC(%). Noticing that current advanced methods [39] and [2] apply multi-task framework ( [11] and [12] respectively) provided by Georgescu et al. as the backbone. Therefore, we evaluate our PFMF with or without the multitask backbone [11] in Avenue and ShanghaiTech dataset.\nTable 2. Quantitative comparisons between our proposed PFMF and state-of-the-arts [2 , 3 , 5 , 11 – 13 , 15 , 19 – 21 , 23 , 27 , 28 , 30 , 33 – 35 , 39 , 42 – 44 , 46 , 48 , 50 , 52 – 55] in Micro and Macro AUC (%). Bold font indicates the best results.\n| year | Method | Avenue AUC MiM | Avenue AUC MiM | ShanghaiTech AUC | ShanghaiTech\nAUC y Micro Macro Micro Macro 8 Liu et al. [27] 85.1 - 72.8 - 201 Lee et al. [19] 87.2 - - - 201 Sultani et al. [42] - - 76.5 - 2019 Lee et al. [20] 90.0 - 76.2 - 2019 Ionescu et al. [15] 87.4 90.4 78.7 84.9 2019 Gong et al. [13] 90.4 - 84.9 - 2019 g Nguyen et al. [34] 86.9 - - - 2019 Wu et al. [50] 86.6 - - - 2020 Park et al. [35] 88.5 - 70.5 - 2020 Sun et al. [43] 89.6 - 74.7 - 2020 Lu et al. [30] 85.8 - 77.9 - 2020 Wang et al. [48] 87.0 - 79.3 - 2020 Yu et al. [52] 89.6 - 74.8 - 2020 Tang et al. [44] 85.1 - 73.0 - 2021 Wang et al. [46 88.3 - 76.6 - 2021 Astrid et al. [3] 87.1 - 73.7 - 2021 Liu et al. [28] 91.1 - 76.2 - 2021 Madan et al. [33] 88.6 - 74.6 - 2021 Li et al. [21] 88.8 - 73.9 - 2021 Georgescu et al. [11] 91.5 91.9 82.4 89.3 2021 Georgescu et al. [12] 92.3 90.4 82.7 89.3 2022 Zaheer et al. [54] 74.2 - 79.6 - 2022 [ Li et al. [23] 82.0 - - - 2022 Zaheer et al. [53] - - 69.9 - 2022 Cho et al. [5] 88.0 - 76.3 - 2022 [] Zhong et al. [55] 89.0 - 74.5 - 2022 Ristea et al. [39] ∗ 92.9 91.9 83.6 89.5 2022 Acsintoae et al. [2] ▽∗ 93.0 93.2 83.7 90.5 2022 PFMF (ours)▽ 91.8 92.3 83.8 87.8 2022 PFMF (ours)▽∗ 93.6 93.9 85.0 91.4 ▽ These methods apply virtual dataset for training. ∗ These methods use multi-task model ( [11] or [12]) as backbone.\nTable 3. Results in human-related anomalies of UCF-Crime dataset.\nMethod Micro AUC Macro AUC l. [35] 55.5 Park et al. [35] Ristca et al. [39] 60.6 64.2 Georgescu et al. [11] 62.3 65.5 PFMF (ours) 67.9 74.0 For UCF-Crime, we did not use the multi-task backbone.\nAvenue. The proposed PFMF achieves the best in the Avenue dataset compared with state-of-the-art [2 , 3 , 5 , 11 – 13 ,\nTable 4. Ablation study of the proposed PFMF. Total five groups of experiments are conducted in the ShanghaiTech dataset to evaluate the effect of each network component.\n| feature | anomaly | mapping | AUC M | AUC\nM mapping prompt adaptation Micro Macro - - - 73.6 74.5 ✔ - - 78.9 84.2 ✔ - ✔ 80.9 86.5 ✔ ✔ - 80.0 85.3 ✔ ✔ ✔ 83.8 87.8 Figure 5. Distributions of features generated by our PFMF and the method of Georgescu et al. [11] visualized by t-SNE [45]. The blue points denote the extracted features of normal videos in ShanghaiTech [31] and Avenue [29] datasets. The red points indicate the abnormal features generated by our PFMF or [11]. The features of normal and abnormal events are tangled together for [11], but our proposed PFMF shows better performance with Nebula-like feature distribution. The pattern of concentrated normal features and dispersed abnormal features is consistent with our perception of anomalies that most normal behaviors are similar, while abnormal behaviors are a highly variable open set. The figure indicates that the generated features are close to the distribution of real anomalies.\n15 , 19 – 21 , 23 , 27 , 28 , 30 , 33 – 35 , 39 , 43 , 44 , 46 , 48 , 50 , 52 , 54 , 55] with micro AUC 93.6% and macro AUC 93.9%, which are 0.6% and 0.7% higher than the second best model [2]. Without the multi-task backbone [11], our PFMF can still obtain the best macro AUC of 92.3%.\nShanghaiTech. From Table 2, the proposed PFMF also outperforms state-of-the-art [2 , 3 , 5 , 11 – 13 , 15 , 20 , 21 , 27 , 28 , 30 , 33 , 35 , 39 , 42 – 44 , 46 , 48 , 52 – 55] with micro AUC 85.0% and macro AUC 91.4%. The PFMF outperforms the second best [2] by 1.3% and 0.9%, respectively. Without the effect\nFigure 6. Visualization of anomaly score prediction results for test video 07 in Avenue dataset. The green line denotes the anomaly prediction of the proposed PFMF. The red area denotes the abnormal interval.\nof multi-task backbone [11], the proposed PFMF can also achieve the best micro AUC of 83.8%, even higher than [2] with backbone [11].\nUCF-Crime. Due to the lack of published results on human-related anomalies on the UCF-Crime dataset, we implement the code of literature [11 , 35 , 39], where [39] takes [35] as the backbone. As shown in Table 3, the proposed PFMF shows a great advantage with a 5.6% increase in Micro AUC and an 8.5% increase in Macro AUC than the second best.\n4.5. Ablation Study # We analyze the role played by each part of the proposed PFMF. A total of five groups of experiments are conducted in the ShanghaiTech dataset, as shown in Table 4. When removing the feature mapping part, the performance drops significantly with Micro AUC 73.6% and Macro AUC 74.5%. By adding feature mapping, the Micro and Macro AUC increase by 5.3% and 9.7%, respectively, which indicating that the significance of feature mapping to align the normal and abnormal features. In addition, the mapping adaptation branch also plays an important role in PFMF by narrowing the scene gap. Furthermore, the anomaly prompt improves model performance by generating unbounded types of anomalies. With all components, the final PFMF achieves the best performance with Micro AUC of 83.8% and Macro AUC of 87.8%.\n4.6. Visualization # To validate that we can generate unbounded anomalies, we visualize the distribution of normal and abnormal features generated by PFMF via t-SNE [45], which is shown in Fig. 5. For comparison, we also visualized the distribution of normal and abnormal features generated by [11], where the abnormal features are generated by reversing the action order and extracting intermittent frame, as also shown in\nFigure 7. Visualization of anomaly score prediction results for test video 05 0018 in ShanghaiTech dataset. The green line denotes the anomaly prediction of the proposed PFMF. The red area denotes the abnormal interval.\nFig. 5. From the figures, we can obverse that for PFMF, the normal features extracted from ShanghaiTech and Avenue concentrate at the center, and the abnormal features are scattered around in a divergent state. However, for [11], the features of normal and abnormal instances are tangled together. From the comparison, we can see that the results of our method are consistent with our perception of anomalies that most normal behaviors are similar and the abnormal behaviors are a highly variable open set. This indicates that the generated features are close to the distribution of real unbounded anomalies, which on the other hand validates the effectiveness of our anomaly prompt.\nTo further show what we learn in PFMF, we visualized the anomaly score prediction of test video 07 in Avenue and test video 05 0018 in ShanghaiTech, which are demonstrated in Fig. 6 and Fig. 7. From the figure, the proposed PFMF can correctly detect the anomalies in both samples.\n5. Conclusion # In this paper, we solve the problem of anomaly gap and scene gap between virtual and real scenarios by proposing a novel PFMF. The proposed framework includes a promptguided mapping network to generate unseen anomalies with unbounded types, and a mapping adaptation branch to narrow the scene gap by applying anomaly classifier and domain classifier. Our approach provides a new paradigm for leveraging virtual datasets to avoid cumbersome anomaly collection in the real scenario. The proposed PFMF performs state-of-the-art on three benchmark datasets, and the ablation study shows the effectiveness of each component of our model design. In the future, we aim to extend the proposed paradigm of utilizing virtual datasets to more areas.\nAcknowledgments: This work was supported partially by the NSFC(U21A20471,U1911401,U1811461), Guangdong NSF Project (No. 2023B1515040025, 2020B1515120085).\nReferences # [1] Davide Abati, Angelo Porrello, Simone Calderara, and Rita Cucchiara. Latent space autoregression for novelty detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 481–490, 2019. 1 , 2 [2] Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ubnormal: New benchmark for supervised open-set video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 20143–20153, 2022. 1 , 3 , 5 , 6 , 7 , 8 [3] Marcella Astrid, Muhammad Zaigham Zaheer, and Seung-Ik Lee. Synthetic temporal anomaly guided end-to-end video anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 207–214, 2021. 7 ¨\n[4] Fredrik Carlsson, Joey Ohman, Fangyu Liu, Severine Verlinden, Joakim Nivre, and Magnus Sahlgren. Fine-grained controllable text generation using non-residual prompting. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6837–6857, 2022. 3 [5] MyeongAh Cho, Taeoh Kim, Woo Jin Kim, Suhwan Cho, and Sangyoun Lee. Unsupervised video anomaly detection via normalizing flows with implicit latent features. Pattern Recognition, 129:108703, 2022. 7 [6] Wenyuan Dai, Yuqiang Chen, Gui-Rong Xue, Qiang Yang, and Yong Yu. Translated learning: Transfer learning across different feature spaces. Proceedings of the Advances in Neural Information Processing Systems, 21, 2008. 3 [7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 3 [8] Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022. 3 [9] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. Mist: Multiple instance self-training framework for video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14009– 14018, 2021. 6 [10] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning, pages 1180– 1189. PMLR, 2015. 5 [11] Mariana-Iuliana Georgescu, Antonio Barbalau, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. Anomaly detection in video via selfsupervised and multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12742–12752, 2021. 1 , 2 , 5 , 6 , 7 , 8 [12] Mariana Iuliana Georgescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. A background-agnostic framework with adversarial training for abnormal event detection in video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4505– 4523, 2021. 1 , 2 , 6 , 7\n[13] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1705–1714, 2019. 2 , 7\n[14] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 733–742, 2016. 1 , 2\n[15] Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7842–7851, 2019. 7\n[16] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. arXiv preprint arXiv:2203.12119, 2022. 3\n[17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 2\n[18] Brian Kulis, Kate Saenko, and Trevor Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1785–1792. IEEE, 2011. 3\n[19] Sangmin Lee, Hak Gu Kim, and Yong Man Ro. Stan: Spatiotemporal adversarial networks for abnormal event detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1323–1327. IEEE, 2018. 7\n[20] Sangmin Lee, Hak Gu Kim, and Yong Man Ro. Bman: bidirectional multi-scale aggregation networks for abnormal event detection. IEEE Transactions on Image Processing , 29:2395–2408, 2019. 7\n[21] Bo Li, Sam Leroux, and Pieter Simoens. Decoupled appearance and motion learning for efficient anomaly detection in surveillance video. Computer Vision and Image Understanding, 210:103249, 2021. 7\n[22] Muheng Li, Lei Chen, Yueqi Duan, Zhilan Hu, Jianjiang Feng, Jie Zhou, and Jiwen Lu. Bridge-prompt: Towards ordinal action understanding in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19880–19889, 2022. 3\n[23] Nanjun Li, Faliang Chang, and Chunsheng Liu. A selftrained spatial graph convolutional network for unsupervised human-related anomalous event detection in complex scenes.\nIEEE Transactions on Cognitive and Developmental Systems, 2022. 7\n[24] Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence , 36(1):18–32, 2013. 3\n[25] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 , 2021. 3\n[26] Wen Liu, Weixin Luo, Zhengxin Li, Peilin Zhao, Shenghua Gao, et al. Margin learning embedded prediction for video anomaly detection with a few anomalies. In Proceedings of the International Joint Conferences on Artificial Intelligence , pages 3023–3030, 2019. 1 , 2\n[27] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6536–6545, 2018. 1 , 2 , 7\n[28] Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13588–13597, 2021. 7\n[29] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2720– 2727, 2013. 6 , 7\n[30] Yiwei Lu, Frank Yu, Mahesh Kumar Krishna Reddy, and Yang Wang. Few-shot scene-adaptive anomaly detection. In Proceedings of the European Conference on Computer Vision, pages 125–141. Springer, 2020. 1 , 2 , 6 , 7\n[31] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 341–349, 2017. 1 , 2 , 6 , 7\n[32] Hui Lv, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Learning normal dynamics in videos with meta prototype network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15425–15434, 2021. 2\n[33] Neelu Madan, Arya Farkhondeh, Kamal Nasrollahi, Sergio Escalera, and Thomas B Moeslund. Temporal cues from socially unacceptable trajectories for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2150–2158, 2021. 7\n[34] Trong-Nguyen Nguyen and Jean Meunier. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1273–1283, 2019. 1 , 2 , 7\n[35] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14372–14381, 2020. 7 , 8\n[36] Kunxun Qi, Hai Wan, Jianfeng Du, and Haolan Chen. Enhancing cross-lingual natural language inference by promptlearning from cross-lingual templates. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 1910–1923, 2022. 3\n[37] Hossein Ragheb, Sergio Velastin, Paolo Remagnino, and Tim Ellis. Vihasi: virtual human action silhouette data for the performance evaluation of silhouette-based action recognition methods. In Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras, pages 1– 10. IEEE, 2008. 3\n[38] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 3\n[39] Nicolae-Cat˘ ˘ alin Ristea, Neelu Madan, Radu Tudor Ionescu, ˘ ˘ Kamal Nasrollahi, Fahad Shahbaz Khan, Thomas B Moeslund, and Mubarak Shah. Self-supervised predictive convolutional attentive block for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13576–13586, 2022. 6 , 7 , 8\n[40] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3234– 3243, 2016. 3\n[41] Xiaoxiao Shi, Qi Liu, Wei Fan, S Yu Philip, and Ruixin Zhu. Transfer learning on heterogenous feature spaces via spectral transformation. In Proceedings of the IEEE International Conference on Data Mining, pages 1049–1054. IEEE, 2010. 3\n[42] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6479–6488, 2018. 6 , 7\n[43] Che Sun, Yunde Jia, Yao Hu, and Yuwei Wu. Scene-aware context reasoning for unsupervised abnormal event detection in videos. In Proceedings of the ACM International Conference on Multimedia, pages 184–192, 2020. 7\n[44] Yao Tang, Lin Zhao, Shanshan Zhang, Chen Gong, Guangyu Li, and Jian Yang. Integrating prediction and reconstruction for anomaly detection. Pattern Recognition Letters , 129:123–130, 2020. 7\n[45] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research , 9(11), 2008. 7 , 8\n[46] Xuanzhao Wang, Zhengping Che, Bo Jiang, Ning Xiao, Ke Yang, Jian Tang, Jieping Ye, Jingyu Wang, and Qi Qi. Robust unsupervised video anomaly detection by multipath frame prediction. IEEE Transactions on Neural Networks and Learning Systems, 2021. 7\n[47] Yanan Wang, Shengcai Liao, and Ling Shao. Surpassing real-world source training data: Random 3d characters for generalizable person re-identification. In Proceedings of the ACM international conference on multimedia, pages 3422– 3430, 2020. 3\n[48] Ziming Wang, Yuexian Zou, and Zeming Zhang. Cluster attention contrast for video anomaly detection. In Proceedings\nof the ACM International Conference on Multimedia, pages 2463–2471, 2020. 7\n[49] Hui Wu and Xiaodong Shi. Adversarial soft prompt tuning for cross-domain sentiment analysis. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2438–2447, 2022. 3\n[50] Peng Wu, Jing Liu, and Fang Shen. A deep one-class neural network for anomalous event detection in complex scenes. IEEE Transactions on Neural Networks and Learning Systems, 31(7):2609–2622, 2019. 7\n[51] Hongbin Ye, Ningyu Zhang, Shumin Deng, Xiang Chen, Hui Chen, Feiyu Xiong, Xi Chen, and Huajun Chen. Ontologyenhanced prompt-tuning for few-shot learning. In Proceedings of the ACM Web Conference, pages 778–787, 2022. 3\n[52] Guang Yu, Siqi Wang, Zhiping Cai, En Zhu, Chuanfu Xu, Jianping Yin, and Marius Kloft. Cloze test helps: Effective video anomaly detection via learning to complete video events. In Proceedings of the ACM International Conference on Multimedia, pages 583–591, 2020. 7\n[53] Muhammad Zaigham Zaheer, Jin-Ha Lee, Arif Mahmood, Marcella Astrid, and Seung-Ik Lee. Stabilizing adversarially learned one-class novelty detection using pseudo anomalies. IEEE Transactions on Image Processing, 31:5963– 5975, 2022. 1 , 2 , 7\n[54] M Zaigham Zaheer, Arif Mahmood, M Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. Generative cooperative learning for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14744–14754, 2022. 1 , 2 , 6 , 7\n[55] Yuanhong Zhong, Xia Chen, Yongting Hu, Panliang Tang, and Fan Ren. Bidirectional spatio-temporal feature learning with multi-scale evaluation for video anomaly detection. IEEE Transactions on Circuits and Systems for Video Technology, 2022. 7\n[56] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. 4\n[57] Joey Tianyi Zhou, Ivor W Tsang, Sinno Jialin Pan, and Mingkui Tan. Heterogeneous domain adaptation for multiple classes. In Artificial Intelligence and Statistics, pages 1095–1103. PMLR, 2014. 3\n[58] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022. 3\n[59] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022. 3\n[60] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycleconsistent adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 2223–2232, 2017. 1\n[61] Yin Zhu, Yuqiang Chen, Zhongqi Lu, Sinno Jialin Pan, GuiRong Xue, Yong Yu, and Qiang Yang. Heterogeneous transfer learning for image classification. In Proceedings of the AAAI Conference on Artificial Intelligence, 2011. 3\n","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/liu_generating_anomalies_for_video_anomaly_detection_with_prompt-based_feature_mapping_cvpr_2023_paper/","section":"Papers","summary":"The paper proposes a prompt-based feature mapping framework (PFMF) to generate unseen anomalies with unbounded types and narrow the scene gap for video anomaly detection, outperforming state-of-the-art methods on multiple datasets.","title":"Generating Anomalies for Video Anomaly Detection with Prompt-based Feature Mapping","type":"method"},{"content":" Hierarchical Semantic Contrast for Scene-aware Video Anomaly Detection # Shengyang Sun Xiaojin Gong * College of Information Science \u0026amp; Electronic Engineering, Zhejiang University, Hangzhou, Zhejiang, China\n{sunshy,gongxj}@zju.edu.cn\nFigure 1. An illustration of hierarchical semantic contrast. The encoded scene-appearance/motion features are gathered together with respect to their semantic classes. Best viewed in color.\npractical and plays the dominant role in past studies.\nAlthough a majority of previous techniques learn their VAD models from normal data, this task has still not been well addressed due to the following reasons. First, some anomalies are scene-dependent [46 , 51], implying that an appearance or motion may be anomalous in one scene but normal in other scenes. How to detect scenedependent anomalies while preventing background bias (i.e . learning the background noise rather than the essence of anomaly [31]) is a challenging problem. Second, normal patterns are diverse. How to enable a deep VAD model to represent the diverse normality well but not generalize to anomalous data is also a challenge [18 , 44]. Last but not least, samples collected from different normal patterns are imbalanced because some normal activities may appear very sparsely [46]. How to deal with rare but normal activities is challenging as well.\nPrevious VAD methods mainly perform learning at frame-level [20 , 47 , 75] or in an object-centric [17 , 24 , 78] way. The former is prone to suffer from the background bias [31] while most of the latter methods are backgroundagnostic. There are some attempts to address the abovementioned challenges in one or another aspect. For instance, a spatio-temporal context graph [51] and a hierarchical scene normality-binding model [1] are constructed to discover scene-dependent anomalies. Memory-augmented autoencoders (AE) [18 , 44] are designed to represent diverse normal patterns while lessening the powerful capacity of\nAbstract # Increasing scene-awareness is a key challenge in video anomaly detection (VAD). In this work, we propose a hierarchical semantic contrast (HSC) method to learn a sceneaware VAD model from normal videos. We first incorporate foreground object and background scene features with highlevel semantics by taking advantage of pre-trained video parsing models. Then, building upon the autoencoderbased reconstruction framework, we introduce both scenelevel and object-level contrastive learning to enforce the encoded latent features to be compact within the same semantic classes while being separable across different classes. This hierarchical semantic contrast strategy helps to deal with the diversity of normal patterns and also increases their discrimination ability. Moreover, for the sake of tackling rare normal activities, we design a skeleton-based motion augmentation to increase samples and refine the model further. Extensive experiments on three public datasets and scene-dependent mixture datasets validate the effectiveness of our proposed method.\n1. Introduction # With the prevalence of surveillance cameras deployed in public places, video anomaly detection (VAD) has attracted considerable attention from both academia and industry. It aims to automatically detect abnormal events so that the workload of human monitors can be greatly reduced. By now, numerous VAD methods have been developed under different supervision settings, including weakly supervised [13 , 50 , 55 , 58 , 64 , 76], purely unsupervised [69 , 72], and ones learning from normal videos only [20 , 24 , 33 , 44 , 45]. However, it is extremely difficult or even impossible to collect sufficient and comprehensive abnormal data due to the rare occurrence of anomalies, whereas collecting abundant normal data is relatively easy. Therefore, the setting of learning from normal data is more\nCorresponding author. AEs. An over-sampling strategy [32] is adopted but to solve the imbalance between normal and abnormal data. Contrastively, in this work we address all of these challenges simultaneously and in distinct ways.\nThe primary objective of our work is to handle scenedependent anomalies. An intuition behind scene-dependent anomalies is that, if a type of object or activity is never observed in one scene in normal videos, then it should be viewed as an anomaly. It implies that we can first determine the scene type and then check if an object or activity has occurred in normal patterns of this scene. Based on this observation, we propose a hierarchical semantic contrast method to learn a scene-aware VAD model. Taking advantage of pre-trained video parsing networks, we group the appearance and activity of objects and background scenes into semantic categories. Then, building upon the autoencoderbased reconstruction framework, we design both scenelevel and object-level contrastive learning to enforce the encoded latent features to gather together with respect to their semantic categories, as shown in Fig. 1. When a test video is input, we retrieve weighted normal features for reconstruction and the clips of high errors are detected as anomalies.\nThe contributions of this work are as follows:\nWe build a scene-aware reconstruction framework composed of scene-aware feature encoders and objectcentric feature decoders for anomaly detection. The scene-aware encoders take background scenes into account while the object-centric decoders are to reduce the background noise. We propose hierarchical semantic contrastive learning to regularize the encoded features in the latent spaces, making normal features more compact within the same semantic classes and separable between different classes. Consequently, it helps to discriminate anomalies from normal patterns. We design a skeleton-based augmentation method to generate both normal and abnormal samples based on our scene-aware VAD framework. The augmented samples enable us to additionally train a binary classifier that helps to boost the performance further. Experiments on three public datasets demonstrate promising results on scene-independent VAD. Moreover, our method also shows a strong ability in detecting scene-dependent anomalies on self-built datasets. 2. Related Work # 2.1. Video Anomaly Detection # Most previous VAD studies can be grouped into weakly supervised category [13 , 50 , 55 , 58 , 64 , 76] that learns with video-level labels, or the one learning from normal videos only [18 , 20 , 44 , 47 , 75]. In this work, we focus on the latter category, which is mainly addressed by reconstructionor distance-based techniques. The reconstruction-based techniques use autoencoder (AE) [20 , 38 , 75], memoryaugmented AE [18 , 34 , 44], or generative models [42 , 47] to reconstruct current frame [18 , 20 , 44 , 75] or predict future frames [33 , 42], by which the frames of high reconstruction errors are detected as anomalies. The distance-based techniques often adopt one-class SVMs [24 , 25], Gaussian mixture models [49 , 52], or other classifiers [17] to compute a decision boundary and those deviating from the normality are screened out as anomalies.\nA majority of reconstruction- and distance-based techniques [18 , 20 , 25 , 42 , 44 , 47 , 75] learn their models at framelevel, which may suffer from the background bias [31] and lack of explainability. To this end, various object-centric methods have been developed, leveraging appearance and motion [16 , 17 , 24 , 67 , 78], or skeleton [28 , 39 , 41 , 67] of objects to promote the performance. However, the VAD models learned by most of them are background-agnostic. Considering that some anomaly events are scene-dependent, a few scene-aware methods [1 , 3 , 51 , 52] have been proposed recently. For instance, Sun et al. [51] construct a spatiotemporal context graph to represent both objects and the background, Sun et al. [52] and Bao et al. [1] learn memoryaugmented AEs to encode scene and objects, and Cao et al. [3] design a network with context recovery and knowledge retrieval streams. Our work adopts the autoencoderbased reconstruction framework like [1]. But differently, we build scene-aware encoders and object-centric decoders for reconstruction and propose the hierarchical semantic contrast to regularize the encoded latent features.\n2.2. Contrastive Learning # Contrastive learning has been successfully applied to various vision tasks, such as representation learning [7 , 21 , 27 , 66], person re-identification [15 , 61], and semantic segmentation [9 , 77]. It performs learning via contrasting anchor instances with their positive and negative instances or prototypes, which are sampled either from a large batch [7] or from an external memory bank [21 , 61]. Recently, contrastive learning has also been exploited in anomaly detection [23 , 26 , 36 , 60 , 63]. Most methods [23 , 26 , 36 , 63] perform contrast between an instance and its augmented version, focusing on the instance level only. An exceptional work is HSCL [60], which takes into account sample-tosample, sample-to-prototype, and normal-to-abnormal contrasts to implement semi-supervised anomaly detection. In our work, we design a hierarchical contrastive learning strategy that performs contrast at the scene-level and objectlevel, enforcing instances to gather together according to their semantic categories.\nFigure 2. An overview of the proposed method. It consists of video parsing, scene-aware autoencoders, memory-based contrastive learning, and motion augmentation modules. Best viewed in color.\n2.3. Data Augmentation # Data augmentation is extensively used in contrastive learning and other class-imbalanced learning tasks [30]. The works most related to ours are skeleton augmentation methods. For instance, Meng et al. [40] design a transformation network to generate new skeleton samples. Thoker et al. [57] design spatial and temporal skeleton augmentation based on shear transformation and joint jittering. Guo et al. [19] apply shear, spatial flip, rotate, and axis mask to generate extreme augmentations. These methods apply skeleton-based augmentation to generate positive samples for the action recognition task. In contrast, we design a skeleton-based augmentation to produce both normal and abnormal samples of rare activities, helping the learning of imbalanced anomaly detection.\n3. The Proposed Method # Figure 2 presents an overview of the proposed method. When a video clip (i.e. a set of consecutive frames) is input, we first parse it to get high-level semantic features, including the appearance and motion of objects, together with the background scene. Then, the appearance or motion feature of each object is incorporated with the scene feature. The obtained scene-appearance and scene-motion features are fed into scene-aware encoders and object-centric decoders for feature encoding and reconstruction. All encoded latent features are stored in external memory banks, based on which we perform scene- and object-level semantic contrastive learning. The hierarchical contrastive learn- ing enforces the diverse latent normal features to be compact within the same semantic classes and separable between different classes, which consequently increases the discrimination ability of normal patterns. During inference time, normal features stored in memory are retrieved and weighted to reconstruct the features of objects in a test clip, and those with high errors are detected as anomalies.\n3.1. Video Parsing # Pre-trained video parsing networks are extensively used in many VAD methods [1 , 16 , 24 , 39 , 41 , 51 , 52 , 67 , 67 , 78] to extract different visual cues. In this work, we take advantage of several pre-trained networks to extract high-level features while introducing semantic labels.\nGiven a video clip C composed of T consecutive frames, we first adopt the pre-trained YOLOv3 [48] and FairMOT [74] to detect and track objects, which produce several object tracklets and their semantic class labels such as pedestrian , bicycle , etc. Then, we extract both appearance and motion features for each object tracklet and extract a scene feature for the remaining background as follows.\nAppearance feature extraction. Appearance information plays an important role in detecting appearance anomalies. Therefore, for an object tracklet Oiin the clip, we employ ViT [8] to extract an appearance feature for each frame of the tracklet, and the features of all frames are averaged to generate one appearance feature f i app f i ∈ R 1024 .\nMotion feature extraction. Motion information is of equal importance in VAD. Considering that human-related anomalies are dominant in non-traffic surveillance, we opt\nto extract action information as a motion feature instead of using optical flow. More specifically, for an object tracklet O i , we use a pre-trained HRNet [54] to extract a skeleton feature for each frame. The features of all frames are further fed into PoseConv3D [10] to produce one motion feature f i mot f i ∈ R 512 , together with an action class label such as walking , jumping , kicking , etc .\nScene feature extraction. In pursuit of scene-awareness, we also extract a scene feature for the clip background. For each clip frame, we employ DeepLabV3+ [6] to generate a segmentation map while masking out the foreground object categories. Then, we perform max-pooling, reshape, averaging, and l2 normalization on all segmentation maps to obtain one scene feature f B ∈ R DB , where DB depends on the size of the video frame. To discriminate different scenes at a fine-grained level, we utilize DBSCAN [11] for clustering and generating pseudo labels of scene classes.\n3.2. Semantic Feature Reconstruction # In this work, we adopt the extensively used reconstruction framework for our anomaly detection. For each appearance or motion feature, we design an autoencoder composed of a scene-aware encoder and an object-centric decoder for feature reconstruction.\nScene-aware feature encoder. To correlate foreground objects with the background scene, we incorporate each appearance/motion feature with its corresponding scene feature. The obtained scene-appearance or scene-motion feature is fed into a scene-aware feature encoder. Formally, it is represented by\nwhere ˜ f i ∗ f i ∈ R D E is the encoded latent feature of object Oi in clip C, \u0026lsquo;∗\u0026rsquo; denotes either app or mot, and DE is the feature dimension. Moreover, [· , · ] denotes the concatenation and Φ ∗ (·) is the feature encoder, which is implemented by a two-layer MLP followed with a l2 normalization.\nObject-centric feature decoder. The reconstruction-based framework assumes that anomalies cannot be represented well by normal patterns. To reduce the background bias [31] in reconstruction, we opt to reconstruct the feature of each foreground object instead of the incorporated scene-aware feature. That is, given a latent code ˜ f i ∗ f i , we enforce the decoder to reconstruct a feature close to the appearance/motion feature f i ∗ f i , which is\nwhere k·k2 is the l2 norm. Θ ∗ is the feature decoder implemented by a two-layer MLP as well.\n3.3. Hierarchical Semantic Contrast # Due to the diversity of normal patterns as well as the large capacity of deep networks, the model learned from normal data may also reconstruct anomalies well [18 , 44]. To address this problem, we propose a hierarchical semantic contrast (HSC) strategy to regularize the encoded normal features in the latent space, by which diverse normal patterns can be represented more compactly and therefore be more discriminative to anomalies. HSC conducts contrastive learning at the scene- and object-level by taking advantage of the semantic labels introduced in video parsing. Scene-level contrastive learning. The scene-level contrastive learning aims to attract the latent features within the same scene class and repel the features of different scenes. To this end, we adopt the InfoNCE loss [7 , 66] to conduct learning, assisted by an external memory bank. The scenelevel contrastive loss is defined by\nwhere N is the number of all encoded latent features, X∗ X∗ ( ˜ f i ∗ f i ) indicates the set of features sharing the same pseudo scene label with ˜ f i ∗ f i , τ is the temperature hyperparameter, and sim(· , · ) denotes the cosine similarity.\nBesides, we also build a linear classification (LC) head to classify each latent feature into its pseudo scene class by using the cross-entropy loss:\nwhere \u0026lt; · , · \u0026gt; denotes dot product, Λ ∗ (·) is the linear classifier, and Y represents the pseudo scene label of ˜ f i ∗ f i .\nObject-level contrastive learning. Within each scene class, the object-level contrastive learning pulls the latent features of the same appearance/motion category together and pushes away those from different appearance/motion categories. Therefore, the object-level contrastive loss is defined by\nwhere N∗ N∗ ( ˜ f i ∗ f i ) represents the set of latent features sharing the same appearance/motion class and same scene class with ˜ f i ∗ f i . Note that in this loss only the features within the same scene class are considered and all others are ignored. Memory banks. In contrast to the memory-augmented AEs [18 , 44] that utilize memory for the learning of autoencoders, we use memory mainly for our contrastive learning. To this end, two memory banks are built for storing the latent scene-appearance and scene-motion features respectively. Each entry is updated by\nfollowed with a l 2 normalization, where m ∈ [0 , 1) is a momentum coefficient.\nFigure 3. An illustration of our skeleton-based motion augmentation, which consists of spatial transformation and temporal cutting.\n3.4. Motion Augmentation # The occurrence of rare but normal activities is a challenge in VAD [46]. This challenge stands out in scenedependent anomaly detection when compared to the sceneagnostic case. The reason is that normal samples collected from different scenes are not counted together anymore. To address this problem, we design a skeleton-based augmentation to produce more samples, which includes spatial transformation and temporal cutting as shown in Fig. 3 .\nSpatial transformation. A skeleton feature extracted from one object frame contains a set of human anatomical keypoints including shoulder , elbow , wrist , etc. In this work, we design a rotation-based augmentation scheme. For each keypoint K except those on head, we set a probability Pst to decide if the keypoint is rotated or not. If the keypoint K is chosen to rotate, it rotates around its parent node and the new coordinates Kr Krot are obtained by\nwhere P(K) is the parent keypoint of K, and α is a rotation angle randomly selected within a pre-defined range. Moreover, when K is rotated, its descendant keypoints are all rotated consequently.\nTemporal cutting. An action is identified not only by the spatial distribution of keypoints but also by the temporal distribution. In this work, we simply adopt the cutting strategy for temporal augmentation. That is, given the frames of an object tracklet, we set a probability Ptc for each frame to decide if it is left out or not.\nSpatio-temporal augmentation. To increase the diversity of motion samples, we combine spatial transformation and temporal cutting together as our spatio-temporal augmentation. Given an object tracklet, we apply the spatio-temporal augmentation to produce a new set of skeleton features and then feed them into PoseConv3D [10] to obtain the motion feature of the augmented sample.\n3.5. Training and Test # Training. The training loss of our full model contains a loss L app for the appearance stream and a loss L mot for the motion stream. That is, the total loss is defined by\nHere, the loss for each stream consists of two contrastive losses, together with a classification loss and a reconstruction loss. That is,\nin which ∗ denotes either app or mot as before.\nAt the first stage of training, we use the loss L to train our model on an original dataset without motion augmentation. Once the model is trained, we take augmented samples into consideration for refinement. Since the samples generated in motion augmentation are not guaranteed to be normal, we apply our trained model to discriminate normal and abnormal samples based on their reconstruction errors defined in Eq. (11). Then, we leverage both normal and abnormal samples to additionally train a binary classifier on the motion stream using a cross-entropy loss L mot aug .\nTest. During inference time, we apply video parsing to obtain high-level features for each test video clip. Then, each test feature f t ∗ f t is fed into the appearance/motion stream for encoding and reconstruction. Let us denote the encoded latent feature as ˜ f t ∗ f t . Different from training that directly reconstructs the latent feature, we calculate the similarity between it to each entry stored in the memory M ∗ by\nand get a weighted average of all stored normal features for reconstruction.\nThe reconstruction error of one stream is therefore defined by\nThe final anomaly score of an object is defined as the average reconstruction error of two streams, which is\nWhen motion augmentation is considered, the anomaly score of the motion stream is replaced by the anomaly probability output by the binary classifier. Moreover, the anomaly score of a clip is decided by the highest final anomaly score of objects in this clip. Finally, we apply a Gaussian filter for temporal smoothing over all video clips.\n4. Experiments # 4.1. Datasets and Evaluation Metrics # We evaluate the proposed method on three public datasets: UCSD Ped2 [29], Avenue [35], and ShanghaiTech [33]. UCSD Ped2 [29] is a single-scene dataset collected from pedestrian walkways, including anomalies such as bikers , skaters, small carts across a walkway. Avenue [35] is a single-scene dataset as well. It is captured in CUHK campus avenue, containing anomalies like running , bicycling , etc. It also contains some rare normal patterns [35]. ShanghaiTech [33] is a challenging multi-scene dataset containing 13 campus scenes with various light conditions and camera angles. The statistics of these datasets are summarized in Table 1 .\nHowever, these three datasets contain very few scenedependent anomalies. And as far as we know, there is no public scene-dependent anomaly dataset available. In order to investigate the performance of our method on scenedependent anomaly detection, we additionally create three mixture datasets based on ShanghaiTech. The mixture set [01 , 02] consists of videos taken from scenes 01 and 02. We select a part of test videos of scene 01 containing the cyclist events into the mixture training set and delete them from the test set. It implies that cyclist is normal in scene 01, but it is still abnormal in scene 02. Likewise, we get a mixture set [04 , 08] and a set [10 , 12], in which some events are normal in one scene but abnormal in the other scene. More details are provided in our supplementary material.\nFor performance evaluation, we adopt the area under the curve (AUC) of the frame-level receiver operating characteristics (ROC) as the evaluation metric following the common practice [2 , 4 , 12 , 17 , 18 , 38 , 45 , 62]. It concatenates all frames and then computes the score, also known as microaveraged AUC [17].\n4.2. Implementation Details # We implement the proposed method in Pytorch. The hyper-parameters involved in our model are set as follows. The dimension of encoded latent features is D E = 1280 . The temperature factor in contrastive learning is τ = 0 . 5 and the momentum coefficient in memory updating is m =\nTable 1. The statistics of three public datasets and self-built scenedependent datasets.\n| Dataset | Training Frame | Test\nFrame Scene Resolution UCSD Ped2 [29] 2,550 2,010 1 360×240 CUHK Avenue [35] 15,328 15,324 1 640×360 ShanghaiTech [33] 274,515 42,883 13 856×480 Mixture [01, 02] 14,080 5,488 2 856×480 Mixture [04, 08] 37,600 5,104 2 856×480 Mixture [10, 12] 33,856 3,584 2 856×480 Table 2. The AUC(%) performance of our model variants.\nMemCL SA-AE SM-AE MA Avenue ShanghaiTech X 90.6 78.4 X 81.3 77.6 X X 82.6 77.8 X X 91.1 80.7 X X X 91.5 81.2 X X 92.1 79.3 X X 82.9 78.1 X X X 84.9 78.3 X X X 92.4 83 X X X X 93.7 83.4 0 . 9. The probabilities used for motion augmentation are set as Ps Pst = Pt Ptc = 0 . 5. In addition, our model is trained using the AdaGrad optimizer with a learning rate of 0.01 and a batch size of 128 for both UCSD Ped2 and Avenue and 512 for ShanghaiTech. Some other details are provided in our supplementary material.\n4.3. Ablation Studies # Although the proposed method aims at scene-dependent VAD, it works for scene-independent anomalies as well. Therefore, we conduct ablation studies mostly on Avenue and ShanghaiTech and partially on the mixture sets.\nEffectiveness of the proposed components. We first validate the effectiveness of our proposed components. We decompose the full model into scene-appearance autoencoder (SA-AE), scene-motion autoencoder (SM-AE), and memory-based contrastive learning (MemCL), together with scene-motion augmentation (MA) components. The performance of the model variants holding different components is reported in Table 2. From the results, we observe that SA-AE outperforms SM-AE or SM-AE+MA when only a single stream is learned and the combination of both streams performs better. Besides, memory-based contrastive learning enables the models to outperform their counterparts by a considerable margin. Motion augmentation also improves the performance on both datasets, especially on the Avenue dataset that contains rare normal activities.\nEffectiveness of scene-aware AEs and HSC. We here go deeper into the above-mentioned components for investigation. More specifically, we check the effectiveness of the scene-aware feature encoder (SA-E) and object-centric feature decoder (OC-D) in our autoencoders, together with the contrastive losses used in hierarchical semantic contrast (HSC). We conduct a series of experiments on the model without using motion augmentation. The results are presented in Table 3. It shows that, when contrastive learning is not applied, the scene-aware feature encoder slightly degenerates the performance on scene-independent Avenue and ShanghaiTech but improves the performance on the scenedependent mixture sets. Moreover, the object-centric de-\n(a) Scene-appearance (w/o HSC)\n(b) Scene-appearance (w/ HSC)\n(c) Scene-motion (w/o HSC)\n(d) Scene-motion (w/ HSC)\nFigure 4. t-SNE [59] visualization of the scene-appearance/motion features encoded by our models without or with hierarchical semantic contrast. The points with the same color belong to the identical scene. Best viewed in color.\nFigure 5. The AUC(%) performance varies with respect to the\nmemory size on Avenue and ShanghaiTech. Best viewed in color. Scene-appearance Scene-motion\nFigure 6. The confusion matrices of encoded scene-appearance and scene-motion features. Best viewed in color.\ncoder improves the performance of all datasets since the background noise in reconstruction is avoided. In HSC, the individual contrastive learning at either scene- or objectlevel can consistently boost the performance, indicating the necessity of regularizing encoded features in the latent space. And the best performance is achieved when the losses work together.\nImpact of the memory size at test time. The memories in our work are used for hierarchical semantic contrast during training and feature reconstruction at test time. In order to make our model more compact and efficient for inference, we may reduce the memory size by reserving a small portion of normal patterns. In this experiment, we randomly select a number of entries and discard the remaining at test time. Fig. 5 illustrates the performance varies with the memory size. It shows that the performance is maintained well even if only 500 entries are reserved, and the performance only degenerates a little bit when only 100 entries are kept.\n4.4. Visualization # To investigate how well the hierarchical semantic contrast strategy works, we further analyze the scene classifi-\nTable 3. The AUC(%) performance of more detailed variants on CUHK Avenue (Avenue), ShanghaiTech (SHT), and the scenedependent mixture datasets (i.e. [01,02] and [04,08]). When SA-E is not checked, only appearance/motion features are input to the encoders. When OC-D is not checked, the decoders reconstruct both scene and appearance/motion features.\nSA-E OC-D LScn LObj LLC Avenue SHT [01,02] [04,08] X 91 80.8 78.6 76.4 X 90.9 80.2 80.5 77.9 X X 91.1 80.7 81 78.2 X X X 91.3 81.6 82.1 79 X X X 91.8 82 81.6 78.7 X X X X 91.9 82.2 82.5 79.4 X X X X 91.6 81.8 82.3 79.3 X X X X 92.2 82.4 81.8 78.9 X X X X X 92.4 83 82.8 80 cation results and the distribution of encoded latent features for data on ShanghaiTech.\nThe confusion matrix of scene classification. We first investigate whether the encoded scene-aware features correctly fall into the actual scene clusters they belong to. To this end, we check the confusion matrix of scene classification for all test samples on ShanghaiTech, which contains 12 scenes. Fig. 6 (a) and (b) visualize the confusion matrices of encoded scene-appearance and scene-motion features, respectively. We observe that most encoded sceneaware features are correctly grouped.\nThe distribution of encoded latent features. We further investigate the distribution of encoded scene-aware normal features stored in the memory banks. Fig. 4 visualizes the distribution of them in the latent space, obtained by the models without or with HSC. We observe that the features distribute more compactly within classes and more separately between classes, consequently helping to discriminate anomalies from these normal patterns.\n4.5. Comparison to State-of-the-Art # Finally, we compare our method with state-of-the-art. The comparison is first made on three public datasets which barely contain scene-dependent anomalies. To validate the effectiveness of our method on scene-dependent anomaly detection, we additionally make a comparison on the mixture datasets created upon ShanghaiTech.\nTable 4. Comparison results on UCSD Ped2 (Ped2), CUHK Avenue (Avenue), and ShanghaiTech (SHT). Besides the frame-level micro-averaged AUC(%) performance, we also list the inputs of the methods, in which \u0026lsquo;F\u0026rsquo; denotes the frame-level input and \u0026lsquo;O\u0026rsquo; is object-centric. The subscript \u0026lsquo;A\u0026rsquo; is appearance, \u0026lsquo;F\u0026rsquo; is optical flow, \u0026lsquo;S\u0026rsquo; is skeleton, and \u0026lsquo;M\u0026rsquo; is other motion information. Besides, in our HSC model, MA − , + denotes using motion augmentation to generate both normal and abnormal samples.\nMethod Reference Input Ped2 Avenue SHT AMC [43] ICCV19 F 96.2 86.9 - Mem-AE [18] ICCV19 F 94.1 83.3 71.2 DeepOC [65] TNNLS19 F 96.9 86.6 - r-GAN [37] ECCV20 F 96.2 85.8 77.9 CDAE [4] ECCV20 F 96.5 86.0 73.3 MNAD [44] CVPR20 F 97.0 88.5 72.8 IPR [56] PRL20 F 96.3 85.1 73.0 LDF [45] WACV20 F 94.0 87.2 - CAC [63] MM20 F - 87.0 79.3 CT-D2GAN [14] MM21 F 97.2 85.9 77.7 AMMCN [2] AAAI21 F 96.6 86.6 73.7 MPN [38] CVPR21 F 96.9 89.5 73.8 AEP [71] TNNLS21 F 97.9 90.2 - SIGnet [12] TNNLS22 F 96.2 86.8 - IAAN [73] TCSVT22 F 92.9 80.5 80.3 ROADMAP [62] TNNLS22 F 96.4 88.3 76.6 GEPC [39] CVPR20 OS - - 76.1 STGformer [22] MM22 OS - 88.8 82.9 HSNBM [1] MM22 F+OA 95.2 91.6 76.5 STC-Graph [51] MM20 OA+OM - 89.6 74.7 SSMTL1[16] CVPR21 OA+OM 97.5 91.5 82.4 VEC [70] MM20 OA+OF 97.3 90.2 74.8 HF2-VAD [34] ICCV21 OA+OF 99.3 91.1 76.2 BAF [17] TPAMI22 OA+OF 98.7 92.3 82.7 CAFE [68] MM22 OA+OF 98.4 92.6 77.0 DERN [53] MM22 OA+OF 97.1 92.7 79.3 BDPN [5] AAAI22 OA+OF 98.3 90.3 78.1 HSC This work O+O 981† 92.4 83.0 HSC w/ MA−,+ This work OA+OS 98.1† 93.7 83.4 Comparison on scene-independent VAD. We compare our method with recent VAD methods that learn from normal data as well. The comparison results are presented in Table 4, in which the inputs of all methods are also provided for reference. Generally speaking, benefiting from the extracted high-level features, most of the object-centric methods perform better than the methods with frame-level inputs, although some of the latter methods also use motion information like optical flow. In addition, the proposed method outperforms all other methods on both Avenue and ShanghaiTech, validating the superiority of our design. Note that we are not able to test our full model on the UCSD Ped2 dataset due to its low resolution, in which no high-quality skeleton keypoints can be detected. Therefore, we only use the scene-appearance stream of our model\n1 Here the micro-average AUC is reported from the officially released website https://github.com/lilygeorgescu/AED-SSMTL.\nTable 5. Comparison results on the scene-dependent mixture datasets built upon ShanghaiTech. MA − , + denotes using motion augmentation to generate both normal and abnormal samples, MA − denotes augmenting normal samples only.\nMethod Reference Input [01,02] [04,08] [10,12] Mem-AE [18] ICCV19 F 77.7 60.2 50.2 MNAD [44] CVPR20 F 77.8 68.6 50 MPN [38] CVPR21 F 78.4 61.5 45.3 HF2-VAD [34] ICCV21 OA+OF 74.8 75.2 66.8 HSC This work OA+OS 82.8 80 87.3 HSC w/ MA− This work OA+OS 85.7 81.8 90.1 HSC w/ MA−,+ This work OA+OS 86.9 82.6 91 for testing, which still achieves a performance higher than a great number of methods.\nComparison on scene-dependent VAD. Finally, we investigate the performance of our method on scene-dependent anomaly detection based on the mixture datasets introduced above. The results are presented in Table 5. We also test other SOTA methods [18 , 34 , 38 , 44] using their released codes for comparison. Unfortunately, we are not able to compare with the scene-aware methods [1 , 5 , 51] since their codes are not available. The results show that the performance of the other methods, especially those with framelevel inputs, degenerates dramatically. In contrast, all model variants of our proposed method consistently demonstrate promising performance.\n4.6. Limitations # Since our proposed method takes skeleton-based motion features as one of the inputs, the full model is restricted to human-related datasets and requires the datasets are not very low in resolution. Otherwise, only the sceneappearance stream works, which inevitably degenerates the performance. A possible extension is replacing the skeleton-based motion features with optical flow and conducting contrastive learning based on the clustering results of optical flow features. Besides, other components of this framework can be replaced by other advanced modules, e.g . substituting another advanced background parsing model for the simple segmentation map.\n5. Conclusion # In this work, we have presented a hierarchical semantic contrast method to address scene-dependent video anomaly detection. The design of our hierarchical semantic contrastive learning, together with scene-aware autoencoders and motion augmentation, enables the proposed model to achieve promising results on both scene-independent and scene-dependent VAD. Experiments on three public datasets and self-created datasets have validated the effectiveness of our method.\nReferences # [1] Qianyue Bao, Fang Liu, Yang Liu, Licheng Jiao, Xu Liu, and Lingling Li. Hierarchical scene normality-binding modeling for anomaly detection in surveillance videos. In ACM MM , pages 6103–6112, 2022. 1 , 2 , 3 , 8\n[2] Ruichu Cai, Hao Zhang, Wen Liu, Shenghua Gao, and Zhifeng Hao. Appearance-motion memory consistency network for video anomaly detection. In AAAI, volume 35, pages 938–946, 2021. 6 , 8\n[3] Congqi Cao, Yue Lu, and Yanning Zhang. Context recovery and knowledge retrieval: A novel two-stream framework for video anomaly detection. arXiv preprint arXiv:2209.02899 , 2022. 2\n[4] Yunpeng Chang, Zhigang Tu, Wei Xie, and Junsong Yuan. Clustering driven deep autoencoder for video anomaly detection. In ECCV, pages 1–16, 2020. 6 , 8\n[5] Chengwei Chen, Yuan Xie, Shaohui Lin, Angela Yao, Guannan Jiang, Wei Zhang, Yanyun Qu, Ruizhi Qiao, Bo Ren, and Lizhuang Ma. Comprehensive regularization in a bidirectional predictive network for video anomaly detection. In AAAI, pages 1–9, 2022. 8\n[6] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018. 4\n[7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607. PMLR, 2020. 2 , 4\n[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 3\n[9] Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. Weakly supervised semantic segmentation by pixel-to-prototype contrast. In CVPR, pages 4320–4329, 2022. 2\n[10] Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. Revisiting skeleton-based action recognition. In CVPR , pages 2969–2978, 2022. 4 , 5\n[11] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu, ¨ ¨ et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226–231, 1996. 4\n[12] Zhiwen Fang, Jiafei Liang, Joey Tianyi Zhou, Yang Xiao, and Feng Yang. Anomaly detection with bidirectional consistency in videos. IEEE Transactions on Neural Networks and Learning Systems, 33(3):1079–1092, 2022. 6 , 8\n[13] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. Mist: Multiple instance self-training framework for video anomaly detection. In CVPR, pages 14009–14018, 2021. 1 , 2\n[14] Xinyang Feng, Dongjin Song, Yuncong Chen, Zhengzhang Chen, Jingchao Ni, and Haifeng Chen. Convolutional transformer based dual discriminator generative adversarial networks for video anomaly detection. In ACM MM, page 5546–5554, 2021. 8\n[15] Yixiao Ge, Dapeng Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In NeurIPS, volume 33, pages 11309–11321, 2020. 2\n[16] Mariana-Iuliana Georgescu, Antonio Barbalau, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. Anomaly detection in video via selfsupervised and multi-task learning. In CVPR, pages 12742– 12752, 2021. 2 , 3 , 8\n[17] Mariana Iuliana Georgescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. A background-agnostic framework with adversarial training for abnormal event detection in video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4505– 4523, 2022. 1 , 2 , 6 , 8\n[18] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In ICCV, pages 1705–1714, 2019. 1 , 2 , 4 , 6 , 8\n[19] Tianyu Guo, Hong Liu, Zhan Chen, Mengyuan Liu, Tao Wang, and Runwei Ding. Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In AAAI, volume 36, pages 762–770, 2022. 3\n[20] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In CVPR, pages 733–742, 2016. 1 , 2\n[21] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020. 2\n[22] Chao Huang, Yabo Liu, Zheng Zhang, Chengliang Liu, Jie Wen, Yong Xu, and Yaowei Wang. Hierarchical graph embedded pose regularity learning via spatio-temporal transformer for abnormal behavior detection. In ACM MM, pages 307–315, 2022. 8\n[23] Chao Huang, Zhihao Wu, JieWen, Yong Xu, Qiuping Jiang, and Yaowei Wang. Abnormal event detection using deep contrastive learning for intelligent video surveillance system. IEEE Transactions on Industrial Informatics, 18(8):5171– 5179, 2022. 2\n[24] Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In CVPR, pages 7842–7851, 2019. 1 , 2 , 3\n[25] Radu Tudor Ionescu, Sorina Smeureanu, Marius Popescu, and Bogdan Alexe. Detecting abnormal events in video using narrowed normality clusters. In WACV, pages 1951–1960. IEEE, 2019. 2\n[26] Okan Kopuklu, Jiapeng Zheng, Hang Xu, and Gerhard Rigoll. Driver anomaly detection: A dataset and contrastive learning approach. In WACV, pages 91–100, 2021. 2\n[27] Junnan Li, Pan Zhou, Caiming Xiong, and Steven Hoi. Prototypical contrastive learning of unsupervised representations. In ICLR, 2020. 2\n[28] Nanjun Li, Faliang Chang, and Chunsheng Liu. A selftrained spatial graph convolutional network for unsupervised human-related anomalous event detection in complex scenes. IEEE Transactions on Cognitive and Developmental Systems, 2022. 2\n[29] Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence , 36(1):18–32, 2013. 6\n[30] Swee Kiat Lim, Yi Loo, Ngoc-Trung Tran, Ngai-Man Cheung, Gemma Roig, and Yuval Elovici. Doping: Generative data augmentation for unsupervised anomaly detection with gan. In ICDM, pages 1122–1127. IEEE, 2018. 3\n[31] Kun Liu and Huadong Ma. Exploring background-bias for anomaly detection in surveillance videos. In ACM MM , pages 1490–1499, 2019. 1 , 2 , 4\n[32] Wen Liu, Weixin Luo, Zhengxin Li, Peilin Zhao, and Shenghua Gao. Margin learning embedded prediction for video anomaly detection with a few anomalies. In IJCAI , pages 3023–3030, 2019. 2\n[33] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In CVPR, pages 6536–6545, 2018. 1 , 2 , 6\n[34] Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In ICCV, pages 13588–13597, 2021. 2 , 8\n[35] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In ICCV, pages 2720–2727, 2013. 6\n[36] Yue Lu, Congqi Cao, Yifan Zhang, and Yanning Zhang. Learnable locality-sensitive hashing for video anomaly detection. IEEE Transactions on Circuits and Systems for Video Technology, 2022. 2\n[37] Yiwei Lu, Frank Yu, Mahesh Kumar Krishna Reddy, and Yang Wang. Few-shot scene-adaptive anomaly detection. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and JanMichael Frahm, editors, ECCV, pages 125–141, Cham, 2020. Springer International Publishing. 8\n[38] Hui Lv, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Learning normal dynamics in videos with meta prototype network. In CVPR, pages 15425–15434, 2021. 2 , 6 , 8\n[39] Amir Markovitz, Gilad Sharir, Itamar Friedman, Lihi ZelnikManor, and Shai Avidan. Graph embedded pose clustering for anomaly detection. In CVPR, pages 10539–10547, 2020. 2 , 3 , 8\n[40] Fanyang Meng, Hong Liu, Yongsheng Liang, Juanhui Tu, and Mengyuan Liu. Sample fusion network: An end-toend data augmentation network for skeleton-based human action recognition. IEEE Transactions on Image Processing , 28(11):5281–5295, 2019. 3\n[41] Romero Morais, Vuong Le, Truyen Tran, Budhaditya Saha, Moussa Mansour, and Svetha Venkatesh. Learning regularity in skeleton trajectories for anomaly detection in videos. In CVPR, pages 11996–12004, 2019. 2 , 3\n[42] Khac-Tuan Nguyen, Dat-Thanh Dinh, Minh N. Do, and Minh-Triet Tran. Anomaly detection in traffic surveillance\nvideos with gan-based future frame prediction. In ICMR , pages 457–463, 2020. 2 [43] Trong-Nguyen Nguyen and Jean Meunier. Anomaly detection in video sequence with appearance-motion correspondence. In ICCV, pages 1273–1283, 2019. 8\n[44] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. Learning memory-guided normality for anomaly detection. In CVPR , pages 14372–14381, 2020. 1 , 2 , 4 , 8\n[45] Bharathkumar Ramachandra, Michael Jones, and Ranga Vatsavai. Learning a distance function with a siamese network to localize anomalies in videos. In WACV, pages 2598–2607, 2020. 1 , 6 , 8\n[46] Bharathkumar Ramachandra, Michael J. Jones, and Ranga Raju Vatsavai. A survey of single-scene video anomaly detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2293–2312, 2022. 1 , 5\n[47] Mahdyar Ravanbakhsh, Enver Sangineto, Moin Nabi, and Nicu Sebe. Training adversarial discriminators for crosschannel abnormal event detection in crowds. In WACV, pages 1896–1904, 2019. 1 , 2\n[48] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 3\n[49] Mohammad Sabokrou, Mohsen Fayyaz, Mahmood Fathy, and Reinhard Klette. Deep-cascade: Cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes. IEEE Transactions on Image Processing , 26(4):1992–2004, 2017. 2\n[50] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In CVPR, pages 6479–6488, 2018. 1 , 2\n[51] Che Sun, Yunde Jia, Yao Hu, and Yuwei Wu. Scene-aware context reasoning for unsupervised abnormal event detection in videos. In ACM MM, pages 184–192, 2020. 1 , 2 , 3 , 8\n[52] Che Sun, Yunde Jia, and Yuwei Wu. Evidential reasoning for video anomaly detection. In ACM MM, pages 2106–2114, 2022. 2 , 3\n[53] Che Sun, Yunde Jia, and Yuwei Wu. Evidential reasoning for video anomaly detection. In ACM MM, pages 2106–2114, 2022. 8\n[54] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5693–5703, 2019. 4\n[55] Shengyang Sun and Xiaojin Gong. Long-short temporal coteaching for weakly supervised video anomaly detection. In ICME, pages 1–6. IEEE, 2023. 1 , 2\n[56] Yao Tang, Lin Zhao, Shanshan Zhang, Chen Gong, Guangyu Li, and Jian Yang. Integrating prediction and reconstruction for anomaly detection. Pattern Recognition Letters , 129:123–130, 2020. 8\n[57] Fida Mohammad Thoker, Hazel Doughty, and Cees GM Snoek. Skeleton-contrastive 3d action representation learning. In ACM MM, pages 1655–1663, 2021. 3\n[58] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In ICCV, pages 4975–4986, 2021. 1 , 2\n[59] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research , 9(11), 2008. 7\n[60] Gaoang Wang, Yibing Zhan, Xinchao Wang, Mingli Song, and Klara Nahrstedt. Hierarchical semi-supervised contrastive learning for contamination-resistant anomaly detection. In ECCV, pages 110–128. Springer, 2022. 2\n[61] Menglin Wang, Jiachen Li, Baisheng Lai, Xiaojin Gong, and Xian-Sheng Hua. Offline-online associated camera-aware proxies for unsupervised person re-identification. IEEE Transactions on Image Processing, 31:6548–6561, 2022. 2\n[62] Xuanzhao Wang, Zhengping Che, Bo Jiang, Ning Xiao, Ke Yang, Jian Tang, Jieping Ye, Jingyu Wang, and Qi Qi. Robust unsupervised video anomaly detection by multipath frame prediction. IEEE Transactions on Neural Networks and Learning Systems, 33(6):2301–2312, 2022. 6 , 8\n[63] Ziming Wang, Yuexian Zou, and Zeming Zhang. Cluster attention contrast for video anomaly detection. In ACM MM , pages 2463–2471, 2020. 2 , 8\n[64] Jie Wu, Wei Zhang, Guanbin Li, Wenhao Wu, Xiao Tan, Yingying Li, Errui Ding, and Liang Lin. Weakly-supervised spatio-temporal anomaly detection in surveillance video. In Zhi-Hua Zhou, editor, IJCAI, pages 1172–1178. International Joint Conferences on Artificial Intelligence Organization, 8 2021. Main Track. 1 , 2\n[65] Peng Wu, Jing Liu, and Fang Shen. A deep one-class neural network for anomalous event detection in complex scenes. IEEE transactions on neural networks and learning systems , 31(7):2609–2622, 2019. 8\n[66] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, pages 3733–3742, 2018. 2 , 4\n[67] Yuxing Yang, Zeyu Fu, and Syed Mohsen Naqvi. A twostream information fusion approach to abnormal event detection in video. In ICASSP, pages 5787–5791, 2022. 2 , 3\n[68] Guang Yu, Siqi Wang, Zhiping Cai, Xinwang Liu, and Chengkun Wu. Effective video abnormal event detection by learning a consistency-aware high-level feature extractor. In ACM MM, pages 6337–6346, 2022. 8\n[69] Guang Yu, Siqi Wang, Zhiping Cai, Xinwang Liu, Chuanfu Xu, and Chengkun Wu. Deep anomaly discovery from unlabeled videos via normality advantage and self-paced refinement. In CVPR, pages 13987–13998, 2022. 1\n[70] Guang Yu, Siqi Wang, Zhiping Cai, En Zhu, Chuanfu Xu, Jianping Yin, and Marius Kloft. Cloze test helps: Effective video anomaly detection via learning to complete video events. In ACM MM, pages 583–591, 2020. 8\n[71] Jongmin Yu, Younkwan Lee, Kin Choong Yow, Moongu Jeon, and Witold Pedrycz. Abnormal event detection and localization via adversarial event prediction. IEEE Transactions on Neural Networks and Learning Systems, 2021. 8\n[72] M Zaigham Zaheer, Arif Mahmood, M Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. Generative cooperative learning for unsupervised video anomaly detection. In CVPR, pages 14744–14754, 2022. 1\n[73] Sijia Zhang, Maoguo Gong, Yu Xie, AK Qin, Hao Li, Yuan Gao, and Yew-Soon Ong. Influence-aware attention networks for anomaly detection in surveillance videos. IEEE Transactions on Circuits and Systems for Video Technology , 2022. 8\n[74] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129:3069–3087, 2021. 3\n[75] Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. Spatio-temporal autoencoder for video anomaly detection. In ACM MM, pages 1933–1941, 2017. 1 , 2\n[76] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In CVPR, pages 1237–1246, 2019. 1 , 2\n[77] Tianfei Zhou, Meijie Zhang, Fang Zhao, and Jianwu Li. Regional semantic contrast and aggregation for weakly supervised semantic segmentation. In CVPR, pages 4299–4309, 2022. 2\n[78] Wenhao Zhou, Yingxuan Li, and Chunhui Zhao. Objectguided and motion-refined attention network for video anomaly detection. In ICME, pages 1–6. IEEE, 2022. 1 , 2 , 3\n","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/hierarchical-semantic-contrast-for-scene-aware-video-anomaly-detection/","section":"Papers","summary":"The paper proposes a hierarchical semantic contrast (HSC) method that leverages scene-aware autoencoders, semantic contrastive learning, and motion augmentation for improved scene-dependent and scene-independent video anomaly detection. It incorporates pre-trained video parsing models, hierarchical contrastive learning at scene and object levels, and skeleton-based motion augmentation to make the normal feature representations more compact and discriminative, thereby enhancing anomaly detection performance.","title":"Hierarchical Semantic Contrast for Scene-aware Video Anomaly Detection","type":"other"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/keng-teck-ma/","section":"Authors","summary":"","title":"Keng Teck Ma","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/kun-yu-lin/","section":"Authors","summary":"","title":"Kun-Yu Lin","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/minhoe-hur/","section":"Authors","summary":"","title":"Minhoe Hur","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/shengyang-sun/","section":"Authors","summary":"","title":"Shengyang Sun","type":"authors"},{"content":" This CVPR workshop paper is the Open Access version, provided by the Computer Vision Foundation.\nExcept for this watermark, it is identical to the accepted version;\nthe final published version of the proceedings is available on IEEE Xplore.\nTEVAD: Improved video anomaly detection with captions # Weiling Chen, Keng Teck Ma, Zi Jian Yew, Minhoe Hur, David Aik-Aun Khoo Hyundai Motor Group Innovation Center in Singapore 2 Bulim Link, Singapore 649674 # weiling.chen,kengteck.ma,zijian.yew, david.khoo@hmgics.com\nAbstract # Video surveillance systems are used to enhance the public safety and private assets. Automatic anomaly detection is vital in such surveillance systems to reduce the human labor and its associated costs. Previous works only consider spatial-temporal features. In many complex realworld scenarios, such visual features are unable to capture the semantic meanings required to further improve accuracy. To deal with such issues, we propose a novel framework: Text Empowered Video Anomaly Detection (TEVAD) which utilizes both visual and text features. Text features complements the visual features as they are semantically rich. Specifically, we compute text features based on the captions of the videos to capture the semantic meanings of abnormal events and thus improve the overall performance of video anomaly detection. Extensive experiments demonstrate that our proposed framework achieves state-of-theart results on four benchmark datasets (i.e. ShanghaiTech, UCF-Crime, XD-Violence, and UCSD-Pedestrians) and achieves improved robustness. We further analyze the captions to provide additional explainability for the anomalous videos identified by our proposed algorithm. Our codes are available at https://github.com/coranholmes/ TEVAD .\n1. Introduction # Video anomaly detection has many practical applications. In manufacturing, it can detect abnormal behavior (e.g. workers tripping) and irregular operations in the production process. In healthcare, intelligent video surveillance systems can reduce the workload of nurses, monitor the conditions of patients and automatically trigger the alarm if an incident occurs to ensure the timely assistance delivered to patients. In public safety domain, anomaly detection can be used to detect illegal behaviors such as fights and shootings to ensure the police officers can be dispatched timely and reduce personal and property losses [2 , 37].\nDespite the wide range of application scenarios, video\nFigure 1. Our TEVAD first generates dense captions for snippets of a video, before using both visual and text modalities for video anomaly detection. The right side shows the predicted anomaly score and the contributions of each word to the prediction. The use of captions provides explainability to the model: the illustrated video is classified anomalous due to the \u0026ldquo;skating\u0026rdquo; action.\nanomaly detection is a challenging task because such training data are very unbalanced between positive and negative classes, i.e. there are usually fewer positive examples (abnormal events) than negative examples (regular events). In addition, the large diversity of abnormal events mean that the training set often do not contain every possible type of anomalies, hindering the applicability of traditional supervised learning methods for detecting video anomalies. Furthermore, abnormal events in video are vaguely defined due to their ambiguous nature and may cover a wide variety of human activities. Such typical uncertainties of anomalies further complicate the video anomaly detection tasks.\nSince video anomaly detection can be used in many scenarios, there have been many attempts on this research topic. Most of the previous models use the spatial-temporal visual features like Temporal Segment Networks (TSN) [55], 3D ConvNet (C3D) [51] or Inflated 3D ConvNet (I3D) [7] to represent the video frames or snippets and perform the video anomaly detection using these visual features.\nHowever, such methods do not consider the high-level semantic meanings of the videos making it difficult to de-\ntect certain abnormal events and generalize the models to complex scenarios. Moreover, the actual detection is done based on the anomaly scores generated by the models which are obscure to the front-end surveillance systems users.\nOn the other hand, video captioning models are trained in a supervised manner using text-video pairs, and learn symbolic representations (words) that are grounded with the visual elements (e.g. people, objects, actions). Recently, through the use of advanced techniques such as transformer [52], the semantically-rich features can be effectively embedded into video captioning models [30 , 46 , 63]. As a result, such models are able to interpret the input videos with semantically meaningful captions. Such semantic meanings are often absent or extremely difficult to extract solely from the visual features. Inspired by these works, we propose a novel approach to interpret the deep and rich semantic meanings through the use of video-to-text process to improve both accuracy and robustness of weakly supervised video anomaly detection problem.\nSpecifically, we divide the videos into short snippets and generate the dense captions for these snippets. These features are fused with the visual features to compute the anomaly scores and perform the video anomaly detection. Experimental results show that captions help improve the performance of video anomaly detection. The use of caption has the additional benefit of providing explanability to our model. An example is shown in Figure 1, where high predicted anomaly score of the video snippet is largely due to the \u0026ldquo;skating\u0026rdquo; action.\nOur contributions of this work are:\nWe propose a framework, TEVAD, which exploits both visual and text features for video anomaly detection with different multi-modal fusion methods. We extend multi-scale temporal learning to text features to better capture the dependencies between snippet features. Our proposed framework outperforms the state-of-thearts (SOTA) methods on four benchmark datasets and achieves improved robustness. We further conduct additional analysis to provide explainability for the anomalous videos identified through the use of a word-masking protocol. 2. Related work # 2.1. Image anomaly detection using captions # To the best of our knowledge, we are the first to propose to incorporate captions in video anomaly detection tasks. Nevertheless, a few prior works uses captions to perform image anomaly detection. In one of the works [20], the authors use a DenseCap [23] module to generate the regions of interest and their captions. Image based features are extracted using CNN networks on the detected regions. Caption based features are calculated using Word2Vec [36]. Then they concatenate the embeddings and image based features together and perform unsupervised anomaly detection using clustering. Another work [14] exploits more state-of-the-art CLIP [42] model and performs experiments on CIFAR-10 dataset [27]. For experimental setting, they treat one category as abnormal while the others as normal. Their proposed method basically follows the zero-shot classifier described in the original CLIP paper with limited adaptation. However, the assumption that the normal and abnormal category are well defined is not practical in the real-world scenarios.\n2.2. Video anomaly detection using visual features # The mainstream methods for anomaly detection in videos can be divided into several categories, depending on the amount of supervision during training.\nEarlier efforts focused on the unsupervised learning scenario, where only normal data are available during training. With the emergence of generative models, many approaches proposed such networks to learn the representation of normal data [11 , 19 , 32 , 38 , 40 , 43 , 44 , 53 , 54]. The basic assumption is that such models only learn the normal representation thus would be unable to reconstruct the abnormal data. However, this assumption does not always hold in many scenarios due to the absence of prior knowledge of abnormal data, resulting in inferior performance. To address this issue, some researchers [3 , 17 , 64] proposed to generate the pseudo anomalies and perform pseudo-supervised training on the normal and pseudo abnormal data.\nSince then, leveraging some abnormal samples have shown more potential compared to unsupervised learning methods. However, frame level annotations on video datasets are especially expensive. Recently, weakly supervised methods has gradually attracted more attention in terms of video anomaly detection tasks. This is because weakly supervised models can be trained on binary videolevel labels while being able to predict frame-level labels. Most of the weakly supervised methods [8 , 13 , 35 , 39 , 45 , 48 , 50 , 59 , 68] are based on Multiple Instance Learning (MIL) framework. These work mainly propose different aggregation functions to process features or anomaly scores so that video-level labels can be used to indirectly supervise instance-level learning. We design our framework which supports the weakly supervised learning as well.\n2.3. Video captioning # Video captioning is an important task in video understanding [47]. Several works [12 , 21 , 42 , 49] focused on exploring different 2D/3D video representations to facilitate video captioning tasks. Moreover, many efforts have been made to learn object-level representations [22 , 65 , 66]\nFigure 2. The overview of the proposed TEVAD. TEVAD first splits the input video into T snippets and feed them into two individual branches. The text branch computes text features based on generated dense captions of snippets, while the visual branch extracts visual features. Both modality features go through a multi-scale temporal networks before being fused together and passed to a binary classifier that outputs anomaly scores for each video snippet which are then propagated to predict the frame level anomaly scores.\nto further improve the performance of video captioning.\nMore recently, with the success of transformer models [10 , 52] in natural language processing fields, the computer vision community has tried to apply the ideas on different downstream tasks and achieved promising results [5 , 18 , 28 , 29 , 33 , 67]. Specifically, [30 , 46 , 63] have proposed end-to-end vision transformer based models to perform video captioning and achieved significantly improved performance.\n3. Our method # Figure 2 shows an overview of our Text Empowered Video Anomaly Detection (TEVAD) framework. Given training videos V, TEVAD first splits each input video v ∈ V into T snippets. Afterwards, two separate branches extract visual and text features for each snippet in parallel. The text branch generates dense captions (Section 3.1.1) before transforming them to sentence embeddings (Section 3.1.2), while the visual branch extracts visual I3D [7] features. Multi-scale temporal networks are included in both branches to better capture multi-scale temporal dependencies (Section 3.3). The resulting multi-scale visual F vis ∈ R dvis and text features F txt ∈ R d txt are fused together (Section 3.4), and used to calculate the feature magnitude of snippets. Top-K largest feature magnitudes from normal and abnormal videos are passed to train a binary snippet classifier. During interference phase, the trained snippet classifier is able to predict the snippet level predictions which are propagated to the individual frames within each snippet to obtain frame level predictions (Section 3.5).\n3.1. Generating text features for videos # 3.1.1 Generating dense captions for videos # Although there are some research works [26 , 56] featuring generating dense captions for videos, the performance of such models is often not satisfying enough compared to single caption generation models. Particularly, it is challenging for dense caption models to determine the number of \u0026ldquo;important events\u0026rdquo; in the video sequences which is essential in video anomaly detection.\nIn view of this, we propose to use single caption models to generate captions needed for producing text features. To fuse the text features with visual features in the next step, a caption needs to be generated for each snippet. However, each snippet usually only includes too few frames for generating meaningful video captions. To circumvent this problem, we employ a sliding window strategy and compute the caption for a consecutive 64 frames for every 16 frame. Although this sliding window strategy results in redundant information being encoded, it has the advantage of minimizing information loss and preserving important events.\nIn this work, we use one of the state-of-the-art video captioning model SwinBERT [30] to generate the descriptions of video snippets. Apart from the performance, another reason we choose SwinBERT is that it uses a Video Swin Transformer (VidSwin) [34] to extract visual features instead of I3D features used in the visual branch of TEVAD. The different network architectures encourages the learning of different representations so as to improve the anomaly detection performance.\nTo generate the captions, we use pre-trained models on several different video captioning datasets (i.e. MSVD [9], VATEX [58] , TVC [29]) instead of training on datasets used for experiments described in Section 4. This is because the anomaly detection datasets do not contain the necessary captions to train the captioning model. As a result, the captions do not always reflect the video contents accurately. Despite this, as we show in the results in Section 4, these inaccurate captions are still highly beneficial for anomaly detection.\n3.1.2 Generating sentence embeddings for videos # To compute the text features from generated video captions, we use SimCSE [15] to generate sentence embeddings. SimCSE is a framework using contrastive learning methods to learn sentence embeddings by using dropout noises and incorporating annotated pairs from natural language inference datasets. It uses \u0026ldquo;entailment\u0026rdquo; pairs as positives and \u0026ldquo;contradiction\u0026rdquo; pairs as hard negatives to train the framework and achieves good results.\nNotably, the proposed TEVAD framework is quite flexible in terms of each individual component and SimCSE can be replaced by other state-of-the-art sentence embedding models with minimum adaptations.\n3.2. Generating visual features for videos # In this work, we extract I3D [7] features using a ResNet50 [21] as backbone. Following previous works [13 , 50], we perform ten-crop or five-crop augmentation on datasets to obtain better performance. For five-crop, we crop the given frame into four corners and the central crop. For tencrop, we further include the horizontal flipped version of five-crop.\nC3D, TSN or other feature extractors can also be used to replace the I3D feature extractor used in the proposed framework. Previous experiments [8 , 50] show that I3D achieves the best performance among other feature extractors for similar tasks, thus we use I3D features for the following experiments.\n3.3. Multi-scale temporal feature learning # Multi-scale Temporal Network (MTN) was firstly proposed in [50] to capture the long and short range temporal dependencies between visual features of snippets. In this work, we extend MTN to process the text features and then fuse them with visual features. The performance improves significantly after adding MTN to process text features (see Section 4).\nSimilar to the visual MTN, the text MTN also includes a 3-layer pyramid dilated convolutions (PDC) [31] block and a non-local block (NLB) [57]. The PDC over time span is used to learn multi-scale representation of video snippets while the NLB is used to learn the global temporal dependencies between video snippets. More details are introduced in Section A of the supplementary materials.\nThe outputs from the two blocks are concatenated and added to the original features to produce the final output of text MTN denoted as F ¯ txt = fMT N (Ftxt; θ), where F ¯ txt ∈ R d txt and θ comprises the weights for all convolution functions described in this section. Both visual and text features go through the similar process thus we have F ¯ vis = fMT N (Fvis; θ), where F ¯ vis ∈ R dvis . By applying MTN to process both visual and text features, TEVAD\nis able to learn the temporal dependencies between video snippets in both modalities.\n3.4. Multi-modal feature fusion # After obtaining the output from MTN, we employ the late fusion scheme [4] to fuse the features together. We investigate three different fusion methods: concatenation, addition and product. Since visual features are five/tencropped, the text features are tiled for five/ten times to be consistent with visual features.\n(a) concatenation: We direct concatenate F ¯ vis and F ¯ txt given by: X = {F ¯ vis |F ¯ txt } where X ∈ R dvis+d txt . (b) addition: We employ an element-wise addition between the visual and text embedding features. However, since d vis \u0026gt; d txt , we add a fully connected layer to reduce the dimension of visual features to the same as the text features and fuse the two by X = fF C (F ¯ vis; δ) + F ¯ txt , where X ∈ R d txt and δ comprises all the weights of the full connected layers described in this section. (b) product: We employ a Hadamard product between the visual and text embedding features. Similar to addition, a fully connected layer is added to reduce the dimension of visual features and the fused features are calculated by X = fF C (F ¯ vis; δ) ⊙ F ¯ txt , where X ∈ R d txt . Overall, we use X = ffuse(F ¯ vis, F ¯ txt; δ) to denote the fused features in the following sections. Three fully connected layers are added to calculate the anomaly scores given by s = fpred(X; δ). Additionally, S = {si} T 1 denotes the anomaly scores of snippets in one video v = {Xi} T 1 .\n3.5. Model training # During the training phase, the model only has access to video level labels. According to [50], abnormal snippets have larger feature magnitude than normal ones. We follow the same work and use l 2 norm to compute the feature magnitude. topK(v; k) is used to denote such a subset which includes k snippets with the highest magnitude among the T snippets in a video. The feature magnitude of a video v is computed as:\nThe purpose of training is to maximise the difference between the anomaly score of normal videos and abnormal videos. Thus the total training loss of the normal and abnormal videos in one batch are denoted as:\nwhere c is a pre-defined constant and |V| is the number of videos in the training set.\nSimilarly, the average of the selected k snippets\u0026rsquo; anomaly scores is calculated to represent the anomaly score of the whole video as:\nFor the actual anomaly detection, we train a simple binary classifier by using a binary cross entropy loss:\nOverall, the loss function is given as below where α is a hyper-parameter to adjust the weights of the loss components.\n4. Experimental results # 4.1. Datasets and evaluation metrics # We present the results of TEVAD on four different datasets, namely UCSD Ped2 [62], ShanghaiTech [32], UCF-Crime [48], and XD-Violence [61]. Among the four datasets, UCF-Crime and XD-Violence are designed for the weakly supervised video anomaly detection task while the other two are originally designed for unsupervised or semisupervised video anomaly detection tasks. More detailed introduction of these datasets are provided in Section B of the supplementary materials.\nTo evaluate the performance of TEVAD, we consider Area Under the ROC curve (AUC) which is widely used for evaluation in video anomaly detection fields. We adopt the micro-averaged AUC by concatenating all frames then computing the AUC scores on UCF-Crime, ShanghaiTech and UCSD Ped2 datasets. For XD-Violence, since most of the previous work used Average Precision (AP), we use it as the evaluation metric to make the results comparable. Similarly, we adopt the micro-averaged AP by concatenating all frames.\n4.2. Implementation details # Visual Feature extraction Given a video, we split it into non-overlapping 16-frame snippets. For UCF-Crime, ShanghaiTech and UCSD Ped2 datasets, we use an I3D feature extractor with a ResNet50 backbone pretrained on Kinetic-400 [24] to extract the visual features of snippets with a dimension d vis = 2048 from Mixed-5c layer. We use the I3D features provided by the author of XD-Violence directly with dvis = 1024 .\nText feature extraction We use the default setting for SwinBERT pretrained on VATEX dataset [58] to generate captions. As described in Section 3, the caption of each snippet is generated based on the current and the following three snippets with a total number of 64 frames. To extract the sentence embeddings of the captions, we use the default setting of supervised SimCSE pretrained on bert-baseuncased. The dimension of text features for each snippet is d txt = 768 .\nMulti-scale temporal feature learning For 3-layer pyramid dilated convolutions in MTN, we set the dilation parameter as 1,2,4 respectively following [50]. We set α = 0 . 0001 in Equation (5).\nTraining details We train our model on a single V100 GPU using Pytorch [41]. The model is trained with a batch size of 64 using an Adam [25] optimiser with a learning rate of 0.001 and weight decay of 0.005.\n4.3. Results on benchmark datasets # We divide previous models or frameworks for video anomaly detection into supervised and unsupervised methods and show the results from Tabs. 1 to 4. For comparisons, we use the published results of other methods.\nTable 1. Frame-level AUC results on UCSD Ped2 dataset.\nType Source Method AUC (%) Unsup Source CVPR’18 WACV’22 CVPR’21 CVPR’20 TPAMI’21 AU 32] 95 40] 96 6] 97 11] 97 et al. [17] 98 maly [68] 93 Unsup C W C C Unsup WACV’22 FastAno [40] 96.3 Unsup CVPR’21 SSMTL [16] 97.5 Unsup CVPR’20 CL-VAD [11] 97.8 Unsup TPAMI’21 Georgescu et al. [17] 98.7 Sup CVPR’19 93.2 98.7 Sup CVPR’18 Sultani et al. [48] 92.3 Sup ICCV’21 RTFM [50] 98.6 Sup – TEVAD 98.7 Results on UCSD Ped2: The frame-level micro AUC results on UCSD Ped2 dataset are presented in Tab. 1. This dataset is relatively old and small-scaled thus over studied. Nevertheless, our proposed model still performs best compared to the SOTA unsupervised and supervised methods.\nResults on ShanghaiTech: The frame-level micro AUC results on ShanghaiTech dataset are presented in Tab. 2 . This dataset has been well studied but our proposed framework managed to outperform the SOTA unsupervised methods and supervised methods by a minimum of 14.9% and 1.2% respectively. [17] achieves similar performance as ours on this dataset but much worse on UCF-Crime dataset which indicates that their method can perform well on detecting anomalies in daily settings but is not adaptive in terms of detecting rarer anomalies like crime related events.\nType Source Method AUC (%) Unsup CVPR’20 TPAMI’21 CVPR’22 CL-VAD [11] Georgescu et al. [17] SSPCAB [43] SSMTL [1] 71.6 Unsup TPAMI’21 Georgescu et al. [17] 82.7 Unsup CVPR’22 SSPCAB [43] 83.6 Unsup CVPR’22 SSMTL [1] 83.7 Sup CVPR 2019 GCN-Anomaly [68] 84.4 Sup ICME’20 AR-Net [53] 91.2 Sup IEEE Trans Multimedia’21 Chang et al. [8] 92.3 Sup CVPR’21 MIST [13] 94.8 Sup CVPR’22 BN-SVP [45] 96 Sup ICCV’21 RTFM [50] 97.2 Sup TIP’21 Wu et al. [59] 97.5 Sup – TEVAD 98.1 Sup – TEVAD 98.1 Table 2. Frame-level AUC results on ShanghaiTech dataset.\nTable 3. Frame-level AUC results on UCF-Crime dataset.\nType Source Method AUC (%) Unsup ICCV’19 BODS [54] 68.3 Unsup ICCV’19 GODS [54] 70.5 Unsup Patter Recog’20 FSCN [60] 70.6 Sup CVPR’18 Sultani et al. [48] 75.4 Sup CVPR’19 GCN-Anomaly [68] 82.1 Sup CVPR’21 MIST [13] 82.3 Sup CVPR’22 BN-SVP [45] 83.4 Sup ICCV’21 RTFM [50] 84.3 Sup IEEE Trans Multimedia’21 Chang et al. [8] 84.6 Sup TIP’21 Wu et al. [59] 84.9 Sup – TEVAD 84.9 Table 4. Frame-level AP results on XD-Violence dataset.\nType Source Method AP (%) Sup arXiv’22 CSL-TAL [39] 71.7 Sup CVPR’18 Sultani et al. [48] 75.7 Sup TIP’21 Wu et al. [59] 75.9 Sup IEEE Trans Multimedia’21 Chang et al. [8] 76.9 Sup ICCV’21 RTFM [50] 77.8 Sup – TEVAD 79.8 Results on UCF-Crime: The frame-level micro AUC results on UCF-Crime dataset are presented in Tab. 3. This dataset was first designed for weakly supervised anomaly detection tasks thus there are fewer unsupervised solutions. Our proposed method outperforms all unsupervised methods by a minimum of 14.3% in AUC. In terms of supervised methods, our results are slightly better if we consider two decimal digits compared to the second last model [59] and outperforms their model in all other datasets.\nResults on XD-Violence: The frame-level micro AUC results on XD-Violence dataset are presented in Tab. 4 . Since this is a relatively new dataset released in 2020 with limited recent works focusing on unsupervised learning, we list only supervised methods here for comparisons. Notably, XD-Violence is an audiovisual dataset which includes both visual and audio modalities. Since we only use the visual information for video anomaly detection, for a fair comparison, we include methods which use visual features only. Comparing to other supervised methods with similar setting, our method is 2% better than the second best work [50] and more than 4% better than the other work [39 , 48 , 59].\nTo summarize, our proposed TEVAD framework consistently outperforms the SOTA methods on four benchmark datasets in video anomaly detection field. This demonstrates the proposed framework can be generalized well to different background scenes.\n4.4. Ablation studies # 4.4.1 Effectiveness of main components # We perform an ablation study on different datasets to demonstrate the effectiveness of the main components in TEVAD and the results are shown in percentage format in Tab. 5. To be consistent, we show the AUC results for UCSD Ped2, ShanghaiTech and UCF-Crime dataset and AP results for XD-Violence dataset. It can be observed from the table that all four datasets show a consistent improvement in performance by adding text features. In addition, the performance can be further boosted if the text features are processed using MTN. To sum up, TEVAD\u0026rsquo;s performance increases by 14.88%, 3.93%, 1.8% and 2.82% on UCSD Ped2, ShanghaiTech, UCF-Crime and XD-Violence datasets respectively compared to using visual features alone.\n4.4.2 Impact of captions quality # Since the anomaly detection datasets do not contain the necessary captions to train the captioning model, we use the pre-trained models trained on other video captioning datasets. To understand the impact of different pretrained models (i.e. caption quality), we perform additional experiments on UCF-Crime dataset as it is the most challenging.\nIt can be observed from Tab. 6 that VATEX pre-trained models perform better than the other two. These results are intuitive as MSVD [9] is a relatively small video captioning dataset and does not contain enough crime or violence related video content. In addition, although TVC [29] is relatively large, videos in this dataset are collected from TV programs and are significantly different from the surveillance contexts in crime dataset. On the other hand, VATEX contains a large number of videos covering 600 human activities which follows the Kinetics-600 [6] taxonomy. Hu-\nVisual Text Fusion Ped2 (%) Shanghai (%) Crime (%) Violence (%) ✓ × × 83.81 94.17 83.1 76.94 ✓ Vanilla concat 93.17 97.85 83.18 77.91 ✓ MTN concat 96.71 97.86 84.9 79.3 ✓ MTN add 98.69 98.1 84.13 79.76 ✓ MTN product 94.12 97.2 83.83 78.49 Table 5. Ablation study results.\nFigure 3. Example results from (a) ShanghaiTech (riding a bike), (b) XD-Violence (riot) , and (c) UCF-Crime (vandalism) datasets. The top row shows predicted anomaly scores and the groundtruth labels. For frames labeled with green or red arrows, we also show the image frames and their associated generated captions in the bottom row.\nTable 6. Experimental results using different SwinBERT pretrained models.\nFusion Pre-trained AUC (%) add MSVD 82.9 concat MSVD 83.8 add TVC 82.3 concat TVC 82.6 add VATEX 84.1 concat VATEX 84.9 man activities cover punching person (boxing), slapping, sword fighting, lighting fire etc. are highly possible to be relevant to crimes or violence. Such findings demonstrate that better captions results help improve the overall video anomaly detection results.\n4.5. Robustness comparisons # Another advantage of TEVAD is that it is more robust by considering both visual and text modalities. We run 1,000 epochs for both RTFM and our method and evaluate every 5 epochs after training for 50 epochs. The standard deviations of AUC/AP are presented in Tab. 7 .\nIt can be concluded from the experimental results that multi-modality features help improve the robustness of the model. TEVAD shows a more robust results on Ped2, ShanghaiTech and Crime datasets when the text features are added. In addition, the framework achieves the lowest standard deviation in terms of AUC/AP on all four datasets when MTN is applied to process the text features.\n4.6. Qualitative analysis # We provide some qualitative results from different datasets in Figure 3. In terms of anomaly scores, our TEVAD can effectively predict a small score for normal snippets and a large score for abnormal snippets regardless of the different background scenes and the types of abnormal events. Additionally, our model is able to detect multiple abnormal events (Figure 3 (c)), which makes it applicable to real-world scenarios. Moreover, the margins between normal and abnormal snippets are relatively clear.\nIn terms of the usability (i.e. quality of generated captions), TEVAD works well on ShanghaiTech dataset which manly contains day to day activities and can effectively capture the main abnormal event like \u0026ldquo;riding bikes\u0026rdquo; (Figure 3 (a)). Figure 3 (b) and (c) present more challenging videos\nTable 7. Robustness of using both modality features.\nVisual Text Fusion Ped2 (%) Shanghai (%) Crime (%) Violence (%) ✓ × × 14.77 3.18 1.98 4.63 ✓ Vanilla concat 7.4 1.62 1.86 4.96 ✓ MTN concat 3.43 1.33 1.75 6.92 ✓ MTN add 5.83 1.61 1.48 4.27 ✓ MTN product 4.62 2.09 1.62 4.67 Figure 4. Example results from (a) ShanghaiTech (riding a bike), (c) XD-Violence (shooting), and (b) UCF-Crime (arrest) datasets showing the contribution of each word in the caption to the snippet anomaly score. An image frame of the abnormal event from the snippet is also shown on the right of each caption.\nwhich includes the abnormal event of \u0026ldquo;riot\u0026rdquo; and \u0026ldquo;vandalism\u0026rdquo; respectively. Notably, though the VATEX dataset used for training the captioning models does not explicitly include such activities, the generated captions capture the similar semantic meaning in the embedding space. For example, \u0026ldquo;a large crowd of people are gathered\u0026rdquo; is possibly related to riot while \u0026ldquo;throws it to the camera\u0026rdquo; indicates potential vandalism.\n4.7. Explainability analysis # Although the generated captions may not be completely accurate in some cases, we conduct additional analysis to demonstrate the explainability of incorporating captions for video anomaly detection tasks. During the inference phase, we iteratively mask each word in the caption of the snippet and calculate the sentence embeddings (i.e. text features) based on the masked captions. The text features are then fused with the visual features and fed into the trained model to predict the anomaly scores for each snippet of the video.\nFigure 4 shows the explainability results to understand the contribution of each word in captions of the snippets. The score above each word in the caption is the difference between the anomaly score by masking this word and the original anomaly score without masking. Therefore, a higher score indicates a higher contribution to the predicted anomaly score.\nFigure 4 (a) shows the caption and an image of a video snippet from ShanghaiTech dataset. This snippet contains an abnormal event of \u0026ldquo;riding a bicycle\u0026rdquo;. Consequently, the word \u0026ldquo;bikes\u0026rdquo; contributes the most for identifying this anomalous event comparing to other words in the caption. Similarly in Figure 4 (b), the word \u0026ldquo;gun\u0026rdquo; contributes most for identifying the \u0026ldquo;shooting\u0026rdquo; scene in this snippet. On the other hand, Figure 4 (c) shows an inaccurate caption for a snippet related to an \u0026ldquo;arrest\u0026rdquo; scene from crime dataset. Regardless of the inaccuracy of the caption, the word \u0026ldquo;fall\u0026rdquo; which is possibily related to the \u0026ldquo;arrest\u0026rdquo; action contributes significantly for identifying the anomalous event.\nThe observations described in this section and previous Section 4.6 provides the insights that the performance of TEVAD framework can potentially be further improved if some captions of the video anomaly detection datasets are available.\n5. Conclusions # Video anomaly detection is a critical yet challenging task in many real-world scenarios. Most of previous works only consider using spatial-temporal visual features to perform video anomaly detection and fail to capture the semantic meaning of complex anomalies in real world contexts. In this work, we have proposed a weakly supervised framework called TEVAD which uses both visual and text modality features to perform video anomaly detection tasks. We extend MTN to process sentence embeddings of captions to learn the dependencies between snippets and further improve the performance. In addition, the generated captions provide explainable results to the surveillance end users. Our proposed TEVAD framework achieves SOTA performance on four different benchmark datasets.\nReferences # [1] Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ubnormal: New benchmark for supervised open-set video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 20143–20153, 2022. 6\n[2] S Anoopa and A Salim. Survey on anomaly detection in surveillance videos. Materials Today: Proceedings, 2022. 1\n[3] Marcella Astrid, Muhammad Zaigham Zaheer, Jae-Yeong Lee, and Seung-Ik Lee. Learning not to reconstruct anomalies. 2021. 2\n[4] Souhail Bakkali, Zuheng Ming, Mickael Coustaty, and ¨ ¨ Marc¸al Rusinol. Visual and textual deep feature fusion ˜ ˜ for document image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 562–563, 2020. 4\n[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-toend object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020. 3\n[6] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics600. arXiv preprint arXiv:1808.01340, 2018. 6\n[7] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 1 , 3 , 4\n[8] Shuning Chang, Yanchao Li, Shengmei Shen, Jiashi Feng, and Zhiying Zhou. Contrastive attention for video anomaly detection. IEEE Transactions on Multimedia, 24:4067–4076, 2021. 2 , 4 , 6\n[9] David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011. 3 , 6\n[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 3\n[11] Keval Doshi and Yasin Yilmaz. Continual learning for anomaly detection in surveillance videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 254–255, 2020. 2 , 5 , 6\n[12] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019. 2\n[13] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. Mist: Multiple instance self-training framework for video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14009– 14018, 2021. 2 , 4 , 6\n[14] William Gan. Language guided out-of-distribution detection. 2021. 2\n[15] Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821, 2021. 4\n[16] Mariana-Iuliana Georgescu, Antonio Barbalau, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. Anomaly detection in video via selfsupervised and multi-task learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12742–12752, 2021. 5\n[17] Mariana Iuliana Georgescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. A background-agnostic framework with adversarial training for abnormal event detection in video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4505– 4523, 2021. 2 , 5 , 6\n[18] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. Video action transformer network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 244–253, 2019. 3\n[19] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1705–1714, 2019. 2\n[20] Yusuke Hatae, Qingpu Yang, Muhammad Fikko Fadjrimiratno, Yuanyuan Li, Tetsu Matsukawa, and Einoshin Suzuki. Detecting anomalous regions from an image based on deep captioning. In VISIGRAPP (5: VISAPP), pages 326–335, 2020. 2\n[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2 , 4\n[22] Yaosi Hu, Zhenzhong Chen, Zheng-Jun Zha, and Feng Wu. Hierarchical global-local temporal modeling for video captioning. In Proceedings of the 27th ACM International Conference on Multimedia, pages 774–783, 2019. 2\n[23] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4565–4574, 2016. 2\n[24] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 , 2017. 5\n[25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014. 5\n[26] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017. 3\n[27] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 2\n[28] Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696, 2018. 3\n[29] Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. Tvr: A large-scale dataset for video-subtitle moment retrieval. In European Conference on Computer Vision, pages 447–463. Springer, 2020. 3 , 6\n[30] Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17949– 17958, 2022. 2 , 3\n[31] Chenyang Liu, Xiangyu Xu, and Yujin Zhang. Temporal attention network for action proposal. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 2281–2285. IEEE, 2018. 4\n[32] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018. 2 , 5\n[33] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. 3\n[34] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3202–3211, 2022. 3\n[35] Hui Lv, Chuanwei Zhou, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Localizing anomalies from weakly-labeled videos. IEEE transactions on image processing, 30:4505– 4515, 2021. 2\n[36] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 2\n[37] Rashmiranjan Nayak, Umesh Chandra Pati, and Santos Kumar Das. A comprehensive review on deep learning-based methods for video anomaly detection. Image and Vision Computing, 106:104078, 2021. 1\n[38] Trong-Nguyen Nguyen and Jean Meunier. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1273–1283, 2019. 2\n[39] Aniello Panariello, Angelo Porrello, Simone Calderara, and Rita Cucchiara. Consistency-based self-supervised learning for temporal anomaly localization. arXiv preprint arXiv:2208.05251, 2022. 2 , 6\n[40] Chaewon Park, MyeongAh Cho, Minhyeok Lee, and Sangyoun Lee. Fastano: Fast anomaly detection via spatiotemporal patch transformation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2249–2259, 2022. 2 , 5\n[41] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im-\nperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. 5\n[42] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2\n[43] Nicolae-Cat˘ ˘ alin Ristea, Neelu Madan, Radu Tudor Ionescu, ˘ ˘ Kamal Nasrollahi, Fahad Shahbaz Khan, Thomas B Moeslund, and Mubarak Shah. Self-supervised predictive convolutional attentive block for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13576–13586, 2022. 2 , 6\n[44] Mohammad Sabokrou, Mahmood Fathy, Guoying Zhao, and Ehsan Adeli. Deep end-to-end one-class classifier. IEEE transactions on neural networks and learning systems , 32(2):675–684, 2020. 2\n[45] Hitesh Sapkota and Qi Yu. Bayesian nonparametric submodular video partition for robust anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3212–3221, 2022. 2 , 6\n[46] Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 17959–17968, 2022. 2 , 3\n[47] Vijeta Sharma, Manjari Gupta, Ajai Kumar, and Deepti Mishra. Video processing using deep learning techniques: A systematic literature review. IEEE Access, 2021. 2\n[48] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018. 2 , 5 , 6\n[49] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence, 2017. 2\n[50] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4975–4986, 2021. 2 , 4 , 5 , 6\n[51] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. 1\n[52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 2 , 3\n[53] Boyang Wan, Yuming Fang, Xue Xia, and Jiajie Mei. Weakly supervised video anomaly detection via centerguided discriminative learning. In 2020 IEEE International\nConference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2020. 6\n[54] Jue Wang and Anoop Cherian. Gods: Generalized one-class discriminative subspaces for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8201–8211, 2019. 2 , 6\n[55] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016. 1\n[56] Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6847– 6857, 2021. 3\n[57] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018. 4\n[58] Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, highquality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019. 3 , 5\n[59] Peng Wu and Jing Liu. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing, 30:3513–3527, 2021. 2 , 6\n[60] Peng Wu, Jing Liu, Mingming Li, Yujia Sun, and Fang Shen. Fast sparse coding networks for anomaly detection in videos. Pattern Recognition, 107:107515, 2020. 6\n[61] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In European conference on computer vision , pages 322–339. Springer, 2020. 5\n[62] Dan Xu, Rui Song, Xinyu Wu, Nannan Li, Wei Feng, and Huihuan Qian. Video anomaly detection based on a hierarchical activity discovery within spatio-temporal contexts. Neurocomputing, 143:144–152, 2014. 5\n[63] Hanhua Ye, Guorong Li, Yuankai Qi, Shuhui Wang, Qingming Huang, and Ming-Hsuan Yang. Hierarchical modular network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17939–17948, 2022. 2 , 3\n[64] Muhammad Zaigham Zaheer, Jin-ha Lee, Marcella Astrid, and Seung-Ik Lee. Old is gold: Redefining the adversarially learned one-class classifier training paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14183–14193, 2020. 2\n[65] Junchao Zhang and Yuxin Peng. Object-aware aggregation with bidirectional temporal graph for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8327–8336, 2019. 2\n[66] Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. Object relational graph with teacher-recommended learning for video captioning. In\nProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13278–13288, 2020. 2\n[67] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6881–6890, 2021. 3 [68] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1237–1246, 2019. 2 , 5 , 6 ","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/chen_tevad_improved_video_anomaly_detection_with_captions_cvprw_2023_paper/","section":"Papers","summary":"Proposes a framework that utilizes both visual and text features, generated through dense video captions, to enhance anomaly detection performance and explainability in videos.","title":"TEVAD: Improved video anomaly detection with captions","type":"method"},{"content":" Text-Driven Traffic Anomaly Detection with Temporal High-Frequency Modeling in Driving Videos # Rongqin Liang, Student Member, IEEE, Yuanman Li, Senior Member, IEEE, Jiantao Zhou, Senior Member, IEEE, and Xia Li, Member, IEEE\nAbstract—Traffic anomaly detection (TAD) in driving videos is critical for ensuring the safety of autonomous driving and advanced driver assistance systems. Previous single-stage TAD methods primarily rely on frame prediction, making them vulnerable to interference from dynamic backgrounds induced by the rapid movement of the dashboard camera. While two-stage TAD methods appear to be a natural solution to mitigate such interference by pre-extracting background-independent features (such as bounding boxes and optical flow) using perceptual algorithms, they are susceptible to the performance of firststage perceptual algorithms and may result in error propagation. In this paper, we introduce TTHF, a novel single-stage method aligning video clips with text prompts, offering a new perspective on traffic anomaly detection. Unlike previous approaches, the supervised signal of our method is derived from languages rather than orthogonal one-hot vectors, providing a more comprehensive representation. Further, concerning visual representation, we propose to model the high frequency of driving videos in the temporal domain. This modeling captures the dynamic changes of driving scenes, enhances the perception of driving behavior, and significantly improves the detection of traffic anomalies. In addition, to better perceive various types of traffic anomalies, we carefully design an attentive anomaly focusing mechanism that visually and linguistically guides the model to adaptively focus on the visual context of interest, thereby facilitating the detection of traffic anomalies. It is shown that our proposed TTHF achieves promising performance, outperforming state-ofthe-art competitors by +5.4% AUC on the DoTA dataset and achieving high generalization on the DADA dataset.\nIndex Terms—Traffic anomaly detection, multi-modality learning, high frequency, attention.\nI. INTRODUCTION # T RAFFIC anomaly detection (TAD) in driving videos is a crucial component of automated driving systems [1], [2]\nThis work was supported in part by in part by the Key project of Shenzhen Science and Technology Plan under Grant 20220810180617001 and the Foundation for Science and Technology Innovation of Shenzhen under Grant RCBS20210609103708014; in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2022A1515010645; in part by the Open Research Project Programme of the State Key Laboratory of Internet of Things for Smart City (University of Macau) under Grant SKLIoTSC(UM)2021-2023/ORP/GA04/2022. (Corresponding author: Yuanman Li)\nRongqin Liang, Yuanman Li and Xia Li are with Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen 518060, China (email: 1810262064@email.szu.edu.cn; yuanmanli@szu.edu.cn; lixia@szu.edu.cn).\nJiantao Zhou is with the State Key Laboratory of Internet of Things for Smart City and the Department of Computer and Information Science, University of Macau, Macau (e-mail: jtzhou@um.edu.mo).\nFig. 1. Existing TAD approaches of single-stage paradigm (a) and two-stage paradigm (b) vs. the proposed TTHF framework (c). Existing single-stage approaches mainly rely on frame prediction, which is difficult to adapt to detecting traffic anomalies with a dynamic background, while the two-stage TAD approaches are vulnerable to the performance of the first-stage perceptual algorithms. The proposed TTHF framework is text-driven and focuses on capturing dynamic changes in driving scenes through modeling temporal high frequency to facilitate traffic anomaly detection.\nand advanced driver assistance systems [3], [4]. It is designed to detect anomalous traffic behavior from the first-person driving perspective. Accurate detection of traffic anomalies helps improve road safety, shorten traffic recovery times, and reduce the number of regrettable daily traffic accidents.\nGiven the significance of traffic anomaly detection, scholars are actively involved in this field and have proposed constructive research [5]–[9]. We observe that these works on TAD can be mainly divided into the single-stage paradigm [6], [10], [11] and the two-stage paradigm [8], [9], [12]. As shown in Fig. 1, previous TAD methods mainly embrace a single-stage paradigm, exemplified by frame prediction [6] and reconstruction-based [11] TAD approaches. Nevertheless, these methods are subject to the dynamic backgrounds caused by the rapid movement of the dashboard camera and have limited accuracy in detecting traffic anomalies. To confront the challenges posed by dynamic backgrounds, researchers have advocated for TAD methods [8], [9], [12] that utilize a two-stage paradigm. These two-stage approaches first extract features such as optical flow, bounding boxes, or tracking IDs from video frames using existing visual perception algorithms, and then propose a TAD model for detecting traffic anomalies. While these approaches have laid the foundation for TAD in driving videos, they are susceptible to the performance\nCopyright © 2024 IEEE. Personal use of this material is permitted.\nof the first-stage visual perception algorithm, which may cause error propagation, resulting in false detection or missing traffic anomalies. Therefore, in this paper, we strive to explore an effective single-stage paradigm-based approach for traffic anomaly detection in driving videos.\nRecently, large-scale visual language pre-training models [13]–[15] have achieved remarkable results by utilizing language knowledge to assist with visual tasks. Among them, CLIP [13] stands out for its exceptional transferability through the alignment of image-text semantics and has demonstrated outstanding capabilities across various computer vision tasks such as object detection [16], semantic segmentation [17], and video retrieval [18]. The success of image-text alignment techniques can be attributed to their ability to map the natural languages associated with an image into highdimensional non-orthogonal vectors. This is in contrast to traditional supervised methods that map predefined labels to low-dimensional one-hot vectors. Compared to the lowdimensional one-hot vectors, these high-dimensional vectors offer more comprehensive representations to guide the network training. Motivated by this, we endeavor to investigate a language-guided approach for detecting traffic anomalies in driving videos. Intuitively, the transition of CLIP from imagetext alignment to video-text alignment primarily involves the consideration of modeling temporal dimensions. Despite the exploration of various methods [19]–[22] for temporal modeling, encompassing various techniques such as Average Pooling , Conv1D , LSTM , Transformer, the existing approaches predominantly concentrate on aggregating visual context along the temporal dimension. In the context of traffic anomaly detection for driving videos, we emphasize that beyond the visual context, characterizing dynamic changes in the driving scene along the temporal dimension proves advantageous in determining abnormal driving behavior. For instance, traffic events such as vehicle collisions or loss of control often result in significant and rapid alterations in the driving scene. Therefore, how to effectively characterize the dynamic changes of driving scenes holds paramount importance for traffic anomaly detection in driving videos .\nAdditionally, considering that different types of traffic anomalies exhibit unique characteristics, a straightforward encoding of the entire driving scene may diminish the discriminability of driving events and impede the detection of diverse traffic anomalies. For instance, traffic anomalies involving the ego-vehicle are often accompanied by global jittering of the dashboard camera, while anomalies involving non-ego vehicles often lead to local anomalies in the driving scene. Consequently, how to better perceive various types of traffic anomalies proves crucial for traffic anomaly detection .\nIn this work, we propose a novel traffic anomaly detection approach: Text-Driven Traffic Anomaly Detection with Temporal High-Frequency Modeling (TTHF), as shown in Fig. 2. To represent driving videos comprehensively, our fundamental idea is to not only capture the spatial visual context but also emphasize the depiction of dynamic changes in the driving scenes, thereby enhancing the visual representation of driving videos. Specifically, we initially leverage the pre-trained visual encoder of CLIP, endowed with rich prior knowledge of visual language semantics, to encode the visual context of driving videos. Then, to capture the dynamic changes in driving scenes, we innovatively introduce temporal high-frequency modeling (THFM) to obtain temporal high frequency representations of driving videos along the temporal dimension. Subsequently, the visual context and temporal high-frequency representations are fused to enhance the overall visual representation of driving videos. To better perceive various types of traffic anomalies, we propose an attentive anomaly focusing mechanism (AAFM) to guide the model to adaptively focus both visually and linguistically on the visual context of interest, thereby facilitating the detection of traffic anomalies.\nIt is shown that our proposed TTHF model exhibits promising performance on the DoTA dataset [9], outperforming stateof-the-art competitors by +5.4% AUC. Furthermore, without any fine tuning, the AUC performance of TTHF on the DADA dataset [23] demonstrates its generalization capability. The main contributions of our work can be summarized as follows:\nWe introduce a simple yet effective single-stage traffic anomaly detection method that aligns the visual semantics of driving videos with matched textual semantics to identify traffic anomalies. In contrast to previous TAD methods, the supervised signals in our approach are derived from text, offering a more comprehensive representation in high-dimensional space. We emphasize the modeling of high frequency in the temporal domain for driving videos. In contrast to previous approaches that solely aggregate visual context along the temporal dimension, we place additional emphasis on modeling high frequency in the temporal domain. This enables us to characterize dynamic changes in the driving scene over time, thereby significantly enhancing the performance of traffic anomaly detection. We further propose an attentive anomaly focusing mechanism to enhance the perception of various traffic anomalies. Our proposed mechanism guides the model both visually and linguistically to adaptively focus on the visual contexts of interest, facilitating the detection of traffic anomalies. Comprehensive experimental results on public benchmark datasets demonstrate the superiority and robustness of the proposed method. Compared to existing state-of-the-art methods, the proposed TTHF improves AUC by +5.4% on the DoTA dataset and also achieves state-of-the-art AUC on the DADA dataset without any fine-tuning. The remainder of this paper is organized as follows. Section II gives a brief review of related works. Section III details our proposed TTHF for traffic anomaly detection in driving videos. Extensive experimental results are presented in Section IV, and we finally draw a conclusion in Section V .\nII. RELATED WORKS # A. Traffic Anomaly Detection (TAD) in Driving Videos # Traffic anomaly detection (TAD) in driving videos aims to identify abnormal traffic events from the perspective of driving, such as collisions with other vehicles or obstacles,\nbeing out of control, and so on. Such events can be classified into two categories: ego-involved anomalies (i.e., traffic events involving the ego-vehicle) and non-ego anomalies (i.e., traffic events involving observed objects but not the ego-vehicle). A closely related topic to TAD in driving videos is anomaly detection in surveillance videos (VAD), which involves identifying abnormal events such as fights, assaults, thefts, arson, and so forth from a surveillance viewpoint. In recent years, various VAD methods [24]–[29] have been proposed for surveillance videos, which have greatly contributed to the development of this field. However, in contrast to the static background in surveillance videos, the background in driving videos is dynamically changing due to the fast movement of the ego vehicle, which makes the VAD methods prone to failure in the TAD task [9], [12]. Recently, Wang et al. [30] proposed a method for detecting crowd flow anomalies by comparing anomalous samples with normal samples that were estimated based on prototypes. However, crowd flow anomaly detection methods are difficult to apply to the TAD task due to the differences in tasks and the data processed. In this paper, we work on the task of traffic anomaly detection in driving videos to provide a new solution for this community.\nEarly TAD methods [5], [31] mainly extracted features in a handcrafted manner and utilized a Bayesian model for classification. However, these methods are sensitive to welldesigned features and generally lack robustness in dealing with a wide variety of traffic scenarios. With the advances of deep neural networks in computer vision, researchers have proposed deep learning-based approaches for TAD, laying the foundation for this task. Based on our observations, the existing TAD methods can be basically classified into singlestage paradigm [6], [10], [11] and two-stage paradigm [12], [32]–[34].\nPrevious single-stage paradigm-based TAD approaches mainly comprise frame reconstruction-based and frame prediction-based TAD approaches [6], [10], [11]. These methods used reconstruction or prediction errors of video frames to evaluate traffic anomalies. For instance, Liu et al. [6] predicted video frames of normal traffic events through appearance and motion constraints, thereby helping to identify traffic anomalies that do not conform to expectations. Unfortunately, these methods tend to detect ego-involved anomalies (e.g., out of control) and perform poorly on non-ego traffic anomalies. This is primarily attributed to ego-involved anomalies causing significant shaking of the dashboard camera, leading to substantial global errors in frame reconstruction or prediction. Such errors undoubtedly facilitate anomaly detection. However, the methods based on frame reconstruction or prediction have difficulty distinguishing the local errors caused by the traffic anomalies of other road participants because of the interference of the dynamic background from the fast-moving egovehicle. This impairs their ability to detect traffic anomalies.\nIn recent years, to address the challenges posed by dynamic backgrounds, researchers have proposed applying a two-stage paradigm to the traffic anomaly detection task. In this paradigm, the perception algorithm is initially applied to extract visual features in the first stage. Then, the TAD model utilizes these features to detect traffic anomalies. For instance,\nYao et al. [9], [32] applied Mask-RCNN [35], FlowNet [36], DeepSort [37], and ORBSLAM [38] algorithms to extract bounding boxes (bboxes), optical flow, tracking ids, and ego motion, respectively. Then, they used these visual features to predict the future locations of objects over a short horizon and detected traffic anomalies based on the deviation of the predicted location. Along this line, Fang et al. [12] used optical flow and bboxes as visual features. They attempted to collaborate on frame prediction and future object localization tasks [39] to detect traffic anomalies by analyzing inconsistencies in predicted frames, object locations, and the spatial relation structure of the scene. Zhou et al. [8] obtained bboxes of objects in the scene from potentially abnormal frames as visual features. They then encoded the spatial relationships of the detected objects to determine the abnormality of these frames. Despite the success of the two-stage paradigm TAD methods, they rely on the perception algorithms in the first stage, which may cause error propagation and lead to missed or false detection of traffic anomalies. Different from existing TAD methods, we propose a text-driven single-stage traffic anomaly detection approach that provides a promising solution for this task.\nB. Vision-Text Multi-Modality Learning # Recently, there has been a gradual focus on vision-text multi-modal learning. Among them, contrastive languageimage pre-training methods have achieved remarkable results in many computer vision tasks such as image classification [13], [14], object detection [16], [40], semantic segmentation [17], [41] and image retrieval [42], [43]. At present, CLIP [13] has become a mainstream visual learning method, which connects visual signals and language semantics by comparing large-scale image-language pairs. Essentially, compared to traditional supervised methods that convert labels into orthogonal one-hot vectors, CLIP provides richer and more comprehensive supervision information by collecting large-scale image-text pairs from web data and mapping the text into high-dimensional supervision signals (usually nonorthogonal). Following this idea, many scholars have applied CLIP to various tasks in the video domain, including video action recognition [19], [44], video retrieval [18], [20], [45], video recognition [46], [47], and so on. For example, ActionCLIP [19] modeled the video action detection task as a video-text matching problem in a multi-modal learning framework and strengthened the video representation with more semantic language supervision to enable the model to perform zero-shot action recognition. More recently, Wu et al. [48] proposed a vision-language model for anomaly detection in surveillance videos. However, as mentioned earlier, traffic anomaly detection faces the problem of dynamic changes in the driving scene, which often makes VAD methods prone to fail in TAD tasks. To the best of our knowledge, there is no effective approach to model traffic anomaly detection task from the perspective of vision-text multi-modal learning. In this paper, we preliminarily explore an effective text-driven method for traffic anomaly detection, which we hope can provide a new perspective on this task.\nFig. 2. Overview of our proposed TTHF. It is a CLIP-like framework for traffic anomaly detection. In this framework, we first apply a visual encoder to extract visual representations of driving video clips. Then, we propose Temporal High-Frequency Modeling (THFM) to characterize the dynamic changes of driving scenes and thus construct a more comprehensive representation of driving videos. Finally, we introduce an attentive anomaly focusing mechanism (AAFM) to enhance the perception of various types of traffic anomalies. Besides, for brevity, we denote the cross-attention as CA, the visually focused representation as VFR, and the linguistically focused representation as LFR.\nIII. THE PROPOSED APPROACH: TTHF # In this section, we mainly introduce the proposed TTHF framework. First, we describe the overall framework of TTHF. Then, we explain two key modules in TTHF, i.e., temporal High-Frequency Modeling (THFM) and attentive anomaly focusing mechanism (AAFM). Moreover, we describe the contrastive learning strategy for cross-modal learning of videotext pairs, and finally show how to perform traffic anomaly detection in our TTHF.\nA. Overview of Our TTHF Framework # The overall framework of TTHF is illustrated in Fig. 2. It presents a CLIP-like two-stream framework for traffic anomaly detection. For the visual context representation, considerable research [49]–[51] has demonstrated that CLIP possesses a robust foundation of vision-language prior knowledge. Leveraging this acquired semantic knowledge for anomaly detection in driving videos facilitates the perception and comprehension of driving behavior. Therefore, we advocate applying the pretrained visual encoder of CLIP to extract visual representations from driving video clips of two consecutive frames. After obtaining the frame representations, we employ Average Pooling along the temporal dimension as in previous works [19]–[21] to aggregate these representations to characterize the visual context of the video clip. For the text representation, we first describe normal and abnormal traffic events as text prompts (i.e. , a1 and a2 in Table I), and then apply the pretrained textual encoder in CLIP to extract text representations.\nIntuitively, after extracting the visual and textual representations of driving video clips, we can directly leverage contrastive learning to align them for traffic anomaly detection. However, in our task, solely modeling the visual representation from visual context is insufficient to capture the dynamic changes in the driving scene. Therefore, we introduce temporal high-frequency modeling (THFM) to characterize the dynamic changes and provide a more comprehensive representation of the driving video clips. Additionally, to better perceive various types of traffic anomalies, we further propose an attentive anomaly focusing mechanism (AAFM) to adaptively focus on the visual context of interest in the driving scene, thereby facilitating the detection of traffic anomalies. In the following sections, we will introduce these two key modules in detail.\nB. Temporal High-Frequency Modeling (THFM) # Video-text alignment diverges from image-text alignment by necessitating consideration of temporal characteristics. Numerous methods [19]–[21] have effectively employed CLIP in addressing downstream tasks within the video domain. The modeling strategies adopted in these approaches for the temporal domain encompass various techniques such as Average Pooling , Conv1D , LSTM, and Transformer. These strategies primarily emphasize aggregating visual context from distinct video frames along the temporal dimension. Nevertheless, for the anomaly detection task in driving videos, we contend that not only the visual context but also the temporal dynamic changes in the driving scene hold significant importance in modeling driving behavior. For instance, a collision or loss of vehicle control often induces substantial changes in the driving scene within a brief timeframe. Therefore, in our work, we propose to model the visual representation of driving videos in two aspects, i.e., the visual context of video frames in the spatial domain and the dynamic changes of driving scenes in the temporal domain. Considering the fact that the high frequency of the driving video in the temporal domain reflects the dynamic changes of the driving scene. To clarify,\nFig. 3. An illustration of the AAFM. The original video frames are displayed in column (a). In column (b), we visualize the attention of the visual representation to the deep features of a video clip under the visually focused strategy (VFS). In column (c), we visualize the attention of the soft text representation to the deep features of a video clip under the linguistically focused strategy (LFS). We present two types of traffic anomaly scenarios. Specifically, case 1 illustrates an instance where the ego-vehicle experiences loss of control while executing a turn. In case 2, the driving vehicle observes a collision between the car turning ahead and the motorcycle traveling straight on the right.\nwe present several cases in Fig. 4 for illustration. Based on\nFig. 4. An illustration of the high frequency. We show 3 cases as examples. The first and second columns correspond to the original consecutive video frames, and the last column is the high-frequency component extracted along the temporal dimension.\nthe above observations, we introduce the Temporal High Frequency Modeling (THFM) to enhance the visual representation of the driving video within the temporal-spatial domain.\nOur fundamental idea involves utilizing the high frequency presented in the temporal domain of the driving video to characterize dynamic changes. Specifically, we first extract the high frequency of the driving video clip in the temporal dimension, which is formulated as:\nwhere HP(·) is the difference operation to extract high frequency I hp n I n along the temporal dimension from two consecutive frames t − 1 and t of the n-th driving video clip. Further, we encode I hp n I n to the high-frequency representation by\nwhere Fhf (·) represents the high-frequency encoder, sharing the same architecture as the visual encoder (i.e., ResNet50 unless specified otherwise). The resultant high-frequency representation is denoted as H n t H n . Finally, to obtain the visual representation of the driving video clip in the spatio-temporal domain, we fuse the spatial visual context representation\nwith the temporal high-frequency representation H n t H n , which is expressed as follows:\nwhere Fv Fve is the visual encoder with frozen pre-trained parameter ξ ve, I n t I n and I n t − 1 I n represent visual representations of frame t and t − 1, respectively, and V n t V n denotes the spatial visual context representation after Average Pooling.Here, Fn Fn ∈ R 1×C is the fused visual representation, where C denotes the feature dimension. The fused visual representation Fn Fn not only models the visual context of driving video clips, but also characterizes the dynamic changes in the temporal dimension, which is beneficial for perception and understanding driving behaviors.\nC. Attentive Anomaly Focusing Mechanism # Different types of traffic anomalies tend to exhibit distinct characteristics. For instance, anomalies involving the ego vehicle are often accompanied by global jitter from the dashboard camera, whereas anomalies involving non-ego vehicles typically cause anomalies in local regions of the driving scene. Blindly encoding the entire driving scene may reduce the discriminability of driving events and impede the ability to detect various types of traffic anomalies. Therefore, adaptively focusing on the visual context of interest is critical to perceiving different types of traffic anomalies.\nIn our work, we propose an attentive anomaly focusing mechanism (AAFM). The fundamental idea is to decouple the visual context visually and linguistically, to guide the model to adaptively focus on the visual content of interest. Specifically, we carefully design two focusing strategies: the visually focused strategy (VFS) and the linguistically focused strategy (LFS). The former utilizes visual representations with global context to concentrate on the most semantically relevant visual context, while the latter adaptively focuses on visual contexts that are most relevant to text prompts through the guidance of language.\nVisually Focused strategy (VFS): In fact, the spatial visual representation inherently captures the global context. Utilizing the attention of visual representation towards the deep features of various regions in the driving scene enables a focus on the most semantically relevant visual content. Specifically, as shown in Fig. 2, we focus on and weight the deep features of interest by using cross-attention (CA) on the spatial visual context representation V n t V n and deep features of the video clip, which can be written as:\nwhere Q , K and V are linear transformation, P ∈ R h∗w×C is the deep feature map of the video clip, (h, w) represents the size of the feature map, and c is the scaling factor which refers to the rooted square of feature dimension. Note that, for transformer-based visual encoders, V n t V n is represented by the class token, and P is represented by the patch tokens. V F R n ∈ R 1×C denotes the visually focused representation of the n-th video clip. Since the spatial visual representation encodes global context, focusing on its most relevant visual content helps guide the model to perceive the semantics of the driving scene. As shown in Fig. 3 (b), our VFS can adaptively focus on the crucial scene semantics in the driving scene. Such attention helps to detect traffic anomalies involving the egovehicle, especially the loss of control of the ego vehicle (case 1 in Fig. 3).\nlinguistically focused strategy (LFS): Intuitively, the fine-grained text prompts clearly define the subjects, objects, and traffic types involved in the traffic events. In contrast to general text prompts (as listed in a1 and a2 in Table I), utilizing fine-grained text prompts helps guide the model to focus on relevant visual contexts, thereby improving the comprehension of various traffic anomalies. Therefore, to facilitate the model\u0026rsquo;s adaptive perception of relevant visual context, we further design a linguistically focused strategy. The core idea is to utilize the carefully designed fine-grained text prompts (as listed in b1 to b4 in Table I) to guide the model to adaptively focus on the visual context of interest, thereby enhancing the understanding of traffic anomalies. Specifically, first, we categorize traffic events into four groups based on their types. Second, we further categorize each type of traffic event according to the different subjects (i.e., ego or non-ego vehicle) and objects (i.e., vehicle, pedestrian, or obstacle) involved. Finally, we define a total of 11 types of fine-grained text prompts, as summarized in Table I from b1 to b4. Note that the DoTA dataset used in our experiments is annotated with 9 types of traffic anomalies, as shown in Table II, with each anomaly encompassing both egoinvolved and non-ego traffic anomalies. With the defined finegrained text prompts, we apply the textual encoder in CLIP to extract the fine-grained text representation as follows:\nwhere Ft Fte is the textual encoder with parameter ξte , t m (m ∈ [1 , 11] ∩ Z) denotes the m-th fine-grained text prompt, and T m ′ T m represents the corresponding text representation. As we can see, the fine-grained text prompts describe the subjects and objects involved in a traffic event in a video frame, as well as the event type, which helps to focus on the visual regions in the driving scene where the traffic event occurred.\nTherefore, we further propose to leverage the similarity of the fine-grained text representation with each deep feature of the video clip to focus on the most relevant visual context of the text prompt. Note that in the driving scenario, we do not have direct access to realistic text prompt that match the driving video. To solve this problem, we leverage the similarity between the visual representation Fn Fn and fine-grained text representations to weight the text representations, and obtain the soft text representation as follows:\nwhere A m n is the cosine similarity between the n-th visual representation Fn Fn and the m-th fine-grained text representation T m ′ T m ∈ R 1×C . After obtaining the soft text representation Tsof t ∈ R 1×C , similar to Section III-C1, we can further focus on the most semantically relevant visual context of the text description based on the cross-attention (CA) on the soft text representation Tsof t and deep features P, which is denoted as:\nLF R n ∈ R 1×C represents the linguistically focused representation of the n-th video clip, which focuses on the visual context that is most relevant to the soft text representation Tsof t. Moreover, Fig. 3(c) shows that our LFS can indeed adaptively concentrate on road participants potentially linked to anomalies. This capability is crucial for identifying local anomalies in driving scenarios arising from non-ego vehicles (case 2 in Fig. 3).\nFinally, we enhance the visual representation Fn Fn of driving videos by fusing it with visually and linguistically focused representations. Formally, it can be expressed as:\nwhere Ffusion is the fusion layer composed of multi-layer ′ perceptrons with parameter ξf . F n ′ F n is an enhanced visual representation that not only adaptively focuses on the visual contexts of interest but also more comprehensively characterizes the driving video clip in the spatio-temporal domain. Moreover, such representations facilitate the alignment of visual representations with general text prompts, thus improving the detection of traffic anomalies.\nD. Contrastive Learning Strategy and Inference Process # In this section, we introduce the contrastive learning strategy of the proposed TTHF framework for cross-modal learning and present how to perform traffic anomaly detection.\nSuppose that, there are N video clips in the batch, we denote:\nTABLE I SUMMARY OF WELL -DESIGNED TEXT PROMPTS .\nGeneral a1: “A traffic anomaly occurred in the scene.” Text Prompt a2: “The traffic in this scenario is normal.” Fine-grained Text Prompt 1 b2: “The {ego, non-ego} vehicle collision with an\u0002other {vechile, pedestrian, obstacle}.” “The {ego, non-ego} vehicle out-of-control and eaving the roadway.” “The {ego, non-ego} vehicle has an unknown accident.” where F is the visual representation of N video clips and F ′ represents the enhanced visual representation. For text prompts, we denote:\nwhere T means the matched general text representation of N video clips and T ′ is the matched fine-grained text representation. Note that Tn Tn and T n ′ T n denote the high-dimensional representations of one of the D predefined text prompts. In our case, D = 2 for general text prompts and D = 11 for finegrained text prompts. To better understand abstract concepts of traffic anomalies, we first perform contrastive learning to align visual representations F with fine-grained text representations T ′ . Formally, the objective loss along the visual axis can be expressed as:\nFor the j-th trained text representation Tj , it may actually match more than one visual representation. Symmetrically, we can calculate the loss along the text axis by:\nwhere τ is a learned temperature parameter [13]. Similarly, we further apply contrastive learning to align the enhanced visual representations with the general text representations. The calculations along the visual and textual axis are as follows:\nThe overall loss then becomes:\nThe inference procedure is similar to the training procedure. For the i-th testing driving video clip, our TTHF first extracts the visual representation Fi and the enhanced visual representation F i ′ F i . For text prompts, the text encoder constructs 11\nTABLE II TRAFFIC ANOMALY CATEGORY IN THE DOTA DATASET .\nLabel Anomaly Category ST Collision with another vehicle that starts, stops, or is stationary AH Collision with another vehicle moving ahead or waiting LA Collision with another vehicle moving laterally in the same direction OC Collision with another oncoming vehicl TC Collision with another vehicle that turns into or crosses a roa VP Collision between vehicle and pedestrian VO Collision with an obstacle in the roadway OO Out-of-control and leaving the roadway to the left or right UK Unknown fine-grained text representations T ′ = {T 1 ′ T 1 , T 2 ′ T 2 , \u0026hellip;, T 1 ′ T 11 } and 2 general text representations T = {T1, T2}. We then compute the cosine similarity between Fi and T ′ and between F i ′ F i and T, respectively. Finally, we calculate the anomaly score for the i-th driving video clip as:\nwhere S 11 f represents the cosine similarity after softmax between Fi Fi and T 1 ′ T 11 , and S g 2 S g denotes the cosine similarity after softmax between F i ′ F i and T2 T2 . By taking the complement of the average over the prompts corresponding to normal traffic at different levels, we can obtain the final anomaly score Scorei .\nIV. EXPERIMENTS AND DISCUSSIONS # In this section, we evaluate the performance of our proposed method, which is performed on a platform with one NVIDIA 3090 GPU. All experiments were implemented using the PyTorch framework. Our source code and trained models will be publicly available upon acceptance.\nA. Implementation Details # In the experiments, we resize the driving video frames to 224 × 224 and take every two consecutive frames as the input video clip. Except where noted otherwise, in all experimental settings, we adopt ResNet-50 [52] for the visual and highfrequency encoders and Text Transformer [53] for the textual encoder. All of them are initialized with the parameters of CLIP\u0026rsquo;s pre-trained model. Note that during the training phase, we freeze the pre-trained parameters of the visual encoder to prevent the model from overfitting to a specific dataset (e.g., DoTA) while enhancing the generalization of the visual representation. Besides, we optimize loss functions using the Adam algorithm with batch size 128, learning rate 5e-6, weight decay 1e-4, and train the framework for 10 epochs. During inference, we evaluate the traffic anomaly score by taking the complement of the similarity score of normal traffic prompts on both fine-grained and general text prompts.\nB. Dataset and Metrics # Dataset: For the sake of fairness, we evaluate our method on two challenging datasets, namely, DoTA [9] and DADA-2000 [23], following prior works [8], [9], [12]. DoTA is the first traffic anomaly video dataset that provides detailed TABLE III THE AUC ↑ (%) OF DIFFERENT APPROACHES ON THE DOTA DATASET .\nMethods Input Paradigm AUC (%) ConvAE [10] Gray Single-Stage 64.3 ConvAE [10] Flow Two-Stage 66.3 ConvLSTMAE [11] Gray Single-Stage 53.8 ConvLSTMAE [11] Flow Two-Stage 62.5 AnoPred [6] RGB Single-Stage 67.5 AnoPred [6] Mask RGB Two-Stage 64.8 FOL-STD [32] Box Two-Stage 66.7 FOL-STD [32] Box + Flow Two-Stage 69.1 FOL-STD [32] Box + Flow + Ego RGB B Two-Stage 69.7 FOL-Ensemble [9] g RGB + Box + Flow + Ego Two-Stage 73 STFE [8] RGB + Box Two-Stage 79.3 TTHF-Base RGB Single-Stage 75.8 TTHF RGB Single-Stage 84.7 spatio-temporal annotations of anomalous objects for traffic anomaly detection in driving scenarios. The dataset contains 4677 dashcam video clips with a resolution of 1280 × 720 pixels, captured under various weather and lighting conditions. Each video is annotated with the start and end time of the anomaly and assigned to one of nine categories, which we summarize in Table II. The DADA-2000 dataset consists of 2000 dashcam videos with a resolution of 1584 × 660 pixels, each annotated with driver attention and one of 54 anomaly categories. In our experiments, we use the standard train-test split as used in [9], [23] and other previous works.\nMetrics: Following prior works [8], [9], [54], we use Area under ROC curve (AUC) metric to evaluate the performance of different TAD approaches. The AUC metric is calculated by computing the area under a standard frame-level receiver operating characteristic (ROC) curve, which plots the true positive rate (TPR) against the false positive rate (FPR). The larger AUC prefers better performance. C. Competitors # To verify the superiority of the proposed framework, we compare with the following state-of-the-art TAD approaches: ConvAE [10], ConvLSTMAE [11], AnoPred [6], FOL-STD [32], FOL-Ensemble [9], DMMNet [55], SSC-TAD [12] and STFE [8]. Among them, the ConvAE [10] and ConvLSTMAE [11] methods contain two variants. The variant utilizing the grayscale image as input belongs to the single-stage paradigm, while the variant using optical flow as input belongs to the twostage paradigm. The AnoPred method [6] also contains two variants. The variant employing the full video frame as input falls within the single-stage paradigm, whereas the variant utilizing pixels of foreground objects belongs to the twostage paradigm. Besides, the DMMNet method [55] follows the single-stage paradigm, while the methods FOL-STD [32], FOL-Ensemble [9], SSC-TAD [12], and STFE [8] fall under the two-stage paradigm. Note that the experimental results for all these methods and their variants are obtained from the published papers [8], [9], [12]. In addition, we consider a CLIPlike TAD framework, denoted as TTHF-Base, as our baseline approach. This baseline lacks temporal High-Frequency\nModeling and the attention anomaly focusing mechanism and utilizes only general text prompts for alignment.\nD. Quantitative Results # Overall results: We conduct a comparative analysis of TTHF with a wide range of competitors and their variants in terms of AUC metric. Table III presents the AUC performance of various competitors, along with labels indicating their respective variants (i.e., different inputs) and paradigms employed. Overall, our framework demonstrates the superior performance on the DoTA dataset in terms of AUC. Specifically, our method outperforms the previously two-stage paradigm-based leading TAD method, STFE [8], by +5.4% AUC. Although in previous methods, the two-stage paradigm method employs a perception algorithm in the first stage to mitigate the impact of dynamic background resulting from the ego-vehicle movement, and generally outperforms single-stage TAD methods [6], [10], [11], such approaches are susceptible to the performance of the perception algorithm in the first stage, potentially leading to error propagation. In contrast, our proposed single-stage TAD method explicitly characterizes dynamic changes by modeling high frequency in the temporal domain, achieving a significant performance improvement over all previous methods and establishing a new state-ofthe-art in traffic anomaly detection. Note that our baseline method outperforms all previous single-stage paradigm-based methods by at least +8.3% AUC. This is mainly attributed to our introduction of text prompts and the alignment of driving videos with text representations in a high-dimensional space, which facilitates the detection of traffic anomalies.\nPer-class results: To investigate the ability of our proposed method to detect traffic anomalies in different categories, we compared the detection performance of different methods for ego-involved and non-ego traffic anomalies. Based on the nine traffic anomalies divided by the DoTA dataset, detailed in Table II, we summarize the AUC performance of the different methods as well as the average AUC in Table IV. Our method achieves significant improvements in all categories of traffic anomalies except ST*, and in particular, achieves an average AUC of at least +9.9% on egos involving traffic anomalies. This further validates our idea that characterizing dynamic changes in driving scenarios is important for traffic anomaly detection. Simultaneously, it also demonstrates the effectiveness of our proposed approach to model the temporal high frequency of driving videos to characterize the dynamic changes of driving scenes.\nGeneralization performance: To explore the generalization performance of our method for unseen types of traffic anomalies, we perform a generalization experiment on the DADA-2000 dataset. Specifically, we compare the AUC performance of our TTHF and TTHF-Base without any fine tuning on the DADA-2000 dataset with previous trained models, summarized in Table V. As we can see, our proposed TTHFbase and TTHF methods outperform previously trained TAD methods, bringing at least +0.8% and +4.2% improvement in AUC respectively, indicating the strong generalization performance of the proposed approach. This is mainly attributed to\nTABLE IV THE AUC ↑ (%) OF DIFFERENT METHODS FOR EACH INDIVIDUAL ANOMALY CLASS ON THE DOTA DATASET IS PRESENTED. THE ∗ INDICATES NON-EGO ANOMALIES , WHILE EGO -INVOLVED ANOMALIES ARE SHOWN WITHOUT ∗ . N/A INDICATES THAT THE AUC PERFORMANCE FOR THE CORRESPONDING CATEGORY IS NOT AVAILABLE. WE BOLD THE BEST PERFORMANCE .\nMethods ST AH LA OC TC VP VO OO UK AVG AnoPred [6] 69.9 73.6 75.2 69.7 73.5 66.3 N/A N/A N/A 71.4 AnoPred [6]+Mask 66.3 72.2 64.2 65.4 65.6 66.6 N/A N/A N/A 66.7 FOL-STD [32] 67.3 77.4 71.1 68.6 69.2 65.1 N/A N/A N/A 69.7 FOL-Ensemble [9] 73.3 81.2 74.0 73.4 75.1 70.1 N/A N/A N/A 74.5 STFE [8] 75.2 84.5 72.1 77.3 72.8 71.9 N/A N/A N/A 75.6 TTHF-Base 72.8 79.6 83.7 76.4 82.6 72.3 81.8 80.4 72.7 78.0 TTHF 86.7 90.5 89.7 87.0 89.5 77.1 87.6 90.1 70.9 85.5 Methods ST* AH* LA* OC* TC* VP* VO* OO* UK* AVG AnoPred [6] 70.9 62.6 60.1 65.6 65.4 64.9 64.2 57.8 N/A 63.9 AnoPred [6]+Mask 72.9 63.7 60.6 66.9 65.7 64.0 58.8 59.9 N/A 64.1 FOL-STD [32] 75.1 66.2 66.8 74.1 72.0 69.7 63.8 69.2 N/A 69.6 FOL-Ensemble [9] 77.5 69.8 68.1 76.7 73.9 71.2 65.2 69.6 N/A 71.5 STFE [8] 80.6 65.6 69.9 76.5 74.2 N/A 75.6 70.5 N/A 73.2 TTHF-Base 75.0 71.5 67.2 72.5 70.6 64.3 69.9 68.3 68.1 69.7 TTHF 74.9 76.0 76.4 79.8 81.5 79.2 79.0 77.5 68.9 77.0 TABLE V THE AUC ↑ (%) OF DIFFERENT METHODS ON THE DADA-2000 DATASET .\nMethods Trained Ego-Involved Non-Ego Both AnoPred [6] ✓ 55.7 56.9 56.1 FOL-STD [32] ✓ 71.3 57.1 66.6 DMMNet [55] ✓ 73 56.3 67.5 SSC-TAD [12] ✓ 67.6 58.7 66.5 TTHF-Base × 78.7 59.4 68.3 TTHF × 80.9 64 71.7 TABLE VI ABLATION RESULTS OF DIFFERENT COMPONENTS ON DOTA DATASET . NOTE THAT FOR FAIR COMPARISON , IN THE EXPERIMENTS WITHOUT THFM, WE FINE -TUNE THE PARAMETERS OF THE VISUAL ENCODER . LARGER AUC PREFERS BETTER PERFORMANCE .\nArch. Visual Textual AAFM THFM AUC (%) TTHF ✓ × × × 61 TTHF ✓ ✓ × × 75.8 TTHF ✓ ✓ ✓ × 76.8 TTHF ✓ ✓ ✓ ✓ 84.7 TABLE VII ABLATION RESULTS ON HOW AAFM CONTRIBUTES TO TRAFFIC ANOMALY DETECTION ON THE DOTA DATASET. LARGER AUC PREFERS BETTER PERFORMANCE .\nArch. VFS LFS AUC (%) TTHF − − 75.8 TTHF ✓ × 76.3 TTHF × ✓ 76.5 TTHF ✓ ✓ 76.8 our introduction of a text-driven video-text alignment strategy for traffic anomaly detection from a new perspective, as well as the proposed attentive anomaly focusing mechanism and temporal high-frequency modeling for traffic anomaly detection.\nE. Qualitative Results # In this subsection, we visualize some examples to further illustrate the detection capability of our TTHF across various\nTABLE VIII ABLATION RESULTS OF DIFFERENT BACKBONES ON DOTA DATASET . LARGER AUC PREFERS BETTER PERFORMANCE .\nArch. Visual Textual AUC (%) TTHF RN-50 Text\u0002Transformer Text\u0002 84.7 TTHF RN-50x64 Text\u0002Transformer T 84.8 TTHF ViT-B-32 Text\u0002Transformer Text\u0002 84 TTHF ViT-L-14 Text\u0002Transformer 85 types of traffic anomalies and the feasibility of soft text representation in our framework.\nVisualization of various types of traffic anomalies: As presented in Fig. 5, we show five representative traffic anomalies from top to bottom as examples: a) The other vehicle collides with another vehicle that turns into or crosses a road. b) The ego-vehicle collides with another oncoming vehicle. c) The ego-vehicle collides with another vehicle moving laterally in the same direction. d) The ego-vehicle collides with another vehicle waiting. e) The ego-vehicle is out-ofcontrol and leaving the roadway to the left. From the above visualization results of different types of traffic anomalies, we can summarize as follows. Overall, our TTHF exhibits superior detection performance on various types of traffic anomalies. Secondly, while the most intuitive classify-based approach (It has the same network architecture as the visual encoder of TTHF, but directly classifies the visual representation, denoted as Classifier in Fig. 5) also follows a single-stage paradigm, our proposed text-driven TAD approach offers a more comprehensive representation in high-dimensional space than orthogonal one-hot vectors. Consequently, both our proposed TTHF and its variants outperform the Classifier. Third, incorporating AAFM allows our method to better perceive different types of traffic anomalies, as evident in Fig. 5 when comparing the Base and AAFM variants across various traffic anomalies. Finally, capturing dynamic changes in driving scenarios significantly Fig. 5. The visualization of anomaly score curves for traffic anomaly detection of different variants on the DoTA dataset. The first row of each case shows the extracted video frames of the driving video, where the red boxes mark the object involved in or causing the anomaly. The second rows show the anomaly score curves of different methods on the corresponding whole videos. For brevity, we label the TTHF-Base variant as Base and TTHF-Base with AAFM as AAFM, while Classifier denotes the classify-based TAD method. Better viewed in color.\nenhances traffic anomaly detection. This highlights the effectiveness of our approach in characterizing dynamic changes in driving scenarios by modeling high frequency in the temporal domain.\nVisualization of the weights used for soft text representation: We further investigate the feasibility of soft text representations. Specifically, as shown in Fig. 6, we use three cases from the test set as examples. For video frames captured at different moments in driving videos, we visualize the weights employed to compute the soft text representation and compare it with the real fine-grained text representation. From the visualization results, we observe that the text representation associated with the maximum weight (indicated by Fig. 6. Visualization of the weights used for computing soft text representations. We present three illustrative cases, each involving video frames captured at different times. These frames are accompanied by the corresponding weight values used in the computation of soft text representations. Notably, we employ a blue-to-red color scale, where increasing redness signifies higher weights. Additionally, we label the ground-truth fine-grained text representations (denoted as T i) associated with specific frames. Among them, T 1 corresponds to the text \u0026ldquo;The ego vehicle collision with another vehicle\u0026rdquo; (as described in Table I), T 4 corresponds to the text \u0026ldquo;The non-ego vehicle collision with another vehicle\u0026rdquo;, T 7 corresponds to the text \u0026ldquo;The ego vehicle out-of-control and leaving the roadway\u0026rdquo;, and T 11 corresponds to the text \u0026ldquo;The vehicle is running normally on the road\u0026rdquo; .\nthe darkest red) consistently aligns with the real fine-grained text representation. The above results indicate that the way we calculate the soft text representation is effective and can well reflect the real anomaly category.\nF. Ablation Investigation # In this subsection, we conduct ablation studies by analyzing how different components of TTHF contribute to traffic anomaly detection on DoTA dataset.\nVariants of our architecture: We first evaluate the effectiveness of different components in our TTHF framework including the visual encoder, the textual encoder, the attentive anomaly focusing mechanism (AAFM), and the temporal high-frequency modeling (THFM). The ablation results are summarized in Table VI. Note that when only the visual encoder is applied, we add a linear classification head after the visual representation. This adaptation formulates the traffic anomaly detection task as a straightforward binary classification task. The results presented in Table VI demonstrate that introducing linguistic modalities and aligning visual-text in high-dimensional space greatly facilitates anomaly detection in driving videos compared to the classifier, achieving an AUC improvement of +14.8%. Based on this, the designed AAFM helps guide the model to adaptively focus on the visual context of interest and thus enhance the perception ability of various types of traffic anomalies. Lastly, the incorporation of the modeling of temporal high frequency to capture dynamic background during driving significantly improves traffic anomaly detection, resulting in an AUC improvement of +7.9%.\nAnalysis of the AAFM: To investigate how the proposed attentive anomaly focusing mechanism (AAFM) contributes to traffic anomaly detection, we perform ablation on each component in the AAFM. The ablation results are presented in Table VII. We can conclude that both the Visually Focused Strategy (VFS) and the Linguistically Focused Strategy (LFS) explicitly guide the model to pay attention to the visual context most relevant to the representations of visual and linguistic modalities, respectively. This enhances the ability to perceive traffic anomalies with different characteristics, thereby improving traffic anomaly detection in driving videos. Our AAFM achieves the best detection performance when both VFS and LFS are applied.\nNetwork Architecture: Different network architectures of visual encoder may exhibit different representation capabilities. We now evaluate the performance of traffic anomaly detection when ResNet50 [52], ResNet50x64 [13], ViT-B-32 [56] and ViT-L-14 [56] are used. Specifically, the results of these visual encoders can be found in Table VIII, respectively. As can be noticed, for the task of traffic anomaly detection in driving videos, we observe that the ResNet-based network achieves comparable performance to the Transformer-based network. The larger model sizes perform slightly better, with ViT-L-14 achieving an AUC performance of 85.0%. Therefore, considering both computing resources and performance gains, we ultimately chose ResNet50 as an example as our visual encoder in all other experiments.\nFig. 7. Visualization of some bad cases of the proposed TTHF. The first row of each case shows the extracted video frames of the driving video, where the red boxes mark the objects involved in the anomaly. The second rows show the anomaly score curves of different methods on the corresponding whole videos. Better viewed in color.\nG. Disscusion # In this subsection, we discuss the limitations of the proposed framework. We experimentally found that the detection accuracy of our proposed method needs improvement for two specific cases: 1) long-distance observation of traffic anomalies; and 2) subtle traffic anomalies involving other vehicles when the ego-vehicle is stationary. Fig. 7 shows several cases where the accuracy of our method needs to be further improved. In the first scenario, the other vehicle at a distance collide with a turning or crossing vehicle. The second scenario depicts a distant vehicle losing control and veering to the left side of the road. The third scenario involves a slowly retreating vehicle experiencing friction with other stationary vehicles. By analyzing the anomaly score curve in Fig. 7, we can conclude that our method faces challenges primarily due to the traffic anomalies occurring in these scenarios involve nonego vehicles and cause minor anomaly areas. These anomalies include small local anomalies that are caused when non-ego vehicles are abnormal at a distance, and slow and slight traffic anomalies that are observed for other vehicles when the egovehicle is at rest. These slight traffic anomalies may not be well focused on the corresponding abnormal regions by modeling the dynamic changes of the driving scene as well as using text guidance. This also explains that the ability of our method in detecting non-ego involved traffic anomalies is not as good as in detecting ego-involved traffic anomalies, especially ST* in Table IV. Despite the significant improvement of our approach over previous TAD methods, addressing these more challenging traffic anomalies undoubtedly requires a greater effort from the community.\nV. CONCLUSION # This paper have proposed an accurate single-stage TAD framework. For the first time, this framework introduces visual-text alignment to address the traffic anomaly detection task for driving videos. Notably, we verified that modeling the high frequency of driving videos in the temporal domain helps to characterize the dynamic changes of the driving scene and enhance the visual representation, thereby greatly facilitating the detection of traffic anomalies. In addition, the experimental results demonstrated that the proposed attentive anomaly focusing mechanism is indeed effective in guiding the model to adaptively focus on the visual content of interest, thereby enhancing the ability to perceive different types of traffic anomalies. Although extensive experiments have demonstrated that the proposed TTHF substantially outperforms state-of-theart competitors, more effort is required to accurately detect the more challenging slight traffic anomalies.\nREFERENCES # [1] Z. Yuan, X. Song, L. Bai, Z. Wang, and W. Ouyang, \u0026ldquo;Temporal-channel transformer for 3d lidar-based video object detection for autonomous driving,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 4, pp. 2068–2078, 2022.\n[2] L. Claussmann, M. Revilloud, D. Gruyer, and S. Glaser, \u0026ldquo;A review of motion planning for highway autonomous driving,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., vol. 21, no. 5, pp. 1826–1848, 2020.\n[3] M. Jeong, B. C. Ko, and J.-Y. Nam, \u0026ldquo;Early detection of sudden pedestrian crossing for safe driving during summer nights,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 27, no. 6, pp. 1368–1380, 2017.\n[4] L. Yue, M. A. Abdel-Aty, Y. Wu, and A. Farid, \u0026ldquo;The practical effectiveness of advanced driver assistance systems at different roadway facilities: System limitation, adoption, and usage,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., vol. 21, no. 9, pp. 3859–3870, 2020.\n[5] Y. Yuan, D. Wang, and Q. Wang, \u0026ldquo;Anomaly detection in traffic scenes via spatial-aware motion reconstruction,\u0026rdquo; IEEE Trans. Intell. Transp. Syst. , vol. 18, no. 5, pp. 1198–1209, 2017.\n[6] W. Liu, W. Luo, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction for anomaly detection – a new baseline,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2018, pp. 6536–6545.\n[7] Z. Liu, Y. Nie, C. Long, Q. Zhang, and G. Li, \u0026ldquo;A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction,\u0026rdquo; in Proc. IEEE Int. Conf. Comput. Vis. , 2021, pp. 13 588–13 597.\n[8] Z. Zhou, X. Dong, Z. Li, K. Yu, C. Ding, and Y. Yang, \u0026ldquo;Spatio-temporal feature encoding for traffic accident detection in vanet environment,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., vol. 23, no. 10, pp. 19 772–19 781, 2022.\n[9] Y. Yao, X. Wang, M. Xu, Z. Pu, Y. Wang, E. Atkins, and D. J. Crandall, \u0026ldquo;Dota: Unsupervised detection of traffic anomaly in driving videos,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 444–459, 2023.\n[10] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, \u0026ldquo;Learning temporal regularity in video sequences,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2016, pp. 733–742.\n[11] Y. S. Chong and Y. H. Tay, \u0026ldquo;Abnormal event detection in videos using spatiotemporal autoencoder,\u0026rdquo; in Proc. Adv. Neural Networks, 2017, pp. 189–196.\n[12] J. Fang, J. Qiao, J. Bai, H. Yu, and J. Xue, \u0026ldquo;Traffic accident detection via self-supervised consistency learning in driving scenarios,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., vol. 23, no. 7, pp. 9601–9614, 2022.\n[13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, \u0026ldquo;Learning transferable visual models from natural language supervision,\u0026rdquo; in Proc. Int. conf. mach. learn., vol. 139, 2021, pp. 8748–8763.\n[14] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.H. Sung, Z. Li, and T. Duerig, \u0026ldquo;Scaling up visual and vision-language representation learning with noisy text supervision,\u0026rdquo; in Proc. Int. conf. mach. learn., vol. 139, 2021, pp. 4904–4916.\n[15] Y. Yang, W. Huang, Y. Wei, H. Peng, X. Jiang, H. Jiang, F. Wei, Y. Wang, H. Hu, L. Qiu, and Y. Yang, \u0026ldquo;Attentive mask clip,\u0026rdquo; in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 2771–2781.\n[16] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, \u0026ldquo;Open-vocabulary object detection via vision and language knowledge distillation,\u0026rdquo; in Proc. Int. Conf. Learn. Represent., 2022.\n[17] J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, \u0026ldquo;Groupvit: Semantic segmentation emerges from text supervision,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2022, pp. 18 134– 18 144.\n[18] S. Chen, Q. Xu, Y. Ma, Y. Qiao, and Y. Wang, \u0026ldquo;Attentive snippet prompting for video retrieval,\u0026rdquo; IEEE Trans. Multimed., pp. 1–12, 2023.\n[19] M. Wang, J. Xing, and Y. Liu, \u0026ldquo;Actionclip: A new paradigm for video action recognition,\u0026rdquo; arXiv preprint arXiv:2109.08472, 2021.\n[20] H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, and T. Li, \u0026ldquo;Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,\u0026rdquo; Neurocomputing, vol. 508, pp. 293–304, 2022.\n[21] H. Rasheed, M. U. Khattak, M. Maaz, S. Khan, and F. S. Khan, \u0026ldquo;Finetuned clip models are efficient video learners,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2023, pp. 6545–6554.\n[22] Y. Li, J. Ye, L. Zeng, R. Liang, X. Zheng, W. Sun, and N. Wang, \u0026ldquo;Learning hierarchical fingerprints via multi-level fusion for video integrity and source analysis,\u0026rdquo; IEEE Trans. Consum. Electron., pp. 1–11, 2024.\n[23] J. Fang, D. Yan, J. Qiao, J. Xue, and H. Yu, \u0026ldquo;Dada: Driver attention prediction in driving accident scenarios,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., vol. 23, no. 6, pp. 4959–4971, 2022.\n[24] Y. Zhong, X. Chen, Y. Hu, P. Tang, and F. Ren, \u0026ldquo;Bidirectional spatiotemporal feature learning with multiscale evaluation for video anomaly detection,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 12, pp. 8285–8296, 2022.\n[25] M. I. Georgescu, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah, \u0026ldquo;A background-agnostic framework with adversarial training for abnormal event detection in video,\u0026rdquo; IEEE Trans. Pattern Anal. Mach. Intell. , vol. 44, no. 9, pp. 4505–4523, 2022.\n[26] S. Zhang, M. Gong, Y. Xie, A. K. Qin, H. Li, Y. Gao, and Y.S. Ong, \u0026ldquo;Influence-aware attention networks for anomaly detection in surveillance videos,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 8, pp. 5427–5437, 2022.\n[27] X. Zeng, Y. Jiang, W. Ding, H. Li, Y. Hao, and Z. Qiu, \u0026ldquo;A hierarchical spatio-temporal graph convolutional neural network for anomaly detection in videos,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 1, pp. 200–212, 2023.\n[28] C. Huang, J. Wen, Y. Xu, Q. Jiang, J. Yang, Y. Wang, and D. Zhang, \u0026ldquo;Self-supervised attentive generative adversarial networks for video anomaly detection,\u0026rdquo; IEEE Trans. Neural Netw. Learn. Systems, vol. 34, no. 11, pp. 9389–9403, 2023.\n[29] Y. Gong, C. Wang, X. Dai, S. Yu, L. Xiang, and J. Wu, \u0026ldquo;Multiscale continuity-aware refinement network for weakly supervised video anomaly detection,\u0026rdquo; in Proc. IEEE Int. Conf. Multimedia Expo., 2022, pp. 1–6.\n[30] Y. Wang, X. Luo, and Z. Zhou, \u0026ldquo;Contrasting estimation of pattern prototypes for anomaly detection in urban crowd flow,\u0026rdquo; IEEE Trans. Intell. Transp. Syst., pp. 1–15, 2024.\n[31] Y. Yuan, J. Fang, and Q. Wang, \u0026ldquo;Incrementally perceiving hazards in driving,\u0026rdquo; Neurocomputing, vol. 282, pp. 202–217, 2018.\n[32] Y. Yao, M. Xu, Y. Wang, D. J. Crandall, and E. M. Atkins, \u0026ldquo;Unsupervised traffic accident detection in first-person videos,\u0026rdquo; in Proc. IEEE Int. Conf. Intell. Rob. Syst., 2019, pp. 273–280.\n[33] G. Sun, Z. Liu, L. Wen, J. Shi, and C. Xu, \u0026ldquo;Anomaly crossing: New horizons for video anomaly detection as cross-domain few-shot learning,\u0026rdquo; arXiv preprint arXiv:2112.06320, 2022.\n[34] R. Liang, Y. Li, Y. Yi, J. Zhou, and X. Li, \u0026ldquo;A memory-augmented multitask collaborative framework for unsupervised traffic accident detection in driving videos,\u0026rdquo; arXiv preprint arXiv:2307.14575, 2023.\n[35] K. He, G. Gkioxari, P. Dollar, and R. Girshick, \u0026ldquo;Mask r-cnn,\u0026rdquo; in Proc. IEEE Int. Conf. Comput. Vis., 2017.\n[36] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, \u0026ldquo;Flownet 2.0: Evolution of optical flow estimation with deep networks,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2017.\n[37] N. Wojke, A. Bewley, and D. Paulus, \u0026ldquo;Simple online and realtime tracking with a deep association metric,\u0026rdquo; in Proc. IEEE Int. Conf. Image Processing, 2017, pp. 3645–3649.\n[38] R. Mur-Artal and J. D. Tardos, \u0026ldquo;Orb-slam2: An open-source slam ´ ´ system for monocular, stereo, and rgb-d cameras,\u0026rdquo; IEEE Trans. Robotics , vol. 33, no. 5, pp. 1255–1262, 2017.\n[39] R. Liang, Y. Li, J. Zhou, and X. Li, \u0026ldquo;Stglow: A flow-based generative framework with dual-graphormer for pedestrian trajectory prediction,\u0026rdquo; IEEE Trans. Neural Netw. Learn. Systems, pp. 1–14, 2023.\n[40] L. Yao, J. Han, X. Liang, D. Xu, W. Zhang, Z. Li, and H. Xu, \u0026ldquo;Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2023, pp. 23 497–23 506.\n[41] Z. Zhou, Y. Lei, B. Zhang, L. Liu, and Y. Liu, \u0026ldquo;Zegclip: Towards adapting clip for zero-shot semantic segmentation,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2023, pp. 11 175–11 185.\n[42] A. Baldrati, L. Agnolucci, M. Bertini, and A. Del Bimbo, \u0026ldquo;Zero-shot composed image retrieval with textual inversion,\u0026rdquo; in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 15 338–15 347.\n[43] M. Tschannen, B. Mustafa, and N. Houlsby, \u0026ldquo;Clippo: Image-andlanguage understanding from pixels only,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2023, pp. 11 006–11 017.\n[44] S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang, \u0026ldquo;Zero-shot temporal action detection via vision-language prompting,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis., 2022, pp. 681–697.\n[45] Y. Ma, G. Xu, X. Sun, M. Yan, J. Zhang, and R. Ji, \u0026ldquo;X-clip: End-toend multi-grained contrastive learning for video-text retrieval,\u0026rdquo; in Proc. ACM Int. Conf. Multi., 2022, p. 638–647.\n[46] W. Wu, Z. Sun, and W. Ouyang, \u0026ldquo;Revisiting classifier: Transferring vision-language models for video recognition,\u0026rdquo; in Proc. AAAI Conf. Art. Intel., vol. 37, 2023, pp. 2847–2855.\n[47] B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, \u0026ldquo;Expanding language-image pretrained models for general video recognition,\u0026rdquo; in Proc. Eur. Conf. Comput. Vis., 2022, pp. 1–18.\n[48] P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y. Zhang, \u0026ldquo;Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,\u0026rdquo; arXiv preprint arXiv:2308.11681, 2023.\n[49] R. Zhang, Z. Zeng, Z. Guo, and Y. Li, \u0026ldquo;Can language understand depth?\u0026rdquo; in Proc. ACM Int. Conf. Multi., 2022, p. 6868–6874.\n[50] Z. Liang, C. Li, S. Zhou, R. Feng, and C. C. Loy, \u0026ldquo;Iterative prompt learning for unsupervised backlit image enhancement,\u0026rdquo; in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 8094–8103.\n[51] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, \u0026ldquo;Conditional prompt learning for vision-language models,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2022, pp. 16 816–16 825.\n[52] K. He, X. Zhang, S. Ren, and J. Sun, \u0026ldquo;Deep residual learning for image recognition,\u0026rdquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2016.\n[53] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al. , \u0026ldquo;Language models are unsupervised multitask learners,\u0026rdquo; OpenAI blog , vol. 8, no. 1, pp. 1–9, 2019.\n[54] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel, \u0026ldquo;Memorizing normality to detect anomaly: Memoryaugmented deep autoencoder for unsupervised anomaly detection,\u0026rdquo; in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1705–1714.\n[55] S. Li, J. Fang, H. Xu, and J. Xue, \u0026ldquo;Video frame prediction by deep multi-branch mask network,\u0026rdquo; IEEE Trans. Circuits Syst. Video Technol. , vol. 31, no. 4, pp. 1283–1295, 2021.\n[56] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al. , \u0026ldquo;An image is worth 16x16 words: Transformers for image recognition at scale,\u0026rdquo; in In Proc. Int. Conf. Learn. Representat., 2021, pp. 1–22.\nRongqin Liang (Student Member, IEEE) received the B.Eng. degree in communication engineering from Wuyi University, Guangdong, China, in 2018 and M.S. degree in Information and Communication Engineering from Shenzhen University, Shenzhen, China, in 2021. He is currently a Ph.D. candidate at the College of Electronics and Information Engineering from Shenzhen University. His current research interests include trajectory prediction, anomaly detection, computer vision and deep learning.\nYuanman Li (Senior Member, IEEE) received the B.Eng. degree in software engineering from Chongqing University, Chongqing, China, in 2012, and the Ph.D. degree in computer science from University of Macau, Macau, 2018. From 2018 to 2019, he was a Post-doctoral Fellow with the State Key Laboratory of Internet of Things for Smart City, University of Macau. He is currently an Assistant Professor with the College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China. His current research interests include multimedia security and forensics, data representation, computer vision and machine learning.\nJiantao Zhou (Senior Member, IEEE) received the B.Eng. degree from the Department of Electronic Engineering, Dalian University of Technology, in 2002, the M.Phil. degree from the Department of Radio Engineering, Southeast University, in 2005, and the Ph.D. degree from the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, in 2009. He held various research positions with University of Illinois at Urbana-Champaign, Hong Kong University of Science and Technology, and McMaster University. He is an Associate Professor with the Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, and also the Interim Head of the newly established Centre for Artificial Intelligence and Robotics. His research interests include multimedia security and forensics, multimedia signal processing, artificial intelligence and big data. He holds four granted U.S. patents and two granted Chinese patents. He has co-authored two papers that received the Best Paper Award at the IEEE Pacific-Rim Conference on Multimedia in 2007 and the Best Student Paper Award at the IEEE International Conference on Multimedia and Expo in 2016. He is serving as the Associate Editors of the IEEE TRANSACTIONS on IMAGE PROCESSING and the IEEE TRANSACTIONS on MULTIMEDIA.\nXia Li (Member, IEEE) received her B.S. and M.S. in electronic engineering and SIP (signal and information processing) from Xidian University in 1989 and 1992 respectively. She was later conferred a Ph.D. in Department of information engineering by the Chinese University of Hong Kong in 1997. Currently, she is a member of the Guangdong Key Laboratory of Intelligent Information Processing. Her research interests include intelligent computing and its applications, image processing and pattern recognition.\n","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/text-driven-traffic-anomaly-detection-with-temporal-high-frequency/","section":"Papers","summary":"Introduces a novel single-stage approach (TTHF) for traffic anomaly detection that aligns video clips with text prompts and models high-frequency temporal changes, enhanced by an attention focusing mechanism, outperforming state-of-the-art methods on benchmark datasets.","title":"Text-Driven Traffic Anomaly Detection with Temporal High-Frequency Modeling in Driving Videos","type":"other"},{"content":" Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model # Peng Wu, Jing Liu Senior Member, IEEE, Xiangteng He, Yuxin Peng Senior Member, IEEE , Peng Wang, and Yanning Zhang Senior Member, IEEE\nAbstract—Video anomaly detection (VAD) has been paid increasing attention due to its potential applications, its current dominant tasks focus on online detecting anomalies, which can be roughly interpreted as the binary or multiple event classification. However, such a setup that builds relationships between complicated anomalous events and single labels, e.g., \u0026ldquo;vandalism\u0026rdquo;, is superficial, since single labels are deficient to characterize anomalous events. In reality, users tend to search a specific video rather than a series of approximate videos. Therefore, retrieving anomalous events using detailed descriptions is practical and positive but few researches focus on this. In this context, we propose a novel task called Video Anomaly Retrieval (VAR), which aims to pragmatically retrieve relevant anomalous videos by cross-modalities, e.g., language descriptions and synchronous audios. Unlike the current video retrieval where videos are assumed to be temporally well-trimmed with short duration, VAR is devised to retrieve long untrimmed videos which may be partially relevant to the given query. To achieve this, we present two large-scale VAR benchmarks and design a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we propose an anomaly-led sampling to focus on key segments in long untrimmed videos. Then, we introduce an efficient pretext task to enhance semantic associations between video-text finegrained representations. Besides, we leverage two complementary alignments to further match cross-modal contents. Experimental results on two benchmarks reveal the challenges of VAR task and also demonstrate the advantages of our tailored method. Captions are publicly released at https://github.com/Roc-Ng/VAR.\nIndex Terms—video anomaly retrieval, video anomaly detection, cross-modal retrieval\nI. INTRODUCTION # V IDEO anomaly detection (VAD) plays a critical role in video content analysis, and has become a hot topic being studied due to its potential applications, e.g., danger earlywarning. VAD, by definition, aims to identify the location of anomaly occurrence, which can be regarded as the framelevel event classification. VAD can be broadly divided into two categories, i.e., semi-supervised [1]–[5] and weakly supervised [6]–[10]. The former typically recognizes anomalies\nPeng Wu, Peng Wang, and Yanning Zhang are with the National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University, China. E-mail: xdwupeng@gmail.com; peng.wang, ynzhang@nwpu.edu.cn. Jing Liu is with the Guangzhou Institute of Technology, Xidian University, China. E-mail: neouma@163.com. Xiangteng He and Yuxin Peng are with the Wangxuan Institute of Computer Technology, Peking University, China. E-mail: hexiangteng, pengyuxin@pku.edu.cn. This work is supported by the National Natural Science Foundation of China (No. 62306240, U23B2013, U19B2037, 62272013, 61925201, 62132001), China Postdoctoral Science Foundation (No. 2023TQ0272), and the Fundamental Research Funds for the Central Universities (No. D5000220431). (Corresponding author: Yanning Zhang.)\nManuscript received April 19, 2021; revised August 16, 2021.\nFig. 1. VAD vs. VAR. Single labels may be unable to describe sequential anomalous events in VAD, but text captions or synchronous audios can sufficiently depict events in VAR.\nthrough self-supervised learning or one-class learning. The latter, thanks to massive normal and abnormal videos with video-level labels, achieves better detection accuracy.\nImpressive progress has been witnessed for VAD, however, an event in videos generally captures an interaction between actions and entities that evolves over time, simply utilizing single labels in VAD may be insufficient to explain the sequential events depicted. Besides, compared with VAD, offline video search thus far is still more commonly used in realworld applications. Imagining the case when searching for related videos, we prefer to use comprehensive descriptions to accurately search, e.g., \u0026ldquo;At night, two topless men smashed the door of the store.\u0026rdquo;, rather than use a single coarse word, e.g, \u0026ldquo;vandalism\u0026rdquo;, to get a large collection of rough results.\nBased on VAD, we propose a new task called Video Anomaly Retrieval (VAR) and present two large-scale benchmark datasets, UCFCrime-AR and XDViolence-AR, to further facilitate the research of video anomaly analysis. The goal of VAR is to retrieve relevant untrimmed videos given crossmodal queries, e.g., text captions and synchronous audios, and vice versa. Unlike VAD, VAR depicts anomalies from multiple viewpoints and sufficiently characterizes sequential events. We illustrate the advantage of video anomaly retrieval in Figure 1. VAR task has high value to real-world applications, especially for smart ground and car surveillance. Generally speaking, for surveillance, the recorded video will be stored in the hard disk or memory card as a series of segments with a certain time length. After an abnormal event occurs, we need to search the corresponding video segment that contains the queried abnormal event through the descriptions, such as a white car crashed into the rear of a van, a group of people breaking into a house at night, etc.\nOur VAR is considerably different from traditional video retrieval (VR) [11]–[13]. In traditional video retrieval, videos\nQuery: A girl playing guitar and singing a song\n.\nQuery: An adult brown horse stand in the barn and his father horse jumps a barrier and his mother\nFig. 2. Comparison of VAR with video retrieval and video moment retrieval.\nare assumed to be temporally pre-trimmed with short duration, and thus the whole video is supposed to be completely relevant to the paired query. In reality, videos are usually not welltrimmed and may only exist partial fragments to fully meet the query. In VAR, the main goal is to retrieve long and untrimmed videos. Such a setup more meets realistic requirements, and also evokes the new challenge. Concretely, the length of relevant fragments is variable in videos w.r.t. given paired queries. For normal videos, the relevant fragment is generally the whole video; For abnormal videos, the relevant fragment may occupy only a fraction or a lion\u0026rsquo;s share of the entire video since the length of anomalous events is inconstant in videos. Besides, our VAR task also differs from video moment retrieval (VMR) [14]–[17] since the latter is to retrieve moments rather than untrimmed videos. Because both abnormal videos and normal videos (no anomaly) need to be retrieved in VAR, video moment retrieval methods are hard to tackle this task. Traditional video retrieval and video moment retrieval methods cannot solve this new challenge well, detailed results are listed in Tables II and III. The differences between video retrieval, video moment retrieval and video anomaly retrieval are shown in Figure 2.\nTo overcome the above challenge, we propose ALAN, an Anomaly-Led Alignment Network for video anomaly retrieval. In ALAN, video, audio, and text encoders are intended to encode raw data into high-level representations, and crossmodal alignment is introduced to match cross-modal representations from different perspectives. Since videos are long untrimmed and anomalous events have complex variations in the scenario and length, we expect that, in the encoding phase, the retrieval system maintains holistic views, meanwhile, focuses on key anomalous segments, so that cross-modal representations can be well aligned in the joint embedding space. Therefore, vanilla fixed-frame sampling, e.g., uniform sampling and random sampling, is not flexible to focus on specific anomalous segments. Inspired by dynamic neural networks [18]–[21], we propose an anomaly-led sampling, which simply resorts to frame-level anomaly priors generated by an ad-hoc anomaly detector and does not require intensive pairwise interactions between cross modality, to select key segments with large anomaly identification degree. We then\nQuery: two teams playing volleyball\n.\ncouple these two win-win sampling mechanisms for videos and audios, where anomaly-led sampling focuses on anomalous segments, and fixed-frame sampling pays attention to the entirety as well as normal videos. Furthermore, to establish associations between video-text fine-grained representations as well as maintain high retrieval efficiency, we also propose a pretext task, i.e., video prompt based masked phrase modeling (VPMPM), serving the model training. Particularly, a new module termed Prompting Decoder takes both frame-level video representations and contextual text representations as input and predicts the masked noun phrases or verb phrases by the cross-modal attention, where video representations serve as fixed prompts [22], [23]. In this paper, video frames are regarded as the fine granularity as frames usually reflect more detailed content of videos, meanwhile noun phrases and verb phrases in texts, e.g., \u0026ldquo;a black car\u0026rdquo; and \u0026ldquo;left quickly\u0026rdquo;, are regarded as the fine granularity, which reflect the local spatial contents and temporal dynamics in the video, respectively. Notably, compared with nouns and verbs, noun phrases and verb phrases contain more contents, and can also particularly illustrate the subtle differences. Finally, such a proxy training objective optimizes the encoder parameters and further promotes the semantic associations between local video frames and text phrases by cross-modal interactions.\nTo summarize, our contributions are three-fold:\nWe introduce a new task named video anomaly retrieval to bridge the gap between the literature and real-world applications in terms of video anomaly analysis. To our knowledge, this is the first work moves towards VAR from VAD; We present two large-scale benchmarks, i.e., UCFCrimeAR and XDViolence-AR, based on public VAD datasets. The former is applied to video-text VAR, the latter is to video-audio VAR; We propose a model called ALAN, aiming at challenges in VAR, where anomaly-led sampling, video prompt based masked phrase modeling, and cross-modal alignment are introduced for the attention of anomalous segments, enhancement of fine-grained associations, and multi-perspective match, respectively. II. RELATED WORK # A. Video Anomaly Detection # Aided by the success of deep learning, VAD has made much good progress in recent years, which is usually classified into two groups, semi-supervised anomaly detection and weakly supervised anomaly detection. In semi-supervised anomaly detection, only normal event samples are available for model training. Recent researchers mainly adopt deep auto-encoders [2], [4], [24]–[30] for self-supervised learning, e.g., reconstruction, prediction, jigsaw, etc. Weakly supervised anomaly detection can be regarded as the problem of binary classification, to obtain frame-level predictions given coarse labels, and multiple instance learning [6], [8], [9], [31]–[34] is widely used to train models. Unlike VAD that utilizes single labels to distinguish whether each frame is anomalous or not, our proposed VAR uses elaborate text descriptions or synchronous audios to depict the sequential events.\nB. Cross-Modal Retrieval # We mainly introduce cross-modal retrieval [35]–[38] built on videos, texts, and audios. There are some works [39]– [44] focus on audio-text and audio-video retrieval. Specifically, Tian et al. [45] propose an audio-to-video/video-toaudio cross-modality localization/retrieval task [46], i.e, given a sound segment, locate the corresponding visual sound source temporally within a video, and vice versa. Then Wu et al. [47] introduce a novel dual attention matching method for this task. Recently, Lin et al. [48] propose a latent audio-visual hybrid adapter that adapts pre-trained vision transformers to audiovisual tasks, this method focuses on audio-video event localization task rather than cross-modal retrieval. In addition, textvideo retrieval is a key role in cross-modal retrieval. Generally, text-video retrieval can be divided into two categories, i.e., dual-encoder and joint-encoder. Dual-encoder based methods usually train two individual encoders to learn video and text representations and then align these representations in joint embedding spaces. Among them, some works [13], [49]–[51] focus on learning single global representations, but they lack the consideration of fine-grained information. Thereby, several works devote efforts to aligning fine-grained information [52]– [58]. Joint-encoder based methods [59]–[62] typically feed the video and text into a joint encoder to capture their cross-modal interactions. In comparison to dual-encoder based methods, joint-encoder based methods explicitly learn finegrained associations and achieve more impressive results, but sacrifice the retrieval efficiency since every text-video pair needs to be fed into the encoder at the inference time.\nDifferent from the above video retrieval, we consider a more realistic scenario, where most videos contain anomalous events, and a more realistic demand, where videos are long untrimmed and partially relevant to the cross-modal query [63]. Such a new task poses extra challenges as well as multi-field research points. In addition, our ALAN also differs from video moment retrieval methods [64]–[67] in that it does not require complex cross-modal interactions.\nIII. BENCHMARK # Manually collecting a large-scale video benchmark is laborintensive and time-consuming, it is also subjective since video understanding can often be an ill-defined task with low annotator consistency [68]. Therefore, we start with two acknowledged datasets in VAD community, i.e., UCF-Crime [6] and XD-Violence [7], and construct our benchmarks for VAR. We adopt these two datasets as the base since they thus far are the two most comprehensive VAD datasets in terms of length and scene, where the total duration of them are 128 and 217 hours, respectively. Besides, they are also collected from a variety of scenarios. For example, UCF-Crime covers 13 realworld anomalies as well as normal activities, and XD-Violence captures 6 anomalies and normal activities from movies and YouTube. In addition, both of them contain half normal videos and half abnormal videos, therefore, retrieval systems retrieve both abnormal and normal videos from the video gallery given related cross-modal queries in VAR. Large and diverse video databases allow us to construct more practicable benchmarks for VAR.\nA. UCFCrime-AR # UCF-Crime dataset consists of 1900 untrimmed videos with 950 abnormal videos and 950 normal videos. Notably, for anomaly videos in the training set, the start timestamp and duration of anomalous activities are unavailable. For normal videos, they are totally anomaly-free. We directly use the total videos as the video search base. To achieve cross-modal retrieval, we require pairwise text descriptions.\nWe invite 8 experienced annotators who are proficient in Chinese and English to annotate these videos. The annotators watch the entire video and make the corresponding captions in both Chinese and English. Specifically, annotators are required to focus on anomalous events when describing anomaly videos. Due to the subtle differences in videos for the same anomaly category, we need to obtain quality sentence annotations to distinguish fine differences and avoid being into a one-to-many dilemma [69] which often appears in the current video retrieval. To be specific, there are at most two annotators to describe videos in the same category. For two similar videos in the same category, describe their differences in detail as much as possible. Take the scene of a fighting between two people as an example, e.g., \u0026ldquo;At a party, the yellow-haired man suddenly attacked a man opposite him.\u0026rdquo;, \u0026ldquo;A young man suddenly beat another man with glasses in the elevator.\u0026rdquo; The above two annotations clearly describe the difference between two similar videos. Finally, we double-check each sentence description to guarantee the quality.\nFollowing the partition of UCF-Crime, UCFCrime-AR includes 1610 training videos and 290 test videos. Each video is annotated with captions in both English and Chinese. In this work, we only use captions in English.\nB. XDViolence-AR # As for XD-Violence, we found that it is very hard to describe videos in a few sentences due to their complicated contents and scenarios. Hence we changed focus and started a new line of audio-to-video retrieval due to its natural audiovisual information, that is, we use videos and synchronous audios for cross-modal anomaly retrieval. Unlike texts, audios have the same granularity as videos. Similar to UCF-Crime, XD-Violence is also a weakly supervised dataset, namely, frame-level annotations are unknown. XDViolence-AR is split into two subsets, with 3954 long videos for training and 800 for testing.\nC. Benchmark Statistics # We compare two benchmarks with several cross-modal retrieval/location datasets in Table I. As we can see that, video databases in UCFCrime-AR and XDViolence-AR are both large-scale and are made public in recent years, where the former is applied to video-text (V-T) anomaly retrieval, and the latter is applied to video-audio anomaly retrieval (VA). Notably, the average length of videos in VAR benchmarks is significantly longer than that of videos in traditional video retrieval datasets. For example, the average length of videos of UCFCrime-AR and XDViolence-AR are 242s and 164s,\nTABLE I COMPARISON OF UCFCRIME-AR AND XDVIOLENCE-AR WITH SEVERAL VIDEO -TEXT AND VIDEO -AUDIO RETRIEVAL/LOCALIZATION DATASETS .\nDatasets Duration #Videos Avg.len. Type Year MSR-VTT [70] 40h 7.2k 20s V-T 2016 VATEX [71] 114h 41k 10s V-T 2019 TVR [74] 463h 21.8k 76s V-T 2020 AVE [45] 11h 4.1k 10s V-A 2018 AudioCaps [72] 127h 46k 10s V-A 2019 LLP [73] 33h 11k 10s V-A 2020 UCFCrime-AR 128h 1.9k 242s V-T 2018 XDViolence-AR 217h 4.8k 164s V-A 2020 Fig. 3. Statistical histogram distributions on UCFCrime-AR. Left: text captions in English; Right: text captions in Chinese.\nwhereas that of MSR-VTT [70], VATEX [71], VAE [45], AudioCaps [72], and LLP [73] are in the range of 10s to 20s, and TVR [74] is mainly applied to video moment retrieval task, its average length of videos is still much shorter than our benchmarks. Longer videos emphasize again the goal of VAR is to retrieve long and untrimmed videos, such a setup meets realistic requirements, and also reveals VAR is a more challenging task. For video-text UCFCrime-AR, we also present histogram distributions of captions in Figure 3. The average caption lengths of UCFCrime-AR-en, UCFCrime-ARzh are 16.3 and 22.4, which are longer than those of previous datasets in video retrieval. e.g., the average caption lengths of VATEX-en [71], VATEX-zh [71], and MSR-VTT [70] are 15.23, 13.95, and 9.28, respectively.\nIV. METHOD # In this section, we introduce ALAN in detail. In Sec. IV-A, we first introduce three encoders in ALAN, namely, video encoder, text encoder, and audio encoder, the goal of these encoders is to project raw videos, texts, and audios into high-level representations. In Sec. IV-B, we introduce the anomaly-led sampling mechanism which is utilized in both video encoder and audio encoder. In Sec. IV-C, we describe a novel pretext task, i.e., VPMPM, which is applied to videotext anomaly retrieval. At last, we describe the cross-modal alignment and training objectives in Secs. IV-D and IV-E.\nA. Encoders # Video encoder. Unlike images, videos possess space-time information [75], [76]. As a consequence, we consider both appearance and motion information to encode videos. Specifically, given an video v, we use I3D-RGB and I3D-Flow pre-trained on Kinetics [77] to extract frame-level object and motion features, respectively, then project these features into a d-dimensional space for the subsequent operations. Here, object and motion feature sequences are denoted as F o (v) and F m (v), respectively. Both sequences contain T clip features. For the sake of clarity, we use F(v) to denote F o (v) and F m (v). Taking into account the variety of anomalous event duration in untrimmed videos, we sample two sparse video clips with different concerns, i.e., U and R, from F(v) by means of the fixed-frame sampling and our proposed anomalyled sampling.\nAs demonstrated in Figure 4, the video encoder is a symmetric two-stream model, one stream takes as input object, and the other takes as input motion. In order to fuse features in different modalities and different temporalities for final representations, we employ the Transformer [78] as the base model, which has been widely used in VAD and VR tasks with good results. For example, Huang et al. [34], [79] and Zhao et al. [80] used Transformer to tackle VAD and VR tasks, respectively. We first concatenate two different sampling clips as a new sequence, i.e., [UCLS, U1, \u0026hellip;, RN , RCLS, R1, \u0026hellip;, RN ] , where UCLS and RCLS are [CLS] tokens, which are the average aggregation of all features in U and R, respectively. Then, we add positional embeddings [78] and sequence embeddings to this sequence. Here positional embeddings provide temporal information about the time in the video, and sequence embeddings depict that features in U and R stem from different sequences. In video encoder, Self Encoder is devised to capture contextual information, which is a standard encoder layer in Transformer. The following Cross Encoder takes the selfmodality as the query, and cross-modality contextual features as the key and value to encode cross-modal representations through cross-modal attention. Cross Encoder is composed of multi-head attention, a linear layer, a residual connection, and a layer normalization. Finally, we obtain two different video representations, one is the average of output from UCLS and RCLS, denoted as g v (including g vo and g vm ), the other is the mean of average pooling aggregation of output from U and R, denoted as h v (including h vo and h vm ). Such a simple pooling operation is parameter-free and effective in our work, enabling h v to involve local fine-grained information.\nText encoder. Give a text caption t, we aim to learn the alignment between it and a related video at two different levels. At first, we leverage a pre-trained BERT [13] to extract features\nFig. 4. Overview of our ALAN. It consists of several components, i.e., video encoder, text encoder, audio encoder, pretext task VPMPM, and cross-modal alignment.\naided by its widespread adoption and proven performance in language representations. Following the video encoder, we obtain g t from the [CLS] output of BERT, and h t by using the average pooling operation for word-level representations. To match the object and motion representations of videos, here we use the gated embedding unit [11] acting on g t and h t to produce g to , g tm and h to , h tm , respectively.\nAudio encoder. Give an audio a, we first extract audio features using a pre-trained VGGish [81], and project these features into the d-dimensional space. As shown in Figure 4, the audio encoder is similar to video encoder in terms of structure. The difference lies in that audio encoder is a single-stream model and has no Cross Encoder. In a similar vein, two different audio representations g a and h a are obtained. The gated embedding unit is also applied to match the object and motion representations of videos.\nB. Anomaly-Led Sampling # As mentioned, only fixed-frame sampling (FS) cannot capture variable anomalous events in anomaly videos. We make use of anomaly priors and propose an anomaly-led sampling (AS) to enable that anomalous clips are more likely to be selected. Since frame-level annotations are unknown, it is impossible to directly identify anomalous clips. To solve this problem, we leverage a weakly supervised anomaly detector to predict clip-level anomaly confidences l ∈ R T , where l i ∈ [0 , 1]. With l in hands, we expect that for a clip, the probability of being selected is positively correlated with its anomaly confidence. A natural way is to select the top several clips with the highest anomaly confidence, but such a solution is too strict to be flexible. We believe that those clips with low anomaly confidences should also have a certain probability to be selected, on the one hand for data augmentation, on the other hand for salvaging false negatives of the anomaly detector. Taking inspiration from the selection strategy of evolutionary algorithms [82], [83], our anomaly-led sampling is based on the classical roulette-wheel selection [84]. To be specific, we regard anomaly confidences l as the fitness, and then normalize all values to the interval [0,1] to ensure summation of selection probabilities equals one,\nwhere p is selection probabilities, and τ is a temperature hyper-parameter [85]. Then calculate cumulative probabilities,\nIt should be noted that q0 = 0 and qT = 1. The final step, then, is to generate N uniformly distributed random numbers in the interval [0, 1]. For each generated number r, the i-th feature in F(v) is selected if qi − 1 \u0026lt; r ≤ qi . A sequence with N clip-level features is assembled in such a way, where the larger the anomaly confidence of a clip is, the more likely is its selection. We present the algorithm flow of Anomaly-led sampling in Algorithm 1.\nThis feature sequence based on anomaly-led sampling are mainly applied to cover anomalous segments, meanwhile, we also use fixed-frame sampling, e.g., uniform or random, to generate another sequence with N clips for the entirety and normal scenarios.\nC. Video Prompt Based Masked Phrase Modeling # We propose a novel pretext task, i.e., video prompt based masked phrase modeling, for cross-modal fine-grained associations in video-text anomaly retrieval. VPMPM takes video representations and text representations as input and predicts the masked phrases, which is related to the prevalent masked\nAlgorithm 1: : Anomaly-led sampling based on roulette-wheel selection # Input: anomaly confidence: l; video features: F(v)\nOutput: N clip-level features step1: Compute the selection probability;\nstep2: Compute the cumulative probability;\nk ← 0;\nwhile k \u0026lt; N do\nStep3: Generate a random number r ∈ [0\n//\nuniform distribution step4: Select features;\nif qi\n−\n1 \u0026lt; r ≤ qi then i-th feature in F(v) is selected;\nend k ← k + 1;\nend language modeling in nature language processing. The main difference lies in that (1) VPMPM masks and predicts noun phrases and verb phrases instead of randomly selected words. Unlike single words, noun phrases and verb phrases comprise words of different parts of speech, e.g., nouns, adjectives, verbs, adverbs, etc., better correspond to the local objects and motions in video frames; (2) VPMPM fuses video representations with text representations through cross-modal attention, where video representations serve as fixed prompts [23]. Such two specific designs encourage video encoder and text encoder to capture cross-modal and contextual representation interactions.\nTo achieve this pretext task, we introduce a Prompting Decoder, which is a standard decoder layer used in the Transformer. Since VPMPM involves the objectives of predicting masked noun phrases and masked verb phrases, Prompting Decoder needs to process noun phrases and verb phrases separately in a parameter-shared manner. Given the final video frame-level representations X v and text word-level X t , we first randomly replace a noun phrase or verb phrase representations with mask embeddings [86], where each mask token is a shared, learned vector. Here we denote this masked text representation as X b t . Then we take X b t as the query, and X v as the key and value, feed them into Prompting Decoder to predict the masked contents.\nD. Cross-Modal Alignment # In this paper, cross-modal alignment is used to match representations of different modalities, e.g., video-text and video-audio, from two complementary perspectives. Hence, we deal with CLS alignment and AVG alignment. Unless otherwise stated, here we take video-text as an example to describe these two alignments.\n,\n1];\nCLS alignment. CLS alignment is intended to compute the similarity between g v and g t , and the similarity between them is a weighted sum [13], which is computed as,\nwhere cos(· , · ) is the cosine similarity between two vectors. wta and wtm are weights, which are obtained from g ta and g tm , respectively. Specifically, we pass g ta (g tm ) through a linear layer with softmax normalization, and output wta (wtm). AVG alignment. AVG alignment is intended to compute the similarity s h (v, t) between h v and h t , which is same as CLS alignment. Notably, AVG alignment introduces more finegrained information. The similarity is presented as,\nE. Training Objectives # The final similarity between v and t is the weighted sum of s g (v, t) and s h (v, t), namely,\nwhere α is a hyper-parameter, which lies in the range of [0,1]. Following the previous work [13], we obtain the bi-directional max-margin ranking loss, which is given by,\nwhere B is batch size, sij = s(vi, ti) .\nTo optimize the weakly supervised anomaly detector in video encoder, we use the top-k strategy [32], [87] to obtain the video-level prediction from frame-level confidences l, which is calculated as,\nwhere k = ⌊ T 16 ⌋, and l topk is the set of k-max framelevel confidences in l for the video v. We train this detector with binary cross-entropy loss Ltopk between the video-level prediction ρ v and video-level binary label y v ,\nFor VPMPM in video-text anomaly retrieval, we adopt the cross-entropy loss L mpm between the model\u0026rsquo;s predicted probability ρ t (X b t , X v ) and ground truth y mask , which is presented as follows,\nwhere y mask is a one-hot vocabulary distribution.\nAt last, the overall loss is shown as follows,\nTABLE II COMPARISONS WITH THE STATE -OF -THE -ART METHODS ON UCFCRIME-AR.\nMethod Text→Video Text→Video Text→Video Text→Video Video→Text Video→Text Video→Text Video→Text SumR↑ Method R@1↑ R@5↑ R@10↑ MdR↓ R@1↑ R@5↑ R@10↑ MdR↓ SumR↑ Random Baseline 0.3 2.1 3.4 144.0 0.3 1.0 3.1 145.5 10.2 CE [12] 6.6 19.7 32.4 23.5 5.5 19.7 32.4 21.0 116.3 MMT [13] 8.3 26.2 39.3 16.0 7.2 23.1 39.0 16.0 143.1 T2VLAD [89] 7.6 23.4 39.7 15.5 6.2 27.9 43.1 14.0 147.9 X-CLIP [58] 8.2 27.2 41.7 16.0 6.9 25.8 40.3 15.0 150.1 HL-Net [7] 5.5 20.2 38.3 19.5 5.5 22.8 35.5 20.0 127.8 XML [74] 6.9 24.1 42.4 14.0 6.6 25.9 43.4 13.0 149.3 ALAN 9.0 27.9 44.8 14.0 7.3 24.8 46.9 12.0 160.7 TABLE III COMPARISONS WITH THE STATE -OF -THE -ART METHODS ON XDVIOLENCE-AR.\nMethod Audio→Video Audio→Video Audio→Video Audio→Video Video→Audio Video→Audio Video→Audio Video→Audio SumR↑ Method R@1↑ R@5↑ R@10↑ MdR↓ R@1↑ R@5↑ R@10↑ MdR↓ SumR↑ Random Baseline 0.4 0.6 2.5 399.5 0.1 0.6 0.8 399.5 5.0 CE [12] 11.4 33.3 47.0 12.5 13.0 34.3 46.4 13.0 185.4 MMT [13] 20.5 53.5 68.0 5.0 23.0 54.6 69.5 5.0 289.1 T2VLAD [89] 22.4 56.1 71.0 4.0 23.2 57.1 73.5 4.0 303.3 X-CLIP [58] 26.4 61.1 73.9 3.0 26.4 61.3 73.8 4.0 322.9 HL-Net [7] 12.4 36.6 48.3 11.0 13.4 38.3 52.1 10.0 201.1 XML [74] 22.9 55.6 70.3 5.0 22.6 57.4 71.4 4.0 300.2 ALAN 29.8 68.0 82.0 3.0 32.3 70.0 82.3 3.0 364.4 V. EXPERIMENTS # A. Experimental Settings # Evaluation metrics. Following prior works, we use the rankbased metric for performance evaluation, i.e., Recall at K (R@K, K=1, 5, 10), Median Rank (MdR), and Sum of all Recalls (SumR) to measure the overall performance.\nImplementation details. We use Spacy 1 to extract noun phrases and verb phrases. In video encoder and audio encoder, the anomaly detector is composed of 3 temporal convolution layers with kernel size of 7, the first layer has 128 units followed by 32 units and 1 unit layers. The first two layers are followed by ReLU, and the last layer is followed by Sigmoid. Dropout with rate of 0.6 is applied to the first two layers. In the text encoder, we use the \u0026ldquo;BERT-base-cased model\u0026rdquo; and fine-tune it with a dropout rate of 0.3.\nTraining. We train our model with a batch size of 64 using Adam [88] optimizer. The initial learning rate is set as 5 × 10 − 5 and decays by a multiplicative factor 0.95 per epoch. For hyper-parameters, hidden size d is set as 768, and temperature parameter τ in Eq. 1 is set as 0.7. Empirically, we found the weight ratio α=0.5 in Eq. 5 and sampling length N=50 worked well across different benchmarks. As the setup in [13], the margin ∆ in Eq. 6 is set as 0.05. λ1 and λ2 in Eq. 10 is set as 0.1 and 0.01, respectively, such a setup achieves optimal performance.\nB. Comparison with State-of-the-Art Methods # We conduct experiments on UCFCrime-AR and XDViolence-AR and compare our ALAN with several recent methods that are widely used in video retrieval, video moment retrieval and VAD. CE [12], MMT [13],\nT2VLAD [89], and X-CLIP [58] are video retrieval methods; XML [74] is a video moment retrieval method, here it is used to retrieve videos, where the moment localization part is removed since moment annotations are unavailable in VAR. HL-Net [7] is a VAD method, since VAD is quite distinct from VAR, it is hard to directly use VAD method for VAR, here, we modify it as a video encoder for VAR. All methods use BERT to extract language features except CE that uses the word2vec word embeddings [90]. We present comparison results in Tables II and III, and observe that our ALAN shows a clear advantage over comparison methods in both text-video and audio-video VAR. Specifically, ALAN outperforms CE, MMT, T2VLAD, X-CLIP, HL-Net, and XML on UCFCrime-AR by 44.4, 17.6, 12.8, 10.6, 32.9, and 11.4 in terms of SumR, respectively. Furthermore, ALAN also achieves clear improvements against competitors on XDViolence-AR, which achieves a significant performance improvement of 41.5 in terms of SumR over the previous best method. Moreover, It can be found that, in comparison to the video and text, the video and audio are easier to align. We argue that video and audio are synchronous with concordant granularity, thereby leading to better align performance in VAR.\nC. Ablation Studies # Study on anomaly-led sampling. As aforementioned, we propose a novel sampling mechanism, i.e., anomaly-led sampling, which combines with the ordinary fixed-frame sampling, and the joint effort is devoted to capturing local anomalous segments as well as overall information. To investigate the effectiveness of anomaly-led sampling, we conduct experiments on two benchmarks, and show results on Tables IV and V. As we can see from the first two rows, only using fixed-frame sampling or anomaly-led sampling results in a clear performance\nTABLE IV COMPARISONS OF DIFFERENT SAMPLINGS ON UCFCRIME-AR.\nSampling Text→Video Text→Video Video→Text Video→Text Sampling R@1↑ R@10↑ R@1↑ R@10↑ FS (N=50) 6.6 35.5 4.8 42.4 AS (N=50) 7.9 37.6 5.5 41.7 FS (N=100) 6.6 37.6 6.2 40.3 FS+AS (N=50) 9.0 44.8 7.3 46.9 TABLE V COMPARISONS OF DIFFERENT SAMPLINGS ON XDVIOLENCE-AR.\nSampling Text→Video Text→Video Video→Text Video→Text Sampling R@1↑ R@10↑ R@1↑ R@10↑ FS (N=50) 29.6 80.4 31.1 80.9 AS (N=50) 26.9 78.6 27.4 78.9 FS (N=100) 28.5 81.0 29.8 81.8 FS+AS (N=50) 29.8 82.0 32.3 82.3 drop on both UCFCrime-AR and XDViolence-AR. Besides, using anomaly-led sampling is inferior to using fixed-sampling on XDViolence-AR, we discover that the main reason for this problem is that the anomaly-led sampling mechanism is applied to both video and audio, resulting in key segments misalignment to some extent. Moreover, we also investigate the effect of sampling length. From the third row, we found that increasing the sampling length from 50 to 100 does not dramatically improve performance, and fixed-frame sampling still lags behind the combination of fixed-frame sampling and anomaly-led sampling, even though they both have the same sampling length at the moment. It also clearly demonstrates that the joint effect between anomaly-led sampling and fixedframe sampling enables our model to capture key anomalous segments as well as holistic data information, thus facilitating cross-modal alignment under local-anomaly and global-video perspectives. For example, in Figure 8, video frames that are selected by anomaly-led sampling are aligned with the key anomaly descriptions, e.g., two car collided violently, a man in black lay on the ground and shot. On another hand, these video frames selected by fixed-frame sampling are aligned with the complete descriptions.\nStudy on VPMPM. Here we conduct experiments to certify the advantage of VPMPM for video-text fine-grained associations. When ALAN removes VPMPM at training time, we observe the performance clearly drops as shown in Table VI. Besides, masking and predicting in the form of random words rather than noun phrases and verb phrases in VPMPM hurts performance. We can also see that, using noun phrases and verb phrases are superior to noun and verb words on most evaluation metrics. This demonstrates that noun phrases and verb phrases, as the sequences of words with different parts of speech, can better align with related local contents in videos. Study on cross-modal alignment. Tables VII and VIII present the performance of two different alignments in our ALAN. We found that CLS alignment and AVG alignment obtain worse results when used alone in comparison to the model of jointly using both. Such results demonstrate the complementarity of these two alignments. A key observation is the AVG alignment performs better than CLS alignment on XDViolence-AR, but\nTABLE VI VPMPM STUDIES ON UCFCRIME-AR.\nMethod Text→Video Text→Video Video→Text Video→Text Method R@1↑ R@10↑ R@1↑ R@10↑ w/o VPMPM 7.9 43.4 7.2 43.4 random words 8.6 43.8 6.2 44.5 noun\u0026amp;verb words 10.0 44.5 6.6 42.8 noun\u0026amp;verb phrases 9.0 44.8 7.3 46.9 TABLE VII COMPARISONS OF DIFFERENT ALIGNMENTS ON UCFCRIME-AR.\nTABLE VIII COMPARISONS OF DIFFERENT ALIGNMENTS ON XDVIOLENCE-AR.\nAlignment Text→Video Text→Video Video→Text Video→Text Alignment R@1↑ R@10↑ R@1↑ R@10↑ CLS 6.2 42.8 7.6 40.3 AVG 6.6 33.8 4.8 36.9 CLS+AVG 9.0 44.8 7.3 46.9 Fig. 5. Influences of α on both UCFCrime-AR and XDViolence-AR.\nAlignment Audio→Video Audio→Video Video→Audio Video→Audio Alignment R@1↑ R@10↑ R@1↑ R@10↑ CLS 26.8 77.9 28.6 77.4 AVG 28.0 79.3 30.0 80.1 CLS+AVG 29.8 82.0 32.3 82.3 the opposite is true on UCFCrime-AR, we suspect that video and audio are easier to align at the fine-grained level due to their concordant granularity. Moreover, we also investigate the influence of α. We try α with its value ranging from 0.0 to 1.0 with an interval of 0.1. As shown in Figure 5, with the increase of α, the performance gradually improves and then decreases, when α is set as 0.5, our method achieves the best performance. In order to further explore how to choose α , we also show the detailed retrieval results of different α in Tables IX and X. It is not hard to see that it is a balanced choice to set the value range of α to 0.4-0.6, where two different cross-modal alignments make nearly the same contribution.\nD. Qualitative Analyses # Visualization of retrieval results. Some text-to-video retrieval examples on UCFCrime-AR are exhibited in Figure 6, where retrieval results of a normal video is shown at the far right. We observe ALAN successfully retrieves the related video given\nFig. 6. Some retrieval examples on UCFCrime-AR. We visualize top 3 retrieved videos (green: correct; pink: incorrect).\nTABLE IX DETAILED INFLUENCES OF α ON UCFCRIME-AR.\nValue of α Audio→Video Audio→Video Video→Audio Video→Audio Value of α R@1↑ R@10↑ R@1↑ R@10↑ 0.0 6.6 33.8 4.8 36.9 0.2 8.3 38.6 6.2 40.3 0.4 7.9 42.4 6.9 44.5 0.5 9.0 44.8 7.3 46.9 0.6 7.6 45.2 6.2 47.2 0.8 7.6 43.8 5.5 43.4 1.0 6.2 42.8 7.6 40.3 TABLE X DETAILED INFLUENCES OF α ON UCFCRIME-AR.\nValue of α Audio→Video Audio→Video Video→Audio Video→Audio Value of α R@1↑ R@10↑ R@1↑ R@10↑ 0.0 28.0 79.3 30.0 80.1 0.2 30.4 80.8 32.1 82.6 0.4 31.9 80.9 32.3 82.4 0.5 29.8 82.0 32.3 82.3 0.6 31.0 82.6 32.8 80.8 0.8 28.0 79.8 30.1 79.3 1.0 26.8 77.9 28.6 77.4 a text query, and there are considerable similarities between the top 3 retrieved videos. This also demonstrates VAR is a challenging task as some scenes are similar with delicate differences.\nVisualization of coarse caption retrieval. In VAR task, the purpose of using accurate captions is to distinguish fine differences and avoid being into a one-to-many dilemma [69]. To further verify the generalization capacity of ALAN, we use several coarse captions that are not directly applied in model training to retrieve videos, results in Figure 7 clearly show that ALAN works very well with different lengths of coarse captions, and also demonstrate ALAN has learned several abstract semantic information, e.g., explosion, fighting, traffic. This also convincingly indicates our methods can meet practical requirements where users cannot provide a complete text description of the videos they intend to search, such as the example in the lower right of Figure 7, users give the the retrieval model a incomplete description \u0026ldquo;man robbed people\u0026rdquo;, and the model returns top 3 related videos, in which the contents correspond to robbery, steal, man, and people.\nVisualization of anomaly-led sampling. We visualize video frames selected by fixed-frame sampling and anomaly-led sampling in Figure 8. These examples are taken from videos\nFig. 7. Some coarse caption retrieval examples on UCFCrime-AR.\nof road accident and shooting scenes. It can be seen from the second row that the duration of anomalous event accounts for less than one-fifth of the entire video length, therefore, frames related to the anomalous event are hard to select based on fixed-frame sampling. In stark contrast to fixedframe sampling, anomaly-led sampling is based on anomaly confidences generated by the anomaly detector, and it can select more frames related to anomalous events since the probability of being selected has positive correlations with anomaly confidences, where anomaly detector generates high confidences in anomalous segments which is shown in the second row.\nVisualization of zero-shot retrieval. ALAN is trained on UCFCrime-AR and XDViolence-AR for text-video and audiovideo anomaly retrieval, respectively. Moreover, scenarios in these two benchmarks are different, because videos from UCFCrime-AR are captured with fixed cameras, whereas videos from XDViolence-AR are collected from movies and YouTube. Here we explore that, given a cross-modal query from UCFCrime-AR (or XDViolence-AR), is ALAN trained on UCFCrime-AR (or XDViolece-AR) capable of retrieving some relevant videos from XDViolence-AR (or texts from UCFCrime-AR)? We show the top 2 retrieval results in Figure 9. In text-to-video anomaly retrieval, we found that given text queries from UCFCrime-AR, ALAN can retrieve some videos from XDViolence-AR that look semantically plausible, even if there are no completely relevant videos in XDViolence-AR. Interestingly, the video in the bottom\nFixed\nFrame\nSampling\nAnomaly\nConfidence\nAnomaly-Led\nSampling\nGT\nAt night\n,\nthe two cars collided violently in the middle of the crossroad and crashed into the side of the road\n.\nTime Axis\nFig. 8. Different samplings for video frame selection. Left: road accident; Right: Shooting.\nText\nto-Video\nVideo\nto-Text\nFig. 9. Zero-shot retrieval results. The left two columns present zero-shot text-to-video anomaly retrieval, and the right two columns present zero-shot video-to-text anomaly retrieval.\nleft is an animation. ALAN learns several local semantic contents and retrieves videos based on these local semantic contents, such as \u0026ldquo;huge fire\u0026rdquo; and \u0026ldquo;mushroom cloud\u0026rdquo;. In videoto-text anomaly retrieval, although retrieved text descriptions are not completely related, ALAN captures partial semantic information from movie videos, such as \u0026ldquo;a man\u0026rdquo;, \u0026ldquo;a female companion\u0026rdquo;, \u0026ldquo;knock down somebody with fists\u0026rdquo;, etc.\nE. Running Time # We report the retrieval time for UCFCrime-AR with 290 video-text test pairs and XDViolence-AR with 800 videoaudio test pairs, our method costs 2.7s and 5.6s, respectively. Generally, it only needs about 0.008s to process a pair on both datasets, showing its higher efficiency. The reason why our method remains high retrieval efficiency is that it has a dual-encoder structure during the test stage, that is, using two separate encoders to embed video and text features and project them into the latent joint space, and only the cosine similarity between video and text features is calculated as similarity, without complicated and inefficient cross-modal interactions. However, it is worth noting that, during the training phase, our method integrates text and video as inputs to a joint encoder for the cross-modality fusion, which can establish local correlation between video-text features and improve retrieval accuracy. Therefore, our method obtains the advantages of the above two kinds of methods, that is, achieving finegrained video-text interactions while maintaining high retrieval efficiency.\nVI. CONCLUSION # In this paper, we introduce a new task called video anomaly retrieval to remedy the inadequacy of video anomaly de- tection in terms of abnormal event depict, further facilitate video anomaly analysis research in cross-modal scenarios. We construct two VAR benchmarks, i.e., UCFCrime-AR and XDViolence-AR, based on popular VAD datasets. Moreover, we propose ALAN which includes several components, where anomaly-led sampling is used to capture local anomalous segments, which coordinates with ordinary fixed-frame sampling to achieve complementary effects; Video prompt based masked phrase modeling is used to learn cross-modal finegrained associations; Cross-modal alignment is used to match cross-modal representations from two perspectives. The future work will lie in two aspects, 1) exploiting cross-modal pretrained models to capture more powerful knowledge for VAR; 2) leveraging VAR to assist VAD methods for more precise anomaly detection.\nREFERENCES # [1] M. Sabokrou, M. Fayyaz, M. Fathy, and R. Klette, \u0026ldquo;Deep-cascade: Cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes,\u0026rdquo; IEEE Transactions on Image Processing , vol. 26, no. 4, pp. 1992–2004, 2017.\n[2] W. Liu, W. Luo, D. Lian, and S. Gao, \u0026ldquo;Future frame prediction for anomaly detection–a new baseline,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6536–6545.\n[3] P. Wu, J. Liu, and F. Shen, \u0026ldquo;A deep one-class neural network for anomalous event detection in complex scenes,\u0026rdquo; IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 7, pp. 2609–2622, 2019.\n[4] H. Park, J. Noh, and B. Ham, \u0026ldquo;Learning memory-guided normality for anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 372–14 381.\n[5] M.-I. Georgescu, A. Barbalau, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah, \u0026ldquo;Anomaly detection in video via self-supervised and multi-task learning,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 742–12 752.\n[6] W. Sultani, C. Chen, and M. Shah, \u0026ldquo;Real-world anomaly detection in surveillance videos,\u0026rdquo; in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6479–6488.\n[7] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang, \u0026ldquo;Not only look, but also listen: Learning multimodal violence detection under weak supervision,\u0026rdquo; in Proceedings of the European Conference on Computer Vision. Springer, 2020, pp. 322–339.\n[8] J.-C. Feng, F.-T. Hong, and W.-S. Zheng, \u0026ldquo;Mist: Multiple instance selftraining framework for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 14 009–14 018.\n[9] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro, \u0026ldquo;Weakly-supervised video anomaly detection with robust temporal feature magnitude learning,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.\n[10] J. Wu, W. Zhang, G. Li, W. Wu, X. Tan, Y. Li, E. Ding, and L. Lin, \u0026ldquo;Weakly-supervised spatio-temporal anomaly detection in surveillance video,\u0026rdquo; arXiv preprint arXiv:2108.03825, 2021.\n[11] A. Miech, I. Laptev, and J. Sivic, \u0026ldquo;Learning a text-video embedding from incomplete and heterogeneous data,\u0026rdquo; arXiv preprint arXiv:1804.02516 , 2018.\n[12] Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman, \u0026ldquo;Use what you have: Video retrieval using representations from collaborative experts,\u0026rdquo; arXiv preprint arXiv:1907.13487, 2019.\n[13] V. Gabeur, C. Sun, K. Alahari, and C. Schmid, \u0026ldquo;Multi-modal transformer for video retrieval,\u0026rdquo; in Proceedings of the 16th European Conference on Computer Vision. Springer, 2020, pp. 214–229.\n[14] X. Yang, S. Wang, J. Dong, J. Dong, M. Wang, and T.-S. Chua, \u0026ldquo;Video moment retrieval with cross-modal neural architecture search,\u0026rdquo; IEEE Transactions on Image Processing, vol. 31, pp. 1204–1216, 2022.\n[15] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, \u0026ldquo;Tubedetr: Spatiotemporal video grounding with transformers,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 16 442–16 453.\n[16] R. Cui, T. Qian, P. Peng, E. Daskalaki, J. Chen, X. Guo, H. Sun, and Y.-G. Jiang, \u0026ldquo;Video moment retrieval from text queries via single frame annotation,\u0026rdquo; in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2022, pp. 1033–1043.\n[17] G. Wang, X. Xu, F. Shen, H. Lu, Y. Ji, and H. T. Shen, \u0026ldquo;Cross-modal dynamic networks for video moment retrieval with text query,\u0026rdquo; IEEE Transactions on Multimedia, vol. 24, pp. 1221–1232, 2022.\n[18] Y. Han, G. Huang, S. Song, L. Yang, H. Wang, and Y. Wang, \u0026ldquo;Dynamic neural networks: A survey,\u0026rdquo; arXiv preprint arXiv:2102.04906, 2021.\n[19] Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, \u0026ldquo;Dynamicvit: Efficient vision transformers with dynamic token sparsification,\u0026rdquo; arXiv preprint arXiv:2106.02034, 2021.\n[20] Y. Zhi, Z. Tong, L. Wang, and G. Wu, \u0026ldquo;Mgsampler: An explainable sampling strategy for video action recognition,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.\n[21] M. Fayyaz, S. A. Koohpayegani, F. R. Jafari, S. Sengupta, H. R. V. Joze, E. Sommerlade, H. Pirsiavash, and J. Gall, \u0026ldquo;Adaptive token sampling for efficient vision transformers,\u0026rdquo; in Proceedings of the European Conference on Computer Vision. Springer, 2022, pp. 396–414.\n[22] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, \u0026ldquo;Align before fuse: Vision and language representation learning with momentum distillation,\u0026rdquo; Advances in Neural Information Processing Systems, vol. 34, pp. 9694–9705, 2021.\n[23] Z. Hou, F. Sun, Y.-K. Chen, Y. Xie, and S.-Y. Kung, \u0026ldquo;Milan: Masked image pretraining on language assisted representation,\u0026rdquo; arXiv preprint arXiv:2208.06049, 2022.\n[24] S. Lee, H. G. Kim, and Y. M. Ro, \u0026ldquo;Bman: Bidirectional multi-scale aggregation networks for abnormal event detection,\u0026rdquo; IEEE Transactions on Image Processing, vol. 29, pp. 2395–2408, 2019.\n[25] R. T. Ionescu, F. S. Khan, M.-I. Georgescu, and L. Shao, \u0026ldquo;Object-centric auto-encoders and dummy anomalies for abnormal event detection in video,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7842–7851.\n[26] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel, \u0026ldquo;Memorizing normality to detect anomaly: Memoryaugmented deep autoencoder for unsupervised anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1705–1714.\n[27] G. Wang, Y. Wang, J. Qin, D. Zhang, X. Bao, and D. Huang, \u0026ldquo;Video anomaly detection by solving decoupled spatio-temporal jigsaw puzzles,\u0026rdquo; 2022.\n[28] Z. Yang, P. Wu, J. Liu, and X. Liu, \u0026ldquo;Dynamic local aggregation network with adaptive clusterer for anomaly detection,\u0026rdquo; in Proceedings of the European Conference on Computer Vision. Springer, 2022, pp. 404– 421.\n[29] Z. Yang, J. Liu, Z. Wu, P. Wu, and X. Liu, \u0026ldquo;Video event restoration based on keyframes for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 14 592–14 601 .\n[30] C. Yan, S. Zhang, Y. Liu, G. Pang, and W. Wang, \u0026ldquo;Feature prediction diffusion model for video anomaly detection,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5527–5537 .\n[31] H. Lv, C. Zhou, Z. Cui, C. Xu, Y. Li, and J. Yang, \u0026ldquo;Localizing anomalies from weakly-labeled videos,\u0026rdquo; IEEE transactions on image processing , vol. 30, pp. 4505–4515, 2021.\n[32] P. Wu and J. Liu, \u0026ldquo;Learning causal temporal relation and feature discrimination for anomaly detection,\u0026rdquo; IEEE Transactions on Image Processing, vol. 30, pp. 3513–3527, 2021.\n[33] C. Cao, X. Zhang, S. Zhang, P. Wang, and Y. Zhang, \u0026ldquo;Adaptive graph convolutional networks for weakly supervised anomaly detection in videos,\u0026rdquo; arXiv preprint arXiv:2202.06503, 2022.\n[34] C. Huang, C. Liu, J. Wen, L. Wu, Y. Xu, Q. Jiang, and Y. Wang, \u0026ldquo;Weakly supervised video anomaly detection via self-guided temporal discriminative transformer,\u0026rdquo; IEEE Transactions on Cybernetics, 2022 .\n[35] Y. Peng, X. Huang, and Y. Zhao, \u0026ldquo;An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges,\u0026rdquo; IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 9, pp. 2372–2385, 2017.\n[36] Y. Peng, W. Zhu, Y. Zhao, C. Xu, Q. Huang, H. Lu, Q. Zheng, T. Huang, and W. Gao, \u0026ldquo;Cross-media analysis and reasoning: advances and directions,\u0026rdquo; Frontiers of Information Technology \u0026amp; Electronic Engineering , vol. 18, no. 1, pp. 44–57, 2017.\n[37] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, \u0026ldquo;Coca: Contrastive captioners are image-text foundation models,\u0026rdquo; arXiv preprint arXiv:2205.01917, 2022.\n[38] R. Zuo, X. Deng, K. Chen, Z. Zhang, Y.-K. Lai, F. Liu, C. Ma, H. Wang, Y.-J. Liu, and H. Wang, \u0026ldquo;Fine-grained video retrieval with scene sketches,\u0026rdquo; IEEE Transactions on Image Processing, 2023.\n[39] M. Monfort, S. Jin, A. Liu, D. Harwath, R. Feris, J. Glass, and A. Oliva, \u0026ldquo;Spoken moments: Learning joint audio-visual representations from video descriptions,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 871–14 881.\n[40] A.-M. Oncescu, A. Koepke, J. F. Henriques, Z. Akata, and S. Albanie, \u0026ldquo;Audio retrieval with natural language queries,\u0026rdquo; arXiv preprint arXiv:2105.02192, 2021.\n[41] V. Gabeur, A. Nagrani, C. Sun, K. Alahari, and C. Schmid, \u0026ldquo;Masking modalities for cross-modal video retrieval,\u0026rdquo; arXiv preprint arXiv:2111.01300, 2021.\n[42] A. Rouditchenko, A. Boggust, D. Harwath, S. Thomas, H. Kuehne, B. Chen, R. Panda, R. Feris, B. Kingsbury, M. Picheny et al., \u0026ldquo;Cascaded multilingual audio-visual learning from videos,\u0026rdquo; arXiv preprint arXiv:2111.04823, 2021.\n[43] P. Morgado, N. Vasconcelos, and I. Misra, \u0026ldquo;Audio-visual instance discrimination with cross-modal agreement,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 12 475–12 486.\n[44] W. Shen, J. Song, X. Zhu, G. Li, and H. T. Shen, \u0026ldquo;End-to-end pretraining with hierarchical matching and momentum contrast for textvideo retrieval,\u0026rdquo; IEEE Transactions on Image Processing, 2023.\n[45] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, \u0026ldquo;Audio-visual event localization in unconstrained videos,\u0026rdquo; in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 247–263 .\n[46] Y. Wei, D. Hu, Y. Tian, and X. Li, \u0026ldquo;Learning in audio-visual context: A review, analysis, and new perspective,\u0026rdquo; arXiv preprint arXiv:2208.09579 , 2022 .\n[47] Y. Wu, L. Zhu, Y. Yan, and Y. Yang, \u0026ldquo;Dual attention matching for audiovisual event localization,\u0026rdquo; in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6292–6300 .\n[48] Y.-B. Lin, Y.-L. Sung, J. Lei, M. Bansal, and G. Bertasius, \u0026ldquo;Vision transformers are parameter-efficient audio-visual learners,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 2299–2309 .\n[49] X. Li, C. Xu, G. Yang, Z. Chen, and J. Dong, \u0026ldquo;W2vv++ fully deep learning for ad-hoc video search,\u0026rdquo; in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 1786–1794.\n[50] J. Dong, X. Li, C. Xu, X. Yang, G. Yang, X. Wang, and M. Wang, \u0026ldquo;Dual encoding for video retrieval by text,\u0026rdquo; IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.\n[51] S. Liu, H. Fan, S. Qian, Y. Chen, W. Ding, and Z. Wang, \u0026ldquo;Hit: Hierarchical transformer with momentum contrast for video-text retrieval,\u0026rdquo; arXiv preprint arXiv:2103.15049, 2021.\n[52] M. Wray, D. Larlus, G. Csurka, and D. Damen, \u0026ldquo;Fine-grained action retrieval through multiple parts-of-speech embeddings,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 450–459.\n[53] P. Wu, X. He, M. Tang, Y. Lv, and J. Liu, \u0026ldquo;Hanet: Hierarchical alignment networks for video-text retrieval,\u0026rdquo; in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 3518–3527.\n[54] N. Han, J. Chen, G. Xiao, H. Zhang, Y. Zeng, and H. Chen, \u0026ldquo;Finegrained cross-modal alignment network for text-video retrieval,\u0026rdquo; in Proceedings of the 29th ACM International Conference on Multimedia , 2021, pp. 3826–3834.\n[55] J. Yang, Y. Bisk, and J. Gao, \u0026ldquo;Taco: Token-aware cascade contrastive learning for video-text alignment,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 562–11 572.\n[56] W. Wang, M. Zhang, R. Chen, G. Cai, P. Zhou, P. Peng, X. Guo, J. Wu, and X. Sun, \u0026ldquo;Dig into multi-modal cues for video retrieval with hierarchical alignment,\u0026rdquo; in Proceedings of the International Joint Conference on Artificial Intelligence, 2021.\n[57] Y. Ge, Y. Ge, X. Liu, D. Li, Y. Shan, X. Qie, and P. Luo, \u0026ldquo;Bridging video-text retrieval with multiple choice questions,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 16 167–16 176.\n[58] Y. Ma, G. Xu, X. Sun, M. Yan, J. Zhang, and R. Ji, \u0026ldquo;X-clip: Endto-end multi-grained contrastive learning for video-text retrieval,\u0026rdquo; in Proceedings of the 30th ACM International Conference on Multimedia , 2022, pp. 638–647.\n[59] L. Li, Y.-C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, \u0026ldquo;Hero: Hierarchical encoder for video+ language omni-representation pre-training,\u0026rdquo; arXiv preprint arXiv:2005.00200, 2020.\n[60] J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu, \u0026ldquo;Less is more: Clipbert for video-and-language learning via sparse sampling,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7331–7341.\n[61] H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, J. Li, T. Bharti, and M. Zhou, \u0026ldquo;Univl: A unified video and language pre-training model for multimodal understanding and generation,\u0026rdquo; arXiv preprint arXiv:2002.06353, 2020.\n[62] K. Ji, J. Liu, W. Hong, L. Zhong, J. Wang, J. Chen, and W. Chu, \u0026ldquo;Cret: Cross-modal retrieval transformer for efficient text-video retrieval,\u0026rdquo; in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 949– 959.\n[63] J. Dong, X. Chen, M. Zhang, X. Yang, S. Chen, X. Li, and X. Wang, \u0026ldquo;Partially relevant video retrieval,\u0026rdquo; in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 246–257.\n[64] J. Gao, C. Sun, Z. Yang, and R. Nevatia, \u0026ldquo;Tall: Temporal activity localization via language query,\u0026rdquo; in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5267–5275.\n[65] D. Zhang, X. Dai, X. Wang, Y.-F. Wang, and L. S. Davis, \u0026ldquo;Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1247–1257.\n[66] N. C. Mithun, S. Paul, and A. K. Roy-Chowdhury, \u0026ldquo;Weakly supervised video moment retrieval from text queries,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019, pp. 11 592–11 601.\n[67] X. Ding, N. Wang, S. Zhang, Z. Huang, X. Li, M. Tang, T. Liu, and X. Gao, \u0026ldquo;Exploring language hierarchy for video grounding,\u0026rdquo; IEEE Transactions on Image Processing, vol. 31, pp. 4693–4706, 2022.\n[68] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, \u0026ldquo;Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2630–2640.\n[69] M. Wray, H. Doughty, and D. Damen, \u0026ldquo;On semantic similarity in video retrieval,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3650–3660.\n[70] J. Xu, T. Mei, T. Yao, and Y. Rui, \u0026ldquo;Msr-vtt: A large video description dataset for bridging video and language,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2016, pp. 5288–5296.\n[71] X. Wang, J. Wu, J. Chen, L. Li, Y.-F. Wang, and W. Y. Wang, \u0026ldquo;Vatex: A large-scale, high-quality multilingual dataset for video-and-language research,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4581–4591.\n[72] C. D. Kim, B. Kim, H. Lee, and G. Kim, \u0026ldquo;Audiocaps: Generating captions for audios in the wild,\u0026rdquo; in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 119–132.\n[73] Y. Tian, D. Li, and C. Xu, \u0026ldquo;Unified multisensory perception: Weaklysupervised audio-visual video parsing,\u0026rdquo; in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, 2020, pp. 436–454 .\n[74] J. Lei, L. Yu, T. L. Berg, and M. Bansal, \u0026ldquo;Tvr: A large-scale dataset for video-subtitle moment retrieval,\u0026rdquo; in Proceedings of the 16th European Conference on Computer Vision. Springer, 2020, pp. 447–463.\n[75] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Luciˇ ˇ c, and C. Schmid, ´ ´ \u0026ldquo;Vivit: A video vision transformer,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6836–6846.\n[76] D. Neimark, O. Bar, M. Zohar, and D. Asselmann, \u0026ldquo;Video transformer network,\u0026rdquo; in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3163–3172.\n[77] J. Carreira and A. Zisserman, \u0026ldquo;Quo vadis, action recognition? a new model and the kinetics dataset,\u0026rdquo; in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.\n[78] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, \u0026ldquo;Attention is all you need,\u0026rdquo; in Advances in neural information processing systems, 2017, pp. 5998–6008.\n[79] C. Huang, Y. Liu, Z. Zhang, C. Liu, J. Wen, Y. Xu, and Y. Wang, \u0026ldquo;Hierarchical graph embedded pose regularity learning via spatio-temporal transformer for abnormal behavior detection,\u0026rdquo; in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 307–315 .\n[80] S. Zhao, L. Zhu, X. Wang, and Y. Yang, \u0026ldquo;Centerclip: Token clustering for efficient text-video retrieval,\u0026rdquo; in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 970–981 .\n[81] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, \u0026ldquo;Audio set: An ontology and humanlabeled dataset for audio events,\u0026rdquo; in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2017, pp. 776–780.\n[82] K. Wu, J. Liu, X. Hao, P. Liu, and F. Shen, \u0026ldquo;An evolutionary multiobjective framework for complex network reconstruction using community structure,\u0026rdquo; IEEE Transactions on Evolutionary Computation, vol. 25, no. 2, pp. 247–261, 2020.\n[83] Y. Jin, H. Wang, T. Chugh, D. Guo, and K. Miettinen, \u0026ldquo;Data-driven evolutionary optimization: An overview and case studies,\u0026rdquo; IEEE Transactions on Evolutionary Computation, vol. 23, no. 3, pp. 442–458, 2018.\n[84] T. Back, Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford university press, 1996.\n[85] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, \u0026ldquo;Unsupervised feature learning via non-parametric instance discrimination,\u0026rdquo; in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3733–3742.\n[86] K. He, X. Chen, S. Xie, Y. Li, P. Dollar, and R. Girshick, \u0026ldquo;Masked au- ´ ´ toencoders are scalable vision learners,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 000–16 009.\n[87] S. Paul, S. Roy, and A. K. Roy-Chowdhury, \u0026ldquo;W-talc: Weakly-supervised temporal activity localization and classification,\u0026rdquo; in Proceedings of the European Conference on Computer Vision, 2018, pp. 563–579.\n[88] D. P. Kingma and J. Ba, \u0026ldquo;Adam: A method for stochastic optimization,\u0026rdquo; arXiv preprint arXiv:1412.6980, 2014.\n[89] X. Wang, L. Zhu, and Y. Yang, \u0026ldquo;T2vlad: global-local sequence alignment for text-video retrieval,\u0026rdquo; in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5079–5088.\n[90] T. Mikolov, K. Chen, G. Corrado, and J. Dean, \u0026ldquo;Efficient estimation of word representations in vector space,\u0026rdquo; arXiv preprint arXiv:1301.3781 , 2013.\n","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/toward-video-anomaly-retrieval-from-video/","section":"Papers","summary":"Proposes a new task called Video Anomaly Retrieval (VAR), introduces two large-scale benchmarks (UCFCrime-AR and XDViolence-AR), and presents a model called Anomaly-Led Alignment Network (ALAN) for VAR, focusing on retrieving long untrimmed videos using cross-modal queries such as language descriptions and synchronous audios. The work introduces anomaly-led sampling, a pretext task (VPMPM), and cross-modal alignment strategies to address the challenges of VAR in practical scenarios.","title":"Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model","type":"other"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/wei-shi-zheng/","section":"Authors","summary":"","title":"Wei-Shi Zheng","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/weiling-chen/","section":"Authors","summary":"","title":"Weiling Chen","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiangteng-he/","section":"Authors","summary":"","title":"Xiangteng He","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiao-ming-wu/","section":"Authors","summary":"","title":"Xiao-Ming Wu","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiaojin-gong/","section":"Authors","summary":"","title":"Xiaojin Gong","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yuxin-peng/","section":"Authors","summary":"","title":"Yuxin Peng","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/zi-jian-yew/","section":"Authors","summary":"","title":"Zi Jian Yew","type":"authors"},{"content":"","date":"1 January 2023","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/zuhao-liu/","section":"Authors","summary":"","title":"Zuhao Liu","type":"authors"},{"content":"","date":"30 August 2021","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/hongchun-yuan/","section":"Authors","summary":"","title":"HONGCHUN YUAN","type":"authors"},{"content":"","date":"30 August 2021","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/hui-zhou/","section":"Authors","summary":"","title":"HUI ZHOU","type":"authors"},{"content":" Received July 17, 2021, accepted August 25, 2021, date of publication August 30, 2021, date of current version September 14, 2021.\nDigital Object Identifier 10.1109/ACCESS.2021.3109102\nTransAnomaly: Video Anomaly Detection Using Video Vision Transformer # HONGCHUN YUAN , ZHENYU CAI , HUI ZHOU , YUE WANG , AND XIANGZHI CHEN\nCollege of Information Technology, Shanghai Ocean University, Shanghai 201306, China\nCorresponding author: Hongchun Yuan (hcyuan@shou.edu.cn)\nThis work was supported in part by the National Natural Science Foundation of China under Grant 41776142.\nABSTRACT Video anomaly detection is challenging because abnormal events are unbounded, rare, equivocal, irregular in real scenes. In recent years, transformers have demonstrated powerful modelling abilities for sequence data. Thus, we attempt to apply transformers to video anomaly detection. In this paper, we propose a prediction-based video anomaly detection approach named TransAnomaly. Our model combines the U-Net and the Video Vision Transformer (ViViT) to capture richer temporal information and more global contexts. To make full use of the ViViT for the prediction, we modified the ViViT to make it capable of video prediction. Experiments on benchmark datasets show that the addition of the transformer module improves the anomaly detection performance. In addition, we calculate regularity scores with sliding windows and evaluate the impact of different window sizes and strides. With proper settings, our model outperforms other state-of-the-art prediction-based video anomaly detection approaches. Furthermore, our model can perform anomaly localization by tracking the location of patches with lower regularity scores.\nINDEX TERMS Anomaly detection, generative adversarial network, self attention.\nI. INTRODUCTION # Anomaly detection is to identify the events that do not conform to expected behaviours [1]. With the increasing use of video surveillance, video anomaly detection has become an important task. Due to the fact that video anomalies are unbounded, rare, equivocal, irregular in real applications [2], video anomaly detection is challenging, and the problem is hard to be tackled with classification methods. Thus, deep-learning-based semi-supervised anomaly detection methods have been proposed and achieved significant improvements. Generally, these methods can be divided into two categories: i) reconstruction-based methods [3]– [8]. Reconstruction-based methods believe that normal events can be reconstructed correctly by models trained with normality. Conversely, reconstruction of abnormal events would have greater reconstruction error. ii) prediction-based methods [9]– [13]. Prediction-based methods use the previous frames to predict the following ones. Similar to the reconstruction-based methods, it is assumed that normal events would be correctly predicted, while the abnormal ones would not.\nThe associate editor coordinating the review of this manuscript and approving it for publication was Zahid Akhtar .\nConvolutional Neural Networks (CNNs) have become necessary for both reconstruction-based and prediction-based methods for their exceptional representation abilities. U-Net [14], as a variation of Fully Convolutional Networks (FCNs) [15], is a symmetric encoder-decoder network with skip-connections to retain more details, has been widely used in video anomaly detections. Moreover, with the emergence of Generative Adversarial Networks (GANs) [16], adversarial training is applied to the methods, bringing better reconstruction and prediction results, and performance improvements have been achieved. Notwithstanding the extraordinary power of CNNs, CNN-based methods are constrained by the inherent locality of convolutional operations, and they do not perform well in modelling long-range relations. In order to overcome this shortcoming, self-attention mechanisms, used for sentence embedding at the beginning [17], [18], are introduced into the models based on CNNs [19], [20], which enhance the ability of CNN-based models for modelling complex structures.\nTransformer, as a sequence-to-sequence model, achieved significant improvements in the field of natural language processing (NLP) [21]– [23]. The transformer is first proposed by Vaswani et al. [21] for machine translation and English constituency parsing tasks, which gives\nan alternative to prior natural language processing models based on Recurrent Neural Networks (RNNs) and CNNs. Furthermore, Devlin et al. [22] proposed BERT (Bidirectional Encoder Representations from Transformers), which obtained state-of-the-art performance on multiple NLP tasks by pre-training transformers on unlabeled text bi-directionally. Brown et al. [23] introduced a transformer-based model with 175 billion parameters named GPT-3 (Generative Pre-trained Transformer 3). This massive model trained with a large amount of training data is capable of different NLP tasks, and fine-tuning is not needed. Since the great success of transformers in the field of NLP has been witnessed, many works have recently applied transformers to the field of computer vision. For example, ViT (Vision Transformer) [24] takes 16 × 16 image patches as input to a transformer encoder to realize image classification. ViViT (Video Vision Transformer) [25], based on ViT, explored the application of ViT in video classification. DETR [26] and deformable DETR [27] are end-to-end object detection models that directly predict the final set of the detections. TransUNet [28] is a combination of U-Net and transformer, achieving superior medical image segmentation performances to previous methods. Transformers are also utilized in other computer vision tasks, such as segmentation [29], image generation [30] and video inpainting [31].\nIn this paper, inspired by TransUNet, we propose a video anomaly detection model based on U-Net and ViViT named TransAnomaly. In our model, CNN features extracted by the encoder part of U-Net are encoded by a modified ViViT. Thus, the encoded features have both spatial and temporal information. The decoder part of the U-Net then decodes the features, and abnormal frames can be identified by comparing the difference between predicted frames and ground truth frames. With the modified ViViT, our model is able to efficiently encode the input images in both spatial and temporal scales. Compared with previous prediction-based methods using stacked frames as inputs, our model captures global context and additional temporal information in the encoding stage, which helps generate better predictions and eventually improve anomaly detection performance. Experiments on multiple datasets show the superiority of our method.\nII. RELATED WORK # As mentioned above, deep-learning-based unsupervised anomaly detection methods can be generally categorized into reconstruction-based methods and prediction-based methods. These methods achieve good performance in the task of video anomaly detection.\nA. RECONSTRUCTION-BASED METHODS # Most of the reconstruction-based methods trained models to reconstruct an input sequence of frames, then reconstruction errors are used for anomaly detection. For instance, Hasan et al. [3] trained a Fully Convolutional Auto-Encoder to reconstruct input sequences, and the regularity scores of the frames were computed based on reconstruction errors.\nFor richer temporal information, Chong and Tay [4] and Luo et al. [5] combined Convolutional Long Short Term Memory (ConvLSTM) with Convolutional Auto-Encoder to reconstruct input sequences. This kind of enhanced motion representation learning contributed to the higher accuracy of video anomaly detection. Except for improved models, motion constraint based on optical flow has been applied to the task in recent years for more temporal/motion information. For example, Nguyen and Meunier [6] designed a Convolutional Auto-Encoder with two branches to reconstruct input frames and corresponding optical flows. The reconstruction errors of pixel intensity and optical flow are both considered for anomaly detection. In addition, some of the reconstruction-based methods exploit the difference of latent representations between normal samples and abnormal samples to detect anomalies. Fan et al. [7] and Li and Chang [8] used Variational Auto-Encoders (VAEs) to reconstruct input frames, and the distribution difference of latent representations was used to compute regularity scores.\nB. PREDICTION-BASED METHODS # Unlike reconstruction-based methods, prediction-based methods train models to predict future frames based on previous inputs frames, and prediction errors are used for anomaly detection. In 2016, Medel and Savakis [9] proposed Conditioned Composite Conv-LSTM Encoder-Decoder, which uses two decoders to reconstruct input frames and predict future frames separately, but only reconstruction error is utilized to compute regularity score. Similarly, Zhao et al. [10] designed a Spatio-Temporal Auto-Encoder with two decoder branches, reconstructing input frames and predicting future frames, respectively. The regularity score is computed with both reconstruction error and prediction error. Liu et al. [11] proposed a prediction model based on U-Net. Without reconstruction, the model computes regularity score with only prediction error. Furthermore, some works integrate reconstruction into prediction models. For instance, Ye et al. [12] proposed a Predictive Coding Network for anomaly detection, which predicts future frames using a ConvLSTM with predictive coding at first. The prediction errors are refined in a reconstruction manner. Finally, the predicted frames are updated with refined errors for better prediction performance. In this way, the regularity score is still computed with prediction error, but reconstruction difference is also considered. Tang et al. [13] connected two U-Net blocks in series. The first block works in the form of frame prediction, then the second block reconstructs the predicted frames, and the reconstructed predictions are used to compute prediction errors.\nC. VISUAL TRANSFORMERS # Inspired by the transformer\u0026rsquo;s success in the field of NLP, many researchers attempted to use similar models to learn useful features for image tasks. Dosovitskiy et al. [24] proposed Vision Transformer (ViT), a pure transformer used for image classification. In ViT, an image is reshaped into\na sequence of flattened 2D patches. These flattened patches are fed into transformer encoders as tokens in the original transformer after a linear projection layer. ViT achieved excellent results when trained on large-sized datasets, proving that the transformer can extract image features effectively. For video classification, Arnab et al. [25] proposed Video Vision Transformer (ViViT). ViViT extracts spatio-temporal tokens from the input video. The transformer encoder is factorized into a spatial part and a temporal part for extracting spatial information and temporal information. TransUNet, proposed by Chen et al. [28], is a combination of U-Net and transformer encoder. As a hybrid CNN-Transformer architecture, TransUNet leverages detailed high-resolution spatial information from CNN features and the global context encoded by the transformer encoder. Such a design allows TransUNet to achieve superior performance in medical image segmentation.\nMost U-Net based video anomaly detection methods use stacked successive frames as input, and temporal information is extracted by applying motion constraints such as optical-flow loss. Limited by the structure, temporal information is insufficient for reconstruction and prediction. As a variation of transformers, ViViT outperforms other models in the video classification task. The performance of the ViViT shows that transformers are capable of encoding high-level features in videos, both spatially and temporally. On the other hand, the U-Net has been widely used in the task of video anomaly detection. Also, TransUNet has demonstrated the potential of the combination of the transformer and the U-Net. Thus, inspired by ViViT, we modified the transformer encoder in ViViT to make it suitable for video prediction. And the ViViT is combined with U-Net for detailed prediction results. In brief, our model encodes spatial information with the U-Net, and our modified transformer encoder encodes temporal information. Compared with the prediction-based baseline model without the transformer module, our model achieves better performance.\nIII. PROPOSED METHOD # Given a video clip with successive frames I1 , I2 , . . . , It It, our goal is to use these frames to predict the future frame It+1, and the prediction of It+1 is denoted as ˆ It It+1 . After the prediction, the difference between It It+1 and ˆ It It+1 can be used to compute the regularity score for anomaly detection. The framework of our model is demonstrated in Fig. 1. In the following part, we will introduce all components of our model in detail. For comparison, t is set to 4, the same as most prediction-based methods.\nA. FUTURE FRAME PREDICTION # The generator used to predict future frames is depicted in Fig. 2. The input of our generator is t successive frames from a video clip, and the output is a single frame which is the next frame of the input frames. All input frames are resized to 256×256 with 3 channels and pixel values are normalized\nto [−1 , 1]. The output is a predicted frame with a resolution of 256 × 256 and 3 channels, and pixel values are in [−1 , 1].\n1) ENCODER # As shown in Fig. 2, the input of an encoder is a 3-channel image with a resolution of 256×256, and the output is a feature map of 512 channels with a resolution of 32×32. Different from stacking the consecutive frames together, the frames are encoded separately. Therefore, t identical encoders share the same parameters in the generator, and the t consecutive frames are encoded into corresponding feature maps. In such a manner, the encoders only focus on extracting spatial information. The activation functions for all convolutions in the encoders are ReLUs.\n2) TRANSFORMER MODULE # The transformer module is a modification of a factorizedencoder ViViT. The temporal transformer receives t feature maps output by the encoders, and outputs a predicted feature map. The details of the temporal transformer are depicted in Fig. 3. Firstly, the feature maps x1 , x2 , . . . , xt output by the generator are embedded into groups of tokens. Specifically, each feature map xi ∈ R C×H×W , xiis reshaped into a sequence of flattened 2D patches xp xpi ∈ R Np Np ×(P 2 · C) , where (H , W) is the resolution of the feature map xi , C is the number of channels, (P , P) is the resolution of a patch, and Np Np = HW/P 2 is the number of patches. The constant latent vector size is set to D in the temporal encoder, so the flattened patches are mapped to D dimensions with a trainable linear projection. The projected flattened patches are denoted as x p 0 x pi , j , where j = 1 , 2 , . . . , Np Np , and x p 0 x pi , j denotes the j th projected token in the sequence xp xpi . Tokens with the same j are seen as a token group [x p 0 x p1 , j ; x p 0 x p2 , j ; . . . ; x p 0 x pt , j ]. In other words, the tokens with the same spatial position and different temporal locations are a group of tokens. Therefore, there are Np Np groups of tokens in total. Similar to the class token of ViT and ViViT, an additional token xpredj dj is added to each token group as the prediction token, whose state at the output of the temporal transformer encoder serves as the prediction representation of the spatial position j. Furthermore, standard 1D learnable temporal position embedding is applied to preserve temporal location information. In our case, xi ∈ R 512×32×32 , P is set to 2, D is set to 512 and t is set to 4. After the embedding, there are 256 groups of tokens. For each group, there are 5 tokens, and the lengths of the tokens are 512. The patch embedding can be described as follows:\nwhere xpredj dj ∈ R D denotes the prediction token of the j th token group, and Ep Eposj ∈ R (t+1)×D denotes the temporal position embedding of the j th token group. z j (0) z j denotes the input of the first layer of the temporal transformer.\nThe temporal transformer encoder consists of Ltlayers of Multi-head Self-Attention (MSA) and Multi-Layer Perceptron (MLP) blocks. The Np Np token groups are encoded by the\nFIGURE 1. The framework of our model. For training, we use our predictor based on U-Net and ViViT to predict ˆ I t I t+1 . To generate high quality predicted images, we adopt intensity loss and gradient loss as the appearance constraints and difference loss for the motion constraint. For computing difference loss, the prediction of I t that is denoted as ˆ I t I t is needed. Thus, in the process of training, a training sample consists of t + 2 consecutive frames. In addition, adversarial training is also leveraged to enhance the quality of generated frames. For testing, we use the trained generator to predict ˆ I t I t+1 , then with its ground truth I t +1 , Peak Signal to Noise Ratio (PSNR) is calculated for computing regularity score.\nFIGURE 2. An overview of our generator. There are three main parts in the generator: the encoder on the left, the decoder on the right and the transformer module on the bottom. The kernel sizes of convolution and deconvolution filters are 3 × 3, and that of max-pooling layers are 2 × 2. The strides of convolutions are set to 1, and the strides are set to 2 for max-poolings and deconvolutions. Padding is used for keeping the sizes of feature maps.\ntemporal transformer encoder separately. The output of the l th layer of the temporal transformer encoder can be described as follows:\nwhere z j (l) z j denotes the output of the l th layer of the temporal transformer encoder, and LN(·) denotes layer normalization. After the encoding, z j (Lt) z j is the final output of the temporal\ntransformer encoder, and z p (Lt) z predj is the prediction token of z j (Lt) z j , j which is the predicted representation of the spatial position j . The output of the temporal transformer encoder consists of Np Np groups of tokens.\nThe prediction tokens are then input into the spatial transformer, which encodes global information. As shown in Fig. 4, the Np Np prediction tokens are fed into the spatial transformer with L s layers after the spatial position embedding, and then the prediction tokens are reshaped into a feature map xˆ ˆ t+1 ∈ R D×( H P )×( W P ) , which is the final output of the transformer module.\n3) DECODER # The decoder receives a predicted feature map xˆ ˆ t+1 and outputs a predicted frame ˆ It It+1 . As shown in Fig. 2, the decoder consists of convolution layers and deconvolution layers. The activation functions for the convolutions are ReLUs, and the deconvolutions do not use activation functions. Similar to the original U-Net, the shortcuts between the encoders and the decoder suppress gradient vanishing, and more low-level features are leveraged in the upsampling process. Additional convolutions are used to reduce the dimensions of concatenated feature maps owing to multiple encoders.\nB. CONSTRAINTS # To guarantee the generated prediction close to its ground truth, we apply both appearance constraint and motion constraint. Intensity loss and gradient loss are adopted as appearance constraints. The intensity loss is the difference of all pixel values in RGB space between the prediction and its ground truth, and the gradient loss sharpens the predicted frames. Following the previous works [11], we define the intensity loss between a predicted frame ˆ It It+1 and its ground truth It It+1 as follows:\nwhere N is the number of the pixels in I, and the gradient loss is defined as follows:\nwhere i , j denotes the spatial indexes of pixels. The gradient loss is helpful for the model to distinguish normal frames and abnormal frames because the gradient loss guarantees the model to generate normal objects with sharp edges. The abnormal objects that have never appeared in the training data cannot be sharpened correctly when predicting. Therefore, The abnormal objects tend to have fuzzy edges, which leads to larger prediction errors.\nInstead of optical flow loss, we adopt image difference loss as the motion constraint following the work [13]. The optical flow loss makes the network deeper, which causes difficulty in training the network. Specifically, we have to use\nsmaller learning rates to stabilize the training process, and the cost is much longer training time, which is not practical in applications. The image difference loss is defined as follows:\nC. ADVERSARIAL TRAINING # Generative Adversarial Networks (GANs) are used to make generated results more realistic in image and video generation tasks. A GAN consists of a generator and a discriminator in most cases. The discriminator tries to distinguish a generated result from a realistic one. In the meantime, the generator tries to generate results that can confuse the discriminator. The generator has been described above, and we utilize the patch discriminator [32] as the discriminator.\n1) TRAINING THE DISCRIMINATOR # The discriminator D aims to classify the generated images and the realistic ones. Given a prediction ˆ I and its ground truth I, the discriminator loss is defined as follows:\nwhere i , j denotes the indexes of spatial patches in the output of the discriminator, and n denotes the number of the patches.\n2) TRAINING THE GENERATOR # The generator aims to generate more realistic images. The weights of the discriminator are fixed when training G . G can be trained by minimizing the adversarial loss defined as follows:\nD. OBJECTIVE FUNCTION # When training D, the objective function is defined as follows:\nwhen training G, the objective function is defined as follows:\nwhere λint, λgd , λdif , λadv are the weights of the loss functions.\nFor training the network, all the frames are resized to 256 × 256, and the pixel values are normalized to [−1 , 1]. Adam [34] based Stochastic Gradient Descent method is used for parameter optimization. The coefficients are set to 1.0,\nFIGURE 4. The spatial transformer encoder.\n1.0, 0.01, 0.05 respectively for all datasets. The mini-batch size is set to 4. The learning rates of the generator and the discriminator are set to 0.0001, 0.00001 respectively for grayscale datasets and 0.0002, 0.00002 for RGB datasets. The network is trained for 100000 iterations on all datasets.\nE. ANOMALY DETECTION # In the testing phase, only the generator in our model is used to predict future frames. Given a generated frame ˆ I and its ground truth I, the difference between them can be used for anomaly detection. Peak Signal to Noise Ratio has been widely used to assess image quality in video anomaly detection. PSNR is defined as follows:\nwhere maxˆI x I means the max pixel value in ˆ I. A lower PSNR value of a predicted frame and its ground truth indicates the frame is more likely to be an abnormal frame. After calculating all the PSNRs in a testing video, the PSNRs are normalized to [0 , 1], and the regularity score of the i th frame\nin a testing video is calculated as follows:\nwhere max(PSNR) is the maximum PSNR value in the testing video, and min(PSNR) is the minimum PSNR value.\nAnother anomaly detecting strategy is to compute regularity score with sliding windows [6]. Given a predicted frame ˆ I and its ground truth I, the mean square errors of corresponding patches are calculated, where a sliding window determines the patches. The p patches that have the largest mean square error are denoted as MSEP 1 , MSEP 2 , . . . , MSEPp Pp , and the PSNR of ˆ I and I is calculated as follows:\nthe regularity score of the i th frame in a testing video is calculated as follows:\nIn this way, only the patches that are most likely to have anomalies are considered so that the influence of background\nnoises is suppressed. The choice of the size and the stride of the sliding windows will be discussed in the next section.\nIV. EXPERIMENTS # In this section, our proposed method is evaluated on the CUHK Avenue dataset and the UCSD Pedestrian dataset. We explore the impact of different settings on our method and then compare our method with other video anomaly detection methods.\nA. DATASETS # The CUHK Avenue dataset is captured in the CUHK campus avenue, consisting of 16 training video clips and 21 testing video clips. The training videos only capture normal situations, and anomalies such as strange actions, wrong directions and abnormal objects are included in the testing videos. The UCSD Pedestrian dataset contains two subsets: Ped1 and Ped2. Ped1 consists of 34 training video clips and 36 testing video clips, and Ped2 consists of 16 training video clips and 12 testing ones. The training videos of both Ped1 and Ped2 are composed of normal scenes, and the testing videos include abnormal targets such as bikers, cars and skaters. Ped1 is more challenging than Ped2 for the changing sizes of the targets due to the camera\u0026rsquo;s position and angle.\nB. EVALUATION METRIC # To evaluate the performance of our method, we use the Area Under Curve (AUC) as the evaluation metric for anomaly detection performance. AUC is the area under the Receiver Operation Characteristic (ROC) curve, and ROC is given by the regularity scores S. A higher AUC value suggests better anomaly detection performance. As described in Section 3.5, the regularity scores can be calculated with different strategies.\nC. MODEL SETTINGS # 1) DEPTH OF THE TRANSFORMER ENCODES # In our model, the transformer module comprises a temporal transformer encoder and a spatial transformer encoder. To clarify how the depth of the transformer encoders affects the anomaly detection performance, we first set L s from 0 to 6 and fix L t to 1. After training the model, we calculate regularity scores with frame-level PSNR. As shown in Table 1 , it is evident that the depth of the spatial transformer encoder impacts the results. In our experiment, the results suggest that the optimal depth is 3 for our model. Compared with the model without the spatial transformer encoder (L s = 0), a proper setting of the depth of the spatial transformer encoder improves the performance of anomaly detection.\nFurthermore, to evaluate the impact of the depth of the temporal transformer encoder, we fix L s to 3 and set L t from 1 to 3. The results are shown in Table 2. For Ped2 and Avenue, a deeper temporal transformer encoder does not improve the performance of the model. Although a slight improvement is witnessed on Ped1 when L t is set to 2, considering\nTABLE 1. AUC of models with different L s on the UCSD Ped1, UCSD Ped2 and avenue.\nTABLE 2. AUC of models with different L t on the UCSD Ped1, UCSD Ped2 and avenue.\nFIGURE 5. Some abnormal frames from the datasets. The bounding boxes indicates the location of anomaly objects.\ncomputation cost, setting Lt to 1 is a better choice. The temporal transformer encoder makes predictions based on small patches, and the number of input frames t is set to 4. Therefore, a shallow temporal transformer is more suitable in our model. Taken overall, we set L t to 1 for all datasets.\n2) CHOICE OF LOSS FUNCTIONS # To choose appropriate constraints for the training, we conduct ablation experiments of the loss functions on the Ped2 dataset. As discussed above, L s and L t are set to 3 and 1, and frame-level PSNR is used to calculate regularity scores. We use the following combinations of loss functions to train the model: only the intensity loss, the intensity loss with the gradient loss, the intensity loss with the difference loss and all three loss functions. The anomaly detection performance on Ped2 in AUC is summarized in Table 3. The results show that the gradient loss or the difference loss only slightly improves the performance or even makes it worse compared with the performance of the baseline (0.954). Nevertheless, the combination of all three loss functions makes significant improvements. The results indicate that our model can make full use of the spatial transformer encoder and the temporal transformer encoder only when all the loss functions are used.\nFIGURE 6. AUCs at different training iterations on the datasets.\nTABLE 3. AUC of models with different loss functions on the UCSD Ped2.\n3) CHOICE OF THE WINDOW SIZE AND STRIDE # In different scenes, the sizes of foreground objects and the complexity of backgrounds vary. Meanwhile, different camera positions cause different degrees of perspective. As shown in Fig. 5, the size of the anomaly objects varies in different datasets. In general, anomaly objects in Avenue are larger compared with the other two datasets, and frames in Ped1 and Avenue have more obvious perspective distortion. Moreover, the background in the Avenue dataset is relatively more complicated. Therefore, it is more reasonable to calculate PSNR based on sliding windows.\nAs described in Section 3.5, the value p decides how many patches are considered while calculating PSNRSW . To evaluate the influence of the window size and stride, we set p to half of the total patch number. For example, given a sliding window with a size of 64 and a stride of 32, there are 49 patches, so that p is set to 24. Table 4 shows the results of 10 different combinations of size and stride. On the dataset Ped1 and Avenue, a proper setting of window size and stride significantly improves the performance. With a clear background and no noticeable perspective distortion, there is only a slight performance improvement on Ped2.\n4) CHOICE OF THE TRAINING ITERATIONS # We trained our model for 100000 iterations on all datasets. The AUCs at different iterations are shown in Fig. 6. In our experiment, a longer training time does not mean better performance, which is evident on Ped1. Our model achieves the highest AUCs in the 52 th , the 67 th and the 75 th iteration on Ped1, Ped2 and Avenue, respectively. Therefore, we use the parameters in these iterations for our model.\nD. COMPARISON WITH STATE-OF-THE-ARTS # We compare our model with 4 state-of-the-arts anomaly detection methods: 1) Future Frame Prediction [11];\nTABLE 4. AUC of models with different sliding windows on the UCSD Ped1, UCSD Ped2 and avenue.\nFIGURE 7. The score gaps of our model and the FFP(baseline) on the datasets.\nAppearance-Motion Correspondence [6]; 3) AnoPCN [12]; 4) Integrating Prediction and Reconstruction [13]; 5) Dual Discriminator [34].The AUC values are listed in Table 5 We first compare our model with the baseline (FFP). With the same PSNR calculating strategy (without sliding windows), our method shows superiority on all datasets, and the improvements are 0.009, 0.007, 0.007 respectively on Ped1, Ped2 and Avenue. The result shows that our transformer module is able to improve the performance due to its ability to encode richer temporal and global information. Moreover, by calculating PSNR with sliding windows, our model FIGURE 8. A visualized example of anomaly detection. Each bounding box represents a patch divided by the sliding window.\nTABLE 5. AUC of different methods on the UCSD Ped1, UCSD Ped2 and avenue.\noutperforms other methods on Ped1 and Avenue with AUCs of 0.867 and 0.870.\nThe score gap is the difference between the average score of normal frames and that of abnormal frames. A larger score gap indicates that the model can better distinguish normal and abnormal events. We compare the score gap of our model with the baseline, and the results are shown in Fig. 7. On datasets Ped1 and Ped2, the score gaps of our model are larger than those of the baseline. Although our model has a smaller score gap on Avenue, the average and the standard deviation of the regular scores is 0.901 and 0.131 on normal frames, where those of the baseline are 0.788 and 0.161. This result suggests that our model\u0026rsquo;s regular scores of normal frames are more consistent than those of the baseline, which leads to the higher AUC on the Avenue dataset.\nIn a word, our method achieves better performance and has generalization ability on multiple datasets.\nE. VISUALIZATION # Fig. 8 shows the predicted frames of a video clip with anomaly objects from the UCSD Ped2. Regularity scores are calculated with PSNRSW . The window size and the stride are set to 64. In the figure, a thicker bounding box means a lower regularity score. The patches with non-pedestrian objects have lower regularity scores. The location of patches with lower regularity scores can be used as a reference for anomaly localization.\nF. COMPUTING TIME # Our model is trained on an NVIDIA Tesla V100 GPU. It takes about 16 hours to train our model for 100000 iterations on a dataset. Testing is performed on an NVIDIA RTX 3070 GPU, and the average testing speed is about 18 fps.\nV. CONCLUSION # In this paper, we proposed TransAnomaly for video anomaly detection. By combining the ViViT and the U-Net, our model predicts future frames with richer temporal information and global contexts. To fully leverage the power of ViViT, we modified the temporal transformer to make it suitable for image generation. Furthermore, in order to alleviate the influence of irrelevant factors during anomaly detection, we calculate PSNR based on sliding windows. Experiments conducted on three benchmark datasets demonstrate the validity of each component in our model, and the results show that our method outperforms other state-ofthe-art prediction-based approaches.\nREFERENCES # [1] V. Chandola, A. Banerjee, and V. Kumar, \u0026lsquo;\u0026lsquo;Anomaly detection: A survey,\u0026rsquo;\u0026rsquo; ACM Comput. Surv., vol. 41, no. 3, pp. 1–58, Jul. 2009. [2] R. Nayak, U. C. Pati, and S. K. Das, \u0026lsquo;\u0026lsquo;A comprehensive review on deep learning-based methods for video anomaly detection,\u0026lsquo;\u0026lsquo;Image Vis. Comput. , vol. 106, Feb. 2021, Art. no. 104078. [3] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, \u0026lsquo;\u0026lsquo;Learning temporal regularity in video sequences,\u0026rsquo;\u0026rsquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 733–742. [4] Y. S. Chong and Y. H. Tay, \u0026lsquo;\u0026lsquo;Abnormal event detection in videos using spatiotemporal autoencoder,\u0026rsquo;\u0026rsquo; in Proc. Int. Symp. Neural Netw. Cham, Switzerland: Springer, 2017, pp. 189–196. [5] W. Luo, W. Liu, and S. Gao, \u0026lsquo;\u0026lsquo;Remembering history with convolutional LSTM for anomaly detection,\u0026rsquo;\u0026rsquo; in Proc. IEEE Int. Conf. Multimedia Expo (ICME), Jul. 2017, pp. 439–444. [6] T. N. Nguyen and J. Meunier, \u0026lsquo;\u0026lsquo;Anomaly detection in video sequence with appearance-motion correspondence,\u0026rsquo;\u0026rsquo; in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 1273–1283. [7] Y. Fan, G. Wen, D. Li, S. Qiu, M. D. Levine, and F. Xiao, \u0026lsquo;\u0026lsquo;Video anomaly detection and localization via Gaussian mixture fully convolutional variational autoencoder,\u0026rsquo;\u0026rsquo; Comput. Vis. Image Understand., vol. 195, Jun. 2020, Art. no. 102920. [8] N. Li and F. Chang, \u0026lsquo;\u0026lsquo;Video anomaly detection and localization via multivariate Gaussian fully convolution adversarial autoencoder,\u0026rsquo;\u0026rsquo; Neurocomputing, vol. 369, pp. 92–105, Dec. 2019. [9] J. R. Medel and A. Savakis, \u0026lsquo;\u0026lsquo;Anomaly detection in video using predictive convolutional long short-term memory networks,\u0026rsquo;\u0026rsquo; 2016, arXiv:1612.00390. [Online]. Available: http://arxiv.org/abs/1612.00390 [10] Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu, and X.-S. Hua, \u0026lsquo;\u0026lsquo;Spatio-temporal AutoEncoder for video anomaly detection,\u0026rsquo;\u0026rsquo; in Proc. 25th ACM Int. Conf. Multimedia, Oct. 2017, pp. 1933–1941. [11] W. Liu, W. Luo, D. Lian, and S. Gao, \u0026lsquo;\u0026lsquo;Future frame prediction for anomaly detection—A new baseline,\u0026rsquo;\u0026rsquo; in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6536–6545. [12] M. Ye, X. Peng, W. Gan, W. Wu, and Y. Qiao, \u0026lsquo;\u0026lsquo;AnoPCN: Video anomaly detection via deep predictive coding network,\u0026rsquo;\u0026rsquo; in Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019, pp. 1805–1813. [13] Y. Tang, L. Zhao, S. Zhang, C. Gong, G. Li, and J. Yang, \u0026lsquo;\u0026lsquo;Integrating prediction and reconstruction for anomaly detection,\u0026rsquo;\u0026rsquo; Pattern Recognit. Lett., vol. 129, pp. 123–130, Jan. 2020. [14] O. Ronneberger, P. Fischer, and T. Brox, \u0026lsquo;\u0026lsquo;U-Net: Convolutional networks for biomedical image segmentation,\u0026rsquo;\u0026rsquo; in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent., 2015, pp. 234–241. [15] J. Long, E. Shelhamer, and T. Darrell, \u0026lsquo;\u0026lsquo;Fully convolutional networks for semantic segmentation,\u0026rsquo;\u0026rsquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3431–3440. [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, \u0026lsquo;\u0026lsquo;Generative adversarial nets,\u0026rsquo;\u0026rsquo; in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680. [17] J. Cheng, L. Dong, and M. Lapata, \u0026lsquo;\u0026lsquo;Long short-term memory-networks for machine reading,\u0026rsquo;\u0026rsquo; 2016, arXiv:1601.06733. [Online]. Available: http://arxiv.org/abs/1601.06733 [18] A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit, \u0026lsquo;\u0026lsquo;A decomposable attention model for natural language inference,\u0026rsquo;\u0026rsquo; 2016, arXiv:1606.01933 . [Online]. Available: http://arxiv.org/abs/1606.01933 [19] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, \u0026lsquo;\u0026lsquo;Self-attention generative adversarial networks,\u0026rsquo;\u0026rsquo; 2018, arXiv:1805.08318. [Online]. Available: http://arxiv.org/abs/1805.08318 [20] X. Wang, R. B. Girshick, A. Gupta, and K. He, \u0026lsquo;\u0026lsquo;Non-local neural networks,\u0026rsquo;\u0026rsquo; in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) , Jun. 2018, pp. 7794–7803. [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, \u0026lsquo;\u0026lsquo;Attention is all you need,\u0026rsquo;\u0026rsquo; in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008. [22] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, \u0026lsquo;\u0026lsquo;BERT: Pre-training of deep bidirectional transformers for language understanding,\u0026rsquo;\u0026rsquo; 2018, arXiv:1810.04805. [Online]. Available: http://arxiv.org/abs/1810.04805 [23] T. B. Brown et al., \u0026lsquo;\u0026lsquo;Language models are few-shot learners,\u0026rsquo;\u0026rsquo; 2020, arXiv:2005.14165. [Online]. Available: http://arxiv.org/abs/2005.14165 [24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, \u0026lsquo;\u0026lsquo;An image is worth 16 × 16 words: Transformers for image recognition at scale,\u0026rsquo;\u0026rsquo; 2020, arXiv:2010.11929 . [Online]. Available: http://arxiv.org/abs/2010.11929 [25] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, \u0026lsquo;\u0026lsquo;ViViT: A video vision transformer,\u0026rsquo;\u0026rsquo; 2021, arXiv:2103.15691. [Online]. Available: http://arxiv.org/abs/2103.15691 [26] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, \u0026lsquo;\u0026lsquo;End-to-end object detection with transformers,\u0026rsquo;\u0026rsquo; in Proc. Eur. Conf. Comput. Vis., 2020, pp. 213–229. [27] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, \u0026lsquo;\u0026lsquo;Deformable DETR: Deformable transformers for end-to-end object detection,\u0026rsquo;\u0026rsquo; 2020, arXiv:2010.04159. [Online]. Available: http://arxiv.org/abs/2010.04159 [28] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, \u0026lsquo;\u0026lsquo;TransUNet: Transformers make strong encoders for medical image segmentation,\u0026rsquo;\u0026rsquo; 2021, arXiv:2102.04306. [Online]. Available: http://arxiv.org/abs/2102.04306 [29] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. S. Torr, and L. Zhang, \u0026lsquo;\u0026lsquo;Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,\u0026rsquo;\u0026rsquo; in Proc. IEEE Conf. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 6881–6890. [30] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran, \u0026lsquo;\u0026lsquo;Image transformer,\u0026rsquo;\u0026rsquo; in Proc. 35th Int. Conf. Mach. Learn., 2018, pp. 4055–4064. [31] Y. Zeng, J. Fu, and H. Chao, \u0026lsquo;\u0026lsquo;Learning joint spatial-temporal transformations for video inpainting,\u0026rsquo;\u0026rsquo; in Proc. Eur. Conf. Comput. Vis., 2020, pp. 528–543. [32] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, \u0026lsquo;\u0026lsquo;Image-to-image translation with conditional adversarial networks,\u0026rsquo;\u0026rsquo; 2016, arXiv:1611.07004 . [Online]. Available: http://arxiv.org/abs/1611.07004 [33] F. Dong, Y. Zhang, and X. Nie, \u0026lsquo;\u0026lsquo;Dual discriminator generative adversarial network for video anomaly detection,\u0026rsquo;\u0026rsquo; IEEE Access, vol. 8, pp. 88170–88176, 2020. [34] D. P. Kingma and J. Ba, \u0026lsquo;\u0026lsquo;Adam: A method for stochastic optimization,\u0026rsquo;\u0026rsquo; 2014, arXiv:1412.6980. [Online]. Available: http://arxiv.org/ abs/1412.6980 HONGCHUN YUAN received the B.S. and M.S. degrees from Anhui Agricultural University, Anhui, China, and the Ph.D. degree in pattern recognition and intelligence system from the University of Science and Technology of China, Anhui. His research interests include the application of artificial intelligence, computer vision, and image processing. He is currently the Vice Chairman of the Smart Agriculture Special Committee of the Chinese Association of Automation and the Agriculture and Forestry Committee of the Association of Fundamental Computing Education in Chinese Universities.\nZHENYU CAI was born in 1996. He received the B.S. degree in information and computing science from Shanghai Ocean University, in 2018, where he is currently pursuing the M.S. degree with the College of Information Technology. His research interests include video anomaly detection and deep learning.\nHUI ZHOU was born in 1996. He received the B.S. degree in software engineering from Xuzhou University of Technology, Jiangsu, China, in 2019. He is currently pursuing the M.S. degree with the College of Information Technology, Shanghai Ocean University. His research interests include object detection and monocular depth estimation.\nYUE WANG received the B.S. degree in information and computing science from Tiangong University, Tianjin, China, in 2019. He is currently pursuing the M.S. degree with the College of Information Technology, Shanghai Ocean University. His research interests include under water image enhancement and deep learning.\nXIANGZHI CHEN received the B.S. degree in computer science and technology from Chengdu University of Technology, Chengdu, China, in 2020. She is currently pursuing the M.S. degree with the College of Information Technology, Shanghai Ocean University. Her research interests include video anomaly detection and deep learning.\n","date":"30 August 2021","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/papers/transanomaly_video_anomaly_detection_using_video_vision_transformer/","section":"Papers","summary":"A prediction-based video anomaly detection approach combining U-Net and Video Vision Transformer (ViViT), with modifications for video prediction, capturing richer temporal and global context information, enabling anomaly localization.","title":"TransAnomaly: Video Anomaly Detection Using Video Vision Transformer","type":"method"},{"content":"","date":"30 August 2021","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/xiangzhi-chen/","section":"Authors","summary":"","title":"XIANGZHI CHEN","type":"authors"},{"content":"","date":"30 August 2021","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/yue-wang/","section":"Authors","summary":"","title":"YUE WANG","type":"authors"},{"content":"","date":"30 August 2021","externalUrl":null,"permalink":"/sis-arxiv-vad-papers/authors/zhenyu-cai/","section":"Authors","summary":"","title":"ZHENYU CAI","type":"authors"}]