SIS ArXiv VAD Papers

Table of Contents

hosts arxiv papers on the topic “video anomaly detection”

This repository contains the source code for the “SIS ArXiv VAD Papers” website, a Hugo static site using the Blowfish theme.

This project is a comprehensive platform for managing, processing, and displaying ArXiv research papers. It combines a Hugo static site with a powerful backend of containerized services for AI-driven PDF processing, metadata extraction, and ArXiv interaction.

Features
#

ArXiv AI Agent: Includes an mcp-arxiv-mcp-server, which allows AI assistants to search, download, and read papers directly from the ArXiv repository.
Automated PDF-to-Markdown: Uses the GPU-accelerated docling-serve to convert complex PDFs into clean Markdown.
AI Metadata Extraction: A Python script orchestrates a pipeline that calls an n8n workflow to extract structured JSON metadata (title, authors, date, etc.) from converted text.
YAML Front Matter: Automatically writes the extracted JSON back into the Markdown files as clean YAML front matter, making them ready to publish.
Hugo Static Site: A clean, modern, and fast website built with Hugo and the Blowfish theme.

Architecture & Services
#

The project’s backend is defined in the docker/compose.yml file and includes several key services:

n8n: The workflow automation service. It is used here as an API endpoint (via Webhook) to run the AI metadata extraction pipeline. It is also used to connect the mcp-arxiv-mcp-server to integrate with an LLM model for searching and downloading the latest papers.
docling-serve: A powerful, GPU-enabled service that handles the core PDF-to-Markdown conversion. It is pre-loaded with models via the docling-serve-initial service.
mcp-gateway & mcp-arxiv-mcp-server: A service that provides an AI-readable interface to the ArXiv repository, allowing for programmatic searching, downloading, and reading of papers.
Python Pipeline (scripts/): This is the “glue” that connects everything. It is a host-run script that:
1. Finds new PDFs in an input directory.
2. Calls docling-serve to convert the PDF to Markdown.
3. Renames the output to index.md in a new content/papers/ bundle.
4. Calls the n8n webhook with the path to the new index.md.
5. Receives the extracted JSON metadata back from n8n.
6. Writes this JSON as YAML front matter into the index.md file.

File Structure
#

.
├── archetypes/         # Hugo new content templates
├── assets/             # Site assets (images, etc.)
├── config/             # Hugo configuration
├── content/            # The Markdown content for the site
│   └── papers/         # <-- Processed, AI-enhanced articles land here
├── docker/             # Docker service definitions
│   ├── compose.yml     # The main Docker Compose file for all services
│   └── catalog.yaml    # Describes the ArXiv MCP service
├── scripts/            # The Python automation pipeline
│   ├── config.py       # Holds paths and API configs
│   ├── main.py         # Main script to run the pipeline
│   ├── .env            # (Not shown) Stores secret keys
│   ├── pyproject.toml  # Python project definition
│   └── uv.lock         # Python dependencies
├── static/             # Static files (favicons, etc.)
├── themes/             # Hugo themes
│   └── blowfish/
└── hugo.toml           # Main Hugo configuration file

Setup & Installation
#

Clone the Repository:

git clone https://github.com/phuchoang2603/sis-arxiv-vad-papers.git
cd sis-arxiv-vad-papers

Configure Docker Environment: Create a .env file in the project’s root directory (next to docker/). This will provide environment variables to your compose.yml.

# ./.env

# -- Docker Services --
# MUST be an absolute path to your shared data folder
SHARED_FOLDER=/path/to/your/shared/data

# MUST be an absolute path for persistent Docker data
APPDATA=/path/to/your/appdata/sis-arxiv

# -- n8n --
SUBDOMAIN=n8n
DOMAIN_NAME=your-domain.com
GENERIC_TIMEZONE=America/New_York

Configure n8n Workflow:
- Start your n8n instance and create your metadata extraction workflow.
- Start Node: Use a Webhook node.
- Authentication: Set to Header Auth and create a secure, random API key.
- Response Mode: Set to Respond at End of Workflow. This is critical for getting the JSON response back.
- Workflow: Add a Read Binary File from Disk node (using the path from the webhook), an Extract from File node, and your Information Extractor node.
- Activate: Click the “Active” toggle in the top-right.
- Copy: Copy the Production URL.

Configure Python Pipeline: Create a separate .env file inside the scripts/ directory for the Python script.

# scripts/.env
N8N_WEBHOOK_URL="https://n8n.your-domain.com/webhook/..." # <-- Your n8n PRODUCTION URL
N8N_API_KEY="your-secret-n8n-header-auth-key"

Run Docker Services: Run this command from the project’s root directory:
```
docker-compose -f docker/compose.yml up --build -d
```
This will build and start n8n, docling-serve, and the other services.
Install Python Dependencies: Navigate to the scripts directory and use uv to install:
```
cd scripts
uv sync
```

How to Use the Pipeline
#

Add PDFs: Place your .pdf files into the input directory defined in scripts/config.py. (By default, this points to ../../arxiv_existing/test, which is a directory sibling to your project folder).
Run Pipeline:
```
cd scripts
python main.py
```
Check Output: Watch the terminal as the script processes each file. Your new content bundles, complete with index.md and YAML front matter, will appear in content/papers/.
Preview Site:
```
cd ..  # Return to the Hugo root
hugo server
```
Your site will be available at http://localhost:1313.

License
#

This project is licensed under the MIT License.

Building an AI-Powered ArXiv Pipeline: Thought n8n was the future, but not yet

2 November 2025·2164 words·11 mins· loading · loading

Arxiv Mcp N8n Python Hugo

This is my story of how I attempt to build an AI-powered pipeline for ArXiv papers. It was a journey that started with a cool idea about AI agents and ended with me wrestling Docker, n8n, and Python into submission.

My First Server Was Office Trash: A Self-Hosting Story

27 August 2023·829 words·4 mins· loading · loading

Self-Hosted Proxmox Python

It all started with a computer my mom saved from the trash heap at her office. It was a standard HP Prodesk 600 G4—nothing special, and since I already had a PC and a laptop, I had no idea what to do with it. Just installing Windows on it felt like a waste. I knew it could be something more than just another desktop collecting dust.

On-Premise 101 (Part 3): My "Fearless" NAS Build with Virtualized TrueNAS, ZFS, and Cloud Backups

2 November 2025·3166 words·15 mins· loading · loading

Proxmox Truenas

In the previous parts of this series, we went from a single hand-me-down PC to a full 3-node cluster, and then we installed Proxmox as our hypervisor. We even managed to pass through a GPU to a VM for near-native performance.

Features#

Architecture & Services#

File Structure#

Setup & Installation#

How to Use the Pipeline#

License#

Related