Skip to content

ai4ce/LangSCD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LangSCD: Scene Change Detection with Vision-Language Representation Learning

Figure 1. Comparative results of the current state-of-the-art model and our method under various environments. Our method produces more precise and object-complete change masks, while the state-of-the-art baseline often yields fragmented predictions or misses semantic changes.

Contents

Project Overview

LangSCD is a modular vision–language scene change detection framework that integrates language-derived semantic priors and geometric–semantic matching to produce robust, object-aware change masks under real-world appearance and viewpoint variations.

Installation

We recommend using a dedicated Python environment to avoid dependency conflicts.

1. Create a virtual environment

conda create -n langscd python=3.9 -y
conda activate langscd

2. Install PyTorch

LangSCD is tested with PyTorch 1.13.1 and CUDA support. Please install the appropriate PyTorch build for your system following the official instructions:

# Example (CUDA 11.7)
pip install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu117

3. Install dependencies

All required Python packages are listed in requirements.txt:

pip install -r requirements.txt

Datasets

LangSCD supports multiple public scene change detection benchmarks. In this work, we primarily evaluate on NYC-CD, and also report results on VL-CMU-CD, PSCD, and ChangeVPR.

All datasets can be downloaded from Hugging Face

NYC-CD

NYC-CD is a large-scale real-world street-view scene change detection dataset consisting of 8,122 image pairs collected across Manhattan, New York City. The dataset captures diverse urban changes over long temporal gaps, including new/missing objects, vegetation changes, and viewpoint-induced changes. Non-target variations such as weather and lighting are excluded. Pedestrians and vehicles are anonymized.

The dataset is organized as follows:

NYC-CD/
├── train/
│   ├── t0/        # reference images
│   ├── t1/        # target images
│   └── mask/      # ground-truth change masks
├── test/
│   ├── t0/
│   ├── t1/
│   └── mask/
├── quality_control/
│   └── ...        # manual inspection and refinement of VLM-generated pseudo-masks
└── vlm_captions/
    └── ...        # VLM-generated change descriptions for image pairs
  • train / test: Contain the training and evaluation splits, where each image pair is stored under t0 (earlier time), t1 (later time), and mask (pixel-level change annotations).
  • quality_control: Includes intermediate results used for manual verification and cleaning of VLM-generated pseudo-masks during dataset construction.
  • vlm_captions: Stores vision–language model (VLM) descriptions of changed objects and appearance variations. These captions cover image pairs from NYC-CD as well as other street-view and remote-sensing datasets used in our experiments.

Other Datasets

In addition to NYC-CD, LangSCD supports evaluation on the following benchmarks, which can be accessed through the same Hugging Face repository:

  • VL-CMU-CD: A real-world street-view dataset focusing on structural changes under varying environmental conditions.
  • PSCD: A disaster-oriented scene change detection dataset capturing large-scale structural changes.
  • ChangeVPR: A viewpoint-robust change detection benchmark derived from long-term visual place recognition data.

Pretrained Weights

All checkpoints are hosted on Hugging Face

The checkpoints cover different backbones (C-3PO, RSCD), training datasets, and whether the LangSCD modules are enabled.

Checkpoint Naming Convention

  • base: baseline model without LangSCD modules
  • langscd: model augmented with LangSCD language and matching modules
  • cmu: trained on VL-CMU-CD
  • pscd: trained on PSCD
  • our: trained on binary NYC-CD
  • multi: trained on multi-class NYC-CD

Available Pretrained Models

Backbone Training Dataset Baseline Checkpoint LangSCD Checkpoint
C-3PO VL-CMU-CD c3po_base_cmu.pth c3po_langscd_cmu.pth
C-3PO NYC-CD c3po_base_our.pth c3po_langscd_our.pth
C-3PO PSCD c3po_base_pscd.pth c3po_langscd_pscd.pth
RSCD VL-CMU-CD rscd_base_cmu.pth rscdd_langscd_cmu.pth
RSCD NYC-CD rscd_base_our.pth rscd_langscd_our.pth
RSCD PSCD rscd_base_pscd.pth rscd_langscd_pscd.pth
RSCD NYC-CD-multi rscd_base_multi_our.pth rscd_langscd_multi_our.pth

Training

LangSCD supports training with two backbones: RSCD + LangSCD and C-3PO + LangSCD.
Below we describe the required setup and commands for each setting.


Training RSCD + LangSCD

1. Configure dataset paths

  • Set the dataset root path in: src/rscd_lang/src/datasets/data_factory.py

  • Specify the target dataset name (e.g., NYC-CD, VL-CMU-CD, PSCD) in: src/rscd_lang/src/scripts/configs/train.yml

2. Update model paths

Set the correct folder paths in the following lines of: src/rscd_lang/src/models/CD_model.py

  • Line 13
  • Line 14
  • Line 192
  • Line 193

These paths are used to locate groundingdino backbones and checkpoints.

3. Run training

Single-GPU training

python src/scripts/train.py src/scripts/configs/train.yml

Multi-GPU training

torchrun --nproc_per_node=4 src/scripts/train_para.py src/scripts/configs/train.yml

Multi-class training on NYC-CD

python src/scripts/train_multi.py src/scripts/configs/train_multi.yml

4. Outputs

All training logs and checkpoints will be saved to: src/rscd_lang/src/output


Training C-3PO + LangSCD

1. Configure dataset paths

Set the dataset root path in: src/c3po_lang/src/dataset/path_config.py

You can directly modify the return value under:

if name == 'CMU_binary':

to point to your local dataset directory.

2. Run training

Single-GPU training

python src/train.py \
  --train-dataset VL_CMU_CD \
  --test-dataset VL_CMU_CD \
  --input-size 640 \
  --model vgg16bn_mtf_msf_deeplabv3 \
  --mtf id \
  --msf 4 \
  --batch-size 4 \
  --warmup \
  --loss bi \
  --loss-weight \
  --lr-scheduler cosine

Multi-GPU training

torchrun --nproc_per_node=4 src/train_para.py \
  --train-dataset VL_CMU_CD \
  --test-dataset VL_CMU_CD \
  --input-size 640 \
  --model vgg16bn_mtf_msf_deeplabv3 \
  --mtf id \
  --msf 4 \
  --batch-size 4 \
  --warmup \
  --loss bi \
  --loss-weight \
  --lr-scheduler cosine

3. Outputs

All training outputs will be saved to: src/c3po_lang/output

Evaluation

Evaluation in LangSCD consists of two stages:
(1) generating intermediate masks for geometric and semantic matching, and
(2) running the final evaluation to compute quantitative metrics.


1. Generate Matching Masks

LangSCD relies on SAM2 and Grounding-SAM2 to provide object-level geometric and semantic cues.

SAM2 Masks

For each dataset, edit the dataset path in the corresponding script:

batch_track_<DatasetName>.py

After setting the correct path, run the script to generate SAM2 masks for all image pairs.

Grounding-SAM2 Masks

Grounding-SAM2 can be generated using either:

  • the free Grounding-DINO model, or
  • the Grounding-DINO-1.6 Pro model (paid).

To use Grounding-DINO-1.6 Pro, you must obtain an API token from the official Grounding-DINO website and configure it in the script accordingly.

Then edit the paths in:

batch_<DatasetName>_<Category>.py

and run the script to generate Grounded SAM2 mask for all image pairs.


2. Generate Initial Prediction Masks

Before LangSCD refinement, an initial change prediction mask must be produced by the base model.

RSCD

python src/scripts/visualize.py train_output_folder/best.val.pth \
  --option DatasetName \
  --output visual_output_folder

C-3PO

python3 src/train.py \
  --test-only \
  --model vgg16bn_mtf_msf_deeplabv3 \
  --mtf id \
  --msf 4 \
  --train-dataset VL_CMU_CD \
  --test-dataset VL_CMU_CD \
  --input-size 512 \
  --resume train_output_folder/best.pth

3. Select Evaluation Framework

Edit the imported framework in:

src/gescf_match/src/test.py

Choose the framework according to the model and dataset:

  • Evaluate GeSCF: framework_orig
  • Evaluate GeSCF + LangSCD on ChangeVPR: framework_gescf_cv_lang
  • Evaluate RSCD or C-3PO + LangSCD on NYC-CD / VL-CMU-CD: framework_lang
  • Evaluate GeSCF + LangSCD on NYC-CD / VL-CMU-CD: framework_gescf_lang
  • Evaluate RSCD or C-3PO + LangSCD on PSCD: framework_pscd_lang
  • Evaluate GeSCF + LangSCD on PSCD: framework_gescf_pscd_lang
  • Evaluate RSCD or C-3PO + LangSCD on ChangeVPR: framework_cv_lang

After selecting the framework, update the dataset and checkpoint paths in the corresponding framework module.


4. Run Evaluation

Binary Change Evaluation

CUDA_VISIBLE_DEVICES=0 python test.py \
  --test-dataset DatasetName

Multi-class Change Evaluation (NYC-CD)

CUDA_VISIBLE_DEVICES=0 python test_multi.py \
  --test-dataset VL-CMU-CD \
  --dataset-path NYCCD_dataset_path

5. Metrics

The evaluation reports the following metrics:

  • Precision
  • Recall
  • Accuracy
  • F1-score
  • IoU

Metrics are computed for either binary change detection or multi-class change detection, depending on the dataset and evaluation mode.

6. Test a single image pair

You can also evaluate and visualize the prediction for a single image pair by editing the imported framework in src/gescf_match/src/test_single.py, and then running:

CUDA_VISIBLE_DEVICES=0 python test_single.py \
  --test-dataset VL_CMU_CD \
  --img-t0-path PATH_TO_T0_IMAGE \
  --img-t1-path PATH_TO_T1_IMAGE \
  --gt-path PATH_TO_GT_MASK

BibTex

Acknowledgements

This project builds upon and benefits from several excellent open-source projects and models. We sincerely thank the authors and contributors of the following works for making their code and models publicly available:

  • SAM2 for class-agnostic object segmentation and tracking
  • Grounded SAM2 for text-guided segmentation and semantic grounding
  • Grounding-DINO for open-vocabulary object detection
  • GPT-4o and InternVL for vision–language understanding and change caption generation
  • RSCD (Robust Scene Change Detection) as a strong transformer-based baseline
  • C-3PO for CNN-based scene change detection
  • GeSCF for generalizable and zero-shot scene change detection

Their contributions have been instrumental in enabling the development, evaluation, and reproducibility of LangSCD.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors