LangSCD: Scene Change Detection with Vision-Language Representation Learning

Figure 1. Comparative results of the current state-of-the-art model and our method under various environments. Our method produces more precise and object-complete change masks, while the state-of-the-art baseline often yields fragmented predictions or misses semantic changes.

Project Overview

LangSCD is a modular vision–language scene change detection framework that integrates language-derived semantic priors and geometric–semantic matching to produce robust, object-aware change masks under real-world appearance and viewpoint variations.

Installation

We recommend using a dedicated Python environment to avoid dependency conflicts.

1. Create a virtual environment

conda create -n langscd python=3.9 -y
conda activate langscd

2. Install PyTorch

LangSCD is tested with PyTorch 1.13.1 and CUDA support. Please install the appropriate PyTorch build for your system following the official instructions:

# Example (CUDA 11.7)
pip install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu117

3. Install dependencies

All required Python packages are listed in requirements.txt:

pip install -r requirements.txt

Datasets

LangSCD supports multiple public scene change detection benchmarks. In this work, we primarily evaluate on NYC-CD, and also report results on VL-CMU-CD, PSCD, and ChangeVPR.

All datasets can be downloaded from Hugging Face

NYC-CD

NYC-CD is a large-scale real-world street-view scene change detection dataset consisting of 8,122 image pairs collected across Manhattan, New York City. The dataset captures diverse urban changes over long temporal gaps, including new/missing objects, vegetation changes, and viewpoint-induced changes. Non-target variations such as weather and lighting are excluded. Pedestrians and vehicles are anonymized.

The dataset is organized as follows:

NYC-CD/
├── train/
│   ├── t0/        # reference images
│   ├── t1/        # target images
│   └── mask/      # ground-truth change masks
├── test/
│   ├── t0/
│   ├── t1/
│   └── mask/
├── quality_control/
│   └── ...        # manual inspection and refinement of VLM-generated pseudo-masks
└── vlm_captions/
    └── ...        # VLM-generated change descriptions for image pairs

train / test: Contain the training and evaluation splits, where each image pair is stored under t0 (earlier time), t1 (later time), and mask (pixel-level change annotations).
quality_control: Includes intermediate results used for manual verification and cleaning of VLM-generated pseudo-masks during dataset construction.
vlm_captions: Stores vision–language model (VLM) descriptions of changed objects and appearance variations. These captions cover image pairs from NYC-CD as well as other street-view and remote-sensing datasets used in our experiments.

Other Datasets

In addition to NYC-CD, LangSCD supports evaluation on the following benchmarks, which can be accessed through the same Hugging Face repository:

VL-CMU-CD: A real-world street-view dataset focusing on structural changes under varying environmental conditions.
PSCD: A disaster-oriented scene change detection dataset capturing large-scale structural changes.
ChangeVPR: A viewpoint-robust change detection benchmark derived from long-term visual place recognition data.

Pretrained Weights

All checkpoints are hosted on Hugging Face

The checkpoints cover different backbones (C-3PO, RSCD), training datasets, and whether the LangSCD modules are enabled.

Checkpoint Naming Convention

base: baseline model without LangSCD modules
langscd: model augmented with LangSCD language and matching modules
cmu: trained on VL-CMU-CD
pscd: trained on PSCD
our: trained on binary NYC-CD
multi: trained on multi-class NYC-CD

Available Pretrained Models

Backbone	Training Dataset	Baseline Checkpoint	LangSCD Checkpoint
C-3PO	VL-CMU-CD	`c3po_base_cmu.pth`	`c3po_langscd_cmu.pth`
C-3PO	NYC-CD	`c3po_base_our.pth`	`c3po_langscd_our.pth`
C-3PO	PSCD	`c3po_base_pscd.pth`	`c3po_langscd_pscd.pth`
RSCD	VL-CMU-CD	`rscd_base_cmu.pth`	`rscdd_langscd_cmu.pth`
RSCD	NYC-CD	`rscd_base_our.pth`	`rscd_langscd_our.pth`
RSCD	PSCD	`rscd_base_pscd.pth`	`rscd_langscd_pscd.pth`
RSCD	NYC-CD-multi	`rscd_base_multi_our.pth`	`rscd_langscd_multi_our.pth`

Training

LangSCD supports training with two backbones: RSCD + LangSCD and C-3PO + LangSCD.
Below we describe the required setup and commands for each setting.

Training RSCD + LangSCD

1. Configure dataset paths

Set the dataset root path in: src/rscd_lang/src/datasets/data_factory.py
Specify the target dataset name (e.g., NYC-CD, VL-CMU-CD, PSCD) in: src/rscd_lang/src/scripts/configs/train.yml

2. Update model paths

Set the correct folder paths in the following lines of: src/rscd_lang/src/models/CD_model.py

Line 13
Line 14
Line 192
Line 193

These paths are used to locate groundingdino backbones and checkpoints.

3. Run training

Single-GPU training

python src/scripts/train.py src/scripts/configs/train.yml

Multi-GPU training

torchrun --nproc_per_node=4 src/scripts/train_para.py src/scripts/configs/train.yml

Multi-class training on NYC-CD

python src/scripts/train_multi.py src/scripts/configs/train_multi.yml

4. Outputs

All training logs and checkpoints will be saved to: src/rscd_lang/src/output

Training C-3PO + LangSCD

1. Configure dataset paths

Set the dataset root path in: src/c3po_lang/src/dataset/path_config.py

You can directly modify the return value under:

if name == 'CMU_binary':

to point to your local dataset directory.

2. Run training

Single-GPU training

python src/train.py \
  --train-dataset VL_CMU_CD \
  --test-dataset VL_CMU_CD \
  --input-size 640 \
  --model vgg16bn_mtf_msf_deeplabv3 \
  --mtf id \
  --msf 4 \
  --batch-size 4 \
  --warmup \
  --loss bi \
  --loss-weight \
  --lr-scheduler cosine

Multi-GPU training

torchrun --nproc_per_node=4 src/train_para.py \
  --train-dataset VL_CMU_CD \
  --test-dataset VL_CMU_CD \
  --input-size 640 \
  --model vgg16bn_mtf_msf_deeplabv3 \
  --mtf id \
  --msf 4 \
  --batch-size 4 \
  --warmup \
  --loss bi \
  --loss-weight \
  --lr-scheduler cosine

3. Outputs

All training outputs will be saved to: src/c3po_lang/output

Evaluation

Evaluation in LangSCD consists of two stages:
(1) generating intermediate masks for geometric and semantic matching, and
(2) running the final evaluation to compute quantitative metrics.

1. Generate Matching Masks

LangSCD relies on SAM2 and Grounding-SAM2 to provide object-level geometric and semantic cues.

SAM2 Masks

For each dataset, edit the dataset path in the corresponding script:

batch_track_<DatasetName>.py

After setting the correct path, run the script to generate SAM2 masks for all image pairs.

Grounding-SAM2 Masks

Grounding-SAM2 can be generated using either:

the free Grounding-DINO model, or
the Grounding-DINO-1.6 Pro model (paid).

To use Grounding-DINO-1.6 Pro, you must obtain an API token from the official Grounding-DINO website and configure it in the script accordingly.

Then edit the paths in:

batch_<DatasetName>_<Category>.py

and run the script to generate Grounded SAM2 mask for all image pairs.

2. Generate Initial Prediction Masks

Before LangSCD refinement, an initial change prediction mask must be produced by the base model.

RSCD

python src/scripts/visualize.py train_output_folder/best.val.pth \
  --option DatasetName \
  --output visual_output_folder

C-3PO

python3 src/train.py \
  --test-only \
  --model vgg16bn_mtf_msf_deeplabv3 \
  --mtf id \
  --msf 4 \
  --train-dataset VL_CMU_CD \
  --test-dataset VL_CMU_CD \
  --input-size 512 \
  --resume train_output_folder/best.pth

3. Select Evaluation Framework

Edit the imported framework in:

src/gescf_match/src/test.py

Choose the framework according to the model and dataset:

Evaluate GeSCF: framework_orig
Evaluate GeSCF + LangSCD on ChangeVPR: framework_gescf_cv_lang
Evaluate RSCD or C-3PO + LangSCD on NYC-CD / VL-CMU-CD: framework_lang
Evaluate GeSCF + LangSCD on NYC-CD / VL-CMU-CD: framework_gescf_lang
Evaluate RSCD or C-3PO + LangSCD on PSCD: framework_pscd_lang
Evaluate GeSCF + LangSCD on PSCD: framework_gescf_pscd_lang
Evaluate RSCD or C-3PO + LangSCD on ChangeVPR: framework_cv_lang

After selecting the framework, update the dataset and checkpoint paths in the corresponding framework module.

4. Run Evaluation

Binary Change Evaluation

CUDA_VISIBLE_DEVICES=0 python test.py \
  --test-dataset DatasetName

Multi-class Change Evaluation (NYC-CD)

CUDA_VISIBLE_DEVICES=0 python test_multi.py \
  --test-dataset VL-CMU-CD \
  --dataset-path NYCCD_dataset_path

5. Metrics

The evaluation reports the following metrics:

Precision
Recall
Accuracy
F1-score
IoU

Metrics are computed for either binary change detection or multi-class change detection, depending on the dataset and evaluation mode.

6. Test a single image pair

You can also evaluate and visualize the prediction for a single image pair by editing the imported framework in src/gescf_match/src/test_single.py, and then running:

CUDA_VISIBLE_DEVICES=0 python test_single.py \
  --test-dataset VL_CMU_CD \
  --img-t0-path PATH_TO_T0_IMAGE \
  --img-t1-path PATH_TO_T1_IMAGE \
  --gt-path PATH_TO_GT_MASK

BibTex

Acknowledgements

This project builds upon and benefits from several excellent open-source projects and models. We sincerely thank the authors and contributors of the following works for making their code and models publicly available:

SAM2 for class-agnostic object segmentation and tracking
Grounded SAM2 for text-guided segmentation and semantic grounding
Grounding-DINO for open-vocabulary object detection
GPT-4o and InternVL for vision–language understanding and change caption generation
RSCD (Robust Scene Change Detection) as a strong transformer-based baseline
C-3PO for CNN-based scene change detection
GeSCF for generalizable and zero-shot scene change detection

Their contributions have been instrumental in enabling the development, evaluation, and reproducibility of LangSCD.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
src		src
README.md		README.md
qualitative_result.png		qualitative_result.png
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LangSCD: Scene Change Detection with Vision-Language Representation Learning

Contents

Project Overview

Installation

1. Create a virtual environment

2. Install PyTorch

3. Install dependencies

Datasets

NYC-CD

Other Datasets

Pretrained Weights

Checkpoint Naming Convention

Available Pretrained Models

Training

Training RSCD + LangSCD

1. Configure dataset paths

2. Update model paths

3. Run training

4. Outputs

Training C-3PO + LangSCD

1. Configure dataset paths

2. Run training

3. Outputs

Evaluation

1. Generate Matching Masks

SAM2 Masks

Grounding-SAM2 Masks

2. Generate Initial Prediction Masks

RSCD

C-3PO

3. Select Evaluation Framework

4. Run Evaluation

Binary Change Evaluation

Multi-class Change Evaluation (NYC-CD)

5. Metrics

6. Test a single image pair

BibTex

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages