Figure 1. Comparative results of the current state-of-the-art model and our method under various environments. Our method produces more precise and object-complete change masks, while the state-of-the-art baseline often yields fragmented predictions or misses semantic changes.
- Project Overview
- Installation
- Datasets
- Pretrained Weights
- Training
- Evaluation
- BibTex
- Acknowledgements
LangSCD is a modular vision–language scene change detection framework that integrates language-derived semantic priors and geometric–semantic matching to produce robust, object-aware change masks under real-world appearance and viewpoint variations.
We recommend using a dedicated Python environment to avoid dependency conflicts.
conda create -n langscd python=3.9 -y
conda activate langscdLangSCD is tested with PyTorch 1.13.1 and CUDA support. Please install the appropriate PyTorch build for your system following the official instructions:
# Example (CUDA 11.7)
pip install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu117All required Python packages are listed in requirements.txt:
pip install -r requirements.txtLangSCD supports multiple public scene change detection benchmarks. In this work, we primarily evaluate on NYC-CD, and also report results on VL-CMU-CD, PSCD, and ChangeVPR.
All datasets can be downloaded from Hugging Face
NYC-CD is a large-scale real-world street-view scene change detection dataset consisting of 8,122 image pairs collected across Manhattan, New York City. The dataset captures diverse urban changes over long temporal gaps, including new/missing objects, vegetation changes, and viewpoint-induced changes. Non-target variations such as weather and lighting are excluded. Pedestrians and vehicles are anonymized.
The dataset is organized as follows:
NYC-CD/
├── train/
│ ├── t0/ # reference images
│ ├── t1/ # target images
│ └── mask/ # ground-truth change masks
├── test/
│ ├── t0/
│ ├── t1/
│ └── mask/
├── quality_control/
│ └── ... # manual inspection and refinement of VLM-generated pseudo-masks
└── vlm_captions/
└── ... # VLM-generated change descriptions for image pairs
- train / test: Contain the training and evaluation splits, where each image pair is stored under
t0(earlier time),t1(later time), andmask(pixel-level change annotations). - quality_control: Includes intermediate results used for manual verification and cleaning of VLM-generated pseudo-masks during dataset construction.
- vlm_captions: Stores vision–language model (VLM) descriptions of changed objects and appearance variations. These captions cover image pairs from NYC-CD as well as other street-view and remote-sensing datasets used in our experiments.
In addition to NYC-CD, LangSCD supports evaluation on the following benchmarks, which can be accessed through the same Hugging Face repository:
- VL-CMU-CD: A real-world street-view dataset focusing on structural changes under varying environmental conditions.
- PSCD: A disaster-oriented scene change detection dataset capturing large-scale structural changes.
- ChangeVPR: A viewpoint-robust change detection benchmark derived from long-term visual place recognition data.
All checkpoints are hosted on Hugging Face
The checkpoints cover different backbones (C-3PO, RSCD), training datasets, and whether the LangSCD modules are enabled.
base: baseline model without LangSCD moduleslangscd: model augmented with LangSCD language and matching modulescmu: trained on VL-CMU-CDpscd: trained on PSCDour: trained on binary NYC-CDmulti: trained on multi-class NYC-CD
| Backbone | Training Dataset | Baseline Checkpoint | LangSCD Checkpoint |
|---|---|---|---|
| C-3PO | VL-CMU-CD | c3po_base_cmu.pth |
c3po_langscd_cmu.pth |
| C-3PO | NYC-CD | c3po_base_our.pth |
c3po_langscd_our.pth |
| C-3PO | PSCD | c3po_base_pscd.pth |
c3po_langscd_pscd.pth |
| RSCD | VL-CMU-CD | rscd_base_cmu.pth |
rscdd_langscd_cmu.pth |
| RSCD | NYC-CD | rscd_base_our.pth |
rscd_langscd_our.pth |
| RSCD | PSCD | rscd_base_pscd.pth |
rscd_langscd_pscd.pth |
| RSCD | NYC-CD-multi | rscd_base_multi_our.pth |
rscd_langscd_multi_our.pth |
LangSCD supports training with two backbones: RSCD + LangSCD and C-3PO + LangSCD.
Below we describe the required setup and commands for each setting.
-
Set the dataset root path in: src/rscd_lang/src/datasets/data_factory.py
-
Specify the target dataset name (e.g.,
NYC-CD,VL-CMU-CD,PSCD) in: src/rscd_lang/src/scripts/configs/train.yml
Set the correct folder paths in the following lines of: src/rscd_lang/src/models/CD_model.py
- Line 13
- Line 14
- Line 192
- Line 193
These paths are used to locate groundingdino backbones and checkpoints.
Single-GPU training
python src/scripts/train.py src/scripts/configs/train.ymlMulti-GPU training
torchrun --nproc_per_node=4 src/scripts/train_para.py src/scripts/configs/train.ymlMulti-class training on NYC-CD
python src/scripts/train_multi.py src/scripts/configs/train_multi.ymlAll training logs and checkpoints will be saved to: src/rscd_lang/src/output
Set the dataset root path in: src/c3po_lang/src/dataset/path_config.py
You can directly modify the return value under:
if name == 'CMU_binary':to point to your local dataset directory.
Single-GPU training
python src/train.py \
--train-dataset VL_CMU_CD \
--test-dataset VL_CMU_CD \
--input-size 640 \
--model vgg16bn_mtf_msf_deeplabv3 \
--mtf id \
--msf 4 \
--batch-size 4 \
--warmup \
--loss bi \
--loss-weight \
--lr-scheduler cosineMulti-GPU training
torchrun --nproc_per_node=4 src/train_para.py \
--train-dataset VL_CMU_CD \
--test-dataset VL_CMU_CD \
--input-size 640 \
--model vgg16bn_mtf_msf_deeplabv3 \
--mtf id \
--msf 4 \
--batch-size 4 \
--warmup \
--loss bi \
--loss-weight \
--lr-scheduler cosineAll training outputs will be saved to: src/c3po_lang/output
Evaluation in LangSCD consists of two stages:
(1) generating intermediate masks for geometric and semantic matching, and
(2) running the final evaluation to compute quantitative metrics.
LangSCD relies on SAM2 and Grounding-SAM2 to provide object-level geometric and semantic cues.
For each dataset, edit the dataset path in the corresponding script:
batch_track_<DatasetName>.py
After setting the correct path, run the script to generate SAM2 masks for all image pairs.
Grounding-SAM2 can be generated using either:
- the free Grounding-DINO model, or
- the Grounding-DINO-1.6 Pro model (paid).
To use Grounding-DINO-1.6 Pro, you must obtain an API token from the official Grounding-DINO website and configure it in the script accordingly.
Then edit the paths in:
batch_<DatasetName>_<Category>.py
and run the script to generate Grounded SAM2 mask for all image pairs.
Before LangSCD refinement, an initial change prediction mask must be produced by the base model.
python src/scripts/visualize.py train_output_folder/best.val.pth \
--option DatasetName \
--output visual_output_folderpython3 src/train.py \
--test-only \
--model vgg16bn_mtf_msf_deeplabv3 \
--mtf id \
--msf 4 \
--train-dataset VL_CMU_CD \
--test-dataset VL_CMU_CD \
--input-size 512 \
--resume train_output_folder/best.pthEdit the imported framework in:
src/gescf_match/src/test.py
Choose the framework according to the model and dataset:
- Evaluate GeSCF:
framework_orig - Evaluate GeSCF + LangSCD on ChangeVPR:
framework_gescf_cv_lang - Evaluate RSCD or C-3PO + LangSCD on NYC-CD / VL-CMU-CD:
framework_lang - Evaluate GeSCF + LangSCD on NYC-CD / VL-CMU-CD:
framework_gescf_lang - Evaluate RSCD or C-3PO + LangSCD on PSCD:
framework_pscd_lang - Evaluate GeSCF + LangSCD on PSCD:
framework_gescf_pscd_lang - Evaluate RSCD or C-3PO + LangSCD on ChangeVPR:
framework_cv_lang
After selecting the framework, update the dataset and checkpoint paths in the corresponding framework module.
CUDA_VISIBLE_DEVICES=0 python test.py \
--test-dataset DatasetNameCUDA_VISIBLE_DEVICES=0 python test_multi.py \
--test-dataset VL-CMU-CD \
--dataset-path NYCCD_dataset_pathThe evaluation reports the following metrics:
- Precision
- Recall
- Accuracy
- F1-score
- IoU
Metrics are computed for either binary change detection or multi-class change detection, depending on the dataset and evaluation mode.
You can also evaluate and visualize the prediction for a single image pair by editing the imported framework in src/gescf_match/src/test_single.py, and then running:
CUDA_VISIBLE_DEVICES=0 python test_single.py \
--test-dataset VL_CMU_CD \
--img-t0-path PATH_TO_T0_IMAGE \
--img-t1-path PATH_TO_T1_IMAGE \
--gt-path PATH_TO_GT_MASKThis project builds upon and benefits from several excellent open-source projects and models. We sincerely thank the authors and contributors of the following works for making their code and models publicly available:
- SAM2 for class-agnostic object segmentation and tracking
- Grounded SAM2 for text-guided segmentation and semantic grounding
- Grounding-DINO for open-vocabulary object detection
- GPT-4o and InternVL for vision–language understanding and change caption generation
- RSCD (Robust Scene Change Detection) as a strong transformer-based baseline
- C-3PO for CNN-based scene change detection
- GeSCF for generalizable and zero-shot scene change detection
Their contributions have been instrumental in enabling the development, evaluation, and reproducibility of LangSCD.