A Benchmark for Psychological Techniques in Real-World Scams
This repository contains the code and partial dataset accompanying our paper submitted to EMNLP 2025:
PsyScam: A Benchmark for Psychological Techniques in Real-World Scams
Online scams exploit various psychological techniques (PTs) to manipulate victims. PsyScam provides a comprehensive benchmark to support the analysis and modeling of these techniques across three key NLP tasks:
- 🏷️ PT Classification: Multi-label classification of psychological techniques in scam content
- ✍️ Scam Completion: Generating realistic scam continuations given partial content
- 🔄 Scam Augmentation: Creating variations of existing scam content while preserving psychological techniques
PsyScam/
├── crawlers/ # Web scrapers for collecting scam reports from public sources
├── data/
│ ├── D2.csv # Evaluation subset used in our experiments (sample dataset)
│ └── PTs.csv # Comprehensive list of psychological technique labels
├── LLMExtractor.py # Human-LLM collaborative annotation using GPT-4
├── PTClassification.py # Multi-label psychological technique classification
├── ScamCompletion.py # Scam completion generation task implementation
├── ScamAugmentation.py # Scam augmentation generation task implementation
└── README.md # Project documentation
pip install -r requirements.txtCreate an api.key file in the root directory with your OpenAI API key:
- PT Classification:
python PTClassification.py --csv data/D2.csv
use the trained model for inference
python inferencePT.py --model_path "./bert/results_multilabel/checkpoint-XXX" --text "Dear valued customer, you have been specially selected for this exclusive investment opportunity. Our expert team guarantees 500% returns within 30 days. This offer expires in 24 hours!"-
Scam Completion:
python ScamCompletion.py --llm_model gpt41
-
Scam Augmentation:
python ScamAugmentation.py --llm_model gpt41
Our benchmark includes carefully curated scam reports annotated with psychological techniques. Due to safety and ethical considerations, the complete dataset is available upon request for research purposes only.
We only include dataset (D2.csv) in this repo.