specdatri_reporting is an automated reporting tool that collects and aggregates download and usage statistics for RECETOX research software packages. It monitors the adoption and impact of the organisation's scientific software across multiple distribution platforms.
- Fetches download and traffic data from PyPI, CRAN, Bioconda, and GitHub on a weekly schedule.
- Aggregates raw data into human-readable TSV reports grouped by month (for package downloads) or by week (for GitHub traffic).
- Commits the updated reports back to the repository automatically via GitHub Actions.
| Platform | Metric | Aggregation |
|---|---|---|
| PyPI | Package downloads | Monthly |
| CRAN | Package downloads | Monthly |
| Bioconda | Package downloads | Monthly |
| GitHub | Repository views | Weekly |
| GitHub | Repository clones | Weekly |
Generated reports are stored as TSV files under reports/<YEAR>/ and are versioned in the repository.
A scheduled GitHub Actions workflow (.github/workflows/actions.yml) runs every Monday at 00:00 UTC and performs the following steps:
- Checkout the repository.
- Install dependencies (
pip install -r requirements.txt). - Collect statistics by calling
./specdatri collect-stats, which queries each configured data source and saves raw JSON responses totmp/runs/<YYYY-MM-DD>/. - Generate reports by calling
./specdatri generate-reports, which reads all collected JSON files and produces aggregated TSV files inreports/<YEAR>/. - Commit and push the updated report files back to the
mainbranch.
The workflow can also be triggered manually from the GitHub Actions UI via workflow_dispatch.
This section explains how to run the reporting tool manually.
- Python 3.12+
- A GitHub personal access token with
reposcope (github_token). Read access is sufficient for collecting traffic statistics manually; write access is required by the automated workflow to push report commits. - A PePy API key for PyPI statistics (
pepy_x_api_key).
# Clone the repository
git clone https://github.com/RECETOX/specdatri_reporting.git
cd specdatri_reporting
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Make the entry point executable (Linux/macOS)
chmod +x specdatri
# Provide API tokens
cp example.env .env
# Edit .env and fill in your tokens:
# github_token="your_github_token"
# pepy_x_api_key="your_pepy_api_key"Note: The tool loads
.envautomatically viapython-dotenv. Alternatively, export the variables in your shell before running any command.
The specdatri entry point exposes three subcommands.
# Add a package tracked on PyPI and GitHub
./specdatri add-repo --project mypackage --pypi --github
# Add a package with a custom repository path and all sources
./specdatri add-repo --project mypackage --repository OWNER/mypackage --pypi --bioconda --cran --github
# Add an R package tracked on CRAN only
./specdatri add-repo --project MyRPackage --cranOptions:
| Option | Description | Default |
|---|---|---|
--project TEXT |
Project name (required) | — |
--repository TEXT |
GitHub repo path OWNER/REPO |
RECETOX/<PROJECT> |
--repository-list PATH |
Path to the tracking list | ./repository_list.tsv |
--pypi |
Track PyPI downloads | off |
--bioconda |
Track Bioconda downloads | off |
--cran |
Track CRAN downloads | off |
--github |
Track GitHub views and clones | off |
# Collect stats for all configured packages
./specdatri collect-stats
# Use custom paths
./specdatri collect-stats --repository-list custom_list.tsv --tmp-dir data/tmpOutput: JSON files in tmp/runs/<YYYY-MM-DD>/ named using the pattern
{timestamp}___{project}___{package}___{source}___{action}.json.
# Generate reports for the current year
./specdatri generate-reports
# Generate reports for a specific year
./specdatri generate-reports --year 2025
# Use custom paths
./specdatri generate-reports --year 2025 --tmp-dir data/tmp --output-dir data/reportsOutput: five TSV files per year in reports/<YEAR>/:
pypi_downloads.tsvbioconda_downloads.tsvcran_downloads.tsvgithub_views.tsvgithub_clones.tsv
# 1. Add packages (one-time setup)
./specdatri add-repo --project matchms --pypi --bioconda --github
./specdatri add-repo --project spec2vec --pypi --bioconda --github
# 2. Collect statistics
./specdatri collect-stats
# 3. Generate reports
./specdatri generate-reports --year 2025# Install development dependencies (includes pre-commit hooks)
pip install -r ./requirements/local.txt
# Install pre-commit hooks
pre-commit install# Run all tests
python -m unittest discover -s tests
# Run with coverage
coverage run -m unittest discover -s tests
coverage report -m
coverage html # generates htmlcov/index.htmlYou need act to simulate GitHub Actions on your machine.
# Simulate the weekly schedule trigger
act --secret-file .env scheduleImportant notes when testing with
act:
- Use a token without push permissions so that test data is never written to the production branch.
- The push step will fail if branch protection rules are active — this is expected behaviour.
- Files created inside the Docker container are not written to your local filesystem.
- Never push directly to
main; always work on a feature branch.
Use the add-repo CLI command (see User Guide) and then commit the updated repository_list.tsv:
./specdatri add-repo --project newpackage --pypi --github
git add repository_list.tsv
git commit -m "Track newpackage on PyPI and GitHub"The repository_list.tsv file is a tab-separated table with the columns:
| Column | Description |
|---|---|
repository |
GitHub repository path (OWNER/REPO) |
project |
Human-readable project identifier |
package |
Package name on the distribution platform |
source |
One of: pypi, bioconda, CRAN, GitHub |
action |
One of: downloads, views, clones |
Data sources live in src/data_sources/ and inherit from the abstract DataSource base class in src/data_sources/base.py.
To add a new source:
-
Create a new module
src/data_sources/<name>.pyand implement theDataSourceinterface:from .base import DataSource class MyNewDataSource(DataSource): def fetch(self, package: str, project: str) -> dict: # Call the external API and return raw data as a dict ...
-
Register the source in
src/cli.pyby mapping the source identifier to the new class, following the same pattern as existing sources. -
Add a new source flag to the
add-repocommand if users need to opt in to this source. -
Write tests in
tests/following the patterns intests/test_data_sources.py.
See src/data_sources/README.md for a detailed description of the data source architecture.
Report generators live in src/reports/ and inherit from the abstract ReportGenerator base class in src/reports/base.py.
To add a new report:
-
Create a new module
src/reports/<name>.pyand implement theReportGeneratorinterface:from .base import ReportGenerator class MyNewReportGenerator(ReportGenerator): source = "mysource" action = "downloads" period = "monthly" # or "weekly" def aggregate(self, data: list[dict]) -> dict: # Aggregate raw records into period → count mapping ...
-
Register the generator in
src/cli.pyso thatgenerate-reportspicks it up automatically. -
Write tests in
tests/following the patterns intests/test_report.py.
See src/reports/README.md for a detailed description of the report generation architecture.
If you want to set up a similar automated download-statistics tracking system for your own organisation's repositories, follow these steps:
Fork this repository into your GitHub organisation (or copy the relevant files into a new repository).
Edit repository_list.tsv (or use ./specdatri add-repo) to list your own packages and their distribution channels.
In your repository's Settings → Secrets and variables → Actions, add:
| Secret | Description |
|---|---|
RECEBOT_REPORTING_TOKEN |
A GitHub personal access token (PAT) with repo scope. Used to read traffic data from the GitHub API and push report commits. Consider using a dedicated bot account. |
pepy_x_api_key |
Your PePy API key for PyPI download statistics. |
The
github_tokensecret used by the push step is the built-insecrets.github_tokenprovided by GitHub Actions and does not need to be configured manually.
Edit .github/workflows/actions.yml and change the cron expression to your preferred schedule:
schedule:
- cron: '0 0 * * 1' # Every Monday at 00:00 UTC- Update the repository description and any references to RECETOX in the README.
- Remove or replace entries in
repository_list.tsvwith your own packages. - Clear the
reports/directory (or delete historical data you do not need).
Enable branch protection on main to prevent accidental direct pushes. The GitHub Actions workflow pushes via the ad-m/github-push-action action using the workflow token, which is exempt from push restrictions when secrets.github_token is used.