GitHub - DiFronzo/WikiVisage: Active learning facial recognition for Wikimedia Commons. Train an ML model to recognize specific people and automatically add P180 (depicts) Structured Data to matching images

Active learning facial recognition for Wikimedia Commons. Train an ML model to recognize specific people and automatically add P180 (depicts) Structured Data to matching images.

export-1775234516672.mp4

🔗 Quick links

📖 Local dev guide: test-local.md
🚀 Toolforge deploy guide: how-to-run-it.md
🤝 Contributing: CONTRIBUTING.md
🔒 Security policy: SECURITY.md
📜 Code of conduct: CODE_OF_CONDUCT.md

✨ Highlights

🧠 Active learning UI: fast yes/no classification with keyboard shortcuts, undo, skip, and manual face drawing
🧵 Background worker: crawls Commons categories, downloads images, detects faces (HOG), and stores 128D encodings
🔀 Multi-instance workers: distributed locking lets multiple workers process projects concurrently without conflicts
🧪 Persistent subprocess pool: face detection runs in long-lived subprocesses, eliminating per-image dlib import overhead
🧷 Bootstrap from existing tags: seeds the model via SPARQL when P180 depicts claims already exist on Commons
🤖 Autonomous inference: centroid-distance classification once you have enough confirmed examples
✍️ User-triggered Commons edits: click "Send Edits to Wikimedia Commons" to write depicts claims via the Wikibase API
🌍 i18n-ready: translations included (en, nb, es, fr)

🧭 How it works

🆕 Create a project — pick a Wikidata entity (e.g., Q42) and a Commons category
🔎 Discover images — the worker traverses the category and detects faces
🧷 Bootstrap (optional) — if Commons already has depicts claims, seed the model from them
✅❌ Classify — review faces one-by-one with Yes/No (keyboard shortcuts: Y / N)
🤖 Infer — after enough confirmed faces (default 5), classify remaining faces automatically
✍️ Write to Commons — send approved matches as depicts claims (OAuth)

🏗️ Architecture

+---------------------------+      +---------------------------+
|       Flask Web App       |      |  Background Worker(s)     |
|          (app.py)         |      |       (worker.py)         |
|---------------------------|      |---------------------------|
| OAuth 2.0 login           |      | Category traversal        |
| Project CRUD              |      | Image download            |
| Active learning UI        |      | HOG face detection (pool) |
| Classification UI         |      | SPARQL bootstrapping      |
| Queue SDC writes          |      | Autonomous inference      |
| Approve/reject/edit bbox  |      | Write SDC claims          |
|                           |      | Distributed claim locking |
+------------+--------------+      +------------+--------------+
             |                                  |
             +----------------------------------+
                               |
                        +------------+
                        |   MariaDB  |
                        |  (ToolsDB) |
                        +------------+

🧰 Stack: Python 3.11+, Flask, gunicorn, face_recognition (dlib HOG), PyMySQL, requests-oauthlib
☁️ Hosted on: Wikimedia Toolforge (Kubernetes Build Service)
🔀 Workers: Multiple instances run concurrently — each claims projects via SELECT … FOR UPDATE with automatic stale-claim expiry (15 min)

🗂️ Project layout

WikiVisage/
├── app.py               # Flask app: OAuth, routes, classification API
├── worker.py            # Background ML pipeline: crawl, detect, infer, write (multi-instance)
├── token_crypto.py      # Fernet encrypt/decrypt helpers for OAuth tokens at rest
├── database.py          # MariaDB connection pool with retry logic
├── schema.sql           # Database schema (9 tables + indices)
├── migrate.py           # Idempotent migration script with --reset flag
├── jobs.yaml            # Toolforge jobs definition (2 worker instances)
├── templates/           # Jinja2 templates (10 files, all extend base.html)
├── static/              # Logos + screenshots
├── translations/        # i18n: en, nb, es, fr
├── requirements.txt     # Runtime dependencies
├── requirements-dev.txt # Dev/test deps (pytest, ruff)
└── tests/               # 567 tests (unit + integration)

🧑‍💻 Setup

✅ Prerequisites

A Toolforge tool account
An OAuth 2.0 consumer registered on Meta with grants: - Basic rights - Edit existing pages - Callback URL: https://<toolname>.toolforge.org/auth/callback

1) 🔐 Environment variables (Toolforge)

# Database credentials (find yours in ~/replica.my.cnf on Toolforge)
toolforge envvars create TOOL_TOOLSDB_USER      "s<NNNNN>"
toolforge envvars create TOOL_TOOLSDB_PASSWORD  "<password>"
toolforge envvars create WIKIVISAGE_DB_NAME     "s<NNNNN>__wikiface"

# OAuth 2.0
toolforge envvars create OAUTH_CLIENT_ID        "<client-id>"
toolforge envvars create OAUTH_CLIENT_SECRET    "<client-secret>"
toolforge envvars create OAUTH_REDIRECT_URI     "https://<toolname>.toolforge.org/auth/callback"

# Flask
toolforge envvars create FLASK_SECRET_KEY "$(python3 -c 'import secrets; print(secrets.token_hex(32))')"

# Token encryption (optional — tokens stored as plaintext if unset)
toolforge envvars create WIKIVISAGE_TOKEN_KEY "$(python3 -c 'from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())')"

2) 🗄️ Create database

mariadb --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.wikimedia.cloud

CREATE DATABASE s<NNNNN>__wikiface;

3) 🧱 Run migration

python3 migrate.py

4) 🚀 Build & deploy

# Build container image
toolforge build start https://github.com/DiFronzo/WikiVisage.git

# Start web service
toolforge webservice buildservice start

# Start background workers (uses jobs.yaml — 2 worker instances)
toolforge jobs load jobs.yaml

The app will be live at https://<toolname>.toolforge.org.

🧪 Local development

pip install -r requirements-dev.txt

export TOOL_TOOLSDB_USER=root
export TOOL_TOOLSDB_PASSWORD=yourpassword
export TOOL_TOOLSDB_HOST=127.0.0.1
export WIKIVISAGE_DB_NAME=wikiface_dev
export OAUTH_CLIENT_ID=<client-id>
export OAUTH_CLIENT_SECRET=<client-secret>
export OAUTH_REDIRECT_URI=http://localhost:8000/auth/callback
export FLASK_SECRET_KEY=dev-secret-key
export OAUTHLIB_INSECURE_TRANSPORT=1

mysql -u root -p -e "CREATE DATABASE wikiface_dev"
python migrate.py
python app.py                                   # Web app on http://localhost:8000
python worker.py --worker-id local-1            # Background worker (separate terminal)

For local OAuth you'll need a separate consumer with http://localhost:8000/auth/callback as the callback URL. Set OAUTHLIB_INSECURE_TRANSPORT=1 to allow OAuth over HTTP.

⚙️ Configuration

Each project has a couple of tunables:

Parameter	Default	Description
`distance_threshold`	`0.6`	Face-distance cutoff for autonomous classification (lower = stricter).
`min_confirmed`	`5`	Minimum confirmed matches before autonomous inference starts.

🔧 Worker environment variables (optional)

Variable	Default	Description
`WIKIVISAGE_WORKER_POLL_INTERVAL`	`60`	Seconds between poll cycles
`WIKIVISAGE_WORKER_MAX_PROJECTS`	`3`	Max projects processed concurrently per worker
`WIKIVISAGE_WORKER_IMAGE_THREADS`	`4`	Parallel image download/detection threads per project
`WIKIVISAGE_WORKER_BATCH_SIZE`	`50`	Images per processing batch
`WIKIVISAGE_DB_POOL_SIZE`	auto	DB connection pool size (auto = `max_projects × image_threads + 3`)
`COMMONS_DOWNLOAD_THROTTLE_SECONDS`	`0`	Delay between image downloads (seconds)

✅ Testing

# Unit tests (CI mode — integration tests auto-skipped)
pytest tests/ -v

# Unit + integration tests (requires local MariaDB)
WIKIVISAGE_TEST_DB=1 pytest tests/ -v

# With coverage
WIKIVISAGE_TEST_DB=1 pytest tests/ --cov=. --cov-report=term-missing

📝 License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔗 Quick links

✨ Highlights

🧭 How it works

🏗️ Architecture

🗂️ Project layout

🧑‍💻 Setup

✅ Prerequisites

1) 🔐 Environment variables (Toolforge)

2) 🗄️ Create database

3) 🧱 Run migration

4) 🚀 Build & deploy

🧪 Local development

⚙️ Configuration

🔧 Worker environment variables (optional)

✅ Testing

📝 License

About

Uh oh!

Releases 34

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github		.github
static		static
templates		templates
tests		tests
translations		translations
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
SECURITY.md		SECURITY.md
app.py		app.py
babel.cfg		babel.cfg
config.py		config.py
database.py		database.py
how-to-run-it.md		how-to-run-it.md
jobs.yaml		jobs.yaml
messages.pot		messages.pot
migrate.py		migrate.py
project.toml		project.toml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
schema.sql		schema.sql
test-local.md		test-local.md
token_crypto.py		token_crypto.py
worker.py		worker.py

Folders and files

Latest commit

History

Repository files navigation

🔗 Quick links

✨ Highlights

🧭 How it works

🏗️ Architecture

🗂️ Project layout

🧑‍💻 Setup

✅ Prerequisites

1) 🔐 Environment variables (Toolforge)

2) 🗄️ Create database

3) 🧱 Run migration

4) 🚀 Build & deploy

🧪 Local development

⚙️ Configuration

🔧 Worker environment variables (optional)

✅ Testing

📝 License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 34

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages