Active learning facial recognition for Wikimedia Commons. Train an ML model to recognize specific people and automatically add P180 (depicts) Structured Data to matching images.
export-1775234516672.mp4
- π Local dev guide: test-local.md
- π Toolforge deploy guide: how-to-run-it.md
- π€ Contributing: CONTRIBUTING.md
- π Security policy: SECURITY.md
- π Code of conduct: CODE_OF_CONDUCT.md
- π§ Active learning UI: fast yes/no classification with keyboard shortcuts, undo, skip, and manual face drawing
- π§΅ Background worker: crawls Commons categories, downloads images, detects faces (HOG), and stores 128D encodings
- π Multi-instance workers: distributed locking lets multiple workers process projects concurrently without conflicts
- π§ͺ Persistent subprocess pool: face detection runs in long-lived subprocesses, eliminating per-image dlib import overhead
- π§· Bootstrap from existing tags: seeds the model via SPARQL when P180 depicts claims already exist on Commons
- π€ Autonomous inference: centroid-distance classification once you have enough confirmed examples
- βοΈ User-triggered Commons edits: click "Send Edits to Wikimedia Commons" to write depicts claims via the Wikibase API
- π i18n-ready: translations included (en, nb, es, fr)
- π Create a project β pick a Wikidata entity (e.g.,
Q42) and a Commons category - π Discover images β the worker traverses the category and detects faces
- π§· Bootstrap (optional) β if Commons already has depicts claims, seed the model from them
- β
β Classify β review faces one-by-one with Yes/No (keyboard shortcuts:
Y/N) - π€ Infer β after enough confirmed faces (default
5), classify remaining faces automatically - βοΈ Write to Commons β send approved matches as depicts claims (OAuth)
+---------------------------+ +---------------------------+
| Flask Web App | | Background Worker(s) |
| (app.py) | | (worker.py) |
|---------------------------| |---------------------------|
| OAuth 2.0 login | | Category traversal |
| Project CRUD | | Image download |
| Active learning UI | | HOG face detection (pool) |
| Classification UI | | SPARQL bootstrapping |
| Queue SDC writes | | Autonomous inference |
| Approve/reject/edit bbox | | Write SDC claims |
| | | Distributed claim locking |
+------------+--------------+ +------------+--------------+
| |
+----------------------------------+
|
+------------+
| MariaDB |
| (ToolsDB) |
+------------+
- π§° Stack: Python 3.11+, Flask, gunicorn, face_recognition (dlib HOG), PyMySQL, requests-oauthlib
- βοΈ Hosted on: Wikimedia Toolforge (Kubernetes Build Service)
- π Workers: Multiple instances run concurrently β each claims projects via
SELECT β¦ FOR UPDATEwith automatic stale-claim expiry (15 min)
WikiVisage/
βββ app.py # Flask app: OAuth, routes, classification API
βββ worker.py # Background ML pipeline: crawl, detect, infer, write (multi-instance)
βββ token_crypto.py # Fernet encrypt/decrypt helpers for OAuth tokens at rest
βββ database.py # MariaDB connection pool with retry logic
βββ schema.sql # Database schema (9 tables + indices)
βββ migrate.py # Idempotent migration script with --reset flag
βββ jobs.yaml # Toolforge jobs definition (2 worker instances)
βββ templates/ # Jinja2 templates (10 files, all extend base.html)
βββ static/ # Logos + screenshots
βββ translations/ # i18n: en, nb, es, fr
βββ requirements.txt # Runtime dependencies
βββ requirements-dev.txt # Dev/test deps (pytest, ruff)
βββ tests/ # 567 tests (unit + integration)
- A Toolforge tool account
- An OAuth 2.0 consumer registered on Meta with grants:
-
Basic rights-Edit existing pages- Callback URL:https://<toolname>.toolforge.org/auth/callback
# Database credentials (find yours in ~/replica.my.cnf on Toolforge)
toolforge envvars create TOOL_TOOLSDB_USER "s<NNNNN>"
toolforge envvars create TOOL_TOOLSDB_PASSWORD "<password>"
toolforge envvars create WIKIVISAGE_DB_NAME "s<NNNNN>__wikiface"
# OAuth 2.0
toolforge envvars create OAUTH_CLIENT_ID "<client-id>"
toolforge envvars create OAUTH_CLIENT_SECRET "<client-secret>"
toolforge envvars create OAUTH_REDIRECT_URI "https://<toolname>.toolforge.org/auth/callback"
# Flask
toolforge envvars create FLASK_SECRET_KEY "$(python3 -c 'import secrets; print(secrets.token_hex(32))')"
# Token encryption (optional β tokens stored as plaintext if unset)
toolforge envvars create WIKIVISAGE_TOKEN_KEY "$(python3 -c 'from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())')"mariadb --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.wikimedia.cloudCREATE DATABASE s<NNNNN>__wikiface;python3 migrate.py# Build container image
toolforge build start https://github.com/DiFronzo/WikiVisage.git
# Start web service
toolforge webservice buildservice start
# Start background workers (uses jobs.yaml β 2 worker instances)
toolforge jobs load jobs.yamlThe app will be live at https://<toolname>.toolforge.org.
pip install -r requirements-dev.txt
export TOOL_TOOLSDB_USER=root
export TOOL_TOOLSDB_PASSWORD=yourpassword
export TOOL_TOOLSDB_HOST=127.0.0.1
export WIKIVISAGE_DB_NAME=wikiface_dev
export OAUTH_CLIENT_ID=<client-id>
export OAUTH_CLIENT_SECRET=<client-secret>
export OAUTH_REDIRECT_URI=http://localhost:8000/auth/callback
export FLASK_SECRET_KEY=dev-secret-key
export OAUTHLIB_INSECURE_TRANSPORT=1
mysql -u root -p -e "CREATE DATABASE wikiface_dev"
python migrate.py
python app.py # Web app on http://localhost:8000
python worker.py --worker-id local-1 # Background worker (separate terminal)For local OAuth you'll need a separate consumer with http://localhost:8000/auth/callback as the callback URL. Set OAUTHLIB_INSECURE_TRANSPORT=1 to allow OAuth over HTTP.
Each project has a couple of tunables:
| Parameter | Default | Description |
|---|---|---|
distance_threshold |
0.6 |
Face-distance cutoff for autonomous classification (lower = stricter). |
min_confirmed |
5 |
Minimum confirmed matches before autonomous inference starts. |
| Variable | Default | Description |
|---|---|---|
WIKIVISAGE_WORKER_POLL_INTERVAL |
60 |
Seconds between poll cycles |
WIKIVISAGE_WORKER_MAX_PROJECTS |
3 |
Max projects processed concurrently per worker |
WIKIVISAGE_WORKER_IMAGE_THREADS |
4 |
Parallel image download/detection threads per project |
WIKIVISAGE_WORKER_BATCH_SIZE |
50 |
Images per processing batch |
WIKIVISAGE_DB_POOL_SIZE |
auto | DB connection pool size (auto = max_projects Γ image_threads + 3) |
COMMONS_DOWNLOAD_THROTTLE_SECONDS |
0 |
Delay between image downloads (seconds) |
# Unit tests (CI mode β integration tests auto-skipped)
pytest tests/ -v
# Unit + integration tests (requires local MariaDB)
WIKIVISAGE_TEST_DB=1 pytest tests/ -v
# With coverage
WIKIVISAGE_TEST_DB=1 pytest tests/ --cov=. --cov-report=term-missing