Skip to content

DiFronzo/WikiVisage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

WikiVisage

Python Framework CI Release Build Hosting Live site License

Active learning facial recognition for Wikimedia Commons. Train an ML model to recognize specific people and automatically add P180 (depicts) Structured Data to matching images.

export-1775234516672.mp4

πŸ”— Quick links

✨ Highlights

  • 🧠 Active learning UI: fast yes/no classification with keyboard shortcuts, undo, skip, and manual face drawing
  • 🧡 Background worker: crawls Commons categories, downloads images, detects faces (HOG), and stores 128D encodings
  • πŸ”€ Multi-instance workers: distributed locking lets multiple workers process projects concurrently without conflicts
  • πŸ§ͺ Persistent subprocess pool: face detection runs in long-lived subprocesses, eliminating per-image dlib import overhead
  • 🧷 Bootstrap from existing tags: seeds the model via SPARQL when P180 depicts claims already exist on Commons
  • πŸ€– Autonomous inference: centroid-distance classification once you have enough confirmed examples
  • ✍️ User-triggered Commons edits: click "Send Edits to Wikimedia Commons" to write depicts claims via the Wikibase API
  • 🌍 i18n-ready: translations included (en, nb, es, fr)

🧭 How it works

  1. πŸ†• Create a project β€” pick a Wikidata entity (e.g., Q42) and a Commons category
  2. πŸ”Ž Discover images β€” the worker traverses the category and detects faces
  3. 🧷 Bootstrap (optional) β€” if Commons already has depicts claims, seed the model from them
  4. βœ…βŒ Classify β€” review faces one-by-one with Yes/No (keyboard shortcuts: Y / N)
  5. πŸ€– Infer β€” after enough confirmed faces (default 5), classify remaining faces automatically
  6. ✍️ Write to Commons β€” send approved matches as depicts claims (OAuth)

πŸ—οΈ Architecture

+---------------------------+      +---------------------------+
|       Flask Web App       |      |  Background Worker(s)     |
|          (app.py)         |      |       (worker.py)         |
|---------------------------|      |---------------------------|
| OAuth 2.0 login           |      | Category traversal        |
| Project CRUD              |      | Image download            |
| Active learning UI        |      | HOG face detection (pool) |
| Classification UI         |      | SPARQL bootstrapping      |
| Queue SDC writes          |      | Autonomous inference      |
| Approve/reject/edit bbox  |      | Write SDC claims          |
|                           |      | Distributed claim locking |
+------------+--------------+      +------------+--------------+
             |                                  |
             +----------------------------------+
                               |
                        +------------+
                        |   MariaDB  |
                        |  (ToolsDB) |
                        +------------+
  • 🧰 Stack: Python 3.11+, Flask, gunicorn, face_recognition (dlib HOG), PyMySQL, requests-oauthlib
  • ☁️ Hosted on: Wikimedia Toolforge (Kubernetes Build Service)
  • πŸ”€ Workers: Multiple instances run concurrently β€” each claims projects via SELECT … FOR UPDATE with automatic stale-claim expiry (15 min)

πŸ—‚οΈ Project layout

WikiVisage/
β”œβ”€β”€ app.py               # Flask app: OAuth, routes, classification API
β”œβ”€β”€ worker.py            # Background ML pipeline: crawl, detect, infer, write (multi-instance)
β”œβ”€β”€ token_crypto.py      # Fernet encrypt/decrypt helpers for OAuth tokens at rest
β”œβ”€β”€ database.py          # MariaDB connection pool with retry logic
β”œβ”€β”€ schema.sql           # Database schema (9 tables + indices)
β”œβ”€β”€ migrate.py           # Idempotent migration script with --reset flag
β”œβ”€β”€ jobs.yaml            # Toolforge jobs definition (2 worker instances)
β”œβ”€β”€ templates/           # Jinja2 templates (10 files, all extend base.html)
β”œβ”€β”€ static/              # Logos + screenshots
β”œβ”€β”€ translations/        # i18n: en, nb, es, fr
β”œβ”€β”€ requirements.txt     # Runtime dependencies
β”œβ”€β”€ requirements-dev.txt # Dev/test deps (pytest, ruff)
└── tests/               # 567 tests (unit + integration)

πŸ§‘β€πŸ’» Setup

βœ… Prerequisites

  • A Toolforge tool account
  • An OAuth 2.0 consumer registered on Meta with grants: - Basic rights - Edit existing pages - Callback URL: https://<toolname>.toolforge.org/auth/callback

1) πŸ” Environment variables (Toolforge)

# Database credentials (find yours in ~/replica.my.cnf on Toolforge)
toolforge envvars create TOOL_TOOLSDB_USER      "s<NNNNN>"
toolforge envvars create TOOL_TOOLSDB_PASSWORD  "<password>"
toolforge envvars create WIKIVISAGE_DB_NAME     "s<NNNNN>__wikiface"

# OAuth 2.0
toolforge envvars create OAUTH_CLIENT_ID        "<client-id>"
toolforge envvars create OAUTH_CLIENT_SECRET    "<client-secret>"
toolforge envvars create OAUTH_REDIRECT_URI     "https://<toolname>.toolforge.org/auth/callback"

# Flask
toolforge envvars create FLASK_SECRET_KEY "$(python3 -c 'import secrets; print(secrets.token_hex(32))')"

# Token encryption (optional β€” tokens stored as plaintext if unset)
toolforge envvars create WIKIVISAGE_TOKEN_KEY "$(python3 -c 'from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())')"

2) πŸ—„οΈ Create database

mariadb --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.wikimedia.cloud
CREATE DATABASE s<NNNNN>__wikiface;

3) 🧱 Run migration

python3 migrate.py

4) πŸš€ Build & deploy

# Build container image
toolforge build start https://github.com/DiFronzo/WikiVisage.git

# Start web service
toolforge webservice buildservice start

# Start background workers (uses jobs.yaml β€” 2 worker instances)
toolforge jobs load jobs.yaml

The app will be live at https://<toolname>.toolforge.org.

πŸ§ͺ Local development

pip install -r requirements-dev.txt

export TOOL_TOOLSDB_USER=root
export TOOL_TOOLSDB_PASSWORD=yourpassword
export TOOL_TOOLSDB_HOST=127.0.0.1
export WIKIVISAGE_DB_NAME=wikiface_dev
export OAUTH_CLIENT_ID=<client-id>
export OAUTH_CLIENT_SECRET=<client-secret>
export OAUTH_REDIRECT_URI=http://localhost:8000/auth/callback
export FLASK_SECRET_KEY=dev-secret-key
export OAUTHLIB_INSECURE_TRANSPORT=1

mysql -u root -p -e "CREATE DATABASE wikiface_dev"
python migrate.py
python app.py                                   # Web app on http://localhost:8000
python worker.py --worker-id local-1            # Background worker (separate terminal)

For local OAuth you'll need a separate consumer with http://localhost:8000/auth/callback as the callback URL. Set OAUTHLIB_INSECURE_TRANSPORT=1 to allow OAuth over HTTP.

βš™οΈ Configuration

Each project has a couple of tunables:

Parameter Default Description
distance_threshold 0.6 Face-distance cutoff for autonomous classification (lower = stricter).
min_confirmed 5 Minimum confirmed matches before autonomous inference starts.

πŸ”§ Worker environment variables (optional)

Variable Default Description
WIKIVISAGE_WORKER_POLL_INTERVAL 60 Seconds between poll cycles
WIKIVISAGE_WORKER_MAX_PROJECTS 3 Max projects processed concurrently per worker
WIKIVISAGE_WORKER_IMAGE_THREADS 4 Parallel image download/detection threads per project
WIKIVISAGE_WORKER_BATCH_SIZE 50 Images per processing batch
WIKIVISAGE_DB_POOL_SIZE auto DB connection pool size (auto = max_projects Γ— image_threads + 3)
COMMONS_DOWNLOAD_THROTTLE_SECONDS 0 Delay between image downloads (seconds)

βœ… Testing

# Unit tests (CI mode β€” integration tests auto-skipped)
pytest tests/ -v

# Unit + integration tests (requires local MariaDB)
WIKIVISAGE_TEST_DB=1 pytest tests/ -v

# With coverage
WIKIVISAGE_TEST_DB=1 pytest tests/ --cov=. --cov-report=term-missing

πŸ“ License

MIT

About

Active learning facial recognition for Wikimedia Commons. Train an ML model to recognize specific people and automatically add P180 (depicts) Structured Data to matching images

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages