Many AI workloads need somewhere to dump data that isn't a final dataset yet i.e. intermediate results, batch outputs, collected records that need processing before they're ready to publish....
That's exactly what Storage Buckets are for. Hugging Face just shipped them: mutable, non-versioned object storage on the Hub, powered by Xet.
I immediately had a use case. I maintain a pipeline that collects README cards from datasets and models across the Hub. The old setup:
- GitHub Actions every few hours
- Download the entire existing dataset (multiple GB)
- Merge in new cards, deduplicate
- Re-upload everything
Ev. Every run moved gigabytes of unchanged data,d it was quite slow and brittle because of storage limits in GitHub actions.
Storage Buckets as a "working layer" made this much simpler:
- Fetch jobs just append JSONL batches to a bucket — no need to read existing data
- A daily compile reads from the bucket, deduplicates with Polars, publishes to the Hub
- The whole compile takes about a minute for 400k+ records
Because Buckets are backed by Xet, you can be lazy about deduplication at the storage level since Xet handles chunk-level dedup for you, so overlapping writes across runs don't balloon storage or transfer costs. You just write and let Xet sort it out.
Each stage only writes forward. Fetch never reads the bucket. Compile never modifies the bucket. The published dataset can be regenerated from the bucket at any time.
The whole thing runs on HF Scheduled Jobs — UV scripts with inline dependencies, no Docker, no CI config 🤗
Wrote up the pattern here: https://lnkd.in/emxEKrMi