CollectorVision Part 3: Milo — Embedding 108,000 Cards into 128 Numbers

Part 3 of the CollectorVision series. Part 1 has the overview.

After Cornelius finds the card and the perspective is corrected, you have a clean 252 × 352 crop. The next problem is figuring out which of 108,000 cards it is.

Why not a classifier?

The obvious approach is to train a classifier with 108,000 output classes, one per card. This works, but has a few problems:

New cards get released every few months. Adding them means retraining the model.
The model learns to recognize specific training images, not the card abstractly. It tends to be brittle to lighting changes, image quality, and so on.
Cards not in the training set can't be identified at all.

Metric learning handles all three. Instead of predicting a class, the model learns an embedding space where similar-looking cards map to similar vectors. New cards get added to the catalog without changing the model. The model generalizes to new printings and lighting conditions because it's learning what makes cards visually distinct, not memorizing specific training photos.

Milo

Milo is a MobileViT-XXS backbone with a shared linear projection, trained with ArcFace loss using two label types: illustration ID and set code. Input is a 448×448 RGB image, ImageNet-normalized. Output is a 128-dimensional L2-normalized float32 vector.

ArcFace adds an angular margin penalty during training — it doesn't just push similar things closer together, it also pushes everything farther apart. The result is tighter, more separated clusters in embedding space, which gives better discrimination between visually similar cards.

The multi-task training (illustration ID + set code) forces the model to be sensitive to visual differences between printings of the same card — things like frame treatment, collector number placement, set symbol — while remaining robust to photo quality variation.

Search

The catalog is a matrix of reference embeddings, one row per card face. For Magic: The Gathering, that's about 108,000 rows × 128 columns, built by running every Scryfall card image through Milo.

Searching is a batched dot product. Since all embeddings are L2-normalized, the dot product equals cosine similarity:

scores = catalog_embeddings @ query_emb   # (N,) cosine similarities

108,000 cards × 128 dimensions = about 14 million multiply-adds per query. On a modern CPU with SIMD instructions, this runs in under 5ms. In the browser with WebAssembly it's 8–15ms.

Multi-frame aggregation

A single frame is noisy. Slight blur, unusual angle, glare on a sleeve — any of these can pull the embedding in a slightly wrong direction. The fix is simple: run multiple frames and accumulate scores per card.

from collections import defaultdict

score_map = defaultdict(float)
for emb in recent_frames:
    for score, card_id in catalog.search(emb, top_k=5):
        score_map[card_id] += score

best = max(score_map, key=score_map.get)

The correct card accumulates consistently across frames; noise doesn't. Three to five frames is usually enough.

The catalog file

The catalog is stored as an NPZ file with four keys:

embeddings     (108000, 128) float32
card_ids       (108000,) str
source         scalar str — &quot;scryfall&quot;
embedder_spec  scalar str — JSON spec for reconstructing the embedder

Card names and prices are not stored — callers look those up from Scryfall by UUID after identification. This keeps the catalog small and makes it easy to distribute independently of any particular metadata source.

The catalog lives on HuggingFace at hf://HanClinto/milo/scryfall-mtg and is updated monthly. The first load downloads ~29 MB and caches it locally. After that, everything runs offline.

Why 128 dimensions?

128 is a reasonable tradeoff for this task. Fewer dimensions (32, 64) make it hard to distinguish cards that look similar. More dimensions (512, 1024) give diminishing returns and make the catalog larger and the search slower. At 128, the full catalog fits in 28 MB as float32 or 14 MB as float16, and the search is fast enough for real-time use.

Next: Part 4 — Running it locally