CollectorVision Part 8: The Sol Ring Benchmark — Testing Card Recognition on the Hardest Case

Sol Ring is ranked #1 on EDHREC — the most-played card in Commander, by a wide margin. It also happens to be one of the hardest cards for an image recognition system to get right.

The reason is the artwork. The most common Sol Ring printing uses the Mike Bierek illustration, and Scryfall currently shows 43 Sol Ring prints sharing that artwork. C13, C14, C15, C16, C17, C18, C19, C20, C21, CMR, CLB, DMC, KHC, ZNC — each has its own printing, each looks nearly identical to the others. The differences are subtle: set symbol placement, minor frame treatment variations, the collector number badge. Nothing you'd easily notice at a glance.

This makes Sol Ring a natural test case for edition discrimination. A system that claims good edition accuracy should be tested on cases like this, not just on cards where each printing looks obviously different.

The dataset

I put together an evaluation set of 307 frames covering 21 distinct Sol Ring printings — all Mike Bierek artwork, all from Commander precon sets. Each printing was filmed on a phone against a white background across a range of lightings, angles, and minor motion. The frames are labeled with Scryfall UUIDs.

The dataset is at https://huggingface.co/datasets/HanClinto/solring-eval.

The full card list:

Set	Printing
CMD	Commander 2011
C13	Commander 2013
C14	Commander 2014
C15	Commander 2015
C16	Commander 2016
C17	Commander 2017
C18	Commander 2018
C19	Commander 2019
C20	Commander 2020
C21	Commander 2021
CMR	Commander Legends
CMA	Commander Anthology
CM2	Commander Anthology V2
AFC	Adventures in the Forgotten Realms Commander
KHC	Kaldheim Commander
ZNC	Zendikar Rising Commander
NEC	Neon Dynasty Commander
CLB	Battle for Baldur's Gate Commander
DMC	Dominaria United Commander
MB1	Mystery Booster
PHED	Happy Holidays promo

Why this benchmark is useful

Most card recognition benchmarks test against a sample of cards from across the whole catalog. Those tend to look good because most cards are visually distinct from each other, and a system only needs to get it "in the right neighborhood" to score a hit.

The Sol Ring set is adversarial in a specific way: all 21 cards look almost identical. A system that's doing anything like "find the card that looks most like this" and not "find the exact printing" will collapse most of these into a single answer. Card accuracy will be high; edition accuracy will be low.

Concretely: if a system correctly identifies these frames as "Sol Ring" but can't distinguish C17 from C18, it scores 21/21 on card accuracy and roughly 1/21 on edition accuracy (assuming it locks onto one printing). A system with genuine edition discrimination should do substantially better on the second number.

Dataset construction

Each of the 21 printings was ordered from TCGPlayer. I recorded short phone videos of each card on a white background, varying the angle and lighting slightly across takes. FFmpeg extracted frames at roughly one per second of source video, and a blur filter removed obviously out-of-focus frames. The resulting set has 9–21 frames per printing.

Because I already had the physical cards, I also ran a SIFT homography pipeline to detect corner coordinates for each frame — matching each frame against the known Scryfall reference image. These corners are stored in corners.csv alongside the frame images. If you want to use the dataset without running your own corner detector, you can use the pre-computed corners directly to dewarp each frame.

Temporal ordering

The frames within each printing are ordered by source video frame number. This matters if you want to simulate how a live-camera system performs in practice, where you process frames sequentially and accumulate evidence over time.

from collections import deque, defaultdict

# Group frames by card_id, sorted by frame_number
by_card = defaultdict(list)
for row in csv_rows:
    by_card[row[&quot;card_id&quot;]].append(row)
for frames in by_card.values():
    frames.sort(key=lambda r: int(r[&quot;frame_number&quot;]))

# Simulate rolling buffer evaluation
for card_id, frames in by_card.items():
    buffer = deque(maxlen=5)
    correct = 0
    for row in frames:
        emb = embed(dewarp(row))
        kept = [e for e in buffer if cosine_sim(emb, e) &gt;= 0.7]
        search_emb = normalize(mean([emb] + kept)) if kept else emb
        top1 = catalog_search(search_emb)
        if top1 == card_id:
            correct += 1
        buffer.append(emb)
    print(f&quot;{card_id}: {correct}/{len(frames)}&quot;)

Rolling-buffer evaluation gives a more realistic sense of live-camera performance than single-frame evaluation, since individual frames are noisier than the average over several seconds.

Results with Milo

Current single-frame edition accuracy on the Sol Ring dataset is around 74–78%, depending on threshold settings. Rolling-buffer accuracy (5-frame window) is around 88–92%.

For context: random guessing across 21 printings would give about 4.8% edition accuracy. Card accuracy is near 100% in both modes — the system consistently identifies these as Sol Ring, just not always the right printing.

The hard cases are the mid-era Commander sets (C16 through C21) where the card frame changed very little across years. The pre-C16 sets and the post-CMR sets are somewhat easier because the frames look more distinctly different.

This is an area where the model is still improving. The multitask ArcFace training was designed partly to address this — forcing the model to be sensitive to set-level visual features — but fine-grained edition discrimination on reprints is genuinely hard and there's room to do better.

Why not just crop to the set symbol?

A reasonable question. The set symbol is the most obvious distinguishing feature between Commander precon printings of Sol Ring.

A few reasons this isn't the approach taken here:

First, the full-frame embedding is more information, not less. Milo sees the entire card face including the set symbol, collector number area, and any frame differences. If those features are discriminating, the model should be able to use them.

Second, a set-symbol classifier would only work for cards where the printings differ by set symbol. Cards where different sets use the same symbol, or where the symbol is small or obscured, would still be hard.

Third, the goal is a general pipeline that doesn't require special-casing per card. A model that learns to distinguish editions holistically is more useful than one that needs a different approach for each card type.

The current performance numbers suggest the model has learned something about set-level differences, but not as much as you'd want for production use. That's an honest assessment of where things stand.