Project Shoes — CV detection, classification, tracking

Why shoes

A computer-vision pipeline is three jobs in a trench coat: detect, classify, track. Each stage gets harder when the target class is open-ended, so we picked one on purpose. Shoes are visually distinct enough for the detector, varied enough that classification matters, and small enough on the frame that a tracker has to work for a living. Public datasets cover the space. And — Bunny full disclosure — the test corpus is nicer to look at than another round of parked cars.

What we built

Detection. A YOLO-family backbone, fine-tuned on public shoe corpora (UT Zappos50K, ShoeStyle, plus internal scrape) to answer one question per frame: where are the shoes. The head emits oriented bounding boxes with a per-detection confidence; a DETR-family variant runs alongside on the harder layouts (cluttered shelves, partial occlusion). The detector is deliberately small — input crops drive the rest of the pipeline, so latency here is the latency budget for everything downstream.

Classification. A separate head consumes the detection crops and assigns form-factor (athletic / dress / casual / boot / sandal), brand family where the silhouette is distinctive, and a confidence vector. Multi-label, with thresholds tuned per use case — a retail counter wants high precision on form-factor and doesn't care about brand; sponsorship analytics is the inverse. The architecture didn't change, only the threshold policy and the labelled data behind it.

Tracking. ByteTrack does the heavy lifting on motion association; a small ReID head behind the detector produces an appearance embedding so the same pair of shoes keeps its track ID through a brief occlusion or an off-frame crossing. The per-track buffer holds the cleanest crops we've seen so the classification head can vote across frames rather than commit on the worst one. End to end, on commodity GPU, the pipeline runs comfortably above 30 FPS at 1080p.

Training and the data we used

Training data is the obvious lever and the obvious trap. We built on public shoe corpora — UT Zappos50K for fine-grained product imagery, ShoeStyle for category labels, plus a curated scrape pinned to a snapshot date so results are reproducible. Augmentation aimed at the realistic failure modes: motion blur, partial occlusion, harsh side-lighting, oblique angles, glare on glossy uppers. The classification head is trained with a per-class threshold curve rather than a single argmax — so a deployment can dial precision up where it costs to be wrong (sponsorship analytics, QA reject lanes) and trade for recall where it doesn't (foot-traffic counts). The ReID embedding is metric-learning, not classification — it doesn't need to know what the shoe is, only that two crops are the same shoe.

Where this goes — four real applications

Retail foot-traffic analysis

Anonymous shoe-class detection at a store entrance characterises shopper composition — athletic vs. dress vs. casual, brand-family distribution, dwell — without ever touching a face. A privacy-respecting alternative to facial-recognition foot-traffic systems, because nothing the camera looks at uniquely identifies a person.

Sports & sponsorship analytics

In broadcast video, who is wearing which silhouette, for how many seconds, in which camera shot. Tracker IDs persist through cuts and re-acquisitions, so a brand can buy a screen-time report grounded in actual frames-on-feet rather than estimated impressions.

Venue lost-and-found

A claimant uploads a phone photo; the system searches the venue's lost-and-found bin via a ReID embedding match. Classification narrows the search space (athletic / dress / kids / boot), embedding similarity ranks within the bucket. Reduces a manual rummage to a five-result page.

Manufacturing-line QA

Cameras over the conveyor flag misaligned, mispaired, or visibly defective footwear before it ships. The same detection head trained on production-line lighting; classification narrowed to the SKUs running that shift; tracker continuity to count units and route reject lanes.

Pipeline

Lessons

A bounded class is the cheat code. Shoes were chosen on purpose. Visually distinct, well-datasetted publicly, with failure modes that are interesting without being open-ended. A constrained scope let us iterate the full pipeline without any one stage starving the others. The architecture is what generalises.
Detection and classification want different cameras. The detector is happy with motion, occlusion, and ugly lighting; the classification head wants stillness, scale, and clean colour. Instead of forcing one frame through both, the detector emits crops and a small per-track buffer accumulates the cleanest examples for the classifier to vote across. Top-1 accuracy moved noticeably without retraining either model.
ReID is what makes a tracker feel intelligent. Kalman + IoU is enough until two pairs of similar shoes cross in frame. After that, you need an appearance descriptor that survives a brief disappearance — embedding distance, not just spatial overlap. A small ReID head behind the detector dropped ID-switches from a frequent annoyance to a rare event, with no change to upstream timing.
Public datasets get you to a demo, not to a deployment. UT Zappos50K and friends prove the pipeline runs and the heads have learned something real. They are not a substitute for the target environment. Assume a domain-shift hit when the camera is real, and budget a small labelled set from the actual deployment for fine-tune. Plan the loop before the demo.
Shipping ONNX is the shape of a product. Exporting to ONNX up front — instead of leaving models as PyTorch checkpoints — paid for itself the first time someone asked whether this could run on a laptop, a Jetson, or a cloud GPU. Same artefact, three runtimes, no retraining. The portability story is the deployment story.

Bunny — Research & Analysis · AK-mee™ 2026-05-02

← Back to AK-mee™