← AK-mee™ · Case study · Internal R&D · Public
Detect. Classify. Track. The same three jobs, every CV pipeline. We picked a small class — shoes — and built it cleanly.
I'm Bunny. I run research and analysis at AK-mee™, and — full disclosure — I like shoes. So when we needed a bounded test target to exercise the AK-mee detection / classification / tracking stack end-to-end, the answer was easy. Shoes are visually distinct, publicly well-datasetted, and the failure modes are interesting without being open-ended. The output is a clean CV pattern we can carry into any well-scoped object class — and four real-world applications that don't need anyone's face in the frame.
A computer-vision pipeline is three jobs in a trench coat: detect, classify, track. Each stage gets harder when the target class is open-ended, so we picked one on purpose. Shoes are visually distinct enough for the detector, varied enough that classification matters, and small enough on the frame that a tracker has to work for a living. Public datasets cover the space. And — Bunny full disclosure — the test corpus is nicer to look at than another round of parked cars.
Detection. A YOLO-family backbone, fine-tuned on public shoe corpora (UT Zappos50K, ShoeStyle, plus internal scrape) to answer one question per frame: where are the shoes. The head emits oriented bounding boxes with a per-detection confidence; a DETR-family variant runs alongside on the harder layouts (cluttered shelves, partial occlusion). The detector is deliberately small — input crops drive the rest of the pipeline, so latency here is the latency budget for everything downstream.
Classification. A separate head consumes the detection crops and assigns form-factor (athletic / dress / casual / boot / sandal), brand family where the silhouette is distinctive, and a confidence vector. Multi-label, with thresholds tuned per use case — a retail counter wants high precision on form-factor and doesn't care about brand; sponsorship analytics is the inverse. The architecture didn't change, only the threshold policy and the labelled data behind it.
Tracking. ByteTrack does the heavy lifting on motion association; a small ReID head behind the detector produces an appearance embedding so the same pair of shoes keeps its track ID through a brief occlusion or an off-frame crossing. The per-track buffer holds the cleanest crops we've seen so the classification head can vote across frames rather than commit on the worst one. End to end, on commodity GPU, the pipeline runs comfortably above 30 FPS at 1080p.
Training data is the obvious lever and the obvious trap. We built on public shoe corpora — UT Zappos50K for fine-grained product imagery, ShoeStyle for category labels, plus a curated scrape pinned to a snapshot date so results are reproducible. Augmentation aimed at the realistic failure modes: motion blur, partial occlusion, harsh side-lighting, oblique angles, glare on glossy uppers. The classification head is trained with a per-class threshold curve rather than a single argmax — so a deployment can dial precision up where it costs to be wrong (sponsorship analytics, QA reject lanes) and trade for recall where it doesn't (foot-traffic counts). The ReID embedding is metric-learning, not classification — it doesn't need to know what the shoe is, only that two crops are the same shoe.
Anonymous shoe-class detection at a store entrance characterises shopper composition — athletic vs. dress vs. casual, brand-family distribution, dwell — without ever touching a face. A privacy-respecting alternative to facial-recognition foot-traffic systems, because nothing the camera looks at uniquely identifies a person.
In broadcast video, who is wearing which silhouette, for how many seconds, in which camera shot. Tracker IDs persist through cuts and re-acquisitions, so a brand can buy a screen-time report grounded in actual frames-on-feet rather than estimated impressions.
A claimant uploads a phone photo; the system searches the venue's lost-and-found bin via a ReID embedding match. Classification narrows the search space (athletic / dress / kids / boot), embedding similarity ranks within the bucket. Reduces a manual rummage to a five-result page.
Cameras over the conveyor flag misaligned, mispaired, or visibly defective footwear before it ships. The same detection head trained on production-line lighting; classification narrowed to the SKUs running that shift; tracker continuity to count units and route reject lanes.
Bunny — Research & Analysis · AK-mee™ 2026-05-02
Conversations with Walter are logged for service improvement. Walter answers questions about AK-mee’s services and team — he doesn’t make commitments. For pricing or specific project inquiries, he’ll connect you with chad@akmee.net.