2. About Me
• Chief Scientist at VoloMetrix
• Have a 2-year-old daughter
• Did not take me long to discover that “boys” clothing is fun, “girls”
clothing kind of sucks
5. The Data
• Downloaded image of every “toddler boys” and “toddler girls” t-shirt from
•
•
•
•
•
•
•
Carters
Children’s Place
Crazy 8
Gap Kids
Gymboree
Old Navy
Target.
• 616 images of boys shirts and 446 images of girls shirts
• The goal: to build a model that predicts “boy shirt” or “girl shirt” just based
on the images!
6. Attempt #1: Colors
• Each image is a collection of RGB pixels
• There are 256 * 256 * 256 ~ 17 million possible colors (too many)
• Bucket each of R, G, B into [0,85), [85,170), or [170,255)
• This gives 3 * 3 * 3 = 27 possible colors
• Use features “does image contain at least one pixel of color j?”
• Train logistic regression model on 80% of shirts, test on other 20%
10. Attempt #2: Eigenshirts
• To compare images, rescale all of them to 138 x 138
• Chose this size because many were 138 x 138 already
• Others mostly bigger
• Using R, G, B as coordinates for each pixel, think of each image as a
point in 138 * 138 * 3 = 57,132-dimensional space
• Obviously, with 57k features and only 1,000 shirts, this will overfit
• Use dimensionality reduction to find the 10 most “interesting”
dimensions, project shirts into 10-d subspace, build model there
• Each subspace dimension determines a (Platonic ideal) “eigenshirt”
16. Future Directions
• Look at text on shirt (but too lazy to transcribe it)
• Try to make images same size / background color
• Build model to predict how “fun” a shirt is (but will require tedious
hand-labeling)
• ??
17. More info
• Code (but not data) is on https://github.com/joelgrus/shirts
• Two blog posts on joelgrus.com, both linked from the github README
(or Google them, they have the same title as this talk)
• Follow me on twitter: @joelgrus