Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15

Deep ML-Inspired
Architecture at Wildcard
Sven Kreiss, @svenkreiss

I am a Data Scientist at
Wildcard. We launched
last month and were
featured in the App Store
as “Best New App”. 
We are looking to grow our
data team.

Wildcard
3
• founded in 2013
• develop technologies for a
future native mobile web
experience through cards
• Cards: new UI paradigm for
content on mobile for which we
schematize unstructured web
content. Surfaced in the Wildcard
iOS app and in other card
ecosystems.

ML Challenge
5
• Extract online content through ML. 
Micro service in this talk powers 54% of cards in Wildcard.
url {“title”: “…

Dataset
6
• Scrape articles from
a diverse set of
sources.
• Custom labeling
tools based on
Databench: 
databench.trivial.io.

Labeling Tools: Tree Based and Visual
7
• cross matched labels between the tools
• inhouse label sessions before handing to offshore (usability)
• assign labels to page elements

Features
10
• Text properties: length, capitalization, special
characters, numbers, first 20 char identical to page’s
meta title, …
• BoW text: bag-of-words visible text
• BoW meta: bag-of-words of CSS classes and other
non-visible information inside HTML tags
• html tag
• Optional info from emulation: (x, y), (w, h), font-family,
font-size, font-weight, …

Pipeline
11
• Parallelized document processing into features
using Apache Spark. Starts from a list of urls.
• Scrapes web pages.
• Constructs Content Tree.
• Matches labels.
• Filters for quality.
• Need the same processing for a single webpage but
with low latency and small resource requirements: 
→ pysparkling: pure Python implementation of  
Spark’s RDD interface

pysparkling
12
• interface compatible with SparkContext and RDD but 
no dependence on the JVM
• pysparkling.fileio can access local files, S3, HTTP, HDFS
with a load-dump interface
• used in Python micro-service endpoint applying  
scikit-learn classifiers
• used in labeling and evaluation tools and
local development
• used in dataset preparation tools 
(train-test split, split urls by domain, …)

Pipeline II
13
• single machine Random Forest training
• “256Gb ought to be enough for anybody” 
(for machine learning) - Andreas Mueller
• multithread support, fast
• use provided structured data (e.g. meta tags) as
much as possible

Architecture
14
• Morbi in sem quis dui placerat ornare. Pellentesque
odio nisi, euismod in, pharetra a, ultricies in, diam.
• Praesent dapibus, neque id cursus faucibus.
• Phasellus ultrices nulla quis nibh. Quisque a lectus.

ML Algorithms
Tough luck with Structured Learning
http://scikit-learn.org/stable/tutorial/machine_learning_map/

Algorithm: zeroth order
16
page elementpage element page element
“title”“navigation” “author”
scikit-learn 
RandomForest
/html/body/div[2]/div/div/div/ul/li[5] /html/body/div[3]/h1 /html/body/div[3]/span

Algorithm: first order
17

Algorithm: second order
18

Requirements
19
• text-density based labeling is too rigid: we want to extend
this to other types than news articles
• clustering is too noisy:
• ads in between paragraphs
• cannot “cluster” authors after titles
• CRF: complexity beyond linear-chain-CRF grows too quickly
• want “single step” process: multi step algorithms erase
information.  
Example: if first step is to remove ads then second step
cannot use information about ads to infer content.

First Attempt: Hypothesis Generation using Sampling
20
• start from a guess (using zeroth order type classifier)
• generate variations of that guess with a proposal function
• evaluate an objective function based on a 
document-wide likelihood function of classification
probabilities

21
Sampling

22
• decent results
• training coverage questionable
• slow inference

Second Attempt: “Deep Learning Inspired”
23
• Borrow ideas from “scene description”. Traditionally
done with scene graphs and CRFs.
• With Deep Learning, can avoid building a graph and
go straight to assigning a label to every pixel.
Clément Farabet, 2011 
http://www.clement.farabet.net/research.html#parsing

24

25

Feed forward process is much faster
26
Processing time dropped by an
order of magnitude.
No significant degradation in
quality.
Training: 
From urls: ~2 hours 
With cached external calls: <1 hour
Introduced
Forward Model
Bucket Model
for Load

Business-visible Successes
27
• embedded media content: 
Twitter cards, Instagram posts,  
Facebook posts, Facebook videos  
and Youtube videos
• On the right, New York Magazine article
on the train crash in Philadelphia:  
http://nymag.com/daily/intelligencer/2015/05/
amtrak-train-derails-philadelphia.html

preliminary
Business-visible Successes
28
• enabling domains that require JavaScript emulation 
(e.g. websites with pure AngularJS)
• fixed individual publishers with high visibility  
in our app
• comparison to competition:  
third party 71-82%, inhouse 83% +/- 4%

Summary
29
• dataset creation, processing pipeline, content tree
creation, evaluation tools, labeling tools, training and
inference strategies implemented over the past year
• chose tools that allow quick iteration: 
simple processing in parallel, ML on single node
• two open source projects: 
databench.trivial.io pip install databench 
pysparkling.trivial.io pip install pysparkling
• competitive performance, 
54% of cards in Wildcard are powered by pure ML
@svenkreiss

Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15

Recomendados

Recomendados

Más contenido relacionado

Similar a Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15

Similar a Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15 (20)

Más de MLconf

Más de MLconf (20)

Último

Último (20)

Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15