Deep ML Architecture at Wildcard: At Wildcard we think about technologies for a future native mobile web experience through cards. Cards are a new UI paradigm for content on mobile for which we schematize unstructured web content. Part of the challenge is to develop an understanding of online content through machine learning algorithms. The extracted information is used to create cards that are surfaced in the Wildcard iOS app and in other card ecosystems. I will describe the challenge and the way we structure the problem of content extraction with a deep architecture of classification and optimization algorithms that combines traditionally factorized problems of content extraction which allows the various stages to inform each other. The talk will include an overview of the used data, features and our training strategy with a partly human-powered labeling system. This ML system, called sic, is used in production and I will show our approach to using only fast or a mix of fast and slow features depending on the use case in the app.
2. I am a Data Scientist at
Wildcard. We launched
last month and were
featured in the App Store
as “Best New App”.
We are looking to grow our
data team.
3. Wildcard
3
• founded in 2013
• develop technologies for a
future native mobile web
experience through cards
• Cards: new UI paradigm for
content on mobile for which we
schematize unstructured web
content. Surfaced in the Wildcard
iOS app and in other card
ecosystems.
5. ML Challenge
5
• Extract online content through ML.
Micro service in this talk powers 54% of cards in Wildcard.
url {“title”: “…
6. Dataset
6
• Scrape articles from
a diverse set of
sources.
• Custom labeling
tools based on
Databench:
databench.trivial.io.
7. Labeling Tools: Tree Based and Visual
7
• cross matched labels between the tools
• inhouse label sessions before handing to offshore (usability)
• assign labels to page elements
10. Features
10
• Text properties: length, capitalization, special
characters, numbers, first 20 char identical to page’s
meta title, …
• BoW text: bag-of-words visible text
• BoW meta: bag-of-words of CSS classes and other
non-visible information inside HTML tags
• html tag
• Optional info from emulation: (x, y), (w, h), font-family,
font-size, font-weight, …
11. Pipeline
11
• Parallelized document processing into features
using Apache Spark. Starts from a list of urls.
• Scrapes web pages.
• Constructs Content Tree.
• Matches labels.
• Filters for quality.
• Need the same processing for a single webpage but
with low latency and small resource requirements:
→ pysparkling: pure Python implementation of
Spark’s RDD interface
12. pysparkling
12
• interface compatible with SparkContext and RDD but
no dependence on the JVM
• pysparkling.fileio can access local files, S3, HTTP, HDFS
with a load-dump interface
• used in Python micro-service endpoint applying
scikit-learn classifiers
• used in labeling and evaluation tools and
local development
• used in dataset preparation tools
(train-test split, split urls by domain, …)
13. Pipeline II
13
• single machine Random Forest training
• “256Gb ought to be enough for anybody”
(for machine learning) - Andreas Mueller
• multithread support, fast
• use provided structured data (e.g. meta tags) as
much as possible
14. Architecture
14
• Morbi in sem quis dui placerat ornare. Pellentesque
odio nisi, euismod in, pharetra a, ultricies in, diam.
• Praesent dapibus, neque id cursus faucibus.
• Phasellus ultrices nulla quis nibh. Quisque a lectus.
15. ML Algorithms
Tough luck with Structured Learning
http://scikit-learn.org/stable/tutorial/machine_learning_map/
16. Algorithm: zeroth order
16
page elementpage element page element
“title”“navigation” “author”
scikit-learn
RandomForest
/html/body/div[2]/div/div/div/ul/li[5] /html/body/div[3]/h1 /html/body/div[3]/span
19. Requirements
19
• text-density based labeling is too rigid: we want to extend
this to other types than news articles
• clustering is too noisy:
• ads in between paragraphs
• cannot “cluster” authors after titles
• CRF: complexity beyond linear-chain-CRF grows too quickly
• want “single step” process: multi step algorithms erase
information.
Example: if first step is to remove ads then second step
cannot use information about ads to infer content.
20. First Attempt: Hypothesis Generation using Sampling
20
• start from a guess (using zeroth order type classifier)
• generate variations of that guess with a proposal function
• evaluate an objective function based on a
document-wide likelihood function of classification
probabilities
21. First Attempt: Hypothesis Generation using Sampling
21
page elementpage element page element
“title”“navigation” “author”
Sampling
22. First Attempt: Hypothesis Generation using Sampling
22
• decent results
• training coverage questionable
• slow inference
23. Second Attempt: “Deep Learning Inspired”
23
• Borrow ideas from “scene description”. Traditionally
done with scene graphs and CRFs.
• With Deep Learning, can avoid building a graph and
go straight to assigning a label to every pixel.
Clément Farabet, 2011
http://www.clement.farabet.net/research.html#parsing
24. Second Attempt: “Deep Learning Inspired”
24
page elementpage element page element
“title”“navigation” “author”
“title”“navigation” “author”
25. Second Attempt: “Deep Learning Inspired”
25
page elementpage element page element
“title”“navigation” “author”
“title”“navigation” “author”
26. Feed forward process is much faster
26
Processing time dropped by an
order of magnitude.
No significant degradation in
quality.
Training:
From urls: ~2 hours
With cached external calls: <1 hour
Introduced
Forward Model
Bucket Model
for Load
27. Business-visible Successes
27
• embedded media content:
Twitter cards, Instagram posts,
Facebook posts, Facebook videos
and Youtube videos
• On the right, New York Magazine article
on the train crash in Philadelphia:
http://nymag.com/daily/intelligencer/2015/05/
amtrak-train-derails-philadelphia.html
28. preliminary
Business-visible Successes
28
• enabling domains that require JavaScript emulation
(e.g. websites with pure AngularJS)
• fixed individual publishers with high visibility
in our app
• comparison to competition:
third party 71-82%, inhouse 83% +/- 4%
29. Summary
29
• dataset creation, processing pipeline, content tree
creation, evaluation tools, labeling tools, training and
inference strategies implemented over the past year
• chose tools that allow quick iteration:
simple processing in parallel, ML on single node
• two open source projects:
databench.trivial.io pip install databench
pysparkling.trivial.io pip install pysparkling
• competitive performance,
54% of cards in Wildcard are powered by pure ML
@svenkreiss