In this session, join the Vice President of Mechanical Turk to explore how businesses are marrying human judgment with distributed data processing, improving accuracy of Big Data analytics without sacrificing efficiency or scalability. We’ll highlight real world examples and introduce Mike Brown, CTO of Comscore, to discuss how the combination technologies such as Hadoop and Mechanical Turk are driving large scale systems to cleanse and categorizes business critical data from unstructured and inconsistent data sources.
BDT102 Algorithms, Machines, and Crowdsourcing - AWS re: Invent 2012
1.
2. NASDAQ SCOR
Clients 2,000+ worldwide
Employees 1,000+
Headquarters Reston, VA
Global Coverage 220+ countries under measurement
Local Presence 32 locations in 23 countries
3.
4.
5. The Challenge
• Available in 7 countries: USA, Brazil,
Britain, Canada, France, Germany, Spain
2013: Mexico and India
Over 4B ads monthly
5M-10M unique new ads monthly
6. Display Ads
• Observes advertising creatives
• As they are encountered by the panelist
Collects Facebook pages
• Regular and premium ads
Extracting all this
information (and more)
7. Production Hadoop Cluster
• 100 nodes
• 2276 total CPUs, 6TB total memory, 1.7PB total disk space, 1GB Ethernet
Facebook
Facebook Facebook Ads
Entity-Stream Entity-
Hadoop Extraction Partitions
DFS Dictionary-Apply Facebook
News &
Profiles
Daily: 2 Hr / 70G 15min / 15Gx 30 min / 15Gx
8. Data size: Client
NameNode
• Compressed ~ 2 TB
• Uncompressed ~ 6 TB
• Total Pages - 320M
Need to process 3,700 pages/sec… Hadoop-1 Hadoop-2 Hadoop-3 Hadoop-N
• Avg size per page: 18 KB…
• Factor in time to collect, load to HDFS,
buffer time for errors, etc… …
Hadoop is used to extract entities
• Each node processes 85 pages /sec
• Daily Facebook entity extraction HDFS
completes in ~2 hours
Load FB Pages
• Multi-Language Support
NTFS
9. AdMetrix:
• Total Ads: 85M
• Ads per Ad-page: 3.7
Social Essentials:
• Total news items: 351M
10. Ad-Volume
• 6M unique new ads monthly
?
Advertiser-Space
(Product Dictionary)
• Over 56K companies
• Over 100K company/brand pairs
Problem
correctly
quickly
inexpensively
11. OCR based Image-Recognition based
Pros
• Potentially applicable to all non-Facebook
online ads
Cons
• Low Accuracy
• Low Coverage
• Difficult to scale and maintain for huge daily
data-volume
12. • Classify ads to cover ~80% impression
• Automated Classification:
Destination URL
Title
Currently classifying 7-20% of new ads
no associated-text for ad
new advertiser
multi-advertiser ads
new brand, movie
13. Classify ads for Turk-
Turk-
Classification to
Ads Turk-
Product-Names to
Classification
Product-Names to
Classification
Product-Names
New No
Prod
Product? Name
Yes
Turk-
Turk-
Identification of
Turk-
Company-Name,of
Identification
Company-Name,of
Identification
URL, Category
Company-Name,
URL, Category
URL, Category