Más contenido relacionado
La actualidad más candente (20)
Similar a BigData @ comScore (20)
BigData @ comScore
- 2. comScore is a Global Leader in Measuring the Digital World
NASDAQ SCOR
Clients 1600+ worldwide
Employees 1,000+
Headquarters Reston, VA
Global Coverage
170+ countries under measurement;
43 markets reported
Local Presence 30+ locations in 21 countries
2© comScore, Inc. Proprietary.
Local Presence 30+ locations in 21 countries
V0910
- 3. Broad Client Base and Deep Expertise Across Key Industries
Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology
3© comScore, Inc. Proprietary. V0910
- 4. The Trusted Source for Digital Intelligence Across Vertical Markets
47 out of the top 50
4 out of the top 4
WIRELESS CARRIERS
9 out of the top 10
INVESTMENT BANKS
9 out of the top 10
9 out of the top 10
INTERNET SERVICE
PROVIDERS
9 out of the top 10
AUTO INSURERS
4© comScore, Inc. Proprietary.
47 out of the top 50
ONLINE PROPERTIES
45 out of the top 50
ADVERTISING AGENCIES
9 out of the top 10
MAJOR MEDIA COMPANIES
9 out of the top 10
PHARMACEUTICAL
COMPANIES
9 out of the top 10
CONSUMER FINANCE
COMPANIES
9 out of the top 10
CPG COMPANIES
V0910
- 5. comScore History of Leadership and Innovation
To measure the search market
To measure
video streaming
To provide behavioral ad effectiveness
To meter mobile user behavior
1st
To Unify census + panel measurement
5© comScore, Inc. Proprietary.
To build and project from 2 million+ longitudinal panel
To monitor and report e-commerce data
1
To deliver a worldwide Internet audience measurement
Global Shaper
Company
2010
V0910
- 6. Average Records Captured per Day (2005-2009)
800,000,000
1,000,000,000
1,200,000,000
1,400,000,000
1,600,000,000
1,800,000,000
6© comScore, Inc. Proprietary.
-
200,000,000
400,000,000
600,000,000
800,000,000
- 7. Launching the 3rd Generation
In 2009, in the midst of the recession, comScore decided to build and
release its 3rd
Generation Product – Unified Digital Measurement (UDM or
Hybrid)
Technology Goals
– Ramp up data collection
– Deploy new methodologies for data processing and analysis
– Be able to scale linearly to the environment to support growth
7© comScore, Inc. Proprietary.
– Be able to scale linearly to the environment to support growth
– Have yesterdays data available today
And one more thing … do it in 4 months or less.
- 8. Unified Digital Measurement™ (UDM) Establishes Platform For
Panel + Census Data Integration
Global
PERSON Measurement
Global
MACHINE Measurement
8© comScore, Inc. Proprietary.
PAGE TAGSPANEL
Unified Digital Measurement (UDM)
Patent-Pending Methodology
Adopted by 88% of Top U.S. Media Properties
V0910
- 9. How Does the Hybrid Process Work?
Collect Traffic from
PCs and devices
Clean Traffic – remove non-
human, bots, apply edit rules
9© comScore, Inc. Proprietary.
Apply comScore
URL Dictionary
Total Traffic Filtered Traffic
- 10. URL Dictionary (CFD): Advertising Industry “Currency”
Intelligent grouping of
Properties with 7+ levels of
detail
– Property (e.g., Yahoo!
Properties, Microsoft Sites)
– Media Title (e.g., Yahoo!, MSN)
10© comScore, Inc. Proprietary.
– Channel (e.g., Yahoo! Search,
MSN Homepages)
– Subchannel (e.g., Yahoo!
Image Search, MSNBC)
– Group/Subgroup (e.g., Yahoo!
Calendar, Today)
- 11. URL Dictionary (CFD) Coverage Statistics
11MM Unique Domains Average/Month in 2010
• Over 80% pages viewed from top 131K domains in 2010 vs. 392K in 2009
11© comScore, Inc. Proprietary.
• 2,360K patterns in January 2011represents 85% of all pages
• 1,254K syndicated entities in January 2010
• 41K patterns added/month in 2010.
- 12. Worldwide UDM™ Penetration
Europe
Austria 80%
Asia Pacific
Australia 91%
North America
Canada 94%
Latin America
Argentina 94%
Middle East & Africa
Israel 93%
Percentage of Machines Included in UDM Measurement
12© comScore, Inc. Proprietary. July 2010 Penetration Data
Austria 80%
Belgium 85%
Switzerland 84%
Germany 84%
Denmark 82%
Spain 90%
Finland 85%
France 91%
Ireland 91%
Italy 80%
Netherlands 88%
Norway 84%
Portugal 86%
Sweden 85%
United Kingdom 90%
Australia 91%
Hong Kong 88%
India 84%
Japan 73%
Malaysia 87%
New Zealand 88%
Singapore 91%
Canada 94%
United States 91%
Argentina 94%
Brazil 92%
Chile 94%
Colombia 95%
Mexico 93%
Puerto Rico 92%
Israel 93%
South Africa 73%
V0910
- 13. Worldwide Tags per Day
15,000,000,000
20,000,000,000
25,000,000,000
#ofrecords
13© comScore, Inc. Proprietary.
0
5,000,000,000
10,000,000,000
Jul
2009
Aug
2009
Sep
2009
Oct
2009
Nov
2009
Dec
2009
Jan
2010
Feb
2010
Mar
2010
Apr
2010
May
2010
Jun
2010
Jul
2010
Aug
2010
Sep
2010
Oct
2010
Nov
2010
Dec
2010
Jan
2011
Feb
2011
#ofrecords
Beacon Records Panel Records
- 14. Monthly Totals
300,000,000,000
400,000,000,000
500,000,000,000
600,000,000,000
#ofrecords
14© comScore, Inc. Proprietary.
0
100,000,000,000
200,000,000,000
300,000,000,000
Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb
2009 2010 2011
#ofrecords
Beacon Records Panel Records
- 15. High Level Data Flow
Panel
ETL
15© comScore, Inc. Proprietary.
Census
ETL
Delivery
- 16. Enterprise Data Warehouse : Sybase IQ 15.2 Multiplex
EDW is currently comprised of 20 servers running Windows 2003 R2 x64
– Currently 220 Intel CPUs
– Dedicated EDW technical team of 3 DBAs and 1 Administrator
– Ability to grow compute capacity and storage capacity independently
EDW data repository housed on both EMC VMAX and Clarion
– 4 EDW instances (2 in Virginia and 2 in Illinois)
– One EDW instance is 147TB usable (app. 200TB of raw data)
16© comScore, Inc. Proprietary.
– One EDW instance is 147TB usable (app. 200TB of raw data)
– Production EDW Drive Layout
416 x 1TB SATA, RAID6, 14+2
42 x 600GB 15K, RAID1
8 X 400GB Flash, RAID5, 7+1
Current Capacity and Performance Metrics
– 1,835,412,793,799 Rows loaded
– 140TB in 14,168 tables
– Capable of Loading 56 Billion rows per hour
- 17. Subsystem
System designed using multiple sub systems
Easily take out and replace different components as demands changed
Moved from a single server to a cluster of servers in a few months in some
cases with first stage tag processing
Periodically redesign different subsystems to support increased
processing demands
17© comScore, Inc. Proprietary.
Many systems on their third generation of technology
- 18. Homegrown Distributed Processing
Reduced core
aggregation from
Reduce final
product creation
2002 – comScore distributed processing framework
Open Source
Hadoop
ScalabilityWall
18© comScore, Inc. Proprietary.
aggregation from
48 hours to 7 hours
product creation
from 24 hours to
2 hours
Hadoop
framework
ScalabilityWall
- 19. GreenPlum
GreenPlum MPP
– 80 Node Cluster: 1 Master; 6 ETL; 72 Workers
– Using Dell R510 with 12 600GB 15K RAID, 64GB RAM, 24 cores (HT)
– Support analytic end users with access to record level data, through a SQL
interface
– Ability to load over 400 billion rows in 8 hours
– Hourly data loading in place
19© comScore, Inc. Proprietary.
– Hourly data loading in place
– Allow the analysts to mine the data for the business uses
– Use for quick analysis of raw event data and for the ideation and creation of
new products
- 20. Hadoop
Hadoop
– Dev - 6x Dell 2950 w/6 1TB
– Prod - 10x Dell R710 w/ 6 600GB
– Prod in 2 weeks – 10x Dell R710 w/6 600GB & 20x Dell R510 w/12 2TB
– Moving large processing jobs that currently are constrained by our current
framework to Hadoop. We have some large analytical runs that currently go
for over 40 hours on 32 servers and we are re-engineering to reduce
20© comScore, Inc. Proprietary.
for over 40 hours on 32 servers and we are re-engineering to reduce
processing time.
– We have found that the Fair Scheduler works well for our job loads
– We use a “homegrown” workflow system (BORG) that manages tasks inside
and outside hadoop.
- 21. Sharding
Sharding divides work across multiple systems using different mechanisms
Shard data as far up stream as possible
Ability to break data into multiple chunks early in processing, enables ability to
compute capacity down stream to accommodate large volume increases in data
ingest
21© comScore, Inc. Proprietary.
- 22. Sorting
We use DMExpress from SyncSort across hundreds of servers this allows
for efficient data processing
We sort input data based on a column in advance
To calculate uniques, check if the prior value changed from the current
value and then increment a counter
We now have aggregation systems that can process over 50 GB of data
with 357 million rows in less than an hour on a Dell R710 2U serve
22© comScore, Inc. Proprietary.
with 357 million rows in less than an hour on a Dell R710 2U serve
- 23. Compression w/Sorting
Compress Log Files when processing large volumes of log data
Several advantages to Sorting Data First:
– Reduces the size of the data
– Improves application performance
Examples:
– 1 Hour of our data (313 GB raw, 815 million rows)
23© comScore, Inc. Proprietary.
1 Hour of our data (313 GB raw, 815 million rows)
– Standard compression of time ordered data is 93GB (30% of original)
– Standard compression on a 2 key sorted set is 56GB (18% of original)
– For one day it saves 800GB
– For one month it saves 25 TB
– For 90 days it saves 75TB
- 24. Big data makes you think differently
Question: How many distinct cookies over 3 months?
Data: 3 monthly tables with distinct cookies, indexed
Size: 10B records per table
Platform: Sybase IQ
Attempt: UNION select count(cookies) over 3 monthly tables
24© comScore, Inc. Proprietary.
– Union operator distincts
Result: FAIL. Out of temp space. Out of luck.
– Failed after 30 minutes.
Why? UNION performs a SELECT and then a DISTINCT (sorting 30B rows)
- 25. Rethink the problem!
INNER joins are cheaper
No sort, they use existing indexes
Remember set theory? Of course you do!
Let months be {A, B, C}
A B
∪ ∪
25© comScore, Inc. Proprietary.
INNER join on only 2 tables of data at a time
2 month intersections took 2 hours each and less taxing on memory
Used intersection of intermediate (indexed!) results… 5 mins
C
A ∪ B ∪ C = A + B + C – A ∩ B – A ∩ C – C ∩ B + A ∩ B ∩ C
A ∩ B ∩ C = (A ∩ B) ∩ (A ∩ C) ∩ (C ∩ B)
Total query time: 6.5 hours
- 26. TCO with Large Cluster Systems
Examine replication factor and disk configuration for systems with
replication built into the framework to support redundancy and
concurrency
Example:
Hadoop cluster that supports 108TB of base compressed data
Hypothetical Configurations:
26© comScore, Inc. Proprietary.
– Replication Factor of 3
R710 (6x drives, JBOD); requires 162 servers
R510 (12x drives JBOD); requires 68 servers
– Replication Factor of 2
R710 (6x drives, RAID 5); requires 129 servers
R510 (12x drives, RAID 5); requires 54 servers
- 27. Useful Factoids
Colorful, bite-sized graphical representations of the best discoveries we unearth.
27© comScore, Inc. Proprietary.
Visit www.comscoredatamine.com or follow @datagems for the latest gems.