SlideShare a Scribd company logo
1 of 25
Download to read offline
nearly three years of continuous changes of approach
to data gathering and processing
(Martin Strycek, Juraj Sottnik)
@rubyslava 2014
We better get it right first
time!
Starting point
● we had two developers
● we had one live server
● we had one cold backup
● we can’t store all the data
● we can’t process all the data
Batch processing - the downsides
● batch every 3 hours
○ delete old data

● updating counters
○ you need to define them upfront

● throwing away old data
○ developer point of view
■ you have no way to correct your mistake

○ business
■ you lose your data
Batch processing - the benefits
● you will learn
○ profiler is your best friend
○ optimizing can be hard and can take time

● what are good access logs good for
○ reconstruct your deleted data
Business says:
save all data
Big Data
● It’s not only about the volume
● What we gonna do with it?
○ We had NO idea!

● We rent more servers.
○ We needed place where to store the data
Big Data
● We went the NoSQL way
○ MongoDB
■ easy replication, possible sharding
■ upsert
■ rich document based queries - we still were one foot
in the SQL world
■ fast prototype

● We were still doing batch processing
● ~15m impressions per day ending with
~5GB raw data per day
Big Data
● each day as collection
○ easy for batch processing

● each impression as a document
● adding processed parameters over
time
● pulling data from 30 collections
○ server is not responding
○ virtual memory is low
Big Data - analytics
● Visitors counts on website/section
○ active - with subscription
○ inactive - without subscription
○ anonymous

● Content consumption
○ how many pageviews
■ active
■ inactive
■ anonymouse

● and others
Business asks:
how many UNIQUE users
did … in month
What we really need
● COUNT(* || DISTINCT ...) GROUP BY
○ entities
○ date periods (day, week, month)
○ combination of entities and date periods [and
some other flags]

● Special demands from analytics team
○ Not too hard to implement with SQL magic

● As fast as possible
○ Minimally as fast as data are incoming

● Still store all historical raw data
○ Ideally compressed
What to do
● Processing raw data?
○ Use lot of space, before getting result
■ We need to store historical data anyway
■ You can store compressed files (LZO) in Hadoop

● Sharding
○ For how long?
○ How to properly determine sharding key(s)?

● Do you have really big amount of data?
● Do you have hardware for running
Hadoop? Really?
● What overnight batch processing really
means?
Naive solution
● Separate counter for each needed
combination, updated for each
impression, maybe with touching DB
○ Fast to generate unique key for combination
■ md5([entityType, entityId, day, dayId].join("|"))

○ Really fast to get value
■ Always primary key
■ Multiget

○ Need to define all GROUP BY combinations
on beginning
○ Failure during processing one impression
■ Need to increment counters in transaction
Real world solution
● Kafka
○ Buffering incoming data
○ Web workers as producers

● Storm / Trident
○ Consuming data from Kafka
○ Processing incoming data
○ Using cassandra as storage backend

● Cassandra
○ Holding counters and helper informations to
determine uniquity
Storm
● Real time processing of unbounded
streams of data
○ Processing data as they come
○ You still need to have computing power
○ Need to transform COUNT(* || DISTINCT ...)
GROUP BY everything to steps of updates of
counters
○ Java, but bolts can be written in different
languages
Storm
● Spouts
● Bolts
Trident
● High level abstraction over Storm
○
○
○
○
○

Joins
Aggregations
Grouping
Filtering
Functions
Trident
● Operating in transactions
● Persistent aggregation
○ “Memcached”
○ Cassandra

● DRPC calls
○ No need to touch Cassandra

● Local cluster for development
● Easy to learn basics
● Hard to discover advanced stuff
■ Lack of documentation
■ Need to tune configuration
Trident
● Functions
○ You can do everything you want
■ Touch DB, read emails, …

○ Stay with java
■ No dependencies problem
■ No performance penalty

● Topology
○ Good to define on beginning
■ Spend time on detailed diagram
■ Save you during implementation and future updates

○ Don’t do it too much complex
■ Problem with loading it
Trident
Cassandra
● Already in our production on different
project
● No SPOF
● Multi Master
● Scalable
● More good stuff
● Lot of new features in 2.x
○ Lite transactions
○ Lot of fixes
■ Good old times on 0.8
■ Our bug report from 2011 - Double load of commit log
on node start :)
Kafka
● A high-throughput distributed
messaging system
● Something like distributed commit log
○ You can set retention
○ You can move reading offset back
■ Used by Trident transactions

● Cluster
● Ideally to use with Trident
Business asks:
are you ready for ~250m
impressions per day?
Thank you.

More Related Content

What's hot

2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)Albert Wong
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAlex Pinkin
 
SOLR Power FTW: short version
SOLR Power FTW: short versionSOLR Power FTW: short version
SOLR Power FTW: short versionAlex Pinkin
 
Improve your SQL workload with observability
Improve your SQL workload with observabilityImprove your SQL workload with observability
Improve your SQL workload with observabilityOVHcloud
 
Apache Tajo on Swift
Apache Tajo on SwiftApache Tajo on Swift
Apache Tajo on SwiftJihoon Son
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Adrianos Dadis
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...GeeksLab Odessa
 
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT EcosystemMongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT EcosystemMongoDB
 
OpenStack MagnetoDB. Atlanta Summit 2014
OpenStack MagnetoDB. Atlanta Summit 2014OpenStack MagnetoDB. Atlanta Summit 2014
OpenStack MagnetoDB. Atlanta Summit 2014Ilya Sviridov
 
Intro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucIntro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucFraugster
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad openstackindia
 
Hitachi datasheet-universal-replicator
Hitachi datasheet-universal-replicatorHitachi datasheet-universal-replicator
Hitachi datasheet-universal-replicatorHitachi Vantara
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
Handle TBs with $1500 per month
Handle TBs with $1500 per monthHandle TBs with $1500 per month
Handle TBs with $1500 per monthHung Lin
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Databases through out and beyond Big Data hype
Databases through out and beyond Big Data hypeDatabases through out and beyond Big Data hype
Databases through out and beyond Big Data hypeParinaz Ameri
 
Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Tom Grek
 

What's hot (20)

2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
 
SOLR Power FTW: short version
SOLR Power FTW: short versionSOLR Power FTW: short version
SOLR Power FTW: short version
 
AmazonRedshift
AmazonRedshiftAmazonRedshift
AmazonRedshift
 
Improve your SQL workload with observability
Improve your SQL workload with observabilityImprove your SQL workload with observability
Improve your SQL workload with observability
 
Apache Tajo on Swift
Apache Tajo on SwiftApache Tajo on Swift
Apache Tajo on Swift
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
 
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT EcosystemMongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
 
OpenStack MagnetoDB. Atlanta Summit 2014
OpenStack MagnetoDB. Atlanta Summit 2014OpenStack MagnetoDB. Atlanta Summit 2014
OpenStack MagnetoDB. Atlanta Summit 2014
 
Intro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucIntro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana Goriuc
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad
 
Hitachi datasheet-universal-replicator
Hitachi datasheet-universal-replicatorHitachi datasheet-universal-replicator
Hitachi datasheet-universal-replicator
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Handle TBs with $1500 per month
Handle TBs with $1500 per monthHandle TBs with $1500 per month
Handle TBs with $1500 per month
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Databases through out and beyond Big Data hype
Databases through out and beyond Big Data hypeDatabases through out and beyond Big Data hype
Databases through out and beyond Big Data hype
 
Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018
 

Viewers also liked

Removing backgrounds in Photoshop
Removing backgrounds in PhotoshopRemoving backgrounds in Photoshop
Removing backgrounds in PhotoshopLandonPhillips
 
Usability for Port Chester Votes
Usability for Port Chester VotesUsability for Port Chester Votes
Usability for Port Chester VotesWhitney Quesenbery
 
Jenny Greeve - AIGA Design for Democracy in Washington State
Jenny Greeve - AIGA Design for Democracy in Washington StateJenny Greeve - AIGA Design for Democracy in Washington State
Jenny Greeve - AIGA Design for Democracy in Washington StateWhitney Quesenbery
 
Saving Your Budget with Plain Language
Saving Your Budget with Plain LanguageSaving Your Budget with Plain Language
Saving Your Budget with Plain LanguageWhitney Quesenbery
 
Monyfon 1 22 2009
Monyfon 1 22 2009Monyfon 1 22 2009
Monyfon 1 22 2009ralcalde
 
Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)David Evans
 
10 iemesli, kāpēc man ir vajadzīgs rekrūteris
10 iemesli, kāpēc man ir vajadzīgs rekrūteris10 iemesli, kāpēc man ir vajadzīgs rekrūteris
10 iemesli, kāpēc man ir vajadzīgs rekrūterisArtyom Kobakhidze
 
Like it presentation for it students (LV)
Like it presentation for it students (LV)Like it presentation for it students (LV)
Like it presentation for it students (LV)Artyom Kobakhidze
 
TypeScript intro / mobile dev camp
TypeScript intro / mobile dev campTypeScript intro / mobile dev camp
TypeScript intro / mobile dev campAndrea Balducci
 
Sakai Tools That Engage Students
Sakai Tools That Engage StudentsSakai Tools That Engage Students
Sakai Tools That Engage StudentsLandonPhillips
 
The Game is Afoot - SUNY Conference 2015
The Game is Afoot - SUNY Conference 2015The Game is Afoot - SUNY Conference 2015
The Game is Afoot - SUNY Conference 2015LandonPhillips
 
Preparing to become Professional Accountant in Nigeria
Preparing to become Professional Accountant in NigeriaPreparing to become Professional Accountant in Nigeria
Preparing to become Professional Accountant in NigeriaAbdulsalam Masud
 
Hitchikers guide to Ux'ing without a Ux'er
Hitchikers guide to Ux'ing without a Ux'erHitchikers guide to Ux'ing without a Ux'er
Hitchikers guide to Ux'ing without a Ux'erChrissy Welsh
 
Persona Stories: Weaving together quant & qual for a richer picture
Persona Stories: Weaving together quant & qual for a richer picturePersona Stories: Weaving together quant & qual for a richer picture
Persona Stories: Weaving together quant & qual for a richer pictureWhitney Quesenbery
 
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法Takeshi Furusato
 
Accessibility as Innovation: Creating accessible user experiences
Accessibility as Innovation: Creating accessible user experiencesAccessibility as Innovation: Creating accessible user experiences
Accessibility as Innovation: Creating accessible user experiencesWhitney Quesenbery
 
Programming The Arduino Due in Rust
Programming The Arduino Due in RustProgramming The Arduino Due in Rust
Programming The Arduino Due in Rustkellogh
 

Viewers also liked (20)

Removing backgrounds in Photoshop
Removing backgrounds in PhotoshopRemoving backgrounds in Photoshop
Removing backgrounds in Photoshop
 
Usability for Port Chester Votes
Usability for Port Chester VotesUsability for Port Chester Votes
Usability for Port Chester Votes
 
Jenny Greeve - AIGA Design for Democracy in Washington State
Jenny Greeve - AIGA Design for Democracy in Washington StateJenny Greeve - AIGA Design for Democracy in Washington State
Jenny Greeve - AIGA Design for Democracy in Washington State
 
Saving Your Budget with Plain Language
Saving Your Budget with Plain LanguageSaving Your Budget with Plain Language
Saving Your Budget with Plain Language
 
Monyfon 1 22 2009
Monyfon 1 22 2009Monyfon 1 22 2009
Monyfon 1 22 2009
 
Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)
 
Typescript intro
Typescript introTypescript intro
Typescript intro
 
10 iemesli, kāpēc man ir vajadzīgs rekrūteris
10 iemesli, kāpēc man ir vajadzīgs rekrūteris10 iemesli, kāpēc man ir vajadzīgs rekrūteris
10 iemesli, kāpēc man ir vajadzīgs rekrūteris
 
Like it presentation for it students (LV)
Like it presentation for it students (LV)Like it presentation for it students (LV)
Like it presentation for it students (LV)
 
TypeScript intro / mobile dev camp
TypeScript intro / mobile dev campTypeScript intro / mobile dev camp
TypeScript intro / mobile dev camp
 
Sakai Tools That Engage Students
Sakai Tools That Engage StudentsSakai Tools That Engage Students
Sakai Tools That Engage Students
 
Class Walkthrough
Class WalkthroughClass Walkthrough
Class Walkthrough
 
The Game is Afoot - SUNY Conference 2015
The Game is Afoot - SUNY Conference 2015The Game is Afoot - SUNY Conference 2015
The Game is Afoot - SUNY Conference 2015
 
Preparing to become Professional Accountant in Nigeria
Preparing to become Professional Accountant in NigeriaPreparing to become Professional Accountant in Nigeria
Preparing to become Professional Accountant in Nigeria
 
Hitchikers guide to Ux'ing without a Ux'er
Hitchikers guide to Ux'ing without a Ux'erHitchikers guide to Ux'ing without a Ux'er
Hitchikers guide to Ux'ing without a Ux'er
 
Persona Stories: Weaving together quant & qual for a richer picture
Persona Stories: Weaving together quant & qual for a richer picturePersona Stories: Weaving together quant & qual for a richer picture
Persona Stories: Weaving together quant & qual for a richer picture
 
Need a little usability?
Need a little usability?Need a little usability?
Need a little usability?
 
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
 
Accessibility as Innovation: Creating accessible user experiences
Accessibility as Innovation: Creating accessible user experiencesAccessibility as Innovation: Creating accessible user experiences
Accessibility as Innovation: Creating accessible user experiences
 
Programming The Arduino Due in Rust
Programming The Arduino Due in RustProgramming The Arduino Due in Rust
Programming The Arduino Due in Rust
 

Similar to Piano Media - approach to data gathering and processing

Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheapMarc Cluet
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Hisham Mardam-Bey
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodbDeep Kapadia
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentationPrzemysław Pastuszka
 
#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDB#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDBdan-p-kimmel
 

Similar to Piano Media - approach to data gathering and processing (20)

Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheap
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentation
 
#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDB#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDB
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Piano Media - approach to data gathering and processing

  • 1. nearly three years of continuous changes of approach to data gathering and processing (Martin Strycek, Juraj Sottnik) @rubyslava 2014
  • 2. We better get it right first time!
  • 3. Starting point ● we had two developers ● we had one live server ● we had one cold backup ● we can’t store all the data ● we can’t process all the data
  • 4. Batch processing - the downsides ● batch every 3 hours ○ delete old data ● updating counters ○ you need to define them upfront ● throwing away old data ○ developer point of view ■ you have no way to correct your mistake ○ business ■ you lose your data
  • 5. Batch processing - the benefits ● you will learn ○ profiler is your best friend ○ optimizing can be hard and can take time ● what are good access logs good for ○ reconstruct your deleted data
  • 7. Big Data ● It’s not only about the volume ● What we gonna do with it? ○ We had NO idea! ● We rent more servers. ○ We needed place where to store the data
  • 8. Big Data ● We went the NoSQL way ○ MongoDB ■ easy replication, possible sharding ■ upsert ■ rich document based queries - we still were one foot in the SQL world ■ fast prototype ● We were still doing batch processing ● ~15m impressions per day ending with ~5GB raw data per day
  • 9. Big Data ● each day as collection ○ easy for batch processing ● each impression as a document ● adding processed parameters over time ● pulling data from 30 collections ○ server is not responding ○ virtual memory is low
  • 10. Big Data - analytics ● Visitors counts on website/section ○ active - with subscription ○ inactive - without subscription ○ anonymous ● Content consumption ○ how many pageviews ■ active ■ inactive ■ anonymouse ● and others
  • 11. Business asks: how many UNIQUE users did … in month
  • 12. What we really need ● COUNT(* || DISTINCT ...) GROUP BY ○ entities ○ date periods (day, week, month) ○ combination of entities and date periods [and some other flags] ● Special demands from analytics team ○ Not too hard to implement with SQL magic ● As fast as possible ○ Minimally as fast as data are incoming ● Still store all historical raw data ○ Ideally compressed
  • 13. What to do ● Processing raw data? ○ Use lot of space, before getting result ■ We need to store historical data anyway ■ You can store compressed files (LZO) in Hadoop ● Sharding ○ For how long? ○ How to properly determine sharding key(s)? ● Do you have really big amount of data? ● Do you have hardware for running Hadoop? Really? ● What overnight batch processing really means?
  • 14. Naive solution ● Separate counter for each needed combination, updated for each impression, maybe with touching DB ○ Fast to generate unique key for combination ■ md5([entityType, entityId, day, dayId].join("|")) ○ Really fast to get value ■ Always primary key ■ Multiget ○ Need to define all GROUP BY combinations on beginning ○ Failure during processing one impression ■ Need to increment counters in transaction
  • 15. Real world solution ● Kafka ○ Buffering incoming data ○ Web workers as producers ● Storm / Trident ○ Consuming data from Kafka ○ Processing incoming data ○ Using cassandra as storage backend ● Cassandra ○ Holding counters and helper informations to determine uniquity
  • 16. Storm ● Real time processing of unbounded streams of data ○ Processing data as they come ○ You still need to have computing power ○ Need to transform COUNT(* || DISTINCT ...) GROUP BY everything to steps of updates of counters ○ Java, but bolts can be written in different languages
  • 18. Trident ● High level abstraction over Storm ○ ○ ○ ○ ○ Joins Aggregations Grouping Filtering Functions
  • 19. Trident ● Operating in transactions ● Persistent aggregation ○ “Memcached” ○ Cassandra ● DRPC calls ○ No need to touch Cassandra ● Local cluster for development ● Easy to learn basics ● Hard to discover advanced stuff ■ Lack of documentation ■ Need to tune configuration
  • 20. Trident ● Functions ○ You can do everything you want ■ Touch DB, read emails, … ○ Stay with java ■ No dependencies problem ■ No performance penalty ● Topology ○ Good to define on beginning ■ Spend time on detailed diagram ■ Save you during implementation and future updates ○ Don’t do it too much complex ■ Problem with loading it
  • 22. Cassandra ● Already in our production on different project ● No SPOF ● Multi Master ● Scalable ● More good stuff ● Lot of new features in 2.x ○ Lite transactions ○ Lot of fixes ■ Good old times on 0.8 ■ Our bug report from 2011 - Double load of commit log on node start :)
  • 23. Kafka ● A high-throughput distributed messaging system ● Something like distributed commit log ○ You can set retention ○ You can move reading offset back ■ Used by Trident transactions ● Cluster ● Ideally to use with Trident
  • 24. Business asks: are you ready for ~250m impressions per day?