SlideShare a Scribd company logo
1 of 25
Download to read offline
nearly three years of continuous changes of approach
to data gathering and processing
(Martin Strycek, Juraj Sottnik)
@rubyslava 2014
We better get it right first
time!
Starting point
● we had two developers
● we had one live server
● we had one cold backup
● we can’t store all the data
● we can’t process all the data
Batch processing - the downsides
● batch every 3 hours
○ delete old data

● updating counters
○ you need to define them upfront

● throwing away old data
○ developer point of view
■ you have no way to correct your mistake

○ business
■ you lose your data
Batch processing - the benefits
● you will learn
○ profiler is your best friend
○ optimizing can be hard and can take time

● what are good access logs good for
○ reconstruct your deleted data
Business says:
save all data
Big Data
● It’s not only about the volume
● What we gonna do with it?
○ We had NO idea!

● We rent more servers.
○ We needed place where to store the data
Big Data
● We went the NoSQL way
○ MongoDB
■ easy replication, possible sharding
■ upsert
■ rich document based queries - we still were one foot
in the SQL world
■ fast prototype

● We were still doing batch processing
● ~15m impressions per day ending with
~5GB raw data per day
Big Data
● each day as collection
○ easy for batch processing

● each impression as a document
● adding processed parameters over
time
● pulling data from 30 collections
○ server is not responding
○ virtual memory is low
Big Data - analytics
● Visitors counts on website/section
○ active - with subscription
○ inactive - without subscription
○ anonymous

● Content consumption
○ how many pageviews
■ active
■ inactive
■ anonymouse

● and others
Business asks:
how many UNIQUE users
did … in month
What we really need
● COUNT(* || DISTINCT ...) GROUP BY
○ entities
○ date periods (day, week, month)
○ combination of entities and date periods [and
some other flags]

● Special demands from analytics team
○ Not too hard to implement with SQL magic

● As fast as possible
○ Minimally as fast as data are incoming

● Still store all historical raw data
○ Ideally compressed
What to do
● Processing raw data?
○ Use lot of space, before getting result
■ We need to store historical data anyway
■ You can store compressed files (LZO) in Hadoop

● Sharding
○ For how long?
○ How to properly determine sharding key(s)?

● Do you have really big amount of data?
● Do you have hardware for running
Hadoop? Really?
● What overnight batch processing really
means?
Naive solution
● Separate counter for each needed
combination, updated for each
impression, maybe with touching DB
○ Fast to generate unique key for combination
■ md5([entityType, entityId, day, dayId].join("|"))

○ Really fast to get value
■ Always primary key
■ Multiget

○ Need to define all GROUP BY combinations
on beginning
○ Failure during processing one impression
■ Need to increment counters in transaction
Real world solution
● Kafka
○ Buffering incoming data
○ Web workers as producers

● Storm / Trident
○ Consuming data from Kafka
○ Processing incoming data
○ Using cassandra as storage backend

● Cassandra
○ Holding counters and helper informations to
determine uniquity
Storm
● Real time processing of unbounded
streams of data
○ Processing data as they come
○ You still need to have computing power
○ Need to transform COUNT(* || DISTINCT ...)
GROUP BY everything to steps of updates of
counters
○ Java, but bolts can be written in different
languages
Storm
● Spouts
● Bolts
Trident
● High level abstraction over Storm
○
○
○
○
○

Joins
Aggregations
Grouping
Filtering
Functions
Trident
● Operating in transactions
● Persistent aggregation
○ “Memcached”
○ Cassandra

● DRPC calls
○ No need to touch Cassandra

● Local cluster for development
● Easy to learn basics
● Hard to discover advanced stuff
■ Lack of documentation
■ Need to tune configuration
Trident
● Functions
○ You can do everything you want
■ Touch DB, read emails, …

○ Stay with java
■ No dependencies problem
■ No performance penalty

● Topology
○ Good to define on beginning
■ Spend time on detailed diagram
■ Save you during implementation and future updates

○ Don’t do it too much complex
■ Problem with loading it
Trident
Cassandra
● Already in our production on different
project
● No SPOF
● Multi Master
● Scalable
● More good stuff
● Lot of new features in 2.x
○ Lite transactions
○ Lot of fixes
■ Good old times on 0.8
■ Our bug report from 2011 - Double load of commit log
on node start :)
Kafka
● A high-throughput distributed
messaging system
● Something like distributed commit log
○ You can set retention
○ You can move reading offset back
■ Used by Trident transactions

● Cluster
● Ideally to use with Trident
Business asks:
are you ready for ~250m
impressions per day?
Thank you.

More Related Content

What's hot

2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)Albert Wong
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAlex Pinkin
 
SOLR Power FTW: short version
SOLR Power FTW: short versionSOLR Power FTW: short version
SOLR Power FTW: short versionAlex Pinkin
 
Improve your SQL workload with observability
Improve your SQL workload with observabilityImprove your SQL workload with observability
Improve your SQL workload with observabilityOVHcloud
 
Apache Tajo on Swift
Apache Tajo on SwiftApache Tajo on Swift
Apache Tajo on SwiftJihoon Son
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Adrianos Dadis
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...GeeksLab Odessa
 
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT EcosystemMongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT EcosystemMongoDB
 
OpenStack MagnetoDB. Atlanta Summit 2014
OpenStack MagnetoDB. Atlanta Summit 2014OpenStack MagnetoDB. Atlanta Summit 2014
OpenStack MagnetoDB. Atlanta Summit 2014Ilya Sviridov
 
Intro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucIntro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucFraugster
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad openstackindia
 
Hitachi datasheet-universal-replicator
Hitachi datasheet-universal-replicatorHitachi datasheet-universal-replicator
Hitachi datasheet-universal-replicatorHitachi Vantara
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
Handle TBs with $1500 per month
Handle TBs with $1500 per monthHandle TBs with $1500 per month
Handle TBs with $1500 per monthHung Lin
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Databases through out and beyond Big Data hype
Databases through out and beyond Big Data hypeDatabases through out and beyond Big Data hype
Databases through out and beyond Big Data hypeParinaz Ameri
 
Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Tom Grek
 

What's hot (20)

2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
 
SOLR Power FTW: short version
SOLR Power FTW: short versionSOLR Power FTW: short version
SOLR Power FTW: short version
 
AmazonRedshift
AmazonRedshiftAmazonRedshift
AmazonRedshift
 
Improve your SQL workload with observability
Improve your SQL workload with observabilityImprove your SQL workload with observability
Improve your SQL workload with observability
 
Apache Tajo on Swift
Apache Tajo on SwiftApache Tajo on Swift
Apache Tajo on Swift
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
 
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT EcosystemMongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
 
OpenStack MagnetoDB. Atlanta Summit 2014
OpenStack MagnetoDB. Atlanta Summit 2014OpenStack MagnetoDB. Atlanta Summit 2014
OpenStack MagnetoDB. Atlanta Summit 2014
 
Intro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucIntro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana Goriuc
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad
 
Hitachi datasheet-universal-replicator
Hitachi datasheet-universal-replicatorHitachi datasheet-universal-replicator
Hitachi datasheet-universal-replicator
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Handle TBs with $1500 per month
Handle TBs with $1500 per monthHandle TBs with $1500 per month
Handle TBs with $1500 per month
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Databases through out and beyond Big Data hype
Databases through out and beyond Big Data hypeDatabases through out and beyond Big Data hype
Databases through out and beyond Big Data hype
 
Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018
 

Viewers also liked

Removing backgrounds in Photoshop
Removing backgrounds in PhotoshopRemoving backgrounds in Photoshop
Removing backgrounds in PhotoshopLandonPhillips
 
Usability for Port Chester Votes
Usability for Port Chester VotesUsability for Port Chester Votes
Usability for Port Chester VotesWhitney Quesenbery
 
Jenny Greeve - AIGA Design for Democracy in Washington State
Jenny Greeve - AIGA Design for Democracy in Washington StateJenny Greeve - AIGA Design for Democracy in Washington State
Jenny Greeve - AIGA Design for Democracy in Washington StateWhitney Quesenbery
 
Saving Your Budget with Plain Language
Saving Your Budget with Plain LanguageSaving Your Budget with Plain Language
Saving Your Budget with Plain LanguageWhitney Quesenbery
 
Monyfon 1 22 2009
Monyfon 1 22 2009Monyfon 1 22 2009
Monyfon 1 22 2009ralcalde
 
Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)David Evans
 
10 iemesli, kāpēc man ir vajadzīgs rekrūteris
10 iemesli, kāpēc man ir vajadzīgs rekrūteris10 iemesli, kāpēc man ir vajadzīgs rekrūteris
10 iemesli, kāpēc man ir vajadzīgs rekrūterisArtyom Kobakhidze
 
Like it presentation for it students (LV)
Like it presentation for it students (LV)Like it presentation for it students (LV)
Like it presentation for it students (LV)Artyom Kobakhidze
 
TypeScript intro / mobile dev camp
TypeScript intro / mobile dev campTypeScript intro / mobile dev camp
TypeScript intro / mobile dev campAndrea Balducci
 
Sakai Tools That Engage Students
Sakai Tools That Engage StudentsSakai Tools That Engage Students
Sakai Tools That Engage StudentsLandonPhillips
 
The Game is Afoot - SUNY Conference 2015
The Game is Afoot - SUNY Conference 2015The Game is Afoot - SUNY Conference 2015
The Game is Afoot - SUNY Conference 2015LandonPhillips
 
Preparing to become Professional Accountant in Nigeria
Preparing to become Professional Accountant in NigeriaPreparing to become Professional Accountant in Nigeria
Preparing to become Professional Accountant in NigeriaAbdulsalam Masud
 
Hitchikers guide to Ux'ing without a Ux'er
Hitchikers guide to Ux'ing without a Ux'erHitchikers guide to Ux'ing without a Ux'er
Hitchikers guide to Ux'ing without a Ux'erChrissy Welsh
 
Persona Stories: Weaving together quant & qual for a richer picture
Persona Stories: Weaving together quant & qual for a richer picturePersona Stories: Weaving together quant & qual for a richer picture
Persona Stories: Weaving together quant & qual for a richer pictureWhitney Quesenbery
 
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法Takeshi Furusato
 
Accessibility as Innovation: Creating accessible user experiences
Accessibility as Innovation: Creating accessible user experiencesAccessibility as Innovation: Creating accessible user experiences
Accessibility as Innovation: Creating accessible user experiencesWhitney Quesenbery
 
Programming The Arduino Due in Rust
Programming The Arduino Due in RustProgramming The Arduino Due in Rust
Programming The Arduino Due in Rustkellogh
 

Viewers also liked (20)

Removing backgrounds in Photoshop
Removing backgrounds in PhotoshopRemoving backgrounds in Photoshop
Removing backgrounds in Photoshop
 
Usability for Port Chester Votes
Usability for Port Chester VotesUsability for Port Chester Votes
Usability for Port Chester Votes
 
Jenny Greeve - AIGA Design for Democracy in Washington State
Jenny Greeve - AIGA Design for Democracy in Washington StateJenny Greeve - AIGA Design for Democracy in Washington State
Jenny Greeve - AIGA Design for Democracy in Washington State
 
Saving Your Budget with Plain Language
Saving Your Budget with Plain LanguageSaving Your Budget with Plain Language
Saving Your Budget with Plain Language
 
Monyfon 1 22 2009
Monyfon 1 22 2009Monyfon 1 22 2009
Monyfon 1 22 2009
 
Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)
 
Typescript intro
Typescript introTypescript intro
Typescript intro
 
10 iemesli, kāpēc man ir vajadzīgs rekrūteris
10 iemesli, kāpēc man ir vajadzīgs rekrūteris10 iemesli, kāpēc man ir vajadzīgs rekrūteris
10 iemesli, kāpēc man ir vajadzīgs rekrūteris
 
Like it presentation for it students (LV)
Like it presentation for it students (LV)Like it presentation for it students (LV)
Like it presentation for it students (LV)
 
TypeScript intro / mobile dev camp
TypeScript intro / mobile dev campTypeScript intro / mobile dev camp
TypeScript intro / mobile dev camp
 
Sakai Tools That Engage Students
Sakai Tools That Engage StudentsSakai Tools That Engage Students
Sakai Tools That Engage Students
 
Class Walkthrough
Class WalkthroughClass Walkthrough
Class Walkthrough
 
The Game is Afoot - SUNY Conference 2015
The Game is Afoot - SUNY Conference 2015The Game is Afoot - SUNY Conference 2015
The Game is Afoot - SUNY Conference 2015
 
Preparing to become Professional Accountant in Nigeria
Preparing to become Professional Accountant in NigeriaPreparing to become Professional Accountant in Nigeria
Preparing to become Professional Accountant in Nigeria
 
Hitchikers guide to Ux'ing without a Ux'er
Hitchikers guide to Ux'ing without a Ux'erHitchikers guide to Ux'ing without a Ux'er
Hitchikers guide to Ux'ing without a Ux'er
 
Persona Stories: Weaving together quant & qual for a richer picture
Persona Stories: Weaving together quant & qual for a richer picturePersona Stories: Weaving together quant & qual for a richer picture
Persona Stories: Weaving together quant & qual for a richer picture
 
Need a little usability?
Need a little usability?Need a little usability?
Need a little usability?
 
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
 
Accessibility as Innovation: Creating accessible user experiences
Accessibility as Innovation: Creating accessible user experiencesAccessibility as Innovation: Creating accessible user experiences
Accessibility as Innovation: Creating accessible user experiences
 
Programming The Arduino Due in Rust
Programming The Arduino Due in RustProgramming The Arduino Due in Rust
Programming The Arduino Due in Rust
 

Similar to Piano Media - approach to data gathering and processing

Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheapMarc Cluet
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Hisham Mardam-Bey
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodbDeep Kapadia
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentationPrzemysław Pastuszka
 
#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDB#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDBdan-p-kimmel
 

Similar to Piano Media - approach to data gathering and processing (20)

Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheap
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentation
 
#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDB#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDB
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Piano Media - approach to data gathering and processing

  • 1. nearly three years of continuous changes of approach to data gathering and processing (Martin Strycek, Juraj Sottnik) @rubyslava 2014
  • 2. We better get it right first time!
  • 3. Starting point ● we had two developers ● we had one live server ● we had one cold backup ● we can’t store all the data ● we can’t process all the data
  • 4. Batch processing - the downsides ● batch every 3 hours ○ delete old data ● updating counters ○ you need to define them upfront ● throwing away old data ○ developer point of view ■ you have no way to correct your mistake ○ business ■ you lose your data
  • 5. Batch processing - the benefits ● you will learn ○ profiler is your best friend ○ optimizing can be hard and can take time ● what are good access logs good for ○ reconstruct your deleted data
  • 7. Big Data ● It’s not only about the volume ● What we gonna do with it? ○ We had NO idea! ● We rent more servers. ○ We needed place where to store the data
  • 8. Big Data ● We went the NoSQL way ○ MongoDB ■ easy replication, possible sharding ■ upsert ■ rich document based queries - we still were one foot in the SQL world ■ fast prototype ● We were still doing batch processing ● ~15m impressions per day ending with ~5GB raw data per day
  • 9. Big Data ● each day as collection ○ easy for batch processing ● each impression as a document ● adding processed parameters over time ● pulling data from 30 collections ○ server is not responding ○ virtual memory is low
  • 10. Big Data - analytics ● Visitors counts on website/section ○ active - with subscription ○ inactive - without subscription ○ anonymous ● Content consumption ○ how many pageviews ■ active ■ inactive ■ anonymouse ● and others
  • 11. Business asks: how many UNIQUE users did … in month
  • 12. What we really need ● COUNT(* || DISTINCT ...) GROUP BY ○ entities ○ date periods (day, week, month) ○ combination of entities and date periods [and some other flags] ● Special demands from analytics team ○ Not too hard to implement with SQL magic ● As fast as possible ○ Minimally as fast as data are incoming ● Still store all historical raw data ○ Ideally compressed
  • 13. What to do ● Processing raw data? ○ Use lot of space, before getting result ■ We need to store historical data anyway ■ You can store compressed files (LZO) in Hadoop ● Sharding ○ For how long? ○ How to properly determine sharding key(s)? ● Do you have really big amount of data? ● Do you have hardware for running Hadoop? Really? ● What overnight batch processing really means?
  • 14. Naive solution ● Separate counter for each needed combination, updated for each impression, maybe with touching DB ○ Fast to generate unique key for combination ■ md5([entityType, entityId, day, dayId].join("|")) ○ Really fast to get value ■ Always primary key ■ Multiget ○ Need to define all GROUP BY combinations on beginning ○ Failure during processing one impression ■ Need to increment counters in transaction
  • 15. Real world solution ● Kafka ○ Buffering incoming data ○ Web workers as producers ● Storm / Trident ○ Consuming data from Kafka ○ Processing incoming data ○ Using cassandra as storage backend ● Cassandra ○ Holding counters and helper informations to determine uniquity
  • 16. Storm ● Real time processing of unbounded streams of data ○ Processing data as they come ○ You still need to have computing power ○ Need to transform COUNT(* || DISTINCT ...) GROUP BY everything to steps of updates of counters ○ Java, but bolts can be written in different languages
  • 18. Trident ● High level abstraction over Storm ○ ○ ○ ○ ○ Joins Aggregations Grouping Filtering Functions
  • 19. Trident ● Operating in transactions ● Persistent aggregation ○ “Memcached” ○ Cassandra ● DRPC calls ○ No need to touch Cassandra ● Local cluster for development ● Easy to learn basics ● Hard to discover advanced stuff ■ Lack of documentation ■ Need to tune configuration
  • 20. Trident ● Functions ○ You can do everything you want ■ Touch DB, read emails, … ○ Stay with java ■ No dependencies problem ■ No performance penalty ● Topology ○ Good to define on beginning ■ Spend time on detailed diagram ■ Save you during implementation and future updates ○ Don’t do it too much complex ■ Problem with loading it
  • 22. Cassandra ● Already in our production on different project ● No SPOF ● Multi Master ● Scalable ● More good stuff ● Lot of new features in 2.x ○ Lite transactions ○ Lot of fixes ■ Good old times on 0.8 ■ Our bug report from 2011 - Double load of commit log on node start :)
  • 23. Kafka ● A high-throughput distributed messaging system ● Something like distributed commit log ○ You can set retention ○ You can move reading offset back ■ Used by Trident transactions ● Cluster ● Ideally to use with Trident
  • 24. Business asks: are you ready for ~250m impressions per day?