SlideShare una empresa de Scribd logo
1 de 15
Descargar para leer sin conexión
Hadoop @ eBuddy
eBuddy
Web based chat (Started in 2003)
● Initially no statistics, msn only
● Started basic logging in 2004
● Today
  ○ 34.467.010.693 login records (34x109)
  ○ It takes about 40min to select them all.
XMS (Launched May 23, 2011)
● Today
   ○ 1.334.794.121 records (1,3x109)
Website (google analytics)
Banners (openx)
Warehousing needs
● Product owners
  ○ Comparing product version
     ■ avg duration
     ■ msg sent/received
  ○ Churn analysis
  ○ Feature analysis
● Marketing
  ○ What countries should we focus on
  ○ What people should we target?
● Sales
  ○ Sell banners in countries/products.
● Operations/Dev
  ○ Help solve bugs
  ○ Blocked in countries/providers
Interesting to know
● Developers are Java centric
● Hosting in the US but BI people in Amsterdam
● 18 hadoop nodes each having
    ○ 16 cores
    ○ 24G ram
    ○ 4x400G HD's
●   We make money with banners
    ○ So don't expect deep pockets
Warehouse timeline
● Traditional rdbms (2004)
● Custom mapreduce code (2008)
  ○ Joining two files (merge join/map join?)
  ○ Repeating code
  ○ Consider abstraction
  ○ Changing data changing code?
● Pig scripts (2008/2009)
  ○ Much simpler to read but domain specific
● Hive (2009)
  ○ Generic sql but with some limitations
  ○ Existing tools can be used
Hive
● Hey I already know this:
select *
from table1 t1
  left outer join table2 t2 on (t1.id = t2.id)
where t2.id is null;


● Java programmers will like this:
  ○ Spring JdbcTemplates
  ○ Existing jdbc tools (SQuirreL)
  ○ Syntax highlighting
  ○ Code completion
Present
● App servers log to mysql
  ○ Brittle but it works
● Hive
  ○ Sql (most developers know this)
  ○ Partition pruning issues
  ○ No rollup queries
● ETL
  ○ Star schema
  ○ Fair scheduling (ETL vs BI)
     ■ reserved for etl pool
     ■ don't start reducers until 90% mappers done
  ○ Lzo on all jobs
● MicroStrategy (odbc)
● SQuirreL (jdbc)
Future
● Look at users from a to z
  ○ website logs
  ○ banners
● Cassandra handler for hive
  ○ Looking at contact lists (not just size)
● Streaming ETL
  ○ flume
      ■ No more mysql & scripts
      ■ Directly write into the correct partition
  ○ avro
      ■ Less schema related problems
  ○ snappy
      ■ Lightweight compression
Questions?
Hive partition pruning
● Won't work
select count(*)
from chatsessions cs
  inner join calendar c on (c.cldr_id = cs.login_cldr_id)
where c.iso_date = '2012-06-14';


● Will work
select cldr_id from calendar where iso_date = '2012-06-14';
select count(*) from chatsessions where login_cldr_id in (1234);
Left outer join in Pig
A = LOAD 'file1' USING PigStorage(',') AS (a1:int,a2:chararray);
B = LOAD 'file2' USING PigStorage(',') AS (b1:int,b2:chararray);
C = COGROUP A BY a1, B BY b1 OUTER;
X = FILTER C BY IsEmpty(B);
Z = FOREACH X GENERATE flatten(A.a2);
DUMP Z;
● avro & hive: https://issues.apache.org/jira/browse/HIVE-
  895

● flume:
   https://cwiki.apache.org/FLUME/

Más contenido relacionado

La actualidad más candente

Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Sanjog Kumar Dash
 

La actualidad más candente (20)

Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
Talend connect BE Vincent Harcq - Talend ESB - DI
Talend connect BE Vincent Harcq - Talend  ESB - DITalend connect BE Vincent Harcq - Talend  ESB - DI
Talend connect BE Vincent Harcq - Talend ESB - DI
 
Neo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in MalmöNeo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in Malmö
 
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Ad Placement Challenge
Ad Placement ChallengeAd Placement Challenge
Ad Placement Challenge
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Intro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucIntro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana Goriuc
 
FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018
 
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Distributed unique id generation
Distributed unique id generationDistributed unique id generation
Distributed unique id generation
 
Challenges in knowledge graph visualization
Challenges in knowledge graph visualizationChallenges in knowledge graph visualization
Challenges in knowledge graph visualization
 
ConvNetJS & CaffeJS
ConvNetJS & CaffeJSConvNetJS & CaffeJS
ConvNetJS & CaffeJS
 
Cypher for Apache Spark
Cypher for Apache SparkCypher for Apache Spark
Cypher for Apache Spark
 
Customer segmentation scbcn17
Customer segmentation scbcn17Customer segmentation scbcn17
Customer segmentation scbcn17
 
Efficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informationsEfficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informations
 
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
 
The immutable database datomic
The immutable database   datomicThe immutable database   datomic
The immutable database datomic
 

Destacado

Extending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post TypesExtending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post Types
Utsav Singh Rathour
 
When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012
Utsav Singh Rathour
 
Must see & experience while in australia
Must see & experience while in australiaMust see & experience while in australia
Must see & experience while in australia
Maiju Heinonen
 
Nr16 atividades e operações perigosas
Nr16 atividades e operações perigosasNr16 atividades e operações perigosas
Nr16 atividades e operações perigosas
Carlos Colombo
 

Destacado (19)

La familia
La familiaLa familia
La familia
 
Hive jdbc
Hive jdbcHive jdbc
Hive jdbc
 
Power profesiones
Power profesionesPower profesiones
Power profesiones
 
Extending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post TypesExtending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post Types
 
Alimentos saludable
Alimentos saludableAlimentos saludable
Alimentos saludable
 
Introducao blue solar
Introducao blue solarIntroducao blue solar
Introducao blue solar
 
Working with WordPress themes
Working with WordPress themesWorking with WordPress themes
Working with WordPress themes
 
When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012
 
Must see & experience while in australia
Must see & experience while in australiaMust see & experience while in australia
Must see & experience while in australia
 
Claro luna partitura
Claro luna partituraClaro luna partitura
Claro luna partitura
 
Nr16 atividades e operações perigosas
Nr16 atividades e operações perigosasNr16 atividades e operações perigosas
Nr16 atividades e operações perigosas
 
Ttg on twitter (1)
Ttg on twitter (1)Ttg on twitter (1)
Ttg on twitter (1)
 
Power profesiones
Power profesionesPower profesiones
Power profesiones
 
Power profesiones
Power profesionesPower profesiones
Power profesiones
 
wine and grape with france regions.......
wine and grape with france regions.......wine and grape with france regions.......
wine and grape with france regions.......
 
WordCamps and how you can make the most of it
WordCamps and how you can make the most of itWordCamps and how you can make the most of it
WordCamps and how you can make the most of it
 
Plan anual 2015 cc ee noveno
Plan anual 2015 cc ee novenoPlan anual 2015 cc ee noveno
Plan anual 2015 cc ee noveno
 
What are child themes, and why use them
What are child themes, and why use themWhat are child themes, and why use them
What are child themes, and why use them
 
Branding strategy
Branding strategyBranding strategy
Branding strategy
 

Similar a Hadoop @ eBuddy

Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshop
Tamas K Lengyel
 

Similar a Hadoop @ eBuddy (20)

TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Dart the better Javascript 2015
Dart the better Javascript 2015Dart the better Javascript 2015
Dart the better Javascript 2015
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshop
 
BlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedback
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipster
 
Scaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLScaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQL
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Scaling xtext
Scaling xtextScaling xtext
Scaling xtext
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Kibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod serversKibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod servers
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Hadoop @ eBuddy

  • 2. eBuddy Web based chat (Started in 2003) ● Initially no statistics, msn only ● Started basic logging in 2004 ● Today ○ 34.467.010.693 login records (34x109) ○ It takes about 40min to select them all. XMS (Launched May 23, 2011) ● Today ○ 1.334.794.121 records (1,3x109) Website (google analytics) Banners (openx)
  • 3. Warehousing needs ● Product owners ○ Comparing product version ■ avg duration ■ msg sent/received ○ Churn analysis ○ Feature analysis ● Marketing ○ What countries should we focus on ○ What people should we target? ● Sales ○ Sell banners in countries/products. ● Operations/Dev ○ Help solve bugs ○ Blocked in countries/providers
  • 4.
  • 5.
  • 6. Interesting to know ● Developers are Java centric ● Hosting in the US but BI people in Amsterdam ● 18 hadoop nodes each having ○ 16 cores ○ 24G ram ○ 4x400G HD's ● We make money with banners ○ So don't expect deep pockets
  • 7. Warehouse timeline ● Traditional rdbms (2004) ● Custom mapreduce code (2008) ○ Joining two files (merge join/map join?) ○ Repeating code ○ Consider abstraction ○ Changing data changing code? ● Pig scripts (2008/2009) ○ Much simpler to read but domain specific ● Hive (2009) ○ Generic sql but with some limitations ○ Existing tools can be used
  • 8. Hive ● Hey I already know this: select * from table1 t1 left outer join table2 t2 on (t1.id = t2.id) where t2.id is null; ● Java programmers will like this: ○ Spring JdbcTemplates ○ Existing jdbc tools (SQuirreL) ○ Syntax highlighting ○ Code completion
  • 9. Present ● App servers log to mysql ○ Brittle but it works ● Hive ○ Sql (most developers know this) ○ Partition pruning issues ○ No rollup queries ● ETL ○ Star schema ○ Fair scheduling (ETL vs BI) ■ reserved for etl pool ■ don't start reducers until 90% mappers done ○ Lzo on all jobs ● MicroStrategy (odbc) ● SQuirreL (jdbc)
  • 10. Future ● Look at users from a to z ○ website logs ○ banners ● Cassandra handler for hive ○ Looking at contact lists (not just size) ● Streaming ETL ○ flume ■ No more mysql & scripts ■ Directly write into the correct partition ○ avro ■ Less schema related problems ○ snappy ■ Lightweight compression
  • 12. Hive partition pruning ● Won't work select count(*) from chatsessions cs inner join calendar c on (c.cldr_id = cs.login_cldr_id) where c.iso_date = '2012-06-14'; ● Will work select cldr_id from calendar where iso_date = '2012-06-14'; select count(*) from chatsessions where login_cldr_id in (1234);
  • 13.
  • 14. Left outer join in Pig A = LOAD 'file1' USING PigStorage(',') AS (a1:int,a2:chararray); B = LOAD 'file2' USING PigStorage(',') AS (b1:int,b2:chararray); C = COGROUP A BY a1, B BY b1 OUTER; X = FILTER C BY IsEmpty(B); Z = FOREACH X GENERATE flatten(A.a2); DUMP Z;
  • 15. ● avro & hive: https://issues.apache.org/jira/browse/HIVE- 895 ● flume: https://cwiki.apache.org/FLUME/