SlideShare una empresa de Scribd logo
1 de 15
Descargar para leer sin conexión
Leveraging Hadoop in Wikimart
             Roman Zykov
          Head of analytics
          http://wikimart.ru




London, Big Data World Europe, 20th September 2012
Key problem


To be or not to be….

Hadoop

Introduction
Key tasks for Wikimart

What
• BI tasks
• Web analytics (in-house solution)
• Recommendations on site
• Data services for marketing

Who
• Core analytics team
• Analytics members in other departments
• IT site operations
Problem

Too time consuming or too
expensive?
• Data volume
• # of data services
Map Reduce



       Standalone


DATA


       Map Reduce
Our idea

New platform for “Big Data” tasks only

• Start research on Map Reduce software
• First patient - recommendation engine

Difficulties
- no planned budget ->   Hadoop is free
- no experts        ->   learn it
- no hardware       ->   virtual cluster
Requirements for Hadoop


•   Easy scalable
•   Easy deployment
•   Easy integration
•   Less low level Java coding
•   SQL-like querries
Data flow




DWH
         Data feeds
Accomplishments

Recommendations
• Collaborative filtering (item-to-item on browsing history, PIG)
• Similar products (items attributes, PIG)
• Most popular items (browsing history + orders, HiveQL)
• Internal and external search recommendations (HiveQL)

Some statistics after 1 year
• >10% of revenue
• 3 months to launch
• Tens of gigabytes are processed 2 hours daily
• 1 crash only (cluster lost power)

Decision: Invest to Hardware cluster
End user

Internal high-level languages
• HiveQL
• Pig

Reporting
• Pre-aggregated data for OLAP
• RDBMS - front end
• OLAP and Reporting software should
  support HiveQL
Data Integration

• SQOOP
  • Parallel data exchange with RDBMS
    (MS SQL, MySQL, Oracle, Teradata… )
  • Incremental updates
  • HDFS, Hive, HBASE

• Talend Open Studio
Hadoop vs RDBMS

• Never replace RDBMS:
   • Latency
   • Weak capabilities of HiveQL vs SQL
• Only some tasks with offline processing:
   • Machine learning
   • Queries to Big tables
   • ….
• Real time: NOSQL
Hadoop myth


      Terabytes?
      Petabytes?

      Big tasks!
Conclusion

• Hadoop is not Rocket Science
• Intermediate data can be Big Data

Starter kit
• Hadoop management system
• Virtual hardware (cloud, virtual servers, etc)
• Offline data tasks
• Pig or HiveQL
• Sqoop: import data from existing data sources
Thank you!!!

     rzykov@gmail.com
linkedin.com/in/romanzykov

Más contenido relacionado

Más de Roman Zykov

Big data europe 2012 brochure (3)
Big data europe 2012 brochure (3)Big data europe 2012 brochure (3)
Big data europe 2012 brochure (3)
Roman Zykov
 
Hadoop implementation in Wikimart
Hadoop implementation in WikimartHadoop implementation in Wikimart
Hadoop implementation in Wikimart
Roman Zykov
 
Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetrics
Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetricsGoogle Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetrics
Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetrics
Roman Zykov
 
MIPhT presentation about BI
MIPhT presentation about BIMIPhT presentation about BI
MIPhT presentation about BI
Roman Zykov
 
Roman zykovcertificates
Roman zykovcertificatesRoman zykovcertificates
Roman zykovcertificates
Roman Zykov
 
Wpaper 005 functionalism_new_approach
Wpaper 005 functionalism_new_approachWpaper 005 functionalism_new_approach
Wpaper 005 functionalism_new_approach
Roman Zykov
 
Searchpatterns 100519055231-phpapp02
Searchpatterns 100519055231-phpapp02Searchpatterns 100519055231-phpapp02
Searchpatterns 100519055231-phpapp02
Roman Zykov
 
Metrics drivendesign
Metrics drivendesignMetrics drivendesign
Metrics drivendesign
Roman Zykov
 
Ozon в высшей школе экономики часть 4
Ozon в высшей школе экономики часть 4Ozon в высшей школе экономики часть 4
Ozon в высшей школе экономики часть 4
Roman Zykov
 
Ozon в высшей школе экономики часть 3
Ozon в высшей школе экономики часть 3Ozon в высшей школе экономики часть 3
Ozon в высшей школе экономики часть 3
Roman Zykov
 
Ozon в высшей школе экономики часть 2
Ozon в высшей школе экономики часть 2Ozon в высшей школе экономики часть 2
Ozon в высшей школе экономики часть 2
Roman Zykov
 
Ozon в высшей школе экономики часть 1
Ozon в высшей школе экономики часть 1Ozon в высшей школе экономики часть 1
Ozon в высшей школе экономики часть 1
Roman Zykov
 
Roman Zykov Certificates
Roman Zykov CertificatesRoman Zykov Certificates
Roman Zykov Certificates
Roman Zykov
 
Связной клуб
Связной клубСвязной клуб
Связной клуб
Roman Zykov
 
Complete Ga Power User Web
Complete Ga Power User WebComplete Ga Power User Web
Complete Ga Power User Web
Roman Zykov
 
RIW2009 Анализ продвижения
RIW2009 Анализ продвиженияRIW2009 Анализ продвижения
RIW2009 Анализ продвижения
Roman Zykov
 

Más de Roman Zykov (20)

Big data europe 2012 brochure (3)
Big data europe 2012 brochure (3)Big data europe 2012 brochure (3)
Big data europe 2012 brochure (3)
 
Wikimart recommendations
Wikimart recommendationsWikimart recommendations
Wikimart recommendations
 
Hadoop implementation in Wikimart
Hadoop implementation in WikimartHadoop implementation in Wikimart
Hadoop implementation in Wikimart
 
Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetrics
Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetricsGoogle Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetrics
Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetrics
 
MIPhT presentation about BI
MIPhT presentation about BIMIPhT presentation about BI
MIPhT presentation about BI
 
Owox rzykov kp_iexamples
Owox rzykov kp_iexamplesOwox rzykov kp_iexamples
Owox rzykov kp_iexamples
 
Owox rzykov
Owox rzykovOwox rzykov
Owox rzykov
 
Roman zykovcertificates
Roman zykovcertificatesRoman zykovcertificates
Roman zykovcertificates
 
Wpaper 005 functionalism_new_approach
Wpaper 005 functionalism_new_approachWpaper 005 functionalism_new_approach
Wpaper 005 functionalism_new_approach
 
Searchpatterns 100519055231-phpapp02
Searchpatterns 100519055231-phpapp02Searchpatterns 100519055231-phpapp02
Searchpatterns 100519055231-phpapp02
 
Metrics drivendesign
Metrics drivendesignMetrics drivendesign
Metrics drivendesign
 
E-commerce KPIs
E-commerce KPIsE-commerce KPIs
E-commerce KPIs
 
Ozon в высшей школе экономики часть 4
Ozon в высшей школе экономики часть 4Ozon в высшей школе экономики часть 4
Ozon в высшей школе экономики часть 4
 
Ozon в высшей школе экономики часть 3
Ozon в высшей школе экономики часть 3Ozon в высшей школе экономики часть 3
Ozon в высшей школе экономики часть 3
 
Ozon в высшей школе экономики часть 2
Ozon в высшей школе экономики часть 2Ozon в высшей школе экономики часть 2
Ozon в высшей школе экономики часть 2
 
Ozon в высшей школе экономики часть 1
Ozon в высшей школе экономики часть 1Ozon в высшей школе экономики часть 1
Ozon в высшей школе экономики часть 1
 
Roman Zykov Certificates
Roman Zykov CertificatesRoman Zykov Certificates
Roman Zykov Certificates
 
Связной клуб
Связной клубСвязной клуб
Связной клуб
 
Complete Ga Power User Web
Complete Ga Power User WebComplete Ga Power User Web
Complete Ga Power User Web
 
RIW2009 Анализ продвижения
RIW2009 Анализ продвиженияRIW2009 Анализ продвижения
RIW2009 Анализ продвижения
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Leveraging Hadoop to mine customer insights in a developing market

  • 1. Leveraging Hadoop in Wikimart Roman Zykov Head of analytics http://wikimart.ru London, Big Data World Europe, 20th September 2012
  • 2. Key problem To be or not to be…. Hadoop Introduction
  • 3. Key tasks for Wikimart What • BI tasks • Web analytics (in-house solution) • Recommendations on site • Data services for marketing Who • Core analytics team • Analytics members in other departments • IT site operations
  • 4. Problem Too time consuming or too expensive? • Data volume • # of data services
  • 5. Map Reduce Standalone DATA Map Reduce
  • 6. Our idea New platform for “Big Data” tasks only • Start research on Map Reduce software • First patient - recommendation engine Difficulties - no planned budget -> Hadoop is free - no experts -> learn it - no hardware -> virtual cluster
  • 7. Requirements for Hadoop • Easy scalable • Easy deployment • Easy integration • Less low level Java coding • SQL-like querries
  • 8. Data flow DWH Data feeds
  • 9. Accomplishments Recommendations • Collaborative filtering (item-to-item on browsing history, PIG) • Similar products (items attributes, PIG) • Most popular items (browsing history + orders, HiveQL) • Internal and external search recommendations (HiveQL) Some statistics after 1 year • >10% of revenue • 3 months to launch • Tens of gigabytes are processed 2 hours daily • 1 crash only (cluster lost power) Decision: Invest to Hardware cluster
  • 10. End user Internal high-level languages • HiveQL • Pig Reporting • Pre-aggregated data for OLAP • RDBMS - front end • OLAP and Reporting software should support HiveQL
  • 11. Data Integration • SQOOP • Parallel data exchange with RDBMS (MS SQL, MySQL, Oracle, Teradata… ) • Incremental updates • HDFS, Hive, HBASE • Talend Open Studio
  • 12. Hadoop vs RDBMS • Never replace RDBMS: • Latency • Weak capabilities of HiveQL vs SQL • Only some tasks with offline processing: • Machine learning • Queries to Big tables • …. • Real time: NOSQL
  • 13. Hadoop myth Terabytes? Petabytes? Big tasks!
  • 14. Conclusion • Hadoop is not Rocket Science • Intermediate data can be Big Data Starter kit • Hadoop management system • Virtual hardware (cloud, virtual servers, etc) • Offline data tasks • Pig or HiveQL • Sqoop: import data from existing data sources
  • 15. Thank you!!! rzykov@gmail.com linkedin.com/in/romanzykov