SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
Transforming Mobile Marketing & Advertising™




                        Harnessing s for Big Data
                        Analytics

                                                                   Jobin Wilson
                                                                   jobin.wilson@flytxt.com




                                                                                             Confidential
               Copyright © 2010 Flytxt B.V. All rights reserved.
Who am I ?

   • Architect @ Flytxt (Big Data Analytics & Automation)

   • Passionate about data, distributed computing , machine learning

   • Previously

        •Virtualization & Cloud Lifecycle Management(BMC)

               • Designed and Implemented Cloud Life Cycle Management Interface for BMC

        • Large Scale Data Centre Automation(AOL)

               • Implemented Centralized Data Center Management Framework for AOL

        •Workflow Systems & Automation (Accenture)

               • Implemented Service Management Suit for various customers




                                                                                          Confidential
             Copyright © 2010 Flytxt B.V. All rights reserved.
Session Agenda!

• Data – What's the big deal?

• What is Hadoop( & What it is not  )

• Map-Reduce Model & HDFS

• Hadoop Ecosystem & Tools

• Lets get started!

• Q&A




                                                                    3   Confidential
                Copyright © 2010 Flytxt B.V. All rights reserved.
Five computers & a 640k ;-)


                                                             "I think there is a world market
                                                             for about five computers"
      Moore’s Law
                                                                        Thomas Watson 1943,
                                                                        Chairman of the board of IBM




       "640k ought to be enough for
       anybody"


                          Attributed to
                          Bill Gates in 1981.




                                                                                                       Confidential
         Copyright © 2010 Flytxt B.V. All rights reserved.
Data Explosion !




                                                             Confidential
         Copyright © 2010 Flytxt B.V. All rights reserved.
Do I also know what you might do next summer?


                                        •     Does your travel company know you visited Goa &
                                              Cochin twice in the last two years?

                                        •     Collaborative Filtering




                                        •     Lots of Data + Statistics = WOW!!!

                                        •     BTW, don’t worry about the eqn 




                                                                                                Confidential
        Copyright © 2010 Flytxt B.V. All rights reserved.
Don‟t throw away data just because it doesn't „fit‟


 •   relational tuples, log files, semi structured textual data (e.g., e-mail),pictures
     , videos

 •   User generated data & System generated data

 •   Applications need more than structured data

 •   My application is not “Dumb” any more!!

 •   “I keep saying that the sexy job in the next 10 years will be
      statisticians, and I’m not kidding.” - Hal Varian (Google’s chief economist)




                                                                                          Confidential
                Copyright © 2010 Flytxt B.V. All rights reserved.
Lets get to business!!

What is Apache Hadoop ?

•   Apache Hadoop is an open-source system to
    reliably store and process extremely large data sets
    across many commodity computers.

•   originally developed to support Nutch search engine
    project.

•   scales linearly with data size or analysis complexity

•   Scale-out ,shared nothing architecture

•   inspired by Google's MapReduce and Google File
    System (GFS) papers




                                                                   Confidential
               Copyright © 2010 Flytxt B.V. All rights reserved.
Basics of Hadoop


 •   Two Core Components – HDFS & Map-Reduce

 •   Machines are un-reliable

 •   Separates distributed fault-tolerant computing code from application
     logic.

 •   No need to worry about identity of a machine

 •   lets you interact with a cluster, not a bunch of machines.

 •   Analysis workloads span across multiple machines

 •   runs as a cloud(cluster) & possibly on a cloud (EC2)




                                                                            Confidential
               Copyright © 2010 Flytxt B.V. All rights reserved.
Lead Actors


•   Name Node – Book keeping metadata server

•   Secondary Name Node – Assistant to Name Node

•   Job Tracker – Scheduler

•   Task Tracker - Task execution

•   Data Node - Block storage




                                                                    Confidential
                Copyright © 2010 Flytxt B.V. All rights reserved.
HDFS Write Model




                                                            Confidential
        Copyright © 2010 Flytxt B.V. All rights reserved.
Map-Reduce Model




                                                          Confidential
      Copyright © 2010 Flytxt B.V. All rights reserved.
Map-Reduce Execution Flow




                                                          Confidential
      Copyright © 2010 Flytxt B.V. All rights reserved.
Hadoop Ecosystem
•   Oozie – Open-source workflow/coordination
    service to manage data processing jobs for Apache
    Hadoop™ - Developed at Yahoo!

•   HBase – Column-store database based on
    Google’s BigTable. Holds extremely large data sets
    (Petabytes)

•   Hive – SQL based data warehousing app with
    features for analyzing very large data sets -
    Developed at Facebook

•   Zoo Keeper – Distributed consensus engine
    providing Leader election, service
    discovery, distributed locking / mutual exclusion

•   Pig - platform for analyzing large data sets that
    consists of a high-level language for expressing
    data analysis steps

•   Ganglia - a scalable distributed monitoring system
    for high-performance computing systems such as
    clusters and Grids
                                                                       Confidential
                   Copyright © 2010 Flytxt B.V. All rights reserved.
Hadoop is not a “Holy Grail”

•   Not a substitute for a database

•   MapReduce is not always the best algorithm

•   HDFS is not a substitute for a
    High Availability SAN-hosted FS

•   HDFS is not a Posix file system

•   Not a place to learn Java programming

•   Not a place to learn Unix/Linux system administration

•   Not a place to learn basics of networking




                                                                    Confidential
                Copyright © 2010 Flytxt B.V. All rights reserved.
Notable Users of Hadoop
(Source: http://en.wikipedia.org/wiki/Hadoop)



     • A9.com                               • Meebo
     • AOL                                  • Metaweb
     • EHarmony                             • The New York Times
     • eBay                                 • Rackspace
     • Facebook                             • StumbleUpon
     • Fox Interactive Media                • Twitter
     • IBM                                  • Yahoo
     • Last.fm                              • Amazon
     • LinkedIn




                                                                        Confidential
                    Copyright © 2010 Flytxt B.V. All rights reserved.
Q&A




                                                    www.flytxt.com
                                                    Confidential
Copyright © 2010 Flytxt B.V. All rights reserved.
THANK YOU
      contact us : dev2dev@flytxt.com/ jobin.wilson@flytxt.com




                                                                 www.flytxt.com
                                                                 Confidential   18
Copyright © 2010 Flytxt B.V. All rights reserved.

Más contenido relacionado

Destacado

20130412 brand management chapter 5 iba 45 e
20130412 brand management chapter 5 iba 45 e20130412 brand management chapter 5 iba 45 e
20130412 brand management chapter 5 iba 45 eZeeshan Huq
 
2011 p5_and_p6_principal's_dialogue_collated_for_uploading
2011  p5_and_p6_principal's_dialogue_collated_for_uploading2011  p5_and_p6_principal's_dialogue_collated_for_uploading
2011 p5_and_p6_principal's_dialogue_collated_for_uploadingalanpillay79
 
Cl introduction of p1_&_p2
Cl introduction of p1_&_p2Cl introduction of p1_&_p2
Cl introduction of p1_&_p2alanpillay79
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to usersjobinwilson
 
P1 & p2_cl_powerpoint_slides_2011
P1 & p2_cl_powerpoint_slides_2011P1 & p2_cl_powerpoint_slides_2011
P1 & p2_cl_powerpoint_slides_2011alanpillay79
 
20140128 buyer behavior iba mba48 d
20140128 buyer behavior iba mba48 d20140128 buyer behavior iba mba48 d
20140128 buyer behavior iba mba48 dZeeshan Huq
 
TL P1 & P2 parent's briefing 2011
TL P1 & P2 parent's briefing 2011TL P1 & P2 parent's briefing 2011
TL P1 & P2 parent's briefing 2011alanpillay79
 
Building apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon BostonBuilding apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon Bostonamansk
 
Brightwater Engineering General Presentation
Brightwater Engineering General PresentationBrightwater Engineering General Presentation
Brightwater Engineering General Presentationfletcher_mat
 
Pptpollution 111024083127-phpapp01
Pptpollution 111024083127-phpapp01Pptpollution 111024083127-phpapp01
Pptpollution 111024083127-phpapp01Mukesh Thakur
 
Pharmapack 2012 Competitive Intelligence Report
Pharmapack 2012 Competitive Intelligence ReportPharmapack 2012 Competitive Intelligence Report
Pharmapack 2012 Competitive Intelligence ReportViedoc
 
Program Komuniti Tone Plus
Program Komuniti Tone PlusProgram Komuniti Tone Plus
Program Komuniti Tone PlusVun Chee Vui
 
Rapport de veille_salon_texworld_paris_2010
Rapport de veille_salon_texworld_paris_2010Rapport de veille_salon_texworld_paris_2010
Rapport de veille_salon_texworld_paris_2010Viedoc
 
IT & Big Data 2012 Report
IT & Big Data 2012 ReportIT & Big Data 2012 Report
IT & Big Data 2012 ReportViedoc
 
Mauricio Escalante Tarea Decalogo
Mauricio Escalante Tarea DecalogoMauricio Escalante Tarea Decalogo
Mauricio Escalante Tarea DecalogoMauricio Escalante
 
CFIA 2012 Food Industry ingredients Competitive Intelligence Report
CFIA 2012 Food Industry ingredients Competitive Intelligence ReportCFIA 2012 Food Industry ingredients Competitive Intelligence Report
CFIA 2012 Food Industry ingredients Competitive Intelligence ReportViedoc
 
20140117 buyer behavior iba mba48 d
20140117 buyer behavior iba mba48 d20140117 buyer behavior iba mba48 d
20140117 buyer behavior iba mba48 dZeeshan Huq
 

Destacado (20)

20130412 brand management chapter 5 iba 45 e
20130412 brand management chapter 5 iba 45 e20130412 brand management chapter 5 iba 45 e
20130412 brand management chapter 5 iba 45 e
 
2011 p5_and_p6_principal's_dialogue_collated_for_uploading
2011  p5_and_p6_principal's_dialogue_collated_for_uploading2011  p5_and_p6_principal's_dialogue_collated_for_uploading
2011 p5_and_p6_principal's_dialogue_collated_for_uploading
 
Monavie Presentation
Monavie PresentationMonavie Presentation
Monavie Presentation
 
Cl introduction of p1_&_p2
Cl introduction of p1_&_p2Cl introduction of p1_&_p2
Cl introduction of p1_&_p2
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to users
 
P1 & p2_cl_powerpoint_slides_2011
P1 & p2_cl_powerpoint_slides_2011P1 & p2_cl_powerpoint_slides_2011
P1 & p2_cl_powerpoint_slides_2011
 
Viral marketing
Viral marketingViral marketing
Viral marketing
 
20140128 buyer behavior iba mba48 d
20140128 buyer behavior iba mba48 d20140128 buyer behavior iba mba48 d
20140128 buyer behavior iba mba48 d
 
TL P1 & P2 parent's briefing 2011
TL P1 & P2 parent's briefing 2011TL P1 & P2 parent's briefing 2011
TL P1 & P2 parent's briefing 2011
 
Building apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon BostonBuilding apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon Boston
 
Brightwater Engineering General Presentation
Brightwater Engineering General PresentationBrightwater Engineering General Presentation
Brightwater Engineering General Presentation
 
Pptpollution 111024083127-phpapp01
Pptpollution 111024083127-phpapp01Pptpollution 111024083127-phpapp01
Pptpollution 111024083127-phpapp01
 
Budjettikone
BudjettikoneBudjettikone
Budjettikone
 
Pharmapack 2012 Competitive Intelligence Report
Pharmapack 2012 Competitive Intelligence ReportPharmapack 2012 Competitive Intelligence Report
Pharmapack 2012 Competitive Intelligence Report
 
Program Komuniti Tone Plus
Program Komuniti Tone PlusProgram Komuniti Tone Plus
Program Komuniti Tone Plus
 
Rapport de veille_salon_texworld_paris_2010
Rapport de veille_salon_texworld_paris_2010Rapport de veille_salon_texworld_paris_2010
Rapport de veille_salon_texworld_paris_2010
 
IT & Big Data 2012 Report
IT & Big Data 2012 ReportIT & Big Data 2012 Report
IT & Big Data 2012 Report
 
Mauricio Escalante Tarea Decalogo
Mauricio Escalante Tarea DecalogoMauricio Escalante Tarea Decalogo
Mauricio Escalante Tarea Decalogo
 
CFIA 2012 Food Industry ingredients Competitive Intelligence Report
CFIA 2012 Food Industry ingredients Competitive Intelligence ReportCFIA 2012 Food Industry ingredients Competitive Intelligence Report
CFIA 2012 Food Industry ingredients Competitive Intelligence Report
 
20140117 buyer behavior iba mba48 d
20140117 buyer behavior iba mba48 d20140117 buyer behavior iba mba48 d
20140117 buyer behavior iba mba48 d
 

Similar a Harnessing hadoop for big data analytics v0.1

Leveraging open source for big data stack
Leveraging open source for big data stackLeveraging open source for big data stack
Leveraging open source for big data stackFlytxt
 
HTML5--The 30,000' View (A fast-paced overview of HTML5)
HTML5--The 30,000' View (A fast-paced overview of HTML5)HTML5--The 30,000' View (A fast-paced overview of HTML5)
HTML5--The 30,000' View (A fast-paced overview of HTML5)Peter Lubbers
 
Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...
Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...
Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...Taras Filatov
 
Putting Business Intelligence to Work on Hadoop Data Stores
Putting Business Intelligence to Work on Hadoop Data StoresPutting Business Intelligence to Work on Hadoop Data Stores
Putting Business Intelligence to Work on Hadoop Data StoresDATAVERSITY
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldSean Roberts
 
Open web platform talk by daniel hladky at rif 2012 (19 april 2012 moscow)
Open web platform talk by daniel hladky at rif 2012 (19 april 2012   moscow)Open web platform talk by daniel hladky at rif 2012 (19 april 2012   moscow)
Open web platform talk by daniel hladky at rif 2012 (19 april 2012 moscow)AI4BD GmbH
 
SharePoint from the Forms-Eye View
SharePoint from the Forms-Eye ViewSharePoint from the Forms-Eye View
SharePoint from the Forms-Eye ViewSteve Weissman
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Building a modern data platform on AWS. Utrecht AWS Dev Day
Building a modern data platform on AWS. Utrecht AWS Dev DayBuilding a modern data platform on AWS. Utrecht AWS Dev Day
Building a modern data platform on AWS. Utrecht AWS Dev Dayjavier ramirez
 
Hadoop World 2011: Advancing Disney’s Data Infrastructure with Hadoop - Matt ...
Hadoop World 2011: Advancing Disney’s Data Infrastructure with Hadoop - Matt ...Hadoop World 2011: Advancing Disney’s Data Infrastructure with Hadoop - Matt ...
Hadoop World 2011: Advancing Disney’s Data Infrastructure with Hadoop - Matt ...Cloudera, Inc.
 
Tw Technology Radar Qtb Sep11
Tw Technology Radar Qtb Sep11Tw Technology Radar Qtb Sep11
Tw Technology Radar Qtb Sep11Adrian Treacy
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperabilityparker01
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Visualizing IoT: Rapid Business Data Discovery for the Internet of Things
Visualizing IoT: Rapid Business Data Discovery for the Internet of ThingsVisualizing IoT: Rapid Business Data Discovery for the Internet of Things
Visualizing IoT: Rapid Business Data Discovery for the Internet of ThingsMia Yuan Cao
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 

Similar a Harnessing hadoop for big data analytics v0.1 (20)

Leveraging open source for big data stack
Leveraging open source for big data stackLeveraging open source for big data stack
Leveraging open source for big data stack
 
HTML5--The 30,000' View (A fast-paced overview of HTML5)
HTML5--The 30,000' View (A fast-paced overview of HTML5)HTML5--The 30,000' View (A fast-paced overview of HTML5)
HTML5--The 30,000' View (A fast-paced overview of HTML5)
 
Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...
Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...
Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...
 
Html5 Flyover
Html5 FlyoverHtml5 Flyover
Html5 Flyover
 
Putting Business Intelligence to Work on Hadoop Data Stores
Putting Business Intelligence to Work on Hadoop Data StoresPutting Business Intelligence to Work on Hadoop Data Stores
Putting Business Intelligence to Work on Hadoop Data Stores
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
 
Open web platform talk by daniel hladky at rif 2012 (19 april 2012 moscow)
Open web platform talk by daniel hladky at rif 2012 (19 april 2012   moscow)Open web platform talk by daniel hladky at rif 2012 (19 april 2012   moscow)
Open web platform talk by daniel hladky at rif 2012 (19 april 2012 moscow)
 
SharePoint from the Forms-Eye View
SharePoint from the Forms-Eye ViewSharePoint from the Forms-Eye View
SharePoint from the Forms-Eye View
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Building a modern data platform on AWS. Utrecht AWS Dev Day
Building a modern data platform on AWS. Utrecht AWS Dev DayBuilding a modern data platform on AWS. Utrecht AWS Dev Day
Building a modern data platform on AWS. Utrecht AWS Dev Day
 
Hadoop World 2011: Advancing Disney’s Data Infrastructure with Hadoop - Matt ...
Hadoop World 2011: Advancing Disney’s Data Infrastructure with Hadoop - Matt ...Hadoop World 2011: Advancing Disney’s Data Infrastructure with Hadoop - Matt ...
Hadoop World 2011: Advancing Disney’s Data Infrastructure with Hadoop - Matt ...
 
Tw Technology Radar Qtb Sep11
Tw Technology Radar Qtb Sep11Tw Technology Radar Qtb Sep11
Tw Technology Radar Qtb Sep11
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
IBM Watson
IBM WatsonIBM Watson
IBM Watson
 
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperability
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Plug 20110217
Plug   20110217Plug   20110217
Plug 20110217
 
Visualizing IoT: Rapid Business Data Discovery for the Internet of Things
Visualizing IoT: Rapid Business Data Discovery for the Internet of ThingsVisualizing IoT: Rapid Business Data Discovery for the Internet of Things
Visualizing IoT: Rapid Business Data Discovery for the Internet of Things
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 

Último

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Último (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Harnessing hadoop for big data analytics v0.1

  • 1. Transforming Mobile Marketing & Advertising™ Harnessing s for Big Data Analytics Jobin Wilson jobin.wilson@flytxt.com Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 2. Who am I ? • Architect @ Flytxt (Big Data Analytics & Automation) • Passionate about data, distributed computing , machine learning • Previously •Virtualization & Cloud Lifecycle Management(BMC) • Designed and Implemented Cloud Life Cycle Management Interface for BMC • Large Scale Data Centre Automation(AOL) • Implemented Centralized Data Center Management Framework for AOL •Workflow Systems & Automation (Accenture) • Implemented Service Management Suit for various customers Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 3. Session Agenda! • Data – What's the big deal? • What is Hadoop( & What it is not  ) • Map-Reduce Model & HDFS • Hadoop Ecosystem & Tools • Lets get started! • Q&A 3 Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 4. Five computers & a 640k ;-) "I think there is a world market for about five computers" Moore’s Law Thomas Watson 1943, Chairman of the board of IBM "640k ought to be enough for anybody" Attributed to Bill Gates in 1981. Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 5. Data Explosion ! Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 6. Do I also know what you might do next summer? • Does your travel company know you visited Goa & Cochin twice in the last two years? • Collaborative Filtering • Lots of Data + Statistics = WOW!!! • BTW, don’t worry about the eqn  Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 7. Don‟t throw away data just because it doesn't „fit‟ • relational tuples, log files, semi structured textual data (e.g., e-mail),pictures , videos • User generated data & System generated data • Applications need more than structured data • My application is not “Dumb” any more!! • “I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.” - Hal Varian (Google’s chief economist) Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 8. Lets get to business!! What is Apache Hadoop ? • Apache Hadoop is an open-source system to reliably store and process extremely large data sets across many commodity computers. • originally developed to support Nutch search engine project. • scales linearly with data size or analysis complexity • Scale-out ,shared nothing architecture • inspired by Google's MapReduce and Google File System (GFS) papers Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 9. Basics of Hadoop • Two Core Components – HDFS & Map-Reduce • Machines are un-reliable • Separates distributed fault-tolerant computing code from application logic. • No need to worry about identity of a machine • lets you interact with a cluster, not a bunch of machines. • Analysis workloads span across multiple machines • runs as a cloud(cluster) & possibly on a cloud (EC2) Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 10. Lead Actors • Name Node – Book keeping metadata server • Secondary Name Node – Assistant to Name Node • Job Tracker – Scheduler • Task Tracker - Task execution • Data Node - Block storage Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 11. HDFS Write Model Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 12. Map-Reduce Model Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 13. Map-Reduce Execution Flow Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 14. Hadoop Ecosystem • Oozie – Open-source workflow/coordination service to manage data processing jobs for Apache Hadoop™ - Developed at Yahoo! • HBase – Column-store database based on Google’s BigTable. Holds extremely large data sets (Petabytes) • Hive – SQL based data warehousing app with features for analyzing very large data sets - Developed at Facebook • Zoo Keeper – Distributed consensus engine providing Leader election, service discovery, distributed locking / mutual exclusion • Pig - platform for analyzing large data sets that consists of a high-level language for expressing data analysis steps • Ganglia - a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 15. Hadoop is not a “Holy Grail” • Not a substitute for a database • MapReduce is not always the best algorithm • HDFS is not a substitute for a High Availability SAN-hosted FS • HDFS is not a Posix file system • Not a place to learn Java programming • Not a place to learn Unix/Linux system administration • Not a place to learn basics of networking Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 16. Notable Users of Hadoop (Source: http://en.wikipedia.org/wiki/Hadoop) • A9.com • Meebo • AOL • Metaweb • EHarmony • The New York Times • eBay • Rackspace • Facebook • StumbleUpon • Fox Interactive Media • Twitter • IBM • Yahoo • Last.fm • Amazon • LinkedIn Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 17. Q&A www.flytxt.com Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 18. THANK YOU contact us : dev2dev@flytxt.com/ jobin.wilson@flytxt.com www.flytxt.com Confidential 18 Copyright © 2010 Flytxt B.V. All rights reserved.