SlideShare a Scribd company logo
1 of 42
Big Data




Steve Watt
 1
             Technology Strategy @ HP   swatt@hp.com
Agenda

    Hardware   Software                     Data




                                    • Big Data



                    • Situational
                    Applications




2
Situational Applications




      3
– eaghra (Flickr)
Web 2.0 Era Topic Map
                                                   Produce          Process
                                    Inexpensiv
              Data                   e Storage
            Explosion
                             LAM
 Social                       P
Platform         Publishin
    s                g
                 Platforms

                                                              Situational
                                                             Applications
              Web 2.0                    Mashups




           Enterpris          SOA
              e
4
5
Big Data




      6
– blmiers2 (Flickr)
The data just keeps growing…

 1024 GIGABYTE= 1 TERABYTE
       1024 TERABYTES = 1 PETABYTE
             1024 PETABYTES = 1 EXABYTE


1 PETABYTE 13.3 Years of HD Video

20 PETABYTES Amount of Data processed by Google daily

5 EXABYTES All words ever spoken by humanity
Mobile
  App Economy for Devices                                                        Sensor Web
  App for this     App for that                                      An instrumented and monitored world




Set Top            Tablets, etc.   Multiple Sensors in your pocket
Boxes
                                                                                                   Real-time
                                                                                                   Data

                                        The Fractured Web
                                                                                                       Opportunity
                                          Facebook       Twitter     LinkedIn
Service Economy
Service for this                          Google     NetFlix    New York Times

Service for that                           eBay          Pandora       PayPal              Web 2.0 Data Exhaust of
                                                                                           Historical and Real-time Data



                                   Web 2.0 - Connecting People                            API Foundation
 Web as a Platform
 8                                 Web 1.0 - Connecting Machines                          Infrastructure
Data Deluge! But filter patterns can
                  help…
    9
Kakadu (Flickr)
Filtering
With
Search




 10
Filtering
Socially




            Awesome
 11
Filtering
Visually




 12
But filter patterns force you down a pre-processed
  path
M.V. Jantzen (Flickr)
What if you could ask your own questions?

     14
– wowwzers(Flickr)
And go from discovering Something about Everything…

– MrB-MMX (Flickr)
To discovering Everything about Something ?

16
How do we do this?
 Lets examine a few techniques for
Gathering,
     Storing,
         Processing &

17
                Delivering Data @   Scale
Gathering Data

Data Marketplaces




 18
19
20
Gathering Data

Apache Nutch
(Web Crawler)




 21
Storing, Reading and Processing - Apache Hadoop
    Cluster technology with a single master and scale out with multiple slaves
    It consists of two runtimes:
        The Hadoop Distributed File System (HDFS)
        Map/Reduce

    As data is copied onto the HDFS it ensures the data is blocked and replicated to other
     machines to provide redundancy
    A self-contained job (workload) is written in Map/Reduce and submitted to the Hadoop
     Master which in-turn distributes the job to each slave in the cluster.
    Jobs run on data that is on the local disks of the machine they are sent to ensuring data
     locality
    Node (Slave) failures are handled automatically by Hadoop. Hadoop may execute or re-
     execute a job on any node in the cluster.

     Want to know more?
22
     “Hadoop – The Definitive Guide (2nd Edition)”
Delivering Data @ Scale

•    Structured Data
•    Low Latency & Random Access
•    Column Stores (Apache HBase or Apache Cassandra)
     •   faster seeks
     •   better compression
     •   simpler scale out
     •   De-normalized – Data is written as it is intended to be queried




         Want to know more?
23
         “HBase – The Definitive Guide” & “Cassandra High Performance
Storing, Processing & Delivering : Hadoop + NoSQL

              Gather            Read/Transfor                  Low-
                                m                              latency       Application
        Web Data
                        Nutch                                                Query
                        Crawl
                                                                                     Serve
                      Copy

                                        Apache
                                        Hadoop
 Log Files
                   Flume
                   Connector              HDFS                                 NoSQL
                                                                              Repository
                                                               NoSQL
                   SQOOP                                       Connector/A
                   Connector                                   PI

 Relational
 Data
                                -Clean and Filter Data
 (JDBC)
                                - Transform and Enrich Data
               MySQL
                                - Often multiple Hadoop jobs
   24
Some things to keep
    in mind…




     25
– Kanaka Menehune (Flickr)
Some things to keep in mind…

•    Processing arbitrary types of data (unstructured, semi-
     structured, structured) requires normalizing data with many different
     kinds of readers
     Hadoop is really great at this !
•    However, readers won’t really help you process truly unstructured data
     such as prose. For that you’re going to have to get handy with Natural
     Language Processing. But this is really hard.
     Consider using parsing services & APIs like Open Calais

     Want to know more?
26
     “Programming Pig” (O’REILLY)
Open Calais (Gnosis)




27
Statistical real-time decision making

      Capture Historical information

      Use Machine Learning to build decision making models (such as
       Classification, Clustering & Recommendation)

      Mesh real-time events (such as sensor data) against Models to make
       automated decisions




     Want to know more?
28
     “Mahout in Action”
29
Pascal Terjan (Flickr
30
31
32
33
Making the data STRUCTURED




          Retrieving HTML

                Prelim Filtering on URL


          Company POJO then /t Out




34
Aargh!

My viz tool
requires
zipcodes to plot
geospatially!


  35
Apache Pig Script to Join on City to get Zip
Code and Write the results to Vertica

ZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('t') AS (State:chararray, City:chararray, ZipCode:int);

CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('t') AS

(Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);


CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);

STORE CrunchBaseZip INTO

'{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year

int, Investor int, Amount varchar(40))}’

USING com.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
Total Tech Investments By Year
Investment Funding By Sector
Total Investments By Zip Code for all Sectors

                                                                     $1.2 Billion in Boston



     $7.3 Billion in San Francisco


            $2.9 Billion in Mountain View




                                            $1.7 Billion in Austin

39
Total Investments By Zip Code for Consumer Web

        $600 Million in Seattle
                                       $1.2 Billion in Chicago


     $1.7 Billion in San Francisco




40
Total Investments By Zip Code for BioTech

                                            $1.3 Billion in Cambridge




                   $528 Million in Dallas




     $1.1 Billion in San Diego




41
Questions?


42

More Related Content

What's hot

Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data scienceDeepak Singh
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open DataJongwook Woo
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousingDataWorks Summit
 
Business of Big Data
Business of Big DataBusiness of Big Data
Business of Big DataLeonid Zhukov
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSiHadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSiCloudera, Inc.
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopJongwook Woo
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Jongwook Woo
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInHari Shankar Sreekumar
 
Django and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assDjango and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assTobias Lindaaker
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshopFang Mac
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 

What's hot (20)

Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data science
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open Data
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
 
Business of Big Data
Business of Big DataBusiness of Big Data
Business of Big Data
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSiHadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedIn
 
Django and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assDjango and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks ass
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 

Similar to Tech4Africa - Opportunities around Big Data

Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshConfluentInc1
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Toolsboorad
 
Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...
Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...
Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...Keiichiro Ono
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014Kenneth Igiri
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
The Information Workbench as a Self-Service Platform for Linked Data Applicat...
The Information Workbench as a Self-Service Platform for Linked Data Applicat...The Information Workbench as a Self-Service Platform for Linked Data Applicat...
The Information Workbench as a Self-Service Platform for Linked Data Applicat...Peter Haase
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.
 
Linked Data as a Service
Linked Data as a ServiceLinked Data as a Service
Linked Data as a ServicePeter Haase
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
Tableau 7.0 prsentation
Tableau 7.0 prsentationTableau 7.0 prsentation
Tableau 7.0 prsentationinam_slides
 
Leonard Austin (Ravelin) - DevOps in a Machine Learning World
Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldLeonard Austin (Ravelin) - DevOps in a Machine Learning World
Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldOutlyer
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big datasolarisyourep
 

Similar to Tech4Africa - Opportunities around Big Data (20)

Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Tools
 
Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...
Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...
Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...
 
Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
The Information Workbench as a Self-Service Platform for Linked Data Applicat...
The Information Workbench as a Self-Service Platform for Linked Data Applicat...The Information Workbench as a Self-Service Platform for Linked Data Applicat...
The Information Workbench as a Self-Service Platform for Linked Data Applicat...
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Linked Data as a Service
Linked Data as a ServiceLinked Data as a Service
Linked Data as a Service
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Tableau 7.0 prsentation
Tableau 7.0 prsentationTableau 7.0 prsentation
Tableau 7.0 prsentation
 
Leonard Austin (Ravelin) - DevOps in a Machine Learning World
Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldLeonard Austin (Ravelin) - DevOps in a Machine Learning World
Leonard Austin (Ravelin) - DevOps in a Machine Learning World
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 

More from Steve Watt

Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerSteve Watt
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerSteve Watt
 
Hadoop for the disillusioned
Hadoop for the disillusionedHadoop for the disillusioned
Hadoop for the disillusionedSteve Watt
 
Hadoop file systems
Hadoop file systemsHadoop file systems
Hadoop file systemsSteve Watt
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoopSteve Watt
 
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureApache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureSteve Watt
 
Mining the Web for Information using Hadoop
Mining the Web for Information using HadoopMining the Web for Information using Hadoop
Mining the Web for Information using HadoopSteve Watt
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 

More from Steve Watt (12)

Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Hadoop for the disillusioned
Hadoop for the disillusionedHadoop for the disillusioned
Hadoop for the disillusioned
 
Hadoop file systems
Hadoop file systemsHadoop file systems
Hadoop file systems
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoop
 
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureApache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructure
 
Mining the Web for Information using Hadoop
Mining the Web for Information using HadoopMining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
 
Final deck
Final deckFinal deck
Final deck
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Extractiv
ExtractivExtractiv
Extractiv
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Tech4Africa - Opportunities around Big Data

  • 1. Big Data Steve Watt 1 Technology Strategy @ HP swatt@hp.com
  • 2. Agenda Hardware Software Data • Big Data • Situational Applications 2
  • 3. Situational Applications 3 – eaghra (Flickr)
  • 4. Web 2.0 Era Topic Map Produce Process Inexpensiv Data e Storage Explosion LAM Social P Platform Publishin s g Platforms Situational Applications Web 2.0 Mashups Enterpris SOA e 4
  • 5. 5
  • 6. Big Data 6 – blmiers2 (Flickr)
  • 7. The data just keeps growing… 1024 GIGABYTE= 1 TERABYTE 1024 TERABYTES = 1 PETABYTE 1024 PETABYTES = 1 EXABYTE 1 PETABYTE 13.3 Years of HD Video 20 PETABYTES Amount of Data processed by Google daily 5 EXABYTES All words ever spoken by humanity
  • 8. Mobile App Economy for Devices Sensor Web App for this App for that An instrumented and monitored world Set Top Tablets, etc. Multiple Sensors in your pocket Boxes Real-time Data The Fractured Web Opportunity Facebook Twitter LinkedIn Service Economy Service for this Google NetFlix New York Times Service for that eBay Pandora PayPal Web 2.0 Data Exhaust of Historical and Real-time Data Web 2.0 - Connecting People API Foundation Web as a Platform 8 Web 1.0 - Connecting Machines Infrastructure
  • 9. Data Deluge! But filter patterns can help… 9 Kakadu (Flickr)
  • 11. Filtering Socially Awesome 11
  • 13. But filter patterns force you down a pre-processed path M.V. Jantzen (Flickr)
  • 14. What if you could ask your own questions? 14 – wowwzers(Flickr)
  • 15. And go from discovering Something about Everything… – MrB-MMX (Flickr)
  • 16. To discovering Everything about Something ? 16
  • 17. How do we do this? Lets examine a few techniques for Gathering, Storing, Processing & 17 Delivering Data @ Scale
  • 19. 19
  • 20. 20
  • 22. Storing, Reading and Processing - Apache Hadoop  Cluster technology with a single master and scale out with multiple slaves  It consists of two runtimes:  The Hadoop Distributed File System (HDFS)  Map/Reduce  As data is copied onto the HDFS it ensures the data is blocked and replicated to other machines to provide redundancy  A self-contained job (workload) is written in Map/Reduce and submitted to the Hadoop Master which in-turn distributes the job to each slave in the cluster.  Jobs run on data that is on the local disks of the machine they are sent to ensuring data locality  Node (Slave) failures are handled automatically by Hadoop. Hadoop may execute or re- execute a job on any node in the cluster. Want to know more? 22 “Hadoop – The Definitive Guide (2nd Edition)”
  • 23. Delivering Data @ Scale • Structured Data • Low Latency & Random Access • Column Stores (Apache HBase or Apache Cassandra) • faster seeks • better compression • simpler scale out • De-normalized – Data is written as it is intended to be queried Want to know more? 23 “HBase – The Definitive Guide” & “Cassandra High Performance
  • 24. Storing, Processing & Delivering : Hadoop + NoSQL Gather Read/Transfor Low- m latency Application Web Data Nutch Query Crawl Serve Copy Apache Hadoop Log Files Flume Connector HDFS NoSQL Repository NoSQL SQOOP Connector/A Connector PI Relational Data -Clean and Filter Data (JDBC) - Transform and Enrich Data MySQL - Often multiple Hadoop jobs 24
  • 25. Some things to keep in mind… 25 – Kanaka Menehune (Flickr)
  • 26. Some things to keep in mind… • Processing arbitrary types of data (unstructured, semi- structured, structured) requires normalizing data with many different kinds of readers Hadoop is really great at this ! • However, readers won’t really help you process truly unstructured data such as prose. For that you’re going to have to get handy with Natural Language Processing. But this is really hard. Consider using parsing services & APIs like Open Calais Want to know more? 26 “Programming Pig” (O’REILLY)
  • 28. Statistical real-time decision making  Capture Historical information  Use Machine Learning to build decision making models (such as Classification, Clustering & Recommendation)  Mesh real-time events (such as sensor data) against Models to make automated decisions Want to know more? 28 “Mahout in Action”
  • 30. 30
  • 31. 31
  • 32. 32
  • 33. 33
  • 34. Making the data STRUCTURED Retrieving HTML Prelim Filtering on URL Company POJO then /t Out 34
  • 35. Aargh! My viz tool requires zipcodes to plot geospatially! 35
  • 36. Apache Pig Script to Join on City to get Zip Code and Write the results to Vertica ZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('t') AS (State:chararray, City:chararray, ZipCode:int); CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('t') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int); CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State); STORE CrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’ USING com.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
  • 39. Total Investments By Zip Code for all Sectors $1.2 Billion in Boston $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.7 Billion in Austin 39
  • 40. Total Investments By Zip Code for Consumer Web $600 Million in Seattle $1.2 Billion in Chicago $1.7 Billion in San Francisco 40
  • 41. Total Investments By Zip Code for BioTech $1.3 Billion in Cambridge $528 Million in Dallas $1.1 Billion in San Diego 41

Editor's Notes

  1. What is Big Data? -- “The challenges, solutions and opportunities around the storage, processing and delivery of data at scale”Tag Cloud created from a week of Tech4Africa Tweets – an example of trend analysis which is a popular Big Data Analytics patternGoals are to explain the importance and opportunity and tell you how to do it. Hadoop/NoSQL deep dive not covered.
  2. As Hardware becomes increasing commoditized, the margin & differentiation moved to software, as software is becoming increasingly commoditized the margin & differentiation is moving to data2000 - Cloud is an IT Sourcing Alternative (Virtualization extends into Cloud)Explosion of Unstructured DataMobile“Let’s create a context in which to think….”Focused on 3 major tipping points in the evolution of the technology. Mention that this is a very web centric view contrasted to Barry Devlin’s Enterprise viewAssumes Networking falls under Hardware & Cloud is at the Intersection of Software and DataWhy should you care?Tipping Point 1: Situational ApplicationsTipping Point 2: Big DataTipping Point 3: Reasoning
  3. Web 2.0(Information Explosion, Now Many Channels - Turning consumers into Producers (Shirky),Tipping point Web Standards allow Rapid Application Development, Advent of Situational Applications, Folksonomies,Social)SOA (Functionality exposed through open interfaces and open standards, Great strides in modularity and re-use whilst reducing complexities around system integration, Still need to be a developer to create applications using theseservice interfaces (WSDL, SOAP, way too complex !) Enter mashups…)Mashups (Place a façade on the service and you have the final step in the evolution of services and service based applications,Now anyone can build applications (i.e. non-programmers). We’ve taken the entire SOA Library and exposed it to non-programmers, What do I mean? Check out this YouTunes app…) 1st example where we saw arbitrary data/content re-purposed in ways the original authors never intended –eg. Craigslist gumtree/ homes for sales scraped and placed on google map mashed up w/ crime statistics. Whole greater than the sum of its parts -> New kinds of Information !!BUT Limitations around how much arbitrary data being scraped and turned into info. Usually no pre-processing and just what can be rendered on a single page.Demo
  4. http://www.housingmaps.com/
  5. “Every 2 days we create as much data as we did from the dawn of humanity until 2003” – We’ve hit the Petabyte & Exabyte age. What does that mean? Lets look (next slide)
  6. Mention Enterprise Growth over time, Mobile/Sensor Data, Web 2.0 Data Exhaust, Social NetworksAdvances in Analytics – keep your data around for deeper business insights and to avoid Enterprise Amnesia
  7. How about we summarize a few of the key trends in the Web as we know it today …. This diagram shows some of the main trends of what Web 3.0 is about…Netflix accounts for 29.7 % of US Traffic, Mention Web 2.0 Summit Points of ControlHaving more data leads to better context which leads to deeper understanding/insight or new discoveriesRefer to Reid Hoffman’s views on what web 3.0 is
  8. Pre-processed though, not flexible, you can’t ask specific questions that have not been pre-processed
  9. Mention folksonomies in Web 2.0 with searching Delicious Bookmarks. Mention Chilean Earthquake Crisis Video using Twitter to do Crisis Mapping.
  10. Talk about Visualizations and InfoGraphics – manual and a lot of work
  11. They are only part of the solution & don’t allow you to ask your own questions
  12. This is the real promise of Big Data
  13. These are not all the problems around Big Data. These are the bigger problems around deriving new information out of web data. There are other issues as well likely inconsistency, skew, etc.
  14. Give a Nutch example
  15. Specifically call out the color coding reasoning for Map/Reduce and HDFS as a single distributed service
  16. Give examples of how one might use Open Calais or Entity Extraction libraries