SlideShare una empresa de Scribd logo
1 de 54
Descargar para leer sin conexión
1




 Big Data
 the next frontier

RVC Seminar                                Leonid Zhukov
Moscow, 08/02/2013   Professor Higher School of Economics
2
Big data




+ Graph of terms popularity




                              www.visibletechologies.com
3
McKinsey, May 2011




                     www.mckinsey.com
4
Headlines




            Data driven business

            Data democratization

            Data scientists
5
The White House



+ $200M initiative
+ NSF: core techniques
+ NIH: 1000 genomes
+ DOE: advanced computing
+ DOD: data to decisions
+ USGS: Earth system


                            www.whitehouse.gov
6
Gartner Hype Cycle




                     www.gartner.com
7
 Market Forecast




                         + Venture money invested (Reuters):
+ Market forecasts:        + 2009 - $1.1B
 + IDC: 2015 - $16.9B      + 2010 - $1.53B
 + Gartner: 2016- $55B     + 2011 - $2.47B
                                                      www.wikibon.com
8
Big Data Revenue 2012




 + Big Business:
    +   IBM
    +   HP
    +   Oracle
    +   Teradata
    +   EMC             www.wikibon.com
9
Big Data Vendors!




    + Hadoop:
      + Cloudera
      + MapR Techonologies
      + HortonWorks          www.wikibon.com
10
Forrester Wave




                 www.forrester.com
What is big data                                                    11




+ Big data:
  + “Data you can’t process by traditional tools”
  + “A phenomenon defined by the rapid acceleration in the
     expanding volume of high velocity, complex and diverse
     types of data.”

  + “Refers to a collection of tools, techniques and technologies
     for working with data productively, at any scale.”
12
What is Big data

 + 3V
    + Volume: petabytes (1000TB) to exabytes (1000PB)
    + Variety: structured, semi-structured, unstructured
    + Velocity: Tb/s data streams
 + Requires distributed processing
 + Big data = storage + processing
 + Big data = Hadoop (not only)
13
Big data Glossary


+ Hadoop, MapReduce, Hive, Pig, Cascading,
  HBase, Hypertable, Cassandra, Flume, Sqoop,
  Mongo, Voldemort, Storm, Kafka, Drill, Dremmel,
  Impala, Zookeeper, Ambari, Oozi, Yarn, Redis,
  Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R,
  Mahout, Weka,
14
How big is Big?

+ Google
  + 24 PB data processed daily
+ Twitter
  + 340 mln daily tweets
  + 1.6 bln search queries
  + 7 TB added daily
+ Facebook
  + 750 mln users
  + 12 TB daily daily content
  + 2.7 bln “likes” and comments daily
15
Sources of Big Data




                      www.ibm.com
16
Supercomputing


+ National Labs, Universities, Military
+ Processing power, flops, MPI
+ Parallel computing:
   + Cray, IBM SP, SGI
   + Beowulf cluster (Linux commodity)
17
New realities


+ Yahoo, AltaVista, Inktomi, Google
+ Consumer web companies:
   + web search (crawling, indexing)
   + advertising
   + email services
   + ecommerce


   + Commodity hardware
18
Google




  2003   2004
19
GFS/HDFS

+ Distributed replicated data blocks (64Mb)
+ Master-slave architecture (Name Node, Data Nodes)
+ Not a general file system
+ Access via command line utils and API
+ Can’t modify after files written
20
  MapReduce


                                                    + Scalable:
                                                      + no file IO
                                                      + no networking
                                                      + no synchronization




                                 + Master-slave architecture
+ MapReduce programming model:
                                   + Master: divide, schedule, monitor work
  + functional programming
                                   + Slave: actual processing
  + like UNIX pipeline
21
 Data movement




+ store and process data on the same nodes
+ bring code to data, data “locality”
                                             www.cloudera.com
22
Hadoop
+ Doug Cutting
  + Search indexer - Lucene
  + Web crawler - Nutch
  + Hadoop
     + HDFS
     + MapReduce
23
Yahoo!
+ 40,000 servers
+ 170PB storage
+ 1000+ active users
+ 5M+ monthly jobs
+ email spam filters
+ categorization, personalization
+ computational advertising
Data Base NoSQL                   24

Revolution
+ Needed:
   + fast read/write time
   + high concurrency
   + easy horizontally scalable
+ Flat data structure
+ Sacrificed:
   + DB Schema
   + SQL
   + Transactions
25
NoSQL World

+ Key-value: Dynamo, Voldemort, Redis, Riak
+ Column (tabular): HBase, Hypertable, Cassandra
+ Document store: CouchDB, MongoDB
+ Graph: Neo4J, FlockDB
+ 120+ products (2012)
26
Hadoop stack




               www.hortonworks.com
27
Hadoop tools

+ Pig
  + high level scripting language (PigLatin)
  + converts to MapReduce jobs
+ Hive
  + SQL like queries on dat in HDFS
  + converts in MapReduce jobs
28

Hadoop data movement




                       www.cloudera.com
29
Typical hadoop usage
 +   Text mining
 +   Pattern recognition
 +   Recommendation systems (collaborative filtering)
 +   Prediction models
 +   Risk assessment
 +   Sentiment analysis
 +   Customer churn prediction
 +   Customer segmentation
 +   Point of Sale Transaction analysis
 +   Data “sandbox”
30

Application fields

+ Science: sensors, genome, weather, satellite,
   imaging

+ Engineering: log analytics, status feeds, network
   messages, spam filters..

+ Product: financial, pharmaceutical, insurance,
   energy, retail, ecommerce, healthcare, telecom

+ Business: analytics, BI
31
Business analytics



+ Analytic
+ Operational




        Capture, analyze, learn from data
                                            www.datasciencecentral.com
32
Who uses Hadoop?




                   www.cloudera.com
33
Why Hadoop?




              www.thinkbiganalytics.com
34
Cloudera




+ Enterprise support for Apache Hadoop
+ Founded 2008, funding $141 M
+ Employee 230
+ Products:
  + CDH 4 (cloudera distrobution hadoop)
  + Impala
  + Consulting and training
                                           www.cloudera.com
35
MapR




+ Founded 2009, funding $20M
+ MapR Technologies is engineering game-
  changing Map/Reduce related technologies

+ Products:
  + M3,M5,M7
  + NFS, no single node failure
  + NOT open source !
                                             www.mapr.com
36
HortonWorks




+ Founded 2011
+ Yahoo spin-off
+ Products:
  + HDP distribution
  + tools

                       www.hortonworks.com
37
Hadoop Ecosystem




                   www.datameer.com
38
Big Data Landscape




                     www.bigdatalandscape.com
39
Splunk




+ Founded 2003, raised $230M, IPO 2011, Market cap $3.35B
+ Machine logs analysis, operational intelligence
+ Collecting, searching, monitoring




                                                            www.splunk.com
40
Datameer




+ Founded 2009,
  Funding $17,8M

+ Big data:
  + Data integration
  + Data Analytics
  + Data Visualization
                         www.datameer.com
41
Datasift




+ Founded 2010, funding $29.7M
+ Data platform for social web
+ Aggregate and filter data



                                 www.datasift.com
42
Infochimps




+ Founded 2009, funding $5.5M
+ Transitioned from data marketpalce to big data platform
+ End-to-end big data solution, real time




                                                        www.infochimps.com
43
Tableau software




+ Founded 2003, funding $15M
+ Big data analytics
+ Big data visualization

                               www.tableau.com
Big data Startups                       44

 2012

+ Platfora, in memory BI on Hadoop
+ Sumologic, log file analysis
+ Hadapt, Hadoop+RDBSM
+ Metamarkets, patterns in data flow
+ DataStax, consulting, training
+ Karmasphere, BI, analytics on Hadoop
Big data startups                               45

 2013!


+ 10gen, MongoDB
+ ClearStory, big data aggregation + analytics
+ Continuuity, Hadoop API
+ Parstream, database analytics
+ Zoomdata, data visualization
+ Climate corporation, predictive analytics
46
Big data by industry




                       www.gartner.com
47
Big data Processing

                 Batch
                             interactive       stream
               processing



               minutes to   Millisecond to
 Query time                                   continues
                 hours         seconds



 data volume    TB to PT      GB to PB        continues



programming
               MapReduce       Queries           DAG
   model




   Users       Developers     Analysts       Developers




                Hadoop
Open Source                  Drill, Impala   Storm, Kafka
               mapreduce
48
New technologies

+ Real time quering
  + Drill (based on Google Dremmel)
  + Impala (Cloudera)


+ Data stream processing
  + Storm (Twitter), real time analytics
  + Kafka (LinkedIn), messaging system
49
Machine learning

 + Predictive analytics
 + Patterns discovery
 + Data mining
 + Tools:
    + Mahout
    + R
50
Big data revolution

+ Google: GFS, MapReduce, BigTable,
+ Yahoo: Hadoop
+ Amazon: DynamoDB
+ Facebook: Cassandra, HBase
+ Twitter: FlockDB, Storm
+ LinkedIn: Vondelmort, Kafka
51
Observations

+ Game changing technologies come from big companies
+ Open Source (!)
+ Start-up ecosystem
+ Less general, more specialized
+ Next step: big data analytics and visualization
52
Data scientist

+ Machine Learning
+ Data Mining
+ Statistics
+ Software Engineering
+ Hadoop/MapReduce/HBase/Hive/Pig
+ Java, Python, C/C+, SQL

“By 2018, the United States alone could face a shortage of 140,000 to 190,000
people with deep analytical skills as well as 1.5 million managers and analysts with
the know-how to use the analysis of big data to make effective decisions.”
Big Data Products                  53

MindMap




                    www.garycrawford.co.uk
54
Contacts


+ Leonid Zhukov, Ph.D.
+ School of Applied Mathematics and Information Science
   Higher School of Economics, NRU-HSE

+ lzhukov@hse.ru
+ www.leonidzhukov.ru

Más contenido relacionado

La actualidad más candente

관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)Myungjin Lee
 
Big Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREBig Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREJazz Yao-Tsung Wang
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUBAhmed Salman
 
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data WarehouseHadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data Warehousetervela
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overviewNitesh Ghosh
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB WorkshopAhmed Salman
 
BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013Brian Crotty
 
Big data overview external
Big data overview externalBig data overview external
Big data overview externalBrett Colbert
 

La actualidad más candente (14)

HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
 
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREBig Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTURE
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data WarehouseHadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013
 
Big data overview external
Big data overview externalBig data overview external
Big data overview external
 

Destacado

Vis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationVis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationLeonid Zhukov
 
ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013Leonid Zhukov
 
Numerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link AnalysisNumerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link AnalysisLeonid Zhukov
 
socialnetworkszhukov
socialnetworkszhukovsocialnetworkszhukov
socialnetworkszhukovLeonid Zhukov
 
The Business of Big Data - IA Ventures
The Business of Big Data - IA VenturesThe Business of Big Data - IA Ventures
The Business of Big Data - IA VenturesBen Siscovick
 
Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges   Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges Experian_US
 
A Primer on Big Data for Business
A Primer on Big Data for BusinessA Primer on Big Data for Business
A Primer on Big Data for BusinessLeslie Bradshaw
 
Turning Big Data to Business Advantage
Turning Big Data to Business AdvantageTurning Big Data to Business Advantage
Turning Big Data to Business AdvantageTeradata Aster
 
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRIOriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRILeonid Zhukov
 

Destacado (11)

CAATs - a way to avoid becoming a "TV star"
CAATs - a way to avoid becoming a "TV star"CAATs - a way to avoid becoming a "TV star"
CAATs - a way to avoid becoming a "TV star"
 
Vis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationVis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI Visualization
 
ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013
 
Numerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link AnalysisNumerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link Analysis
 
socialnetworkszhukov
socialnetworkszhukovsocialnetworkszhukov
socialnetworkszhukov
 
Data Scientists
 Data Scientists Data Scientists
Data Scientists
 
The Business of Big Data - IA Ventures
The Business of Big Data - IA VenturesThe Business of Big Data - IA Ventures
The Business of Big Data - IA Ventures
 
Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges   Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges
 
A Primer on Big Data for Business
A Primer on Big Data for BusinessA Primer on Big Data for Business
A Primer on Big Data for Business
 
Turning Big Data to Business Advantage
Turning Big Data to Business AdvantageTurning Big Data to Business Advantage
Turning Big Data to Business Advantage
 
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRIOriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
 

Similar a Business of Big Data

Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013Big Data Spain
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Forecast of Big Data Trends
Forecast of Big Data TrendsForecast of Big Data Trends
Forecast of Big Data TrendsIMC Institute
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup 1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup Faizan Javed
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...BigDataEverywhere
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)Sascha Dittmann
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013nkabra
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopOCTO Technology
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven businessOpenDataSoft
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowMapR Technologies
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overviewDorai Thodla
 

Similar a Business of Big Data (20)

Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Forecast of Big Data Trends
Forecast of Big Data TrendsForecast of Big Data Trends
Forecast of Big Data Trends
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Big data
Big dataBig data
Big data
 
1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup 1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven business
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution Roadshow
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overview
 

Más de Leonid Zhukov

Ecosystem challenges around data use
Ecosystem challenges around data useEcosystem challenges around data use
Ecosystem challenges around data useLeonid Zhukov
 
Social Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorSocial Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorLeonid Zhukov
 
Big Data at Ancestry.com
Big Data at Ancestry.comBig Data at Ancestry.com
Big Data at Ancestry.comLeonid Zhukov
 
Russian Big Data Startups
Russian Big Data StartupsRussian Big Data Startups
Russian Big Data StartupsLeonid Zhukov
 
Революция Больших Данных
Революция Больших ДанныхРеволюция Больших Данных
Революция Больших ДанныхLeonid Zhukov
 
Профессия Data Scientist
 Профессия Data Scientist Профессия Data Scientist
Профессия Data ScientistLeonid Zhukov
 
Большие Данные
Большие ДанныеБольшие Данные
Большие ДанныеLeonid Zhukov
 
Information cascades
Information cascadesInformation cascades
Information cascadesLeonid Zhukov
 
Инфорамционные каскады
Инфорамционные каскадыИнфорамционные каскады
Инфорамционные каскадыLeonid Zhukov
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network AnalysisLeonid Zhukov
 
Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Leonid Zhukov
 

Más de Leonid Zhukov (13)

Ecosystem challenges around data use
Ecosystem challenges around data useEcosystem challenges around data use
Ecosystem challenges around data use
 
Social Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorSocial Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to Macrobehavior
 
Big Data at Ancestry.com
Big Data at Ancestry.comBig Data at Ancestry.com
Big Data at Ancestry.com
 
Russian Big Data Startups
Russian Big Data StartupsRussian Big Data Startups
Russian Big Data Startups
 
Революция Больших Данных
Революция Больших ДанныхРеволюция Больших Данных
Революция Больших Данных
 
Профессия Data Scientist
 Профессия Data Scientist Профессия Data Scientist
Профессия Data Scientist
 
Большие Данные
Большие ДанныеБольшие Данные
Большие Данные
 
Information cascades
Information cascadesInformation cascades
Information cascades
 
Инфорамционные каскады
Инфорамционные каскадыИнфорамционные каскады
Инфорамционные каскады
 
Social Networks
Social NetworksSocial Networks
Social Networks
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.
 
Monitorium DLP
Monitorium DLPMonitorium DLP
Monitorium DLP
 

Último

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Último (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Business of Big Data

  • 1. 1 Big Data the next frontier RVC Seminar Leonid Zhukov Moscow, 08/02/2013 Professor Higher School of Economics
  • 2. 2 Big data + Graph of terms popularity www.visibletechologies.com
  • 3. 3 McKinsey, May 2011 www.mckinsey.com
  • 4. 4 Headlines Data driven business Data democratization Data scientists
  • 5. 5 The White House + $200M initiative + NSF: core techniques + NIH: 1000 genomes + DOE: advanced computing + DOD: data to decisions + USGS: Earth system www.whitehouse.gov
  • 6. 6 Gartner Hype Cycle www.gartner.com
  • 7. 7 Market Forecast + Venture money invested (Reuters): + Market forecasts: + 2009 - $1.1B + IDC: 2015 - $16.9B + 2010 - $1.53B + Gartner: 2016- $55B + 2011 - $2.47B www.wikibon.com
  • 8. 8 Big Data Revenue 2012 + Big Business: + IBM + HP + Oracle + Teradata + EMC www.wikibon.com
  • 9. 9 Big Data Vendors! + Hadoop: + Cloudera + MapR Techonologies + HortonWorks www.wikibon.com
  • 10. 10 Forrester Wave www.forrester.com
  • 11. What is big data 11 + Big data: + “Data you can’t process by traditional tools” + “A phenomenon defined by the rapid acceleration in the expanding volume of high velocity, complex and diverse types of data.” + “Refers to a collection of tools, techniques and technologies for working with data productively, at any scale.”
  • 12. 12 What is Big data + 3V + Volume: petabytes (1000TB) to exabytes (1000PB) + Variety: structured, semi-structured, unstructured + Velocity: Tb/s data streams + Requires distributed processing + Big data = storage + processing + Big data = Hadoop (not only)
  • 13. 13 Big data Glossary + Hadoop, MapReduce, Hive, Pig, Cascading, HBase, Hypertable, Cassandra, Flume, Sqoop, Mongo, Voldemort, Storm, Kafka, Drill, Dremmel, Impala, Zookeeper, Ambari, Oozi, Yarn, Redis, Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R, Mahout, Weka,
  • 14. 14 How big is Big? + Google + 24 PB data processed daily + Twitter + 340 mln daily tweets + 1.6 bln search queries + 7 TB added daily + Facebook + 750 mln users + 12 TB daily daily content + 2.7 bln “likes” and comments daily
  • 15. 15 Sources of Big Data www.ibm.com
  • 16. 16 Supercomputing + National Labs, Universities, Military + Processing power, flops, MPI + Parallel computing: + Cray, IBM SP, SGI + Beowulf cluster (Linux commodity)
  • 17. 17 New realities + Yahoo, AltaVista, Inktomi, Google + Consumer web companies: + web search (crawling, indexing) + advertising + email services + ecommerce + Commodity hardware
  • 19. 19 GFS/HDFS + Distributed replicated data blocks (64Mb) + Master-slave architecture (Name Node, Data Nodes) + Not a general file system + Access via command line utils and API + Can’t modify after files written
  • 20. 20 MapReduce + Scalable: + no file IO + no networking + no synchronization + Master-slave architecture + MapReduce programming model: + Master: divide, schedule, monitor work + functional programming + Slave: actual processing + like UNIX pipeline
  • 21. 21  Data movement + store and process data on the same nodes + bring code to data, data “locality” www.cloudera.com
  • 22. 22 Hadoop + Doug Cutting + Search indexer - Lucene + Web crawler - Nutch + Hadoop + HDFS + MapReduce
  • 23. 23 Yahoo! + 40,000 servers + 170PB storage + 1000+ active users + 5M+ monthly jobs + email spam filters + categorization, personalization + computational advertising
  • 24. Data Base NoSQL 24 Revolution + Needed: + fast read/write time + high concurrency + easy horizontally scalable + Flat data structure + Sacrificed: + DB Schema + SQL + Transactions
  • 25. 25 NoSQL World + Key-value: Dynamo, Voldemort, Redis, Riak + Column (tabular): HBase, Hypertable, Cassandra + Document store: CouchDB, MongoDB + Graph: Neo4J, FlockDB + 120+ products (2012)
  • 26. 26 Hadoop stack www.hortonworks.com
  • 27. 27 Hadoop tools + Pig + high level scripting language (PigLatin) + converts to MapReduce jobs + Hive + SQL like queries on dat in HDFS + converts in MapReduce jobs
  • 28. 28 Hadoop data movement www.cloudera.com
  • 29. 29 Typical hadoop usage + Text mining + Pattern recognition + Recommendation systems (collaborative filtering) + Prediction models + Risk assessment + Sentiment analysis + Customer churn prediction + Customer segmentation + Point of Sale Transaction analysis + Data “sandbox”
  • 30. 30 Application fields + Science: sensors, genome, weather, satellite, imaging + Engineering: log analytics, status feeds, network messages, spam filters.. + Product: financial, pharmaceutical, insurance, energy, retail, ecommerce, healthcare, telecom + Business: analytics, BI
  • 31. 31 Business analytics + Analytic + Operational Capture, analyze, learn from data www.datasciencecentral.com
  • 32. 32 Who uses Hadoop? www.cloudera.com
  • 33. 33 Why Hadoop? www.thinkbiganalytics.com
  • 34. 34 Cloudera + Enterprise support for Apache Hadoop + Founded 2008, funding $141 M + Employee 230 + Products: + CDH 4 (cloudera distrobution hadoop) + Impala + Consulting and training www.cloudera.com
  • 35. 35 MapR + Founded 2009, funding $20M + MapR Technologies is engineering game- changing Map/Reduce related technologies + Products: + M3,M5,M7 + NFS, no single node failure + NOT open source ! www.mapr.com
  • 36. 36 HortonWorks + Founded 2011 + Yahoo spin-off + Products: + HDP distribution + tools www.hortonworks.com
  • 37. 37 Hadoop Ecosystem www.datameer.com
  • 38. 38 Big Data Landscape www.bigdatalandscape.com
  • 39. 39 Splunk + Founded 2003, raised $230M, IPO 2011, Market cap $3.35B + Machine logs analysis, operational intelligence + Collecting, searching, monitoring www.splunk.com
  • 40. 40 Datameer + Founded 2009, Funding $17,8M + Big data: + Data integration + Data Analytics + Data Visualization www.datameer.com
  • 41. 41 Datasift + Founded 2010, funding $29.7M + Data platform for social web + Aggregate and filter data www.datasift.com
  • 42. 42 Infochimps + Founded 2009, funding $5.5M + Transitioned from data marketpalce to big data platform + End-to-end big data solution, real time www.infochimps.com
  • 43. 43 Tableau software + Founded 2003, funding $15M + Big data analytics + Big data visualization www.tableau.com
  • 44. Big data Startups 44 2012 + Platfora, in memory BI on Hadoop + Sumologic, log file analysis + Hadapt, Hadoop+RDBSM + Metamarkets, patterns in data flow + DataStax, consulting, training + Karmasphere, BI, analytics on Hadoop
  • 45. Big data startups 45 2013! + 10gen, MongoDB + ClearStory, big data aggregation + analytics + Continuuity, Hadoop API + Parstream, database analytics + Zoomdata, data visualization + Climate corporation, predictive analytics
  • 46. 46 Big data by industry www.gartner.com
  • 47. 47 Big data Processing Batch interactive stream processing minutes to Millisecond to Query time continues hours seconds data volume TB to PT GB to PB continues programming MapReduce Queries DAG model Users Developers Analysts Developers Hadoop Open Source Drill, Impala Storm, Kafka mapreduce
  • 48. 48 New technologies + Real time quering + Drill (based on Google Dremmel) + Impala (Cloudera) + Data stream processing + Storm (Twitter), real time analytics + Kafka (LinkedIn), messaging system
  • 49. 49 Machine learning + Predictive analytics + Patterns discovery + Data mining + Tools: + Mahout + R
  • 50. 50 Big data revolution + Google: GFS, MapReduce, BigTable, + Yahoo: Hadoop + Amazon: DynamoDB + Facebook: Cassandra, HBase + Twitter: FlockDB, Storm + LinkedIn: Vondelmort, Kafka
  • 51. 51 Observations + Game changing technologies come from big companies + Open Source (!) + Start-up ecosystem + Less general, more specialized + Next step: big data analytics and visualization
  • 52. 52 Data scientist + Machine Learning + Data Mining + Statistics + Software Engineering + Hadoop/MapReduce/HBase/Hive/Pig + Java, Python, C/C+, SQL “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
  • 53. Big Data Products 53 MindMap www.garycrawford.co.uk
  • 54. 54 Contacts + Leonid Zhukov, Ph.D. + School of Applied Mathematics and Information Science Higher School of Economics, NRU-HSE + lzhukov@hse.ru + www.leonidzhukov.ru