SlideShare una empresa de Scribd logo
1 de 37
2013: year of real-time access to Big Data?

              Geoffrey Hendrey
               @geoffhendrey
                @vertascale
Agenda

•   Motivation
•   Hadoop stack & data formats
•   File access times and mechanics
•   Key-based indexing systems (HBase)
•   MapReduce, Hive/Pig
•   MPP approaches & alternatives
Motivation
• Big Data is more opaque than small data
  – Spreadsheets choke
  – BI tools can’t scale
  – Small samples often fail to replicate issues
• Engineers, data scientists, analysts need:
  – Faster “time to answer” on Big Data
  – Rapid “find, quantify, extract”
• Solve “I don’t know what I don’t know”
Survey or real-time capabilities

• Real-time, in-situ, self-service is the
  “Holy Grail” for the business analyst
• spectrum of real-time capabilities exists
  on Hadoop

 Available in Hadoop                   Proprietary

           HDFS        HBase   Drill
 Easy                                        Hard
Hadoop Stack Review
Real-time spectrum on Hadoop
Use Case                                      Support       Real-time

Seek to a particular byte in a distributed    HDFS          YES
file

Seek to a particular value in a distributed   HBase         YES
file, by key (1-dimensional indexing)

Answer complex questions expressible in       MapReduce     NO
code (e.g. matching users to music            (Hive, Pig)
albums). Data science.

Ad-hoc query for scattered records given MPP                YES
simple constraints (“field*4+==“music” && Architectures
field*9+==“dvd”)
Hadoop Underpinned By HDFS
•   Hadoop Distributed File System (HDFS)
•   inspired by Google FileSystem (GFS)
•   underpins every piece of data in “Hadoop”
•   Hadoop FileSystem API is pluggable
•   HDFS can be replaced with other suitable
    distributed filesystem
    – S3
    – kosmos
    – etc
Amazon S3
Distributed File System
Distributed File System Mechanics
HDFS performance characteristics
• HDFS was designed for high throughput, not low
  seek latency
• best-case configurations have shown HDFS to
  perform 92K/s random reads
  [http://hadoopblog.blogspot.com/]

• Personal experience: HDFS very robust. Fault
  tolerance is “real”. I’ve unplugged machines
  and never lost data.
Typical Hadoop File Formats
MapFile for real-time access?
  – Index file must be loaded by client (slow)
  – Index file must fit in RAM of client by default
  – scan an average of 50% of the sampling
    interval
  – Large records make scanning intolerable
  – not a viable “real world” solution for random
    access
Apache HBase

• Clone of Google’s Big Table.
• Key-based access mechanism
• Designed to hold billions of rows
• “Tables” stored in HDFS
• Supports MapReduce over tables, into
  tables
• Requires you to think hard, and commit
  to a key design.
HBase Architecture
HBase random read performance
http://hstack.org/hbase-performance-testing/
• 7 servers, each with
   • 8 cores
   • 32GB DDR3 and
   • 24 x 146GB SAS 2.0 10K RPM disks.
• Hbase table
   • 3 billion records,
   • 6600 regions.
   • data size is between 128-256 bytes per row,
     spread in 1 to 5 columns.
Zoomed-in “Get” time histogram




      http://hstack.org/hbase-performance-testing/
MapReduce
• “MapReduce is a framework for processing
  parallelizable problems across huge datasets
  using a large number of computers”-wikipedia
• MapReduce is strongly tied to HDFS in Hadoop.
• Systems built on HDFS (i.e. HBase) leverage this
  common foundation for integration with the MR
  paradigm
MapReduce and Data Science
• Many complex algorithms can be expressed in
  the MapReduce paradigm
  – NLP
  – Graph processing
  – Image codecs
• The more complex the algorithm, the more Map
  and Reduce processes become complex
  programs in their own right.
• Often cascade multiple MR jobs in succession
A very bad* diagram




*this diagram makes it appear that data flows through the master node.
A better picture
Is MapReduce real-time?
• MapReduce on Hadoop has certain latencies
  that are hard to improve
  – Copy
  – Shuffle, sort
  – Iterate
• time-dependent on the both the size of the
  input data and the number of processors
  available
• In a nutshell, it’s a “batch process” and isn’t
  “real-time”
Hive and Pig
• Run on top of MapReduce
• Provide “Table” metaphor familiar to SQL users
• Provide SQL-like (or actually same) syntax
• Store a “schema” in a database, mapping tables
  to HDFS files
• Translate “queries” to MapReduce jobs
• No more real-time than MapReduce
MPP Architectures
• Massively Parallel Processing
• Lots of machines, so also lots of memory
Examples:
• Spark – general purpose data science framework
  sort of like real-time MapReduce for data
  science
• Dremel – columnar approach, geared toward
  answering SQL-like aggregations and BI-style
  questions
Spark

• Originally designed for iterative machine
  learning problems at Berkeley
• MapReduce does not do a great job on iterative
  workloads
• Spark makes more explicit use of memory
  caches than Hadoop
• Spark can load data from any Hadoop input
  source
Effect of Memory Caching in Spark
Is Spark Real-time?
• If data fits in memory, execution time for most
  algorithms still depends on
  – amount of data to be processed
  – number of processors
• So, it still “depends”
• …but definitely more focused on fast time-to-
  answer
• Interactive scala and java shells
Dremel MPP architecture
• MPP architecture for ad-hoc query on nested
  data
• Apache Drill is an OS clone of Dremel
• Dremel originally developed at Google
• Features “in situ” data analysis
• “Dremel is not intended as a replacement for
  MR and is often used in conjunction with it to
  analyze outputs of MR pipelines or rapidly
  prototype larger computations.” -Dremel:
  Interactive Analysis of WebScaleDatasets
In Situ Analysis

• Moving Big Data is a nightmare
• In situ: ability to access data in
  place
  – In HDFS
  – In Big Table
Uses For Dremel At Google
•    Analysis of crawled web documents.
•    Tracking install data for applications on Android
    Market.
•    Crash reporting for Google products.
•    OCR results from Google Books.
•    Spam analysis.
•    Debugging of map tiles on Google Maps.
•    Tablet migrations in managed Bigtable instances.
•    Results of tests run on Google’s distributed build
    system.
•   Etc, etc.
Why so many uses for Dremel?
• On any Big Data problem or application, dev
  team faces these problems:
  – “I don’t know what I don’t know” about data
  – Debugging often requires finding and correlating
    specific needles in the haystack
  – Support and marketing often require segmentation
    analysis (identify and characterize wide swaths of
    data)
• Every developer/analyst wants
  – Faster time to answer
  – Fewer trips around the mulberry bush
Column Oriented Approach
Dremel MPP query execution tree
Is Dremel real-time?
Alternative approaches?
• Both MapReduce and MPP query architectures
  take “throw hardware at the problem”
  approach.
• Alternatives?
  – Use MapReduce to build distributed indexes on data
  – Combine columnar storage and inverted indexes to
    create columnar inverted indexes
  – Aim for the sweet spot for data scientist and
    engineer: Ad-hoc queries with results returned in
    seconds on a single processing node.
Contact Info
               Email:
               geoff@vertascale.com

               Twitter:
               @geoffhendrey
               @vertascale

               www:
               http://vertascale.com
references
• http://www.ebaytechblog.com/2010/10/29/hadoop-the-power-of-
  the-elephant/
• http://bradhedlund.com/2011/09/10/understanding-hadoop-
  clusters-and-the-network/
• http://yoyoclouds.wordpress.com/tag/hdfs/
• http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
• http://hadoopblog.blogspot.com/
• http://lunarium.info/arc/images/Hbase.gif
• http://www.zdnet.com/i/story/60/01/073451/zdnet-amazon-
  s3_growth_2012_q1_1.png
• http://hstack.org/hbase-performance-testing/
• http://en.wikipedia.org/wiki/File:Mapreduce_Overview.svg
• http://www.rabidgremlin.com/data20/MapReduceWordCountOver
  view1.png
• Dremel: Interactive Analysis of WebScale Datasets

Más contenido relacionado

La actualidad más candente

Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
HBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at XiaomiHBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at XiaomiMichael Stack
 
Capacity Planning For Your Growing MongoDB Cluster
Capacity Planning For Your Growing MongoDB ClusterCapacity Planning For Your Growing MongoDB Cluster
Capacity Planning For Your Growing MongoDB ClusterMongoDB
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop IntroductionSNEHAL MASNE
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5Samuel Rash
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practicesHadoop User Group
 
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at HuaweiHBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at HuaweiMichael Stack
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardMatthew Blair
 
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...Michael Stack
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS Dr Neelesh Jain
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi namboori
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 

La actualidad más candente (20)

Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
HBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at XiaomiHBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at Xiaomi
 
Capacity Planning For Your Growing MongoDB Cluster
Capacity Planning For Your Growing MongoDB ClusterCapacity Planning For Your Growing MongoDB Cluster
Capacity Planning For Your Growing MongoDB Cluster
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at HuaweiHBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Hadoop
Hadoop Hadoop
Hadoop
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 

Destacado

ACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hiveACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hivePadma shree. T
 
Complex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeComplex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeNati Shalom
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafkaconfluent
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafkaconfluent
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafkaconfluent
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in HadoopBotond Balázs
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web ServiceSpark Summit
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Productionconfluent
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBconfluent
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
 

Destacado (20)

Apache avro
Apache avroApache avro
Apache avro
 
ACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hiveACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hive
 
Complex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeComplex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real Time
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafka
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in Hadoop
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 

Similar a 2013 year of real-time hadoop

Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Big data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionBig data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionkvaderlipa
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 

Similar a 2013 year of real-time hadoop (20)

Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Big Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in ChandigarhBig Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in Chandigarh
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Big data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionBig data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introduction
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 

Último

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

2013 year of real-time hadoop

  • 1. 2013: year of real-time access to Big Data? Geoffrey Hendrey @geoffhendrey @vertascale
  • 2. Agenda • Motivation • Hadoop stack & data formats • File access times and mechanics • Key-based indexing systems (HBase) • MapReduce, Hive/Pig • MPP approaches & alternatives
  • 3. Motivation • Big Data is more opaque than small data – Spreadsheets choke – BI tools can’t scale – Small samples often fail to replicate issues • Engineers, data scientists, analysts need: – Faster “time to answer” on Big Data – Rapid “find, quantify, extract” • Solve “I don’t know what I don’t know”
  • 4. Survey or real-time capabilities • Real-time, in-situ, self-service is the “Holy Grail” for the business analyst • spectrum of real-time capabilities exists on Hadoop Available in Hadoop Proprietary HDFS HBase Drill Easy Hard
  • 6. Real-time spectrum on Hadoop Use Case Support Real-time Seek to a particular byte in a distributed HDFS YES file Seek to a particular value in a distributed HBase YES file, by key (1-dimensional indexing) Answer complex questions expressible in MapReduce NO code (e.g. matching users to music (Hive, Pig) albums). Data science. Ad-hoc query for scattered records given MPP YES simple constraints (“field*4+==“music” && Architectures field*9+==“dvd”)
  • 7. Hadoop Underpinned By HDFS • Hadoop Distributed File System (HDFS) • inspired by Google FileSystem (GFS) • underpins every piece of data in “Hadoop” • Hadoop FileSystem API is pluggable • HDFS can be replaced with other suitable distributed filesystem – S3 – kosmos – etc
  • 11. HDFS performance characteristics • HDFS was designed for high throughput, not low seek latency • best-case configurations have shown HDFS to perform 92K/s random reads [http://hadoopblog.blogspot.com/] • Personal experience: HDFS very robust. Fault tolerance is “real”. I’ve unplugged machines and never lost data.
  • 13. MapFile for real-time access? – Index file must be loaded by client (slow) – Index file must fit in RAM of client by default – scan an average of 50% of the sampling interval – Large records make scanning intolerable – not a viable “real world” solution for random access
  • 14. Apache HBase • Clone of Google’s Big Table. • Key-based access mechanism • Designed to hold billions of rows • “Tables” stored in HDFS • Supports MapReduce over tables, into tables • Requires you to think hard, and commit to a key design.
  • 16. HBase random read performance http://hstack.org/hbase-performance-testing/ • 7 servers, each with • 8 cores • 32GB DDR3 and • 24 x 146GB SAS 2.0 10K RPM disks. • Hbase table • 3 billion records, • 6600 regions. • data size is between 128-256 bytes per row, spread in 1 to 5 columns.
  • 17. Zoomed-in “Get” time histogram http://hstack.org/hbase-performance-testing/
  • 18. MapReduce • “MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers”-wikipedia • MapReduce is strongly tied to HDFS in Hadoop. • Systems built on HDFS (i.e. HBase) leverage this common foundation for integration with the MR paradigm
  • 19. MapReduce and Data Science • Many complex algorithms can be expressed in the MapReduce paradigm – NLP – Graph processing – Image codecs • The more complex the algorithm, the more Map and Reduce processes become complex programs in their own right. • Often cascade multiple MR jobs in succession
  • 20. A very bad* diagram *this diagram makes it appear that data flows through the master node.
  • 22. Is MapReduce real-time? • MapReduce on Hadoop has certain latencies that are hard to improve – Copy – Shuffle, sort – Iterate • time-dependent on the both the size of the input data and the number of processors available • In a nutshell, it’s a “batch process” and isn’t “real-time”
  • 23. Hive and Pig • Run on top of MapReduce • Provide “Table” metaphor familiar to SQL users • Provide SQL-like (or actually same) syntax • Store a “schema” in a database, mapping tables to HDFS files • Translate “queries” to MapReduce jobs • No more real-time than MapReduce
  • 24. MPP Architectures • Massively Parallel Processing • Lots of machines, so also lots of memory Examples: • Spark – general purpose data science framework sort of like real-time MapReduce for data science • Dremel – columnar approach, geared toward answering SQL-like aggregations and BI-style questions
  • 25. Spark • Originally designed for iterative machine learning problems at Berkeley • MapReduce does not do a great job on iterative workloads • Spark makes more explicit use of memory caches than Hadoop • Spark can load data from any Hadoop input source
  • 26. Effect of Memory Caching in Spark
  • 27. Is Spark Real-time? • If data fits in memory, execution time for most algorithms still depends on – amount of data to be processed – number of processors • So, it still “depends” • …but definitely more focused on fast time-to- answer • Interactive scala and java shells
  • 28. Dremel MPP architecture • MPP architecture for ad-hoc query on nested data • Apache Drill is an OS clone of Dremel • Dremel originally developed at Google • Features “in situ” data analysis • “Dremel is not intended as a replacement for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations.” -Dremel: Interactive Analysis of WebScaleDatasets
  • 29. In Situ Analysis • Moving Big Data is a nightmare • In situ: ability to access data in place – In HDFS – In Big Table
  • 30. Uses For Dremel At Google • Analysis of crawled web documents. • Tracking install data for applications on Android Market. • Crash reporting for Google products. • OCR results from Google Books. • Spam analysis. • Debugging of map tiles on Google Maps. • Tablet migrations in managed Bigtable instances. • Results of tests run on Google’s distributed build system. • Etc, etc.
  • 31. Why so many uses for Dremel? • On any Big Data problem or application, dev team faces these problems: – “I don’t know what I don’t know” about data – Debugging often requires finding and correlating specific needles in the haystack – Support and marketing often require segmentation analysis (identify and characterize wide swaths of data) • Every developer/analyst wants – Faster time to answer – Fewer trips around the mulberry bush
  • 33. Dremel MPP query execution tree
  • 35. Alternative approaches? • Both MapReduce and MPP query architectures take “throw hardware at the problem” approach. • Alternatives? – Use MapReduce to build distributed indexes on data – Combine columnar storage and inverted indexes to create columnar inverted indexes – Aim for the sweet spot for data scientist and engineer: Ad-hoc queries with results returned in seconds on a single processing node.
  • 36. Contact Info Email: geoff@vertascale.com Twitter: @geoffhendrey @vertascale www: http://vertascale.com
  • 37. references • http://www.ebaytechblog.com/2010/10/29/hadoop-the-power-of- the-elephant/ • http://bradhedlund.com/2011/09/10/understanding-hadoop- clusters-and-the-network/ • http://yoyoclouds.wordpress.com/tag/hdfs/ • http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ • http://hadoopblog.blogspot.com/ • http://lunarium.info/arc/images/Hbase.gif • http://www.zdnet.com/i/story/60/01/073451/zdnet-amazon- s3_growth_2012_q1_1.png • http://hstack.org/hbase-performance-testing/ • http://en.wikipedia.org/wiki/File:Mapreduce_Overview.svg • http://www.rabidgremlin.com/data20/MapReduceWordCountOver view1.png • Dremel: Interactive Analysis of WebScale Datasets