SlideShare una empresa de Scribd logo
1 de 49
A Big picture for Developers
 Over 1 terabytes data genrated by NYSE every day.
 We send more than 144.8 billion Email messages sent a day.
 On Twitter send more than 340 million tweets a day.
 On Facebook share more than 684,000 bits of content a day.
 We 72 hours of new video to YouTube a minute.
 We spend $272,000 on Web shopping a day.
 Google receives over 2 million search queries a minute.
 Apple receives around 47,000 app downloads a minute.
 Brands receive more than 34,000 Facebook ‘likes’ a minute.
 Tumblr blog owners publish 27,000 new posts a minute.
 Instagram photographers share 3,600 new photos a minute.
 Flickr photographers upload 3,125 new photos a minute.
 We perform over 2,000 Foursquare check-ins a minute.
 Individuals and organizations launch 571 new websites a minute.
 WordPress bloggers publish close to 350 new blog posts a minute.
XML
0
100000
200000
300000
400000
500000
600000
700000
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Structured Semi Structured Unstructured
 Dec 2004: Dean/Ghemawat (Google) MapReduce paper
 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first
only to extend Nutch (the name is derived from Doug’s son’s toy
elephant)
 2006:Yahoo runs Hadoop on 5-20 nodes
 March 2008: Cloudera founded
 July 2008: Hadoop winsTeraByte sort benchmark (1st time a Java
program won this competition)
 April 2009: Amazon introduce “Elastic MapReduce” as a service on
S3/EC2
 June 2011: Hortonworks founded
 27 dec 2011: Apache Hadoop release 1.0.0
 June 2012: Facebook claim “biggest Hadoop cluster”, totalling more
than 100 PetaBytes in HDFS
 2013:Yahoo runs Hadoop on 42,000 nodes, computing about 500,000
MapReduce jobs per day
 15 oct 2013: Apache Hadoop release 2.2.0 (YARN)
Google calls it: Hadoop equivalent:
MapReduce Hadoop
GFS HDFS
Bigtable HBase
Chubby Zookeeper
 Single Namespace for entire cluster
 Data Coherency
–Write-once-read-many access model
– Client can only append to existing files
 Files are broken up into blocks
–Typically 128MB block size
– Each block replicated on multiple DataNodes
 Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
 Meta-data in Memory
–The entire metadata is in main memory
– No demand paging of meta-data
 Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication
factor
 ATransaction Log
– Records file creations, file deletions. etc
 A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block (e.g. CRC)
– Serves data and meta-data to Clients
 Block Report
– Periodically sends a report of all existing
blocks to the NameNode
 Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
 Current Strategy
-- One replica on local node
-- Second replica on a remote rack
--Third replica on same remote rack
-- Additional replicas are randomly placed
 Clients read from nearest replica
 Would like to make this policy pluggable
 Input: This is the input data / file to be processed.
 Split: Hadoop splits the incoming data into smaller pieces called "splits".
 Map: In this step, MapReduce processes each split according to the logic
defined in map() function. Each mapper works on each split at a time. Each
mapper is treated as a task and multiple tasks are executed across different
TaskTrackers and coordinated by the JobTracker.
 Combine: This is an optional step and is used to improve the performance by
reducing the amount of data transferred across the network. Combiner is the
same as the reduce step and is used for aggregating the output of the map()
function before it is passed to the subsequent steps.
 Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted
to put them in order, and grouped before sending them to the next step.
 Reduce: This step is used to aggregate the outputs of mappers using the
reduce() function. Output of reducer is sent to the next and final step. Each
reducer is treated as a task and multiple tasks are executed across different
TaskTrackers and coordinated by the JobTracker.
 Output: Finally the output of reduce step is written to a file in HDFS.
 Sqoop: is a tool designed for efficiently
transferring bulk data betweenApache
Hadoop and structured datastores such as
relational databases.
RDBMS HDFS
 Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of
streaming data into the Hadoop Distributed File System
(HDFS). It has a simple and flexible architecture based on
streaming data flows; and is robust and fault tolerant with
tunable reliability mechanisms for failover and recovery.
 A data collection system for monitoring large distributed
systems. Chukwa is built on top of the Hadoop Distributed File
System (HDFS) and Map/Reduce framework and inherits
Hadoop’s scalability and robustness. Chukwa also includes a
flexible and powerful toolkit for displaying, monitoring and
analyzing results to make the best use of the collected data.
 A platform for processing and analyzing large
data sets. Pig consists of a high-level language
(Pig Latin) for expressing data analysis
programs paired with the MapReduce
framework for processing these programs.
 Built on the MapReduce framework, Hive is a
data warehouse that enables easy data
summarization and ad-hoc queries via an SQL-
like interface for large datasets stored in HDFS
 A column-oriented NoSQL data storage system
that provides random real-time read/write
access to big data for user applications.
 Built on Amazon’s Dynamo and Google’s
BigTable, is a distributed database for managing
large amounts of structured data across many
commodity servers, while providing highly
available service and no single point of
failure. Cassandra offers capabilities that
relational databases and other NoSQL
databases.
 Built on top of the Hive metastore and
incorporates components from the Hive DDL.
HCatalog provides read and write interfaces for
Pig and MapReduce and uses Hive’s command
line interface for issuing data definition and
metadata exploration commands. It also
presents a REST interface to allow external
tools access to Hive DDL (Data Definition
Language) operations, such as “create table”
and “describe table”.
 Lucene is a full-text search library in Java which
makes it easy to add search functionality to an
application or website. It does so by adding
content to a full-text index.
 Also available as Lucene .Net
 Apache Hama is a pure BSP (Bulk Synchronous
Parallel) computing framework on top of HDFS
for massive scientific computations such as
matrix, graph and network algorithms.
 A Simple and Efficient MapReduce Pipelines.
The Apache Crunch Java library provides a
framework for writing, testing, and running
MapReduce pipelines. Its goal is to make
pipelines that are composed of many user-
defined functions simple to write, easy to test,
and efficient to run.
 A very popular data serialization format in the
Hadoop technology stack. In this article I show
code examples of MapReduce jobs in Java,
Hadoop Streaming, Pig and Hive that read
and/or write data in Avro format.
 For scalable cross-language services
development, combines a software stack with a
code generation engine to build services that
work efficiently and seamlessly betweenC++,
Java, Python, PHP, Ruby, Erlang, Perl, Haskell,
C#, Cocoa, JavaScript, Node.js, Smalltalk,
OCaml and Delphi ..
 A framework that supports data-intensive
distributed applications for interactive analysis
of large-scale datasets. Drill is the open source
version of Google's Dremel system which is
available as an infrastructure service called
Google BigQuery
 A machine learning algorithms focused
primarily in the areas of collaborative filtering,
clustering and classification. Many of the
implementations use the Apache Hadoop
platform
 A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes
support for Hadoop HDFS, Hadoop MapReduce,
Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and
Sqoop.
 Ambari also provides a dashboard for viewing
cluster health such as heatmaps and ability to view
MapReduce, Pig and Hive applications visually
alongwith features to diagnose their performance
characteristics in a user-friendly manner.
 ZooKeeper is a centralized service for
maintaining configuration information, naming,
providing distributed synchronization, and
providing group services. All of these kinds of
services are used in some form or another by
distributed applications. Each time they are
implemented there is a lot of work that goes
into fixing the bugs and race conditions that are
inevitable
 Oozie is a workflow scheduler system to manage
Apache Hadoop jobs.
 OozieWorkflow jobs are Directed Acyclical Graphs
(DAGs) of actions.
 Oozie Coordinator jobs are recurrent Oozie
Workflow jobs triggered by time (frequency) and
data availabilty.
 Oozie is integrated with the rest of the Hadoop
stack supporting several types of Hadoop jobs out
of the box (such as Java map-reduce, Streaming
map-reduce, Pig, Hive, Sqoop and Distcp) as well
as system specific jobs (such as Java programs and
shell scripts).
 SPARK: is ideal for in-memory data processing. It allows data
scientists to implement fast, iterative algorithms for
advanced analytics such as clustering and classification of
datasets.
 STORM: is a distributed real-time computation system for
processing fast, large streams of data adding reliable real-
time data processing capabilities to Apache Hadoop® 2.x
 SOLR: A platform for searches of data stored in Hadoop. Solr
enables powerful full-text search and near real-time indexing
on many of the world’s largest Internet sites.
 TEZ: A generalized data-flow programming framework, built
on HadoopYARN, which provides a powerful and flexible
engine to execute an arbitrary DAG of tasks to process data
for both batch and interactive use-cases.
 Need to process Multi Petabyte Datasets
 Expensive to build reliability in each application.
 Nodes fail every day
– Failure is expected, rather than exceptional.
–The number of nodes in a cluster is not constant.
 Need common infrastructure
– Efficient, reliable, Open SourceApache License
 The above goals are same as Condor, but
 Workloads are IO bound and not CPU bound
 Hadoop consists of multiple products. We talk about Hadoop
as if it’s one monolithic thing, but it’s actually a family of open
source products and technologies overseen by the Apache
Software Foundation (ASF). (Some Hadoop products are also
available via vendor distributions; more on that later.)The
Apache Hadoop library includes (in BI priority order): the
Hadoop Distributed File System (HDFS), MapReduce, Pig,
Hive, HBase, HCatalog, Ambari, Mahout, Flume, and so on.
You can combine these in various ways, but HDFS and
MapReduce (perhaps with Pig, Hive, and HBase) constitute a
useful technology stack for applications in BI, DW, DI, and
analytics. More Hadoop projects are coming that will apply to
BI/DW, including Impala, which is a much-needed SQL engine
for low-latency data access to HDFS and Hive data.
 Hadoop is open source but available from
vendors, too. Apache Hadoop’s open source
software library is available from ASF at
www.apache.org. For users desiring a more
enterprise-ready package, a few vendors now
offer Hadoop distributions that include
additional administrative tools, maintenance,
and technical support. A handful of vendors
offer their own non-Hadoop-based
implementations of MapReduce.
 Hadoop is an ecosystem, not a single product. In addition to
products from Apache, the extended Hadoop ecosystem
includes a growing list of vendor products (e.g., database
management systems and tools for analytics, reporting, and
DI) that integrate with or expand Hadoop technologies. One
minute on your favorite search engine will reveal these.
Ignorance of Hadoop is still common in the BI and IT
communities. Hadoop comprises multiple products, available
from multiple sources. 1This section of the report was
originally published as the expert column “Busting 10 Myths
about Hadoop” inTDWI’s BIThisWeek newsletter, March 20,
2012 (available at tdwi.org).The column has been updated
slightly for use in this report. 6TDWI research Integrating
Hadoop Into Bi/DW
 HDFS is a file system, not a database management
system (DBMS). Hadoop is primarily a distributed
file system and therefore lacks capabilities we
associate with a DBMS, such as indexing, random
access to data, support for standard SQL, and
query optimization.That’s okay, because HDFS
does things DBMSs do not do as well, such as
managing and processing massive volumes of file-
based, unstructured data. For minimal DBMS
functionality, users can layer HBase over HDFS and
layer a query framework such as Hive or SQL-based
Impala over HDFS or HBase.
 Hive resembles SQL but is not standard SQL.
Many of us are handcuffed to SQL because we
know it well and our tools demand it. People
who know SQL can quickly learn to hand code
Hive, but that doesn’t solve compatibility issues
with SQL-based tools.TDWI believes that over
time, Hadoop products will support standard
SQL and SQL-based vendor tools will support
Hadoop, so this issue will eventually be moot.
 Hadoop and MapReduce are related but don’t
require each other. Some variations of
MapReduce work with a variety of storage
technologies, including HDFS, other file
systems, and some relational DBMSs. Some
users deploy HDFS with Hive or HBase, but not
MapReduce.
 MapReduce provides control for analytics, not
analytics per se. MapReduce is a general-
purpose execution engine that handles the
complexities of network communication,
parallel programming, and fault tolerance for a
wide variety of hand-coded logic and other
applications—not just analytics.
 Hadoop is about data diversity, not just data
volume.Theoretically, HDFS can manage the
storage and access of any data type as long as you
can put the data in a file and copy that file into
HDFS. As outrageously simplistic as that sounds,
it’s largely true, and it’s exactly what brings many
users to Apache HDFS and related Hadoop
products.After all, many types of big data that
require analysis are inherently file based, such as
Web logs, XML files, and personal productivity
documents.
 Hadoop complements a DW; it’s rarely a replacement. Most
organizations have designed their DWs for structured, relational data,
which makes it difficult to wring BI value from unstructured and
semistructured data. Hadoop promises to complement DWs by
handling the multi-structured data types most DWs simply weren’t
designed for. Furthermore, Hadoop can enable certain pieces of a
modern DW architecture, such as massive data staging areas, archives
for detailed source data, and analytic sandboxes. Some early adoptors
offload as many workloads as they can to HDFS and other Hadoop
technologies because they are less expensive than the average DW
platform.The result is that DW resources are freed for the workloads
with which they excel. HDFS is not a DBMS. Oddly enough, that’s an
advantage for BI/DW. Hadoop promises to extend DW architecture to
better handle staging, archiving, sandboxes, and unstructured data.
tdwi.org 7 Introduction to Hadoop Products andTechnologies
 Hadoop enables many types of analytics, not justWeb
analytics. Hadoop gets a lot of press about how Internet
companies use it for analyzingWeb logs and otherWeb
data, but other use cases exist. For example, consider the
big data coming from sensory devices, such as robotics in
manufacturing, RFID in retail, or grid monitoring in
utilities. Older analytic applications that need large data
samples—such as customer base segmentation, fraud
detection, and risk analysis—can benefit from the
additional big data managed by Hadoop. Likewise,
Hadoop’s additional data can expand 360-degree views
to create a more complete and granular view of
customers, financials, partners, and other business
entities.
Hadoop Big Data A big picture

Más contenido relacionado

La actualidad más candente

Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopVigen Sahakyan
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 

La actualidad más candente (20)

Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 

Destacado

Declaraciones de impuestos de personas físicas
Declaraciones de impuestos de personas físicasDeclaraciones de impuestos de personas físicas
Declaraciones de impuestos de personas físicaslibrazul100
 
Definición de bases de datos
Definición de bases de datosDefinición de bases de datos
Definición de bases de datosEmerson Molina
 
Sincronización de BD SQLite con MySQL en Android
Sincronización de BD SQLite con MySQL en AndroidSincronización de BD SQLite con MySQL en Android
Sincronización de BD SQLite con MySQL en AndroidMeison Chirinos
 
Pago fiscal de honorarios sat
Pago fiscal de honorarios satPago fiscal de honorarios sat
Pago fiscal de honorarios satcesar hernandez
 
Sistemas de Gestión de Bases de datos
Sistemas de Gestión de Bases de datosSistemas de Gestión de Bases de datos
Sistemas de Gestión de Bases de datosJesús Tramullas
 
SISTEMA DE GESTION DE BASE DE DATOS SGBD
SISTEMA DE GESTION DE BASE DE DATOS SGBDSISTEMA DE GESTION DE BASE DE DATOS SGBD
SISTEMA DE GESTION DE BASE DE DATOS SGBDIsabel C de Talamas
 
Estipulacion de honorarios
Estipulacion de honorariosEstipulacion de honorarios
Estipulacion de honorariostito80
 
Costos y honorarios del consultor
Costos y honorarios del consultorCostos y honorarios del consultor
Costos y honorarios del consultorargeanatali
 

Destacado (8)

Declaraciones de impuestos de personas físicas
Declaraciones de impuestos de personas físicasDeclaraciones de impuestos de personas físicas
Declaraciones de impuestos de personas físicas
 
Definición de bases de datos
Definición de bases de datosDefinición de bases de datos
Definición de bases de datos
 
Sincronización de BD SQLite con MySQL en Android
Sincronización de BD SQLite con MySQL en AndroidSincronización de BD SQLite con MySQL en Android
Sincronización de BD SQLite con MySQL en Android
 
Pago fiscal de honorarios sat
Pago fiscal de honorarios satPago fiscal de honorarios sat
Pago fiscal de honorarios sat
 
Sistemas de Gestión de Bases de datos
Sistemas de Gestión de Bases de datosSistemas de Gestión de Bases de datos
Sistemas de Gestión de Bases de datos
 
SISTEMA DE GESTION DE BASE DE DATOS SGBD
SISTEMA DE GESTION DE BASE DE DATOS SGBDSISTEMA DE GESTION DE BASE DE DATOS SGBD
SISTEMA DE GESTION DE BASE DE DATOS SGBD
 
Estipulacion de honorarios
Estipulacion de honorariosEstipulacion de honorarios
Estipulacion de honorarios
 
Costos y honorarios del consultor
Costos y honorarios del consultorCostos y honorarios del consultor
Costos y honorarios del consultor
 

Similar a Hadoop Big Data A big picture

Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 

Similar a Hadoop Big Data A big picture (20)

Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Último

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Último (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Hadoop Big Data A big picture

  • 1. A Big picture for Developers
  • 2.  Over 1 terabytes data genrated by NYSE every day.  We send more than 144.8 billion Email messages sent a day.  On Twitter send more than 340 million tweets a day.  On Facebook share more than 684,000 bits of content a day.  We 72 hours of new video to YouTube a minute.  We spend $272,000 on Web shopping a day.  Google receives over 2 million search queries a minute.  Apple receives around 47,000 app downloads a minute.  Brands receive more than 34,000 Facebook ‘likes’ a minute.  Tumblr blog owners publish 27,000 new posts a minute.  Instagram photographers share 3,600 new photos a minute.  Flickr photographers upload 3,125 new photos a minute.  We perform over 2,000 Foursquare check-ins a minute.  Individuals and organizations launch 571 new websites a minute.  WordPress bloggers publish close to 350 new blog posts a minute.
  • 3. XML
  • 4. 0 100000 200000 300000 400000 500000 600000 700000 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Structured Semi Structured Unstructured
  • 5.  Dec 2004: Dean/Ghemawat (Google) MapReduce paper  2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (the name is derived from Doug’s son’s toy elephant)  2006:Yahoo runs Hadoop on 5-20 nodes  March 2008: Cloudera founded  July 2008: Hadoop winsTeraByte sort benchmark (1st time a Java program won this competition)  April 2009: Amazon introduce “Elastic MapReduce” as a service on S3/EC2  June 2011: Hortonworks founded  27 dec 2011: Apache Hadoop release 1.0.0  June 2012: Facebook claim “biggest Hadoop cluster”, totalling more than 100 PetaBytes in HDFS  2013:Yahoo runs Hadoop on 42,000 nodes, computing about 500,000 MapReduce jobs per day  15 oct 2013: Apache Hadoop release 2.2.0 (YARN)
  • 6. Google calls it: Hadoop equivalent: MapReduce Hadoop GFS HDFS Bigtable HBase Chubby Zookeeper
  • 7.  Single Namespace for entire cluster  Data Coherency –Write-once-read-many access model – Client can only append to existing files  Files are broken up into blocks –Typically 128MB block size – Each block replicated on multiple DataNodes  Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
  • 8.
  • 9.  Meta-data in Memory –The entire metadata is in main memory – No demand paging of meta-data  Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor  ATransaction Log – Records file creations, file deletions. etc
  • 10.  A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients  Block Report – Periodically sends a report of all existing blocks to the NameNode  Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  • 11.  Current Strategy -- One replica on local node -- Second replica on a remote rack --Third replica on same remote rack -- Additional replicas are randomly placed  Clients read from nearest replica  Would like to make this policy pluggable
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.  Input: This is the input data / file to be processed.  Split: Hadoop splits the incoming data into smaller pieces called "splits".  Map: In this step, MapReduce processes each split according to the logic defined in map() function. Each mapper works on each split at a time. Each mapper is treated as a task and multiple tasks are executed across different TaskTrackers and coordinated by the JobTracker.  Combine: This is an optional step and is used to improve the performance by reducing the amount of data transferred across the network. Combiner is the same as the reduce step and is used for aggregating the output of the map() function before it is passed to the subsequent steps.  Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted to put them in order, and grouped before sending them to the next step.  Reduce: This step is used to aggregate the outputs of mappers using the reduce() function. Output of reducer is sent to the next and final step. Each reducer is treated as a task and multiple tasks are executed across different TaskTrackers and coordinated by the JobTracker.  Output: Finally the output of reduce step is written to a file in HDFS.
  • 17.  Sqoop: is a tool designed for efficiently transferring bulk data betweenApache Hadoop and structured datastores such as relational databases. RDBMS HDFS
  • 18.  Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.
  • 19.  A data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
  • 20.  A platform for processing and analyzing large data sets. Pig consists of a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.
  • 21.  Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL- like interface for large datasets stored in HDFS
  • 22.  A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.
  • 23.  Built on Amazon’s Dynamo and Google’s BigTable, is a distributed database for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure. Cassandra offers capabilities that relational databases and other NoSQL databases.
  • 24.  Built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands. It also presents a REST interface to allow external tools access to Hive DDL (Data Definition Language) operations, such as “create table” and “describe table”.
  • 25.  Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website. It does so by adding content to a full-text index.  Also available as Lucene .Net
  • 26.  Apache Hama is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS for massive scientific computations such as matrix, graph and network algorithms.
  • 27.  A Simple and Efficient MapReduce Pipelines. The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user- defined functions simple to write, easy to test, and efficient to run.
  • 28.  A very popular data serialization format in the Hadoop technology stack. In this article I show code examples of MapReduce jobs in Java, Hadoop Streaming, Pig and Hive that read and/or write data in Avro format.
  • 29.  For scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly betweenC++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi ..
  • 30.  A framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an infrastructure service called Google BigQuery
  • 31.  A machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification. Many of the implementations use the Apache Hadoop platform
  • 32.  A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.  Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  • 33.  ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable
  • 34.  Oozie is a workflow scheduler system to manage Apache Hadoop jobs.  OozieWorkflow jobs are Directed Acyclical Graphs (DAGs) of actions.  Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.  Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).
  • 35.  SPARK: is ideal for in-memory data processing. It allows data scientists to implement fast, iterative algorithms for advanced analytics such as clustering and classification of datasets.  STORM: is a distributed real-time computation system for processing fast, large streams of data adding reliable real- time data processing capabilities to Apache Hadoop® 2.x  SOLR: A platform for searches of data stored in Hadoop. Solr enables powerful full-text search and near real-time indexing on many of the world’s largest Internet sites.  TEZ: A generalized data-flow programming framework, built on HadoopYARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases.
  • 36.
  • 37.
  • 38.  Need to process Multi Petabyte Datasets  Expensive to build reliability in each application.  Nodes fail every day – Failure is expected, rather than exceptional. –The number of nodes in a cluster is not constant.  Need common infrastructure – Efficient, reliable, Open SourceApache License  The above goals are same as Condor, but  Workloads are IO bound and not CPU bound
  • 39.  Hadoop consists of multiple products. We talk about Hadoop as if it’s one monolithic thing, but it’s actually a family of open source products and technologies overseen by the Apache Software Foundation (ASF). (Some Hadoop products are also available via vendor distributions; more on that later.)The Apache Hadoop library includes (in BI priority order): the Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase, HCatalog, Ambari, Mahout, Flume, and so on. You can combine these in various ways, but HDFS and MapReduce (perhaps with Pig, Hive, and HBase) constitute a useful technology stack for applications in BI, DW, DI, and analytics. More Hadoop projects are coming that will apply to BI/DW, including Impala, which is a much-needed SQL engine for low-latency data access to HDFS and Hive data.
  • 40.  Hadoop is open source but available from vendors, too. Apache Hadoop’s open source software library is available from ASF at www.apache.org. For users desiring a more enterprise-ready package, a few vendors now offer Hadoop distributions that include additional administrative tools, maintenance, and technical support. A handful of vendors offer their own non-Hadoop-based implementations of MapReduce.
  • 41.  Hadoop is an ecosystem, not a single product. In addition to products from Apache, the extended Hadoop ecosystem includes a growing list of vendor products (e.g., database management systems and tools for analytics, reporting, and DI) that integrate with or expand Hadoop technologies. One minute on your favorite search engine will reveal these. Ignorance of Hadoop is still common in the BI and IT communities. Hadoop comprises multiple products, available from multiple sources. 1This section of the report was originally published as the expert column “Busting 10 Myths about Hadoop” inTDWI’s BIThisWeek newsletter, March 20, 2012 (available at tdwi.org).The column has been updated slightly for use in this report. 6TDWI research Integrating Hadoop Into Bi/DW
  • 42.  HDFS is a file system, not a database management system (DBMS). Hadoop is primarily a distributed file system and therefore lacks capabilities we associate with a DBMS, such as indexing, random access to data, support for standard SQL, and query optimization.That’s okay, because HDFS does things DBMSs do not do as well, such as managing and processing massive volumes of file- based, unstructured data. For minimal DBMS functionality, users can layer HBase over HDFS and layer a query framework such as Hive or SQL-based Impala over HDFS or HBase.
  • 43.  Hive resembles SQL but is not standard SQL. Many of us are handcuffed to SQL because we know it well and our tools demand it. People who know SQL can quickly learn to hand code Hive, but that doesn’t solve compatibility issues with SQL-based tools.TDWI believes that over time, Hadoop products will support standard SQL and SQL-based vendor tools will support Hadoop, so this issue will eventually be moot.
  • 44.  Hadoop and MapReduce are related but don’t require each other. Some variations of MapReduce work with a variety of storage technologies, including HDFS, other file systems, and some relational DBMSs. Some users deploy HDFS with Hive or HBase, but not MapReduce.
  • 45.  MapReduce provides control for analytics, not analytics per se. MapReduce is a general- purpose execution engine that handles the complexities of network communication, parallel programming, and fault tolerance for a wide variety of hand-coded logic and other applications—not just analytics.
  • 46.  Hadoop is about data diversity, not just data volume.Theoretically, HDFS can manage the storage and access of any data type as long as you can put the data in a file and copy that file into HDFS. As outrageously simplistic as that sounds, it’s largely true, and it’s exactly what brings many users to Apache HDFS and related Hadoop products.After all, many types of big data that require analysis are inherently file based, such as Web logs, XML files, and personal productivity documents.
  • 47.  Hadoop complements a DW; it’s rarely a replacement. Most organizations have designed their DWs for structured, relational data, which makes it difficult to wring BI value from unstructured and semistructured data. Hadoop promises to complement DWs by handling the multi-structured data types most DWs simply weren’t designed for. Furthermore, Hadoop can enable certain pieces of a modern DW architecture, such as massive data staging areas, archives for detailed source data, and analytic sandboxes. Some early adoptors offload as many workloads as they can to HDFS and other Hadoop technologies because they are less expensive than the average DW platform.The result is that DW resources are freed for the workloads with which they excel. HDFS is not a DBMS. Oddly enough, that’s an advantage for BI/DW. Hadoop promises to extend DW architecture to better handle staging, archiving, sandboxes, and unstructured data. tdwi.org 7 Introduction to Hadoop Products andTechnologies
  • 48.  Hadoop enables many types of analytics, not justWeb analytics. Hadoop gets a lot of press about how Internet companies use it for analyzingWeb logs and otherWeb data, but other use cases exist. For example, consider the big data coming from sensory devices, such as robotics in manufacturing, RFID in retail, or grid monitoring in utilities. Older analytic applications that need large data samples—such as customer base segmentation, fraud detection, and risk analysis—can benefit from the additional big data managed by Hadoop. Likewise, Hadoop’s additional data can expand 360-degree views to create a more complete and granular view of customers, financials, partners, and other business entities.