SlideShare una empresa de Scribd logo
1 de 105
Business intelligence (BI) is a technology-driven process for analyzing data and presenting actionable
information to help corporate executives, business managers and other end users make more informed
business decisions.
BIG DATA MARKET
Big Data Is Big Market & Big Business - $50
Billion Market by 2017
Big Data not only refers to the data itself but also a set of
technologies that capture, store, manage and analyze large
and variable collections of data to solve complex problems.
Big Data Mind Map
Hadoop
Apache Hadoop®
is an open source framework for distributed storage and processing of large
sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from
massive amounts of structured and unstructured data.
Hadoop is not a database:
Hadoop an efficient distributed file
system and not a database. It is designed
specifically for information that comes in
many forms, such as social media, server
log files or other documents. Anything
that can be stored as a file can be placed
in a Hadoop repository.
● The MapReduce is the processing part of Hadoop and the MapReduce server on a
typical machine is called a TaskTracker.
● HDFS is the data part of Hadoop and the HDFS server on a a typical machine is called a
DataNode.
A Multi-node Hadoop Cluster Architecture
HDFS Architecture
Hive – A Data Warehouse
on top of Hadoop
Most of us might have already heard of the
history of Hadoop and how Hadoop is being
used in more and more organizations today
for batch processing of large sets of data.
There are many components in the Hadoop
eco-system, each serving a definite purpose.
For e.g. HDFS is the storage layer and a File
System, Map-Reduce is a programming
model and an execution framework. Now
talking about Hive, it is another component in
Hadoop stack; which was built by Facebook
and contributed back to the community.
Hadoop architecture involves many different components
Components of Hadoop
Core Hadoop:
HDFS:
HDFS stands for Hadoop Distributed File System for managing big data sets with High Volume, Velocity and Variety. HDFS
implements master slave architecture. Master is Name node and slave is data node.
Features:
• Scalable
• Reliable
• Commodity Hardware
HDFS is the well known for Big Data storage.
Map Reduce:
Map Reduce is a programming model designed to process high volume distributed data. Platform is built using Java for better
exception handling. Map Reduce includes two deamons, Job tracker and Task Tracker.
Features:
• Functional Programming.
• Works very well on Big Data.
• Can process large datasets.
Map Reduce is the main component known for processing big data.
YARN:
YARN stands for Yet Another Resource Negotiator. It is also called as MapReduce 2(MRv2). The two major functionalities of Job
Tracker in MRv1, resource management and job scheduling/ monitoring are split into separate daemons which are
ResourceManager, NodeManager and ApplicationMaster.
Features:
• Better resource management.
• Scalability
• Dynamic allocation of cluster resources.
Data Access:
Pig:
Apache Pig is a high level language built on top of MapReduce for analyzing large datasets with simple adhoc data analysis
programs. Pig is also known as Data Flow language. It is very well integrated with python. It is initially developed by yahoo.
Salient features of pig:
• Ease of programming
• Optimization opportunities
• Extensibility.
Pig scripts internally will be converted to map reduce programs.
Hive:
Apache Hive is another high level query language and data warehouse infrastructure built on top of Hadoop for providing data
summarization, query and analysis. It is initially developed by yahoo and made open source.
Salient features of hive:
• SQL like query language called HQL.
• Partitioning and bucketing for faster data processing.
• Integration with visualization tools like Tableau.
Hive queries internally will be converted to map reduce programs.
If you want to become a big data analyst, these two high level languages are a must know!!
Data Storage:
Hbase:
Apache HBase is a NoSQL database built for hosting large tables with billions of rows and millions of columns on top of Hadoop
commodity hardware machines. Use Apache Hbase when you need random, realtime read/write access to your Big Data.
Features:
• Strictly consistent reads and writes. In memory operations.
• Easy to use Java API for client access.
• Well integrated with pig, hive and sqoop.
• Is a consistent and partition tolerant system in CAP theorem.
Cassandra:
Cassandra is a NoSQL database designed for linear scalability and high availability. Cassandra is based on key-value model.
Developed by Facebook and known for faster response to queries.
Features:
• Column indexes
• Support for de-normalization
• Materialized views
• Powerful built-in caching.
Interaction -Visualization- execution-development:
Hcatalog:
HCatalog is a table management layer which provides integration of hive metadata for other Hadoop applications. It enables
users with different data processing tools like Apache pig, Apache MapReduce and Apache Hive to more easily read and write
data.
Features:
• Tabular view for different formats.
• Notifications of data availability.
• REST API’s for external systems to access metadata.
Lucene:
Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology
suitable for nearly any application that requires full-text search, especially cross-platform.
Features:
• Scalable, High – Performance indexing.
• Powerful, Accurate and Efficient search algorithms.
• Cross-platform solution.
Hama:
Apache Hama is a distributed framework based on Bulk Synchronous Parallel(BSP) computing. Capable and well known for
massive scientific computations like matrix, graph and network algorithms.
Features:
• Simple programming model
• Well suited for iterative algorithms
• YARN supported
• Collaborative filtering unsupervised machine learning.
• K-Means clustering.
Crunch:
Apache crunch is built for pipelining MapReduce programs which are simple and efficient. This framework is used for writing,
testing and running MapReduce pipelines.
Features:
• Developer focused.
• Minimal abstractions
• Flexible data model.
Data Intelligence:
Drill:
Apache Drill is a low latency SQL query engine for Hadoop and NoSQL.
Features:
• Agility
• Flexibility
• Familiarilty.
Mahout:
Apache Mahout is a scalable machine learning library designed for building predictive analytics on Big Data. Mahout now has
implementations apache spark for faster in memory computing.
Features:
• Collaborative filtering.
• Classification
• Clustering
• Dimensionality reduction
Data Serialization:
Avro:
Apache Avro is a data serialization framework which is language neutral. Designed for language portability, allowing data to
potentially outlive the language to read and write it.
Thrift:
Thrift is a language developed to build interfaces to interact with technologies built on Hadoop. It is used to define and create
services for numerous languages.
Data Integration:
Apache Sqoop:
Apache Sqoop is a tool designed for bulk data transfers between relational databases and Hadoop.
Features:
• Import and export to and from HDFS.
• Import and export to and from Hive.
• Import and export to HBase.
Apache Flume:
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
Features:
• Robust
• Fault tolerant
• Simple and flexible Architecture based on streaming data flows.
Apache Chukwa:
Scalable log collector used for monitoring large distributed files systems.
Features:
• Scales to thousands of nodes.
• Reliable delivery.
• Should be able to store data indefinitely.
Management, Monitoring and Orchestration:
Apache Ambari:
Ambari is designed to make hadoop management simpler by providing an interface for provisioning, managing and monitoring
Apache Hadoop Clusters.
Features:
• Provision a Hadoop Cluster.
• Manage a Hadoop Cluster.
• Monitor a Hadoop Cluster.
Apache Zookeeper:
Zookeeper is a centralized service designed for maintaining configuration information, naming, providing distributed
synchronization, and providing group services.
Features:
• Serialization
• Atomicity
• Reliability
• Simple API
Apache Oozie:
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Features:
• Scalable, reliable and extensible system.
• Supports several types of Hadoop jobs such as Map-Reduce, Hive, Pig and Sqoop.
• Simple and easy to use.
Krishnaiah Dasari
Email: dkrishna.hadoop@gmail.com

Más contenido relacionado

La actualidad más candente

Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Muthu Natarajan
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.Data Con LA
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystemJakub Stransky
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_OpportunityNojan Emad
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedDouglas Bernardini
 

La actualidad más candente (20)

Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
4. hbase overview
4. hbase overview4. hbase overview
4. hbase overview
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
 
Hive
HiveHive
Hive
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 

Similar a BIGDATA ppts

Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxiaeronlineexm
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxmrudulasb
 
01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptxVIJAYAPRABAP
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction葵慶 李
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxBhavanaHotchandani
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 

Similar a BIGDATA ppts (20)

Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptx
 
01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
hive.pptx
hive.pptxhive.pptx
hive.pptx
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 

BIGDATA ppts

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. Business intelligence (BI) is a technology-driven process for analyzing data and presenting actionable information to help corporate executives, business managers and other end users make more informed business decisions.
  • 15.
  • 16. Big Data Is Big Market & Big Business - $50 Billion Market by 2017 Big Data not only refers to the data itself but also a set of technologies that capture, store, manage and analyze large and variable collections of data to solve complex problems.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36. Hadoop Apache Hadoop® is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data.
  • 37.
  • 38.
  • 39. Hadoop is not a database: Hadoop an efficient distributed file system and not a database. It is designed specifically for information that comes in many forms, such as social media, server log files or other documents. Anything that can be stored as a file can be placed in a Hadoop repository.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49. ● The MapReduce is the processing part of Hadoop and the MapReduce server on a typical machine is called a TaskTracker. ● HDFS is the data part of Hadoop and the HDFS server on a a typical machine is called a DataNode.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64. A Multi-node Hadoop Cluster Architecture HDFS Architecture
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73. Hive – A Data Warehouse on top of Hadoop Most of us might have already heard of the history of Hadoop and how Hadoop is being used in more and more organizations today for batch processing of large sets of data. There are many components in the Hadoop eco-system, each serving a definite purpose. For e.g. HDFS is the storage layer and a File System, Map-Reduce is a programming model and an execution framework. Now talking about Hive, it is another component in Hadoop stack; which was built by Facebook and contributed back to the community.
  • 74.
  • 75.
  • 76. Hadoop architecture involves many different components
  • 77.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86. Core Hadoop: HDFS: HDFS stands for Hadoop Distributed File System for managing big data sets with High Volume, Velocity and Variety. HDFS implements master slave architecture. Master is Name node and slave is data node. Features: • Scalable • Reliable • Commodity Hardware HDFS is the well known for Big Data storage. Map Reduce: Map Reduce is a programming model designed to process high volume distributed data. Platform is built using Java for better exception handling. Map Reduce includes two deamons, Job tracker and Task Tracker. Features: • Functional Programming. • Works very well on Big Data. • Can process large datasets. Map Reduce is the main component known for processing big data. YARN: YARN stands for Yet Another Resource Negotiator. It is also called as MapReduce 2(MRv2). The two major functionalities of Job Tracker in MRv1, resource management and job scheduling/ monitoring are split into separate daemons which are ResourceManager, NodeManager and ApplicationMaster. Features: • Better resource management. • Scalability • Dynamic allocation of cluster resources.
  • 87.
  • 88. Data Access: Pig: Apache Pig is a high level language built on top of MapReduce for analyzing large datasets with simple adhoc data analysis programs. Pig is also known as Data Flow language. It is very well integrated with python. It is initially developed by yahoo. Salient features of pig: • Ease of programming • Optimization opportunities • Extensibility. Pig scripts internally will be converted to map reduce programs. Hive: Apache Hive is another high level query language and data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis. It is initially developed by yahoo and made open source. Salient features of hive: • SQL like query language called HQL. • Partitioning and bucketing for faster data processing. • Integration with visualization tools like Tableau. Hive queries internally will be converted to map reduce programs. If you want to become a big data analyst, these two high level languages are a must know!!
  • 89.
  • 90. Data Storage: Hbase: Apache HBase is a NoSQL database built for hosting large tables with billions of rows and millions of columns on top of Hadoop commodity hardware machines. Use Apache Hbase when you need random, realtime read/write access to your Big Data. Features: • Strictly consistent reads and writes. In memory operations. • Easy to use Java API for client access. • Well integrated with pig, hive and sqoop. • Is a consistent and partition tolerant system in CAP theorem. Cassandra: Cassandra is a NoSQL database designed for linear scalability and high availability. Cassandra is based on key-value model. Developed by Facebook and known for faster response to queries. Features: • Column indexes • Support for de-normalization • Materialized views • Powerful built-in caching.
  • 91.
  • 92. Interaction -Visualization- execution-development: Hcatalog: HCatalog is a table management layer which provides integration of hive metadata for other Hadoop applications. It enables users with different data processing tools like Apache pig, Apache MapReduce and Apache Hive to more easily read and write data. Features: • Tabular view for different formats. • Notifications of data availability. • REST API’s for external systems to access metadata. Lucene: Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Features: • Scalable, High – Performance indexing. • Powerful, Accurate and Efficient search algorithms. • Cross-platform solution. Hama: Apache Hama is a distributed framework based on Bulk Synchronous Parallel(BSP) computing. Capable and well known for massive scientific computations like matrix, graph and network algorithms. Features: • Simple programming model • Well suited for iterative algorithms • YARN supported • Collaborative filtering unsupervised machine learning. • K-Means clustering.
  • 93. Crunch: Apache crunch is built for pipelining MapReduce programs which are simple and efficient. This framework is used for writing, testing and running MapReduce pipelines. Features: • Developer focused. • Minimal abstractions • Flexible data model.
  • 94.
  • 95. Data Intelligence: Drill: Apache Drill is a low latency SQL query engine for Hadoop and NoSQL. Features: • Agility • Flexibility • Familiarilty. Mahout: Apache Mahout is a scalable machine learning library designed for building predictive analytics on Big Data. Mahout now has implementations apache spark for faster in memory computing. Features: • Collaborative filtering. • Classification • Clustering • Dimensionality reduction
  • 96.
  • 97. Data Serialization: Avro: Apache Avro is a data serialization framework which is language neutral. Designed for language portability, allowing data to potentially outlive the language to read and write it. Thrift: Thrift is a language developed to build interfaces to interact with technologies built on Hadoop. It is used to define and create services for numerous languages.
  • 98.
  • 99. Data Integration: Apache Sqoop: Apache Sqoop is a tool designed for bulk data transfers between relational databases and Hadoop. Features: • Import and export to and from HDFS. • Import and export to and from Hive. • Import and export to HBase. Apache Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Features: • Robust • Fault tolerant • Simple and flexible Architecture based on streaming data flows. Apache Chukwa: Scalable log collector used for monitoring large distributed files systems. Features: • Scales to thousands of nodes. • Reliable delivery. • Should be able to store data indefinitely.
  • 100.
  • 101. Management, Monitoring and Orchestration: Apache Ambari: Ambari is designed to make hadoop management simpler by providing an interface for provisioning, managing and monitoring Apache Hadoop Clusters. Features: • Provision a Hadoop Cluster. • Manage a Hadoop Cluster. • Monitor a Hadoop Cluster. Apache Zookeeper: Zookeeper is a centralized service designed for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Features: • Serialization • Atomicity • Reliability • Simple API Apache Oozie: Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Features: • Scalable, reliable and extensible system. • Supports several types of Hadoop jobs such as Map-Reduce, Hive, Pig and Sqoop. • Simple and easy to use.
  • 102.
  • 103.
  • 104.