SlideShare una empresa de Scribd logo
1 de 22
Msquare Systems Inc.,
INFORMATION TECHNOLOGY & CONSULTING FIRM

Visit: http:/www.msquaresystems.com/
What is Hadoop?

Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain
insight from massive amounts of structured and unstructured data quickly and without significant investment.

Hadoop is designed to run on commodity hardware and can scale up or down without system interruption. It consists
of three main functions: storage, processing and resource management.
Core services on Hadoop
MapReduce:
MapReduce is a framework for writing applications that process large amounts of structured and
unstructured data in parallel across a cluster of several machines in a reliable and fault-tolerant.

Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of
large data sets on compute clusters of commodity hardware.

The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks.

The Hadoop MapReduce framework sorts the outputs of the maps, which are then input to the
reduce tasks. Typically, both the input and the output of the job are stored in a file system.
Core services on Hadoop

HDFS:
 Hadoop Distributed File System is a java-based file system that provides scalable and reliable
data storage for large group of clusters.

 This Apache Software Foundation project is designed to provide a fault-tolerant file system
designed to run on commodity hardware.
 The primary objective of HDFS is to store data reliably even in the presence of failures
including NameNode failures, DataNode failures and network partitions.

 The NameNode is a single point of failure for the HDFS cluster and a DataNode stores data in
the Hadoop file management system
Core services on Hadoop
Hadoop Yarn:

 Yarn is a next generation framework for Hadoop Data processing extending MapReduce capabilities by
supporting non-MapReduce workloads associated with other programming models.
 Its a resource-management platform responsible for managing compute resources in clusters and using
them for scheduling of users' applications.

 All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of
individual machines, or racks of machines) are common and thus should be automatically handled in
software by the framework and is now commonly considered to consist of a number of related projects
as well
Core services on Hadoop

Apache Tez:
 Tez generalizes the MapReduce paradigm tois a generic data-processing pipeline engine envisioned
as a low-level engine for higher abstractions such as Apache Hadoop Map-Reduce, Apache
Pig, Apache Hive etc.
 The data-processing pipeline engine where-in one can plug-in input, processing and output
implementations to perform arbitrary data-processing.

 Every 'task' in tez has the following,Input to consume key/value pairs from,Processor to process
them,Output to collect the processed key/value pairs a more powerful framework for executing a
complex DAG (directed acyclic graph) of tasks for near real-time big data processing.
Hadoop Data Services

Apache Pig:
 Its a high-level procedural language platform developed to simplify querying large data sets in
Apache Hadoop and MapReduce.
 Apache Pig features a “Pig Latin” language layer that enables SQL-like queries to be performed
on distributed datasets within Hadoop applications.
 Apache Pig is a platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these programs.

 The salient property of Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data sets
Hadoop Data Services

Apache Hbase:

(HBase) is the Hadoop database.
It is a distributed, scalable, big data store.

 HBase is a sub-project of the Apache Hadoop project and is used to provide real-time read
and write access to your big data.
Hadoop Data Services
Apache Hive:
Data warehouse software facilitates querying and managing large datasets residing in distributed
storage.
Hive provides a mechanism to project structure onto this data and query the data using a SQL-like
language called HiveQL.
At the same time this language also allows traditional map/reduce programmers to plug in their custom
mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Hive is an open source volunteer project under the Apache Software Foundation. Previously it was a
subproject of Apache Hadoop, but has now graduated to become a top-level project of its own.
Hadoop Data Services

Apache flume:
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of log data.
It has a simple and flexible architecture based on streaming data flows.

It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms.
It uses a simple extensible data model that allows for online analytic application. lume’s high-level
architecture is focused on delivering a streamlined codebase that is easy-to-use and easy-to-extend.
Hadoop Data Services

Apache Mahout:
 Apache Mahout is an Apache project to produce free implementations of distributed or otherwise
scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering
and classification, often leveraging, but not limited to, the Hadoop platform.
 Our core algorithms for clustering, classfication and collaborative filtering are implemented on top of
Apache Hadoop using the map/reduce paradigm.
 Classification learns from exisiting categorized documents what documents of a specific category look
like and is able to assign unlabelled documents to the (hopefully) correct category.
Hadoop Data Services

Apache Accumulo :

 Is a sorted, distributed key/value store and is at the core of Sqrrl Enterprise.
 It handles large amounts of structured, semi-structured, and unstructured data as a
robust, scalable, and real-time data storage and retrieval system.
 Fine-grained security controls allow organizations to control data at the cell-level and promote a datacentric security model without degrading performance.
 Accumulo can support a wide variety of real-time analytics, including statistics and graph analytics, via
Accumulo’s server-side programming framework called iterators.
Hadoop Data Services

Apache Storm:
 Storm is a distributed realtime computation system.
 Storm provides a set of general primitives for doing realtime computation.
 Storm is simple, can be used with any programming language, and is a lot of fun to use!
Hadoop Data Services

Apache Sqoop:
 Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational
databases and data warehouses – into Hadoop.
 It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from
Oracle, Teradata or other relational databases to the target.
Hadoop Data Services

Apache Catalog:
HCatalog is a table and storage management layer for Hadoop that enables users with different data
processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data
on the grid
HCatalog is a set of interfaces that open up access to Hive's metastore for tools inside and outside of the
Hadoop grid.
It includes providing a shared schema and data type mechanism for Hadoop tools.
HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File
System (HDFS) and ensures that users need not worry about where or in what format their data is stored.
Hadoop Operational Services

Apache Zookeeper :

 ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical
name space of data registers, known as nodes.
 Every znode is identified by a path, with path elements separated by a slash (“/”). Aside from the
root, every znode has a parent, and a znode cannot be deleted if it has children.
 A service is replicated over a set of machines and each maintains an in-memory image of the the data
tree and transaction logs.
 Clients connect to a single ZooKeeper server and maintains a TCP connection through which they send
requests and receive responses.
Hadoop Operational Services

Apache Falcon:
 Falcon is a framework for simplifying data management and pipeline processing in Apache Hadoop.
 It enables users to automate the movement and processing of datasets for ingest, pipelines, disaster
recovery and data retention use cases.

 Instead of hard-coding complex dataset and pipeline processing logic, users can now rely on Apache
Falcon for these functions, maximizing reuse and consistency across Hadoop applications.
 Falcon simplifies the development and management of data processing pipelines with introduction of
higher layer of abstractions for users to work with.
Hadoop Operational Services

Apache Ambari :
 Apache Ambari is a 100-percent open source operational framework for provisioning, managing and
monitoring Apache Hadoop clusters.
 Ambari includes an intuitive collection of operator tools and a robust set of APIs that hide the
complexity of Hadoop, simplifying the operation of clusters.
 Ambari includes an intuitive Web interface that allows you to easily provision, configure and test all
the Hadoop services and core components.
 Ambari provides tools to simplify cluster management. The Web interface allows you to
start/stop/test Hadoop services, change configurations and manage ongoing growth of your cluster.
Hadoop Operational Services

Apache knox :

 The Knox Gateway (“Knox”) is a system that provides a single point of authentication and access for
Apache™ Hadoop® services in a cluster.
 The goal of the project is to simplify Hadoop security for users who access the cluster data and
execute jobs, and for operators who control access and manage the cluster.
 Knox runs as a server (or cluster of servers) that serve one or more Hadoop clusters.
Hadoop Operational Services

Apache Oozie :
Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs.
Oozie combines multiple jobs sequentially into one logical unit of work.
It is integrated with the Hadoop stack and supports Hadoop jobs for Apache MapReduce, Apache
Pig, Apache Hive, and Apache Sqoop.
Apache Oozie allows Hadoop administrators to build complex data transformations out of multiple
component tasks.
Apache Oozie helps administrators derive more value from their Hadoop investment.
What Hadoop can, and can't do

What Hadoop can't do
You can't use Hadoop for
 Structured data
 Transactional data

What Hadoop can do
You can use Hadoop for
 Big Data
Support & Partner
Getting Hadoop Started or Need Support –

Muthu Natarajan

muthu.n@msquaresystems.com

www.msquaresystems.com

Phone: 212-941-6000/703-222-5500

Más contenido relacionado

La actualidad más candente

Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)ruchabhandiwad
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleHarald Erb
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem nuriadelasheras
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionDataWorks Summit
 

La actualidad más candente (20)

Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)
 
Big data
Big dataBig data
Big data
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Bigdata ppt
Bigdata pptBigdata ppt
Bigdata ppt
 
Bigdata
BigdataBigdata
Bigdata
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
4. hbase overview
4. hbase overview4. hbase overview
4. hbase overview
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Oracle in Database Hadoop
Oracle in Database HadoopOracle in Database Hadoop
Oracle in Database Hadoop
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
 

Similar a Brief Introduction about Hadoop and Core Services.

Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxmrudulasb
 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleSpringPeople
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 

Similar a Brief Introduction about Hadoop and Core Services. (20)

Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Hadoop white papers
Hadoop white papersHadoop white papers
Hadoop white papers
 
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptx
 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeople
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 

Más de Muthu Natarajan

Understanding about relational database m-square systems inc
Understanding about relational database m-square systems incUnderstanding about relational database m-square systems inc
Understanding about relational database m-square systems incMuthu Natarajan
 
Agile methodologiesvswaterfall
Agile methodologiesvswaterfallAgile methodologiesvswaterfall
Agile methodologiesvswaterfallMuthu Natarajan
 
Business intelligence data analytics-visualization
Business intelligence data analytics-visualizationBusiness intelligence data analytics-visualization
Business intelligence data analytics-visualizationMuthu Natarajan
 
Business intelligence, Data Analytics & Data Visualization
Business intelligence, Data Analytics & Data VisualizationBusiness intelligence, Data Analytics & Data Visualization
Business intelligence, Data Analytics & Data VisualizationMuthu Natarajan
 
Social Media Strategies and Social Marketing
Social Media Strategies and Social MarketingSocial Media Strategies and Social Marketing
Social Media Strategies and Social MarketingMuthu Natarajan
 
Cloud Computing & Benefits
Cloud Computing & BenefitsCloud Computing & Benefits
Cloud Computing & BenefitsMuthu Natarajan
 

Más de Muthu Natarajan (8)

Understanding about relational database m-square systems inc
Understanding about relational database m-square systems incUnderstanding about relational database m-square systems inc
Understanding about relational database m-square systems inc
 
Agile methodologiesvswaterfall
Agile methodologiesvswaterfallAgile methodologiesvswaterfall
Agile methodologiesvswaterfall
 
Business intelligence data analytics-visualization
Business intelligence data analytics-visualizationBusiness intelligence data analytics-visualization
Business intelligence data analytics-visualization
 
Business intelligence, Data Analytics & Data Visualization
Business intelligence, Data Analytics & Data VisualizationBusiness intelligence, Data Analytics & Data Visualization
Business intelligence, Data Analytics & Data Visualization
 
Social Media Strategies and Social Marketing
Social Media Strategies and Social MarketingSocial Media Strategies and Social Marketing
Social Media Strategies and Social Marketing
 
Protect your website
Protect your websiteProtect your website
Protect your website
 
Hr presentation
Hr presentationHr presentation
Hr presentation
 
Cloud Computing & Benefits
Cloud Computing & BenefitsCloud Computing & Benefits
Cloud Computing & Benefits
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Último (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Brief Introduction about Hadoop and Core Services.

  • 1. Msquare Systems Inc., INFORMATION TECHNOLOGY & CONSULTING FIRM Visit: http:/www.msquaresystems.com/
  • 2. What is Hadoop? Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment. Hadoop is designed to run on commodity hardware and can scale up or down without system interruption. It consists of three main functions: storage, processing and resource management.
  • 3. Core services on Hadoop MapReduce: MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of several machines in a reliable and fault-tolerant. Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks. The Hadoop MapReduce framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system.
  • 4. Core services on Hadoop HDFS:  Hadoop Distributed File System is a java-based file system that provides scalable and reliable data storage for large group of clusters.  This Apache Software Foundation project is designed to provide a fault-tolerant file system designed to run on commodity hardware.  The primary objective of HDFS is to store data reliably even in the presence of failures including NameNode failures, DataNode failures and network partitions.  The NameNode is a single point of failure for the HDFS cluster and a DataNode stores data in the Hadoop file management system
  • 5. Core services on Hadoop Hadoop Yarn:  Yarn is a next generation framework for Hadoop Data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.  Its a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.  All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework and is now commonly considered to consist of a number of related projects as well
  • 6. Core services on Hadoop Apache Tez:  Tez generalizes the MapReduce paradigm tois a generic data-processing pipeline engine envisioned as a low-level engine for higher abstractions such as Apache Hadoop Map-Reduce, Apache Pig, Apache Hive etc.  The data-processing pipeline engine where-in one can plug-in input, processing and output implementations to perform arbitrary data-processing.  Every 'task' in tez has the following,Input to consume key/value pairs from,Processor to process them,Output to collect the processed key/value pairs a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing.
  • 7. Hadoop Data Services Apache Pig:  Its a high-level procedural language platform developed to simplify querying large data sets in Apache Hadoop and MapReduce.  Apache Pig features a “Pig Latin” language layer that enables SQL-like queries to be performed on distributed datasets within Hadoop applications.  Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.  The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets
  • 8. Hadoop Data Services Apache Hbase: (HBase) is the Hadoop database. It is a distributed, scalable, big data store.  HBase is a sub-project of the Apache Hadoop project and is used to provide real-time read and write access to your big data.
  • 9. Hadoop Data Services Apache Hive: Data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. Hive is an open source volunteer project under the Apache Software Foundation. Previously it was a subproject of Apache Hadoop, but has now graduated to become a top-level project of its own.
  • 10. Hadoop Data Services Apache flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. lume’s high-level architecture is focused on delivering a streamlined codebase that is easy-to-use and easy-to-extend.
  • 11. Hadoop Data Services Apache Mahout:  Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification, often leveraging, but not limited to, the Hadoop platform.  Our core algorithms for clustering, classfication and collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.  Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.
  • 12. Hadoop Data Services Apache Accumulo :  Is a sorted, distributed key/value store and is at the core of Sqrrl Enterprise.  It handles large amounts of structured, semi-structured, and unstructured data as a robust, scalable, and real-time data storage and retrieval system.  Fine-grained security controls allow organizations to control data at the cell-level and promote a datacentric security model without degrading performance.  Accumulo can support a wide variety of real-time analytics, including statistics and graph analytics, via Accumulo’s server-side programming framework called iterators.
  • 13. Hadoop Data Services Apache Storm:  Storm is a distributed realtime computation system.  Storm provides a set of general primitives for doing realtime computation.  Storm is simple, can be used with any programming language, and is a lot of fun to use!
  • 14. Hadoop Data Services Apache Sqoop:  Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop.  It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.
  • 15. Hadoop Data Services Apache Catalog: HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid HCatalog is a set of interfaces that open up access to Hive's metastore for tools inside and outside of the Hadoop grid. It includes providing a shared schema and data type mechanism for Hadoop tools. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored.
  • 16. Hadoop Operational Services Apache Zookeeper :  ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers, known as nodes.  Every znode is identified by a path, with path elements separated by a slash (“/”). Aside from the root, every znode has a parent, and a znode cannot be deleted if it has children.  A service is replicated over a set of machines and each maintains an in-memory image of the the data tree and transaction logs.  Clients connect to a single ZooKeeper server and maintains a TCP connection through which they send requests and receive responses.
  • 17. Hadoop Operational Services Apache Falcon:  Falcon is a framework for simplifying data management and pipeline processing in Apache Hadoop.  It enables users to automate the movement and processing of datasets for ingest, pipelines, disaster recovery and data retention use cases.  Instead of hard-coding complex dataset and pipeline processing logic, users can now rely on Apache Falcon for these functions, maximizing reuse and consistency across Hadoop applications.  Falcon simplifies the development and management of data processing pipelines with introduction of higher layer of abstractions for users to work with.
  • 18. Hadoop Operational Services Apache Ambari :  Apache Ambari is a 100-percent open source operational framework for provisioning, managing and monitoring Apache Hadoop clusters.  Ambari includes an intuitive collection of operator tools and a robust set of APIs that hide the complexity of Hadoop, simplifying the operation of clusters.  Ambari includes an intuitive Web interface that allows you to easily provision, configure and test all the Hadoop services and core components.  Ambari provides tools to simplify cluster management. The Web interface allows you to start/stop/test Hadoop services, change configurations and manage ongoing growth of your cluster.
  • 19. Hadoop Operational Services Apache knox :  The Knox Gateway (“Knox”) is a system that provides a single point of authentication and access for Apache™ Hadoop® services in a cluster.  The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster.  Knox runs as a server (or cluster of servers) that serve one or more Hadoop clusters.
  • 20. Hadoop Operational Services Apache Oozie : Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. Apache Oozie allows Hadoop administrators to build complex data transformations out of multiple component tasks. Apache Oozie helps administrators derive more value from their Hadoop investment.
  • 21. What Hadoop can, and can't do What Hadoop can't do You can't use Hadoop for  Structured data  Transactional data What Hadoop can do You can use Hadoop for  Big Data
  • 22. Support & Partner Getting Hadoop Started or Need Support – Muthu Natarajan muthu.n@msquaresystems.com www.msquaresystems.com Phone: 212-941-6000/703-222-5500