SlideShare una empresa de Scribd logo
1 de 12
Getting Started With Big Data
Apache Hadoop
Apache Hadoop
Apache Hadoop
• is a popular open-source
framework for storing and
processing large data sets across
clusters of computers.
• HDP 2.2 on Sandbox system
Requirements:
– Now runs on 32-bit and 64-bit OS
(Windows XP, Windows 7,
Windows 8 and Mac OSX)
– Minimum 4GB RAM; 8Gb required
to run Ambari and Hbase
– Virtualization enabled on BIOS
– Browser: Chrome 25+, IE 9+, Safari
6+ recommended. (Sandbox will
not run on IE 10)
• An ideal way to get started Enterprise
Hadoop. Sandbox is a self-contained
virtual machine with Apache Hadoop
pre-configured alongside a set of
hands-on, step-by-step Hadoop
tutorials.
• Sandbox is a personal, portable Hadoop
environment that comes with a dozen
interactive Hadoop tutorials.
• It includes many of the most exciting
developments from the latest HDP
distribution, packaged up in a virtual
environment that you can get up and
running in 15 minutes!
Hadoop… Getting Started
Terminologies
• Hadoop
• YARN – the Hadoop Operating system
– enables a user to interact with all data in multiple
ways simultaneously, making Hadoop a true multi-use
data platform and allowing it to take its place in a
modern data architecture.
– A framework for job scheduling and cluster resource
management.
– This means that many different processing engines can
operate simultaneously across a Hadoop cluster, on
the same data, at the same time.
• the Hadoop Distributed File System (HDFS)
– A distributed file system that provides high-
throughput access to application data.
• MapReduce
– A YARN-based system for parallel processing of large
data sets.
• Sqoop
• theHiveODBC Driver
Hortonworks Data Platform(HDP)
• is a 100% open source
distribution of Apache
Hadoop that is truly
enterprise grade having
been built, tested and
hardened with enterprise
rigor.
Introducing Apache Hadoop to
Developers
• Apache Hadoop is a community driven open-source project
governed by the Apache Software Foundation.
• originally implemented at Yahoo based on papers published
by Google in 2003 and 2004.
• Since then Apache Hadoop has matured and developed to
become a data platform for not just processing humongous
amount of data in batch but with the advent of YARN it now
supports many diverse workloads such as Interactive
queries over large data with Hive on Tez, Realtime data
processing with Apache Storm, super scalable NoSQL
datastore like HBase, in-memory datastore like Spark and
the list goes on.
Apache Enterprise Hadoop
...
Core of Hadoop
• A set of machines running
HDFS and MapReduce is
known as a Hadoop Cluster.
Individual machines are
known as nodes. A cluster
can have as few as one node
to as many as several
thousands. For most
application scenarios Hadoop
is linearly scalable, which
means you can expect better
performance by simply
adding more nodes.
• The Hadoop
Distributed File
System (HDFS)
• MapReduce
MapReduce
• a method for distributing a task across multiple nodes. Each node
processes data stored on that node to the extent possible.
• A running Map Reduce job consists of various phases such as Map -
> Sort -> Shuffle -> Reduce
• Advantages:
– Automatic parallelization and distribution of data in blocks across a
distributed, scale-out infrastructure.
– Fault-tolerance against failure of storage, compute and network
infrastructure
– Deployment, monitoring and security capability
– A clean abstraction for programmers
• Most MapReduce programs are written in Java. It can also be
written in any scripting language using the Streaming API of
Hadoop.
The MapReduce Concepts and
Terminology
• MapReduce jobs are controlled by a software daemon
known as the JobTracker. The JobTracker resides on a
'master node'. Clients submit MapReduce jobs to the
JobTracker. The JobTracker assigns Map and Reduce tasks to
other nodes on the cluster.
• These nodes each run a software daemon known as the
TaskTracker. The TaskTracker is responsible for actually
instantiating the Map or Reduce task, and reporting
progress back to the JobTracker
• A job is a program with the ability of complete execution of
Mappers and Reducers over a dataset. A task is the
execution of a single Mapper or Reducer over a slice of
data.
Hadoop Distributed File System
• the foundation of the Hadoop cluster.
• manages how the datasets are stored in the
Hadoop cluster.
• responsible for distributing the data across the
data nodes, managing replication for
redundancy and administrative tasks like
adding, removing and recovery of data nodes.
Apache Hive
• provides a data warehouse view of the data in HDFS.
• Using a SQL-like language Hive lets you create
summarizations of your data, perform ad-hoc queries,
and analysis of large datasets in the Hadoop cluster.
• The overall approach with Hive is to project a table
structure on the dataset and then manipulate it with
HiveQL.
• Since you are using data in HDFS your operations can
be scaled across all the datanodes and you can
manipulate huge datasets.
Apache HCatalog
• Used to hold location and metadata about the
data in a Hadoop cluster. This allows scripts and
MapReduce jobs to be decoupled from data
location and metadata like the schema.
• since it supports many tools, like Hive and Pig,
the location and metadata can be shared
between tools. Using the open APIs of HCatalog
other tools like Teradata Aster can also use the
location and metadata in HCatalog.
• how can we reference data by name and inherit
the location and metadata???
Apache Pig
• a language for expressing data analysis and
infrastructure processes.
• is translated into a series of MapReduce jobs that
are run by the Hadoop cluster.
• is extensible through user-defined functions that
can be written in Java and other languages.
• Pig scripts provide a high level language to create
the MapReduce jobs needed to process data in a
Hadoop cluster.

Más contenido relacionado

La actualidad más candente

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copyMohammad_Tariq
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Cloudera, Inc.
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at myliferesponseteam
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture Ganesh B
 

La actualidad más candente (20)

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 

Destacado

Tripod Astrophotography - Glenn Wester
Tripod Astrophotography - Glenn WesterTripod Astrophotography - Glenn Wester
Tripod Astrophotography - Glenn Westerglennwester
 
LEFGOZAH-Nominees
LEFGOZAH-NomineesLEFGOZAH-Nominees
LEFGOZAH-Nomineesdmvs-jim
 
Ավելուկ
Ավելուկ Ավելուկ
Ավելուկ 777ruzan
 
Mortgage CRM Made Easy with Mortgage Quest
Mortgage CRM Made Easy with Mortgage QuestMortgage CRM Made Easy with Mortgage Quest
Mortgage CRM Made Easy with Mortgage QuestChris Carter
 
How hurricanes get their names
How hurricanes get their namesHow hurricanes get their names
How hurricanes get their nameskygraham23
 
Analysis for office training
Analysis for office   trainingAnalysis for office   training
Analysis for office trainingKibrom Gebrehiwot
 
Caterpillar operation and maintenance manual 3500 b engines s
Caterpillar operation and maintenance manual 3500 b engines sCaterpillar operation and maintenance manual 3500 b engines s
Caterpillar operation and maintenance manual 3500 b engines sZubes Masade
 

Destacado (12)

Tripod Astrophotography - Glenn Wester
Tripod Astrophotography - Glenn WesterTripod Astrophotography - Glenn Wester
Tripod Astrophotography - Glenn Wester
 
LEFGOZAH-Nominees
LEFGOZAH-NomineesLEFGOZAH-Nominees
LEFGOZAH-Nominees
 
Ավելուկ
Ավելուկ Ավելուկ
Ավելուկ
 
Multi 2
Multi 2Multi 2
Multi 2
 
Trabajo de angie
Trabajo de angieTrabajo de angie
Trabajo de angie
 
Mortgage CRM Made Easy with Mortgage Quest
Mortgage CRM Made Easy with Mortgage QuestMortgage CRM Made Easy with Mortgage Quest
Mortgage CRM Made Easy with Mortgage Quest
 
Sammy Vander Donckt
Sammy Vander DoncktSammy Vander Donckt
Sammy Vander Donckt
 
How hurricanes get their names
How hurricanes get their namesHow hurricanes get their names
How hurricanes get their names
 
certificate Finance
certificate Financecertificate Finance
certificate Finance
 
Analysis for office training
Analysis for office   trainingAnalysis for office   training
Analysis for office training
 
Catalog cat
Catalog catCatalog cat
Catalog cat
 
Caterpillar operation and maintenance manual 3500 b engines s
Caterpillar operation and maintenance manual 3500 b engines sCaterpillar operation and maintenance manual 3500 b engines s
Caterpillar operation and maintenance manual 3500 b engines s
 

Similar a Getting started big data

hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxmrudulasb
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 

Similar a Getting started big data (20)

hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Anju
AnjuAnju
Anju
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Unit 3 intro.pptx
Unit 3 intro.pptxUnit 3 intro.pptx
Unit 3 intro.pptx
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptx
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 

Último

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 

Último (20)

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 

Getting started big data

  • 1. Getting Started With Big Data Apache Hadoop
  • 2. Apache Hadoop Apache Hadoop • is a popular open-source framework for storing and processing large data sets across clusters of computers. • HDP 2.2 on Sandbox system Requirements: – Now runs on 32-bit and 64-bit OS (Windows XP, Windows 7, Windows 8 and Mac OSX) – Minimum 4GB RAM; 8Gb required to run Ambari and Hbase – Virtualization enabled on BIOS – Browser: Chrome 25+, IE 9+, Safari 6+ recommended. (Sandbox will not run on IE 10) • An ideal way to get started Enterprise Hadoop. Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials. • Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. • It includes many of the most exciting developments from the latest HDP distribution, packaged up in a virtual environment that you can get up and running in 15 minutes!
  • 3. Hadoop… Getting Started Terminologies • Hadoop • YARN – the Hadoop Operating system – enables a user to interact with all data in multiple ways simultaneously, making Hadoop a true multi-use data platform and allowing it to take its place in a modern data architecture. – A framework for job scheduling and cluster resource management. – This means that many different processing engines can operate simultaneously across a Hadoop cluster, on the same data, at the same time. • the Hadoop Distributed File System (HDFS) – A distributed file system that provides high- throughput access to application data. • MapReduce – A YARN-based system for parallel processing of large data sets. • Sqoop • theHiveODBC Driver Hortonworks Data Platform(HDP) • is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
  • 4. Introducing Apache Hadoop to Developers • Apache Hadoop is a community driven open-source project governed by the Apache Software Foundation. • originally implemented at Yahoo based on papers published by Google in 2003 and 2004. • Since then Apache Hadoop has matured and developed to become a data platform for not just processing humongous amount of data in batch but with the advent of YARN it now supports many diverse workloads such as Interactive queries over large data with Hive on Tez, Realtime data processing with Apache Storm, super scalable NoSQL datastore like HBase, in-memory datastore like Spark and the list goes on.
  • 6. Core of Hadoop • A set of machines running HDFS and MapReduce is known as a Hadoop Cluster. Individual machines are known as nodes. A cluster can have as few as one node to as many as several thousands. For most application scenarios Hadoop is linearly scalable, which means you can expect better performance by simply adding more nodes. • The Hadoop Distributed File System (HDFS) • MapReduce
  • 7. MapReduce • a method for distributing a task across multiple nodes. Each node processes data stored on that node to the extent possible. • A running Map Reduce job consists of various phases such as Map - > Sort -> Shuffle -> Reduce • Advantages: – Automatic parallelization and distribution of data in blocks across a distributed, scale-out infrastructure. – Fault-tolerance against failure of storage, compute and network infrastructure – Deployment, monitoring and security capability – A clean abstraction for programmers • Most MapReduce programs are written in Java. It can also be written in any scripting language using the Streaming API of Hadoop.
  • 8. The MapReduce Concepts and Terminology • MapReduce jobs are controlled by a software daemon known as the JobTracker. The JobTracker resides on a 'master node'. Clients submit MapReduce jobs to the JobTracker. The JobTracker assigns Map and Reduce tasks to other nodes on the cluster. • These nodes each run a software daemon known as the TaskTracker. The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker • A job is a program with the ability of complete execution of Mappers and Reducers over a dataset. A task is the execution of a single Mapper or Reducer over a slice of data.
  • 9. Hadoop Distributed File System • the foundation of the Hadoop cluster. • manages how the datasets are stored in the Hadoop cluster. • responsible for distributing the data across the data nodes, managing replication for redundancy and administrative tasks like adding, removing and recovery of data nodes.
  • 10. Apache Hive • provides a data warehouse view of the data in HDFS. • Using a SQL-like language Hive lets you create summarizations of your data, perform ad-hoc queries, and analysis of large datasets in the Hadoop cluster. • The overall approach with Hive is to project a table structure on the dataset and then manipulate it with HiveQL. • Since you are using data in HDFS your operations can be scaled across all the datanodes and you can manipulate huge datasets.
  • 11. Apache HCatalog • Used to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. • since it supports many tools, like Hive and Pig, the location and metadata can be shared between tools. Using the open APIs of HCatalog other tools like Teradata Aster can also use the location and metadata in HCatalog. • how can we reference data by name and inherit the location and metadata???
  • 12. Apache Pig • a language for expressing data analysis and infrastructure processes. • is translated into a series of MapReduce jobs that are run by the Hadoop cluster. • is extensible through user-defined functions that can be written in Java and other languages. • Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.