SlideShare una empresa de Scribd logo
1 de 25
Mayuri Agarwal
Data Management !!!!!!
Big Data-What does it mean?
Velocity:
Often time sensitive , big data must be used
as it is streaming in to the enterprise it order
to maximize its value to the business.
Batch ,Near time , Real-time ,streams
Volume:
Big data comes in one size : large .
Enterprises are awash with data ,easy
amassing terabytes and even petabytes of
information.
TB , Records , Transactions ,Tables , Files.
Variety:
Big data extends beyond structured data
, including semi-structured and unstructured
data to all varieties :text , audio , video ,click
streams ,log files and more
Structured , Unstructured , Semi-structured
Veracity:
Quality and provenance of received data.
Good , Undefined , bad , Inconsistency
, Incompleteness , Ambiguity
Value
Big Data
90%
10%
Worldwide Data
Last 2 years
Since the Beginnning of
the Time
What is Hadoop?
Software project that enables the distributed processing of large data sets across clusters of
commodity servers
Works with structured and unstructured data
Open source software + Hardware commodity = IT cost Reduction
It is designed to scale up from a single server to thousands of machines
Very high degree of fault tolerance software’s ability to detect and handle failures at the application
layer
The origin of the name Hadoop….
The name Hadoop is not an acronym; it’s a
made-up name. The project’s creator, Doug
Cutting, explains how the name came about:
The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell and
pronounce, meaningless, and not used
elsewhere: those are my naming criteria.
Kids are good at generating such. Googol is
a kid’s term.
Hadoop Sub-projects
 HDFS
 Map-Reduce
HDFS-Hadoop Distributed File System
 Distributed, scalable, and portable file system
Each node in a Hadoop instance typically has
a single Namenode : a cluster of Datanodes
form the HDFS cluster
Asynchronous replication.
Data divided into 64mb (default) or 128mb
blocks , each block replicated 3 times (default)
Namenode holds file system metadata.
Files are broken up and spread over Datanode
.
HDFS- Read & Write
MapReduce
Software framework for distributed
computation
Input | Map() | Copy/Sort | Reduce () |
Output
JobTracker schedules and manages
jobs.
Task tracker executes individual
map() and reduce task on each cluster
node.
Example : MapReduce
Master – Slave Model
Hadoop Ecosystem
HBase
 HBase is an open source , non-relational, distributed database
 A Key-value store
 A value is identified by the key
 Both key and value are a byte array
 The values are stored in key-order
 Thus access data by key is very fast
 Users create table in HBase
 There is no schema of HBase table
 Very good for sparse data
 Takes lots of disk space
HBase Architecture
 Master: Responsible for coordinating with region server.
 Region server: Serves data for read and write
 Zookeeper: Manages the HBase cluster
 Low latency and random access to data
Hive
 A system for managing and querying structured data built on Hadoop
 SQL-Like query language called HQL
 Main purpose is analysis and ad hoc querying
 Database/table/partition –DDL operation
 Not for :small data sets ,Low latency queries ,OLTP
Hadoop-Hive Architecture
HBase-Hive configuration
HBase as ETL data sink
HBase as Data Source
Low Latency warehouse
Hive and MySQL Database Structure
Hadoop Limitations
 Not a high-speed SQL database.
 Is not a particularly simple technology.
 Hadoop is not easy to connect to legacy systems.
 Hadoop is not a replacement for traditional data warehouses. It is an
adjunctive product to data warehouses.
 Normal DBAs will need to learn new skills before they can adopt
Hadoop tools.
 The architecture around the data - the way you store data, the way
you de-normalize data, the way you ingest data, the way you extract
data - is different in Hadoop.
 Linux and Java skills are critical for making a Hadoop environment a
reality.
Hadoop’s Capability
 Hadoop is a super-powerful environment that can transform your
understanding of data.
 Hadoop can store vast amounts of data.
 Hadoop can run queries on huge data sets.
 You can archive data on Hadoop and still query it.
 Hadoop allows you to ingest data at incredible speeds and analyze it and
report on it in near real-time.
 Hadoop massively reduces the latency of data.
Hadoop: Hot skill to acquire on IT job
circuit
 The market for data technologies, such as databases, is a multi-billion dollar industry.
 Many start-ups are working on technology extensions to Hadoop to make it both analytical
and transactional. That would be big.
 Major companies have a big data strategy and want to build their businesses on top of this
 Google, the originator of Hadoop, has already moved on – suggesting that within a decade
either the Hadoop framework will have to be developed beyond all recognition or that
something newer could be on the way to supplant it.
 Every major internet company - be it Google, Twitter, Linkedin or Facebook - uses some form
of Hadoop .
mayuri.enggheads@gmail.com

Más contenido relacionado

La actualidad más candente

عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
 
Introduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleIntroduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleSpringPeople
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
An introduction to Big Data
An introduction to Big DataAn introduction to Big Data
An introduction to Big DataForwardSprint
 

La actualidad más candente (20)

عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Introduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleIntroduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeople
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data
Big dataBig data
Big data
 
An introduction to Big Data
An introduction to Big DataAn introduction to Big Data
An introduction to Big Data
 

Similar a Hadoop

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeSysfore Technologies
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop TechnologyRahul Sharma
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoopdatabloginfo
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoopahmed alshikh
 

Similar a Hadoop (20)

Big data
Big dataBig data
Big data
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big data
Big dataBig data
Big data
 
paper
paperpaper
paper
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoop
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoop
 

Último

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Hadoop

  • 3.
  • 4. Big Data-What does it mean? Velocity: Often time sensitive , big data must be used as it is streaming in to the enterprise it order to maximize its value to the business. Batch ,Near time , Real-time ,streams Volume: Big data comes in one size : large . Enterprises are awash with data ,easy amassing terabytes and even petabytes of information. TB , Records , Transactions ,Tables , Files. Variety: Big data extends beyond structured data , including semi-structured and unstructured data to all varieties :text , audio , video ,click streams ,log files and more Structured , Unstructured , Semi-structured Veracity: Quality and provenance of received data. Good , Undefined , bad , Inconsistency , Incompleteness , Ambiguity Value
  • 5. Big Data 90% 10% Worldwide Data Last 2 years Since the Beginnning of the Time
  • 6. What is Hadoop? Software project that enables the distributed processing of large data sets across clusters of commodity servers Works with structured and unstructured data Open source software + Hardware commodity = IT cost Reduction It is designed to scale up from a single server to thousands of machines Very high degree of fault tolerance software’s ability to detect and handle failures at the application layer
  • 7. The origin of the name Hadoop…. The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.
  • 9. HDFS-Hadoop Distributed File System  Distributed, scalable, and portable file system Each node in a Hadoop instance typically has a single Namenode : a cluster of Datanodes form the HDFS cluster Asynchronous replication. Data divided into 64mb (default) or 128mb blocks , each block replicated 3 times (default) Namenode holds file system metadata. Files are broken up and spread over Datanode .
  • 10. HDFS- Read & Write
  • 11. MapReduce Software framework for distributed computation Input | Map() | Copy/Sort | Reduce () | Output JobTracker schedules and manages jobs. Task tracker executes individual map() and reduce task on each cluster node.
  • 15. HBase  HBase is an open source , non-relational, distributed database  A Key-value store  A value is identified by the key  Both key and value are a byte array  The values are stored in key-order  Thus access data by key is very fast  Users create table in HBase  There is no schema of HBase table  Very good for sparse data  Takes lots of disk space
  • 16. HBase Architecture  Master: Responsible for coordinating with region server.  Region server: Serves data for read and write  Zookeeper: Manages the HBase cluster  Low latency and random access to data
  • 17. Hive  A system for managing and querying structured data built on Hadoop  SQL-Like query language called HQL  Main purpose is analysis and ad hoc querying  Database/table/partition –DDL operation  Not for :small data sets ,Low latency queries ,OLTP
  • 19. HBase-Hive configuration HBase as ETL data sink HBase as Data Source Low Latency warehouse
  • 20. Hive and MySQL Database Structure
  • 21. Hadoop Limitations  Not a high-speed SQL database.  Is not a particularly simple technology.  Hadoop is not easy to connect to legacy systems.  Hadoop is not a replacement for traditional data warehouses. It is an adjunctive product to data warehouses.  Normal DBAs will need to learn new skills before they can adopt Hadoop tools.  The architecture around the data - the way you store data, the way you de-normalize data, the way you ingest data, the way you extract data - is different in Hadoop.  Linux and Java skills are critical for making a Hadoop environment a reality.
  • 22. Hadoop’s Capability  Hadoop is a super-powerful environment that can transform your understanding of data.  Hadoop can store vast amounts of data.  Hadoop can run queries on huge data sets.  You can archive data on Hadoop and still query it.  Hadoop allows you to ingest data at incredible speeds and analyze it and report on it in near real-time.  Hadoop massively reduces the latency of data.
  • 23. Hadoop: Hot skill to acquire on IT job circuit  The market for data technologies, such as databases, is a multi-billion dollar industry.  Many start-ups are working on technology extensions to Hadoop to make it both analytical and transactional. That would be big.  Major companies have a big data strategy and want to build their businesses on top of this  Google, the originator of Hadoop, has already moved on – suggesting that within a decade either the Hadoop framework will have to be developed beyond all recognition or that something newer could be on the way to supplant it.  Every major internet company - be it Google, Twitter, Linkedin or Facebook - uses some form of Hadoop .
  • 24.