SlideShare una empresa de Scribd logo
1 de 39
BY:
SHIVANEE GARG
SUMMER TRAINING PRESENTATION
BIGDATA HADOOP
COMPANY OVERVIEW
►Company name : Linux World Informatics Pvt Ltd
►Linuxworld Informatics Pvt Ltd : RedHat
Awarded Partner , Cisco Learning Partner and An ISO
9001 : 2008 Certified Company is dedicated to offering
a comprehensive set of most useful Open Source and
Commercial training programmes today's demands .
NOTE: This Organisation is specilized in providing
training to the students of B.Tech , M.Tech. , MCA , BCA,
and other students pursuing course in computer related
technologies
COMPANY OVERVIEW
 Core Dicision of
organisation :
 Training & development
service
 Technical Support
Services
 Research & Development
center
 Course Provided
By Orginization :
 RedHat Linux
Cloud Computing
BigData Hadoop
DevOop
LOREM IPSUM DOLOR
Course :
BigData Hadoop
Technology Learned :
1. Hadoop
2. Map Reduce
3. Single node & Multi node cluster
4. Docker
5. Ansible
6. Python
Vision Statement
State the vision
 Data which are very large in size is called Big Data. Normally we
work on data of size MB(WordDoc ,Excel) or maximum GB(Movies,
Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data
 Big Data describes data sets so large and complex they are
impractical to manage with traditional software tools.
 big data tends to refer to the use of predictive analytics, certain
other advanced data analytics methods that extract value from
data,
 The amount of Big Data increases exponentially- more than 500
terabytes of data are uploaded to Face book alone, in a single day- it
represents a real problem in terms of analysis.
Category of bigData :
1. structured data :
 It concerns all data which can be stored in database SQL in table with
rows and columns , in an ordered manner.
 structured datas represent only 5 to 10% of all informatics
datas.
unstructure data :
 Unstructured data represent around 80% of data.
 It often include text and multimedia content.
 Examples include e-mail messages, word
processing documents, videos, photos, audio files,
presentations, webpages and many other kinds of
business documents.
 Unstructured data is everywhere.
semi-structured data :
 Semi-structured data is information that doesn’t reside in a
relational database
 but that does have some organizational properties that make it
easier to analyze with some process
 Examples of semi-structured : CSV but XML and JSON
documents are semi structured documents, NoSQL databases
are considered as semi structured.
 semi structured data represents a few parts of data (5 to 10%)
BIGDATA CHALLENGES
1. Data storage and quality :
 Companies and Organizations are growing at a very fast . The growth of the companies rapidly
increases the amount of data produced.
 The storage of this data is becoming a challenge for everyone.
 Options like data lakes/ warehouses are used to collect and store massive quantities of
unstructured data in its native format.
 when a data lakes/ warehouse try to combine inconsistent data from disparate
sources
 it encounters errors. Inconsistent data, duplicates, and missing data all result in data
quality challenges.
People who understand Big Data Analysis :
1.Data Analysis is very important to make the huge amount of
data being produced, useful.
2. there is a huge need for Big Data analysts and Data Scientists. The
storage of quality data scientists has made it a job in great demand
3. This is another challenge faced by companies. The number of data
scientists available is very less in comparison to the amount of data
being produced.
4'vs of bigdata
VOLUME :
 The main characteristic that makes data “big” is the sheer volume.
 It makes no sense to focus on minimum storage units because the total amount
of information is growing exponentially every year.
VARIETY :
 variety refers to the many sources and types of data both structured
and unstructured.
 We used to store data from sources like spreadsheets and databases.
 data comes in the form of emails, photos, videos, PDFs, audio, etc.
VERADITY :
 Big Data Veradity refers to the noise and abnormality in data.
 veracity in data analysis is the biggest challenge when compares to things like volume and
velocity
VELOCITY :
 Velocity is the frequency of incoming data that needs to be processed.
 Think about how many SMS messages, Facebook status updates, or credit card swipes are
being sent on a particular telecom carrier every minute of every day,
 You will have a good appreciation of valocity
 A streaming application like Amazon Web Services Kinesis is an example of an application
that handles the velocity of data.
Issue In Bigdata
1. YOUR PERSONAL INFORMATION IS AT RISK :
 Data breaches involve the disclosure of customer information to people .
 some involving Social Security numbers, email addresses, contact
information, debit and credit card numbers, your personal information is really
at risk.
2 . E-DISCOVERY PROBLEMS :
 E-discovery refers to the search of electronic data for use as evidence in a
legal proceeding.
 it is now more difficult to search for electronic evidence because there are lots
and lots of data
3 . ANALYTICS ISN’T 100% ACCURATE :
 it is very difficult to check the analysis manually,
 your best bet to ensure that your analytics will not provide gravely inaccurate data is
to use a trusted data analysis tool that guarantees the highest level of accuracy.
Solution Of Bigdata
1 . Traditional Approach :
Data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2
and sophisticated softwares can be written to interact with the database,
process the required data and present it to the users for analysis purpose.
limitation : This approach works well where we have less volume of data
2. Google’s Solution
 Google solved this problem using an algorithm called MapReduce.
 This algorithm divides the task into small parts and assigns those
parts to many computers connected over the network
 collects the results to form the final result dataset.
what is hadoop
 Hadoop is an open source framework from Apache and is used to
store and process
 Analyze data which are very huge in volume.
 Hadoop is written in Java and is not OLAP (online analytical processing). It is
used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn
 Hadoop is the core platform for structuring Big Data.
 Big Data is large volumes of structured and unstructured data
HDFS
 The Hadoop Distributed File System (HDFS) is designed to store very large data sets
 In a large cluster, thousands of servers both host directly attached storage and execute user
application tasks.
 An important characteristic of Hadoop is the partationing of data and computation
across many (thousands) of hosts, and the execution of application computations in
parallel close to their data.
 A Hadoop cluster scales computation capacity
 HDFS stores filesystem metadata and application data separately.
 HDFS stores metadata on a dedicated server, called the NameNode.
 Application data are stored on other servers called DataNodes
ARCHITECTURE OF HDFS
 NameNode is the master node in the Apache Hadoop HDFS Architecture
 Namenode maintains and manages the blocks present on the DataNodes
 NameNode controls access to files by clients.
 user data never resides on the NameNode. The data resides on DataNodes
 It is the master daemon that maintains and manages the DataNodes
 It records the metadata of all the files stored in the cluster, e.g. The location
of blocks stored, the size of the files, permissions, hierarchy, etc
 It regularly receives a Heartbeat and a block report from all the DataNodes in
the cluster to ensure that the DataNodes are live.
 It keeps a record of all the blocks in HDFS and in which nodes these blocks
are located.
 The NameNode is also responsible to take care of the replication factor of all the
blocks
DATA NODE
 Client applications can talk directly to a DataNode, once the
NameNode has provided the location of the data.
 DataNode is usually configured with a lot of hard disk space.
Because the actual data is stored in the DataNode.
 DataNode is responsible for storing the actual data in HDFS.
 DataNode is also known as the Slave
 File data is replicated on multiple DataNodes for reliability
 They send heartbeats to the NameNode periodically to report the overall
health of HDFS, by default, this frequency is set to 3 seconds.
BLOCK
Hadoop distributed file system also stores the data in terms of
blocks.
the block size in HDFS is very large. The default size of HDFS block
is 64MB.
The files are split into 64MB blocks and then stored into the hadoop
filesystem.
The hadoop application is responsible for distributing the data
blocks across multiple nodes.
If the size of the file is less than the HDFS block size, then the file
Map Reduce
 Hadoop Map/Reduce is a software framework for distributed processing of
large data sets on computing clusters.
 MapReduce is the core component for data processing in Hadoop
framework
 Mapreduce helps to split the input data set into a number of parts and run a program
on all data parts parallel at once.
 The term MapReduce refers to two separate and distinct tasks.--------->
1. The first is the map operation, takes a set of data and converts it into another set of
data,
2. The reduce operation combines those data tuples based on the key and accordingly
modifies the value of the key.
Mappers and Reducers will be run as tasks on nodes in the cluster.
For example, in the WordCount application the Mapper functions read blocks of text and
output each word they find. If several Mappers output the same word, each of those
outputs are aggregated to a single Reducer which can count the number of times it has
Job Tracker & Task Tracker
 MapReduce processing in Hadoop 1 is handled by the JobTracker
and TaskTracker daemons
 Client applications submit jobs to the Job tracker.
 The JobTracker talks to the NameNode to determine the location of
the data
 The JobTracker locates TaskTracker nodes with available slots at or
near the data
 The JobTracker submits the work to the chosen TaskTracker nodes.
 The TaskTracker nodes are monitored. A TaskTracker will notify the
JobTracker when a task fails.
 The JobTracker decides what to do then: it may resubmit the job
 When the work is completed, the JobTracker updates its status.
DOCKER
Docker is the world’s leading software container
platform.
 Developers use Docker to eliminate “works on my machine”
problems
 Docker is a tool designed to make it easier to create, deploy, and
run applications by using containers.
 Docker containers are so lightweight, a single server or virtual
machine can run several containers simultaneously
CONTAINER
A container image is a lightweight,
 executable package of a piece of software that includes everything
needed to run it: code, runtime, system tools, system libraries,
settings.
Available for both Linux and Windows based apps,
Containers isolate software from its surroundings,
Property Of Container
 LIGHTWEIGHT
 Docker containers running on a single machine share that machine's
operating system kernel;
 Images are constructed from filesystem layers and share common files.
 This minimizes disk usage and image downloads are much faster.
STANDARD
 Docker containers are based on open standards and run on all major
Linux distributions and on any infrastructure including VMs, bare-metal
and in the cloud.
SECURE
 Docker containers isolate applications from one another and
from the underlying infrastructure.
hypervisor based virtulization & container virtulization
ANSIBLE
 Ansible Tasks are idempotent Without extra coding ,bash script are
usually run again and again.
 Ansible is a simple, but powerful, server and configuration management
tool. Learn to use Ansible effectively, whether you manage one server—
or thousands.
 Ansible has two types of servers: controlling machines and nodes.
 First, there is a single controlling machine . The controlling machine describes the
location of nodes through its inventory.
 Second Nodes are managed by a controlling machine over SSH.
 Playbooks express configurations, deployment in Ansible.
 The Playbook format is YAML.
 Each Playbook maps a group of hosts to a set of roles.
 Each role is represented by calls to Ansible tasks
created by :
shivanee garg
computer science
roll no. 14/261

Más contenido relacionado

La actualidad más candente

Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 

La actualidad más candente (20)

Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOP
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Big Data
Big DataBig Data
Big Data
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 

Similar a Big data Hadoop presentation

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 

Similar a Big data Hadoop presentation (20)

Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Bigdata overview
Bigdata overviewBigdata overview
Bigdata overview
 
paper
paperpaper
paper
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
IJARCCE_49
IJARCCE_49IJARCCE_49
IJARCCE_49
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoop
 
Big Data
Big DataBig Data
Big Data
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 

Último

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Big data Hadoop presentation

  • 1. BY: SHIVANEE GARG SUMMER TRAINING PRESENTATION BIGDATA HADOOP
  • 2. COMPANY OVERVIEW ►Company name : Linux World Informatics Pvt Ltd ►Linuxworld Informatics Pvt Ltd : RedHat Awarded Partner , Cisco Learning Partner and An ISO 9001 : 2008 Certified Company is dedicated to offering a comprehensive set of most useful Open Source and Commercial training programmes today's demands . NOTE: This Organisation is specilized in providing training to the students of B.Tech , M.Tech. , MCA , BCA, and other students pursuing course in computer related technologies
  • 3. COMPANY OVERVIEW  Core Dicision of organisation :  Training & development service  Technical Support Services  Research & Development center  Course Provided By Orginization :  RedHat Linux Cloud Computing BigData Hadoop DevOop
  • 5. Course : BigData Hadoop Technology Learned : 1. Hadoop 2. Map Reduce 3. Single node & Multi node cluster 4. Docker 5. Ansible 6. Python
  • 6.
  • 8.  Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data  Big Data describes data sets so large and complex they are impractical to manage with traditional software tools.  big data tends to refer to the use of predictive analytics, certain other advanced data analytics methods that extract value from data,  The amount of Big Data increases exponentially- more than 500 terabytes of data are uploaded to Face book alone, in a single day- it represents a real problem in terms of analysis.
  • 9.
  • 10. Category of bigData : 1. structured data :  It concerns all data which can be stored in database SQL in table with rows and columns , in an ordered manner.  structured datas represent only 5 to 10% of all informatics datas.
  • 11. unstructure data :  Unstructured data represent around 80% of data.  It often include text and multimedia content.  Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents.  Unstructured data is everywhere.
  • 12. semi-structured data :  Semi-structured data is information that doesn’t reside in a relational database  but that does have some organizational properties that make it easier to analyze with some process  Examples of semi-structured : CSV but XML and JSON documents are semi structured documents, NoSQL databases are considered as semi structured.  semi structured data represents a few parts of data (5 to 10%)
  • 13. BIGDATA CHALLENGES 1. Data storage and quality :  Companies and Organizations are growing at a very fast . The growth of the companies rapidly increases the amount of data produced.  The storage of this data is becoming a challenge for everyone.  Options like data lakes/ warehouses are used to collect and store massive quantities of unstructured data in its native format.  when a data lakes/ warehouse try to combine inconsistent data from disparate sources  it encounters errors. Inconsistent data, duplicates, and missing data all result in data quality challenges.
  • 14. People who understand Big Data Analysis : 1.Data Analysis is very important to make the huge amount of data being produced, useful. 2. there is a huge need for Big Data analysts and Data Scientists. The storage of quality data scientists has made it a job in great demand 3. This is another challenge faced by companies. The number of data scientists available is very less in comparison to the amount of data being produced.
  • 15. 4'vs of bigdata VOLUME :  The main characteristic that makes data “big” is the sheer volume.  It makes no sense to focus on minimum storage units because the total amount of information is growing exponentially every year. VARIETY :  variety refers to the many sources and types of data both structured and unstructured.  We used to store data from sources like spreadsheets and databases.  data comes in the form of emails, photos, videos, PDFs, audio, etc.
  • 16. VERADITY :  Big Data Veradity refers to the noise and abnormality in data.  veracity in data analysis is the biggest challenge when compares to things like volume and velocity VELOCITY :  Velocity is the frequency of incoming data that needs to be processed.  Think about how many SMS messages, Facebook status updates, or credit card swipes are being sent on a particular telecom carrier every minute of every day,  You will have a good appreciation of valocity  A streaming application like Amazon Web Services Kinesis is an example of an application that handles the velocity of data.
  • 17. Issue In Bigdata 1. YOUR PERSONAL INFORMATION IS AT RISK :  Data breaches involve the disclosure of customer information to people .  some involving Social Security numbers, email addresses, contact information, debit and credit card numbers, your personal information is really at risk. 2 . E-DISCOVERY PROBLEMS :  E-discovery refers to the search of electronic data for use as evidence in a legal proceeding.  it is now more difficult to search for electronic evidence because there are lots and lots of data 3 . ANALYTICS ISN’T 100% ACCURATE :  it is very difficult to check the analysis manually,  your best bet to ensure that your analytics will not provide gravely inaccurate data is to use a trusted data analysis tool that guarantees the highest level of accuracy.
  • 18. Solution Of Bigdata 1 . Traditional Approach : Data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated softwares can be written to interact with the database, process the required data and present it to the users for analysis purpose. limitation : This approach works well where we have less volume of data
  • 19. 2. Google’s Solution  Google solved this problem using an algorithm called MapReduce.  This algorithm divides the task into small parts and assigns those parts to many computers connected over the network  collects the results to form the final result dataset.
  • 20. what is hadoop  Hadoop is an open source framework from Apache and is used to store and process  Analyze data which are very huge in volume.  Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn  Hadoop is the core platform for structuring Big Data.  Big Data is large volumes of structured and unstructured data
  • 21. HDFS  The Hadoop Distributed File System (HDFS) is designed to store very large data sets  In a large cluster, thousands of servers both host directly attached storage and execute user application tasks.  An important characteristic of Hadoop is the partationing of data and computation across many (thousands) of hosts, and the execution of application computations in parallel close to their data.  A Hadoop cluster scales computation capacity  HDFS stores filesystem metadata and application data separately.  HDFS stores metadata on a dedicated server, called the NameNode.  Application data are stored on other servers called DataNodes
  • 23.  NameNode is the master node in the Apache Hadoop HDFS Architecture  Namenode maintains and manages the blocks present on the DataNodes  NameNode controls access to files by clients.  user data never resides on the NameNode. The data resides on DataNodes  It is the master daemon that maintains and manages the DataNodes  It records the metadata of all the files stored in the cluster, e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc  It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live.  It keeps a record of all the blocks in HDFS and in which nodes these blocks are located.  The NameNode is also responsible to take care of the replication factor of all the blocks
  • 24. DATA NODE  Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.  DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode.  DataNode is responsible for storing the actual data in HDFS.  DataNode is also known as the Slave  File data is replicated on multiple DataNodes for reliability  They send heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds.
  • 25. BLOCK Hadoop distributed file system also stores the data in terms of blocks. the block size in HDFS is very large. The default size of HDFS block is 64MB. The files are split into 64MB blocks and then stored into the hadoop filesystem. The hadoop application is responsible for distributing the data blocks across multiple nodes. If the size of the file is less than the HDFS block size, then the file
  • 26. Map Reduce  Hadoop Map/Reduce is a software framework for distributed processing of large data sets on computing clusters.  MapReduce is the core component for data processing in Hadoop framework  Mapreduce helps to split the input data set into a number of parts and run a program on all data parts parallel at once.  The term MapReduce refers to two separate and distinct tasks.---------> 1. The first is the map operation, takes a set of data and converts it into another set of data, 2. The reduce operation combines those data tuples based on the key and accordingly modifies the value of the key. Mappers and Reducers will be run as tasks on nodes in the cluster. For example, in the WordCount application the Mapper functions read blocks of text and output each word they find. If several Mappers output the same word, each of those outputs are aggregated to a single Reducer which can count the number of times it has
  • 27.
  • 28. Job Tracker & Task Tracker  MapReduce processing in Hadoop 1 is handled by the JobTracker and TaskTracker daemons  Client applications submit jobs to the Job tracker.  The JobTracker talks to the NameNode to determine the location of the data  The JobTracker locates TaskTracker nodes with available slots at or near the data  The JobTracker submits the work to the chosen TaskTracker nodes.  The TaskTracker nodes are monitored. A TaskTracker will notify the JobTracker when a task fails.  The JobTracker decides what to do then: it may resubmit the job  When the work is completed, the JobTracker updates its status.
  • 29.
  • 30.
  • 31. DOCKER Docker is the world’s leading software container platform.  Developers use Docker to eliminate “works on my machine” problems  Docker is a tool designed to make it easier to create, deploy, and run applications by using containers.  Docker containers are so lightweight, a single server or virtual machine can run several containers simultaneously
  • 32.
  • 33. CONTAINER A container image is a lightweight,  executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, settings. Available for both Linux and Windows based apps, Containers isolate software from its surroundings,
  • 34. Property Of Container  LIGHTWEIGHT  Docker containers running on a single machine share that machine's operating system kernel;  Images are constructed from filesystem layers and share common files.  This minimizes disk usage and image downloads are much faster. STANDARD  Docker containers are based on open standards and run on all major Linux distributions and on any infrastructure including VMs, bare-metal and in the cloud. SECURE  Docker containers isolate applications from one another and from the underlying infrastructure.
  • 35. hypervisor based virtulization & container virtulization
  • 36.
  • 37.
  • 38. ANSIBLE  Ansible Tasks are idempotent Without extra coding ,bash script are usually run again and again.  Ansible is a simple, but powerful, server and configuration management tool. Learn to use Ansible effectively, whether you manage one server— or thousands.  Ansible has two types of servers: controlling machines and nodes.  First, there is a single controlling machine . The controlling machine describes the location of nodes through its inventory.  Second Nodes are managed by a controlling machine over SSH.  Playbooks express configurations, deployment in Ansible.  The Playbook format is YAML.  Each Playbook maps a group of hosts to a set of roles.  Each role is represented by calls to Ansible tasks
  • 39. created by : shivanee garg computer science roll no. 14/261