this presentation describes the company from where I did my summer training and what is bigdata why we use big data, big data challenges, the issue in big data, the solution of big data issues, hadoop, docker , Ansible etc.
2. COMPANY OVERVIEW
►Company name : Linux World Informatics Pvt Ltd
►Linuxworld Informatics Pvt Ltd : RedHat
Awarded Partner , Cisco Learning Partner and An ISO
9001 : 2008 Certified Company is dedicated to offering
a comprehensive set of most useful Open Source and
Commercial training programmes today's demands .
NOTE: This Organisation is specilized in providing
training to the students of B.Tech , M.Tech. , MCA , BCA,
and other students pursuing course in computer related
technologies
3. COMPANY OVERVIEW
Core Dicision of
organisation :
Training & development
service
Technical Support
Services
Research & Development
center
Course Provided
By Orginization :
RedHat Linux
Cloud Computing
BigData Hadoop
DevOop
8. Data which are very large in size is called Big Data. Normally we
work on data of size MB(WordDoc ,Excel) or maximum GB(Movies,
Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data
Big Data describes data sets so large and complex they are
impractical to manage with traditional software tools.
big data tends to refer to the use of predictive analytics, certain
other advanced data analytics methods that extract value from
data,
The amount of Big Data increases exponentially- more than 500
terabytes of data are uploaded to Face book alone, in a single day- it
represents a real problem in terms of analysis.
9.
10. Category of bigData :
1. structured data :
It concerns all data which can be stored in database SQL in table with
rows and columns , in an ordered manner.
structured datas represent only 5 to 10% of all informatics
datas.
11. unstructure data :
Unstructured data represent around 80% of data.
It often include text and multimedia content.
Examples include e-mail messages, word
processing documents, videos, photos, audio files,
presentations, webpages and many other kinds of
business documents.
Unstructured data is everywhere.
12. semi-structured data :
Semi-structured data is information that doesn’t reside in a
relational database
but that does have some organizational properties that make it
easier to analyze with some process
Examples of semi-structured : CSV but XML and JSON
documents are semi structured documents, NoSQL databases
are considered as semi structured.
semi structured data represents a few parts of data (5 to 10%)
13. BIGDATA CHALLENGES
1. Data storage and quality :
Companies and Organizations are growing at a very fast . The growth of the companies rapidly
increases the amount of data produced.
The storage of this data is becoming a challenge for everyone.
Options like data lakes/ warehouses are used to collect and store massive quantities of
unstructured data in its native format.
when a data lakes/ warehouse try to combine inconsistent data from disparate
sources
it encounters errors. Inconsistent data, duplicates, and missing data all result in data
quality challenges.
14. People who understand Big Data Analysis :
1.Data Analysis is very important to make the huge amount of
data being produced, useful.
2. there is a huge need for Big Data analysts and Data Scientists. The
storage of quality data scientists has made it a job in great demand
3. This is another challenge faced by companies. The number of data
scientists available is very less in comparison to the amount of data
being produced.
15. 4'vs of bigdata
VOLUME :
The main characteristic that makes data “big” is the sheer volume.
It makes no sense to focus on minimum storage units because the total amount
of information is growing exponentially every year.
VARIETY :
variety refers to the many sources and types of data both structured
and unstructured.
We used to store data from sources like spreadsheets and databases.
data comes in the form of emails, photos, videos, PDFs, audio, etc.
16. VERADITY :
Big Data Veradity refers to the noise and abnormality in data.
veracity in data analysis is the biggest challenge when compares to things like volume and
velocity
VELOCITY :
Velocity is the frequency of incoming data that needs to be processed.
Think about how many SMS messages, Facebook status updates, or credit card swipes are
being sent on a particular telecom carrier every minute of every day,
You will have a good appreciation of valocity
A streaming application like Amazon Web Services Kinesis is an example of an application
that handles the velocity of data.
17. Issue In Bigdata
1. YOUR PERSONAL INFORMATION IS AT RISK :
Data breaches involve the disclosure of customer information to people .
some involving Social Security numbers, email addresses, contact
information, debit and credit card numbers, your personal information is really
at risk.
2 . E-DISCOVERY PROBLEMS :
E-discovery refers to the search of electronic data for use as evidence in a
legal proceeding.
it is now more difficult to search for electronic evidence because there are lots
and lots of data
3 . ANALYTICS ISN’T 100% ACCURATE :
it is very difficult to check the analysis manually,
your best bet to ensure that your analytics will not provide gravely inaccurate data is
to use a trusted data analysis tool that guarantees the highest level of accuracy.
18. Solution Of Bigdata
1 . Traditional Approach :
Data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2
and sophisticated softwares can be written to interact with the database,
process the required data and present it to the users for analysis purpose.
limitation : This approach works well where we have less volume of data
19. 2. Google’s Solution
Google solved this problem using an algorithm called MapReduce.
This algorithm divides the task into small parts and assigns those
parts to many computers connected over the network
collects the results to form the final result dataset.
20. what is hadoop
Hadoop is an open source framework from Apache and is used to
store and process
Analyze data which are very huge in volume.
Hadoop is written in Java and is not OLAP (online analytical processing). It is
used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn
Hadoop is the core platform for structuring Big Data.
Big Data is large volumes of structured and unstructured data
21. HDFS
The Hadoop Distributed File System (HDFS) is designed to store very large data sets
In a large cluster, thousands of servers both host directly attached storage and execute user
application tasks.
An important characteristic of Hadoop is the partationing of data and computation
across many (thousands) of hosts, and the execution of application computations in
parallel close to their data.
A Hadoop cluster scales computation capacity
HDFS stores filesystem metadata and application data separately.
HDFS stores metadata on a dedicated server, called the NameNode.
Application data are stored on other servers called DataNodes
23. NameNode is the master node in the Apache Hadoop HDFS Architecture
Namenode maintains and manages the blocks present on the DataNodes
NameNode controls access to files by clients.
user data never resides on the NameNode. The data resides on DataNodes
It is the master daemon that maintains and manages the DataNodes
It records the metadata of all the files stored in the cluster, e.g. The location
of blocks stored, the size of the files, permissions, hierarchy, etc
It regularly receives a Heartbeat and a block report from all the DataNodes in
the cluster to ensure that the DataNodes are live.
It keeps a record of all the blocks in HDFS and in which nodes these blocks
are located.
The NameNode is also responsible to take care of the replication factor of all the
blocks
24. DATA NODE
Client applications can talk directly to a DataNode, once the
NameNode has provided the location of the data.
DataNode is usually configured with a lot of hard disk space.
Because the actual data is stored in the DataNode.
DataNode is responsible for storing the actual data in HDFS.
DataNode is also known as the Slave
File data is replicated on multiple DataNodes for reliability
They send heartbeats to the NameNode periodically to report the overall
health of HDFS, by default, this frequency is set to 3 seconds.
25. BLOCK
Hadoop distributed file system also stores the data in terms of
blocks.
the block size in HDFS is very large. The default size of HDFS block
is 64MB.
The files are split into 64MB blocks and then stored into the hadoop
filesystem.
The hadoop application is responsible for distributing the data
blocks across multiple nodes.
If the size of the file is less than the HDFS block size, then the file
26. Map Reduce
Hadoop Map/Reduce is a software framework for distributed processing of
large data sets on computing clusters.
MapReduce is the core component for data processing in Hadoop
framework
Mapreduce helps to split the input data set into a number of parts and run a program
on all data parts parallel at once.
The term MapReduce refers to two separate and distinct tasks.--------->
1. The first is the map operation, takes a set of data and converts it into another set of
data,
2. The reduce operation combines those data tuples based on the key and accordingly
modifies the value of the key.
Mappers and Reducers will be run as tasks on nodes in the cluster.
For example, in the WordCount application the Mapper functions read blocks of text and
output each word they find. If several Mappers output the same word, each of those
outputs are aggregated to a single Reducer which can count the number of times it has
27.
28. Job Tracker & Task Tracker
MapReduce processing in Hadoop 1 is handled by the JobTracker
and TaskTracker daemons
Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of
the data
The JobTracker locates TaskTracker nodes with available slots at or
near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. A TaskTracker will notify the
JobTracker when a task fails.
The JobTracker decides what to do then: it may resubmit the job
When the work is completed, the JobTracker updates its status.
29.
30.
31. DOCKER
Docker is the world’s leading software container
platform.
Developers use Docker to eliminate “works on my machine”
problems
Docker is a tool designed to make it easier to create, deploy, and
run applications by using containers.
Docker containers are so lightweight, a single server or virtual
machine can run several containers simultaneously
32.
33. CONTAINER
A container image is a lightweight,
executable package of a piece of software that includes everything
needed to run it: code, runtime, system tools, system libraries,
settings.
Available for both Linux and Windows based apps,
Containers isolate software from its surroundings,
34. Property Of Container
LIGHTWEIGHT
Docker containers running on a single machine share that machine's
operating system kernel;
Images are constructed from filesystem layers and share common files.
This minimizes disk usage and image downloads are much faster.
STANDARD
Docker containers are based on open standards and run on all major
Linux distributions and on any infrastructure including VMs, bare-metal
and in the cloud.
SECURE
Docker containers isolate applications from one another and
from the underlying infrastructure.
38. ANSIBLE
Ansible Tasks are idempotent Without extra coding ,bash script are
usually run again and again.
Ansible is a simple, but powerful, server and configuration management
tool. Learn to use Ansible effectively, whether you manage one server—
or thousands.
Ansible has two types of servers: controlling machines and nodes.
First, there is a single controlling machine . The controlling machine describes the
location of nodes through its inventory.
Second Nodes are managed by a controlling machine over SSH.
Playbooks express configurations, deployment in Ansible.
The Playbook format is YAML.
Each Playbook maps a group of hosts to a set of roles.
Each role is represented by calls to Ansible tasks