Big Data Analytics and Hadoop is presented. Key points include:
- Big data is large and complex data that is difficult to process using traditional methods. Domains that produce large datasets include meteorology, physics simulations, and internet search.
- The four V's of big data are volume, velocity, variety, and veracity. Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. Its core components are HDFS for storage and MapReduce for processing.
- Apache Hadoop has gained popularity for big data analytics due to its ability to process large amounts of data in parallel using commodity hardware, its scalability, and automatic failover. A Hadoop ecosystem of
4. What is Big data?
Key Challenges
• Capture & Store
• Search
• Sharing & Transfer
• Analysis
Big data is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process them using traditional data
processing applications
Domains with Large Datasets:
• Meteorology
• Complex physics simulations
• Biological and environmental
research
• Internet Search
5. Dimensions to Big Data
• Initially, there are three dimensions to big data
known as Volume, Variety and Velocity.
• These are also called characteristics of big data or
3V’s of Big data.
• 4th V (Veracity) is added afterwards.
6. Volume(Scale)-Data Volume
• There will be 44x increase from 2009 to 2020, From 0.8
zettabytes to 35zb, Data volume is increasing
exponentially.
• 1TB=1024GB
• 1 PetaByte (5th power of 1000, 1015) =1024TB
• 1 ExaByte (6th power of 1000, 1018) =1024 PB
• 1 ZettaByte=1024 EB
• 1 YottaByte=1024 ZB
• Big Data is a collection of huge volumes of Data.
7. Velocity (Speed)
• Data is being generated fast and need to be processed fast.
• Requires Online Data Analytics
• Late decisions means missing opportunities
• Examples
•E-Promotions: Based on your current location, your
purchase history, what you like send promotions right
now for store next to you
•Healthcare monitoring: sensors monitoring your activities
and body any abnormal measurements require
immediate reaction
8. Variety (Complexity)
• Text, numerical, images, audio, video, sequences, time
series, social media data, multi-dim arrays, etc…
• Data can be either Static or streaming data
11. Types of Data: Data is categorized as
1. Structured Data
2. Semi-Structured Data
3. Un-Structured Data
Generally Big Data consists unstructured
Data
1. Structured Data:
• Uploads neatly into a relational
database
Types of Data
12. 2. Unstructured Data
• Today more than 80% of the data generated is
unstructured.
• Examples:
•Satellite images, Social media data, Mobile
data, Photographs and video: This includes
security, surveillance, and traffic video
•Website content: This comes from any site
delivering unstructured content, like YouTube,
Flickr, or Instagram.
Types of Data -Unstructured Data
13. • Semi-structured has
some organizational
properties that make it
easier to analyze.
• Examples of semi-
structured data formats:
• CSV (Comma separated values)
• XML (Extended Markup
language)
• JSON (JavaScript Object
Notation)
Types of Data – Semi structured Data
15. What is Big data Analytics?
• “It is the art of finding patterns and insights in
large sets of data that allow you to make
better decisions or learn things you couldn’t
otherwise learn.”
• It makes use of statistics, AI, data mining,
machine learning, pattern recognition, natural
language processing etc
16. Reasons Benefits of Big data Analytics
Timely Gain instant insights from diverse data sources
Better analytics Improvement of business performance through
real-time analytics
Vast data Big data technologies manage huge amounts of
data
Insights Can provide better insights with the help of
unstructured and semi-structured data
Decision making Helps mitigate risk and make smart decision by
proper risk analysis
Why Big data Analytics?
20. Big data in Healthcare
Customer relationship
management
Electronic
Health
Record
21. Big data in Healthcare
• Big data reduces costs of treatment since there is less chances
of having to perform unnecessary diagnosis.
• It helps in predicting outbreaks of epidemics and also helps in
deciding what preventive measures
• It helps avoid preventable diseases by detecting diseases in
early stages which helps in preventing it
• Patients can be provided with the evidence based
medicine which is identified and prescribed after doing the past
medical results research.
22. Big data in Insurance
• Analyzing and predicting customer behavior through data
derived from social media, GPS-enabled devices and CCTV
footage.
• When it comes to claims management, predictive analytics from
big data has been used to offer faster service and Fraud
detection.
• Through massive data from digital channels and social media,
real-time monitoring of claims throughout the claims cycle has
been used to provide insights.
• SBI life makes use of big data analytics.
23. Big data in Education
• The University of Tasmania, An Australian university with over 26000
students has deployed a Learning and Management System that tracks
among other things, when a student logs onto the system, how much
time is spent on different pages in the system, as well as the overall
progress of a student over time.
• it is also used to measure teacher’s effectiveness to ensure a good
experience for both students and teachers.
• Click patterns are also being used to detect boredom.
• Adaptive learning: Customized learning. Enterprises produce digital
courses that use big-data-fuelled prognostic analytics to locate what a
learner is learning and what components of a lecture plan most
effectively ensembles them at those situations.
24. Big data in Media and Entertainment
• Media and entertainment industry is facing new business models, for the way
they – create, market and distribute their content. This is happening because
of current consumer’s search and the requirement of accessing content
anywhere, any time, on any device.
• Big Data provides actionable points of information about millions of
individuals. Now, publishing environments are tailoring advertisements and
content to appeal consumers. These insights are gathered through
various data-mining activities. Big Data applications benefits media and
entertainment industry by:
• Predicting what the audience wants
• Scheduling optimization
• Increasing acquisition and retention
• Ad targeting
• Content monetization and new product development
25. • Crime Prediction and Prevention
Police departments can leverage advanced, real-time analytics to provide
actionable intelligence that can be used to understand criminal behaviour,
identify crime/incident patterns, and uncover location-based threats.
• Weather Forecasting
The NOAA(National Oceanic and Atmospheric Administration) gathers data
every minute of every day from land, sea, and space-based sensors. Daily NOAA
uses Big Data to analyze and extract value from over 20 terabytes of data.
• Tax Compliance
Big Data Applications can be used by tax organizations to analyze both
unstructured and structured data from a variety of sources in order to identify
suspicious behavior and multiple identities. This would help in tax fraud
identification.
• Big Data Contributions to Transportation: Route planning to reduce the users
wait times, Congestion management by predicting traffic conditions: Using big
data, real time estimation of congestion and traffic patterns is now possible. For
examples, people using Google Maps to locate the least traffic prone routes.
Safety level of traffic: Using the real time processing of big data and predictive
analysis to identify the traffic accidents prone areas can help reduce accidents
and increase the safety level of traffic
Big data in Various Other Fields
27. Role of Mathematicians in Big data
• Data science is the marriage of statistics and computer
science, we need
• Probability
• Statistics
• Distributed Optimization
• Algebra
• Calculus
28. How Physicists can use Big data
• Astrophysics
• Quantum Computing
• Electrical grid analytics
• Simulation of complex systems
• Internet of things
29. How Bio People can use Big data
• The human genome contains roughly 3 billion
DNA base pairs and about 20,000 genes.
• The genetic information acquired globally about
patients and diseases will enable the health-care
providers to offer individual-specific, tailor made
medicines.
• Smart agriculture using IOTs
• The DNA-sequence data contain insights for the
development of (a) superior, disease-resistant
and high yielding crop varieties that are resistant
to the climate change, and (b) drugs for cancer
cure, HIV, or new strains of influenza
30. For Commerce People
• Supply chain analytics
• Retail Analytics
• Manufacturing analytics
• Bank Analytics
• HR Analytics
• Sales analytics
• Recommender systems
32. APACHE HADOOP
• Hadoop is an open source framework developed by Doug Cutting in 2006
and is managed by the Apache Software Foundation
• The project was named as Hadoop after the yellow toy elephant of the Doug
Cutting’s son.
• The framework is written in Java that allows storage and processing of large
volumes of data on a cluster of commodity hardware.
• The Apache Hadoop project actively supports multiple projects intended to
extend Hadoop’s capabilities and make it easier to use.
33. Traditional Systems Vs Big data Systems
Traditional Systems
• Schema-On-Write
• Traditional systems use
shared storage
• Cost of Proprietary
Hardware
• Brings Data to the Programs
Hadoop Data Systems
• Schema-On-Read
• Uses the Hadoop Distributed
File System (HDFS)
• Local storage, uses
commodity hardware
• Brings Programs to the Data
36. HDFS (Hadoop Distributed File System)
• It is the storage layer of Hadoop. It works as the Master-Slave pattern.
• In HDFS NameNode acts as a master which stores the metadata of
DataNode.
• Data node acts as a slave which stores the actual data in local disc and
parallely performs the actual task on data.
HADOOP COMPONENTS
37. MapReduce
• It is the data processing layer of Hadoop.
• It processes huge amount of data in parallel by dividing the job (submitted
job) into a set of independent tasks.
• It contains four tasks: Map-shuffle-sort-reduce
HADOOP COMPONENTS
38. Hbase and Hive
• Hive and HBase are both data stores for storing unstructured data.
• RDBMS professionals love apache hive as they can simply map HDFS files to Hive
tables and query the data
• HBase is a NoSQL database used for real-time data streaming whereas Hive is not
ideally a database but a mapreduce based SQL engine that runs on top of
Hadoop.
• HBase is a database and Hive is a SQL engine for batch processing of big data.
• Other NoSQL databases are MongoDB, Cassandra etc
HADOOP COMPONENTS
39. Pig
• It is a top-level scripting language.
• It enables writing complex data processing operators in Hadoop using Pig Latin
programming.
Sqoop
• It is a data collection tool design to transport huge volumes of data between Hadoop
and RDBMS.
Mahout
• A library of scalable machine-learning algorithms, implemented on top of Apache
Hadoop and using the MapReduce paradigm.
HADOOP COMPONENTS
40. Flume
• It is a reliable system for collecting large amounts of log data from many different
sources in real-time.
Oozie
• It is a workflow scheduler system that is used to schedule Apache Hadoop jobs. It
combines multiple jobs sequentially into one logical unit of work.
Zookeeper
• ZooKeeper is a high-performance coordination service for distributed applications.
It provides a centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services.
HADOOP COMPONENTS
41.
42. FEATURES OF HADOOP
No expensive hardware are required
Supports a large cluster of 100 to 1000 nodes
More computing power and storage system
Parallel Processing of Data
Distributed Data
Data Replication
Automatic Failover management
Data Locality Optimization
Supports Heterogeneous Cluster
Scalability