2. What is Big Data?
Big Data is also data but with a huge size. Big Data is a
term used to describe a collection of data that is huge
in volume and yet growing exponentially with time. In
short such data is so large and complex that none of
the traditional data management tools are able to
store it or process it efficiently.
The New York Stock Exchange generates about one terabyte of new trade data
per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly generated
in terms of photo and video uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight
time. With many thousand flights per day, generation of data reaches up to
many Petabytes.
3. Types Of Big Data
Big Data' could be found in three forms:
•Structured
•Unstructured
•Semi-structured
4. Structured
Any data that can be stored, accessed and processed in the form of
fixed format is termed as a 'structured' data. Over the period of time,
talent in computer science has achieved greater success in developing
techniques for working with such kind of data (where the format is
well known in advance) and also deriving value out of it. However,
nowadays, we are foreseeing issues when a size of such data grows to
a huge extent, typical sizes are being in the rage of multiple
zettabytes.
Employee_ID
Employee_Name
Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
Examples Of Structured Data
An 'Employee' table in a database is an example of Structured Data
5. Unstructured
Any data with unknown form or the structure is classified as
unstructured data. In addition to the size being huge, un-structured data
poses multiple challenges in terms of its processing for deriving value
out of it. A typical example of unstructured data is a heterogeneous data
source containing a combination of simple text files, images, videos etc.
Now day organizations have wealth of data available with them but
unfortunately, they don't know how to derive value out of it since this
data is in its raw form or unstructured format.
Examples Of Un-structured Data
The output returned by 'Google Search'
6. Semi-structured
Semi-structured data can contain both the forms of data. We can
see semi-structured data as a structured in form but it is actually
not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML
file.
Examples Of Semi-structured Data
Personal data stored in an XML file-
<rec><name>PrashantRao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
7. Characteristics Of Big Data
(i) Volume – The name Big Data itself is related to a size which is
enormous. Size of data plays a very crucial role in determining
value out of data. Also, whether a particular data can actually
be considered as a Big Data or not, is dependent upon the
volume of data. Hence, 'Volume' is one characteristic which
needs to be considered while dealing with Big Data.
(ii) Variety – The next aspect of Big Data is its variety.
Variety refers to heterogeneous sources and the nature of data,
both structured and unstructured. During earlier days, spreadsheets
and databases were the only sources of data considered by most of
the applications. Nowadays, data in the form of emails, photos,
videos, monitoring devices, PDFs, audio, etc. are also being
considered in the analysis applications. This variety of unstructured
data poses certain issues for storage, mining and analyzing data.
8. (iii) Velocity – The term 'velocity' refers to the speed of generation
of data. How fast the data is generated and processed to meet the
demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from
Sources like business processes, application logs, networks, and
social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown
by the data at times, thus hampering the process of being able to
handle and manage the data effectively.
(v) Value-
Value is most important aspect in Big data. Though the potential
value of big data is huge. It is all well and good to have access to big
data but unless we can turn it into value it is become useless. It
become very costly to implement IT infrastructure system to store
big data and business are going to require a return on investment.
9. Big Data Examples: Applications of Big Data in Real Life
Big Data has totally changed and revolutionized the way
businesses and organizations work. In this blog, we will
go deep into the major Big Data applications in various
sectors and industries and learn how these sectors are
being benefitted by these applications.
10. Big Data in Education Industry
Education industry is flooding with huge amounts of data related to
students, faculty, courses, results, and what not. Now, we have
realized that proper study and analysis of this data can provide
insights which can be used to improve the operational
effectiveness and working of educational institutes.
11. Big Data in Healthcare Industry
Healthcare is yet another industry which is bound to generate a huge
amount of data.
Following are some of the ways in which big data has contributed to
healthcare:
Big data reduces costs of treatment since there is less chances of
having to perform unnecessary diagnosis.
12. It helps in predicting outbreaks of epidemics and also in deciding
what preventive measures could be taken to minimize the effects
of the same.
It helps avoid preventable diseases by detecting them in early
stages. It prevents them from getting any worse which in turn
makes their treatment easy and effective.
Patients can be provided with evidence-based medicine which is
identified and prescribed after doing research on past medical
results.
13. Big Data in Government Sector
Governments, be it of any country, come face to face with a very
huge amount of data on almost daily basis. The reason for this is,
they have to keep track of various records and databases regarding
their citizens, their growth, energy resources, geographical surveys,
and many more. All this data contributes to big data. The proper
study and analysis of this data, hence, helps governments in endless
ways.
14. Few of them are as follows:
Welfare Schemes
•In making faster and informed decisions regarding various political
programs
•To identify areas that are in immediate need of attention
•To stay up to date in the field of agriculture by keeping track of all
existing land and livestock.
•To overcome national challenges such as unemployment, terrorism,
energy resources exploration, and much more.
Cyber Security
•Big Data is hugely used for deceit recognition.
•It is also used in catching tax evaders.
15. Big Data in Media and Entertainment Industry
With people having access to various digital gadgets, generation of
large amount of data is inevitable and this is the main cause of the rise
in big data in media and entertainment industry.
16. Other than this, social media platforms are another way in which
huge amount of data is being generated. Although, businesses in the
media and entertainment industry have realized the importance of
this data, and they have been able to benefit from it for their
growth.
Some of the benefits extracted from big data in the media and
entertainment industry are given below:
Predicting the interests of audiences
Optimized or on-demand scheduling of media streams in digital
media distribution platforms
Getting insights from customer reviews
Effective targeting of the advertisements
17. Big Data in Weather Patterns
There are weather sensors and satellites deployed all around the
globe. A huge amount of data is collected from them, and then this
data is used to monitor the weather and environmental conditions.
All of the data collected from these sensors and satellites contribute
to big data and can be used in different ways such as:
In weather forecasting
•To study global warming
•In understanding the patterns of natural disasters
•To make necessary preparations in the case of crises
•To predict the availability of usable water around the world
18. Since the rise of big data, it has been used in various ways to make
transportation more efficient and easy. Following are some of the
areas where big data contributes to transportation.
Route planning: Big data can be used to understand and estimate
users’ needs on different routes and on multiple modes of
transportation and then utilize route planning to reduce their wait
time.
Congestion management and traffic control: Using big data, real-
time estimation of congestion and traffic patterns is now possible.
For examples, people are using Google Maps to locate the least
traffic-prone routes.
of traffic.
Big Data in Transportation Industry
Safety level of traffic: Using the real-time
processing of big data and predictive
analysis to identify accident-prone areas
can help reduce accidents and increase
the safety level
19. Big Data in Banking Sector
The amount of data in the banking sector is skyrocketing every
second. According to GDC prognosis, this data is estimated to
grow 700 percent by the end of the next year. Proper study and
analysis of this data can help detect any and all illegal activities
that are being carried out such as:
Misuse of credit/debit cards
Venture credit hazard treatment
Business clarity
Customer statistics alteration
Money laundering
Risk mitigation
74. Design of Hadoop Distributed File System (HDFS)
• Master-Slave design
• Master Node
– Single NameNode for managing metadata
• Slave Nodes
– Multiple DataNodes for storing data
• Other
– Secondary NameNode as a backup
75. HDFS Architecture
NameNode
DataNode
DataNode DataNode DataNode
DataNode
DataNode
DataNode DataNode
Secondary
NameNode
Client
Heartbeat, Cmd, Data
NameNode keeps the metadata, the name, location and directory
DataNode provide storage for blocks of data
78. Apply function Map:
Apply a function to all the elements of
List
list1=[1,2,3,4,5];
square x = x * x
list2=Map square(list1)
print list2
-> [1,4,9,16,25]
Reduce:
Combine all the elements of list for a
summary
list1 = [1,2,3,4,5];
A = reduce (+) list1
Print A
-> 15
Map Reduce Paradigm
• Map and Reduce are based on functional programming
Input Output
Map Reduce
79. Node
Map
MapReduce Word Count Example
File
A
B
C
D
Node
Map
A
Node
Map
Node
Map
B
C
D
Node
Reduce
Node
Reduce
F
Node
Reduce
Node
Reduce
E
G
H
Shuffle
&
Sort
I am Sam
Sam I am
(I,1)
(am,1)
(Sam,1)
(I,1)
(am,1)
(Sam,1)
(I,2)
(am,2)
(Sam,2)
(…,..)
(..,..)
………
………
89. SPARK Outline
• Introduction to Apache Hadoop and Spark for developing
applications
• Components of Hadoop, HDFS, MapReduce and HBase
• Capabilities of Spark and the differences from a typical
MapReduce solution
• Some Spark use cases for data analysis
90. Cloud and Distributed Computing
• The second trend is pervasiveness of cloud based storage and
computational resources
– For processing of these big datasets
• Cloud characteristics
– Provide a scalable standard environment
– On-demand computing
– Pay as you need
– Dynamically scalable
– Cheaper
91. One Solution is Apache Spark
• A new general framework, which solves many of the short comings of
MapReduce
• It capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN, HBase,
S3, …
• Has many other workflows, i.e. join, filter, flatMapdistinct, groupByKey,
reduceByKey, sortByKey, collect, count, first…
– (around 30 efficient distributed operations)
• In-memory caching of data (for iterative, graph, and machine learning
algorithms, etc.)
• Native Scala, Java, Python, and R support
• Supports interactive shells for exploratory data analysis
• Spark API is extremely simple to use
• Developed at AMPLab UC Berkeley, now by Databricks.com
92. Spark Uses Memory instead of Disk
Iteration1 Iteration2
HDFS read
Iteration1 Iteration2
HDFS
read
HDFS
Write
HDFS
read HDFS
Write
Spark: In-Memory Data Sharing
Hadoop: Use Disk for Data Sharing
93. Sort competition
Hadoop MR
Record (2013)
Spark
Record (2014)
Data Size 102.5 TB 100 TB
Elapsed Time 72 mins 23 mins
# Nodes 2100 206
# Cores 50400 physical 6592 virtualized
Cluster disk
throughput
3150 GB/s
(est.)
618 GB/s
Network
dedicated data
center, 10Gbps
virtualized (EC2) 10Gbps
network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min
Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records)
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Spark, 3x
faster with
1/10 the
nodes
94. Apache Spark
Apache Spark supports data analysis, machine learning, graphs, streaming data, etc. It
can read/write from a range of data types and allows development in multiple
languages.
Spark Core
Spark
Streaming
MLlib GraphX
ML Pipelines
Spark SQL
DataFrames
Data Sources
Scala, Java, Python, R, SQL
Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre)
95. Resilient Distributed Datasets (RDDs)
• RDDs (Resilient Distributed Datasets) is Data Containers
• All the different processing components in Spark
share the same abstraction called RDD
• As applications share the RDD abstraction, you can
mix different kind of transformations to create new
RDDs
• Created by parallelizing a collection or reading a file
• Fault tolerant
96. DataFrames & SparkSQL
• DataFrames (DFs) is one of the other distributed datasets organized
in named columns
• Similar to a relational database, Python Pandas Data frame or R’s
DataTables
– Immutable once constructed
– Track lineage
– Enable distributed computations
• How to construct Data frames
– Read from file(s)
– Transforming an existing DFs(Spark or Pandas)
– Parallelizing a python collection list
– Apply transformations and actions
97. DataFrame example
// Create a new DataFrame that contains “students”
students = users.filter(users.age < 21)
//Alternatively, using Pandas-like syntax
students = users[users.age < 21]
//Count the number of students users by gender
students.groupBy("gender").count()
// Join young students with another DataFrame called
logs
students.join(logs, logs.userId == users.userId,
“left_outer")
98. RDDs vs. DataFrames
• RDDs provide a low level interface into Spark
• DataFrames have a schema
• DataFrames are cached and optimized by Spark
• DataFrames are built on top of the RDDs and the core
Spark API
Example: performance
99. Spark Operations
Transformations
(create a new RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
intersection
flatMap
union
join
cogroup
cross
mapValues
reduceByKey
Actions
(return results to
driver program)
collect first
Reduce take
Count takeOrdered
takeSample countByKey
take save
lookupKey foreach
100. Directed Acyclic Graphs (DAG)
A
B
S
C
E
D
F
DAGs track dependencies (also known as Lineage )
nodes are RDDs
arrows are Transformations
101. Narrow Vs. Wide transformation
A,1 A,[1,2]
A,2
Narrow Wide
Map groupByKey
Vs.
102. Actions
• What is an action
– The final stage of the workflow
– Triggers the execution of the DAG
– Returns the results to the driver
– Or writes the data to HDFS or to a file
104. Python RDD API Examples
• Word count
text_file = sc.textFile("hdfs://usr/godil/text/book.txt")
counts = text_file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://usr/godil/output/wordCount.txt")
• Logistic Regression
# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])
# Set parameters for the algorithm.
# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)
# Fit the model to the data.
model = lr.fit(df)
# Given a dataset, predict each point's label, and show the results.
model.transform(df).show()
Examples from http://spark.apache.org/
106. Broadcast Variables and Accumulators
(Shared Variables )
• Broadcast variables allow the programmer to keep a read-only
variable cached on each node, rather than sending a copy of it
with tasks
>broadcastV1 = sc.broadcast([1, 2, 3,4,5,6])
>broadcastV1.value
[1,2,3,4,5,6]
• Accumulators are variables that are only “added” to through
an associative operation and can be efficiently supported in
parallel
accum = sc.accumulator(0)
accum.add(x)
accum.value
107. Spark’s Main Use Cases
• Streaming Data
• Machine Learning
• Interactive Analysis
• Data Warehousing
• Batch Processing
• Exploratory Data Analysis
• Graph Data Analysis
• Spatial (GIS) Data Analysis
• And many more
108. Spark Use Cases
• Fingerprint Matching
– Developed a Spark based fingerprint minutia
detection and fingerprint matching code
• Twitter Sentiment Analysis
– Developed a Spark based Sentiment Analysis code
for a Twitter dataset
109. Spark in the Real World (I)
• Uber – the online taxi company gathers terabytes of event data from its
mobile users every day.
– By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL
pipeline
– Convert raw unstructured event data into structured data as it is collected
– Uses it further for more complex analytics and optimization of operations
• Pinterest – Uses a Spark ETL pipeline
– Leverages Spark Streaming to gain immediate insight into how users all
over the world are engaging with Pins—in real time.
– Can make more relevant recommendations as people navigate the site
– Recommends related Pins
– Determine which products to buy, or destinations to visit
110. Spark in the Real World (II)
Here are Few other Real World Use Cases:
• Conviva – 4 million video feeds per month
– This streaming video company is second only to YouTube.
– Uses Spark to reduce customer churn by optimizing video streams and
managing live video traffic
– Maintains a consistently smooth, high quality viewing experience.
• Capital One – is using Spark and data science algorithms to understand customers
in a better way.
– Developing next generation of financial products and services
– Find attributes and patterns of increased probability for fraud
• Netflix – leveraging Spark for insights of user viewing habits and then
recommends movies to them.
– User data is also used for content creation
111. Spark: when not to use
• Even though Spark is versatile, that doesn’t mean Spark’s
in-memory capabilities are the best fit for all use cases:
– For many simple use cases Apache MapReduce and
Hive might be a more appropriate choice
– Spark was not designed as a multi-user environment
– Spark users are required to know that memory they
have is sufficient for a dataset
– Adding more users adds complications, since the users
will have to coordinate memory usage to run code
112. HPC and Big Data Convergence
• Clouds and supercomputers are collections of computers
networked together in a datacenter
• Clouds have different networking, I/O, CPU and cost trade-offs
than supercomputers
• Cloud workloads are data oriented vs. computation oriented
and are less closely coupled than supercomputers
• Principles of parallel computing same on both
• Apache Hadoop and Spark vs. Open MPI
113. HPC and Big Data K-Means example
MPI definitely outpaces Hadoop, but can be boosted using a hybrid approach of other
technologies that blend HPC and big data, including Spark and HARP. Dr. Geoffrey Fox,
Indiana University. (http://arxiv.org/pdf/1403.1528.pdf)
114. Conclusion
• Hadoop (HDFS, MapReduce)
– Provides an easy solution for processing of Big Data
– Brings a paradigm shift in programming distributed system
• Spark
– Has extended MapReduce for in memory computations
– for streaming, interactive, iterative and machine learning
tasks
• Changing the World
– Made data processing cheaper and more efficient and
scalable
– Is the foundation of many other tools and software