Inroduction to Big Data

Inroduction to Big Data
Omnia Safaan, Senior Data Scientist
Data & AI Expert labs , IBM
Omnia.safaan@ibm.com

What is Big data?
Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal
with data sets that are too large or complex to be dealt with by traditional data-processing application
software

The motivation behind Big
Data Technology

Data has an intrinsic property…it grows and grows
1 in 2
business leaders don’t
have access to data they
need
83%
of CIO’s cited BI and analytics
as part of their visionary plan
5.4X
more likely that top
performers use business
analytics
80%
of the world’s
data today is
unstructured
90%
of the world’s data
was created in the
last two years
20%
of available data can
be processed by
traditional systems

Data & Processing loop
More data generated requires better processing speed
More processing speed encourages more consumption
More consumption generates more data!!!

Largest Radio Telescope (Square Kilometer
Array)
● The Square Kilometer Array (SKA) is a large multi radio telescope project aimed to be built in Australia and
South Africa. When finished, it would have a total collecting area of about one square kilometer - it will
operate over a wide range of frequencies and its size will make it
50 times more sensitive than any other radio instrument and survey, 10,000 times faster than ever
before…
○ The SKA will produce
20,000 PB / day in 2020
(compared with the current
internet volume of 300 PB / day)
■ Incidentally, IBM is developing
hardware specifically to process
this astronomical information

Engine data - aircraft, diesel trucks & generators,…
● Each Airbus A380 engine - there are 4 engines - generates 1 PB of data on a flight from London
(LHR) to Singapore (SIN)
● GE, one of the world’s largest manufacturers, is using big data
analytics with data generated from machine sensors to
predict maintenance needs.
● GE is using the analysis
to provide services tied to
its products, designed to minimize downtime caused
by parts failures

Large Hadron Collider (LHC)
● The Large Hadron Collider (LHC) data center process about one petabyte of data
every day. The center hosts 11,000 services with 100,000 processor cores. Some
6000 changes in the database are performed every second.
○ A global collaboration of computer
centers distributes and stores LHC
data, giving real-time access to
physicists around the world
○ The Grid runs more than two million
jobs per day - at peak rates, 10 GB
of data may be transferred / sec.

Why Do we need to handle big data differently
?
Because It’s BIG!!!?

Classical workflow
Shady is a data engineer who works for financial services he has list of transactions that he need
to store and do some analysis:
1-Design Database schema.
2-Create Tables to store the data in.
2-Write script to read the transactions from the source files and store it to the database
3-Write some queries to return the analysis results required.
4-Run Query , Save Results

Classical Approach cont..
What if we are dealing with data that is too big for our hard disk?
What if we are dealing with something like Paypal data where millions of transactions are done
per second?
What if you have fraud detection algorithm that should fire real time alerts of fraudulent
transactions?
What if my data source has data with no uniform schema? Unstructured data (Documents,Images
,Audio files..etc)?

4 Vs
Volume Velocity Variety
Veracity
Data size is
exceeding
petabytes
Data generated at
high speed
Structured ,
unstructured &
semi-structured
data
Trustworthiness of
Data

Evolution of big data systems
Bigger
computers
Distributed
Systems
Hadoop

Which is slower??
Which is the slower factors in data handling?
● I/O Operations
● Processing

Moor’s law
According to Moor’s law processing power
doubles every 18 months
While ….
Typical Disk transfer rate is 100-200 mb/sec
Time needed to transfer 100GB is 17 mins!!

What is Hadoop?
● Apache open-source software framework for reliable,
scalable, distributed computing over massive amount of data.
● Originally based on papers published by Google around 2004
● Main Concept: Minimize data transfer over the network & data nodes process its local data

Hadoop Core Concepts
No need for network programming knowledge
Run Computations where the data is stored
Data is replicated multiple times across the data nodes
Scalability achieved easily through adding commodity hardware as additional nodes.
Fault tolerant and resilient execution of tasks

Hadoop Main Components
● Hadoop Distributed File System (HDFS) Storing the data on the cluster.
● MapReduce Processing the data on the cluster
● Yarn :(Yet Another Resource Negotiator) responsible for resource management and job
scheduling

Hadoop Echo System
● Hadoop is supplemented by an extensive ecosystem of open-source projects.

HDFS
● Hadoop Distributed File System
● Splits data into blocks (Typically block is 128 MB in size)
● Each block is replicated multiple time (default replication factor 3)
● Replicas are stored on different nodes (different racks if possible)
● Name node stores the files metadata

HDFS
101101001
010010011
100111111
001010011
101001010
010110010
010101001
100010100
101110101
110101111
011011010
101101001
010100101
010101011
100100110
101110100
1
2
3
4
1
1
1
2
2 2
3
3
34
4
4
Blocks
Logical File
Data Nodes
Name
Node
(Metadata)

How to access HDFS?
● HDFS Command line, ex: hdfs dfs -put foo.txt
● web UI (Ambari, Hue)
● Spark API :
df_example = sparkSession.read.csv('hdfs://cluster/user/hdfs/test/example.csv')
● Echo system projects(Sqoope , flume )

YARN
Responsible component for :
● Resource management
● Job scheduling

Map Reduce
● Method for distributing tasks across cluster nodes.
● Consist of 2 phases : Map & reduce (in between comes shuffle & sort)

MapReduce 1 overview
Distributed
FileSystem HDFS,
data in blocks
Map Shuffle Reduce
Results can be
written to HDFS or
a database

Mapper
● Each Map task process one HDFS block.
● Map tasks runs ideally on the same node where the data block is located.

Shuffle & Sort
● Sorts and consolidates the intermediate data from all mappers

Reducer
● Process the intermediate data generated from shuffle & sort step
● Produces the final output

Word Count Example
● Input : Bus, Car, train, bus, car, train, bus, train, bus, plane
● Output: (Bus,3) (Car,2)(train,2)(plane,1)

Word Count Example
Splitting Mapping Intermediate
Shuffling
Reducing

Map Reduce disadvantages
● MapReduce started a general batch processing paradigm, but had its limitations:
■ Difficulty programming in MapReduce
■ Batch processing did not fit many use cases
■ MR spawned a lot of specialized systems (Storm, Impala, Giraph, etc.)

Apache Spark
● Parallel distributed processing, fault tolerance on commodity hardware, scalability, in-memory
computing, high level APIs, etc.
○ Ease of use
○ Wide variety of functionality
○ Mature and reliable

Resilient Distributed Dataset (RDD)
● Fault-tolerant collection of elements that can be operated on in parallel
● RDDs are immutable (content cannot be changed)
● Dataset from any storage supported by Hadoop
○ HDFS, Cassandra, HBase, Amazon S3, etc.
● Types of files supported:
○ Text files, SequenceFiles, Hadoop InputFormat, etc

Resilient Distributed Dataset (RDD)
● Three methods for creating RDD
■ Parallelizing an existing collection
■ Referencing a dataset
■ Transformation from an existing RDD
● Two types of RDD operations
○ Transformations => Generate new RDD
○ Actions => Calculate results

Lazy Execution
# create a sample list
my_list = [i for i in range(1,10000000)]
# parallelize the data
rdd_0 = sc.parallelize(my_list,3)
# Add 4 to each item
rdd_1=rdd_0.map(lambda x : x+4)
#multiply each item by 2
rdd_2=rdd_1.map(lambda x:x*2)
#take only items less than 100
rdd_3=rdd_2.filter(lambda x: x<100)
#return 2 items from RDD to array
total=rdd_3.take(2)
1
2
3
4
5
.
.
.
.
1M
1
5
10
2
6
12
[10,12]
rdd_1
rdd_2
rdd_3
Action
Transformation

Programming with Spark
● Available API of many programming languages including
○ Scala
○ Python
○ Java
○ R
spark-submit wordcount.py

Spark libraries
● Extension of the core Spark API.
● Improvements made to the core are passed to these libraries.
● Little overhead to use with the Spark core
spark.apache.org

When to use Hadoop
● Hadoop is good for:
○ Processing massive amounts of data through parallelism
○ Handling a variety of data (structured, unstructured, semi-structured)
○ Using inexpensive commodity hardware

When NOT to use Hadoop
● Hadoop is not good for:
○ Processing transactions (random access)
○ When work cannot be parallelized
○ Low latency data access
○ Processing lots of small files
○ Intensive calculations with small amounts of data

What's next?
● IBM skills academy Big Data Track
● https://cognitiveclass.ai/learn/big-data
● https://www.coursera.org/specializations/big-data
● https://www.edx.org/course/big-data-analysis-with-apache-spark
● Try Spark on local file system.
● Learn PySpark
● Get familiar with bash scripting
● https://www.edx.org/course/big-data-analysis-with-apache-spark

Inroduction to Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Inroduction to Big Data

Similar to Inroduction to Big Data (20)

Recently uploaded

Recently uploaded (20)

Inroduction to Big Data

Editor's Notes