SlideShare a Scribd company logo
1 of 46
Inroduction to Big Data
Omnia Safaan, Senior Data Scientist
Data & AI Expert labs , IBM
Omnia.safaan@ibm.com
What is Big data?
Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal
with data sets that are too large or complex to be dealt with by traditional data-processing application
software
The motivation behind Big
Data Technology
Data has an intrinsic property…it grows and grows
1 in 2
business leaders don’t
have access to data they
need
83%
of CIO’s cited BI and analytics
as part of their visionary plan
5.4X
more likely that top
performers use business
analytics
80%
of the world’s
data today is
unstructured
90%
of the world’s data
was created in the
last two years
20%
of available data can
be processed by
traditional systems
Data & Processing loop
More data generated requires better processing speed
More processing speed encourages more consumption
More consumption generates more data!!!
Big Data Exciting projects
Largest Radio Telescope (Square Kilometer
Array)
● The Square Kilometer Array (SKA) is a large multi radio telescope project aimed to be built in Australia and
South Africa. When finished, it would have a total collecting area of about one square kilometer - it will
operate over a wide range of frequencies and its size will make it
50 times more sensitive than any other radio instrument and survey, 10,000 times faster than ever
before…
○ The SKA will produce
20,000 PB / day in 2020
(compared with the current
internet volume of 300 PB / day)
■ Incidentally, IBM is developing
hardware specifically to process
this astronomical information
Engine data - aircraft, diesel trucks & generators,…
● Each Airbus A380 engine - there are 4 engines - generates 1 PB of data on a flight from London
(LHR) to Singapore (SIN)
● GE, one of the world’s largest manufacturers, is using big data
analytics with data generated from machine sensors to
predict maintenance needs.
● GE is using the analysis
to provide services tied to
its products, designed to minimize downtime caused
by parts failures
Large Hadron Collider (LHC)
● The Large Hadron Collider (LHC) data center process about one petabyte of data
every day. The center hosts 11,000 services with 100,000 processor cores. Some
6000 changes in the database are performed every second.
○ A global collaboration of computer
centers distributes and stores LHC
data, giving real-time access to
physicists around the world
○ The Grid runs more than two million
jobs per day - at peak rates, 10 GB
of data may be transferred / sec.
Five key Big Data Use Cases
Why Do we need to handle big data differently
?
Because It’s BIG!!!?
Classical workflow
Shady is a data engineer who works for financial services he has list of transactions that he need
to store and do some analysis:
1-Design Database schema.
2-Create Tables to store the data in.
2-Write script to read the transactions from the source files and store it to the database
3-Write some queries to return the analysis results required.
4-Run Query , Save Results
Classical Approach cont..
What if we are dealing with data that is too big for our hard disk?
What if we are dealing with something like Paypal data where millions of transactions are done
per second?
What if you have fraud detection algorithm that should fire real time alerts of fraudulent
transactions?
What if my data source has data with no uniform schema? Unstructured data (Documents,Images
,Audio files..etc)?
4 Vs
Volume Velocity Variety
Veracity
Data size is
exceeding
petabytes
Data generated at
high speed
Structured ,
unstructured &
semi-structured
data
Trustworthiness of
Data
Evolution of big data systems
Bigger
computers
Distributed
Systems
Hadoop
Inspiration behind Hadoop
Which is slower??
Which is the slower factors in data handling?
● I/O Operations
● Processing
Moor’s law
According to Moor’s law processing power
doubles every 18 months
While ….
Typical Disk transfer rate is 100-200 mb/sec
Time needed to transfer 100GB is 17 mins!!
What is Hadoop?
● Apache open-source software framework for reliable,
scalable, distributed computing over massive amount of data.
● Originally based on papers published by Google around 2004
● Main Concept: Minimize data transfer over the network & data nodes process its local data
Hadoop Core Concepts
No need for network programming knowledge
Run Computations where the data is stored
Data is replicated multiple times across the data nodes
Scalability achieved easily through adding commodity hardware as additional nodes.
Fault tolerant and resilient execution of tasks
Hadoop Main Components
● Hadoop Distributed File System (HDFS) Storing the data on the cluster.
● MapReduce Processing the data on the cluster
● Yarn :(Yet Another Resource Negotiator) responsible for resource management and job
scheduling
Hadoop Echo System
● Hadoop is supplemented by an extensive ecosystem of open-source projects.
HDFS
● Hadoop Distributed File System
● Splits data into blocks (Typically block is 128 MB in size)
● Each block is replicated multiple time (default replication factor 3)
● Replicas are stored on different nodes (different racks if possible)
● Name node stores the files metadata
HDFS
101101001
010010011
100111111
001010011
101001010
010110010
010101001
100010100
101110101
110101111
011011010
101101001
010100101
010101011
100100110
101110100
1
2
3
4
1
1
1
2
2 2
3
3
34
4
4
Blocks
Logical File
Data Nodes
Name
Node
(Metadata)
How to access HDFS?
● HDFS Command line, ex: hdfs dfs -put foo.txt
● web UI (Ambari, Hue)
● Spark API :
df_example = sparkSession.read.csv('hdfs://cluster/user/hdfs/test/example.csv')
● Echo system projects(Sqoope , flume )
YARN
Responsible component for :
● Resource management
● Job scheduling
Map Reduce
● Method for distributing tasks across cluster nodes.
● Consist of 2 phases : Map & reduce (in between comes shuffle & sort)
MapReduce 1 overview
Distributed
FileSystem HDFS,
data in blocks
Map Shuffle Reduce
Results can be
written to HDFS or
a database
Mapper
● Each Map task process one HDFS block.
● Map tasks runs ideally on the same node where the data block is located.
Shuffle & Sort
● Sorts and consolidates the intermediate data from all mappers
Reducer
● Process the intermediate data generated from shuffle & sort step
● Produces the final output
Word Count Example
● Input : Bus, Car, train, bus, car, train, bus, train, bus, plane
● Output: (Bus,3) (Car,2)(train,2)(plane,1)
Word Count Example
Splitting Mapping Intermediate
Shuffling
Reducing
Word Count
Map Reduce disadvantages
● MapReduce started a general batch processing paradigm, but had its limitations:
■ Difficulty programming in MapReduce
■ Batch processing did not fit many use cases
■ MR spawned a lot of specialized systems (Storm, Impala, Giraph, etc.)
Apache Spark
● Parallel distributed processing, fault tolerance on commodity hardware, scalability, in-memory
computing, high level APIs, etc.
○ Ease of use
○ Wide variety of functionality
○ Mature and reliable
Spark Code Size
Resilient Distributed Dataset (RDD)
● Fault-tolerant collection of elements that can be operated on in parallel
● RDDs are immutable (content cannot be changed)
● Dataset from any storage supported by Hadoop
○ HDFS, Cassandra, HBase, Amazon S3, etc.
● Types of files supported:
○ Text files, SequenceFiles, Hadoop InputFormat, etc
Resilient Distributed Dataset (RDD)
● Three methods for creating RDD
■ Parallelizing an existing collection
■ Referencing a dataset
■ Transformation from an existing RDD
● Two types of RDD operations
○ Transformations => Generate new RDD
○ Actions => Calculate results
Lazy Execution
# create a sample list
my_list = [i for i in range(1,10000000)]
# parallelize the data
rdd_0 = sc.parallelize(my_list,3)
# Add 4 to each item
rdd_1=rdd_0.map(lambda x : x+4)
#multiply each item by 2
rdd_2=rdd_1.map(lambda x:x*2)
#take only items less than 100
rdd_3=rdd_2.filter(lambda x: x<100)
#return 2 items from RDD to array
total=rdd_3.take(2)
1
2
3
4
5
.
.
.
.
1M
1
5
10
2
6
12
[10,12]
rdd_1
rdd_2
rdd_3
Action
Transformation
Programming with Spark
● Available API of many programming languages including
○ Scala
○ Python
○ Java
○ R
spark-submit wordcount.py
Spark libraries
● Extension of the core Spark API.
● Improvements made to the core are passed to these libraries.
● Little overhead to use with the Spark core
spark.apache.org
When to use Hadoop
● Hadoop is good for:
○ Processing massive amounts of data through parallelism
○ Handling a variety of data (structured, unstructured, semi-structured)
○ Using inexpensive commodity hardware
When NOT to use Hadoop
● Hadoop is not good for:
○ Processing transactions (random access)
○ When work cannot be parallelized
○ Low latency data access
○ Processing lots of small files
○ Intensive calculations with small amounts of data
What's next?
● IBM skills academy Big Data Track
● https://cognitiveclass.ai/learn/big-data
● https://www.coursera.org/specializations/big-data
● https://www.edx.org/course/big-data-analysis-with-apache-spark
● Try Spark on local file system.
● Learn PySpark
● Get familiar with bash scripting
● https://www.edx.org/course/big-data-analysis-with-apache-spark
Thanks
● Any Questions?

More Related Content

What's hot

Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
[KCC oral] 정준영
[KCC oral] 정준영[KCC oral] 정준영
[KCC oral] 정준영Junyoung Jung
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheSandeepTaksande
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovVasil Remeniuk
 
Computer Hardware | 3B
Computer Hardware | 3BComputer Hardware | 3B
Computer Hardware | 3BCMDLMS
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop IntroductionSNEHAL MASNE
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemAnshul Bhatnagar
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS Dr Neelesh Jain
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Hadoop and Netezza - Co-existence or Competition?
Hadoop and Netezza - Co-existence or Competition?Hadoop and Netezza - Co-existence or Competition?
Hadoop and Netezza - Co-existence or Competition?Krishnan Parasuraman
 
Massive parallel processing database systems mpp
Massive parallel processing database systems mppMassive parallel processing database systems mpp
Massive parallel processing database systems mppDiana Patricia Rey Cabra
 

What's hot (20)

Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
[KCC oral] 정준영
[KCC oral] 정준영[KCC oral] 정준영
[KCC oral] 정준영
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
Computer Hardware | 3B
Computer Hardware | 3BComputer Hardware | 3B
Computer Hardware | 3B
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Spark 101
Spark 101Spark 101
Spark 101
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
Tera data
Tera dataTera data
Tera data
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Hive
HiveHive
Hive
 
Hadoop and Netezza - Co-existence or Competition?
Hadoop and Netezza - Co-existence or Competition?Hadoop and Netezza - Co-existence or Competition?
Hadoop and Netezza - Co-existence or Competition?
 
Massive parallel processing database systems mpp
Massive parallel processing database systems mppMassive parallel processing database systems mpp
Massive parallel processing database systems mpp
 

Similar to Inroduction to Big Data

Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 

Similar to Inroduction to Big Data (20)

Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Big data
Big dataBig data
Big data
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big data
Big dataBig data
Big data
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Final deck
Final deckFinal deck
Final deck
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 

Recently uploaded

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 

Recently uploaded (20)

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 

Inroduction to Big Data

  • 1. Inroduction to Big Data Omnia Safaan, Senior Data Scientist Data & AI Expert labs , IBM Omnia.safaan@ibm.com
  • 2. What is Big data? Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software
  • 3. The motivation behind Big Data Technology
  • 4. Data has an intrinsic property…it grows and grows 1 in 2 business leaders don’t have access to data they need 83% of CIO’s cited BI and analytics as part of their visionary plan 5.4X more likely that top performers use business analytics 80% of the world’s data today is unstructured 90% of the world’s data was created in the last two years 20% of available data can be processed by traditional systems
  • 5. Data & Processing loop More data generated requires better processing speed More processing speed encourages more consumption More consumption generates more data!!!
  • 6. Big Data Exciting projects
  • 7. Largest Radio Telescope (Square Kilometer Array) ● The Square Kilometer Array (SKA) is a large multi radio telescope project aimed to be built in Australia and South Africa. When finished, it would have a total collecting area of about one square kilometer - it will operate over a wide range of frequencies and its size will make it 50 times more sensitive than any other radio instrument and survey, 10,000 times faster than ever before… ○ The SKA will produce 20,000 PB / day in 2020 (compared with the current internet volume of 300 PB / day) ■ Incidentally, IBM is developing hardware specifically to process this astronomical information
  • 8. Engine data - aircraft, diesel trucks & generators,… ● Each Airbus A380 engine - there are 4 engines - generates 1 PB of data on a flight from London (LHR) to Singapore (SIN) ● GE, one of the world’s largest manufacturers, is using big data analytics with data generated from machine sensors to predict maintenance needs. ● GE is using the analysis to provide services tied to its products, designed to minimize downtime caused by parts failures
  • 9. Large Hadron Collider (LHC) ● The Large Hadron Collider (LHC) data center process about one petabyte of data every day. The center hosts 11,000 services with 100,000 processor cores. Some 6000 changes in the database are performed every second. ○ A global collaboration of computer centers distributes and stores LHC data, giving real-time access to physicists around the world ○ The Grid runs more than two million jobs per day - at peak rates, 10 GB of data may be transferred / sec.
  • 10. Five key Big Data Use Cases
  • 11. Why Do we need to handle big data differently ? Because It’s BIG!!!?
  • 12. Classical workflow Shady is a data engineer who works for financial services he has list of transactions that he need to store and do some analysis: 1-Design Database schema. 2-Create Tables to store the data in. 2-Write script to read the transactions from the source files and store it to the database 3-Write some queries to return the analysis results required. 4-Run Query , Save Results
  • 13. Classical Approach cont.. What if we are dealing with data that is too big for our hard disk? What if we are dealing with something like Paypal data where millions of transactions are done per second? What if you have fraud detection algorithm that should fire real time alerts of fraudulent transactions? What if my data source has data with no uniform schema? Unstructured data (Documents,Images ,Audio files..etc)?
  • 14. 4 Vs Volume Velocity Variety Veracity Data size is exceeding petabytes Data generated at high speed Structured , unstructured & semi-structured data Trustworthiness of Data
  • 15. Evolution of big data systems Bigger computers Distributed Systems Hadoop
  • 17. Which is slower?? Which is the slower factors in data handling? ● I/O Operations ● Processing
  • 18. Moor’s law According to Moor’s law processing power doubles every 18 months While …. Typical Disk transfer rate is 100-200 mb/sec Time needed to transfer 100GB is 17 mins!!
  • 19. What is Hadoop? ● Apache open-source software framework for reliable, scalable, distributed computing over massive amount of data. ● Originally based on papers published by Google around 2004 ● Main Concept: Minimize data transfer over the network & data nodes process its local data
  • 20. Hadoop Core Concepts No need for network programming knowledge Run Computations where the data is stored Data is replicated multiple times across the data nodes Scalability achieved easily through adding commodity hardware as additional nodes. Fault tolerant and resilient execution of tasks
  • 21. Hadoop Main Components ● Hadoop Distributed File System (HDFS) Storing the data on the cluster. ● MapReduce Processing the data on the cluster ● Yarn :(Yet Another Resource Negotiator) responsible for resource management and job scheduling
  • 22. Hadoop Echo System ● Hadoop is supplemented by an extensive ecosystem of open-source projects.
  • 23. HDFS ● Hadoop Distributed File System ● Splits data into blocks (Typically block is 128 MB in size) ● Each block is replicated multiple time (default replication factor 3) ● Replicas are stored on different nodes (different racks if possible) ● Name node stores the files metadata
  • 25. How to access HDFS? ● HDFS Command line, ex: hdfs dfs -put foo.txt ● web UI (Ambari, Hue) ● Spark API : df_example = sparkSession.read.csv('hdfs://cluster/user/hdfs/test/example.csv') ● Echo system projects(Sqoope , flume )
  • 26. YARN Responsible component for : ● Resource management ● Job scheduling
  • 27. Map Reduce ● Method for distributing tasks across cluster nodes. ● Consist of 2 phases : Map & reduce (in between comes shuffle & sort)
  • 28. MapReduce 1 overview Distributed FileSystem HDFS, data in blocks Map Shuffle Reduce Results can be written to HDFS or a database
  • 29. Mapper ● Each Map task process one HDFS block. ● Map tasks runs ideally on the same node where the data block is located.
  • 30. Shuffle & Sort ● Sorts and consolidates the intermediate data from all mappers
  • 31. Reducer ● Process the intermediate data generated from shuffle & sort step ● Produces the final output
  • 32. Word Count Example ● Input : Bus, Car, train, bus, car, train, bus, train, bus, plane ● Output: (Bus,3) (Car,2)(train,2)(plane,1)
  • 33. Word Count Example Splitting Mapping Intermediate Shuffling Reducing
  • 35. Map Reduce disadvantages ● MapReduce started a general batch processing paradigm, but had its limitations: ■ Difficulty programming in MapReduce ■ Batch processing did not fit many use cases ■ MR spawned a lot of specialized systems (Storm, Impala, Giraph, etc.)
  • 36. Apache Spark ● Parallel distributed processing, fault tolerance on commodity hardware, scalability, in-memory computing, high level APIs, etc. ○ Ease of use ○ Wide variety of functionality ○ Mature and reliable
  • 38. Resilient Distributed Dataset (RDD) ● Fault-tolerant collection of elements that can be operated on in parallel ● RDDs are immutable (content cannot be changed) ● Dataset from any storage supported by Hadoop ○ HDFS, Cassandra, HBase, Amazon S3, etc. ● Types of files supported: ○ Text files, SequenceFiles, Hadoop InputFormat, etc
  • 39. Resilient Distributed Dataset (RDD) ● Three methods for creating RDD ■ Parallelizing an existing collection ■ Referencing a dataset ■ Transformation from an existing RDD ● Two types of RDD operations ○ Transformations => Generate new RDD ○ Actions => Calculate results
  • 40. Lazy Execution # create a sample list my_list = [i for i in range(1,10000000)] # parallelize the data rdd_0 = sc.parallelize(my_list,3) # Add 4 to each item rdd_1=rdd_0.map(lambda x : x+4) #multiply each item by 2 rdd_2=rdd_1.map(lambda x:x*2) #take only items less than 100 rdd_3=rdd_2.filter(lambda x: x<100) #return 2 items from RDD to array total=rdd_3.take(2) 1 2 3 4 5 . . . . 1M 1 5 10 2 6 12 [10,12] rdd_1 rdd_2 rdd_3 Action Transformation
  • 41. Programming with Spark ● Available API of many programming languages including ○ Scala ○ Python ○ Java ○ R spark-submit wordcount.py
  • 42. Spark libraries ● Extension of the core Spark API. ● Improvements made to the core are passed to these libraries. ● Little overhead to use with the Spark core spark.apache.org
  • 43. When to use Hadoop ● Hadoop is good for: ○ Processing massive amounts of data through parallelism ○ Handling a variety of data (structured, unstructured, semi-structured) ○ Using inexpensive commodity hardware
  • 44. When NOT to use Hadoop ● Hadoop is not good for: ○ Processing transactions (random access) ○ When work cannot be parallelized ○ Low latency data access ○ Processing lots of small files ○ Intensive calculations with small amounts of data
  • 45. What's next? ● IBM skills academy Big Data Track ● https://cognitiveclass.ai/learn/big-data ● https://www.coursera.org/specializations/big-data ● https://www.edx.org/course/big-data-analysis-with-apache-spark ● Try Spark on local file system. ● Learn PySpark ● Get familiar with bash scripting ● https://www.edx.org/course/big-data-analysis-with-apache-spark

Editor's Notes

  1. With receiving stations extending out to distance of at least 3,000 kilometres (1,900 mi) from a concentrated central core, it will exploit radio astronomy's ability to provide the highest resolution images in all astronomy. The SKA will be built in the southern hemisphere, in sub-Saharan states with cores in South Africa and Australia, where the view of the Milky Way Galaxy is best and radio interference least. Construction of the SKA is scheduled to begin in 2018 for initial observations by 2020. The SKA will be built in two phases, with Phase 1 (2018-2023) representing about 10% of the capability of the whole telescope. Phase 1 of the SKA was cost-capped at 650 million euros in 2013, while Phase 2's cost has not yet been established. The headquarters of the project are located at the Jodrell Bank Observatory, in the UK. The data collected by the SKA in a single day would take nearly two million years to playback on an iPod The SKA will be so sensitive that it will be able to detect an airport radar on a planet tens of light years away The SKA central computer will have the processing power of about one hundred million PCs The dishes of the SKA will produce 10 times the global internet traffic The SKA will use enough optical fiber to wrap twice around the Earth The aperture arrays in the SKA could produce more than 100 times the global internet traffic References: https://en.wikipedia.org/wiki/Square_Kilometre_Array https://www.skatelescope.org/ (SKA homepage) http://www.ska.gov.au/About/Pages/default.aspx http://www.ska.gov.au/NewZealandSKA/Pages/default.aspx
  2. Both Qantas and Singapore Airlines use Rolls-Royce Trent 900 engines in their A380 aircraft. Qantas Flight 32 was a Qantas scheduled passenger flight that suffered an uncontained engine failure on 4 November 2010 and made an emergency landing at Singapore Changi Airport. The failure was the first of its kind for the Airbus A380, the world's largest passenger aircraft. It marked the first aviation occurrence involving an Airbus A380. On inspection it was found that a turbine disc in the aircraft's No. 2 Rolls-Royce Trent 900 engine (on the port side nearest the fuselage) had disintegrated. The aircraft had also suffered damage to the nacelle, wing, fuel system, landing gear, flight controls, the controls for engine No. 1 and an undetected fire in the left inner wing fuel tank that eventually self-extinguished.[1] The failure was determined to have been caused by the breaking of a stub oil pipe which had been manufactured improperly. GE manufactures jet engines, turbines and medical scanners. It is using operational data from sensors on its machinery and engines for pattern analysis. References: http://www.computerweekly.com/news/2240176248/GE-uses-big-data-to-power-machine-services-business "The airline industry spends $200bn on fuel per year so a 2% saving is $4bn. GE provides software that enables airline pilots to manage fuel efficiency.“ “Another product, Movement Planner, is a cruise control system for train drivers. The technology assesses the terrain and the location of the train to calculate the optimal speed to run the locomotive for fuel economy. ” http://www.infoworld.com/article/2616433/big-data/general-electric-lays-out-big-plans-for-big-data.html “As one of the world's largest companies, GE is a major manufacturer of systems in aviation, rail, mining, energy, healthcare, and more. In recognition of the importance of big data to GE, CEO Jeff Immelt launched a new initiative called the “industrial Internet,” which aims to help customers increase efficiency and to create new revenue opportunities for GE through analytics.” “The industrial Internet is GE's spin on “the Internet of things,” where Internet-connected sensors collect vast quantities of data for analysis. According to Immelt, sensors have already been embedded in 250,000 "intelligent machines" manufactured by GE, including jet engines, power turbines, medical devices, and so on. Harvesting and analyzing the data generated by those sensors holds enormous potential for optimization across a broad range of industrial operations.
  3. References: http://www.popularmechanics.com/technology/a20540/300-tb-cern-data-large-hadron-collider/ (April 2016) “The most complex machine in mankind's history just put a gargantuan data trove online for anyone to parse. You think you've got the analytical chops to glean insights about the nature of the cosmos or God or simply the tendencies of muons? Go ahead, man, dig through the 300 terabytes of data that CERN, the European Organization for Nuclear Research, just dropped onto the cloud.” … “But it ain't nothing compared to what the National Security Agency works with. Going by2013 figures the agency released, the NSA's various activities ’touch’ 300 TB of data every 15 minutes or so.” http://www.theverge.com/2016/4/25/11501078/cern-300-tb-lhc-data-open-access “If you ever wanted to take a look at raw data produced by the Large Hadron Collider, but are missing the necessary physics PhD, here's your chance: CERN has published more than 300 terabytes of LHC data online for free. The data covers roughly half the experiments run by the LHC's CMS detector during 2011, with a press release from CERN explaining that this includes about 2.5 inverse fem tobarns of data - around 250 trillion particle collisions. Best not to download this on a mobile connection then.”
  4. Icons made by <a href="https://www.flaticon.com/authors/freepik" title="Freepik">Freepik</a> from <a href="https://www.flaticon.com/" title="Flaticon"> www.flaticon.com</a>
  5. 28
  6. 36
  7. 37
  8. 38
  9. 41
  10. 42