SlideShare una empresa de Scribd logo
1 de 60
‫و‬ ‫چرا‬ ،‫داده‬ ‫کالن‬ ‫عصر‬
‫چگونه؟‬
VAHID AMIRI
VAHIDAMIRY.IR
VAHID.AMIRY@GMAIL.COM
Big DataData Data Processing
Data Gathering
Data Storing
Big Data Definition
 No single standard definition…
“Big Data” is data whose scale, diversity, and complexity
require new architecture, techniques, algorithms, and
analytics to manage it and extract value and hidden
knowledge from it…
Big Data: 3V’s
12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?TBsof
dataeveryday
2+
billion
people on
the Web
by end
2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world wide
100s of
millions
of GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
Volume
Variety (Complexity)
 Relational Data (Tables/Transaction/Legacy Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF), …
 Streaming Data
 You can only scan the data once
 Big Public Data (online, weather, finance, etc)
To extract knowledge all these types of
data need to linked together
A Single View to the Customer
Customer
Social
Media
Gaming
Entertain
Bankin
g
Financ
e
Our
Known
History
Purchase
Velocity (Speed)
 Data is begin generated fast and need to be processed fast
 Online Data Analytics
 Late decisions  missing opportunities
Social media and networks
(all of us are generating data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
Some Make it 4V’s
 The Model of Generating/Consuming Data has Changed
The Model Has Changed…
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
Solution
Big
Data
Big
Comput
ation
Big
Computer
Big Data Solutions
 Hadoop is a software framework for distributed processing of large datasets
across large clusters of computers
 Hadoop implements Google’s MapReduce, using HDFS
 MapReduce divides applications into many small blocks of work.
 HDFS creates multiple replicas of data blocks for reliability, placing them on compute
nodes around the cluster
Hadoop
Spark Stack
 More than just the Elephant in the room
 Over 120+ types of NoSQL databases
So many NoSQL options
 Extend the Scope of RDBMS
 Caching
 Master/Slave
 Table Partitioning
 Federated Tables
 Sharding
NoSql
 Relational database (RDBMS) technology
 Has not fundamentally changed in over 40 years
 Default choice for holding data behind many web apps
 Handling more users means adding a bigger server
RDBMS with Extended Functionality
Vs.
Systems Built from Scratch
with Scalability in Mind
NoSQL Movement
CAP Theorem
 “Of three properties of shared-data systems – data Consistency, system
Availability and tolerance to network Partition – only two can be achieved at
any given moment in time.”
“Of three properties of shared-data systems – data
Consistency, system Availability and tolerance to
network Partition – only two can be achieved at any
given moment in time.”
 CA
 Highly-available consistency
 CP
 Enforced consistency
 AP
 Eventual consistency
CAP Theorem
Flavors of NoSQL
 Schema-less
 State (Persistent or Volatile)
 Example:
 Redis
 Amazon DynamoDB
Key / Value Database
 Wide, sparse column sets
 Schema-light
 Examples:
 Cassandra
 HBase
 BigTable
 GAE HR DS
Column Database
 Use for data that is
 document-oriented (collection of JSON documents) w/semi structured
data
 Encodings include XML, YAML, JSON & BSON
 binary forms
 PDF, Microsoft Office documents -- Word, Excel…)
 Examples: MongoDB, CouchDB
Document Database
Graph Database
Use for data with
 a lot of many-to-many relationships
 when your primary objective is quickly
finding connections, patterns and
relationships between the objects within
lots of data
 Examples: Neo4J, FreeBase (Google)
So which type of NoSQL? Back to CAP…
CP = noSQL/column
Hadoop
Big Table
HBase
MemCacheDB
AP = noSQL/document or key/value
DynamoDB
CouchDB
Cassandra
Voldemort
CA = SQL/RDBMS
SQL Sever / SQL
Azure
Oracle
MySQL
Apache Hadoop Projects
Apache Hadoop
 A framework for storing & processing Petabyte of data using commodity hardware
and storage
 Apache project
 Implemented in Java
 Community of contributors is growing
 Yahoo: HDFS and MapReduce
 Powerset: HBase
 Facebook: Hive and FairShare scheduler
 IBM: Eclipse plugins
Briefing history of Hadoop
Organization used hadoop
Hadoop System Principles
 Scale-Out rather than Scale-Up
 Bring code to data rather than data to code
 Deal with failures – they are common
 Abstract complexity of distributed and concurrent applications
Scale-Out Instead of Scale-Up
 It is harder and more expensive to scale-up
 Add additional resources to an existing node (CPU, RAM)
 New units must be purchased if required resources can not be added
 Also known as scale vertically
 Scale-Out
 Add more nodes/machines to an existing distributed application
 Software Layer is designed for node additions or removal
 Hadoop takes this approach - A set of nodes are bonded together as a single
distributed system
 Very easy to scale down as well
Code to Data
 Traditional data processing architecture
 Nodes are broken up into separate processing and storage nodes connected by
high-capacity link
 Many data-intensive applications are not CPU demanding causing bottlenecks in
network
Code to Data
 Hadoop co-locates processors and storage
 Code is moved to data (size is tiny, usually in KBs)
 Processors execute code and access underlying local storage
Failures are Common
 Given a large number machines, failures are common
 Large warehouses may see machine failures weekly or even daily
 Hadoop is designed to cope with node failures
 Data is replicated
 Tasks are retried
Abstract Complexity
 Hadoop abstracts many complexities in distributed and concurrent applications
 Defines small number of components
 Provides simple and well defined interfaces of interactions between these components
 Frees developer from worrying about system level challenges
 processing pipelines, data partitioning, code distribution
 Allows developers to focus on application development and business logic
Distribution Vendors
 Cloudera Distribution for Hadoop (CDH)
 MapR Distribution
 Hortonworks Data Platform (HDP)
 Apache BigTop Distribution
Components
 Distributed File System
 HDFS
 Distributed Processing Framework
 Map/Reduce
The Storage:
Hadoop Distributed File System
HDFS is Good for...
 Storing large files
 Terabytes, Petabytes, etc...
 Millions rather than billions of files
 100MB or more per file
 Streaming data
 Write once and read-many times patterns
 Optimized for streaming reads rather than random reads
 “Cheap” Commodity Hardware
 No need for super-computers, use less reliable commodity hardware
HDFS Daemons
Files and Blocks
HDFS Component Communication
REPLICA MANGEMENT
 A common practice is to spread the nodes across multiple racks
 A good replica placement policy should improve data reliability, availability,
and network bandwidth utilization
 Namenode determines replica placement
NETWORK TOPOLOGY AND HADOOP
The Execution Engine:
Apache Yarn
Apache Yarn
Yarn Components
 RescourceManager:
 Arbitrates resources among all the applications in the
system
 NodeManager:
 the per-machine slave, which is responsible for launching
the applications’ containers, monitoring their resource
usage
 ApplicationMaster:
 Negotiate appropriate resource containers from the
Scheduler, tracking their status and monitoring for progress
 Container:
 Unit of allocation incorporating resource elements such as
memory, cpu, disk, network etc., to execute a specific task of the
application (similar to map/reduce slots in MRv1)
YARN Architecture
The Processing Model:
MapReduce
Hadoop Mapreduce Framework
What is MapReduce?
 Parallel programming model for large clusters
 User implements Map() and Reduce()
 Parallel computing framework
 Libraries take care of EVERYTHING else
 Parallelization
 Fault Tolerance
 Data Distribution
 Load Balancing
 MapReduce library does most of the hard work for us!
 Takes care of distributed processing and coordination
 Scheduling
 Task Localization with Data
 Error Handling
 Data Synchronization
MapReduce: Data Flow
Map and Reduce
 Map()
 Map workers read in contents of corresponding input partition
 Process a key/value pair to generate intermediate key/value pairs
 Reduce()
 Merge all intermediate values associated with the same key
 eg. <key, [value1, value2,..., valueN]>
 Output of user's reduce function is written to output file on global file system
 When all tasks have completed, master wakes up user program
Distributed Processing
 Word count on a huge file
Mapreduce Model
Example: Counting Words
عصر کلان داده، چرا و چگونه؟

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 

Destacado

Tarea taller 4 mapa de conceptos isamalia muñiz
Tarea taller 4 mapa de conceptos   isamalia muñizTarea taller 4 mapa de conceptos   isamalia muñiz
Tarea taller 4 mapa de conceptos isamalia muñiz
Isamalia Muniz
 

Destacado (20)

فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data Technologies
 فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data Technologies فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data Technologies
فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data Technologies
 
Internet of Things Security Challlenges
Internet of Things Security ChalllengesInternet of Things Security Challlenges
Internet of Things Security Challlenges
 
تشخیص انجمن در مقیاس کلان داده
تشخیص انجمن در مقیاس کلان دادهتشخیص انجمن در مقیاس کلان داده
تشخیص انجمن در مقیاس کلان داده
 
داده های جریانی streaming data
داده های جریانی streaming dataداده های جریانی streaming data
داده های جریانی streaming data
 
Big Data and select suitable tools
Big Data and select suitable toolsBig Data and select suitable tools
Big Data and select suitable tools
 
A Story of Big Data:Introduction
A Story of Big Data:IntroductionA Story of Big Data:Introduction
A Story of Big Data:Introduction
 
Big data بزرگ داده ها
Big data بزرگ داده هاBig data بزرگ داده ها
Big data بزرگ داده ها
 
کلان داده کاربردها و چالش های آن
کلان داده کاربردها و چالش های آنکلان داده کاربردها و چالش های آن
کلان داده کاربردها و چالش های آن
 
داده های عظیم چگونه دنیا را تغییر خواهند داد
داده های عظیم چگونه دنیا را تغییر خواهند داد داده های عظیم چگونه دنیا را تغییر خواهند داد
داده های عظیم چگونه دنیا را تغییر خواهند داد
 
بیگ دیتا
بیگ دیتابیگ دیتا
بیگ دیتا
 
اینترنت اشیا در 10 دقیقه
اینترنت اشیا در 10 دقیقهاینترنت اشیا در 10 دقیقه
اینترنت اشیا در 10 دقیقه
 
Emilio aparicio
Emilio aparicioEmilio aparicio
Emilio aparicio
 
Big Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACMBig Data and Machine Learning Workshop - Day 3 @ UTACM
Big Data and Machine Learning Workshop - Day 3 @ UTACM
 
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in IranTwo Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
 
Gifted education
Gifted educationGifted education
Gifted education
 
Family Circle Presentation
Family Circle PresentationFamily Circle Presentation
Family Circle Presentation
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairs
 
3Com 69-001212-00
3Com 69-001212-003Com 69-001212-00
3Com 69-001212-00
 
Tarea taller 4 mapa de conceptos isamalia muñiz
Tarea taller 4 mapa de conceptos   isamalia muñizTarea taller 4 mapa de conceptos   isamalia muñiz
Tarea taller 4 mapa de conceptos isamalia muñiz
 

Similar a عصر کلان داده، چرا و چگونه؟

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 

Similar a عصر کلان داده، چرا و چگونه؟ (20)

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoop
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
 
HADOOP
HADOOPHADOOP
HADOOP
 

Último

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Último (20)

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 

عصر کلان داده، چرا و چگونه؟

  • 1. ‫و‬ ‫چرا‬ ،‫داده‬ ‫کالن‬ ‫عصر‬ ‫چگونه؟‬ VAHID AMIRI VAHIDAMIRY.IR VAHID.AMIRY@GMAIL.COM
  • 2. Big DataData Data Processing Data Gathering Data Storing
  • 3.
  • 4. Big Data Definition  No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…
  • 6. 12+ TBs of tweet data every day 25+ TBs of log data every day ?TBsof dataeveryday 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by 2014 Volume
  • 7. Variety (Complexity)  Relational Data (Tables/Transaction/Legacy Data)  Text Data (Web)  Semi-structured Data (XML)  Graph Data  Social Network, Semantic Web (RDF), …  Streaming Data  You can only scan the data once  Big Public Data (online, weather, finance, etc) To extract knowledge all these types of data need to linked together
  • 8. A Single View to the Customer Customer Social Media Gaming Entertain Bankin g Financ e Our Known History Purchase
  • 9. Velocity (Speed)  Data is begin generated fast and need to be processed fast  Online Data Analytics  Late decisions  missing opportunities Social media and networks (all of us are generating data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data)
  • 10. Some Make it 4V’s
  • 11.  The Model of Generating/Consuming Data has Changed The Model Has Changed… Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data
  • 14.  Hadoop is a software framework for distributed processing of large datasets across large clusters of computers  Hadoop implements Google’s MapReduce, using HDFS  MapReduce divides applications into many small blocks of work.  HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster Hadoop
  • 15.
  • 17.  More than just the Elephant in the room  Over 120+ types of NoSQL databases So many NoSQL options
  • 18.  Extend the Scope of RDBMS  Caching  Master/Slave  Table Partitioning  Federated Tables  Sharding NoSql  Relational database (RDBMS) technology  Has not fundamentally changed in over 40 years  Default choice for holding data behind many web apps  Handling more users means adding a bigger server
  • 19. RDBMS with Extended Functionality Vs. Systems Built from Scratch with Scalability in Mind NoSQL Movement
  • 20. CAP Theorem  “Of three properties of shared-data systems – data Consistency, system Availability and tolerance to network Partition – only two can be achieved at any given moment in time.”
  • 21. “Of three properties of shared-data systems – data Consistency, system Availability and tolerance to network Partition – only two can be achieved at any given moment in time.”  CA  Highly-available consistency  CP  Enforced consistency  AP  Eventual consistency CAP Theorem
  • 23.  Schema-less  State (Persistent or Volatile)  Example:  Redis  Amazon DynamoDB Key / Value Database
  • 24.  Wide, sparse column sets  Schema-light  Examples:  Cassandra  HBase  BigTable  GAE HR DS Column Database
  • 25.  Use for data that is  document-oriented (collection of JSON documents) w/semi structured data  Encodings include XML, YAML, JSON & BSON  binary forms  PDF, Microsoft Office documents -- Word, Excel…)  Examples: MongoDB, CouchDB Document Database
  • 26. Graph Database Use for data with  a lot of many-to-many relationships  when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data  Examples: Neo4J, FreeBase (Google)
  • 27. So which type of NoSQL? Back to CAP… CP = noSQL/column Hadoop Big Table HBase MemCacheDB AP = noSQL/document or key/value DynamoDB CouchDB Cassandra Voldemort CA = SQL/RDBMS SQL Sever / SQL Azure Oracle MySQL
  • 28.
  • 30. Apache Hadoop  A framework for storing & processing Petabyte of data using commodity hardware and storage  Apache project  Implemented in Java  Community of contributors is growing  Yahoo: HDFS and MapReduce  Powerset: HBase  Facebook: Hive and FairShare scheduler  IBM: Eclipse plugins
  • 33. Hadoop System Principles  Scale-Out rather than Scale-Up  Bring code to data rather than data to code  Deal with failures – they are common  Abstract complexity of distributed and concurrent applications
  • 34. Scale-Out Instead of Scale-Up  It is harder and more expensive to scale-up  Add additional resources to an existing node (CPU, RAM)  New units must be purchased if required resources can not be added  Also known as scale vertically  Scale-Out  Add more nodes/machines to an existing distributed application  Software Layer is designed for node additions or removal  Hadoop takes this approach - A set of nodes are bonded together as a single distributed system  Very easy to scale down as well
  • 35. Code to Data  Traditional data processing architecture  Nodes are broken up into separate processing and storage nodes connected by high-capacity link  Many data-intensive applications are not CPU demanding causing bottlenecks in network
  • 36. Code to Data  Hadoop co-locates processors and storage  Code is moved to data (size is tiny, usually in KBs)  Processors execute code and access underlying local storage
  • 37. Failures are Common  Given a large number machines, failures are common  Large warehouses may see machine failures weekly or even daily  Hadoop is designed to cope with node failures  Data is replicated  Tasks are retried
  • 38. Abstract Complexity  Hadoop abstracts many complexities in distributed and concurrent applications  Defines small number of components  Provides simple and well defined interfaces of interactions between these components  Frees developer from worrying about system level challenges  processing pipelines, data partitioning, code distribution  Allows developers to focus on application development and business logic
  • 39. Distribution Vendors  Cloudera Distribution for Hadoop (CDH)  MapR Distribution  Hortonworks Data Platform (HDP)  Apache BigTop Distribution
  • 40. Components  Distributed File System  HDFS  Distributed Processing Framework  Map/Reduce
  • 42. HDFS is Good for...  Storing large files  Terabytes, Petabytes, etc...  Millions rather than billions of files  100MB or more per file  Streaming data  Write once and read-many times patterns  Optimized for streaming reads rather than random reads  “Cheap” Commodity Hardware  No need for super-computers, use less reliable commodity hardware
  • 46. REPLICA MANGEMENT  A common practice is to spread the nodes across multiple racks  A good replica placement policy should improve data reliability, availability, and network bandwidth utilization  Namenode determines replica placement
  • 50. Yarn Components  RescourceManager:  Arbitrates resources among all the applications in the system  NodeManager:  the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage  ApplicationMaster:  Negotiate appropriate resource containers from the Scheduler, tracking their status and monitoring for progress  Container:  Unit of allocation incorporating resource elements such as memory, cpu, disk, network etc., to execute a specific task of the application (similar to map/reduce slots in MRv1)
  • 54. What is MapReduce?  Parallel programming model for large clusters  User implements Map() and Reduce()  Parallel computing framework  Libraries take care of EVERYTHING else  Parallelization  Fault Tolerance  Data Distribution  Load Balancing  MapReduce library does most of the hard work for us!  Takes care of distributed processing and coordination  Scheduling  Task Localization with Data  Error Handling  Data Synchronization
  • 56. Map and Reduce  Map()  Map workers read in contents of corresponding input partition  Process a key/value pair to generate intermediate key/value pairs  Reduce()  Merge all intermediate values associated with the same key  eg. <key, [value1, value2,..., valueN]>  Output of user's reduce function is written to output file on global file system  When all tasks have completed, master wakes up user program
  • 57. Distributed Processing  Word count on a huge file