SlideShare a Scribd company logo
1 of 42
Introduction to
Big Data
AHMED SHOUMAN
Our agenda
 Demystify the term "Big Data"
 Find out what is Hadoop
 Explore the realms of batch and real-time big data processing
 Explore challenges of size, speed and scale in databases
 Skim the surface of big-data technologies
 Provide ways into the big-data world
Big Data
Demystified
What is big data?
 Big data is a collective term for a set technologies designed
for storage, querying and analysis of extremely large data sets,
sources and volumes.
 Big data technologies come in where traditional off-the-shelf
databases, data warehousing systems and analysis tools fall
short.
How did we end up with so much data?
 Data Generation: Human (Internal) ↦ Human (Social) ↦ Machine
 Data Processing: Single Core ↦ Multi-Core ↦ Cluster / Cloud
 An Important Side Note
Big Data technologies are based on the concept of clustering - Many computers
working in sync to process chunks of our data.
Not just size
 Big data isn't just about data size, but also about data volume,
diversity and inter-connectedness.
Big data is
 Any attribute of our data that challenges either technological capabilities
or business needs, like:
 Scaling, moving, storage and retrieval of ever-growing generated data
 Processing many small data points in real-time
 Analysing diverse semi-structured data from multiple sources
 Querying multiple, diverse data sources in real-time
Breath... Let's recap
 Lot's of data due to technological capabilities and social paradigms
 Not just size! Diversity, volume and inter-connectedness also count
 Scale, speed, processing, querying and analysis
 Challenges technological capabilities or business needs
Hadoop The Elephant in the Room
Everyone talks about Hadoop
 Hadoop is a powerful platform for batch analysis of large
volumes of both structured and unstructured data.
From: Conquering Hadoop with Haskell
Hadoop explained
 Hadoop is a horizontally scalable, fault-tolerant, open-source file system
and batch-analysis platform capable of processing large amounts of data.
 HDFS - Hadoop File System
 M/R - Hadoop Map-Reduce platform
Hadoop explained
 HDFS is an ever-growing file system. We can store lots and
lots of data on it for later use.
 HDFS is used as the underlying platform for other
technologies likeHadoop M/R, Apache Mahout or HBase.
Hadoop explained
 Imagine we want to look at 30 days worth of access logs to identify site
usage patterns at a volume of 30M log entries per day.
 Hadoop M/R is a platform that allows us to query HDFS data in parallel for
the purpose of batch (offline) data processing and analysis.
Why is Hadoop so important?
 Scalable and fault-tolerant
 Handles massive amounts of data
 Truly parallel processing
 Data can be semi-structured or unstructured (schemaless)
 Serves as basis for other technologies (Hbase, Mahout, Impala, Shark)
Hadoop - Words of caution
 Complex
 Not for real-time
 Choose a distribution (Cloudera, HW, MapR) for better interoperability
 Requires trained DevOps for day-to-day operations
Breath....
 We demystified the term Big Data and glimpsed at Hadoop. Now What?
 How do I really get into the Big Data world?
The world of big data
 Batch & Data Science
 DBs
 Real-Time
Batch Processing
Hadoop M/R
Batch processing of large data sets
 We collect data for the purpose of providing end-users with better
experience in our business domain. This means we have to constantly
query our data and divine new insights and relevant information.
 The problem is doing that in very large scales is a painful, slow challenge.
How do we do this on Hadoop data?
Source: https://cwiki.apache.org/confluence/display/Hive/Tutorial
Batch processing of large data sets
 Hadoop gives us the basic tools for large data processing in
the form of M/R.
However, Hadoop M/R is pretty annoying to work with
directly as it lacks a lot of relevant tools for the job (statistical
analysis, machine learning etc.)
Source: http://xiaochongzhang.me/blog/?p=338
Hadoop querying and data science
tools
 Tool Purpose
 Hive Write SQL-like M/R queries on top of Hadoop
 Shark Hive-compatible, distributed SQL query engine for Hadoop
 Pig Write scripted M/R queries on top of Hadoop
 Impala Real-time SQL-like queries of Hadoop
 Mahout Scalable machine-learning on top of Hadoop M/R
The gentle way in
 Hive or Shark are a great place to start due to their SQL-like nature
 Shark is faster than Hive - less frustration
 You need some Hadoop data to work with (consider Avro)
 Remember - it's SQL-like, not SQL
 Start small, locally and grow to production later
 Check out Apache Sqoop for moving processed Hadoop data to your DB
Databases In the big data world
Databases in the big data world
 The Problem: Traditional RDBMS were not designed for storing, indexing
and querying growing amounts and volumes of data.
 The 3S Challenge:
 Size - How much data is written and read
 Speed - How fast can we write and read data
 Scale - How easily can our DB scale to accommodate more data
The 3S Challenge
 There's no single, simple solution to the 3S challenge. Instead,
solutions focus on making an informed sacrifice in one area in
order to gain in another area.
NoSQL and C.A.P.
 NoSQL is a term referring to a family of DBMS that attempt to resolve the
3S challenge by sacrificing one of three areas:
 Consistency - All clients have the same view of data
 Availability - Each client can always read and write
 Partition Tolerance - System works despite physical network failures
NoSQL and C.A.P.
 C.A.P. means you have to make an informed choice (and sacrifice)
 No single perfect solution
 Opt for mixed solutions per use-case
 Remember we're talking about read/write volume, not just size
Confused? Let's take a breath and focus
OK, so where do I go from here?
 Identify your needs and limitations
 Choose a few candidates
 Research & Prototype
 Read about NewSQL - VoltDB, InfiniDB, MariaDB, HyperDex, FoundationDB
(omitted due to time constraints).
Real-Time Big Data Now!
Real-Time big data processing
 Processing big data in real-time is about data volumes rather than just size.
For example, given a rate of 100K ops/sec, how do I do the following in
real-time?:
 Find anomalies in a data stream (spam)
 Group check-ins by geo
 Identify trending pages / topics
Hadoop isn't for real-time processing
 When it comes to data processing and analysis, Hadoop's M/R framework
is wonderful for batch (offline) processing.
 However, processing, analysing and querying Hadoop data in real-time is
quite difficult.
Apache Storm and Apache Spark
 Apache Storm and Apache Spark are two frameworks for large-scale,
distributed data processing in real-time.
 One could say that both Storm and Spark are for real-time data processing
what is Hadoop M/R for batch data processing.
Apache Storm - Highlights
 Runs on the JVM (Clojure / Java mix)
 Fully distributed and fault-tolerant
 Highly-scalable and extremely fast
 Interoperability with popular languages (Scala, Python etc.)
 Mature and production ready
 Hadoop interoperability via Storm-YARN
 Stateless / Non-Persistent (Data brought to processors)
Apache Spark - Highlights
 Fully distributed and extremely fast
 Write applications in Java Scala and Python
 Perfect for both batch and real-time
 Combine Hadoop SQL (Shark), Machine Learning and Data streaming
 Native Hadoop interoperability
 HDFS, HBase, Cassandra, Flume as data sources
 Stateful / Persistent (Processors brought to data)
Storm & Spark - Use Cases
 Continuous/Cyclic Computation
 Real-time analytics
 Machine Learning (eg. recommendations, personalisation)
 Graph Processing (eg. social networks) - Only Spark
 Data Warehouse ETL (Extract, Transform, Load)
Recap
Term Purpose
 Big Data Collective term for data-processing solutions at scale
 Hadoop Scalable file-system and batch processing platform
 Batch Processing Sifting and analysing data offline / in background
 M/R Parallel, batch data-processing algorithm
 3S Challenge Size, Speed, Scale of DBs
 C.A.P Consistency, Availability, Partition Tolerance
 NoSQL Family of DBMS that grew due to the 3S Challenge
 NewSQL Family of DBMS that provide ACID at scale
Questions?!
Feel free to drop my a line:
Email: ahmed.sayed.shouman@gmail.com

More Related Content

What's hot

Big Data Storage Challenges and Solutions
Big Data Storage Challenges and SolutionsBig Data Storage Challenges and Solutions
Big Data Storage Challenges and Solutions
WSO2
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 

What's hot (20)

Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Big Data
Big DataBig Data
Big Data
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Big Data Storage Challenges and Solutions
Big Data Storage Challenges and SolutionsBig Data Storage Challenges and Solutions
Big Data Storage Challenges and Solutions
 
Big data
Big dataBig data
Big data
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 

Viewers also liked

Big data
Big dataBig data
Big data
hsn99
 
How to implement hadoop successfuly
How to implement hadoop successfulyHow to implement hadoop successfuly
How to implement hadoop successfuly
Adir Sharabi
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
Raul Chong
 
Big Data v Data Mining
Big Data v Data MiningBig Data v Data Mining
Big Data v Data Mining
University of Hertfordshire
 

Viewers also liked (20)

Big data
Big dataBig data
Big data
 
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsBig Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
How to implement hadoop successfuly
How to implement hadoop successfulyHow to implement hadoop successfuly
How to implement hadoop successfuly
 
Extreme Salesforce Data Volumes Webinar (with Speaker Notes)
Extreme Salesforce Data Volumes Webinar (with Speaker Notes)Extreme Salesforce Data Volumes Webinar (with Speaker Notes)
Extreme Salesforce Data Volumes Webinar (with Speaker Notes)
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)
 
Privacy in the Age of Big Data
Privacy in the Age of Big DataPrivacy in the Age of Big Data
Privacy in the Age of Big Data
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
100 sql queries
100 sql queries100 sql queries
100 sql queries
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Big Data v Data Mining
Big Data v Data MiningBig Data v Data Mining
Big Data v Data Mining
 
In-Memory Database Platform for Big Data
In-Memory Database Platform for Big DataIn-Memory Database Platform for Big Data
In-Memory Database Platform for Big Data
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to Big Data Concepts

Similar to Big Data Concepts (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
finap ppt conference.pptx
finap ppt conference.pptxfinap ppt conference.pptx
finap ppt conference.pptx
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Big data
Big dataBig data
Big data
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 

More from Ahmed Salman (10)

IBM Netezza
IBM NetezzaIBM Netezza
IBM Netezza
 
DR_PRESENT 1
DR_PRESENT 1DR_PRESENT 1
DR_PRESENT 1
 
Faas__Food_as_a_Service__project
Faas__Food_as_a_Service__projectFaas__Food_as_a_Service__project
Faas__Food_as_a_Service__project
 
Project_Overview_-_final
Project_Overview_-_finalProject_Overview_-_final
Project_Overview_-_final
 
Cloudera
ClouderaCloudera
Cloudera
 
TECRM 20 Presentation
TECRM 20 PresentationTECRM 20 Presentation
TECRM 20 Presentation
 
TCRM10 Pesentation
TCRM10 PesentationTCRM10 Pesentation
TCRM10 Pesentation
 
Introduction to Dig Data& Hadoop
Introduction to Dig Data& HadoopIntroduction to Dig Data& Hadoop
Introduction to Dig Data& Hadoop
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
Hadoop Installation
Hadoop InstallationHadoop Installation
Hadoop Installation
 

Recently uploaded

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 

Recently uploaded (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 

Big Data Concepts

  • 2. Our agenda  Demystify the term "Big Data"  Find out what is Hadoop  Explore the realms of batch and real-time big data processing  Explore challenges of size, speed and scale in databases  Skim the surface of big-data technologies  Provide ways into the big-data world
  • 4. What is big data?  Big data is a collective term for a set technologies designed for storage, querying and analysis of extremely large data sets, sources and volumes.  Big data technologies come in where traditional off-the-shelf databases, data warehousing systems and analysis tools fall short.
  • 5. How did we end up with so much data?  Data Generation: Human (Internal) ↦ Human (Social) ↦ Machine  Data Processing: Single Core ↦ Multi-Core ↦ Cluster / Cloud  An Important Side Note Big Data technologies are based on the concept of clustering - Many computers working in sync to process chunks of our data.
  • 6. Not just size  Big data isn't just about data size, but also about data volume, diversity and inter-connectedness.
  • 7. Big data is  Any attribute of our data that challenges either technological capabilities or business needs, like:  Scaling, moving, storage and retrieval of ever-growing generated data  Processing many small data points in real-time  Analysing diverse semi-structured data from multiple sources  Querying multiple, diverse data sources in real-time
  • 8. Breath... Let's recap  Lot's of data due to technological capabilities and social paradigms  Not just size! Diversity, volume and inter-connectedness also count  Scale, speed, processing, querying and analysis  Challenges technological capabilities or business needs
  • 10. Everyone talks about Hadoop  Hadoop is a powerful platform for batch analysis of large volumes of both structured and unstructured data. From: Conquering Hadoop with Haskell
  • 11. Hadoop explained  Hadoop is a horizontally scalable, fault-tolerant, open-source file system and batch-analysis platform capable of processing large amounts of data.  HDFS - Hadoop File System  M/R - Hadoop Map-Reduce platform
  • 12. Hadoop explained  HDFS is an ever-growing file system. We can store lots and lots of data on it for later use.  HDFS is used as the underlying platform for other technologies likeHadoop M/R, Apache Mahout or HBase.
  • 13. Hadoop explained  Imagine we want to look at 30 days worth of access logs to identify site usage patterns at a volume of 30M log entries per day.  Hadoop M/R is a platform that allows us to query HDFS data in parallel for the purpose of batch (offline) data processing and analysis.
  • 14. Why is Hadoop so important?  Scalable and fault-tolerant  Handles massive amounts of data  Truly parallel processing  Data can be semi-structured or unstructured (schemaless)  Serves as basis for other technologies (Hbase, Mahout, Impala, Shark)
  • 15. Hadoop - Words of caution  Complex  Not for real-time  Choose a distribution (Cloudera, HW, MapR) for better interoperability  Requires trained DevOps for day-to-day operations
  • 16. Breath....  We demystified the term Big Data and glimpsed at Hadoop. Now What?  How do I really get into the Big Data world?
  • 17. The world of big data  Batch & Data Science  DBs  Real-Time
  • 19. Batch processing of large data sets  We collect data for the purpose of providing end-users with better experience in our business domain. This means we have to constantly query our data and divine new insights and relevant information.  The problem is doing that in very large scales is a painful, slow challenge.
  • 20. How do we do this on Hadoop data? Source: https://cwiki.apache.org/confluence/display/Hive/Tutorial
  • 21. Batch processing of large data sets  Hadoop gives us the basic tools for large data processing in the form of M/R. However, Hadoop M/R is pretty annoying to work with directly as it lacks a lot of relevant tools for the job (statistical analysis, machine learning etc.)
  • 23. Hadoop querying and data science tools  Tool Purpose  Hive Write SQL-like M/R queries on top of Hadoop  Shark Hive-compatible, distributed SQL query engine for Hadoop  Pig Write scripted M/R queries on top of Hadoop  Impala Real-time SQL-like queries of Hadoop  Mahout Scalable machine-learning on top of Hadoop M/R
  • 24. The gentle way in  Hive or Shark are a great place to start due to their SQL-like nature  Shark is faster than Hive - less frustration  You need some Hadoop data to work with (consider Avro)  Remember - it's SQL-like, not SQL  Start small, locally and grow to production later  Check out Apache Sqoop for moving processed Hadoop data to your DB
  • 25. Databases In the big data world
  • 26. Databases in the big data world  The Problem: Traditional RDBMS were not designed for storing, indexing and querying growing amounts and volumes of data.  The 3S Challenge:  Size - How much data is written and read  Speed - How fast can we write and read data  Scale - How easily can our DB scale to accommodate more data
  • 27. The 3S Challenge  There's no single, simple solution to the 3S challenge. Instead, solutions focus on making an informed sacrifice in one area in order to gain in another area.
  • 28. NoSQL and C.A.P.  NoSQL is a term referring to a family of DBMS that attempt to resolve the 3S challenge by sacrificing one of three areas:  Consistency - All clients have the same view of data  Availability - Each client can always read and write  Partition Tolerance - System works despite physical network failures
  • 29. NoSQL and C.A.P.  C.A.P. means you have to make an informed choice (and sacrifice)  No single perfect solution  Opt for mixed solutions per use-case  Remember we're talking about read/write volume, not just size
  • 30. Confused? Let's take a breath and focus
  • 31.
  • 32. OK, so where do I go from here?  Identify your needs and limitations  Choose a few candidates  Research & Prototype  Read about NewSQL - VoltDB, InfiniDB, MariaDB, HyperDex, FoundationDB (omitted due to time constraints).
  • 34. Real-Time big data processing  Processing big data in real-time is about data volumes rather than just size. For example, given a rate of 100K ops/sec, how do I do the following in real-time?:  Find anomalies in a data stream (spam)  Group check-ins by geo  Identify trending pages / topics
  • 35. Hadoop isn't for real-time processing  When it comes to data processing and analysis, Hadoop's M/R framework is wonderful for batch (offline) processing.  However, processing, analysing and querying Hadoop data in real-time is quite difficult.
  • 36. Apache Storm and Apache Spark  Apache Storm and Apache Spark are two frameworks for large-scale, distributed data processing in real-time.  One could say that both Storm and Spark are for real-time data processing what is Hadoop M/R for batch data processing.
  • 37. Apache Storm - Highlights  Runs on the JVM (Clojure / Java mix)  Fully distributed and fault-tolerant  Highly-scalable and extremely fast  Interoperability with popular languages (Scala, Python etc.)  Mature and production ready  Hadoop interoperability via Storm-YARN  Stateless / Non-Persistent (Data brought to processors)
  • 38. Apache Spark - Highlights  Fully distributed and extremely fast  Write applications in Java Scala and Python  Perfect for both batch and real-time  Combine Hadoop SQL (Shark), Machine Learning and Data streaming  Native Hadoop interoperability  HDFS, HBase, Cassandra, Flume as data sources  Stateful / Persistent (Processors brought to data)
  • 39. Storm & Spark - Use Cases  Continuous/Cyclic Computation  Real-time analytics  Machine Learning (eg. recommendations, personalisation)  Graph Processing (eg. social networks) - Only Spark  Data Warehouse ETL (Extract, Transform, Load)
  • 41. Term Purpose  Big Data Collective term for data-processing solutions at scale  Hadoop Scalable file-system and batch processing platform  Batch Processing Sifting and analysing data offline / in background  M/R Parallel, batch data-processing algorithm  3S Challenge Size, Speed, Scale of DBs  C.A.P Consistency, Availability, Partition Tolerance  NoSQL Family of DBMS that grew due to the 3S Challenge  NewSQL Family of DBMS that provide ACID at scale
  • 42. Questions?! Feel free to drop my a line: Email: ahmed.sayed.shouman@gmail.com