SlideShare a Scribd company logo
1 of 41
Big Data & APIs
A recon tour on how to successfully do Big Data
More events, users
Facebook user post 4.5 billion
items a day (as of Sep 2013)
Facebook MAU 1.2 billion
(as of Sep 2013)
More messages, transactions
WhatsApp
From 0 to 31 billion
messages sent daily
(as of Aug 2013)
for {
x <- post.stream
user <- getUser(x)
message <- getData(x)
friend <- getFriends(user)
} {
yield notifyFriend(friend,user,message.id)
}
1 billion posts a day!
Example: Notify all my friends
Pleasingly parallel problems
●
●
●
●
●
●
News filtering
This is a tougher problem.
You cannot read
all that stuff !!!
News filtering:
“a machine feeds you what to read”
for {
x <- post.stream
user <- getUser(x)
message <- getData(x)
friend <- getFriends(user)
hustle <- getFriendNonsense(friend)
weather <- getWeather(user)
mood <- getMood(user),
vibe <- getMood(friend),
topics <- getTrendingTopics(friends)
market <- getChart(‘gold, ‘bigmac)
interesting <- hal9000(hustle,weather,mood,vibe,topics,market)
if interesting
}{
yield notifyFriend(friend,user,message.id)
}
1 billion posts a day!
Notify only those who care.
The context is
much bigger now.
Dealing with context
Machine learning to the rescue
●
●
●
The problem
Constraints
Data science: random forests
from bigml.com
Solve a
classification
problem
Million of features.
Million of users and
preferences.
Very large
sparse matrix !
Data science: Time series prediction
Extract features.
Correlate time
series
Very large sparse
matrix !
RAM: 100 Tera Byte, DISK: 100 Peta Byte, CPU: 100 Tera Flops
Bummer.
Why?
Nature went that way too.
Ain’t that funny?
“Evolving to multi cellular
organisms”
More resiliant
cells die: organism lives on
Complex tasks:
cannot be handled by a single cell
Distributed parallel problems
A few distributed computing paradigms
MPI, supercomputing, layered memory arch. , locking, acid
homogeneous, simpler model
heterogeneous, actor model, state-machine
The Map Reduce computing
The Map Reduce computing
Map-Reduce: How well are we doing?
CAP theorem: 12 years later
The CAP theorem is largely misunderstood.
High Availability
A system can be up, but not available
(think of a network outage: your system is in P mode)
How to improve it .
Replication / Redundancy:
3, 5 replicas are common in highly available systems
Dynamic Commission - Decommission:
re-balance the cluster for dead/new nodes
Tuning CAP: understand your use cases
Hadoop Distributed FS
Haddop Distribute Run-Time (Map-Reduce)
Hive (DB) Python R
Cassandra (distributed low-latency datastore)
Akka (web server, in-memory runtime)
A proven stack today: Functional
Hadoop Distributed FS
Haddop Distribute Run-Time (Map-Reduce)
Hive (DB) Python R
Cassandra (distributed low-latency datastore)
Akka (web server, in-memory runtime)
A proven stack today: Monitoring-Logging
Atmos
DataStax
OpsCenter
Hue
Ambari
Ganglia
Elastic
Search
Logstash
KibanaMarvel
Everything Distributed
Latency tradeoffs
Hmm, thats a complex system.
How to manage?
Hmm, thats a complex system.
How to manage?
lazy evaluated
scheduled
APIs are everywhere.
Thanks

More Related Content

What's hot

Introduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigDataIntroduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigData
Nilay Mishra
 

What's hot (20)

Introduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigDataIntroduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigData
 
Overview of bigdata
Overview of bigdataOverview of bigdata
Overview of bigdata
 
Hadoop
HadoopHadoop
Hadoop
 
Big data
Big dataBig data
Big data
 
Digital Contact's big data presentation to the University of Kent
Digital Contact's big data presentation to the University of KentDigital Contact's big data presentation to the University of Kent
Digital Contact's big data presentation to the University of Kent
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Why Hadoop is Useful?
Why Hadoop is Useful?Why Hadoop is Useful?
Why Hadoop is Useful?
 
A Empresa na Era da Informação Extrema
A Empresa na Era da Informação ExtremaA Empresa na Era da Informação Extrema
A Empresa na Era da Informação Extrema
 
Identifying sick cannabis with ai defcon 2018
Identifying sick cannabis with ai defcon 2018Identifying sick cannabis with ai defcon 2018
Identifying sick cannabis with ai defcon 2018
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
Hadoop 101: North East Wisconsin Code Camp
Hadoop 101: North East Wisconsin Code CampHadoop 101: North East Wisconsin Code Camp
Hadoop 101: North East Wisconsin Code Camp
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6
 
Big Data presentation Tensing
Big Data presentation TensingBig Data presentation Tensing
Big Data presentation Tensing
 
2015 - Extract SF - Data Quality
2015 - Extract SF - Data Quality2015 - Extract SF - Data Quality
2015 - Extract SF - Data Quality
 
Datawarehouse
DatawarehouseDatawarehouse
Datawarehouse
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 

Similar to Big Data and APIs - a recon tour on how to successfully do Big Data analytics

Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
ivascucristian
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Art Of Distributed P0
Art Of Distributed P0Art Of Distributed P0
Art Of Distributed P0
George Ang
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 

Similar to Big Data and APIs - a recon tour on how to successfully do Big Data analytics (20)

Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Art Of Distributed P0
Art Of Distributed P0Art Of Distributed P0
Art Of Distributed P0
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
PyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc AltedPyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc Alted
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 

More from Natalino Busa

Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Natalino Busa
 
Streaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and SprayStreaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and Spray
Natalino Busa
 

More from Natalino Busa (18)

Data Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovationData Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovation
 
Data science apps powered by Jupyter Notebooks
Data science apps powered by Jupyter NotebooksData science apps powered by Jupyter Notebooks
Data science apps powered by Jupyter Notebooks
 
7 steps for highly effective deep neural networks
7 steps for highly effective deep neural networks7 steps for highly effective deep neural networks
7 steps for highly effective deep neural networks
 
Data science apps: beyond notebooks
Data science apps: beyond notebooksData science apps: beyond notebooks
Data science apps: beyond notebooks
 
[Ai in finance] AI in regulatory compliance, risk management, and auditing
[Ai in finance] AI in regulatory compliance, risk management, and auditing[Ai in finance] AI in regulatory compliance, risk management, and auditing
[Ai in finance] AI in regulatory compliance, risk management, and auditing
 
Strata London 16: sightseeing, venues, and friends
Strata  London 16: sightseeing, venues, and friendsStrata  London 16: sightseeing, venues, and friends
Strata London 16: sightseeing, venues, and friends
 
Data in Action
Data in ActionData in Action
Data in Action
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
The evolution of data analytics
The evolution of data analyticsThe evolution of data analytics
The evolution of data analytics
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
 
Streaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and SprayStreaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and Spray
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Big data solutions for advanced marketing analytics
Big data solutions for advanced marketing analyticsBig data solutions for advanced marketing analytics
Big data solutions for advanced marketing analytics
 
Awesome Banking API's
Awesome Banking API'sAwesome Banking API's
Awesome Banking API's
 
Big and fast a quest for relevant and real-time analytics
Big and fast a quest for relevant and real-time analyticsBig and fast a quest for relevant and real-time analytics
Big and fast a quest for relevant and real-time analytics
 
Strata 2014: Data science and big data trending topics
Strata 2014: Data science and big data trending topicsStrata 2014: Data science and big data trending topics
Strata 2014: Data science and big data trending topics
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Big Data and APIs - a recon tour on how to successfully do Big Data analytics