This document discusses the rise of connected data across various domains such as science, consumers, retail, industrial, sports, location, and multi-sensor applications. It describes how the amount of data generated today is exponentially larger than what could be stored in the Library of Congress. Various examples are provided on how connected data is being used for genomics research, video surveillance, recommendations engines, materials research, weather forecasting, and real-time analytics. The document argues that cloud computing provides the necessary infrastructure to collect, process and collaborate on massive connected data sets without limits.
5. The amount of information generated during the first day of
a baby’s life today is equivalent to 70 times the information
contained in the Library of Congress"
9. Human Genome Project"
Collaborative project to sequence every single letter!
of the human genetic code.!
13 years and $billions to complete.!
Gigabyte scale datasets (transferred between sites on!
iPods!)!
10. Beyond the Human Genome"
45+ species sequenced: mouse, rat, gorilla, rabbit, !
platypus, nematode, zebra fish...!
Compare genomes between species to identify!
biologically interesting areas of the genome.!
100Gb scale datasets. Increased computational
requirements.!
11. The Next Generation"
New sequencing instruments lead to a dramatic!
drop in cost and time required to sequence a genome.!
Sequence and compare genetic code of individuals to!
find areas of variation. Much more interesting.!
Terabyte scale datasets. Significant computational
requirements.!
12. The 1000 Genomes Projects"
Public/private consortium to build world’s largest!
collection of human genetic variation.!
Hugely important dataset to drive new insight into!
known genetic traits, and the identification of new ones.!
Vast, complex data and computational resources required,
beyond reach of most research groups and hospitals.!
13. 1000 Genomes in the Cloud"
The 1000 Genomes data made available to all on AWS.!
Stored for free as part of the Public Datasets program.!
Updated regularly.!
200Tb. 1700 individual genomes. As much compute and
storage as required available to all.!
24. Dropcam
is
the
biggest
inbound
video
service
on
the
Web
• More
data
uploaded
per
minute
than
YouTube
• Petabytes
of
data
processed
every
month
• Billions
of
mo=on
events
detected
40. Who
is
my
customer
really?
What
do
people
really
like?
What
is
happening
socially
with
my
products?
Where
do
people
consume
my
product?
How
do
people
really
use
your
product?
42. 75% of users select"
movies based on"
recommendations"
43. More than 27 million users!
~ 30 million plays per day!
More than 40 billion events per day !
~ 4 million ratings per day!
~ 3 million searches per day!
Geo-location data!
Device information!
Time of day and week (it now can verify that users watch more TV shows during
the week and more movies during the weekend)!
Metadata from third parties such as Nielsen!
Social media data from Facebook and Twitter!
92. What ! right now?
trades are executing!
is the exception rate!
is the ad click-through!
topics are trending"
inventory remains!
queries are slow!
are the high scores!
!
95. Kinesis
architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates data
across three data centers (availability zones)
Aggregate and
archive to S3
Millions of
sources producing
100s of terabytes
per hour
Front
End
Authentication
Authorization
Ordered stream
of events supports
multiple readers
Real-time
dashboards
and alarms
Machine learning
algorithms or
sliding window
analytics
Aggregate analysis
in Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts
96. Voting Demo
High Level Architecture
Sentimentizer
Webpage
hosted on S3
Kinesis
Stream
Mobile
Client
Tablet
Client
Desktop
Client
Clients load S3 Hosted Webpages using
AWS JavaScript SDK
Clients PUT votes directly to Kinesis stream
Kinesis Redshift
Connector ASG
Kinesis Client
Library ASG
Redshift Data
Warehouse
Analytics
JasperSoft
AWS Marketplace
Consumers process
records from stream
Persistence and long-term analysis in Redshift
ElastiCache
Live Tally
Pulse
Real Time Average
of Voting Sentiment
Tealeaves
Real time Totals of
Votes Across Sentiment
Speedo
Realtime Display of
Votes Per Second
ElasticBeanstalk
Tallyroom App
(Sinatra API)
Tallying and live visualization of data
S3 hosted webpages using JavaScript and live calls to API
97. Sentimentizer Pricing
Service Pricing Total Cost Per Hour
Kinesis Stream 25 shards @ 1.5 cents per shard per hour $0.38
Kinesis messages 24 million PUTS (all of Australia) @ 2.8 cents per million PUTS $0.68
Kinesis Workers 2 x m3.large $0.40
Redshift Workers 2 x m1.medium $0.24
Redshift Cluster 2 x dw1.xlarge (4 TB total) $2.50
ElastiCache Cluster 2 x cache.m3.xlarge $1.02
Tallyroom App Fully redundant deployment with ELB & 2 x m1.small $0.15
S3 Websites
Sentimentizer, Pulse,
Tea Leaves, Speedo
Cents per GB of storage. 0.44 cents per 10,000 requests for 24 million
requests.
$10.56
TOTAL $15.93