12. bigdata is cool but...
expensive cluster
hard to set up and monitor
not interactive enough
13. Data analysis as a service
Google BigQuery
javier ramirez @supercoco9 https://datawaki.com
14. javier ramirez @supercoco9 https://datawaki.com
The right-now data analytics platform
for your website, your backend, and your business
datawaki
15. The Challenge
Several thousands of req./s
From many devices/apps
Provide real-time alerts
Analyze billions of rows interactively
Extract graph information
javier ramirez @supercoco9 https://datawaki.com
17. data from many sources
HTTP
Libraries available for virtually any
programming language
De facto standard for inter-system comms.
Easy to script from command line tools
18. Free, open-source, high-performance HTTP
server and reverse proxy
Nginx is known for its high performance,
stability, rich feature set, simple configuration,
and low resource consumption.
Used by Netflix, Hulu, Pinterest, CloudFlare, Airbnb, WordPress.com, GitHub,
SoundCloud, Zynga, Eventbrite, Zappos, Media Temple, Heroku, RightScale, Engine
Yard and MaxCDN
Free, open-source, high-performance HTTP
server and reverse proxy
Nginx is known for its high performance,
stability, rich feature set, simple configuration,
and low resource consumption.
Used by Netflix, Hulu, Pinterest, CloudFlare, Airbnb, WordPress.com, GitHub,
SoundCloud, Zynga, Eventbrite, Zappos, Media Temple, Heroku, RightScale, Engine
Yard and MaxCDN
20. Logstash is a tool for managing events
and logs. You can use it to collect logs,
parse them, and store them for later use
It is fully free and fully open source. The
license is Apache 2.0, meaning you are
pretty much free to use it however you
want in whatever way.
logstash: handle the log data
23. open source, BSD licensed, advanced
key-value store. It is often referred to as a
data structure server since keys can contain
strings, hashes, lists, sets and sorted sets.
http://redis.io
started in 2009 by Salvatore Sanfilippo @antirez
100+ contributors at
https://github.com/antirez/redis
javier ramirez @supercoco9 https://datawaki.com codemotion 2013
24. Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (with pipelining)
$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 -q
SET: 552,028 requests per second
GET: 707,463 requests per second
LPUSH: 767,459 requests per second
LPOP: 770,119 requests per second
Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (without pipelining)
$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -q
SET: 122,556 requests per second
GET: 123,601 requests per second
LPUSH: 136,752 requests per second
LPOP: 132,424 requests per second
javier ramirez @supercoco9 https://datawaki.com codemotion 2013
26. what it's being used for
javier ramirez @supercoco9 https://datawaki.com
27. twitter
user info from
gizmoduck
(memcached)
user id tweet id metadata
write API (from browser or client app)
rpushx to Redis
tweet info from tweetypie
(memcached + mysql) your twitter
timeline
javier ramirez @supercoco9 https://datawaki.com
fanout (flockDB)
one per follower
28. products using Redis
javier ramirez @supercoco9 https://datawaki.com
Pinterest
SnapChat
World of Warcraft
GitHub
HipChat
SoundCloud
Tumblr
Booking.com
YouPorn...
31. Google BigQuery
Data analysis as a service
http://developers.google.com/bigquery
javier ramirez @supercoco9 https://datawaki.com
32. Based on Dremel
Specifically designed for
interactive queries over
petabytes of real-time data
javier ramirez @supercoco9 https://datawaki.com
33. • Analysis of crawled web documents.
• Tracking install data for applications on Android Market.
• Crash reporting for Google products.
• OCR results from Google Books.
• Spam analysis.
• Debugging of map tiles on Google Maps.
• Tablet migrations in managed Bigtable instances.
• Results of tests run on Google’s distributed build system.
• Disk I/O statistics for hundreds of thousands of disks.
• Resource monitoring for jobs run in Google’s data centers.
• Symbols and dependencies in Google’s codebase.
What Dremel has been used for in
Google
42. Things you always wanted to
try but were too scared to
javier ramirez @supercoco9 https://datawaki.com
select count(*) from
publicdata:samples.wikipedia
where REGEXP_MATCH(title, "[0-9]*")
AND wp_namespace = 0;
223,163,387
Query complete (5.6s elapsed, 9.13 GB processed)
43. Global Database of Events,
Language and Tone
quarter billion rows
30 years
updated daily
http://gdeltproject.org/data.html#googlebigquery
44. SELECT Year, Actor1Name, Actor2Name, Count FROM (
SELECT Actor1Name, Actor2Name, Year,
COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY
Count DESC) rank
FROM
(SELECT Actor1Name, Actor2Name, Year FROM
[gdelt-bq:full.events] WHERE Actor1Name < Actor2Name
and Actor1CountryCode != '' and Actor2CountryCode != ''
and Actor1CountryCode!=Actor2CountryCode),
(SELECT Actor2Name Actor1Name, Actor1Name Actor2Name,
Year FROM [gdelt-bq:full.events] WHERE
Actor1Name > Actor2Name and Actor1CountryCode != '' and
Actor2CountryCode != '' and
Actor1CountryCode!=Actor2CountryCode),
WHERE Actor1Name IS NOT null
AND Actor2Name IS NOT null
GROUP EACH BY 1, 2, 3
HAVING Count > 100
)
WHERE rank=1
ORDER BY Year
45.
46. BigQuery pricing
$20 per stored TB
1000000 rows => $0.004 / month
$5 per processed TB
1 full scan (1MM rows) ~ 200 MB
1 count = 0 MB
1 full scan over 1 column ~ 15 MB
*the 1st
TB every month is free of charge
javier ramirez @supercoco9 https://datawaki.com
48. Neo4j is a high performance graph store with all the
features expected of a mature and robust database, like a
friendly query language and ACID transactions.
The programmer works with a flexible network structure of
nodes and relationships rather than static tables—yet
enjoys all the benefits of enterprise-quality database.
For many applications, Neo4j offers orders of magnitude
performance benefits compared to relational DBs.
49. Define data flows (funnels) for users or devices
Check if the data points are part of a funnel
Store BigQuery ID on the graph so we can
cross analytical queries with data flows
How are we using neo4j
54. Cost of a minimum system
Nginx $5 per server
Logstash $10 per server
Redis $5
Ruby workers $5 per server
BigQuery $5 per 500MM rows
Neo4j $10 per server
Rails $5 per server
total: $45 / month + backups
javier ramirez @supercoco9 https://datawaki.com
nadie duda de que tu api sea técnicamente muy buena, pero...
conclusión obvia
esto va a ser un problema de big data
el problema es que nosotros no sabíamos de big data. Nos sonaba map/reduce, hadoop, cassandra.. pero nos faltaban datos
master-slave, transactions, atomicity/concurrency
Nosotros mantenemos una lista en la que insertamos una entrada por cada operación del API
NEXT: funciona rápido porque va en memoria
aquí no tuvimos que pensar mucho, porque estábamos usando ya redis para varias cosas en el sistema, justamente por este motivo de que permite muchas operaciones de forma muy ligera
pero podemos configurar persistencia y redundancia
intermediate storage
cache
index
NEXT: twitter
kary perry, 45MM
400 million tweets per day
4600 tweets per second
30 billion timeline deliveries per day
300K queries per second
7000 tweets per second at peak times
12000 tweets per second at events
143199/sec tweets on castle in the sky
800 tweets history per timeline on redis
44% of twitter accounts never posted a tweet
each tweet Is replicated 3 times
2 terabytes of ram for the redis cluster
kate perry 67 million inserts (justin bieber 56MM)
t-bird are the tweets in mysql
gizmoduck all users in memcached
tweetypie all tweets for last 45 days
snapchat 400MM daily
youporn 200MM daily
debido a que el protocolo de redis es muy sencillo, se puede acceder a redis desde cualquier lenguaje de programación, y como los servidores web soportan scripting...
you can set the info directly from the webserver, so if you have several backends (rails, node...) you can centralize all your logging into a single layer
Not for analytics!
Everything on memory!
SO far we have solved the Velocity part of big data, and a bit of the veracity, but we need more
Apache Drill es el equivalente en open source. No funciona como servicio.
bigquery es un recubrimiento REST encima de dremel. Usable desde cualquier plataforma que permita REST. Apis disponibles para diferentes lenguajes
Solamente para inserciones! no borrados o updates.A menudo junto Map/reduce o hadoop. Análisis in place, sin carga previa, sin índices ni planificar las queries de antemano
full scan!
a typical Solid State Disk reads at 550MBytes/second
The public enemy of data scientist/interactive queries
Column data is of uniform type; therefore, there are some opportunities for storage size optimizations available in column-oriented data that are not available in row-oriented data.
also less I/O
Además Dremel proporciona una estructura en árbol para lanzar las queries
batch y tiempo real tanto en la entrada de datos (ficheros o stream) como en la salida (interactivo o batch)
pagas por lo que usas
read only!!!
batch y tiempo real tanto en la entrada de datos (ficheros o stream) como en la salida (interactivo o batch)
pagas por lo que usas
web console
api rest
command line
Notice the validate button to avoid expenses
next: full scan regexp
total
313,797,035
global database of events, language
and tone
quarter billion rows
30 years
updated daily
review!
Open source
Open source
Open source
You can combine services on a single server at first