Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, javier ramirez, datawaki

javier ramirez
@supercoco9
https://datawaki.com
Big Data analytics with
Nginx, Logstash, Redis,
Google BigQuery, and Neo4j
datawaki

moral of the story
you can do big,
if you know how

javier ramirez @supercoco9 https://datawaki.com

Apache Hadoop
Apache Cassandra
Apache Spark
Apache Storm
Hbase
Kafka

bigdata is cool but...
expensive cluster
hard to set up and monitor
not interactive enough

Data analysis as a service
Google BigQuery

The right-now data analytics platform
for your website, your backend, and your business
datawaki

The Challenge
Several thousands of req./s
From many devices/apps
Provide real-time alerts
Analyze billions of rows interactively
Extract graph information

The real challenge
Cheap

data from many sources
HTTP
Libraries available for virtually any
programming language
De facto standard for inter-system comms.
Easy to script from command line tools

Free, open-source, high-performance HTTP
server and reverse proxy
Nginx is known for its high performance,
stability, rich feature set, simple configuration,
and low resource consumption.
Used by Netflix, Hulu, Pinterest, CloudFlare, Airbnb, WordPress.com, GitHub,
SoundCloud, Zynga, Eventbrite, Zappos, Media Temple, Heroku, RightScale, Engine
Yard and MaxCDN
Free, open-source, high-performance HTTP
server and reverse proxy
Nginx is known for its high performance,
stability, rich feature set, simple configuration,
and low resource consumption.
Used by Netflix, Hulu, Pinterest, CloudFlare, Airbnb, WordPress.com, GitHub,
SoundCloud, Zynga, Eventbrite, Zappos, Media Temple, Heroku, RightScale, Engine
Yard and MaxCDN

Log
NGINX
Log
NGINX
Several hundred thousand
request per second/server
Limited by network
bandwidth
Just add more servers ($5)
and balance
data
input

Logstash is a tool for managing events
and logs. You can use it to collect logs,
parse them, and store them for later use
It is fully free and fully open source. The
license is Apache 2.0, meaning you are
pretty much free to use it however you
want in whatever way.
logstash: handle the log data

Highly scalable (Jruby process)
Input/Output/Codecs/Filters
Easily extendable using ruby
Logstash

Log
NGINX
Log
NGINX
Logstash
Data Verification: we
discard invalid inputs in
Logstash
We complete messsages
with basic info (timestamp,
origin...)
Redis
data
input

open source, BSD licensed, advanced
key-value store. It is often referred to as a
data structure server since keys can contain
strings, hashes, lists, sets and sorted sets.
http://redis.io
started in 2009 by Salvatore Sanfilippo @antirez
100+ contributors at
https://github.com/antirez/redis
javier ramirez @supercoco9 https://datawaki.com codemotion 2013

Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (with pipelining)
$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 -q
SET: 552,028 requests per second
GET: 707,463 requests per second
LPUSH: 767,459 requests per second
LPOP: 770,119 requests per second
Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (without pipelining)
$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -q
SET: 122,556 requests per second
GET: 123,601 requests per second
LPUSH: 136,752 requests per second
LPOP: 132,424 requests per second
javier ramirez @supercoco9 https://datawaki.com codemotion 2013

Redis keeps
everything in
memory
all the time

what it's being used for

twitter
user info from
gizmoduck
(memcached)
user id tweet id metadata
write API (from browser or client app)
rpushx to Redis
tweet info from tweetypie
(memcached + mysql) your twitter
timeline
fanout (flockDB)
one per follower

products using Redis
Pinterest
SnapChat
World of Warcraft
GitHub
HipChat
SoundCloud
Tumblr
Booking.com
YouPorn...

Log
NGINX
Log
NGINX
Logstash
Redis
Ruby
Worker
Ruby
Worker
Alert
system
data
input

Google BigQuery
Data analysis as a service
http://developers.google.com/bigquery

Based on Dremel
Specifically designed for
interactive queries over
petabytes of real-time data

• Analysis of crawled web documents.
• Tracking install data for applications on Android Market.
• Crash reporting for Google products.
• OCR results from Google Books.
• Spam analysis.
• Debugging of map tiles on Google Maps.
• Tablet migrations in managed Bigtable instances.
• Results of tests run on Google’s distributed build system.
• Disk I/O statistics for hundreds of thousands of disks.
• Resource monitoring for jobs run in Google’s data centers.
• Symbols and dependencies in Google’s codebase.
What Dremel has been used for in
Google

INPUT
/
OUTPUT
Big Data's
#1 Enemy

INDEXES
Data
Scientists's
#1 Enemy

Columnar
storage

Colossus filesystem
Distributed/redundant
Parallel reads
Ultra fast network

highly distributed
execution using a tree

loading data
You can feed flat CSV-like
files or nested JSON objects

web console screenshot

analytical SQL functions.
correlations.
window functions.
views.
JSON fields.
timestamped tables.

Things you always wanted to
try but were too scared to
select count(*) from
publicdata:samples.wikipedia
where REGEXP_MATCH(title, "[0-9]*")
AND wp_namespace = 0;
223,163,387
Query complete (5.6s elapsed, 9.13 GB processed)

Global Database of Events,
Language and Tone
quarter billion rows
30 years
updated daily
http://gdeltproject.org/data.html#googlebigquery

SELECT Year, Actor1Name, Actor2Name, Count FROM (
SELECT Actor1Name, Actor2Name, Year,
COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY
Count DESC) rank
FROM
(SELECT Actor1Name, Actor2Name, Year FROM
[gdelt-bq:full.events] WHERE Actor1Name < Actor2Name
and Actor1CountryCode != '' and Actor2CountryCode != ''
and Actor1CountryCode!=Actor2CountryCode),
(SELECT Actor2Name Actor1Name, Actor1Name Actor2Name,
Year FROM [gdelt-bq:full.events] WHERE
Actor1Name > Actor2Name and Actor1CountryCode != '' and
Actor2CountryCode != '' and
Actor1CountryCode!=Actor2CountryCode),
WHERE Actor1Name IS NOT null
AND Actor2Name IS NOT null
GROUP EACH BY 1, 2, 3
HAVING Count > 100
)
WHERE rank=1
ORDER BY Year

BigQuery pricing
$20 per stored TB
1000000 rows => $0.004 / month
$5 per processed TB
1 full scan (1MM rows) ~ 200 MB
1 count = 0 MB
1 full scan over 1 column ~ 15 MB
*the 1st
TB every month is free of charge

Log
NGINX
Log
NGINX
Logstash
Redis
BigQuery
Ruby
Worker
Ruby
Worker
Alert
system
data
input

Neo4j is a high performance graph store with all the
features expected of a mature and robust database, like a
friendly query language and ACID transactions.
The programmer works with a flexible network structure of
nodes and relationships rather than static tables—yet
enjoys all the benefits of enterprise-quality database.
For many applications, Neo4j offers orders of magnitude
performance benefits compared to relational DBs.

Define data flows (funnels) for users or devices
Check if the data points are part of a funnel
Store BigQuery ID on the graph so we can
cross analytical queries with data flows
How are we using neo4j

MATCH
startPath=(root)-[:`2010`]->()-[:`12`]->()-[:`31`]->
(startLeaf), endPath=(root)-[:`2011`]->()-[:`01`]->()
-[:`03`]->(endLeaf),
valuePath=(startLeaf)-[:NEXT*0..]->(middle)-
[:NEXT*0..]->(endLeaf), vals=(middle)-[:VALUE]->(event)
WHERE
root.name = 'Root'
RETURN
event.name
ORDER BY
event.name ASC
Cypher Query Language

Log
NGINX
Log
NGINX
Logstash
Redis
BigQuery
Neo4j
Ruby
Worker
Ruby
Worker
Alert
system
data
input

Postgre
SQL
Log
NGINX
Log
NGINX
Logstash
Redis
BigQuery
Neo4j
Ruby
Worker
Ruby
Worker
Rails
App
Alert
system
datawaki in a nutshell
Report
system
user
interaction
data
input

Cost of a minimum system
Nginx $5 per server
Logstash $10 per server
Redis $5
Ruby workers $5 per server
BigQuery $5 per 500MM rows
Neo4j $10 per server
Rails $5 per server
total: $45 / month + backups

javier ramirez
@supercoco9
https://datawaki.com
Thanks!
datawaki

Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, javier ramirez, datawaki

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, javier ramirez, datawaki

Similar a Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, javier ramirez, datawaki (20)

Más de javier ramirez

Más de javier ramirez (20)

Último

Último (20)

Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, javier ramirez, datawaki

Notas del editor