Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Cassandra at Disqus — SF Cassandra Users Group July 31st
1. C* @Disqus · July 31, 2013
Cassandra SF Meetup
1Thursday, August 1, 13
2. INTRO
Software Engineer at Disqus
Built the current Data Pipeline
Enjoy working on large ecosystems
Who am I?
2Thursday, August 1, 13
3. SO YOU MADE SOME ANALYTICS
200,000 unique users creating
1,000,000 unique comments on
1,000,000 unique articles on
20,000 unique websites
Needed to build a system to track events from across the
Disqus network. On a given day we have
4*10^21
4,000,000,000,000,000,000,000
4 sextillion (zetta)
potential combinations PER DAY
3Thursday, August 1, 13
6. 3. ABILITY TO ACCESS A SUBSET IN REAL TIME
2. ABILITY TO QUERY AND JOIN LARGE DATA SETS
1. SCALABLE AND AVAILABLE DATA PIPELINE
GOALS
6Thursday, August 1, 13
7. 3. ABILITY TO ACCESS A SUBSET IN REAL TIME
2. ABILITY TO QUERY AND JOIN LARGE DATA SETS
1. SCALABLE AND AVAILABLE DATA PIPELINE
GOALS
This is where Cassandra comes in
7Thursday, August 1, 13
11. At Disqus we do comments
{
! "category": "comment",
! "data": {
! ! "text": "What's going on",
! ! "author": "gjcourt"
! },
! "meta": {
! ! "endpoint": "/event.js",
! ! "useragent": {
! ! ! "flavor": { "version": "X" },
! ! ! "browser": { "version": "6.0", "name": "Safari" }
! ! }
! },
! "timestamp": 1375228800
}
DATA FORMAT
11Thursday, August 1, 13
12. At Disqus we do comments
{
! "category": "comment",
! "data": {
! ! "text": "What's going on",
! ! "author": "gjcourt"
! },
! "meta": {
! ! "endpoint": "/event.js",
! ! "useragent": {
! ! ! "flavor": { "version": "X" },
! ! ! "browser": { "version": "6.0", "name": "Safari" }
! ! }
! },
! "timestamp": 1375228800
}
DATA FORMAT
12Thursday, August 1, 13
13. At Disqus we do comments
{
! "category": "comment",
! "data": {
! ! "text": "What's going on",
! ! "author": "gjcourt"
! },
! "meta": {
! ! "endpoint": "/event.js",
! ! "useragent": {
! ! ! "flavor": { "version": "X" },
! ! ! "browser": { "version": "6.0", "name": "Safari" }
! ! }
! },
! "timestamp": 1375228800
}
DATA FORMAT
13Thursday, August 1, 13
14. At Disqus we do comments
{
! "category": "comment",
! "data": {
! ! "text": "What's going on",
! ! "author": "gjcourt"
! },
! "meta": {
! ! "endpoint": "/event.js",
! ! "useragent": {
! ! ! "flavor": { "version": "X" },
! ! ! "browser": { "version": "6.0", "name": "Safari" }
! ! }
! },
! "timestamp": 1375228800
}
DATA FORMAT
14Thursday, August 1, 13
15. At Disqus we do comments
{
! "category": "comment",
! "data": {
! ! "text": "What's going on",
! ! "author": "gjcourt"
! },
! "meta": {
! ! "endpoint": "/event.js",
! ! "useragent": {
! ! ! "flavor": { "version": "X" },
! ! ! "browser": { "version": "6.0", "name": "Safari" }
! ! }
! },
! "timestamp": 1375228800
}
DATA FORMAT
15Thursday, August 1, 13
16. Random Aside
Handling time in python is a pain in the ass
RANDOM ASIDE
time.time()
Return the time in seconds since the epoch as a floating point number. Note that even
though the time is always returned as a floating point number, not all systems provide time
with a better precision than 1 second. While this function normally returns non-decreasing
values, it can return a lower value than a previous call if the system clock has been set back
between the two calls.
16Thursday, August 1, 13
17. Random Aside
Handling time in python is a pain in the ass
RANDOM ASIDE
time.time()
Return the time in seconds since the epoch as a floating point number. Note that even
though the time is always returned as a floating point number, not all systems provide time
with a better precision than 1 second. While this function normally returns non-decreasing
values, it can return a lower value than a previous call if the system clock has been set back
between the two calls.
>>> print time.time(); print time.mktime(time.gmtime())
1375244678.64
1375273478.0
17Thursday, August 1, 13
19. Mainly because there are so many choices
PICKING A DATABASE
19Thursday, August 1, 13
20. PICKING A DATABASE
In an early startup, opportunity cost is king
While the choice of a system is important there are a
range of possible choices.
A system that provides value is more important than
choosing a local maximum.
20Thursday, August 1, 13
21. PICKING A DATABASE
We need a large sparse matrix
Requires horizontal scalability
Fast reads and inserts
High cardinality
21Thursday, August 1, 13
22. PICKING A DATABASE
We need a large sparse matrix
Requires horizontal scalability
Fast reads and inserts
High cardinality
Almost rules out most RDBMS
22Thursday, August 1, 13
25. PICKING A DATABASE
What made the difference
We wanted counters and 0.8.0 has this capability
Fast inserts and reads
Tunable consistency guarantees
Simple data model
25Thursday, August 1, 13
27. 3. SCALABLE AND AVAILABLE
2. FAST AND ACCURATE COUNTERS
1. HIGH VOLUME SPARSE MATRIX (billions of dimensions)
DATA THAT SCALES
27Thursday, August 1, 13
28. DATA MODEL
How do you store arbitrary dimensionality over time?
Cassandra is a 2D sorted array
28Thursday, August 1, 13
29. DATA MODEL
A simple way to build a counter
CREATE TABLE counts (
key text,
time_dimension text,
value counter,
PRIMARY KEY (key, time_dimension)
);
29Thursday, August 1, 13
30. DATA MODEL
A simple way to build a counter
+--------------+-----------------+-----------------+-----------------+-----------------+
|! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 |
| comment |-----------------+------------------------------------------------------
|! ! ! | 1000 | 100 | 10 | 1 |
+--------------+-----------------+-----------------+-----------------+-----------------+
30Thursday, August 1, 13
31. DATA MODEL
A simple way to build a counter
+--------------+-----------------+-----------------+-----------------+-----------------+
|! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 |
| comment |-----------------+------------------------------------------------------
|! ! ! | 1000 | 100 | 10 | 1 |
+--------------+-----------------+-----------------+-----------------+-----------------+
----------------------------+-----------------+-----------------+-----------------+-----------------+
|! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 |
| comment.author.gjcourt |-----------------+------------------------------------------------------
|! ! ! | 23 | 17 | 7 | 1 |
----------------------------+-----------------+-----------------+-----------------+-----------------+
Dimensions are easy
31Thursday, August 1, 13
32. DATA MODEL
And if you increment the time bucket 2013-07-31
+--------------+-----------------+-----------------+-----------------+-----------------+
|! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 |
| comment |-----------------+------------------------------------------------------
|! ! ! | 1001 | 101 | 10 | 1 |
+--------------+-----------------+-----------------+-----------------+-----------------+
----------------------------+-----------------+-----------------+-----------------+-----------------+
|! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 |
| comment.author.gjcourt |-----------------+------------------------------------------------------
|! ! ! | 24 | 18 | 7 | 1 |
----------------------------+-----------------+-----------------+-----------------+-----------------+
Dimensions are easy
32Thursday, August 1, 13
33. DATA MODEL
Some major disadvantages
All time intervals are in the same row
Queries are non linear
Time buckets in lexical order
Dimensions can not be indexed
Rows can grow unbounded
33Thursday, August 1, 13
35. DATA MODEL
This is a large improvement
Efficient range queries
Rollups are possible
35Thursday, August 1, 13
36. DATA MODEL
However still has some problems
Dimensions are not indexed
Rows can grow unbounded
36Thursday, August 1, 13
37. DATA MODEL
Remember the schema
CREATE TABLE counts (
key text,
time_dimension text,
value counter,
PRIMARY KEY (key, time_dimension)
);
37Thursday, August 1, 13
38. DATA MODEL
Remember the schema
CREATE TABLE counts (
key text,
time_dimension text,
value counter,
PRIMARY KEY (key, time_dimension)
);
38Thursday, August 1, 13
39. DATA MODEL
Remember the schema
CREATE TABLE counts (
key text,
time_dimension text,
value counter,
PRIMARY KEY (key, time_dimension)
);
Should this be a <timestamp>?
39Thursday, August 1, 13
40. DATA MODEL
A better version of counters
CREATE TABLE better_counts (
key text,
time_dimension 'org.apache.cassandra.db.marshal.ReversedType' <timestamp>,
value counter,
PRIMARY KEY (key, time_dimension)
);
40Thursday, August 1, 13
41. DATA MODEL
The problem with counters
Operations are NOT Idempotent
Limited protection for overcounting
https://issues.apache.org/jira/browse/CASSANDRA-4775
41Thursday, August 1, 13
42. DATA MODEL
And you end up having to write code like this
def swallow_cassandra_timeouts(func):
@wraps(func)
def inner(*args, **kwargs):
try:
return func(*args, **kwargs)
except TimedOutException, e:
logger.warning("processor.pycassa.exception.timeout")
except UnavailableException, e:
# raise so that we retry this batch
logger.error("processor.pycassa.exception.unavailable")
raise CassandraError(e)
except MaximumRetryException, e:
logger.warning("processor.pycassa.exception.max_retry")
except Exception, e:
logger.error("processor.pycassa.exception.unknown")
raise
return inner
42Thursday, August 1, 13
43. DATA MODEL
And this
if LOCAL:
CASSANDRA_TIMEOUT = 60
CASSANDRA_RETRIES = 0
elif "prod" in hostname:
CASSANDRA_TIMEOUT = 2 # Seconds
CASSANDRA_RETRIES = 0 # None
elif "storm" in hostname:
CASSANDRA_TIMEOUT = 0.2
CASSANDRA_RETRIES = 0
else: # proxy (read only)
CASSANDRA_TIMEOUT = 60
CASSANDRA_RETRIES = 3
43Thursday, August 1, 13
44. DATA MODEL
And this too
CASSANDRA_CONFIG = {
'stats': {
'pool': PoolConfig(CASSANDRA_TIMEOUT, CASSANDRA_RETRIES, CASSANDRA_POOL_SIZE),
'cf': {
'counts': ColumnFamilyConfig(ConsistencyLevel.LOCAL_QUORUM, ConsistencyLevel.ONE),
'durable_counts': ColumnFamilyConfig(ConsistencyLevel.LOCAL_QUORUM, ConsistencyLevel.LOCAL_QUORUM),
'sets': ColumnFamilyConfig(ConsistencyLevel.LOCAL_QUORUM, ConsistencyLevel.LOCAL_QUORUM),
}
}
}
44Thursday, August 1, 13
45. DATA MODEL
And operations to Cassandra look like this
@swallow_cassandra_timeouts
def side_effecting_function():
# insert/update into cassandra
pass
45Thursday, August 1, 13
46. DATA MODEL
Durable counts
CREATE TABLE durable_counts (
key text,
time_dimension 'org.apache.cassandra.db.marshal.ReversedType'<timestamp>,
random uuid,
value int,
PRIMARY KEY (key, time_dimension, random)
);
46Thursday, August 1, 13
48. DATA MODEL
And even doing all that hackery
Hive count C* counter % Similar C* durable counts % Similar
8101 8179 99.046338 8179 99.046338
7328 7390 99.161028 7390 99.161028
6255 6304 99.222715 6304 99.222715
6604 6665 99.150141 6665 99.150141
7700 7766 99.150141 7766 99.150141
5 week days of countable data
48Thursday, August 1, 13
49. DATA MODEL
Over 99% accuracy
100% (allegedly) counter parity
49Thursday, August 1, 13
50. DATA MODEL
Since our data is time series what if you could view it that way
50Thursday, August 1, 13
53. DATA MODEL
Sets (our first iteration)
CREATE TABLE sets (
key text,
time_dimension timestamp,
element blob,
value double,
PRIMARY KEY (key, time_dimension)
);
Insert only workload. Items are deleted by TTL
53Thursday, August 1, 13
54. DATA MODEL
Better Sets
CREATE TABLE sets (
key text,
time_dimension timestamp,
element blob,
deleted boolean,
value double,
PRIMARY KEY (key, time_dimension)
);
Insert only workload. When you want to delete, you insert with deleted set to true.
Read require you to iterate over all columns in chronological order. You sum values to calculate a score.
54Thursday, August 1, 13
55. DATA MODEL
Counters with indexable dimensions
CREATE TABLE catalog (
key text,
time_dimension 'org.apache.cassandra.db.marshal.ReversedType' <timestamp>,
dimension_1 text,
dimension_1_val text,
dimension_2 text,
dimension_2_val text,
...
value counter,
PRIMARY KEY (key, time_dimension)
);
55Thursday, August 1, 13
56. DATA MODEL
Dimension Catalog
CREATE TABLE catalog (
key text,
dimension text,
value text,
PRIMARY KEY (key, dimension)
);
56Thursday, August 1, 13
57. DATA MODEL
Dimension Catalog
CREATE TABLE catalog (
key text,
dimension text,
value text,
PRIMARY KEY (key, dimension)
);
cqlsh:> insert into catalog (key, dimension, value) values ('comment', 'author', 'gjcourt');
cqlsh:> insert into catalog (key, dimension, value) values ('comment', 'forum', 'disqus');
cqlsh:> select dimension from catalog where key='comment';
dimension
-----------
author
forum
57Thursday, August 1, 13
59. 3. EXPLORE NEW AND INTERESTING DATA PRODUCTS
2. PRODUCTIZE OUR DATA PIPELINE
1. EVOLVE CONTENT RECOMMENDATION AND ADVERTISING
OUR 2013 MISSIONS
59Thursday, August 1, 13
62. THE FUTURE
Graph of users and views
g.V('username','gjcourt').out('thread_views').in('thread_views').except('username', 'gjcourt')
The Netflix algorithm:
All articles that people that have viewed the thread I’m currently viewing
have also viewed.
62Thursday, August 1, 13
63. C* @Disqus · July 31, 2013
Cassandra SF Meetup
Thanks for listening
We’re hiring http://disqus.com/jobs/
63Thursday, August 1, 13