An introduction to DataStax's Brisk (a distribution of Cassandra, Hadoop and Hive). Includes a back story of my own experience with Cassandra plus a demo of Brisk built around a very simple ad-network-type application.
23. Data model CF = users [userUUID] [segmentID] = 1 CF = segments [segmentID] [userUUID] = 1
24. Data model create keyspacewhyk ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ... and strategy_options = [{replication_factor:1}]; create column family users ... with comparator = 'AsciiType' ... and rows_cached = 5000; create column family segments ... with comparator = 'AsciiType' ... and rows_cached = 5000;
25. Data model create keyspacewhyk ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ... and strategy_options = [{replication_factor:1}]; create column family users ... with comparator = 'AsciiType' ... and rows_cached = 5000; create column family segments ... with comparator = 'AsciiType' ... and rows_cached = 5000;
26.
27. Real-time access http://wehaveyourkidneys.com/show.php $pool = new ConnectionPool('whyk', array('localhost')); $users = new ColumnFamily($pool, 'users'); // @todo this only gets first 100! $segments = $users->get($userUuid); header('Content-Type: application/json'); echo json_encode(array_keys($segments));
28. Analytics How many users in each segment? Launch HIVE (very easy!) root@brisk-01:~# brisk hive
29. CREATE EXTERNAL TABLE whyk.users (userUuid string, segmentId string, value string) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler’ WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,:column,:value" ); select segmentId, count(1) as total from whyk.users group by segmentId order by total desc;
34. Further reading… Installing the Brisk AMI http://www.datastax.com/docs/0.8/brisk/install_brisk_ami Key advantages of Brisk – from Jonathan Ellis http://hackerne.ws/item?id=2528271 Why I’m very excited about DataStax’s Brisk – by Nathan Milford http://blog.milford.io/2011/04/why-i-am-very-excited-about-datastaxs-brisk/ The demo code on Github https://github.com/davegardnerisme/we-have-your-kidneys
Notas del editor
Started at Imagini; May 2010New ad-targeting product! Lots of users.MySQL DB for profiles, MySQL based server for events reportingProfile DB cannot update rows so we only insert; this means clients have to merge together all rows for a user on every readMySQL DB has a habbit of dying, requiring a repair and downtime; having 2 DBs managed to put off total death but not for long
Choosing Cassandra after some research; no single point of failure attractive, high write throughput attractive, linear scaling attractiveWelcome to GC hell!Start Cassandra London – like alcoholics anonymous; a support network
Batch analytics; how? No Hive support, no support for streaming jarPig input readerNo output reader; require HDFS
Keep up the meetupsAcunu generous at providing speakers; downside is hearing sales pitch!0.7 comes along; downside is not compatible with 0.6; Thrift interface changes0.8 comes along; CQL, countersBrisk!
A summary
Some points about “distribution” Some points about Cloudera and reaction
Realtime + batch analytics combinedNo single point of failure; we don’t need Hadoop’snamenode anymoreCross DC clusters
No adsNo networkNo publishersCool domain name
User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)