Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
1. Handling realtime and analytic
workloads in a single cluster
with Hadoop and Cassandra
Handling realtime and analytic
workloads in a single cluster
with Hadoop and Cassandra
Piotr Kołaczkowski
pkolaczk@datastax.com
@pkolaczk
Piotr Kołaczkowski
pkolaczk@datastax.com
@pkolaczk
3. ColumnFamilyInputFormat
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
Key: ByteBuffer
Value: SortedMap<ByteBuffer, IColumn>
(column name, value, timestamp)
row key
column name
4. ColumnFamilyInputFormat
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
Input Key:
jim
age: 36 car: camaro gender: M
Input Value:
5. ColumnFamilyInputFormat
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
Input Key:
carol
age: 37 car: subaru
Input Value:
6. ColumnFamilyInputFormat
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
Input Key:
johnny
age: 12 gender: M
Input Value:
7. ColumnFamilyInputFormat
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
Input Key:
suzy
age: 10 gender: F
Input Value:
8. CFIF – Wide Row Support
Input Key:
jim
age: 36
Input Value:
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
9. CFIF – Wide Row Support
Input Key:
jim
car: camaro
Input Value:
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
10. CFIF – Wide Row Support
Input Key:
jim
gender: M
Input Value:
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
11. CFIF – Wide Row Support
Input Key:
carol
age: 37
Input Value:
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
12. CFIF – Wide Row Support
Input Key:
carol
car: subaru
Input Value:
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
13. CFIF – Cassandra Secondary Index Support
IndexExpression expr =
new IndexExpression(
ByteBufferUtil.bytes("car"),
IndexOperator.EQ,
ByteBufferUitl.bytes("subaru")
);
ConfigHelper.setInputRange(
job.getConfiguration(),
Arrays.asList(expr)
);
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
18. Basic Features
● Single, simplified component
● Workload separation
● No SPOF
● Peer to peer
● JobTracker failover
● No additional Cassandra config
19. System Administrator's View
Address DC Rack Workload Status State Load Owns Token
148873535527910577765226390751398592512
101.202.204.101 Analytics rack1 Analytics(JT) Up Normal 78,96 GB 12,50% 0
101.202.204.102 Analytics rack1 Analytics(TT) Up Normal 82,65 GB 12,50% 21267647932558653966460912964485513216
101.202.204.103 Analytics rack1 Analytics(TT) Up Normal 74,96 GB 12,50% 42535295865117307932921825928971026432
101.202.204.104 Analytics rack1 Analytics(TT) Up Normal 78,79 GB 12,50% 63802943797675961899382738893456539648
101.202.204.105 Cassandra rack1 Cassandra Up Normal 67,42 GB 12,50% 85070591730234615865843651857942052864
101.202.204.106 Cassandra rack1 Cassandra Up Normal 60,86 GB 12,50% 106338239662793269832304564822427566080
101.202.204.107 Cassandra rack1 Cassandra Up Normal 81,27 GB 12,50% 127605887595351923798765477786913079296
101.202.204.108 Cassandra rack1 Cassandra Up Normal 77,17 GB 12,50% 148873535527910577765226390751398592512
Easy monitoring of
your nodes,
regardless of their
workload type
20. Wait, but where are my files?
Hadoop M/R
HDFS
Hadoop M/R
CFS
Cassandra Server
21. Cassandra File System Properties
● Decentralized
● Replicated
● HDFS compatible
– compatible with Hadoop filesystem utilities
– allows for running M/R programs on DSE without
any change
● Compressed
23. CFS Compaction
● Keeps track of deleted rows (blocks)
● When all blocks in SSTable removed,
deletes the whole SSTable
Cassandra Storage
block 1
block 2
block 3
block 4
block 5
block 6
ts 1
ts 2
block 6 block 6block 7
block 8
ts 3
ts 4
block 6block 9
block 10
X
24. Hive Integration
● CassandraHiveMetaStore
– stores Hive database metadata in Cassandra
– no need to run a separate RDBMS
● CassandraStorageHandler
– allows for direct access to C* tables with CFIF and
CFOF
25. Hive Integration – Example
CREATE EXTERNAL TABLE MyHiveTable(row_key string, col1 string, col2 string)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
TBLPROPERTIES ("cassandra.ks.name" = "MyCassandraKS");
SELECT count(*) FROM MyHiveTable;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201306041030_0001, Tracking URL = http://192.168.123.10:50030/jobdetails.jsp?jobid=job_201306041030_0001
Kill Command = /usr/bin/dse hadoop job -Dmapred.job.tracker=192.168.123.10:8012 -kill job_201306041030_0001
Hadoop job information for Stage-1: number of mappers: 9; number of reducers: 1
2013-06-04 15:11:54,573 Stage-1 map = 0%, reduce = 0%
2013-06-04 15:11:58,622 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 1.04 sec
2013-06-04 15:11:59,691 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 1.04 sec
...
2013-06-04 15:12:28,288 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec
2013-06-04 15:12:29,304 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec
2013-06-04 15:12:30,330 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec
2013-06-04 15:12:31,339 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec
MapReduce Total cumulative CPU time: 31 seconds 910 msec
Ended Job = job_201306041030_0001
MapReduce Jobs Launched:
Job 0: Map: 9 Reduce: 1 Cumulative CPU: 31.91 sec HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 31 seconds 910 msec
OK
1000000
Time taken: 46.246 seconds
26. Custom Column Mapping
CREATE EXTERNAL TABLE Users(
userid string, name string, email string, phone string)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH
SERDEPROPERTIES (
"cassandra.columns.mapping" = ":key,user_name,primary_email,home_phone");
Cassandra: row key user_name primary_email home_phone
Hive: userid name email phone