4. 0.6 to 1.2
• 1,352 changed files with 235,413 additions and 47,487 deletions
• 7,429 commits
• 1,653 tickets completed
https://github.com/apache/cassandra/compare/cassandra-0.6.0...cassandra-1.2
https://github.com/apache/cassandra/blob/trunk/CHANGES.txt
#CASSANDRAEU
CASSANDRASUMMITEU
5. What this talk is about
Cassandra adoption at Hailo from three perspectives:
1. Development
2. Operational
3. Management
#CASSANDRAEU
CASSANDRASUMMITEU
6. What is Hailo?
Hailo is The Taxi Magnet. Use Hailo to get a cab wherever you are, whenever you want.
#CASSANDRAEU
CASSANDRASUMMITEU
10. What is Hailo?
• The world’s highest-rated taxi app – over 11,000 five-star reviews
• Over 500,000 registered passengers
• A Hailo hail is accepted around the world every 4 seconds
• Hailo operates in 15 cities on 3 continents from Tokyo to Toronto in
nearly 2 years of operation
#CASSANDRAEU
CASSANDRASUMMITEU
11. Hailo is growing
• Hailo is a marketplace that facilitates over $100M in run-rate
transactions and is making the world a better place for passengers
and drivers
• Hailo has raised over $50M in financing from the world's best
investors including Union Square Ventures, Accel, the founder of
Skype (via Atomico), Wellington Partners (Spotify), Sir Richard
Branson, and our CEO's mother, Janice
#CASSANDRAEU
CASSANDRASUMMITEU
12. The history
The story behind Cassandra adoption at Hailo
#CASSANDRAEU
CASSANDRASUMMITEU
13. Hailo launched in London in November 2011
• Launched on AWS
• Two PHP/MySQL web apps plus a Java backend
• Mostly built by a team of 3 or 4 backend engineers
• MySQL multi-master for single AZ resilience
#CASSANDRAEU
CASSANDRASUMMITEU
14. Why Cassandra?
• A desire for greater resilience – “become a utility”
Cassandra is designed for high availability
• Plans for international expansion around a single consumer app
Cassandra is good at global replication
• Expected growth
Cassandra scales linearly for both reads and writes
• Prior experience
I had experience with Cassandra and could recommend it
#CASSANDRAEU
CASSANDRASUMMITEU
15. The path to adoption
• Largely unilateral decision by developers – a result of a startup
culture
• Replacement of key consumer app functionality, splitting up the
PHP/MySQL web app into a mixture of global PHP/Java services
backed by a Cassandra data store
• Launched into production in September 2012 – originally just
powering North American expansion, before gradually switching
over Dublin and London
#CASSANDRAEU
CASSANDRASUMMITEU
16. One year on...
• Further breakdown of functionality into Go/Java SOA
• Migrating all online databases to Cassandra
#CASSANDRAEU
CASSANDRASUMMITEU
21. Considerations for entity storage
• Do not read the entire entity, update one property and then write
back a mutation containing every column
• Only mutate columns that have been set
• This avoids read-before-write race conditions
#CASSANDRAEU
CASSANDRASUMMITEU
26. Considerations for time series storage
• Choose row key carefully, since this partitions the records
• Think about how many records you want in a single row
• Denormalise on write into many indexes
#CASSANDRAEU
CASSANDRASUMMITEU
28. Analytics
• With Cassandra we lost the ability to carry out analytics
eg: COUNT, SUM, AVG, GROUP BY
• We use Acunu Analytics to give us this abilty in real time, for preplanned query templates
• It is backed by Cassandra and therefore highly available, resilient
and globally distributed
• Integration is straightforward
#CASSANDRAEU
CASSANDRASUMMITEU
33. “Allows a team of 2 to achieve things they
wouldn’t have considered before Cassandra
existed”
Chris H, Operations Engineer
#CASSANDRAEU
CASSANDRASUMMITEU
37. Stats
Cluster
AWS VPCs with Open
VPN links
3 AZs per region
m1.large machines
~ 1TB/node
Provisoned IOPS EBS
#CASSANDRAEU
Operational
Cluster
~ 200GB/node
CASSANDRASUMMITEU
38. Backups
• SSTable snapshot
• Used to upload to S3, but this was taking >6 hours and consuming
all our network bandwidth
• Now take EBS snapshot of the data volumes
#CASSANDRAEU
CASSANDRASUMMITEU
39. Encryption
• Requirement for NYC launch
• We use dmcrypt to encrypt the entire EBS volume
• Chose dmcrypt because it is uncomplicated
• Our tests show a 1% performance hit in disk performance, which
concurs with what Amazon suggest
#CASSANDRAEU
CASSANDRASUMMITEU
41. Multi DC
• Something that Cassandra makes trivial
• Would have been very difficult to accomplish active-active inter-DC
replication with a team of 2 without Cassandra
• Rolling repair needed to make it safe (we use LOCAL_QUORUM)
• We schedule “narrow repairs” on different nodes in our cluster
each night
#CASSANDRAEU
CASSANDRASUMMITEU
42. Compression
• Our stats cluster was running at ~1.5TB per node
• We didn’t want to add more nodes
• With compression, we are now back to ~600GB
• Easy to accomplish
• `nodetool upgradesstables` on a rolling schedule
#CASSANDRAEU
CASSANDRASUMMITEU
44. “The days of the quick and dirty are over”
Simon V, EVP Operations
#CASSANDRAEU
CASSANDRASUMMITEU
45. Technically, everything is fine…
• Our COO feels that C* is “technically good and beautiful”, a
“perfectly good option”
• Our EVPO says that C* reminds him of a time series database in
use at Goldman Sachs that had “very good performance”
…but there are concerns
#CASSANDRAEU
CASSANDRASUMMITEU
46. People who can
attempt to query
MySQL
People who can
attempt to
query Cassandra
#CASSANDRAEU
CASSANDRASUMMITEU
51. Lesson learned
• Have an advocate - get someone who will sell the vision internally
• Learn the theory - teach each team member the fundamentals
• Make an effort to get everyone on board
#CASSANDRAEU
CASSANDRASUMMITEU
58. Lesson learned
• Be pro-active with Cassandra, even if it seems to be running
smoothly
• Peer-review data models, take time to think about them
• Big rows are bad - use cfstats to look for them
• Mixed workloads can cause problems - use cfhistograms and look
out for signs of data modeling problems
• Think about the compaction strategy for each CF
#CASSANDRAEU
CASSANDRASUMMITEU
60. Lessons learned
• EBS is nearly always the cause of Amazon outages
• EBS is a single point of failure (it will fail everywhere in your
cluster)
• EBS is slow
• EBS is expensive
• EBS is unnecessary!
#CASSANDRAEU
CASSANDRASUMMITEU
62. Lessons learned
• Keep the business informed – explain the tradeoffs in simple terms
• Sing from the same hymn sheet
• Make sure there solutions in place for every use case from the
beginning
#CASSANDRAEU
CASSANDRASUMMITEU
63. People who can
attempt to query
MySQL
#CASSANDRAEU
People who can
attempt to
query Cassandra
CASSANDRASUMMITEU
65. We like Cassandra
• Solid design
• HA characteristics
• Easy multi-DC setup
• Simplicity of operation
#CASSANDRAEU
CASSANDRASUMMITEU
66. Lessons for successful adoption
• Have an advocate, sell the dream
• Learn the fundamentals, get the best out of Cassandra
• Invest in tools to make life easier
• Keep management in the loop, explain the trade offs
#CASSANDRAEU
CASSANDRASUMMITEU
67. The future
• We will continue to invest in Cassandra as we expand globally
• We will hire people with experience running Cassandra
• We will focus on expanding our reporting facilities
• We aspire to extend our network (1M consumer installs, wallet)
beyond cabs
• We will continue to hire the best engineers in London, NYC and
Asia
#CASSANDRAEU
CASSANDRASUMMITEU
I started using Cassandra in 2010, back in version 0.6. Back then it was quite hard work.
I founded the London meetup group in 2010 and have been flying the C* flag over London ever since. My motivation was to connect with others who were using Cassandra. Back then “swapping war stories” was a common theme. Cassandra was not easy to use.
Fast forward to 2013. 7,429 commits later. Cassandra “just works”. Kudos to the team of committers and contributors who have made this happen.
4:30Whilst “it just works” is quite compelling, there are still challenges to successful adoption of C* in an organisation. I am going to talk about our experiences at Hailo, from three perpsectives: dev, ops and management.
On iOS and Android, live in London, New York, Chicago, Toronto, Boston, Dublin, Madrid
Founded by 3 taxi drivers and 3 seasoned entrepreneurs.
Built by a small team, in one room, on a boat on the Thames, but with global ambitions. Cloud native from day 1 – run solely on AWS.
My recommendation was based on the solid design principles behind C*, something I’ve talked about in the past.
13:00
Row key = entity ID, in this instance, a 64 bit integer a-la SnowflakeColumn name = property nameValue = property valueA key point when using this pattern is to only mutate columns that you change
Row key = entity ID, in this instance, a 64 bit integer a-la SnowflakeColumn name = property nameValue = property valueA key point when using this pattern is to only mutate columns that you change
Read heavy, demand-driven. Writes consistent.
Time series for storing records of all actions in Hailo. In this instance bucketed by a daily row key, for all messages. The column name is a type 1 UUID.
We also denormalise for other indexes, eg: here we store every message sent to a given address under a single row.
Stats service – insert rate at 5k/sec. Responsible for storing business events from all areas of our system.
Row key = entity ID, in this instance, a 64 bit integer a-la SnowflakeColumn name = property nameValue = property valueA key point when using this pattern is to only mutate columns that you change
We are not using CQL.
We can execute AQL
Some screenshot
27:00
London, NYC, Tokyo, Osaka, Dublin, Toronto, Boston, Chicago, Madrid, Barcelona, Washington, Montreal
Our rings, plus key stats (m1.large, 18 nodes in cluster A, 12 nodes in cluster B, 100GB per node in cluster A, ~ 600GB in cluster B)
EC2 snitch
Our rings, plus key stats (m1.large, 18 nodes in cluster A, 12 nodes in cluster B, 100GB per node in cluster A, ~ 600GB in cluster B)
I interviewed key people from our management team to gauge their reaction to our C* deployment.
There is a perceptionthat we have made it much harder to get at our data. In the early days at Hailo, when we all worked in one room, developers could execute ad-hoc queries on the fly for management. Nowadays we can’t. The reasons behind this are two-fold – firstly it is true that C* is harder to execute ad-hoc queries. But that’s not the whole picture. Much of our data is still in MySQL, and the queries we used to do against this data do not run smoothly either. The perception, however, is that it is the “new database” that is the cause of problems.
It’s easy to cause yourself a “Big Data” problem. Developers collect and store data because they can, without being clear about the business implications.
1. Most people have N years of SQL experience where N >= 5
Sometimes C* works too well. Clearly this cluster needs some attention, but our application is still working fine.We are probably at the point where we need a dedicated C* expert.
2. It’s possible to shoot yourself in the foot – but this is true of SQL (eg: joins that work with low data volumes)
Big rows are bad – they expose a data modeling problem
Big rows are bad – they expose a data modeling problem
Big rows are bad – they expose a data modeling problem
With the right tools, we could change the picture completely.