IAC 2024 - IA Fast Track to Search Focused AI Solutions
Couchbase Connect 2016
1. Michael Kehoe
Staff Site Reliability Engineer
LinkedIn
Going all in:
From single use-case to many
2. 2
Overview
• The LinkedIn Story
• Couchbase Use-Cases
• Development & Operations
• Conclusions
• Questions
3. $ whoami
3
Michael Kehoe
• Staff Site Reliability Engineer (SRE)
• Production-SRE team
• Funny accent = Australian
• Contact
• linkedin.com/in/michaelkkehoe
• @matrixtek
4. $ whatis SRE
4
Michael Kehoe
• Site Reliability Engineering
• Operations for the production application environment
• Responsibilities include
• Architecture design
• Capacity planning
• Operations
• Tooling
5. $ whatis CBVT
5
Michael Kehoe
• Couchbase Virtual Team
• ~10 SRE’s
• 2 Software Engineers
• Sponsored by SRE Director
• 5-90% of their time to support Couchbase
• Encourage as many people to contribute as possible
• What do we do?
• Operational work on Couchbase clusters
• Evangelize the use of Couchbase within LinkedIn
• Develop tools for the Couchbase Ecosystem
6. 6
The LinkedIn Story
• Founded in 2002, LinkedIn has grown into the world’s largest professional social
media network
• 30 offices in 24 countries, Available in 24 languages
• More than 450+ million members worldwide
7. 7
The LinkedIn Story
• Growth in Products
• Profiles
• Groups
• Recruiter
• Sales Navigator
• Growth in Internet Traffic
• Billions of page-hits per day
• 100k+ QPS to production services
8. In-Memory Storage Needs
8
The LinkedIn Story
• LinkedIn started as an Oracle shop
• Hyper-growth = Scaling challenges
• Read-Scaling becomes important
• Applicable use-cases
• Simple cache store
• Pre-warmed
• Read through
• Potential for Source of Truth (SoT) store
9. Enter Couchbase
9
The LinkedIn Story
• Until 2012, we were only using Memcache as a non SoT In-Memory store
• Drawbacks
• Difficult to pre-warm
• No partitioning/sharding (had to write our own)
• Cold-cache restarts
• Difficult to move data across hosts/clusters data-centers
10. Enter Couchbase
10
The LinkedIn Story
• Evaluated replacement systems for Memcached: Mongo, Redis, and others
• Couchbase had distinct advantages:
• Simple replacement for Memcached
• Built-in replication and cluster expansion
• Automatic partitioning
• Low latency
• Async writes to disk
• Building tooling is simple
11. Enter Couchbase
11
The LinkedIn Story
• Today we run Couchbase in our Corporate, Staging and Production environments
• Production/ Staging statistics:
• 148 buckets
• 2821 hosts
• 10M+ QPS
• Largest Clusters:
• By Hosts: 72 Hosts
• By Documents: 1.4B Documents
• By QPS: 2.5M QPS
13. Simple read-through cache
13
Use-Cases
• Drop-in replacement for memcache
• Read-scaling
• Protecting backend database from large amounts of traffic
• E.g. 3rd party ingestion credential cache
14. Counter Store
14
Use-Cases
• In certain places, we simply need to increment counters from multiple systems and
store them
• E.g. Anti-abuse/Anti-scraping systems (Fuse)
16. SoT Store for Internal Tools
16
Use-Cases
• For Non-Member facing tools, we use Couchbase as a SoT store.
• Benefits:
• Schema-less
• Short setup time
• Couchbase Python Client works easily in our environment
• Use views for simple map-reduce
• Example Uses:
• Nurse – Autoremediation system
• TrafficshiftIn – Global traffic automation system
• Availability – Storing and tracking Linkedin availability data
18. 18
Developing around Couchbase
• Java – li-couchbase-client
• Wrapper around standard Java Couchbase Client
• Custom metrics emission
• Using Spring interface
• Storing data as Java serialized objects
• Python – couchbase-python-client
19. 19
Operational Tooling
In order to efficiently use Couchbase as SRE’s, we need the following:
• Provisioning
• Installation
• Monitoring & Alerting
• Infrastructure Visibility
20. Provisioning
20
Operational Tooling
• Provisioning Flow
• Seek estimated usage statistics for cluster
• Size of data to be stored
• QPS
• Redundancy Needs
• Calculate cluster sizing
• Currently done with a template
• Couchbase has a simple calculator available online: http://docs.couchbase.com/prebuilt/calculators/sizing-
calc.html
• Request hardware for cluster(s)
21. Installation
21
Operational Tooling
• Process
• Enter cluster metadata into our management system (Range)
• Use Salt States to install and configure cluster
• See Issa Fattah’s post for more information:
• https://engineering.linkedin.com/blog/2016/04/leveraging-saltstack-to-scale-couchbase
• Benefits
• Ability to perform ‘state enforcement’
• Using Salt Pillar’s to encrypt cluster/ bucket passwords end-to-end
22. Monitoring & Alerting
22
Operational Tooling
• We run a daemon on each Couchbase Server that collects metrics every minute via
Couchbase API’s
• Use cluster metadata from range to build dashboards with our own system
InGraphs
• See: ‘Monitoring production deployments’: 4pm - Great America 1
24. Management
24
Operational Tooling
• We want to see a world-view of all the clusters we run
• Having bucket cluster/server level statistics is useful
• Having a global view of who owns and operates each cluster/ bucket is useful
26. 26
Conclusions
• Couchbase was a natural fit into our existing infrastructure
• Building an ecosystem around Couchbase was important to us and has helped
Couchbase be successful at LinkedIn
• Expanding use of Couchbase
• In the past year we’ve grown the number of buckets over 50%
• Starting to use Views in production
• Moving Couchbase into LinkedIn standard deployment infrastructure
The LinkedIn Story
Couchbase Use-Cases
Development & Operations
Conclusions
Questions
Site Reliability Engineering
A term coined by Ben Treynor from Google
Hybrid of:
Sysadmin
Network Engineer
Architect
Troubleshooter
Software Engineer
Ninja’s – Digital economy
10 SRE’s, with a tech-lead
Sponsored by a SRE Director
Input from Software Engineers on development
Founded in 2002, LinkedIn has grown into the world’s largest professional social media network
30 offices in 24 countries, Available in 24 languages
More than 450+ million members worldwide
LinkedIn started as an Oracle shop
To-date, we still run a significant number of Oracle databases
Oracle is fine for writes, scaling reads becomes challenging
HyperGrowth == Scaling challenges
Scaling writes isn’t a common problem in most cases
Scaling reads to 100k+ QPS, is challenging
Failures in read-scaling infra can take down back-end systems
Applicable use-cases
Simple cache store
Pre-warmed
Read-through
SoT Store
Until 2012, we were only using Memcache as a non SoT In-Memory store
Drawbacks of memcache:
Difficult to pre-warm, not easy to copy-data
No native sharding for clusters, had to write our own
Restarting memcache servers caused problems
Couldn’t copy data across for new DC’s, expanding clusters etc
Mid-2012, started testing Couchbase
Evaluated replacement systems for Memcached: Mongo, Redis, and others
Couchbase had distinct advantages
Simple replacement for memcache JAVA Spring made this simpler
Built-in replication and cluster expansion, significantly reduces ops-workload
Automatic partitioning, doesn’t become a concern anymore
Low-latency, reads from disk are still very fast
Async write to disk, can write a low of data at once without it being a problem
Lots of API’s that make tooling relatively simple
Insert fuse architecture
We have a deduplication filter in stork that you can take advantage of to make sure we don't send duplicates of your email. This is highly recommended for any email using kafka (kafka can potentially deliver your email to our system twice)
Don’t use as SoT store as Espresso is our primary key-value store