PyCon Ukraine 2016: Maintaining a high load Python project for newcomers

Maintaining a high load Python project
for newcomers
Viacheslav Kakovskyi
PyCon Ukraine 2016

Me!
@kakovskyi
Python Developer at SoftServe
Contributor of Atlassian HipChat — Python 2, Twisted
Maintainer of KPIdata — Python 3, asyncio
2

Agenda
● What project is `high load`?
● High loaded projects from my experience
● Case study: show last 5 feedbacks for a university course
● Developer's checklist
● Tools that help to make customers happy
● Summary
● Further reading
3

What project is `high load`?
4

● 2+ nodes?
● 10 000 connections?
● 200 000 RPS?
● 1 000 000 daily active users?
● monitoring?
● scalability?
● continuous deployment?
● disaster recovery?
● sharding?
● clustering?
● ???
5

a project where an inefficient solution or a tiny bug
has a huge impact on your business
(due to a lack of resources)→
→ causes an increase of costs $$$ or loss of reputation
(due to performance degradation)
6

High loaded Python projects from my experience
● Instant messenger:
○ 100 000+ connected users
○ 100+ nodes
○ 100+ developers
● Embedded system for traffic analysis:
○ scaling and upgrade options are unavailable
7

Some examples of issues from my experience
● usage of a less-efficient library: json VS ujson
● usage of a more complex serialization format: XML vs JSON
● usage of a wrong data format for a certain case: JPEG vs BMP
● usage of a wrong protocol: TCP vs UDP
● usage of legacy code without understanding how it works under the hood: 100
PostgreSQL queries instead of 1
● spawning a lot of objects that aren't destroyed by garbage collector
● ...
● deployment of a new feature which does not fit well with the load
on your production environment
8

Terms
9
● Elasticsearch - a search server that provides a full-text search engine
● Redis - an in-memory data structure server
● Capacity planning - a process aimed to determine an amount of resources
that will be needed over some future period of time
● StatsD - a daemon for stats aggregation
● Feature flag - an ability to turn on/off some functionality of an application
without deployment

Case study
Let's imagine some application for assessing the quality of higher education
● A university has faculties
● A faculty has departments
● A department has directions
● A direction has groups
● A group has students
● A student learns courses
10

Case study
● A student leaves feedback about courses
● Feedbacks are stored in Elasticsearch for full-text search
A feedback looks like this:
Introduction to Software Engineering. Faculty of Applied Math
Good for ones who don't have any previous experience with programming and
algorithms. Optional for prepared folks. They should request additional tasks to
stay in a good shape.
11

Case study: show recent 5 feedbacks for the course
12
INTRODUCTION TO SOFTWARE ENGINEERING
100500
Recent feedbacks
Software engineering is about teams and it is about quality. The problems to
solve are so complex or large, that a single developer cannot solve them
anymore.
See https://en.wikibooks.org/wiki/Introduction_to_Software_Engineering
Faculties

Case study: obvious solution
Request the last 5 feedbacks directly from Elasticsearch
13
from elasticsearch import Elasticsearch
es = Elasticsearch()
def fetch_feedback(es, course_id, amount):
query = _build_es_filter_query(doc_type='course',
id=course_id,
amount=amount)
# blocking call to Elasticsearch
entries = es.search(index='kpi', body=query)
result = _validate_and_adapt(entries)
return result

Case study
OK, just implement the solution, test on
staging, and deploy to production.
14

15
WHERE YOU LIVE
YOUR OPS KNOWS

16
EsRejectedExecutionException
[rejected execution (queue capacity 1000)
on org.elasticsearch.search.action.SearchServiceTransportAction]

Case study: optimization
Hypotheses:
● configure Elasticsearch properly for the case
● cache responses from Elasticsearch for some time
● use double writes:
○ write a feedback to Elasticsearch and Redis queue with a limited size
○ fetch from Redis at first
17

Case study: prerequisites from our domain*
● up to 1000 characters allowed for a feedback
● 50 000 feedbacks expected just for Kyiv Polytechnic Institute every year
● 300 000+ applicants in 2016
● 100+ universities in Ukraine if we decide to scale
18
*it's just an assumption for the example case study

Case study: let's measure the current load on production
Operations:
● add a feedback
● retrieve last 5 feedbacks
● find a feedback by a phrase
19

Case study: let's measure the current load on production
Application metrics:
● add a feedback
○ stats.count.feedback.course.added.es
○ stats.timing.feedback.course.added.es
● retrieve latest 5 feedbacks
○ stats.count.feedback.course.fetched.es
○ stats.timing.feedback.course.fetched.es
● find a feedback by a phrase
○ stats.count.feedback.course.found.es
○ stats.timing.feedback.course.found.es
20

Case study: how to add a metric to your code
21
from statsd import StatsClient
statsd = StatsClient()
def fetch_feedback(statsd, es, course_id, amount):
statsd.incr('feedback.course.fetched.es')
# don't perform anything query, just collect stats
return
query = _build_es_filter_query(doc_type='course', id=course_id,
amount=amount)
with statsd.timer('feedback.course.fetched.es'):
return result

Case study: how to add a metric to your code
22
def write_feedback_to_elasticsearch(statsd, es, course_id, doc):
statsd.incr('feedback.course.added.es')
with statsd.timer('feedback.course.added.es'):
result = es.index(index='kpi', doc_type='course',
id=course_id, body=doc)
def find_feedback(statsd, es, phrase, course_id=None)
statsd.incr('feedback.course.found.es')
query = _build_es_search_query(doc_type='course',
id=course_id, phrase=phrase)
with statsd.timer('feedback.course.found.es'):
return result

Visualize metrics: RPS, feature-related operations
23
Add feedback Find feedback Fetch feedback

Visualize metrics: Course feedback request performance
24
Add feedback Find feedback Fetch feedback

Case study: visualize collected metrics
Outcomes:
● we know frequency of operations
● we know timing of operations
● we know what to optimize
● we can perform a capacity planning for a new flow
25

Optimization: double writes
● continue using Elasticsearch as a storage for feedbacks
● duplicate writing of a feedback to Elasticsearch and Redis
● store last 5 feedbacks in Redis for faster retrieval
● use Elasticsearch for custom queries and full-text search
26

Optimization
27
def fetch_feedback(es, redis, course_id, amount):
result = None
if amount <= REDIS_FEEDBACK_QUEUE_SIZE: # REDIS_FEEDBACK_QUEUE_SIZE = 5
result = _fetch_feedback_from_redis(redis, course_id, amount)
if not result:
result = _fetch_feedback_from_elasticsearch(es, course_id, amount)
return result

Optimization
28
def _fetch_feedback_from_elasticsearch (es, course_id, amount):
query = _build_es_filter_query(doc_type ='course', id=course_id,
amount =amount)
return result
def _fetch_feedback_from_redis (redis, course_id, amount):
queue = redis.get_queue(entity ='course', id=course_id)
# blocking call to Redis
result = queue.get(amount)
return result

Optimization
29
def add_feedback(es, redis, course_id, doc):
_write_feedback_to_redis(redis, course_id, doc)
_write_feedback_to_elasticsearch(es, course_id, doc)
def _write_feedback_to_elasticsearch (es, course_id, doc):
result = es.index(index='kpi', doc_type='course', id=course_id,
body =doc)
def _write_feedback_to_redis (statsd, redis, course_id, doc):
queue.push(doc)

Optimization: potential impact on production
● Increased:
○ Insert feedback time
○ Redis capacity
○ Network traffic for Redis
● Reduced:
○ Fetch feedback time
○ Elasticsearch capacity
○ Network traffic for Elasticsearch
30

Measure: timing of insert and fetch operations
32
def _fetch_feedback_from_elasticsearch(statsd, es, course_id, amount):
statsd.incr('feedback.course.fetched.es')
query = _build_es_filter_query(doc_type='course', id=course_id,
amount=amount)
with statsd.timer('feedback.course.fetched.es'):
return result
def _fetch_feedback_from_redis(statsd, redis, course_id, amount):
statsd.incr('feedback.course.fetched.redis')
queue = redis.get_queue(entity='course', id=course_id)
with statsd.timer('feedback.course.fetched.redis'):
result = queue.get(amount)
return result

Measure: timing of insert and fetch operations
33
def _write_feedback_to_elasticsearch (statsd, es, course_id, doc):
statsd.incr('feedback.course.added.es' )
with statsd.timer('feedback.course.added.es' ):
result = es.index(index='kpi', doc_type='course',
id =course_id, body =doc)
def _write_feedback_to_redis (statsd, redis, course_id, doc):
statsd.incr('feedback.course.added.redis' )
with statsd.timer('feedback.course.added.redis' ):
queue.push(doc)

Measure: Redis capacity
● A feedback - up to 1000 characters
● Redis is used for storing 5 feedbacks per course
● 10 000 courses for Kyiv Polytechnic Institute
● Key: feedback:course:<course_id>
● Data structure: List
● Commands:
○ LPUSH - O(1)
○ LRANGE - O(S+N), S=0, N=5
○ LTRIM - O(N)
34

● Don't trust benchmarks from the internet
● Run a benchmark for a production-like environment with your sample data
● Example:
○ FLUSHALL
○ define a sample feedback (string up to 1000 characters)
○ create N=10 000 lists with M=5 sample feedbacks
○ measure allocated memory
● You can run an approximated benchmark and calculate expected memory
size
35

● 76.3 MB for 10000 courses, Kyiv Polytechnic Institute
● 7GB for 100 Ukrainian universities
36

Measure: Network traffic for Redis
● Measure network traffic for send/receive operations:
○ add_feedback → LPUSH
○ fetch_feedback → LRANGE
● Revise Redis protocol (RESP)
● Calculate expected sent/received data for new Redis
operations:
○ How much data sent for LPUSH
○ How much data received for LRANGE
37

Measure: Network traffic for Redis
from aioredis.util import encode_command
add_feedback = len(encode_command(b'LPUSH
feedback:course:100500
"MY_AWESOME_FEEDBACK"'))
https://github.com/aio-libs/aioredis/blob/master/aioredis/util.py
38

Measure: Network traffic for Redis *
● MAX add_feedback_traffic = 1.5 Mbps
● AVG add_feedback_traffic = 0.8 Mbps
● MAX fetch_feedback_traffic = 30 Mbps
● AVG fetch_feedback_traffic = 10 Mbps
* This step is optional and depends on your architecture
(optional)
39

Summary of the investigation around double writes
● 90% of fetch feedback requests could be processed by
Redis
● Initial issue when Elasticsearch is out of queue capacity
should be avoided
40

Summary of the investigation
● Fetch feedback time is reduced
■ 2 ms per fetch for 90% of cases
● Increased:
○ Insert feedback time
■ 16 ms per insert
○ Redis capacity
■ 76.3 MB for 10000 courses, Kyiv Polytechnic Institute
■ 7GB for 100 Ukrainian universities
○ Network traffic for Redis
■ 11 Mbps 41

Making a decision
42
● Implement a prototype
● Discuss collected stats with Ops
● And with Business guys
● Implement the solution
● Deploy under a feature flag

Adding a feature flag
43
from feature import Feature
feature = Feature()
def fetch_feedback(feature, statsd, es, redis, course_id, amount):
result = None
if feature.is_enabled('fetch_feedback_from_redis')
and amount <= REDIS_FEEDBACK_QUEUE_SIZE: # 5 feedbacks in queue
fetched_from_redis = True
result = _fetch_feedback_from_redis(statsd, redis, course_id, amount)
if feature.is_enabled('fetch_feedback_from_elasticsearch') and not result:
result = _fetch_feedback_from_elasticsearch(statsd, es, course_id,
amount)
return result

Rolling the feature only for a subset of users
44

Rolling the feature only for a subset of users
45

RPS. Feature "Fetch last 5 feedbacks about a course".
Rolled out for 1% of users.
46
Fetch from Elasticsearch Fetch from Redis

Incremental rollout prevented the incident
EsRejectedExecutionException[rejected execution (queue capacity 1000)
on org.elasticsearch.search.action.SearchServiceTransportAction]
47

Investigation
48
● Disable the feature
● Run investigation
○ Only recent feedbacks are retrieved from Redis
○ Legacy feedbacks are fetched directly from Elasticsearch
● Solution
○ Write legacy feedbacks to Redis using a background job

Fixing missed data in Redis
49
def fetch_feedback(feature, statsd, es, redis, course_id, amount):
fetched_from_redis, result = False, None
if feature.is_enabled('fetch_feedback_from_redis')
and amount <= REDIS_FEEDBACK_QUEUE_SIZE:
fetched_from_redis = True
result = _fetch_feedback_from_redis(statsd, redis, course_id, amount)
if feature.is_enabled('fetch_feedback_from_elasticsearch') and not result:
result = _fetch_feedback_from_elasticsearch(statsd, es, course_id,
amount)
if fetched_from_redis: # redis was empty for the course
fill_redis(redis, result, amount=REDIS_FEEDBACK_QUEUE_SIZE)
return result

Fixed and rolled out for 1% of users.
50
Fetch from Elasticsearch Fetch from Redis

Fixed and rolled out for 100% of users.
51Fetch from Elasticsearch Fetch from Redis

Feature has been deployed for 100% users
52

Developer's checklist for adding a feature to a high loaded project
● discover which services are hit by the feature
○ database
○ cache
○ storage
○ whatever
● measure the impact of the feature on the existing environment
○ call frequency
○ amount of memory
○ traffic
○ latency 53

Developer's checklist for adding a feature to a high loaded project (2)
● calculate allowed load for the feature
○ requests per second for the existing environment
○ a timing of request processing
● calculate the additional load for the feature
○ latency for additional requests
○ how to deal with a lack of resources
54

Developer's checklist for adding a feature to a high loaded project (3)
● discuss the acceptability of the solution
○ with peers
○ with Ops
○ with business owners
● consider alternatives if needed
● perform load testing on staging
● rollout the feature to production incrementally
55

Tools that help to make customers happy
● profiling:
○ cProfile
○ kcachegrind
○ memory_profiler
○ guppy
○ objgraph
56

Tools that help to make customers happy (2)
● metrics
○ StatsD
● graphs and dashboards
○ Graphana
○ Graphite
● logging
○ Elasticsearch
○ Logstash
○ Kibana
57

Tools that help to make customers happy (3)
● feature flags:
○ Gargoyle and Gutter from Disqus
○ Flask-FeatureFlags
○ Switchboard
● alerting:
○ elastalert
○ monit
○ graphite-beacon
○ cabot
58

Summary
● Be careful with calls to external services
● Collect metrics about state of your production environment
● Perform a capacity planning for "serious" changes
● Use application metrics and measure potential load
● Roll out new code incrementally with feature flags
● Set proper monitoring, it can prevent majority of incidents
● Use the tools, it's really easy
● Be ready to rollback fast
60

To be continued
● asynchronous programming
● infrastructure as a service
● testing
● monitoring and alerting
● dealing with bursty traffic
● OS and hardware metrics
● scaling
● distributed applications
● continuous integration
61

Further reading
● How HipChat Stores and Indexes Billions of Messages Using
● Continuous Deployment at Instagram
● How Twitter Uses Redis To Scale
● Why Leading Companies Dark Launch - LaunchDarkly Blog
● Lessons Learned From A Year Of Elasticsearch ... - Tech blog
● Notes on Redis Memory Usage
● Using New Relic to Understand Redis Performance: The 7 Key Metrics
● A guide to analyzing Python performance
62

Questions?
63
Viacheslav Kakovskyi
viach.kakovskyi@gmail.com
@kakovskyi

PyCon Ukraine 2016: Maintaining a high load Python project for newcomers

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a PyCon Ukraine 2016: Maintaining a high load Python project for newcomers

Similar a PyCon Ukraine 2016: Maintaining a high load Python project for newcomers (20)

Último

Último (20)

PyCon Ukraine 2016: Maintaining a high load Python project for newcomers