This document discusses scaling web applications to meet demand. It uses an analogy of a grocery store checkout line to explain concepts like latency, throughput, and concurrency. Latency is the time to complete a task, like scanning items. Throughput is the number of tasks completed per unit of time and is impacted by latency and concurrency. Concurrency is the number of "workers" like cashiers. If demand exceeds capacity, queues will form as customers wait. To avoid queues, systems must decrease latency by optimizing processes, and increase concurrency by designing tasks that can run independently in parallel. Tools like caching, multiprocessing, and Docker can help scale applications by improving latency and concurrency.
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
High performance Web Applications @ Minted - Notes
1. How many people here work for a startup? How many people have their own startups?
<slide>
For those of us who are lucky enough to work for a successful startup, one of the stories you’ll
hear over and over again is that of an idea finally taking off and gaining traction in the market.
The startup works hard and iterates on their product over and over again until it finally gains the
recognition that it deserves, more and more customers start to pour into their service, and they
find their initial implementation unable to scale past their pool of early adopters. This is the
problem we seek to understand better today.
<slide>
For those of you here with scientific computing and analytics backgrounds, a lot of the same
concepts that we use in high-performance e-commerce web applications also apply to other
fields of computing. I invite you to be open-minded about the analogies we will make, and to
extend these ideas for your benefit in your field of work.
<explain slide>
<slide>
Let’s start with a story. You’re on your way home after a long day of work and, instead of trying
your luck again with more “culinary experimentation” at home, you decide to pick up some food
at the grocery market. After pacing back and forth between the refrigerators at the frozen foods
aisle, you finally settled on a choice of “just get it over with pizza” over “not unhealthy enough to
say no ramen.” Feeling defeated, you retreat to the checkout section where, to your surprise,
you find only two cash registers open, with long lines of other disappointed customers.
<slide>
As your fingers slowly accumulate frostbyte from the pizza, you can clearly see the six
unattended cash registers on the side. You start to wonder. “Why do they only have two lines?”
“Why would they make a decision to hire only two workers?” “Are they trying to make my life
more miserable?” “I could do a better job of managing this store than them.” And as the arrival
schedule for the last bus home slowly slips away, you start to contemplate your life decisions
you’ve made up until that point.
Why did I bring up this story? No, of course, I do not speak from personal experience. These
are the thoughts that you do not want your customer to have, regardless if you’re running a
grocery store or web application. There are real needs to be addressed and real value to
delivered to the customer by your business, whether it’s shopping online or just trying to find a
low-calorie solution for a high-calorie meal. If you want to deliver as much value as you can to
2. your customers, you should serve more customers or deliver more value per customer. You do
not want your customer to be waiting in line to give you money. You definitely do not want the
line to be growing so large that no more customers can visit the store. The experience may
tarnish your brand. Demand might not come back again.
<slide>
Let’s look at some analogies to be drawn between our grocery store example and a web
application. In both instances, there is a necessary task to be performed for the customer, which
requires both resources and time. For the grocery store, it’s the task of scanning each one of
the items, swiping the customer’s credit card, and handling them a receipt while wishing them to
“have a nice day”. For a web application, you’re receiving an HTTP request, doing some
processing from the information in the request, and replying with a 200 OK. Sure, maybe one
transaction takes minutes while the other takes milliseconds, but the idea is analogous. I will
refer to this timing between the start and the end of the task as the “latency”.
<slide>
Similarly, the inverse of the latency is the throughput. If you take 12 seconds to complete each
task, then you can complete 5 tasks a minute. The throughput of a system will determine how
many customers you can serve per unit of time, and it’s very important business metric,
because how well you’re doing depends on how many customers can go through the checkout
line. So by that logic, decreasing your latency is the first step in increasing your throughput.
<slide>
However, there’s a limit to decreasing your latency. Despite how much you can train your
cashiers to move their hands, there are thermodynamic limits to how fast a human can press
buttons on a cash register. Similarly, while it is possible to just keep scaling the size of your EC2
instances, or get bigger machines with more RAM, or faster CPUs, there are also
thermodynamic limits to how large of a box you can run before the hardware costs of a single
box overlap the performance benefits. <slide> But you don’t need nanosecond-scale latency in
order to reach high throughput. And this is also the reason why, instead of hiring cashiers with
faster hands, we would rather hire more cashiers.
Throughput is inversely proportional to latency, but it’s directly proportional to concurrency.
<slide>
Concurrency is the number of workers you have processing requests. This is the number of
cashiers so your store doesn’t halt when a guy comes down the lane with a cart full of barbeque
ribs. This is the number of web workers you set up for your application so that your website
doesn’t come to a halt when a single visitor runs into a slow route.
3. <slide>
Your concurrency, divided by your latency, is equal to your throughput.
<slide>
Generally, I refer to the highest throughput that you can achieve with a system as the
“throughput capacity”, and the throughput experienced by your system currently as the
“throughput demand”.
Well, what happens when your capacity does not meet your demand? <slide> Queueing. The
queue only forms when there are not enough workers to satisfy all the requests. There are
some interesting properties about the existence of queues.
<slide>
First, once a queue forms, it will grow longer unless something changes. This is because once
the input to a system is greater than the output of a system, the excess can only accumulate
within the system. Think of this as turning up the faucet above a leaky bucket.
<slide>
Second, even after the demand has settled, your workers will be busy for an extended amount
of time to process the rest of the queue until it can dissipate.
<slide>
Third, because the size of queue will continue to grow during the high demand, the average
latency will also increase. The customers in the front of the line will not only accumulate extra
waiting time overall, but will force customers at the back of the line to wait even longer. An
unresolved queue will not only bump your average latency, but will continuously increase
latency until the queue disappears.
<slide>
The only conclusion to a over capacity system is that the queue will eventually overflow, at
which point additional demand is turned away. This is true for grocery stores, this is true for web
servers. At that point, the business is bottlenecked on your application’s throughput. A good
system with enough capacity to handle the workload should always have at least one worker
who is doing nothing.
4. So that now that we’ve established that the formation of queues are generally an ill omen, but
what can we do to avoid them? Well, we can increase the throughput capacity by decreasing
latency and increasing concurrency. These are the core concepts for high-performance web
applications.
Let’s start with latency.
The general process with optimizing latency is, <slide> “Do less things” Let’s look at some
examples.
<celery example>
We can use frameworks like celery to help us build synchronous applications. Breaking down
your applications into its subcomponents also help you build systems that are separately
scalable. For example, if the first part of a request is CPU heavy, but the second part is memory
heavy, you can have them deployed to machines with different resource allocation to get more
cost-efficient use of your boxes. Efficiency is also a part of performance.
<caching example>
Now, let’s look at concurrency.
The question you want to ask to determine scalability for your function is: <slide> What do I
need to do my job. If the answer is “not a lot”, then your function is highly scalable. If there are
dependencies that you’re required to wait on, then the chain of scripts must be executed in
order, leading to longer latency and lower throughput. If you *can* act on multiple resources
independently of each other and still come up with a correct answer, you should. Let’s look at
some examples.
<thread pool>
This is another call for breaking down your application into dumb, functional tasks. Scalability is
easy when you need to build a simple app that does one thing and does it well.
<multiprocessing> / <multithreading>
Generally, multithreading is not encouraged in Python, since a lot of the libraries are not
thread-safe.
<slide>
If you want to scale your app at a higher level than code-level implementation, there are various
tools for helping with that.
5. <uwsgi>
<docker>
<throughput>
So these are the ways of increasing your concurrency. Remember, throughput is equal to
concurrency divided by latency. For those of you here who are not interested in scaling
e-commerce web applications, I hope you still found this presentation helpful. I hope everyone
took away at least some general concepts in queueing theory. With that, let’s all go code
something.
Any questions?