News events led to dramatically increased traffic, causing the ACLU’s donation platform to go down under load, impacting revenue and supporter engagement at a critical time for the organization. Performance tuning under normal circumstances is difficult, but even more so while under extreme load and experiencing downtime with millions of dollars being lost by the hour.
The ACLU called on Tag1’s Technical Architecture and Leadership to perform emergency support and rescue work to get the ACLU Action website back online as quickly as possible and to help it withstand even bigger traffic spikes in the future. The results of Tag1’s efforts were 3,000% increase in donations from a yearly average of $4mm to $120mm, $24mm in donations on a single weekend, 57% faster database response times, 900% throughput increase in requests per minute. In addition, systems stay online and perform quickly under extreme loads.
ACLU Partners with Tag1 to Raise Most-Ever $120M in Donations at Mission-Critical Moments
1. ACLU.org in 2017
Patrick Jensen (ACLU), Narayan Newton (Tag1 Consulting), & Matthew Cheney (Pantheon)
Handling a Big Year
2. ACLU
● Nonprofit founded 1920 with over 3
million supporters
● Defend individual rights and liberties
● Famous cases
○ Led fight against Japanese-American
internment camps
○ 1996 Communications Decency Act
○ Marriage equality
image
3. ACLU Action Website
● Act
○ Sign petitions
○ Send messages
○ Request legal aid
● Support
○ Donate
○ Sign up to volunteer
● Accomplished via form submissions
● Drupal 6 (now Drupal 7)
image
4. Before Pantheon
Instability and Uncertainty
2013
● Database Strain
○ Using core Drupal search
● Hardware upgrades took weeks
● Maintenance was onerous
○ test and development environments
○ infrastructure (e.g. varnish)
5. Hosting Websites is Hard Work
image
● Need to Know Lots of Technology
○ Linux, LXC, NGINX, MariaDB, PHP,
Redis, Solr, Git, Varnish, New Relic
● Need to Do Lots of Things
○ Workflow, Branches, Backups,
Scalability, Performance, Security
● 24 hours a day, 7 days a week
7. Putting Organizational Mission at Top of Stack
There is already so much to do!
● The World is Already Full of Challenges
● Don’t be “ambitious” about a backup
system or your load balancers
● Leverage the Experience of Others
● Be the Pyramidion you want to be in
the world!
8. That Is Why Folks Like the ACLU Use Drupal
Stand on the Shoulders of Giants
● Leverage the Expertise of Others
○ Drupal Core
○ Contrib Modules
○ External Libraries
● Benefit from Community of Practice
○ Best Practices, Security Process,
Performance, Documentation
9. And Why Folks Use Managed Cloud Services
Free up Time & Resources to Focus
● Drupal is Getting More Complicated &
The Web is Getting More Ambitious
● Leverage Pre-Built Feature Sets
○ Redis (Object Caching), Solr (Search
Indexing, Dev->Test->Live (Workflow)
● Use Best In Class Security Processes +
Performance/Scalability Tooling
10. And Be Prepared. Now and in the Future.
Behold the Power of Containerization!
11. And Be Prepared. Now and in the Future.
Behold the Power of Containerization!
12. And Be Prepared. Now and in the Future.
Behold the Power of Containerization!
13. Be Prepared. You Never Know
What Is Going to Happen
Andrew Lowery
“ “
14. Donald Trump Elected
● Donations in the 5 days after election
■ 2012: $25,000
■ 2016: $7,200,000
● Page views Nov. 9 - 13
■ 2015: 400,000
■ 2016: 4,250,000
15. Nov 16, 2016: The wake-up call
Site outage
Formsubmissionsperminute
18. Outage Review
Tag1 Consulting brought in to review outage after
Rachel Maddow interview
Specifically --
● Fabian Franz (d.o.: fabianx)
● Narayan Newton (d.o.: nnewton)
● Jeremy Andrews (d.o.: Jeremy)
Overall issue was clear and was somewhat on-going.
Immediately transitioned into developing and
deploying fixes.
image
19. Example Query Fix
+------+-------------+-------+--------+----------------------+---------+---------+---------------------+--------++
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+----------------------+---------+---------+---------------------+--------++
| 1 | SIMPLE | fo | ALL | NULL | NULL | NULL | NULL | 282880 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | o | eq_ref | PRIMARY,order_status | PRIMARY | 4 | aclu.fo.oid | 1 | Using where |
| 1 | SIMPLE | os | eq_ref | PRIMARY | PRIMARY | 98 | aclu.o.order_status| 1 | |
+------+-------------+-------+--------+----------------------+---------+---------+---------------------+--------++
SELECT o.order_id, o.uid, o.billing_first_name,
o.billing_last_name, o.order_total, o.order_status, o.created, os.title
FROM uc_orders o INNER JOIN fundraiser_og fo ON fo.oid = o.order_id AND
fo.gid IN (8888,9999) LEFT JOIN uc_order_statuses os ON o.order_status =
os.order_status_id WHERE o.order_status IN ('refunded', 'pending', 'processing',
'payment_received', 'completed') ORDER BY o.order_id DESC LIMIT 0, 30;
20. Index Solution
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+----------------------+--------------+---------+---------------------+------++
| 1 | SIMPLE | o | range | PRIMARY,order_status | order_status | 98 | NULL | 76 | Using index condition; Using filesort |
| 1 | SIMPLE | os | eq_ref | PRIMARY | PRIMARY | 98 | aclu.o.order_status | 1 | |
| 1 | SIMPLE | fo | ref | test | test | 4 | aclu.o.order_id | 1 | Using where; Using index |
+------+-------------+-------+--------+----------------------+--------------+---------+---------------------+------++
+ db_add_primary_key($ret, 'fundraiser_og', array('oid', 'gid', 'nid'));
+ db_add_index($ret, 'fundraiser_og', 'idx_gid', array('gid'));
+ db_add_index($ret, 'fundraiser_og', 'idx_nid', array('nid'));
ALTER TABLE fundraiser_og ADD INDEX test (oid,gid,nid);
23. It Works on My Local (cluster)
Performance Testing For Complex Sites
● Performance Testing is Complicated
○ Varnish/CDN
○ Redis/APC
○ PHP, MariaDB
● Production Parity Testing!
● But Replicating a Cluster is Hard Work
○ Nobody has time for that!
24. Let the Robots Do the Work!
They already do so much. What’s a little more SysAdmin?
33. Payment Gateway
Toolkit
● curl_log
○ Adding verbose logging to the curl requests
○ Logging to a table in the DB
○ In-flight sanitization of user information
● curl_loadbalance
○ Decaying ticket-based curl endpoints load balancer
○ Removes failing endpoints for a window of time after X failures
○ Specifically designed to always have at least one endpoint
34. Performance Next Steps
● query_cache
○ Caching “shim” to adding db_query caching to
contrib modules without patching them
○ Ability to map queries to a single base query
○ Moves read-only traffic from the DB to the object
cache
● rate_limit
○ An in-drupal solution to rate limiting specific types of
requests
○ Webform protection
○ Search protection
41. Join us for
contribution sprints
Friday, April 13, 2018
9:00-12:00
Room: Stolz 2
Mentored
Core sprint
First time
sprinter workshop
General
sprint
#drupalsprint
9:00-12:00
Room: Stolz 2
9:00-12:00
Room: Stolz 2
42. What did you think?
Locate this session at the DrupalCon Nashville website:
http://nashville2018.drupal.org/schedule
Take the Survey!
https://www.surveymonkey.com/r/DrupalConNashville
Notas del editor
Narayan Newton, Lead systems engineer at Tag1 Consulting
Matthew Cheney, Chaos Wizard at Pantheon
A non-profit founded almost 100 years ago and we have over 3 million supporters
Our mission is to defend and preserve the individual rights and liberties guaranteed by the Constitution and laws of the United States.
To put it succinctly, we consider ourselves to be the first responders for the Constitution
We take on issues like:
Voting rights
Reproductive Freedom
the intersection of privacy and technology
For example,
Led fight against Japanese-American internment camps during WWII
took on and defeated 1996 Communications Decency Act, which censored the Internet by banning "indecent" speech
Marriage equality - We brought the first lawsuit in the country seeking the freedom to marry for same-sex couples in 1970.
We appear before the Supreme Court more than any other organization except the Department of Justice.
We maintain about 40 Drupal websites at the ACLU
but today we’re going to talk about just one really important website: action.aclu.org
Take action online
Sign an online petition
Send a letter to an elected official
Request legal aid from the ACLU
Where our members can go to support us
Fundraising
Sign up to volunteer
action.aclu.org is currently on Drupal 7 but for the time period we’re talking about Drupal 6
Critical for our organization that our websites are available and performant.
Before Pantheon in like 2013 on dedicated hosting
There was an initiative at the ACLU to build our online presence
But we found our infrastructure wasn’t quite up to the task of handling the increased traffic
site slowness
some site outages
Problems with our infra
Database strain (using core drupal search bc Solr wasn’t set up)
hardware upgrades took weeks and weeks
maintaining test and development environments and varnish involved a lot of developer time
ACLU CTO, Marco Carbone who is an old-school drupal dev heard about Pantheon by attending DrupalCon event
We did our research and decided they’d be a great host for us. Matt’s going to tell you why.
-- This may not be surprising, but hosting websites is hard work.
-- Not as hard as hard as resisting executive overreach through constitutional law of course.
-- Need to Use Lots of Technologies and Do Lots of Things 365 days a year
-- Plus you need to keep it all up to date and adopt NEW stuff when it come
-- What does knowing Git have to do with civil rights?
-- Its about as necessary as this guinea pig wearing sunglasses.
-- I mean its great to know how Git works, but its not necessary
-- The world is full of challenges, why add to things you need to do!
-- Things move quickly. Organizations need to be able to respond.
-- Time/Resources need to be focused on organiational goals.
-- Even more true with “Ambitious Digital Experiences”.
-- Be the Pyramidion you want to be in the world
-- Leverage the expertise of others through reusable modules/libraries
-- Benefit from a community of practice around web development
-- Leveraging the expertise of others is why people use CLOUD
-- Drupal is getting more complciarfed. Web is getting more ambitious.
-- Features You Need Require Spercialized Knowledge to Make. Even More to Maintain.
-- Security is Ongoing Challenge Requiring Lots of Knowledable People
-- Performance/Scalability Takesa a Village
-- Horizontally Scaling PHP is Hard Work
-- Hosting Platforms That Have This Tech Work Really Hard To Make it Awesome
-- It Wont Solve All Your Performance Problems
-- But It will Provide you a SOLID Starting Point
-- Be preapred, you never know what is going to happen
-- Its Not About Having all The Answers, It’s About Having the Right Tools
Pat:
After switching to Pantheon, our site was quite stable… until Nov. 8th 2016
After switching to Pantheon, our site was quite stable… until Nov. 8th 2016
We received $7.2 million in the 5 days after the 2016 election.
Compare that to the $25k in donations we received in the 5 days after the 2012 election
In the 5 days after the election our websites saw over 4 million page view
Compare that to 400,000 page views the year before
Our web traffic increased to more than 10x what we were used to seeing in the days after the election, essentially overnight
This was a great outpouring of support for our organizaion
but we started seeing small performance issues
Those small performance issues turned into a really big performance issue on Nov 16, 2016
The ACLU’s executive director appeared on the Rachel Maddow show.
Rachel Maddow Appearance Nov. 16 2016
500 peak form submissions per minute
~15 minutes site outage
Only able to sustain ~300 submissions per minute
This graph shows the spike in HTTP 500 errors our site was returning during the Maddow appearance
Huge missed opportunity for us.
Supporters were trying to donate to us, send letters to their elected officials via our site and sign up for our email lists, but they were being met by errors
Luckily ACLU mgt realized this wouldn’t be a one-time spike
They realized the Trump era meant that we’d be seeing spikes like this on the the regular for the next 4 - 8 years
But we didn’t have 4 - 8 years to fix these performance issues
The next spike could come at ANY time
so we called in Tag1 to do emergency weekend
Tag1 Brought in to look at outage period
Issue was clearly that we were DB bound, brought in 3 engineers including myself to review new relic traces
Developed indexes, fixed queries, worked in concert with the ACLU team to deploy fixes.
An example of what type of thing we were doing.
This is a fairly typical ubercart-esqe query, with the addition of an og table.
An interesting quirk of this additional table is that it lacks all indexes. This is more common than you might think.
Looked at the table to find the datasets natural key and pushed a primary key and some additional keys for filtering and joining.
Note, we have a key on oid, gid, nid but then I have indexes on specifically gid and nid. Why? Because of the order of gid and nid in the primary index
As you can see, we went from 200k rows to 76.
And here is the result of just that change. You can see the green query being marked fundraiser_og, that is this query and you can see it basically dropping out of the graph.
Put together our fixes as a patchset, tested against multidev
at this point wanted to ensure that the ACLU site would survive larger traffic spikes and find other issues
Turned to pantheon to setup a production-alike environment to enable testing at that capacity
-- performasnce testing is complicated. just ask narayan.
-- important to test in as “close a production parity” as possible
-- but setting all this stuff up is hard!
-- robots will drive our cars. raise our children on ipads. tell us what to believe politically
-- is it really too much to ask that they can create production parity developkment environments on demand?
-- on demand environments are the answer
-- at pantheon we call this “Multidev”, but its basically ONE ENVIRONMENT PER GIT BRANCH
---- integrated with new relic, production parity
-- made possible by Containers and Robots
-- Allowed Tag1 and ACLU to quickly iterate and test features
The emergency improvements Tag1 put in over that weekend in late Nov 2016 were very effective.
Made it through:
Giving Tuesday 2016
end of year fundraising pushes
received 15x more donations in our end of year fundraising than previous year (20,548 gifts)
But we weren’t out of the woods yet
Jan 27 2017, issue Executive Order 13769 (AKA Muslim travel ban)
Barred people from 7 Muslim-majority countries from entering the US
Thousands protested the executive order at airports across the country
The ACLU fulfilled our reputation as first responders for the Constitution
Within hours, the ACLU—and partnering organizations nationwide—obtained the first injunction to block the order
When news broke of what the ACLU had accomplished
People rushed to our websites
That top line on the graph there shows page views in the page views before during and after the executive order
The line at the bottom of the screen shows the same dates from the previous year
The big spike is at almost 4 million hits, on the same day the previous year is at 44,000
85x traffic spike… almost 2 orders of magnitude
Donations over the weekend after the executive order were six times the organization’s yearly average
So how did our websites hold up during this crazy post-executive order weekend?
Rachel Maddow Appearance
Able to sustain 300 submissions per minute
~15 minutes site outage
Executive Order
900 peak form submissions per minute
Sustained 500 submissions per minute for ~8 hours
We did have a 10 minute ‘site outage’
We did 2 smart things to mitigate this outage
New Relic alerts when traffic got high or response times increased
Static CDN-hosted donation page
After the dust settled, we took some time and confirmed what we previously suspected
slow responses from one of our payment gateways was the root cause of the site outage
we still had some issues with database performance to address
Once again, we handed the reins over to Tag1
So at this point we know things are better, but that we are still having issues at very high load. We are past the easy fixes you can detect at low load situations and need actual traffic
I build a botnet
initial results of the patchset
starting seeing issues with DB and external request/curl requests to the payment gateway
We took a two pronged approach to fix these issues
First, we starting look at the external requests. It was very unclear what was actually happening with our payment gateways
Developed curl_log to log the actual responses from the gateway, but also to sanitize
Finally found that there was an issue with CDN
curl_load balance was developed
Turned towards DB issues, which were more over-all load and less bad queries specifically
Legacy deployment, don’t want to patch every module. Fabian developed query_cache
We also developed rate_limit, which is sort of a performance tool and sort of a security tool. It allows us to rate_limit specific actions in Drupal itself.
How well did this second and final round of changes serve us?
We got a chance to find out when the Trump administration’s FCC repealed 'Net Neutrality' rules for internet providers in mid-December 2017
The internet reacted with outrage and once again the action.aclu.org website was a conduit for that outrage
This time, we nailed it.
Nov 2016 Before changes
Rachel Maddow Appearance
Maxed out at 300 form submissions per minute
~15 minute outage
January 2017 After first round of changes
Executive Order
900 peak form submissions per minute
~10 minute mitigated outage
After the second round of changes
we were able to hit a peak of 1,900 form submissions per minute
and easily sustained 500 submissions per minute for 10 hours (probably indefinitely)
This was a big victory for us… 100s of thousands
In his Nov 2016 appearance on the Maddow show our exec dir said about Pres. Trump’s election
While the rest of the organization was ready, the website wasn’t quite prepared.
But after a year of:
work by the developers and management at the ACLU
leveraging Tag1’s expertise
and having Pantheon’s infrastructure having our backs
We’re now confident that our websites are really ready to be used in their full capacity to defend civil liberties in the US