Cloud arch patterns

Cloud Architecture Patterns
Running PostgreSQL at Scale
(when RDS won't do what you need)
Corey Huinker
Corlogic Consulting
March 2018

First, we need a problem to
solve.

You make a product! ...now you have to sell it.

To advertise the product, you need an ad...
...so you talk to an ad agency.

But placing ads has challenges
Need to find websites with visitors who:
● Would want to buy your product
● Are able to buy your product
● Would be drawn in by the creative you have designed

Websites Claims about their Visitors...

Buying ad-space on websites directly is usually not
possible, you need a broker/auction service.

So how do you know that your ad was seen?

Focal points of ad monitoring
● Number of times ad landed on a page (impressions)
● Where on the page did it land?
● Did it fit the space allotted.
● How long did the page stay up.
● Did the viewer interact with the ad in any way?
● Was the viewer a human?
● How do these numbers compare with the claims of the website?
● How do these numbers compare with the claims of the broker?

This creates a lot of data
● Not all impressions phone home (sampling rate varies by contract)
● Sampling events recorded per day (approx): 50 Billion
● Sampling events are chained together to tell the story of that impression.
● Impression data is then aggregated by date, ad campaign, browser
● After aggregation, about 500M rows are left per day.
● Each row has > 125 measures of viewability metrics

Capturing The Events
● Pixel servers
● Need to be fast to not slow down user experience
● or risk losing event data
● Need to get log data off of machine ASAP
● Approximately 500 machines
○ Low CPU workload
○ Low disk I/O workload
○ High network bandwidth
○ low latency
○ generously over-provisioned

Real-time accumulation and aggregation
● Consumes event logs from pixel servers as fast as possible.
● Each server is effectively a shard of the whole "today" database
● Custom in-memory database updating continuously
● Serving API calls continuously
● Approximately 450 machines
○ CPU load nearly 100%
○ To swap is to die
○ High network bandwidth
○ low latency
○ generously over-provisioned

What Didn't Work: MySQL
● Original DB choice
● Performed adequately when daily volume was < 1% of current volume
● Impossible to add new columns to tables
● Easier to create a new shard than to modify an existing one.
● New metrics being added every few weeks, or even days
● Dozens of shards, no consistency in their size

What Didn't Work: Redshift
● Intended to compliment MySQL
● Performed adequately when daily volume was < 1% of current volume
● Needed subsecond response, was getting 30s+ response
● Was only machine that had a copy of data across all time
● HDD was slow, tried SSD instances, but had limited space
● Eventually got up to a 26 node cluster with 32 cores per node.
● Cannot distinguish a large query from a small one
● Had no insight into how the data was partitioned
● Reorganizing data according to AWS suggestions would have resulted in
vacuums taking several days.

What Didn't Work: Vertica
● Intended to compliment MySQL
● Good response times over larger data volumes
● Needed local disk to perform adequately, which limited disk size
● each cluster could only hold a few months of data
● 5 node clusters, 32 cores each.
● Could only have K-safety of 1, or else load took too long (2 hrs vs 10)
● Nodes failed daily, until glibc bug was fixed
● Expensive

What Did Work: Postgres
● Migrated OLTP MySQL DB (which held some DW tables)
● Conversion took 2 weeks with 2 programmers
● Used mysql_fdw to create migration tables
● Triggers on tables to identify modified rows
● Moved read-only workloads to postgres instance
● Migrated read-write apps in stages
● Only downtime was in final cut-over
● Single 32 core EC2 with 1-2 physical read replicas

What Did Work: Zipfian workloads
● Customers primarily care about data from today and last seven days
● About 85% of all API requests were in that date range
● Vanilla PostgreSQL instance, 32 cores, ample RAM, 5TB disk
● Data partitioned by day. Drop any partitions > 10 days old.
● Stores derivative data, so no need for backup and recovery strategy
● Focus on loading the data as quickly as possible each morning.
● Adjust apps to be aware that certain client's data is available earlier than
others
● Codename: L7

What Did Work: Getting cute with RAID
● Engineer discovered a quirk in AWS pricing of disk by size
● Could maximize IOPS by combining 30 small drives into a RAID-0
● Same hardware as an L7 could now store ~40 days of data, but data growth
meant that that figure would shrink with time
● Same strategy as L7, just adjusted for longer date coverage
● Codename:
○ L-Month? Would sound silly when X fell below 30
○ L-More? Accurate but not catchy.
○ L-mo?
○ Elmo

What Did Work: Typeahead search
● "Type-ahead" queries must return in < 100ms
● Such queries can be across arbitrary time range
● Scope of response is limited (screen real estate)
● Engineer discovers that our data compresses really well with TOAST
● Specialized instance to store all data at highest grain level, TOASTed
● pseudo-materialized views that aggregate data in search-friendly forms
● Use of "Dimension" tables as a form of compression on the matviews.
● Heavy btree_gin indexing on searchable terms and tokens in dimensions
● Single 32 core machine, abundant memory, 2 read replicas
● Rebuild from scratch would take days, so B&R strategy was needed

What Did Work: TOASTing the Kitchen Sink
● Data usage patterns guaranteed that a client usually wants most of the data
across their org for whatever date range is requested
● Putting such data in arrays guarantees TOASTing and compression.
● Compression shifts workload from scarce IOPS to abundant CPU
● Size of array chunks was heavily tuned for the EC2 instance type.
● Same RAID-0 as used in Elmo instance could now hold all customer data
● 5 32-core machines with an ETL-load sharing feature such that each one
processes a client/day then shares it with other nodes
● Replaced all Redshift and Vertica instances
● Codename: Marjory (the all seeing, all knowing trash heap)

What Did Work: Foreign Data Wrappers
● One FDW converted queries into API calls to the in-memory "today" database
● Another one used query quals to determine the set of client-dates that must
be fetched
● All client data stored on S3 as both .csv.gz and a compressed SQLite db
● FDW starts web service, launches on lambda per sqlite file
● Lambda queries SQLite file, sends results to web service
● web service re-issues lambdas as needed, returns results to FDW
● Very good for queries across long date ranges
● Codename: Frackles (the name for background monster muppets)

What Did Work: PMPP
● Poor Man's Parallel Processing
● Allows an application to issue multiple queries in parallel to multiple servers,
provided all the queries have the same shape
● Returns data via a set returning function, which can then do secondary
aggregation, joins, etc.
● Any machine that talks libpq could be queried (PgSQL, Vertica, Redshift)
● Allows for partial aggregation on DW boxes
● Secondary aggregation can occur on local machine

What Did Work: Decanters
● A place to let the data "breathe"
● Abundant CPUs, abundant memory per CPU, minimal disk
● Very small lookup tables replicated for performance reasons
● All other local tables are FDWs to OLTP database
● Mostly executes aggregation queries that use PMPP to access: Statscache,
Elmo, Marjory, Frackles, each one doing a local aggregation
● Final aggregation happens on decanter
● Can occasionally experience OOM (rather than on an important machine)
● New decanter can spin up and enter load balancer in 5 minutes
● No engineering time to be spent rescuing failed decanters

Putting it all together with PostgreSQL
Tagged Ads
Viewable Events
Pixel
Servers
Stats
Aggregators
S3 - CSVs
S3 - SQLite
Log shipping
Daily
Summaries
Elmo
Clusters
Marjory
Clusters
Search
Clusters
Daily ETLs

Putting it all together with PostgreSQL
User
Stats Requests
Elmo
Clusters
Marjory
Clusters
S3 - SQLite
PMPP Requests
OLTP DB
Third Party DW
Search
Clusters
Searches
Pg FDW
Frackles
FDW
Pg
FDW
Decanters
Live Stats
Aggregators
Stats-Cache
FDW

Why Not RDS?
● No ability to install custom extensions (esp Partitioning modules)
● No place to do local copy operations
● Reduced insight into the server load
● Reduced ability to tune pg server
● No ability to try beta versions
● Expense

Why Not Aurora?
● Had early adopter access
● AWS Devs said that it wasn't geared for DW workloads
● Seems nice on I/O
● Nice not having to worry about which servers are read only
● Wasn't there yet
● Data volumes necessitate advanced partitioning
● Expense

Why Not Athena?
● Athena had no concept of constraint exclusion to avoid reading irrelevant files
● Costs $5/TB of data read
● Most queries would cost > $100 each
● Running thousands of queries per hour

Cloud arch patterns

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Cloud arch patterns

Similar a Cloud arch patterns (20)

Último

Último (20)

Cloud arch patterns