[@IndeedEng] Managing Experiments and Behavior Dynamically with Proctor

Proctor
Managing A/B Tests and More

Tom Bergman
Product Manager
Aggregation

Matt Schemmel
Software Engineer
Resume

What's best for the
job seeker?

A/B Testing: Definition

A/B testing is an experimental
methodology comparing at least
two variants, a control group A
and test group B, in a controlled
experiment

A/B Testing Key Points
Test and Control Groups should be:
1. Unbiased
2. Independent
3. Representative

103 tests
315 variations
2^147 combinations

Control

10% test

10% test

10% test

10% test

10% test

10% test

Control

+2.9%

+2.0%

+2.3%

+12.8%

+5.2%

+9.6%

Control

+2.9%

+2.0%

+2.3%

+12.8%

+5.2%

+9.6%

+614M emails

Before and After
Before and After is bad science.

Visitors

Weekly Traffic

Mon

Tues

Wed

Thur

Fri

Mid Year Test

B
Visitors

A

A<B

History of A/B Testing
@Indeed

Next we tried ...
● Multiple Code Versions
● Separate Configuration
● "Sampling by Load Balancer"

Load Balancer: Multiple Versions

Load Balancer

CONTROL

(Old Version Code)

TEST

(New Version Code)

Load Balancer: Multiple Versions
It worked, but ...
1. Tedious
2. Expensive
3. Inflexible

Finally ...
Built Libraries, hand-write code per test to:
1. Arbitrarily Group Users
2. Select Test Groups
3. Implement Variations

Custom Coded Tests
Allowed us:
1. Sophisticated Tests
2. Scientifically Valid Methods
3. Low Operational Overhead

Goals:

1. Increase Engineering Velocity
2. Standardize Representation
3. Work Seamlessly Across Products

Proctor
Indeed’s Java Framework for

Proctor
Indeed's Open Source Java Framework for

github.com/indeedeng/proctor

Using Proctor

1. Background and Design
2. Running A/B Tests with Proctor
3. Beyond the Basics

Running a Test

1. Define the Experiment
2. Select Groups
3. Implement the Behavior
4. Log the Results

Define the Experiment
Key characteristics:
1. Buckets
2. Sample Sizes

50%
Control: Gray

50%
Test: Blue

Division of Responsibilities

Apply the Experiment

(global)

(each product)
Proctor Library

Test
Definition

Test
Specification

Buckets Enumerate the Test Variations
● ID, for code
● Long Description, for people
● Short Name, for people

0
"Control Group"
Gray

1
"Test Group"
Blue

Sizing the Buckets

1. Buckets
2. Sample Sizes

Selecting a Test Bucket
Good science requires good sampling:
● Independent
● Unbiased
Good user experience does, too:
● Fast
● Consistent

Round robin assignment
Assign each subsequent visitor to the next
bucket.
● Requires global state for "next bucket"
● Requires state for assigned buckets

✘Fast
~ Consistent
✓Independent ✓Unbiased

Randomized Assignment
At small scale, you might need round-robin
to ensure equal sample sizes.

At large scale, randomized assignment is
uniform enough.

? Fast
? Consistent

Roll the dice as needed
Select a bucket at random at the point of
execution.

Consistent
✓Fast
✘

Roll Once and Cache in a Cookie
● Single-domain, Single-device
● N cookies: Hard to evolve
● One cookie: Fragile to edit
● Size scales with # experiments

~ Fast
~ Consistent

Roll Once and Cache in Session
● Consistent only to length of session
● Tied to one server / data-center
● Many apps don’t use sessions

Consistent
~ Fast
✘

Roll Dice and Cache in DB
● DB hit on every request
● More infrastructure

✘Fast
~ Consistent

We can do better
Flaws stem from the need to record selected
buckets.

What if we didn't?

Don’t Record. Recalculate.
1. Assign each user a unique ID
2. Map that ID to a bucket
3. Store the ID, not the assignments

? Fast
? Consistent
? Independent ? Unbiased

Simple Mapping: Mod N

id mod N=> bucket
Doesn’t work:
● Should provide uniform distribution;
mod N assumes it.
● Limited bucket distributions

Range Mapping
id / MAX_ID => bucket

control
0

test
0.5

1

Buckets can be any size
control

test

0

control
0

0.5

1

(inactive) 1

test
0.5

1

Sequential IDs No Longer Uniform
MAX_ID
2

control
0

test
0.5

✘Unbiased

1

Hashed Range Mapping

hash ( id ) => bucket
control
MIN_INT

test
0

MAX_INT

Kept:
● Arbitrary bucket allocations ok

Unbiased Distribution for Any ID
50 / 50:

33 / 33 / 33:

✓Unbiased

But is it independent?

Sign Up

vs

Activate

Sign Up

vs

Sign Up

Should look like this
25%

Sign Up
25%

Sign Up

25%

Activate
25%

Activate

But our inputs are consistent

hash ( id ) => bucket

control
MIN_INT

test
0

MAX_INT

Text

Color

So our buckets are identical

S A S S A S A A A S A S A

S A S S A S A A A S A S A

And we look like this
50%

0%

Sign Up
0%

Activate
50%

Sign Up

Activate

Independent
✘

Add Salt to Test

hash ( id + test.salt ) => bucket

Kept:
● Arbitrary bucket allocations
● Uniform distribution

Text

Color

Uncorrelated Distribution

A S A S A S S A A A S S A

A S A S A S S A A A S S A

✓Independent

But is it fast?
0.90
0.85
0.80
0.75
0.70
0.65
0.60

Resume Editor

Resume Search

But is it consistent?

Consistency bounded only by ID

We Usually Use Tracking Cookies
● Easy
● Ubiquitous on the web
● Require no server-side storage
● Best we can do with no user action

~ Consistent

Best we’ve seen so far…

✓Fast
~ Consistent

Definitions Map Buckets to ID Range
Each bucket maps to a % of the hashed range

Bucket

Range

gray

0.50

blue

0.50

Sometimes, Though, Cookies Won't Do
● Some People Block Cookies
● Cross-Domain
● Cross-Device
● Cookies are Web-Only

Many Ways to ID a User
Session ID
557206C363F…

Email Address
me@indeed.com

Tracking cookie:
UID#1

Access Token:
4/rymOMYE…

Account #
12345

Proctor Uses Any Set of IDs
We use…
ID Type...

Tracked By...

USER

Tracking Cookie

ACCOUNT

Account ID

EMAIL

Email Address

…

…

Account ID
● Authenticated
● Consistent across domains
● Consistent across devices
● Consistent across visits

Email Address
● Sometimes available without account
● Identified, though not authenticated

Each Test Applies to One ID Type
● Test groups split by that identifier
● Visitors without that identifier are ignored

Test Definitions Encoded in JSON
● Compact
● Simple and Flexible
● Editable by Humans
● Editable by Machines

Basic Data in the Test Definition

"description": "Button colors",
"salt": "buttonBgColorTst",
"type": "USER"

Buckets in the Test Definition
"buckets": [{
"id": 0,
"name": "gray",
"description": "Control group"
}, {
"id": 1,
"name": "blue",
"description": "Test group"
}]

Mapping Buckets to Ranges
"ranges": [{
"bucketValue": 0,
"length": 0.5
}, {
"bucketValue": 1,
"length": 0.5
}]

Complete Test Definition
{
"type": "USER",
"buckets": […],
"allocations": [{
"ranges": […]
}],
}

Division of Responsibilities


proctor data

(each product)
Proctor Library

Test
Definition

Test
Specification

Proctor includes several modules
Common
Maven Builder
Proctor
Ant Builder
Codegen

Product Test Specification lists active tests
References into the global pool:
"tests": [{
"buttonBgcolorTest": {
"buckets": {
"gray": 0,
"blue": 1
}
}
}]

On every request…
1. Select Groups
2. Render the Response
3. Log the Action

Determining Buckets in Code
On every request…
1. Collect identifiers
2. Select buckets for opted-in tests

Collect identifiers for all ID Types
// Product code
String cookie = getTrackingCookie(request);
String accountId = getAccountIdOrNull(request);

// Proctor preparation
Identifiers identifiers = Identifiers.of(
TestType.USER, cookie,
TestType.ACCOUNT, accountId
);

Select Buckets for Opted-In Tests
// Proctor preparation
Identifiers identifiers = Identifiers.of(
TestType.USER, trackingCookie,
TestType.ACCOUNT, accountId
);

// Proctor assignments
ProctorResult assignments =
proctor.determineBuckets(identifiers);

Choose behavior for selected bucket
int bgColorBucket;
/* … */
// Choose a background color for templates
if (bgColorBucket == 1) {
// Test
model.put("buttonBgColor", "#00f");
} else {
// Control group
model.put("buttonBgColor", "#ccc");
}

ProctorResult exposes buckets… verbosely
proctor.determineBuckets(identifiers);
// Get selected bucket for this user
int bgColorBucket = assignments
// Map<String, TestBucket>: All tests
.getBuckets()
// TestBucket: This assignment
.get("buttonBgColorTst") // TestBucket
// int: Enumerated ID
.getValue();

"Redundant" names in test spec…
"buttonBgColorTest": {
"buckets": {
"gray": 0,
"blue": 1
}
}

… are used to generate helper methods
ResumeSearchGroups groups =
new ResumeSearchGroups(assignments);
// Enumerated value by test name
groups.getButtonBgColorTstValue();
// Boolean accessors for each test & bucket
groups.isButtonBgColorTstGray();
groups.isButtonBgColorTstBlue();

Helper designed for use in UI layer
This immutable bean is trivial to:
● Read from JSP/JSF
● Read from Templates
○ Freemarker, Velocity, Closure, etc
● Serialize as JSON

Logging Bucket Assignments
Proctor just selects the buckets.
When and how you log are up to you:
● On related events only
● On every event

Test Definitions in Source Control
● No new infrastructure
● Lots of desirable features for free
History
Diff
Access Control

Proctor Data
Test
Definitions

Publish

Artifact

Periodic
Refresh

App
App Servers

Publication is also via Source Control
Individual test changes pushed to a named
branch:
/trunk

/branches/production

Overwriting Tests on a Named Branch
Not required to use proctor, but beneficial:
● Same features for free
History, Diff, ACL
● No merging
● Easy roll-back, roll-forward

Proctor Data

Project

Test
Definitions

Test
Specifications
Compile

Publish

Deliverable

Build Servers

Deploy

Artifact

Periodic
Refresh

App
App Servers

Segmentation
Test often apply to only certain users:
● Specific markets
● Specific languages
● Specific devices

Segmentation through Test Rules
● Test definition allows one optional rule
● A rule is simply a boolean expression
● If the rule passes, the user is assigned to a test
bucket

Rules are written in Unified EL

Simple Things are Simple
● No deployment needed
● Changes live within minutes
{
"rule": "country == ‘CA’"
"buckets": […]
}

Primitive and rich data types

"userAgent.phone || userAgent.tablet"
"userAgent.supports.html5"
"userAgent.supports.geolocation"
"userAgent.supports.fileUpload"

Commons EL is Easily Extended
JSTL Standard Functions
"rule":
"fn:endsWith(
account.email, '@indeed.com')"

Custom code
"rule":
"proctor:contains(
['US', 'CA'], country)"

Arbitrary Complexity
Sometimes rules are unavoidably complex:
"Android v2.1+":
userAgent.android && (
userAgent.OS.majorVersion gt 2 || (
userAgent.OS.majorVersion == 2
&&
userAgent.OS.minorVersion gte 1
)
)

What context is available?
So far we've seen:
● country
● language
● userAgent
● account
What's the full list of available context variables?

Context Defined in Test Specification
● Test spec declares available context variables
● This is a contract to provide values at runtime
{
"tests": […],
"providedContext": {
"country": "String",
"language": "String"
"userAgent":
"com.indeed.web.UserAgent"
}
}

Provided While Determining Buckets
Also generated from test specification:
private ResumeSearchProctor proctor;

proctor.determineBuckets(
identifiers,
country,
language,
userAgent);

Even Tiny Changes Need Deploys
if (bgColorBucket == 1) {
// Test
model.put("btnBgcolor", "#00f");
} else {
// Control group
model.put("btnBgcolor", "#ccc");
}

Some Tests Just Vary Data
Many tests have no behavioral change:
● CSS Colors
● Display Text
● Algorithm Weights

Payloads
● Values added for each bucket in a test
● Proctor verifies payloads are "all or none"

Control: Gray

Test: Blue

Payloads
● Values added for each bucket in a test
● Proctor verifies payloads are "all or none"

Control: Gray
"#ccc"

Test: Blue
"#00f"

Part of Test Definition
● No deployment needed
● Changes live within minutes
"buckets": [{
"id": 0, "name": "gray",
"description": "Control group",
"payload": {
"stringValue": "#ccc"
}
}, …]

Declared in Project Test Specification
● Type definition only
● Must match test definition
"buttonBgColorTst": {
"buckets": […],
"payload": {
"type": "stringValue"
}
}

Cleaner Code, Only Data Deploy

// Choose a background color
model.put(
"btnBgcolor",
groups.getButtonBgColorTstPayload()
);

Cross-Product Tests
Many flavors of cross-product test, including
● Peer webapps
● Client / Service
● Mobile Native / Web

Cross-Product Tests
Even more ways to coordinate tests
● Tracking parameters on links, requests
● Service response metadata
● Different service calls

Proctor offers an interesting alternative

Two products can share test groups
As long as both products
● Share the test’s identifier
● Provide the context variables it uses

Deterministic selection guarantees
identical bucket assignment.

Evolving Tests

control

test

(inactive)

10%

Changed allocations, not ID mapping

control

OOPS!

test

● Inconsistent experience
● Polluted results

Evolving Tests Smoothly

control

test

(inactive)

[ 10%, 10%]
[ 10%, 10%, 80% ]

control

test

(inactive)

[ 10%, 10%, 80% ]

[ 10%, 10%, 40%, 40%]
control

test

control

test

control

(inactive)

test

[ 10%, 80%, 10% ]

[ 50%, 50% ]
control

test

Evolving Tests… Turbulently
hash ( uid + test.salt ) => bucket
Test range:
control

test1

Any ID:

test1

After re-salt:

test

Contextual Allocation
10% (US):
control

(inactive)

test

(inactive)

test

20% (CA):
control

50% (Rest of World):
control

test

Allocations
Each test definition
● has one or more allocations

Each allocation
● has a rule and ranges totaling 1.0
● except the last, which has no rule.

Allocation Rules
● Use Unified EL, same as test rules.
● Use the same context variables as test rules.
● Choose the first matching allocation.

Allocations in the Test Definition
{ "description": "Button colors",
"type": "USER",
"buckets": [ … ],
"allocations": [{
"rule": "country == 'US'",
"ranges": [ … ]
}, {
"ranges": [ … ]
}]
}

Environments
Local
commit

Integration
push

QA
push

Production

Show test matrix

/private/showTestMatrix

Show test bucket assignments

/private/showGroups

/private/showGroups

Privileged users can force assignments

Privileged users can force assignments

?prforceGroups=buttonColorTst1

Beyond A/B Testing
Proctor Patterns for Managing Behavior

Kill Switch
When

● New Feature
How

● 'Active' bucket @ 100%

Phased Rollout
When

● Experimental Feature
How

● 'Active' → 1% → 5% → 100%

Throttle
When

● Downsampling
○ trace logging
○ survey
How

● 'Active' → 1% → 10% → 5% → ??

Feature Toggles
When

● Localized Behavior
● Device-Specific Behavior
● Logged-in, w/ Resume, etc.
How

● Multiple Allocations
● Targeted Rules

Dark Deploys
When

● Partial Implementations
● Additional QA is needed
How

● 'Active' → 100%

Cross-Product Coordination
When

● Dependencies between products
○ Resume Wizard feature
How

● 'Active' bucket at 0%
● Resume Wizard allocation: → 100%
● Home page promo allocation: → 100%

Post-Proctor Tests
103

Proctor
42

Post-Proctor Tests + Toggles
103

Proctor

65

42
10

Proctor Webapp
A/B Test Change Management
(Coming Soon to github)

Building On Proctor
(Not Open Source)

Description:
Group 0: control - Job alert label: Save Alert (control)
Group 1: labelSubscribe - Job alert label: Subscribe
Group 2: labelSignUp - Job alert label: Sign up
Group 3: labelGetJobs - Job alert label: Get jobs
Group 4: labelSendMeNewJobs - Job alert label: Send me new jobs
Group 5: labelActivate - Job alert label: Activate
Group 6: labelSave - Job alert label: Save

History:
jack @ 2013-03-12 (r203267): Promoting jasxjabtnlbltst (trunk r203089) to
production JASX-11365: jasxjabtnlbltst disabled
ketan @ 2012-12-11 (r190675): merged r190418: JASX-10663: Stop
jasxjabtnlbltst in all languages except nl
will @ 2012-11-29 (r188801): merged r187452: JASX-10457: exclude US from
jasxjabtnlbltst
ketan @ 2012-10-25 (r182881): merged r182688: JASX-10234 - Adding new
langauges to job alert button label test
ketan @ 2012-10-25 (r182876): merged r181938: JASX-10234 - Adding test
definition and allocations for job alert button label test

DEMO
Get out your Phones and Tablets

http://go.indeed.com/demo
Simple: test different background colors
25%
25%
50%

25%

25%

Let’s increase our bucket size...

50%

50%

We have a winner!

50%

100%

Let’s do something wacky!

Android

iOS

Android >= 4

iOS >= 7

Also a reference implementation
Running on heroku -- feel free to clone!
http://indeedeng-hello-proctor.herokuapp.com
Source:

github.com/indeedeng/proctor-demo

Q&A
Source:
github.com/indeedeng/proctor
Docs:
indeedeng.github.io/proctor

Next @IndeedEng Talk

Boxcar
Self-balancing distributed services
Wednesday, October 30
R.B. Boyer

[@IndeedEng] Managing Experiments and Behavior Dynamically with Proctor

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a [@IndeedEng] Managing Experiments and Behavior Dynamically with Proctor

Similar a [@IndeedEng] Managing Experiments and Behavior Dynamically with Proctor (20)

Más de indeedeng

Más de indeedeng (8)

Último

Último (20)

[@IndeedEng] Managing Experiments and Behavior Dynamically with Proctor