The document discusses how to avoid bad database surprises through early simulation and scalability testing. It provides examples of web and analytics apps that did not scale due to unanticipated database issues. It recommends using Python classes and JSON schema to define data models and generate synthetic test data. This allows simulating the full system early in development to identify potential performance bottlenecks before real data is involved.
3. WEB APP DOESN’T SCALE
You’ve got a brilliant app
You’ve got a brilliant cloud deployment
EC2, ALB — all the right moving parts
It still doesn't scale
S LOW
4. ANALYTICS DON’T SCALE
You’ve studied the data
You’ve got a model that’s hugely important
It explains things
It predicts things
But. It’s SLOW S LOW
11. A/K/A SYNTHETIC DATA
Why You Don’t Necessarily Need Data for Data Science
https://medium.com/capital-one-tech
12. HORROR MOVIE TROPES
Look behind the door
Don’t run into the dark barn alone
Avoid the trackless forest after dark
Stay with your friends
Don’t dismiss the funny noises
13. DATA IS THE ONLY THING
THAT MATTERS
Foundational Concept
14. SIDE-BAR ARGUMENT
UX is important, but secondary
But it’s never the kind of bottleneck app and DB servers are
You will also experiment with UX
I’m not saying ignore the UX
Data lasts forever. Data is converted and preserved.
UX is transient. Next release will have a better, more modern experience.
15. SIMULATE EARLY
Build your data models first
Build the nucleus of your application processing
Build a performance testing environment
With realistic volumes of data
16. SIMULATE OFTEN
When your data model changes
When you add features
When you acquire more sources of data
Rerun the nucleus of your application processing
With realistic volumes of data
18. WHAT YOU’LL NEED
A candidate data model
The nucleus of processing
RESTful API CRUD elements
Analytical Extract-Transform-Load-Map-Reduce steps
A data generator to populate the database
Measurements using time.perf_counter()
19. DATA MODEL
SQLAlchemy or Django ORM (or others, SQLObject, etc.)
Data Classes
Plain Old Python Objects (POPO) and JSON serialization
If we use JSON Schema validation, we can do cool stuff
20. THE PROBLEM: FLEXIBILITY
SQL provides minimal type information (string, decimal, date, etc.)
No ranges, enumerated values or other domain details (e.g., name vs. address)
Does provide Primary Key and Foreign Key information
Data classes provide more detailed type information
Still doesn’t include ranges or other domain details
No PK/FK help at all
23. HOW DOES THIS WORK?
A metaclass parses the schema YAML and builds a validator
An abstract superclass provides __init__() to validate the
document
24. import yaml
import json
import jsonschema
class SchemaMeta(type):
def __new__(mcs, name, bases, namespace):
# pylint: disable=protected-access
result = type.__new__(mcs, name, bases, dict(namespace))
result.SCHEMA = yaml.load(result.__doc__)
jsonschema.Draft4Validator.check_schema(result.SCHEMA)
result._validator = jsonschema.Draft4Validator(result.SCHEMA)
return result
Builds JSONSchema validator
from __doc__ string
25. class Model(dict, metaclass=SchemaMeta):
"""
title: Model
description: abstract superclass for Model
"""
@classmethod
def from_json(cls, document):
return cls(yaml.load(document))
@property
def json(self):
return json.dumps(self)
def __init__(self, *args, **kw):
super().__init__(*args, **kw)
if not self._validator.is_valid(self):
raise TypeError(list(self._validator.iter_errors(self)))
Validates object and raises TypeError
26. >>> h1 = Card.from_json('{"suit": "H", "rank": 1}')
>>> h1['suit']
'H'
>>> h1.json
'{"suit": "H", "rank": 1}'
Deserialize POPO from JSON text
Serialize POPO into JSON text
27. >>> d = Card.from_json('{"suit": "hearts", "rank": -12}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 8, in from_json
File "<stdin>", line 15, in __init__
TypeError: [<ValidationError: "'hearts' is not one of
['H', 'S', 'D', 'C']">, <ValidationError: '-12 is less
than the minimum of 1'>]
Fail to deserialize invalid POPO from JSON text
28. WHY?
JSON Schema allows us to provide
Type (string, number, integer, boolean, array, or object)
Ranges for numerics
Enumerated values (for numbers or strings)
Format for strings (i.e. email, uri, date-time, etc.)
Text Patterns for strings (more general regular expression handling)
31. THERE ARE SIX SCHEMA TYPES
null — Always None
integer — Use const, enum, minimum, maximum constraints
number — Use const, enum, minimum, maximum constraints
string — Use const, enum, format, or pattern constraints
There are 17 defined formats to narrow the constraints
array — recursively expand items to build an array
object — recursively expand properties to build a document
33. def make_documents(model_class, count=100, domains=None):
generator = Generator(model_class.SCHEMA, domains)
docs_iter = (generator.generate() for i in range(count))
for doc in docs_iter:
print(model_class(**doc))
Or write to a file
Or load a database
35. WHAT ABOUT?
More sophisticated data domains?
Name, Address, Comments, etc.
More than text. No simple format.
Primary Key and Foreign Key Relationships
36. HANDLING FORMATS
def gen_string(self, schema):
if 'const' in schema:
return schema['const']
elif 'enum' in schema:
return random.choice(schema['enum'])
elif 'format' in schema:
return FORMATS[schema['format']]()
else:
return "string"
38. DATA DOMAINS
String format (and enum) may not be enough to characterize data
Doing Text Search or Indexing? You want text-like data
Using Names or Addresses? Random strings may not be
appropriate.
Credit card numbers? You want 16-digit strings
42. HOW TO GET INTO TROUBLE
Faith
Have faith the best practices you read in a blog really work
Assume
Assume you understand best practices you read in a blog
Hope
Hope you will somehow avoid scaling problems
43. SOME TROPES
Look behind the door
Don’t run into the dark barn alone
Avoid the trackless forest after dark
Stay with your friends
Don’t dismiss the funny noises
44. TO DO CHECKLIST
Simulate Early and Often
Define Python Classes
Use JSON Schema to provide fine-grained definitions
With ranges, formats, enums
Build a generator to populate instances in bulk
Gather Performance Data
Profit