SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
AVOIDING BAD DATABASE
SURPRISES
Simulation and Scalability
SOME HORROR STORIES
WEB APP DOESN’T SCALE
You’ve got a brilliant app
You’ve got a brilliant cloud deployment
EC2, ALB — all the right moving parts
It still doesn't scale
S LOW
ANALYTICS DON’T SCALE
You’ve studied the data
You’ve got a model that’s hugely important
It explains things
It predicts things
But. It’s SLOW S LOW
YOU BLAMED PYTHON
Web:
NGINX,
uWSGI,
the proxy server,
the coffee shop
Analytics:
Pandas,
Scikit Learn,
Jupyter Notebook,
open office layouts
And More…
STACK OVERFLOW SAYS “PROFILE”
So you profiled
and you profiled
And…
It turns out it’s the database
HORROR MOVIE TROPE
There’s a monster
And it’s in your base
And it’s killing your dudes
It’s Your Database
KILLING THE MONSTER
Hard work
Lots of stress
Many Techniques
Indexing
Denormalization
I/O Concurrency (i.e., more devices)
Compression
CAN WE PREVENT ALL THIS?
Spoiler Alert: Yes
TO AVOID BECOMING A HORROR STORY
Simulate
Early
Often
A/K/A SYNTHETIC DATA
Why You Don’t Necessarily Need Data for Data Science
https://medium.com/capital-one-tech
HORROR MOVIE TROPES
Look behind the door
Don’t run into the dark barn alone
Avoid the trackless forest after dark
Stay with your friends
Don’t dismiss the funny noises
DATA IS THE ONLY THING
THAT MATTERS
Foundational Concept
SIDE-BAR ARGUMENT
UX is important, but secondary
But it’s never the kind of bottleneck app and DB servers are
You will also experiment with UX
I’m not saying ignore the UX
Data lasts forever. Data is converted and preserved.
UX is transient. Next release will have a better, more modern experience.
SIMULATE EARLY
Build your data models first
Build the nucleus of your application processing
Build a performance testing environment
With realistic volumes of data
SIMULATE OFTEN
When your data model changes
When you add features
When you acquire more sources of data
Rerun the nucleus of your application processing
With realistic volumes of data
PYTHON MAKES THIS EASY
WHAT YOU’LL NEED
A candidate data model
The nucleus of processing
RESTful API CRUD elements
Analytical Extract-Transform-Load-Map-Reduce steps
A data generator to populate the database
Measurements using time.perf_counter()
DATA MODEL
SQLAlchemy or Django ORM (or others, SQLObject, etc.)
Data Classes
Plain Old Python Objects (POPO) and JSON serialization
If we use JSON Schema validation, we can do cool stuff
THE PROBLEM: FLEXIBILITY
SQL provides minimal type information (string, decimal, date, etc.)
No ranges, enumerated values or other domain details (e.g., name vs. address)
Does provide Primary Key and Foreign Key information
Data classes provide more detailed type information
Still doesn’t include ranges or other domain details
No PK/FK help at all
A SOLUTION
Narrow type specifiations using JSON Schema
Examples to follow
class Card(Model):
"""
title: Card
description: "Simple Playing Cards"
type: object
properties:
suit:
type: string
enum: ["H", "S", "D", "C"]
rank:
type: integer
minimum: 1
maximum: 13
"""
JSON Schema Definition
In YAML Notation
HOW DOES THIS WORK?
A metaclass parses the schema YAML and builds a validator
An abstract superclass provides __init__() to validate the
document
import yaml
import json
import jsonschema
class SchemaMeta(type):
def __new__(mcs, name, bases, namespace):
# pylint: disable=protected-access
result = type.__new__(mcs, name, bases, dict(namespace))
result.SCHEMA = yaml.load(result.__doc__)
jsonschema.Draft4Validator.check_schema(result.SCHEMA)
result._validator = jsonschema.Draft4Validator(result.SCHEMA)
return result
Builds JSONSchema validator
from __doc__ string
class Model(dict, metaclass=SchemaMeta):
"""
title: Model
description: abstract superclass for Model
"""
@classmethod
def from_json(cls, document):
return cls(yaml.load(document))
@property
def json(self):
return json.dumps(self)
def __init__(self, *args, **kw):
super().__init__(*args, **kw)
if not self._validator.is_valid(self):
raise TypeError(list(self._validator.iter_errors(self)))
Validates object and raises TypeError
>>> h1 = Card.from_json('{"suit": "H", "rank": 1}')
>>> h1['suit']
'H'
>>> h1.json
'{"suit": "H", "rank": 1}'
Deserialize POPO from JSON text
Serialize POPO into JSON text
>>> d = Card.from_json('{"suit": "hearts", "rank": -12}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 8, in from_json
File "<stdin>", line 15, in __init__
TypeError: [<ValidationError: "'hearts' is not one of
['H', 'S', 'D', 'C']">, <ValidationError: '-12 is less
than the minimum of 1'>]
Fail to deserialize invalid POPO from JSON text
WHY?
JSON Schema allows us to provide
Type (string, number, integer, boolean, array, or object)
Ranges for numerics
Enumerated values (for numbers or strings)
Format for strings (i.e. email, uri, date-time, etc.)
Text Patterns for strings (more general regular expression handling)
DATABASE SIMULATION
AHA
With JSON schema we can build simulated data
THERE ARE SIX SCHEMA TYPES
null — Always None
integer — Use const, enum, minimum, maximum constraints
number — Use const, enum, minimum, maximum constraints
string — Use const, enum, format, or pattern constraints
There are 17 defined formats to narrow the constraints
array — recursively expand items to build an array
object — recursively expand properties to build a document
class Generator:
def __init__(self, parent_schema, domains=None):
self.schema = parent_schema
def gen_null(self, schema):
return None
def gen_string(self, schema): …
def gen_integer(self, schema): …
def gen_number(self, schema): …
def gen_array(self, schema):
doc = [self.generate(schema.get('items')) for _ in range(lo, hi+1)]
return doc
def gen_object(self, schema):
doc = {
name: self.generate(subschema)
for name, subschema in schema.get('properties', {}).items()
}
return doc
def generate(self, schema=None):
schema = schema or self.schema
schema_type = schema.get('type', 'object')
method = getattr(self, f"gen_{schema_type}")
return method(schema)
Finds gen_* methods
def make_documents(model_class, count=100, domains=None):
generator = Generator(model_class.SCHEMA, domains)
docs_iter = (generator.generate() for i in range(count))
for doc in docs_iter:
print(model_class(**doc))
Or write to a file
Or load a database
NOW YOU CAN SIMULATE
Early Often
WHAT ABOUT?
More sophisticated data domains?
Name, Address, Comments, etc.
More than text. No simple format.
Primary Key and Foreign Key Relationships
HANDLING FORMATS
def gen_string(self, schema):
if 'const' in schema:
return schema['const']
elif 'enum' in schema:
return random.choice(schema['enum'])
elif 'format' in schema:
return FORMATS[schema['format']]()
else:
return "string"
TS_RANGE = (datetime.datetime(1900, 1, 1).timestamp(),
datetime.datetime(2100, 12, 31).timestamp())
FORMATS = {
'date-time': (
lambda: datetime.datetime.utcfromtimestamp(
random.randrange(*TS_RANGE)
).isoformat()
),
'date': (
lambda: datetime.datetime.utcfromtimestamp(
random.randrange(*TS_RANGE)
).date().isoformat()
),
'time': (
lambda: datetime.datetime.utcfromtimestamp(
random.randrange(*TS_RANGE)
).time().isoformat()
),
DATA DOMAINS
String format (and enum) may not be enough to characterize data
Doing Text Search or Indexing? You want text-like data
Using Names or Addresses? Random strings may not be
appropriate.
Credit card numbers? You want 16-digit strings
EXAMPLE DOMAIN: DIGITS
def digits(n):
return ''.join(random.choice('012345789') for _ in range(n))
EXAMPLE DOMAIN: NAMES
class LoremIpsum:
_phrases = [
"Lorem ipsum dolor sit amet",
"consectetur adipiscing elit”,
…etc.…
"mollis eleifend leo venenatis"
]
@staticmethod
def word():
return
random.choice(random.choice(LoremIpsum._phrases).split())
@staticmethod
def name():
return ' '.join(LoremIpsum.word() for _ in range(3)).title()
RECAP
HOW TO GET INTO TROUBLE
Faith
Have faith the best practices you read in a blog really work
Assume
Assume you understand best practices you read in a blog
Hope
Hope you will somehow avoid scaling problems
SOME TROPES
Look behind the door
Don’t run into the dark barn alone
Avoid the trackless forest after dark
Stay with your friends
Don’t dismiss the funny noises
TO DO CHECKLIST
Simulate Early and Often
Define Python Classes
Use JSON Schema to provide fine-grained definitions
With ranges, formats, enums
Build a generator to populate instances in bulk
Gather Performance Data
Profit
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott

Más contenido relacionado

La actualidad más candente

Querying XML: XPath and XQuery
Querying XML: XPath and XQueryQuerying XML: XPath and XQuery
Querying XML: XPath and XQuery
Katrien Verbert
 
PhD Presentation
PhD PresentationPhD Presentation
PhD Presentation
mskayed
 

La actualidad más candente (20)

Xpath
XpathXpath
Xpath
 
Parsing XML Data
Parsing XML DataParsing XML Data
Parsing XML Data
 
XSLT and XPath - without the pain!
XSLT and XPath - without the pain!XSLT and XPath - without the pain!
XSLT and XPath - without the pain!
 
XML SAX PARSING
XML SAX PARSING XML SAX PARSING
XML SAX PARSING
 
XML and XPath details
XML and XPath detailsXML and XPath details
XML and XPath details
 
Professional-grade software design
Professional-grade software designProfessional-grade software design
Professional-grade software design
 
Xm lparsers
Xm lparsersXm lparsers
Xm lparsers
 
Querying XML: XPath and XQuery
Querying XML: XPath and XQueryQuerying XML: XPath and XQuery
Querying XML: XPath and XQuery
 
Computer project
Computer projectComputer project
Computer project
 
Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2
 
6 xml parsing
6   xml parsing6   xml parsing
6 xml parsing
 
Xpath presentation
Xpath presentationXpath presentation
Xpath presentation
 
SAX, DOM & JDOM parsers for beginners
SAX, DOM & JDOM parsers for beginnersSAX, DOM & JDOM parsers for beginners
SAX, DOM & JDOM parsers for beginners
 
XML Support: Specifications and Development
XML Support: Specifications and DevelopmentXML Support: Specifications and Development
XML Support: Specifications and Development
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)
 
Php
PhpPhp
Php
 
PostgreSQL and XML
PostgreSQL and XMLPostgreSQL and XML
PostgreSQL and XML
 
Xml parsers
Xml parsersXml parsers
Xml parsers
 
Object Relational Mapping in PHP
Object Relational Mapping in PHPObject Relational Mapping in PHP
Object Relational Mapping in PHP
 
PhD Presentation
PhD PresentationPhD Presentation
PhD Presentation
 

Similar a Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott

json.ppt download for free for college project
json.ppt download for free for college projectjson.ppt download for free for college project
json.ppt download for free for college project
AmitSharma397241
 

Similar a Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott (20)

98765432345671223Intro-to-PostgreSQL.ppt
98765432345671223Intro-to-PostgreSQL.ppt98765432345671223Intro-to-PostgreSQL.ppt
98765432345671223Intro-to-PostgreSQL.ppt
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
NoSQL - "simple" web monitoring
NoSQL - "simple" web monitoringNoSQL - "simple" web monitoring
NoSQL - "simple" web monitoring
 
When to NoSQL and when to know SQL
When to NoSQL and when to know SQLWhen to NoSQL and when to know SQL
When to NoSQL and when to know SQL
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
Graph Database Query Languages
Graph Database Query LanguagesGraph Database Query Languages
Graph Database Query Languages
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
 
Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)
 
Mindmap: Oracle to Couchbase for developers
Mindmap: Oracle to Couchbase for developersMindmap: Oracle to Couchbase for developers
Mindmap: Oracle to Couchbase for developers
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx
 
SAS Internal Training
SAS Internal TrainingSAS Internal Training
SAS Internal Training
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SAS
 
R environment
R environmentR environment
R environment
 
json.ppt download for free for college project
json.ppt download for free for college projectjson.ppt download for free for college project
json.ppt download for free for college project
 
SAS cheat sheet
SAS cheat sheetSAS cheat sheet
SAS cheat sheet
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
 
Learn D3.js in 90 minutes
Learn D3.js in 90 minutesLearn D3.js in 90 minutes
Learn D3.js in 90 minutes
 

Más de PyData

Más de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott

  • 3. WEB APP DOESN’T SCALE You’ve got a brilliant app You’ve got a brilliant cloud deployment EC2, ALB — all the right moving parts It still doesn't scale S LOW
  • 4. ANALYTICS DON’T SCALE You’ve studied the data You’ve got a model that’s hugely important It explains things It predicts things But. It’s SLOW S LOW
  • 5. YOU BLAMED PYTHON Web: NGINX, uWSGI, the proxy server, the coffee shop Analytics: Pandas, Scikit Learn, Jupyter Notebook, open office layouts And More…
  • 6. STACK OVERFLOW SAYS “PROFILE” So you profiled and you profiled And… It turns out it’s the database
  • 7. HORROR MOVIE TROPE There’s a monster And it’s in your base And it’s killing your dudes It’s Your Database
  • 8. KILLING THE MONSTER Hard work Lots of stress Many Techniques Indexing Denormalization I/O Concurrency (i.e., more devices) Compression
  • 9. CAN WE PREVENT ALL THIS? Spoiler Alert: Yes
  • 10. TO AVOID BECOMING A HORROR STORY Simulate Early Often
  • 11. A/K/A SYNTHETIC DATA Why You Don’t Necessarily Need Data for Data Science https://medium.com/capital-one-tech
  • 12. HORROR MOVIE TROPES Look behind the door Don’t run into the dark barn alone Avoid the trackless forest after dark Stay with your friends Don’t dismiss the funny noises
  • 13. DATA IS THE ONLY THING THAT MATTERS Foundational Concept
  • 14. SIDE-BAR ARGUMENT UX is important, but secondary But it’s never the kind of bottleneck app and DB servers are You will also experiment with UX I’m not saying ignore the UX Data lasts forever. Data is converted and preserved. UX is transient. Next release will have a better, more modern experience.
  • 15. SIMULATE EARLY Build your data models first Build the nucleus of your application processing Build a performance testing environment With realistic volumes of data
  • 16. SIMULATE OFTEN When your data model changes When you add features When you acquire more sources of data Rerun the nucleus of your application processing With realistic volumes of data
  • 18. WHAT YOU’LL NEED A candidate data model The nucleus of processing RESTful API CRUD elements Analytical Extract-Transform-Load-Map-Reduce steps A data generator to populate the database Measurements using time.perf_counter()
  • 19. DATA MODEL SQLAlchemy or Django ORM (or others, SQLObject, etc.) Data Classes Plain Old Python Objects (POPO) and JSON serialization If we use JSON Schema validation, we can do cool stuff
  • 20. THE PROBLEM: FLEXIBILITY SQL provides minimal type information (string, decimal, date, etc.) No ranges, enumerated values or other domain details (e.g., name vs. address) Does provide Primary Key and Foreign Key information Data classes provide more detailed type information Still doesn’t include ranges or other domain details No PK/FK help at all
  • 21. A SOLUTION Narrow type specifiations using JSON Schema Examples to follow
  • 22. class Card(Model): """ title: Card description: "Simple Playing Cards" type: object properties: suit: type: string enum: ["H", "S", "D", "C"] rank: type: integer minimum: 1 maximum: 13 """ JSON Schema Definition In YAML Notation
  • 23. HOW DOES THIS WORK? A metaclass parses the schema YAML and builds a validator An abstract superclass provides __init__() to validate the document
  • 24. import yaml import json import jsonschema class SchemaMeta(type): def __new__(mcs, name, bases, namespace): # pylint: disable=protected-access result = type.__new__(mcs, name, bases, dict(namespace)) result.SCHEMA = yaml.load(result.__doc__) jsonschema.Draft4Validator.check_schema(result.SCHEMA) result._validator = jsonschema.Draft4Validator(result.SCHEMA) return result Builds JSONSchema validator from __doc__ string
  • 25. class Model(dict, metaclass=SchemaMeta): """ title: Model description: abstract superclass for Model """ @classmethod def from_json(cls, document): return cls(yaml.load(document)) @property def json(self): return json.dumps(self) def __init__(self, *args, **kw): super().__init__(*args, **kw) if not self._validator.is_valid(self): raise TypeError(list(self._validator.iter_errors(self))) Validates object and raises TypeError
  • 26. >>> h1 = Card.from_json('{"suit": "H", "rank": 1}') >>> h1['suit'] 'H' >>> h1.json '{"suit": "H", "rank": 1}' Deserialize POPO from JSON text Serialize POPO into JSON text
  • 27. >>> d = Card.from_json('{"suit": "hearts", "rank": -12}') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 8, in from_json File "<stdin>", line 15, in __init__ TypeError: [<ValidationError: "'hearts' is not one of ['H', 'S', 'D', 'C']">, <ValidationError: '-12 is less than the minimum of 1'>] Fail to deserialize invalid POPO from JSON text
  • 28. WHY? JSON Schema allows us to provide Type (string, number, integer, boolean, array, or object) Ranges for numerics Enumerated values (for numbers or strings) Format for strings (i.e. email, uri, date-time, etc.) Text Patterns for strings (more general regular expression handling)
  • 30. AHA With JSON schema we can build simulated data
  • 31. THERE ARE SIX SCHEMA TYPES null — Always None integer — Use const, enum, minimum, maximum constraints number — Use const, enum, minimum, maximum constraints string — Use const, enum, format, or pattern constraints There are 17 defined formats to narrow the constraints array — recursively expand items to build an array object — recursively expand properties to build a document
  • 32. class Generator: def __init__(self, parent_schema, domains=None): self.schema = parent_schema def gen_null(self, schema): return None def gen_string(self, schema): … def gen_integer(self, schema): … def gen_number(self, schema): … def gen_array(self, schema): doc = [self.generate(schema.get('items')) for _ in range(lo, hi+1)] return doc def gen_object(self, schema): doc = { name: self.generate(subschema) for name, subschema in schema.get('properties', {}).items() } return doc def generate(self, schema=None): schema = schema or self.schema schema_type = schema.get('type', 'object') method = getattr(self, f"gen_{schema_type}") return method(schema) Finds gen_* methods
  • 33. def make_documents(model_class, count=100, domains=None): generator = Generator(model_class.SCHEMA, domains) docs_iter = (generator.generate() for i in range(count)) for doc in docs_iter: print(model_class(**doc)) Or write to a file Or load a database
  • 34. NOW YOU CAN SIMULATE Early Often
  • 35. WHAT ABOUT? More sophisticated data domains? Name, Address, Comments, etc. More than text. No simple format. Primary Key and Foreign Key Relationships
  • 36. HANDLING FORMATS def gen_string(self, schema): if 'const' in schema: return schema['const'] elif 'enum' in schema: return random.choice(schema['enum']) elif 'format' in schema: return FORMATS[schema['format']]() else: return "string"
  • 37. TS_RANGE = (datetime.datetime(1900, 1, 1).timestamp(), datetime.datetime(2100, 12, 31).timestamp()) FORMATS = { 'date-time': ( lambda: datetime.datetime.utcfromtimestamp( random.randrange(*TS_RANGE) ).isoformat() ), 'date': ( lambda: datetime.datetime.utcfromtimestamp( random.randrange(*TS_RANGE) ).date().isoformat() ), 'time': ( lambda: datetime.datetime.utcfromtimestamp( random.randrange(*TS_RANGE) ).time().isoformat() ),
  • 38. DATA DOMAINS String format (and enum) may not be enough to characterize data Doing Text Search or Indexing? You want text-like data Using Names or Addresses? Random strings may not be appropriate. Credit card numbers? You want 16-digit strings
  • 39. EXAMPLE DOMAIN: DIGITS def digits(n): return ''.join(random.choice('012345789') for _ in range(n))
  • 40. EXAMPLE DOMAIN: NAMES class LoremIpsum: _phrases = [ "Lorem ipsum dolor sit amet", "consectetur adipiscing elit”, …etc.… "mollis eleifend leo venenatis" ] @staticmethod def word(): return random.choice(random.choice(LoremIpsum._phrases).split()) @staticmethod def name(): return ' '.join(LoremIpsum.word() for _ in range(3)).title()
  • 41. RECAP
  • 42. HOW TO GET INTO TROUBLE Faith Have faith the best practices you read in a blog really work Assume Assume you understand best practices you read in a blog Hope Hope you will somehow avoid scaling problems
  • 43. SOME TROPES Look behind the door Don’t run into the dark barn alone Avoid the trackless forest after dark Stay with your friends Don’t dismiss the funny noises
  • 44. TO DO CHECKLIST Simulate Early and Often Define Python Classes Use JSON Schema to provide fine-grained definitions With ranges, formats, enums Build a generator to populate instances in bulk Gather Performance Data Profit