Diamond Application Development Crafting Solutions with Precision
Sea of Data
1. Sea of data
Story of data, scale and how we evolve
architecture to handle it.
Daniel Marchant (@driedtoast)
2. What do you think of when
you hear the word “data”?
3. Setting the stage
Data – Things known or assumed as facts,
making the basis of reasoning or calculation
Time – the indefinite continued progress of
existence and events in the past, present and
future regarded as a whole
5. Types of data
● Customer Data - Data the customer provides,
the lifeblood of your application
● Business Data - Metrics on how growth,
customer attrition, marketing, etc...
● Operation Data - Metrics and log messages
that help troubleshoot / monitor your
application
7. Once upon a time...
A company was founded to
produce the best seamonkey
management application ever
produced. (purely fictional for
now)
More details: http://www.seamonkey.xyz (eventually)
8. A hypothetical system timeline
● Launch of application
● Reddit posts promote application
● Hacker News promotes application
● Product Hunt promotes application
10. Initial dataset
● Operation Data
○ cpu / memory / disk metrics
○ error messages in logs
● Business Data
○ Signup metrics
○ Access usage
● Customer Data
○ User
○ Seamonkey info
12. Architecture
● Load balancer - route traffic to application
● Application - handles requests and manages
data to the database
● Database - data storage
So simple, life is good! Some reads and writes!
13. Integrations
● Metric Service - google analytics,
kilometer.io, kissmetrics, mixpanel, etc...
● Operation Events - datadog, graylog,
newrelic, etc…
14. Troubleshooting
● Pretty straight forward
● Check application can write to DB
● Make sure database user can access tables
● Make sure the transactions scoped in the
application make sense
● Check rollback scenarios
15. A little about ACID
● Atomicity: all task(s) within a transaction are performed or
none of them are. An all-or-none principle.
● Consistency: transaction does not violate those protocols
and the data must remain in a consistent state at the
beginning and end of a transaction; no half-completed
transactions.
● Isolation: each transaction is independent unto itself for
both performance and consistency of transactions.
● Durability: Once complete the transaction will persist as
complete; it will survive system failure, power loss and other
types of system breakdowns.
17. Data evolution
● Operation Data Additions
○ Timers on critical logic
○ Customer requests
● Business Data Additions
○ Customer emails on problems
● Customer Data Additions
○ Seamonkey Tank
○ Seamonkey Social interactions
19. Architecture
● Load balancer - route traffic to application
● Application - still managing data, more
nodes added
● Worker - handles work from the db ‘queue’
table
● Cache - used to taper database reads
● Database - data storage master
● Read Only Database - slave data storage
20. More integrations
● Gmail - customer emails
● DataLoop - Timers and statsd data
● Open Tracing - distributed event tracing
http://opentracing.io/
21. More Troubleshooting
● If the application isn’t display the right data,
is the cache invalidated properly
● Has the worker updated over the application
as changes happen within the queued
process
● Is replication working on from master to
slave
23. Data evolution
● Operation Data / Business Data convergence
○ Customer requests
○ Customer emails to support cases
○ Customer usage to product roadmap
● Customer Data requirements stabilize
25. Architecture
● Application - still managing data, more
nodes added, application pushes writes to a
queue for non-critical work
● Worker - handles work coming from queue
vs db, and writes from application. Also
invalidates cache now.
● Cache - used to taper database reads. App is
getting more complex invalidation logic
26. CAPs off to you!
● Consistency: same idea presented in ACID.
All data storage nodes see the data.
● Availability: data is available
● Partition Tolerance: system continues to
operate even under circumstances of data
loss or system failure. A single node failure
should not cause the entire system to
collapse.
27. Troubleshooting
● Oh boy, more systems more debugging
“opportunities”
● If data isn’t updated, has the queue gotten
the event from the application? Has the
worker processed the change event and
written to db?
● Is the queue up? Is the worker up?
29. Data evolution
● Operation Data
○ Hopes for attrition
● Business Data
○ Monitors customer attrition
○ Hopes for NO attrition
● Customer Data
○ Grows insane
○ Working out archive strategies
31. Architecture
● Lifecyle service / database - added a service
to migrate some of the monolith app, service
just handles seamonkey growth and lifecyle
● Worker - still listens for events, writes to
lifecycle service
● Stream - swapped out the queue with an
immutable stream, better data recovery
32. BASE
● Basically Available: system does guarantee the availability
of the data as regards CAP Theorem; there will be a
response to any request. Response could be a failure to find
data or data could be in an inconsistent state.
● Soft state: state of the data could change over time, there
may be changes going on due to ‘eventual consistency’
● Eventual consistency: data will eventually become
consistent once it stops receiving changes. The system will
continue to receive changes and is not checking the
consistency of every transaction before it moves onto the
next one.
33. Troubleshooting
● If seamonkeys aren’t progressing, debug
new service, is it up? Database for service
up?
● If event isn’t processed reset stream point to
catch up, handle duplicate events on the
worker vs stream.
● UI not finding events, check service up.
35. Time and data
As you see through the growth patterns, time
and data start to have trade offs. With
questions such as:
● How fast does the data update?
● How do we support a backup and restore?
● How do we ensure no data loss?
36. Immutability and Time
● If point in time never changes, immutability
is achieved
● Pointer vs point in time, current data version
is a pointer to the latest point in time
● A timeline of data changes provides for
restoration and easier debugging
37. Distributed immutability
● Database transaction log is an immutable
stream of changes
○ Used for replication, most database /
datastores use this approach
● Immutable stream(Kafkta, Kinesis) provides
an incoming change log, latest changes can
be pointed to part of stream. Reverse db
approach
39. Some plankton for thought
● If you have any idea where you'll end up,
you’d have a better idea where to start
● Understanding reactions to growth will help
with setting up services as you grow
● Misery loves company, knowing everyone
has these pain points somehow makes you
happier
● Know where you’ve been helps you now