Sea of Data

Sea of data
Story of data, scale and how we evolve
architecture to handle it.
Daniel Marchant (@driedtoast)

What do you think of when
you hear the word “data”?

Setting the stage
Data – Things known or assumed as facts,
making the basis of reasoning or calculation
Time – the indefinite continued progress of
existence and events in the past, present and
future regarded as a whole

Types of data
● Customer Data - Data the customer provides,
the lifeblood of your application
● Business Data - Metrics on how growth,
customer attrition, marketing, etc...
● Operation Data - Metrics and log messages
that help troubleshoot / monitor your
application

Let’s jump into the story...

Once upon a time...
A company was founded to
produce the best seamonkey
management application ever
produced. (purely fictional for
now)
More details: http://www.seamonkey.xyz (eventually)

A hypothetical system timeline
● Launch of application
● Reddit posts promote application
● Hacker News promotes application
● Product Hunt promotes application

Launch
Look ma, I got an app online!

Initial dataset
● Operation Data
○ cpu / memory / disk metrics
○ error messages in logs
● Business Data
○ Signup metrics
○ Access usage
● Customer Data
○ User
○ Seamonkey info

Architecture
● Load balancer - route traffic to application
● Application - handles requests and manages
data to the database
● Database - data storage
So simple, life is good! Some reads and writes!

Integrations
● Metric Service - google analytics,
kilometer.io, kissmetrics, mixpanel, etc...
● Operation Events - datadog, graylog,
newrelic, etc…

Troubleshooting
● Pretty straight forward
● Check application can write to DB
● Make sure database user can access tables
● Make sure the transactions scoped in the
application make sense
● Check rollback scenarios

A little about ACID
● Atomicity: all task(s) within a transaction are performed or
none of them are. An all-or-none principle.
● Consistency: transaction does not violate those protocols
and the data must remain in a consistent state at the
beginning and end of a transaction; no half-completed
transactions.
● Isolation: each transaction is independent unto itself for
both performance and consistency of transactions.
● Durability: Once complete the transaction will persist as
complete; it will survive system failure, power loss and other
types of system breakdowns.

Reddit
Oh, cool some people are looking at it!

Data evolution
● Operation Data Additions
○ Timers on critical logic
○ Customer requests
● Business Data Additions
○ Customer emails on problems
● Customer Data Additions
○ Seamonkey Tank
○ Seamonkey Social interactions

Architecture
● Load balancer - route traffic to application
● Application - still managing data, more
nodes added
● Worker - handles work from the db ‘queue’
table
● Cache - used to taper database reads
● Database - data storage master
● Read Only Database - slave data storage

More integrations
● Gmail - customer emails
● DataLoop - Timers and statsd data
● Open Tracing - distributed event tracing
http://opentracing.io/

More Troubleshooting
● If the application isn’t display the right data,
is the cache invalidated properly
● Has the worker updated over the application
as changes happen within the queued
process
● Is replication working on from master to
slave

Hacker News
What have I gotten myself into?

Data evolution
● Operation Data / Business Data convergence
○ Customer requests
○ Customer emails to support cases
○ Customer usage to product roadmap
● Customer Data requirements stabilize

Architecture
● Application - still managing data, more
nodes added, application pushes writes to a
queue for non-critical work
● Worker - handles work coming from queue
vs db, and writes from application. Also
invalidates cache now.
● Cache - used to taper database reads. App is
getting more complex invalidation logic

CAPs off to you!
● Consistency: same idea presented in ACID.
All data storage nodes see the data.
● Availability: data is available
● Partition Tolerance: system continues to
operate even under circumstances of data
loss or system failure. A single node failure
should not cause the entire system to
collapse.

Troubleshooting
● Oh boy, more systems more debugging
“opportunities”
● If data isn’t updated, has the queue gotten
the event from the application? Has the
worker processed the change event and
written to db?
● Is the queue up? Is the worker up?

Product Hunt
There’s too many people on this planet.
We need another plague.

Data evolution
● Operation Data
○ Hopes for attrition
● Business Data
○ Monitors customer attrition
○ Hopes for NO attrition
● Customer Data
○ Grows insane
○ Working out archive strategies

Architecture
● Lifecyle service / database - added a service
to migrate some of the monolith app, service
just handles seamonkey growth and lifecyle
● Worker - still listens for events, writes to
lifecycle service
● Stream - swapped out the queue with an
immutable stream, better data recovery

BASE
● Basically Available: system does guarantee the availability
of the data as regards CAP Theorem; there will be a
response to any request. Response could be a failure to find
data or data could be in an inconsistent state.
● Soft state: state of the data could change over time, there
may be changes going on due to ‘eventual consistency’
● Eventual consistency: data will eventually become
consistent once it stops receiving changes. The system will
continue to receive changes and is not checking the
consistency of every transaction before it moves onto the
next one.

Troubleshooting
● If seamonkeys aren’t progressing, debug
new service, is it up? Database for service
up?
● If event isn’t processed reset stream point to
catch up, handle duplicate events on the
worker vs stream.
● UI not finding events, check service up.

Immutability for the
changing chaos...

Time and data
As you see through the growth patterns, time
and data start to have trade offs. With
questions such as:
● How fast does the data update?
● How do we support a backup and restore?
● How do we ensure no data loss?

Immutability and Time
● If point in time never changes, immutability
is achieved
● Pointer vs point in time, current data version
is a pointer to the latest point in time
● A timeline of data changes provides for
restoration and easier debugging

Distributed immutability
● Database transaction log is an immutable
stream of changes
○ Used for replication, most database /
datastores use this approach
● Immutable stream(Kafkta, Kinesis) provides
an incoming change log, latest changes can
be pointed to part of stream. Reverse db
approach

What’s the point of all this?

Some plankton for thought
● If you have any idea where you'll end up,
you’d have a better idea where to start
● Understanding reactions to growth will help
with setting up services as you grow
● Misery loves company, knowing everyone
has these pain points somehow makes you
happier
● Know where you’ve been helps you now

Sea of Data

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (15)

Similar a Sea of Data

Similar a Sea of Data (20)

Último

Último (20)

Sea of Data