About VisualDNA Architecture @ Rubyslava 2014

@ Rubyslava 2014
Michal Hariš : michal.haris@visualdna.com
- Technical Architect, joined VisualDNA in 2012

Where were we 3 years ago
●

10 people working around one mysql table holding 50M+ user profiles

●


●

LAMP Architecture
SCALABILITY ISSUES

●


●

LAMP Architecture
SCALABILITY ISSUES
DECISION TO GO BIG (DATA) !

Where were we 18 months ago
●

30 strong team, of that a single tech team of roughly 15 people

●

Basically a batch architecture
●
●
●
●
●
●

●

just not MySQL but CASSANDRA + HADOOP at the back
http+php trackers with piped custom log batch process
s3 upload every 5 min
daily hdfs distcp
POC = daily hadoop inference > 6 node cassandra -> batch integrations
POC was a daily batch job which on bad days took 30 hours

One of the first commercial Cassandra cluster in the world
● very unstable

Where are we today
● Stack
● Java
● Scala
● Hadoop
● Cassandra
● Kafka
● Redis
● R
● AngularJS for the front-end

Where are we today
●

Auto-scaling geo-located Tracker Clusters - well, almost auto-scaling

●

Robust Streaming Infrastructure - aggregation of all data streams in
central infrastructure
●

bringing in 8.5k events/ second at peak

●

●

Real-time end-user products, scoring services, integrations with third
parties where possible, pre-computation infrastructure that scales more
predictively
● These are primary events which get multiplied by various speed-layer
ETL Pipeline - offloading data streams and pre-computing materialised
views onto HDFS > 30TB of primary data

●

● some data we keep only last 60 or 90 days, others we keep for ever
Decision Analytics Pipeline (or RD Pipe) > 100TB+ of secondary data i
●

Using feature-extraction machine learning methods

Where are we today
●

Still one Cassandra ring, just bigger and more stable, 16 nodes, 250M+
active user profiles

●

Lambda Architecture for real-time products like WHY Analytics
●
●
●
●
●

RD Pipe is the "batch" layer (daily) that generates active profiles as a
cassandra ("view layer")
Primary Events are enriched for user profiles produced daily by the
Enrichment service ("speed layer")
Combination of probabilistic counters and Redis cubes calculates the
current audience profiles for subscribed websites ("speed layer")
API on top of the Redis cubes serves the current audience profiles for the
front end suite of real-time analytics products ("serving layer")
Audience Analytics product suite is the good looking bit - http://www.
visualdna.com/why/

Where are we today
● 120-strong team, of that tech is roughly 60:
●
●
●
●
●

Sysadmin Team
Architecture Tech Team
Decision Analytics Tech Team
Consumer Tech Team
WHY Analytics Team

What have we learned
●

Architecture:
●

Updating json blobs in Cassandra columns is a trap
● Logging is better http://engineering.linkedin.com/distributed-systems/log-what-everysoftware-engineer-should-know-about-real-time-datas-unifying

●

●

●

Metrics are crucial in large distributed systems
● yammer metrics + graphite + icinga works well for infrastructure
● but complex event/anomalies detection and pattern analysis gives the
edge
Real-Time processing of Data Streams is not only cool, but scales
well ... until you find a bottleneck in a single component which will limit the
entire system
Batch still matters
● but could be much faster than Hadoop which falls on too much
redundant I/O and requires a coordinated ETL pipeline

What have we learned
●

Engineering:
●

●

the unix philosophy of building short, simple, clear, modular, and
extendable code applies also to a design of distributed systems not
just an OS
bad tests are better than no tests but they are still bad and most tests
only test positive outcome
● the story of Math.abs() -> actually can return negative number ->
but none of the unit-tests anticipated this -> which is why metrics
and systems with feedback control are crucial

●

●
Process:
●

●

It is possible to co-operate remotely even on complex and not-well
defined systems - atm some of the architecture team is working remotely
on permanent basis
QA is intrinsic to Architecture and local to products

Interesting issues we’re facing
1. SLAs vs. Start-up dynamics - Separate process (and to some
degree architecture) for different levels of guarantee of service

2. Globally-distributed highly-available API for random
access to our profiles - enabling decisions based on VDNA profiles on-demand
3. Our Lambda has a bottleneck at the enrichment point

-

although if we solve (2.) we will be half-way through

4. Complex data pooling attribution model
5. Cassandra still gives us some pain - it's the drivers! - interesting
about consistency: http://aphyr.com/posts/294-call-me-maybe-cassandra/

6. Preserving start-up dynamics and culture in a company
of 200+ with offices in several cities

We’re hiring for Bratislava office!
● We’re looking for engineers and analysts and
more to be based in Bratislava

careers-cee@visualdna.com

About VisualDNA Architecture @ Rubyslava 2014

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a About VisualDNA Architecture @ Rubyslava 2014

Similar a About VisualDNA Architecture @ Rubyslava 2014 (20)

Último

Último (20)

About VisualDNA Architecture @ Rubyslava 2014