6. Different Types of Data
Activity Data
Application Data
– Accounts
– Lists - Who got what Email
Images
– Built our own de-duplication
Emails
– 1m @ 100k = 100GB
– We don’t Keep all emails
9. Production System
Architecture - New Architecture
Microservices: Docker/Kubernates
Cluster: Spark/Cassandra
Zookeeper
Cassandra
Spark
Graphite
Grafana
Kafka
Bamboo
Ansible
(Git)
Docker
Registry
Cassandra
Spark
Cassandra
Spark
10. MySQL - Analytic Data
Not fast for Analytics even with 99.9% buffer hit rate
Single Threaded Queries
We have a lot of Idle Cores
SSDs 20 times faster than HD on some queries
Relational Database Schema
Placement of data difficult
Difficult to update DB schema
Struggled doing Email Analytics
More indexes than data
Data is scattered across HD/SSDs
13. Cassandra/Spark - Analytics
Linear Scalability - Easy to add more servers
Replication
Protects against machine failure
We take 3 Copies of all data
Acts as backup
Data Centre Aware - Business Continuity
No Single Point of Failure
Spark on top for Fast Analytics
Hand optimised parallel queries
SSTable knowledge is essential
Backups - Snapshots
14. Image Storage
GlusterFS
Not as stable as we would like
Data Centre Replication is problematic
Difficult to upgrade
Difficult to deal with Millions of Individual Files
Mistakes are costly
re-applying permissions on millions of files
Files are scattered across HD drives
Difficult to Backup Reliability
15. Facebook Haystack
Created by Facebook
SeaweedFS open source version
Writes files to Volumes
Volumes are replicated
Volumes are max 32GB in size
Seaweed is Data Centre Aware
No Single point of Failure
Volumes can set to read only
Again we know how data is stored on disk
16. ActiveMQ - Kafka
ActiveMQ not great for High Volume Activity Events
Heavy Weight
Will continue to Use ActiveMQ - Job Tasks
Kafka to handle all activity events
Netflix - Handles Billions of Events per Day in Kafka
Resilience
Replicates data between machines
No single point of failure
Kafka’s main benefit is speed
Appends data to the end of sequential log files
17. Conclusion on Storage
Best Solutions write to files sequentially
Data is never modified once written
For Performance
Need to know how the data is stored
Access the data Linearly
Maximises
HD/SD read speed
RAM Cache Hits
End up with large files that are easy to Back-Ups