Lookout is a mobile cybersecurity company that ingests telemetry data from hundreds of millions of mobile devices to provide security scanning and apply corporate policies. They were facing scaling issues with their existing data pipeline and storage as the number of devices grew. They decided to use Apache Kafka and Confluent Platform for scalable data ingestion and ScyllaDB as the persistent store. Testing showed the new architecture could handle their target of 1 million devices with low latency and significantly lower costs compared to their previous DynamoDB-based solution. Key learnings included improving Kafka's default partitioner and working through issues during proof of concept testing with ScyllaDB.
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Scaling Security on 100s of Millions of Mobile Devices Using Kafka and Scylla
1. 1
Scaling Security on 100s of
Millions of Mobile Devices
Using Kafka and Scylla
Mobile cybersecurity leader Lookout talks through their data ingestion journey
2. 2
Speakers
Richard Ney
Principal Engineer
Over 30 years experience predominantly
dealing with event pipelines and data
retrieval. Currently he is a platform
architect and principal developer at
Lookout Inc on the Ingestion Pipeline and
Query Services team working on the next
scale of data ingestion.
Eyal Gutkind
VP of Solutions
A solution architect for Scylla, prior to
Scylla, Eyal held product management
roles at Mirantis and DataStax and 12
years with Mellanox Technologies in
various engineering management and
product marketing roles.
Jeff Bean
Partner Solutions Architect
An experienced software engineer
and technical evangelist with many
years in the open source software
ecosystem. He leads the Confluent
Verified Integrations Program that
helps partners build and verify
integrations with Confluent Platform.
7. Confluent Platform
The Event Streaming Platform Built by the Original Creators of Apache Kafka®
Operations and Security
Development & Stream Processing
Apache Kafka
Confluent Platform
Support,Services,
Training&Partners
Mission-critical Reliability
Complete Event
Streaming Platform
Freedom of Choice
Datacenter Public Cloud Confluent Cloud
Self-Managed Software Fully-Managed Service
8. Confluent Platform
Operations and Security
Development & Stream Processing
Support,Services,Training&Partners
Apache Kafka
Security plugins | Role-Based Access Control
Control Center | Replicator | Auto Data Balancer | Operator
Connectors
Clients | REST Proxy
MQTT Proxy | Schema Registry
KSQL
Connect Continuous Commit Log Streams
Complete Event
Streaming Platform
Mission-critical
Reliability
Freedom of Choice
Datacenter Public Cloud Confluent Cloud
Self-Managed Software Fully-Managed Service
10. 10
About ScyllaDB
+ The Real-Time Big Data Database
+ Drop-in replacement for Apache Cassandra
+ 10X the performance & low tail latency
+ Open source and enterprise editions
+ New: Scylla Cloud, DBaaS
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA; Herzelia, Israel
12. 12
Scylla Benefits
Lower Node
Count
Millions of OPS per node
reduces the number of
nodes your application
requires
Predictable,
Low Latencies
Consistent single-digit
millisecond p99
latencies
Less
Complexity
Smaller footprint,
self-optimizing,
works out-of-the-box
Cassandra
Compatibility
Drop-in replacement,
compatible with full C*
ecosystem (drivers, etc.)
14. 14
Scylla Alternator
+ DynamoDB compatible API
for Scylla
+ Scale-out system with
complete observability
+ Deployment flexibility
+ On prem, Multi-Cloud, Hybrid
+ Open Source or Scylla Cloud
+ No vendor lock-in
+ Better performance and
less expensive
scylladb.com/alternator
16. 16
What does Lookout do?
● Founded in 2004 when the original founders discovered a vulnerability in the
Bluetooth and Nokia phones
● Demonstrated the need for mobile security through a demonstration at the
2005 Academy Awards downloading information from celebrity phones 1.5
miles away from the venue
● Provides security scanning for mobile devices for Enterprise and Consumer
markets
● Enterprise customers have the ability to apply corporate policies against
devices registered in their enterprise
● To apply these policies Lookout ingests data about device configuration and
applications installed on devices
18. 18
What is the Common Device Service
● Functions as a proxy for all mobile devices in the Lookout fleet
● Device telemetry is sent at various intervals for these categories
○ Binary Artifacts
○ Risky Configuration
○ Personal Content Protection
(safe browsing)
○ Device Settings
○ Device Permissions
○ Software
○ Hardware
○ Client
○ Filesystem
○ Configuration
21. 21
Benefits of Current Design
● Easy to setup and maintain
● Scaling is easy
● Cost Effective
● Simple to handle the Unexpected
22. 22
Long Term Issue with Current Design
● Some of the components are “single region” (EMR)
● As the system grows the costs increase significantly (DynamoDB)
● Limits on PK and SK for DynamoDB - Not designed for time series data
25. 25
What is needed to scale to 1 Million Devices?
A highly scalable and fault tolerant streaming framework that can process
messages (for example Device Telemetry Messages) and persist these messages
into a scalable, fault tolerant persistent store and support operational queries.
Key Requirements:
● Infrastructure should scale to support 100M devices
● Cost effective ingestion, storage and querying at this scale
● Low Latency, High Availability at scale (up/down)
● Failure handling (no loss of data)
● Ease of deployment and management
26. 26
Why we considered Scylla
● A NoSQL database that implements almost all the features of Cassandra
● Written in C++ 14 instead of Java to increase the performance.
● Uses a shared nothing approach and uses the Seastar framework to shard
requests by core - http://seastar.io/
● Scylla’s close-to-the-hardware design significantly reduces the number of
instances needed.
● Can horizontal scale-out and is fault-tolerance like Apache Cassandra, but
delivers 10X the throughput and consistent, low single-digit latencies.
● Has support for tunable job prioritization to support extremely high read and
write throughput (which was a problem that Cassandra has not solved yet).
Has really high throughput on instances with NVMe volumes (compared to EBS
or non NVMe volumes).
27. 27
Why we considered Kafka and
Confluent Platform?
● Schema Registry
● Kafka Connect
● Confluent Control Center
● Ability to create new message flows using JSON
29. 29
Final Environment Setup
● Kafka
○ 6 Kafka Brokers - R5.xlarge
○ 6 Zookeepers - M5.large
○ 3 Schema Registries - M5.large
○ 6 Kafka Connect Workers - C5.xlarge
○ 1 Control Center - M5.2xlarge
○ Split over 3 AZs
○ # partitions
■ Loaded Libraries - 120 partitions
■ Device Settings - 150 partitions
■ Other topics - 60 partitions
● ScyllaDB
○ 4 ScyllaDB instances - I3.4xlarge
○ Split over 2 AZs
● Load
○ 12 different device telemetry emulated
○ Messages sent in avro format
○ 14 instances generating load -
C5d.4xlarge
30. 30
Interesting Finding During This Journey
● The default partitioner (<murmur2 hash> mod <# partitions>) that comes with Kafka is not very
efficient with sharding when the number of partitions grows (approx 50% of the partitions were idle).
● We replaced it by using a murmur3 hash and then putting it through a consistent hashing algorithm
(jump hash) to get an even distribution across all partitions (we used Google’s guava library). - “A
Fast, Minimal Memory, Consistent Hash Algorithm” - https://arxiv.org/pdf/1406.2294.pdf
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
if (keyBytes == null) {
int nextValue = nextValue(topic);
List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
if (availablePartitions.size() > 0) {
int part = Utils.toPositive(nextValue) % availablePartitions.size();
return availablePartitions.get(part).partition();
} else {
// no partitions are available, give a non-available partition
//return Utils.toPositive(nextValue) % numPartitions;
return Hashing.consistentHash(Utils.toPositive(nextValue), numPartitions);
}
} else {
// hash the keyBytes to choose a partition
int hashLong = Hashing.murmur3_128().hashBytes(keyBytes).asInt();
return Hashing.consistentHash(Utils.toPositive(hashLong), numPartitions);
}
}
32. 32
The Crash!
● Part of the testing included a Destructive Test
○ Loaded the Scylla Cluster to 90%+ CPU usage
○ Executed a “Repair” against the cluster
● Repair ran for over 8 hours
● One of the Three nodes crashed
33. 33
The Aftermath and Retest
● Worked closely with Scylla to identify the “root” cause of the node crash
● Identified configuration issue that was introduced during a software upgrade
● CPU reservation parameters were lost allowing all `vCores` to get allocated to
DB operations so none were reserved for maintenance tasks
● Corrected the CPU reservation configuration and repeated crash test
● Repair still took over 8 hours to run but it was successful second time
● During the proof of concept run several other issues did occur. Worked
closely with Scylla each time to find root cause and fix issue.
34. 34
Test Results
● Message latency was in milliseconds
on average, unless the system was
overtaxed.
● Repairs forced the load and was
generally taxing on the system (CPU
at 100%), but the cluster continued to
function.
● The latency increased when Kafka
Connect tasks failed (when repairs
were running on ScyllaDB).
● ScyllaDB Cluster was running near
capacity (CPU between 75-90%)
● Overall, the results were really
positive.