Scaling Security on 100s of Millions of Mobile Devices Using Kafka and Scylla

1
Scaling Security on 100s of
Millions of Mobile Devices
Using Kafka and Scylla
Mobile cybersecurity leader Lookout talks through their data ingestion journey

2
Speakers
Richard Ney
Principal Engineer
Over 30 years experience predominantly
dealing with event pipelines and data
retrieval. Currently he is a platform
architect and principal developer at
Lookout Inc on the Ingestion Pipeline and
Query Services team working on the next
scale of data ingestion.
Eyal Gutkind
VP of Solutions
A solution architect for Scylla, prior to
Scylla, Eyal held product management
roles at Mirantis and DataStax and 12
years with Mellanox Technologies in
various engineering management and
product marketing roles.
Jeff Bean
Partner Solutions Architect
An experienced software engineer
and technical evangelist with many
years in the open source software
ecosystem. He leads the Confluent
Verified Integrations Program that
helps partners build and verify
integrations with Confluent Platform.

33
Conﬂuent Platform
ScyllaDB
Lookout’s Journey
About Lookout
Current data ingestion design and issues
Scaling to 1 Million devices
Technology decisions and cost analysis
Testing results
Takeaways
Agenda

44
Jeff Bean
Partner Solutions Architect,
Confluent

Real-Time
Inventory
Real-Time
Fraud
Detection
Real-Time
Customer 360
Machine
Learning
Models
Real-Time
Data
Transformation
...
Contextual Event-Driven Applications
Universal Event Pipeline
Data Stores Logs 3rd Party Apps Custom Apps/Microservices
TREAMSSTREAMS
CONNECT CLIENTS
With
Conﬂuent

Confluent Platform
The Event Streaming Platform Built by the Original Creators of Apache Kafka®
Operations and Security
Development & Stream Processing
Apache Kafka
Confluent Platform
Support,Services,
Training&Partners
Mission-critical Reliability
Complete Event
Streaming Platform
Freedom of Choice
Datacenter Public Cloud Confluent Cloud
Self-Managed Software Fully-Managed Service

Conﬂuent Platform
Operations and Security
Development & Stream Processing
Support,Services,Training&Partners
Apache Kafka
Security plugins | Role-Based Access Control
Control Center | Replicator | Auto Data Balancer | Operator
Connectors
Clients | REST Proxy
MQTT Proxy | Schema Registry
KSQL
Connect Continuous Commit Log Streams
Complete Event
Streaming Platform
Mission-critical
Reliability
Freedom of Choice
Datacenter Public Cloud Conﬂuent Cloud
Self-Managed Software Fully-Managed Service

99
Eyal Gutkind
VP of Solutions,
Scylla

10
About ScyllaDB
+ The Real-Time Big Data Database
+ Drop-in replacement for Apache Cassandra
+ 10X the performance & low tail latency
+ Open source and enterprise editions
+ New: Scylla Cloud, DBaaS
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA; Herzelia, Israel

12
Scylla Beneﬁts
Lower Node
Count
Millions of OPS per node
reduces the number of
nodes your application
requires
Predictable,
Low Latencies
Consistent single-digit
millisecond p99
latencies
Less
Complexity
Smaller footprint,
self-optimizing,
works out-of-the-box
Cassandra
Compatibility
Drop-in replacement,
compatible with full C*
ecosystem (drivers, etc.)

13
Version Apache Cassandra Scylla Enterprise
Data Layer: 600 nodes i3.2xlarge 60 nodes i3.2xlarge
Caching Layer: 60 nodes Varnish
m4.4xlarge
No caching
Annual Spend*: $3.7 million/yr $328k/yr
*per publicly posted AWS hourly fees
+ Simplified deployments
+ Reduced infrastructure
+ Performance improvements

14
Scylla Alternator
+ DynamoDB compatible API
for Scylla
+ Scale-out system with
complete observability
+ Deployment flexibility
+ On prem, Multi-Cloud, Hybrid
+ Open Source or Scylla Cloud
+ No vendor lock-in
+ Better performance and
less expensive
scylladb.com/alternator

1515
Richard Ney
Principal Engineer,
Lookout

16
What does Lookout do?
● Founded in 2004 when the original founders discovered a vulnerability in the
Bluetooth and Nokia phones
● Demonstrated the need for mobile security through a demonstration at the
2005 Academy Awards downloading information from celebrity phones 1.5
miles away from the venue
● Provides security scanning for mobile devices for Enterprise and Consumer
markets
● Enterprise customers have the ability to apply corporate policies against
devices registered in their enterprise
● To apply these policies Lookout ingests data about device configuration and
applications installed on devices

18
What is the Common Device Service
● Functions as a proxy for all mobile devices in the Lookout fleet
● Device telemetry is sent at various intervals for these categories
○ Binary Artifacts
○ Risky Configuration
○ Personal Content Protection
(safe browsing)
○ Device Settings
○ Device Permissions
○ Software
○ Hardware
○ Client
○ Filesystem
○ Configuration

1919
Cheap to Fail but Expensive to
Succeed!

21
Beneﬁts of Current Design
● Easy to setup and maintain
● Scaling is easy
● Cost Effective
● Simple to handle the Unexpected

22
Long Term Issue with Current Design
● Some of the components are “single region” (EMR)
● As the system grows the costs increase significantly (DynamoDB)
● Limits on PK and SK for DynamoDB - Not designed for time series data

25
What is needed to scale to 1 Million Devices?
A highly scalable and fault tolerant streaming framework that can process
messages (for example Device Telemetry Messages) and persist these messages
into a scalable, fault tolerant persistent store and support operational queries.
Key Requirements:
● Infrastructure should scale to support 100M devices
● Cost effective ingestion, storage and querying at this scale
● Low Latency, High Availability at scale (up/down)
● Failure handling (no loss of data)
● Ease of deployment and management

26
Why we considered Scylla
● A NoSQL database that implements almost all the features of Cassandra
● Written in C++ 14 instead of Java to increase the performance.
● Uses a shared nothing approach and uses the Seastar framework to shard
requests by core - http://seastar.io/
● Scylla’s close-to-the-hardware design significantly reduces the number of
instances needed.
● Can horizontal scale-out and is fault-tolerance like Apache Cassandra, but
delivers 10X the throughput and consistent, low single-digit latencies.
● Has support for tunable job prioritization to support extremely high read and
write throughput (which was a problem that Cassandra has not solved yet).
Has really high throughput on instances with NVMe volumes (compared to EBS
or non NVMe volumes).

27
Why we considered Kafka and
Conﬂuent Platform?
● Schema Registry
● Kafka Connect
● Confluent Control Center
● Ability to create new message flows using JSON

29
Final Environment Setup
● Kafka
○ 6 Kafka Brokers - R5.xlarge
○ 6 Zookeepers - M5.large
○ 3 Schema Registries - M5.large
○ 6 Kafka Connect Workers - C5.xlarge
○ 1 Control Center - M5.2xlarge
○ Split over 3 AZs
○ # partitions
■ Loaded Libraries - 120 partitions
■ Device Settings - 150 partitions
■ Other topics - 60 partitions
● ScyllaDB
○ 4 ScyllaDB instances - I3.4xlarge
○ Split over 2 AZs
● Load
○ 12 different device telemetry emulated
○ Messages sent in avro format
○ 14 instances generating load -
C5d.4xlarge

30
Interesting Finding During This Journey
● The default partitioner (<murmur2 hash> mod <# partitions>) that comes with Kafka is not very
efficient with sharding when the number of partitions grows (approx 50% of the partitions were idle).
● We replaced it by using a murmur3 hash and then putting it through a consistent hashing algorithm
(jump hash) to get an even distribution across all partitions (we used Google’s guava library). - “A
Fast, Minimal Memory, Consistent Hash Algorithm” - https://arxiv.org/pdf/1406.2294.pdf
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
if (keyBytes == null) {
int nextValue = nextValue(topic);
List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
if (availablePartitions.size() > 0) {
int part = Utils.toPositive(nextValue) % availablePartitions.size();
return availablePartitions.get(part).partition();
} else {
// no partitions are available, give a non-available partition
//return Utils.toPositive(nextValue) % numPartitions;
return Hashing.consistentHash(Utils.toPositive(nextValue), numPartitions);
}
} else {
// hash the keyBytes to choose a partition
int hashLong = Hashing.murmur3_128().hashBytes(keyBytes).asInt();
return Hashing.consistentHash(Utils.toPositive(hashLong), numPartitions);
}
}

31
The Connector Template{
"connector.class": "com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector",
"errors.log.include.messages": "true",
"connect.cassandra.key.space": "cds",
"tasks.max": "10",
"topics": "metron.client",
"connect.cassandra.kcql": "UPSERT INTO client SELECT * from metron.client",
"connect.cassandra.password": "cassandra",
"connect.progress.enabled": "true",
"connect.cassandra.username": "cassandra",
"connect.cassandra.contact.points": "scylla-node1.staging.hollandaise.com,scylla-node2.staging.hollandaise.com,scylla-node3.staging.hollandaise.com",
"connect.cassandra.port": "9042",
"name": "client-flow",
"errors.tolerance": "all",
"errors.log.enable": "true",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://sr0.staging.hollandaise.com:8081"
}

32
The Crash!
● Part of the testing included a Destructive Test
○ Loaded the Scylla Cluster to 90%+ CPU usage
○ Executed a “Repair” against the cluster
● Repair ran for over 8 hours
● One of the Three nodes crashed

33
The Aftermath and Retest
● Worked closely with Scylla to identify the “root” cause of the node crash
● Identified configuration issue that was introduced during a software upgrade
● CPU reservation parameters were lost allowing all `vCores` to get allocated to
DB operations so none were reserved for maintenance tasks
● Corrected the CPU reservation configuration and repeated crash test
● Repair still took over 8 hours to run but it was successful second time
● During the proof of concept run several other issues did occur. Worked
closely with Scylla each time to find root cause and fix issue.

34
Test Results
● Message latency was in milliseconds
on average, unless the system was
overtaxed.
● Repairs forced the load and was
generally taxing on the system (CPU
at 100%), but the cluster continued to
function.
● The latency increased when Kafka
Connect tasks failed (when repairs
were running on ScyllaDB).
● ScyllaDB Cluster was running near
capacity (CPU between 75-90%)
● Overall, the results were really
positive.

36
Costs
● This does not include:
○ Query load and associated costs.
○ Dynamo streams and it’s equivalent on Scylla and associated costs.
DynamoDB Scylla
# Devices $ Cost/Mo # Devices $ Cost/Mo
On Demand
38,000,000 $304,400.00 38,000,000 $14,564.24
100,000,000 $801,052.63 100,000,000 $61,198.94
+20% Engineer cost
(Maintenance)
Provisioned
38,000,000 $55,610.00
100,000,000 $146,342.11

3939
Richard Ney
richard.ney@
lookout.com
@rney_home
Keep in touch
cnfl.io/download
cnfl.io/blog
scylladb.com
info@scylladb.com
Scylla Summit 2019
Nov. 5-6
San Francisco, CA
cnfl.io/contact

Scaling Security on 100s of Millions of Mobile Devices Using Kafka and Scylla

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Scaling Security on 100s of Millions of Mobile Devices Using Kafka and Scylla

Similar a Scaling Security on 100s of Millions of Mobile Devices Using Kafka and Scylla (20)

Más de confluent

Más de confluent (20)

Último

Último (20)

Scaling Security on 100s of Millions of Mobile Devices Using Kafka and Scylla