Spark & Cassandra Use Case at Telefónica CyberSecurity (CBS) Antonio Alcocer antonio@stratio.com Oscar Mendez oscar@stratio.com @omendezsoto #CassandraSummit 2014 1
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
1. Spark Use Case at
Telefónica CyberSecurity (CBS)
Antonio Alcocer
antonio@stratio.com
Oscar Mendez
oscar@stratio.com
@omendezsoto
#CassandraSummit 2014 1
2. Who are we?
STRATIO
• Stratio is a Big Data Company
• Founded in 2013
• Commercially launched in 2014
• 50+ employees in Madrid
• Office in San Francisco
• Certified Spark distribution
#CassandraSummit 2014 2
4. General info
o 1924- 2014: 317+ customer with 130.000+ employees
o 2nd European operator by revenues
o 4th global integrated operator by accesses
o 9th Telco in the Global ranking by market capitalization
o 2nd global operator for investment in R+D
#CassandraSummit 2014 4
15. What is Cybersecurity?
What does it mean for us?
“Cybersecurity is the collection of tools, policies… capabilities to protect
the cyber environment and organization and user’s assets. Cybersecurity
strives to ensure unauthorized access to, manipulation of the integrity,
confidentiality, or availability of an information, or unauthorized
exfiltration of information.”
No rules, just guidelines.
#CassandraSummit 2014 15
16. An example of threats
Cassandra OpsCenter
World map
Wordpress
#CassandraSummit 2014 16
28. Use Case Architecture
We have three phases:
• Ingestion: based on Apache Kafka
• Data fusion: based on Apache Storm.
• Batch & Analytics: Based on Cassandra
and Spark
#CassandraSummit 2014 28
29. Data Adquisition
• Data are in several sources:
• DNS traffic
• IP
• Social media
• Underground sources
• Government sources
• …
Data sources
Sources
Sources
Sources
Sources
Sources
Sources
KAFKA
API
• There are several sources consumers pulling the info and
pushing it into a Kafka Cluster
• Sources are heterogeneous and their speed is variable.
Sources
Sources
#CassandraSummit 2014 29
30. Data fusion
• We use Storm to process and
normalize the information.
• The system must fire alerts
to the analysts.
• This use case required a Big
Data component capable of
processing the data and
extract its information in real-time.
• Warnings and alerts are time-sensitive in order to deal efficiently with security attacks.
#CassandraSummit 2014 30
31. Batch
•The data are saved in
Cassandra.
•We use Cassandra directly for
the easy queries.
•And we used Spark to extract
the information not accessible
to cassandra directly.
Data process
INTEGRATION INTEGRATION INTEGRATION
#CassandraSummit 2014 31
32. Why did we use C*?
Because we need their features:
• P2P architecture
• Read/write performance
• Fault tolerance
• Easy to deploy
• CQL
#CassandraSummit 2014 32
33. Why did we use C*?
•And we needed data modeler:
•The data in Storm is normalize by source.
• The primary key is the source key (f.e. IP) and a
time stamp to split the cluster key.
• All the data row have view tables with relationship
between entities: IP, DNS, Domain…
IP timestamp Timesplit … Domain … Table name: IP
Primary Key ((IP, timestamp)timesplit)
Domain timestamp timesplit IP1 … IPn Table name: IP_Domain
Primary Key ((Domain, timestamp)timesplit)
#CassandraSummit 2014 33
34. Why did we use C*?
IP main table
IP timestamp Timesplit … Domain … Table name: IP
Primary Key ((IP, timestamp)timesplit)
IP view for domain
Domain timestamp timesplit IP1 … IPn Table name: IP_Domain
Primary Key ((Domain, timestamp)timesplit)
Domain main table
Domain timestamp Timesplit … IP … Table name: domain
Primary Key ((domain, timestamp)timesplit)
IP view for domain
IP timestamp timesplit domain1 … Domainn Table name: Domain_IP
Primary Key ((IP, timestamp)timesplit)
#CassandraSummit 2014 34
39. 1 One stack to rule them all
RDD-Based Matrices
Interactive
Batch
processin
g
Stream
processing
Why Spark
Batch
Interactive [SQL]
Streaming
Machine Learning
Learn just one system
Develop within one framework
Deploy/Manage just one system
Databricks co-founder & CTO Matei Zaharia
(source)
41. The only Pure Spark processing
No Hadoop elements
+10
year old constraints
42. Lean simplicity
Pure Spark Platform
Former Hadoop or
Hybrid Hadoop-Spark Platforms
Lean = Easier deployment, management, and use
of the system
43. Not to make a POC, but a real project for a Big Company is
STRATIO
ADMIN
STRATIO
DATAVIS
STRATIO
INGESTION
STRATIO
CROSSDATA
(SPARK)
CASSANDRA
MONGO DB
ELASTICSEARCH
HDFS
STRATIO
STREAMING
(SPARK STREAMING,
SIDDHI)
very demanding
SPARK
CERTIFIED
45. Full text search + queries
C*
node
C*
node
Lucene
index
C*
node
Lucene
index
C*
node
Lucene
index
C*
node
Lucene
index
Lucene
index
SELECT * FROM logs
WHERE description
MATCH ‘*Exception’;
46. Stratio Streaming
•Start using Spark Streaming for
doing some Complex Event
Processing operations.
https://github.com/Stratio/stratio-streaming
47. DATA JOURNEY THROUGH TIME
PAS
T
PRESENT FUTURE
Stored
data
Real Time
Data
Streaming
ML
Algorithms
Ephemeral
Tables
Stored
Tables
SQL combination: Done
SQL combination: In progress
Quantum
Tables