Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
How to teach your data scientist to leverage an analytics cluster with Presto, Spark, and Alluxio
Katarzyna Orzechowska, Data Scientist (ING Tech)
Mariusz Derela, DevOps Engineer (ING Tech)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Unveiling Design Patterns: A Visual Guide with UML Diagrams
How to teach your data scientist to leverage an analytics cluster with Presto, Spark, and Alluxio
1. December 2020
Katarzyna Orzechowska, Mariusz Derela
How to teach your data scientist to leverage an
analytics
cluster with Presto, Spark and Alluxio
4. 4
Who are we?
We provide comprehensive IT services and operational services at ING around the globe. Our IT services
include: IT security, hosting, remote and application services.
5. 5
Who are we?
Hunt team - finding new ways to
monitor and mitigate security risks.
• R&D
• POCs
• Aiding analysis
• Developement and maintenence
of Security Analytics Platform
• Hunting for security incidents
6. 6
How do we work?
• Mostly independent from other teams – our platform, our tools
• Short time from an idea to implementation
• Whole team has full access to the platform
• Everyone is encouraged test the solutions (often in production)
• Everyone is always looking for a better solution
• Want to try something new? Sure!
• If someone messes up – no worries
7. 7
How do we work?
Everyone is always working a bit
outside their comfort zone and that
is okay.
Over time roles blurr - everyone
does a bit of everything –
managing platform, writing
queries, hunting
8. • Many users
• High Availability (Disaster Recovery)
• Critical Data – extra security restriction
• Scaling vs resource utilization
• Frameworks!
• Data integrity and delivery guarantee
• Multi regions (on-premise)
8
Challenges
10. 10
Data ingestion
10
75 000 data sources all over the world
200 000 events per second
700B * 150000EPS =
14 TB/d
Data sources
Applications 400
OS 4200
DB 900
NET 90
MDW 80
▪ Data sources availability monitoring process
▪ Component standardization in place (events normalization)
▪ Monitor All initiative based on stack configuration (all new
assets automatically added and removed from scope)
▪ Experience in building distributed multi-tenant cloud computing
and file system.
11. 11
Data processing technology summary
Machine learning and
analytics
Central Analytics
and Reporting
Data indexing (search) and
visualization
Common data busData producers Persistent storage Data access
Infrastructure
and applications
logs
Network logs
Data consumers
13. 13
SIEM? Not enough
• Pros of SIEM:
• Fast rule based system
• Fast correlation engine
• Real time alerting system
• Cons of SIEM:
• It is mostly rule based system
• Correlation can be done only
within„short” timeframe
• Slow searching mechanism
SIEM - Security Information
and Event Management
17. 17
ArcSight scoring
Problems:
• Too many diverse data sources
• Too much data
• Performance
• CEF (Common Event Format)
• Fields populated differently depending on vendor
21. 21
It didn’t work
Execution took too long to be feasable wile using Spark.
Run for 1 app on 1 day of data:
If we wented to do it for 1000 apps for 30 days:
7 * 60 s * 1000 * 30 = 12 600 000 s -> 3500 h -> 145 days
22. 22
How to make it run better?
Platform engineer:
- Hey, there’s this new tool that
will make your life easier!
Data scientists, still figuring
out how to use the last cool
thing
24. 24
Adding Presto
trios_rdd = trios_df.rdd.map(lambda r:((r[0], r[1], r[2]), (1))).groupByKey(5)
result_rdd = trios_rdd.map(lambda x: get_trio_columns_stats(x, conn))
result_df = result_rdd.toDF(schema_selected_columns)
We want to run it on will
run on 5 executors
Group deviceVendor,
deviceProduct and name in
one cell
Connect to Presto and
get significant columns
for one trio
--conf spark.cores.max=10
--conf spark.executor.cores=2
Job code:
Spark submit:
Spark will run on 5
executors
5 concurrent
queries to Presto
28. 28
What now?
Are we done? No!
Tools give us new solutions with each release so the product is always changing for the better.
What we want to implement:
• Save to Alluxio from Presto – using Alluxio as a main data proxy (catalog service)
• Transformation service
29. 29
So why does it work so well?
What makes our work easier?
From vendor side:
- Support from vendor and open communication
- Informal communication channels
Within team:
- Data scientists involved form the beginning in the deployment proces of new technologies
- We work closely together
- Issues are reported and resolved as soon as possible (we don’t use tickets, incidents etc.)
- We have space to play around with stuff (people tend to not use things they're not
comfortable with)