Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Building a scalable analytics environment to support diverse workloads
Tom Panozzo, Chief Technology Officer (Aunalytics)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
2. WHO WE ARE
Aunalytics
Key Stats
Aunalytics provides a leading-edge cloud platform
to help companies leverage data, algorithms, and
high-performance computing to help their teams
answer questions and perform tasks more
efficiently.
Our side-by-side digital transformation model
provides on-demand access to technology, data
science, and AI experts to help transform the
way our clients work.
> 200 Employees
> 1,000 Customers
Financial
Institution
partners
4. Daybreak is a data platform powered by financial
industry intelligence and smart features that enable a
variety of analytics solutions across the enterprise.
6. UNIVERSAL ACCESS TO DATA
Access all your data in one
shared location
Securely connect your existing systems with a
data-source-agnostic product, and then quickly put
your data to use with everything you need in one
place.
Give everyone on the team access to the latest and
most accurate data, so they can answer their pressing
questions.
Use Daybreak as a single source of information.
Whether you are using Tableau, Power BI or input into
a 3rd party system, you can pull from a single source.
Simplify the information. Get everyone on the
same page.
7. SQL
FASTER INSIGHTS
Get the right data at the
right time
Get the updated data you need delivered timely and
consistently every day.
Convert rich, transactional data about your
customers into actionable insights.
Avoid wasting time wrangling data or straining your
IT department and focus on advancing your strategic
business priorities.
Make it easier to quickly understand your data and
save time with automated reporting and clean data.
Scale insights across the organization quickly
Leverage data insights and efficiently answer your
daily questions.
8. SMART
FEATURES
DATA MARTS
ARTIFICIAL INTELLIGENCE/
MACHINE LEARNING
MEMBER
LIBRARY
SERVICES
LIBRARY
TRANSACTION
LIBRARY
CORE
LENDING
MOBILE BANKING
ATM/ITM
WEALTH AND TRUST
CRM
ACCOUNT
LIBRARY
MEMBER-CENTRIC VIEW
DAYBREAK DATA WAREHOUSE
INSIGHTS
10. SIDE-BY-SIDE CLIENT SUCCESS
Support from a team of
data experts
Get tools, resources, and support throughout
our end-to-end process.
Integrate, enrich, and utilize data marts with
our team beside you, so you can get better
answers to the questions you have.
Be ready for your AI, machine learning, and
predictive analytics journey with the right
foundation.
Our talented team of data scientists and
analysts are here to help.
DATA
SCIENTISTS
CLIENT SUCCESS
MANAGER
BUSINESS
ANALYSTS
CLIENT
ADVISORY
TEAM
RELATIONSHIP
MANAGER
DATA ENGINEERS
ENGINEERS
CLIENT
INFRASTRUCTURE
INGESTION
SOFTWARE
SECURITY
PROJECT
MANAGER
13. Based on
Requirement: Parallel and scalable data access layer
required, but not for all data all of the time
Typical Parallel File
System
All fast, all the time.
Tiering cost/benefit is
negligible and overhead
cost is high.
Alluxio as deployed
• Data in use is fast
• Invisible Upstream
• Scale based on
performance
• Scale de-coupled from
amount of storage
14.
15. CLOUD HOSTING/ANALYTICS
Legacy Hadoop Platform
Hadoop
Cluster ONE
Hadoop
Cluster TWO
Hadoop
Cluster THREE
Small Containerization
Platform Kubernetes
Job Controller: low volume
workloads (low lift activity)
Limitations
Data Stored in triplicate
Requires high speed
storage
Requires high IOPS storage
Requires many spindles
Costly Hadoop nodes
Storage is still performant
even when you are not
using it !!!
Heavy Lift Area
Lots of performant
storage
Lots of performant LAN
Legacy Platform
16. CLOUD HOSTING/ANALYTICS
Commercial Boutique Storage Proposal
Diskless Physical Hadoop
Nodes
Hadoop processing nodes
connected to remote
boutique storage
Limitations
Extreme cost storage
All nodes have singular
purpose
Requires high speed
dedicated LAN/FIBER
Requires many spindles
Storage vendor lock in
Storage vendor support
All data on HP storage
always
Storage is still performant
even when you are not
Heavy Lift Storage Area
Lots of performant
storage
Lots of performant LAN
(Fiber possibly)
Lots of replication
Extreme performance
storage
Commercial performance
storage
Option ONE
17. CLOUD HOSTING/ANALYTICS
Open-Source Storage Proposal
Diskless Physical Hadoop
Nodes
Hadoop processing nodes
connected to remote
boutique storage
Limitations
Learning Curve
Internal Staff cost/training
All nodes have singular
purpose
Requires high speed
dedicated LAN/FIBER
Requires many spindles
All data on HP storage
always
Storage is still performant
even when you are not
using it !!!
Heavy Lift Storage Area
Lots of performant
storage
Lots of performant LAN
(Fiber possibly)
Lots of replication
Extreme performance
storage
CEPH, Gluster, Lustre, DPFS
Open-Source Storage
Option TWO
18. CLOUD HOSTING/ANALYTICS
Data Cache Layer Extreme Speed Storage (Abstraction Layer)
200 Cores
6TB ALL FLASH
12 million read IOPS
40 GB per second sustained read performance
Cost effective
Average Transfer Speeds
Low IOPS requirement
Highly Available
Built in DR functionality
NFS
● Scalable Caching Layer
● RAM/FLASH based
● Compensates for lower
speed/cost underlying storage
● Supports Spark/MR
● Replaces Physical HDFS
Kubernetes Heavy Lift
Platform
Alluxio Caching Layer
Final Design Choice
NFS
NFS
20 Hadoop Clusters
Same Hardware as 2
Legacy Clusters
19. CLOUD HOSTING/ANALYTICS
Kubernetes Platform Handles Heavy Lift
Object Store
or NFS
Alluxio
Data Cache
GPU
Containerization Platform
(DC/OS) Kubernetes
High Volume Transient
Workloads
Enterprise Cloud Services
Static Critical Management
Workloads
25 Servers (Can scale
to thousands)
1400 Cores
100% Memory
No spinning Disk
Hadoop
Map Reduce
Spark
Aunsight Tasks
Apache Drill
All heavy lifting data
processing
20. Adaptive Read/Write Methods
Local Object Store
(S3 Compatible)
NFS
Cloud Object Store
(Amazon/Azure)
• All Flash
• 600GB Aggregate
Lan Speed
• Extreme IOPS
• Low Latency
• Temp storage for
processing loads
• All NVME/Flash
• High RAM nodes
• High Core Density
21. Pre Staged Read Methodology
NFS
NFS
1) Data written to NFS
2) Alluxio copies data into
Flash to pre-stage for
processing
22. Adaptive Write Methods
• All Flash
• 600GB Aggregate
Lan Speed
• Extreme IOPS
• Low Latency
• Temp storage for
processing loads
NFS
NFS
Write to Alluxio only (Must
Cache)
Any Temp File (High Use)
Write through to UFS (Cache
Through)
(Rare Use)
Write Back to UFS (Async
Through)
Cache/Persist Later (High Use)
Write to UFS Only (Through)
(Rare Use)
Write modes embedded into
each write provides
maximum efficiency
24. Aunalytics
Use Case
Conclusions
• We have mass quantities of historical data that must be
stored but a much smaller amount of data that must be
processed daily
• The (relatively) small amount of data that we must
process daily requires parallelism from its underlying
storage in order to run in our required time frame
• ALL data must be quickly available for high speed
processing if required
• Allows for (IN Memory) storage performance levels in a
controlled, tunable and independently scalable way.