From ingest to insights with AWS

From Ingestion to Insights:
How to Deliver Business Value at Scale
Qlik Data Integration & Analytics Summit for Financial Services
February 26, 2020
Misha Goussev
Principal Solutions Architect
Financial Services Partner Technology – AWS
goussev@amazon.com
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Financial Services Industry data trends

® 2019 Amazon Web Services Inc. or its Affiliates. All rights reserved.
Financial institutions collect both structured data, including CRM and transaction
data, and unstructured data, including chatlog transcriptions, social media, and
mobile interactions
2.2 million terabytes of new data is created everyday1
90% of data worldwide has been generated in the last five years
1. Axis Corporate, Understanding Big Data in Financial Services, https://axiscorporate.com/us/infographic/understanding-big-data-in-financial-services-infographic/
Financial institutions are collecting an unprecedented amount of data

The challenges of Big Data
Data silos: having pockets of data in different places, controlled by different groups,
inherently obscures data.
Analyzing diverse datasets: challenge of using different systems and approaches to data
management is that the data structures and information vary; different systems may also have
the same type of information, but it's labeled differently.
Managing data access: with data stored in so many locations, it's difficult to both access all
of it and to link to external tools for analysis.
Accelerating machine learning: requires a powerful data lake foundation, because ML and AI
thrive on large, diverse datasets.
Reference: How Amazon is solving big-data challenges with data lakes By Werner Vogels on 30 January 2020 09:00 AM, https://www.allthingsdistributed.com/2020/01/aws-datalake.html

AWS Technology Stack for
Data Storage and Analytics

Purpose-built databases – Working Backwards
Lift and shift,
ERP, CRM,
finance
Real-time
bidding,
shopping cart,
customer
preferences
Content
management,
personalization,
mobile
Leaderboards,
real-time
analytics
Fraud detection,
social
networking,
recommendation
engine
IoT applications,
event tracking
Systems
of record, supply
chain, health care,
registrations,
financial
Industrial equipment
maintenance, fleet
management, and
route optimization

S3 data lakes provide a flexible foundation for analytics and innovation
Structured data
Data that are highly
normalized with common
schema and stored in
relational databases,
powering transactional line-
of-business applications
ERP CRM
LOB
applications
Semistructured data
Data that contain
identifiers without
conforming to a
predefined schema
Mobile Social
Sensors POS terminals
Unstructured data
Data that do not conform
to a data model and are
typically stored as
individual files
Phone
calls
Images
Videos Email
Batch load
Extracts data from
various data sources
at periodic intervals
and moves them to
the data lake
AWS Glue
Streaming
Ingests data that are
generated from
multiple sources such
as log files, telemetry,
mobile applications,
and social networks
Amazon
Kinesis
Amazon S3 data lake
Cloud-scale centralized
and scalable architecture
that enables enterprise
data science
Amazon S3
Amazon Redshift
And data stored in the data lake can also
be made directly searchable and queryable
Amazon Athena
Analytics
Data Warehouses are repositories
of normalized data and provide
the foundational technology for BI
Amazon
QuickSight
Amazon
EMR
Amazon
MSK
Machine Learning
Storing data in an Amazon S3 data lake
enables customers to leverage predictive
or prescriptive analytics; perform ad-hoc
analyses; and use AI/ML for automation
and efficiency
Amazon
SageMaker
AWS Deep
Learning AMIs
Amazon
EMR

Data lakes are the future of data management
Used for all use cases including
machine learning, real-time
streaming analytics, data discovery,
and business intelligence
Data is stored as-is,
without having to first
structure the data
Centralized repository
that allows structured and
unstructured data to be
stored at any scale
Access to historical data within
seconds without the cost of
managing infrastructure

How financial institutions are using
AWS data lakes

Industry-leading financial institutions are building data lakes on AWS

Nasdaq moves mountains of data to an AWS data lake every day
Amazon
Redshift
Nasdaq needed to provide
greater accessibility to data for
internal groups and regulators.
Nasdaq built a data lake on Amazon S3
and chose Amazon Redshift to realize
cost efficiencies and fulfill security and
regulatory requirements.
Nasdaq moves an average of 30 billion
rows into Amazon Redshift everyday
(with 60 billion on a peak day), and
uses the service to power its data
analytics applications.
Amazon
S3
Nasdaq has been a user of Amazon Redshift since it was released and we are
extremely happy with it. Currently, our system is moving an average of 5.5 billion
rows into Amazon Redshift every day.
- Nate Sammons, Principal Architect, Nasdaq
“
”

Dow Jones is using an AWS data lake to better serve its customers
Dow Jones was looking to develop
new products and craft targeted
customer communications.
The company built a data lake on
Amazon S3 that also relies heavily on
Amazon Redshift to enable cost-
effective, ad-hoc querying of large data
sets, including anonymized clickstream
data generated by multiple products.
By building a data lake on AWS,
Dow Jones enabled more than 100
data scientists to access multiple
data sets, build dashboards that
generate custom insights, and
experiment with machine learning
using Amazon SageMaker.
Amazon
S3
It wasn’t enough to just put the raw data into the platform. We
needed to make it useable to our users, so our version of the data
lake is cleansed, it’s performance-optimized, and it’s keyed so that
it can be used as a standalone item.
– Colleen Camuccio, VP, Program
Management, Dow Jones
“
”
Amazon
Redshift

FINRA uses an AWS data lake to oversee over 3,000 securities firms
FINRA needed a platform that could
ingest, process, and store 36 billion
market events on an average day and
dynamically scale up to handle 100
billion events on a peak day.
FINRA built a data lake on
AWS using Amazon S3 and
EMR to store and analyze data
from 3,700 broker dealers and
12 exchanges.
FINRA’s flexible platform can adapt
to changing market dynamics while
providing analysts with the tools
needed to query the data set.
Amazon
S3
Amazon
EMR
We got some huge pleasant surprises out of [going all in on AWS] that we weren’t
expecting at all. First of those is amazing performance improvements. On average,
400 times improvement to interactive queries. The investigative capacity to our
surveillance team has expanded dramatically.
– Steve Randich, CIO, FINRA
“
”

Demo:
‘The Art of the Possible’

Financial Data Innovation Workshop: A glimpse into the art of the possible
Amazon S3AWS Data
Exchange
Comprehend
Amazon
Athena
Amazon
Redshift
Translate
Amazon
QuickSight
SageMaker
Traditional data
Data science
teams
Traders
Developers
Risk managers
Analysts
AWS Glue
Proprietary data
Alternative data
Forecast

Demo: Amazon S3 and AWS Glue data catalog
AWS Glue data catalogAmazon S3 data files

Demo: Interactive query using Athena & data visualization using QuickSight
Amazon QuickSightAmazon Athena query

Sample Reference Architecture
with Qlik products

Server-based data lake ingest pipeline with Qlik and EC2 (or EMR)
Region
Raw
layer (S3)
ISV
Data
Databases
Files
Legacy
Files
Processed Layer
(S3)
EC2 via Qlik Data
Catalyst
(transform)
Consumption
layer (S3)
EC2 via Qlik
Data Catalyst
(enrich)
Hive Metastore
Redshift
Spectrum
ConsumerIngest Transform
1
2
3
5
1. Data from and ISV and
legacy systems is brought
in raw form into the raw
layer S3 buckets. Attunity
Replicate may be used to
ingest data from RDBMS
databases with continuous
CDC option.
2. Data from raw data
buckets is processed and
stored in processed layer
using transient EC2 (or
EMR cluster).
3. Data from processed layer
is transformed based on
business rules (derived,
flattened, enriched) and
copied into consumption
layer by transient EC2 (or
EMR cluster).
4. Hive tables are created for
data landing in different
layers as needed for
consumption
5. Data in consumption layer
is used by analytics teams
via EMR analytics cluster
and visualized via BI tools
such as Qlik.
Attunity
Replicate(CDC)
Attunity
Replicate(CDC)
Catalog
Glue
Catalog
orDatabases
Attunity
Compose

® 2019 Amazon Web Services Inc. or its Affiliates. All rights reserved.2018 Amazon Web Services Inc. or its Affiliates. All rights reserved.
Inquire about the Financial Data Innovation Workshop
Financial Data Innovation
Workshop Description
Learn how to effectively ingest,
catalog, integrity check, perform
analytics, use ML and AI capabilities
against structured and semi-
structured financial data, sourced from
AWS Data Exchange and real-time
market data providers.
Contacts:
Balaji Gopalan Nichole Brown
Sr. Partner Solutions Architect Sr. Partner Development Manager
Financial Services Industry Analytics
Amazon Web Services Amazon Web Services
balajgop@amazon.com nicbrwn@amazon.com

Leverage AWS resources to start building a data lake on AWS today
2018 Amazon Web Services Inc. or its Affiliates. All rights reserved.
Ready to start building?
Work with your account
team to schedule a Big
Data Immersion Day
Work with an APN
Partner to implement
solutions on AWS.
Work with the AWS Professional
Services team to set up an AWS Data
Lake Workshop, AWS Data Lake
Assessment, or AWS Data Lake
Accelerator

How to build a data lake on AWS

Essential elements of a data lake and analytics solution
Analytics
Machine
Learning
Real-time Data
Movement
On-premises
Data Movement
Data lake
on AWS
Data Movement
Import your data from on-premises, and in real-time
Data Lake
Store any type of data securely, from gigabytes to exabytes
Analytics
Analyze your data with the broadest selection of
analytics services
Machine Learning
Predict future outcomes, and prescribe actions for
rapid response

Step 1: Data movement
Data Movement
The first step to building data lakes on AWS is to move data
to the cloud. AWS makes data transfer simple by providing
the widest range of options to transfer data to the cloud.
On-premises
data movement
Real-time data
movement
AWS Direct
Connect
AWS
Snowball
AWS
Snowmobile
AWS Storage
Gateway
AWS IoT
Core
Amazon Kinesis
Video Streams
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Streams
Analytics
Machine
Learning
Real-time Data
Movement
On-premises
Data Movement
Data lake
on AWS

Step 2: Data lake
Data lake
Once data is ready for the cloud, AWS makes it easy to store in
any format, securely and at massive scale with Amazon S3 and
Amazon Glacier. To make it easy for end users to discover the
relevant data to use in their analysis, AWS Glue automatically
creates a single searchable and queryable catalog.
Object Storage
Backup and Archive Data Catalog
Amazon S3
Amazon Glacier AWS Glue
Analytics
Machine
Learning
Real-time Data
Movement
On-premises
Data Movement
Amazon S3 Glacier
Deep Archive
Amazon S3
Object Lock
Amazon S3
Intelligent Tiering
Data lake
on AWS

Step 3: Perform analytics
Analytics
AWS provides the broadest, and most cost-effective set of
analytic services that run for data lakes.
Interactive analytics Big data processing Data warehousing
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Real-time analytics Operational analytics Dashboards and
visualizations
Amazon
Kinesis
Amazon
Elasticsearch
Service
Amazon
QuickSight
Analytics
Machine
Learning
Real-time Data
Movement
On-premises
Data Movement
Data lake
on AWS

Step 4: Machine learning
Machine learning
For predictive analytics use cases, AWS provides a broad
set of machine learning services, and tools that run on
your AWS data lake.
Frameworks and interfaces Platform services
Application services
Amazon Deep
Learning AMIs
Amazon
SageMaker
AWS provides solution-oriented APIs
for computer vision and natural
language processing
Analytics
Machine
Learning
Data lake
on AWS

AWS Lake Formation will make setting up a data lake as simple as defining where your
data sits and what data access and security policies you want to apply.
• Collects and catalogs data from databases and object storage
• Moves the data into your new Amazon S3 data lake
• Cleans and classifies data using ML algorithms
• Secures access to your sensitive data
• Leverage data sets with Amazon analytics and ML services
AWS Lake Formation will allow FIs to build a secure data lake in days
AWS Lake Formation

From ingest to insights with AWS

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a From ingest to insights with AWS

Similar a From ingest to insights with AWS (20)

Último

Último (20)

From ingest to insights with AWS