Building Modern Data Platform with AWS

---------------------------------------------------------------
Outline
---------------------------------------------------------------
About Myself
Role ofAnalytics
About Abebooks
Innovation and Data
Abebooks DW modernization
ETLTool selection for the Cloud
AWS Analytics Solution Prices example
Some Free Learning resources

---------------------------------------------------------------
Disclaimer
---------------------------------------------------------------
The content of this presentation doesn’t represent Abebooks, Amazon, and AWS.This
information is based on my knowledge, experience and doesn’t have any sensitive or
confidential data. All information and pictures are available online.

About Myself
• Work с Business Intelligence
since 2007
• Canada since 09/2015

Technical Skills Matrix
2015
2010
2007
Data
Warehouse
ETL/ELT
Business
Intelligence
Big Data )
Cloud
Analytics
(AWS,
Azure,
GCP)
Machine
Learning
2019

Other Activities
Jumpstart Sno
wflake: A Step-
by-Step Guide
to Modern
Cloud Analytics.
• BITechTalk (100+ BI teams globally)
• AmazonTableau User Group (2000+ users)
• Conferences (EDW 2018, 2019, Data Summit, SQLPass)
• Amazon internal conferences

---------------------------------------------------------------
Outline
---------------------------------------------------------------
Role ofAnalytics

BusinessValue
Stakeholders Employees Customers
Value
”The goal of any organization is to generateValue”
The Future of Competition.
https://www.amazon.com/Future-Competition-Co-Creating-Unique-Customers/dp/1578519535

BIValue Chain
Stakeholders Employees Customers
Value
Decisions
Data
Value creation based on effective decisions
Effective decisions based on accurate
information

---------------------------------------------------------------
Outline
---------------------------------------------------------------
About Abebooks

About Abebooks
• Online marketplace for books, art & collectibles.
• Amazon subsidiary since 2008 we are a marketplace
for used books and increasingly non-book-
collectibles
• 350M listings
• 3 in ‘Data EngineeringTeam’ for 120
• 2 locations:Victoria, BC and Dusseldorf

---------------------------------------------------------------
Outline
---------------------------------------------------------------
Innovation and Data

History of Innovation
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??

Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??

We are here
https://www.weforum.org/about/the-fourth-industrial-revolution-by-klaus-schwab
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??

Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
4th Industrial
Revolution

AWS Rapid Pace of Innovation
Chart: #AWS services
https://www.slideshare.net/AmazonWebServices/a-culture-of-innovation-powered-by-aws

---------------------------------------------------------------
Outline
---------------------------------------------------------------
Analytics powered by AWS

For Data to be a differentiator, customers need to be able to…
• Capture and store new non-relational data at
PB-EB scale in real time
• Discover value in a new type of analytics that
go beyond batch reporting to incorporate
real-time, predictive, voice, and image
recognition
• Democratize access to data in a secure and
governed way
New types of analytics
Dashboards Predictive Image
Recognition
VoiceReal-time
New types of data

Data & analytics partners extend the traditional
approach
DataWarehouse
Business Intelligence
OLTP ERP CRM LOB Devices Web Sensors Social
Big Data processing,
real-time, Machine Learning
Data Lake
 Relational and non-relational data
 TBs–EBs scale
 Diverse analytical engines
 Low-cost storage & analytics

---------------------------------------------------------------
Outline
---------------------------------------------------------------
DW Modernization

BI/DW (before)
Storage LayerSource Layer
Ad-hoc SQL
SFTP
Data Warehouse
ETL (PL/SQL)
Files
Inventory
Sales
Access Layer

Cloud Migration Strategy
Lift & Shift
• Typical Approach
• Move all-at-once
• Target platform then evolve
• Approach gets you to the cloud quickly
• Relatively small barrier to learning new technology
since it tends to be a close fit
Split & Flip
• Split application into logical functional data layers
• Match the data functionality with the right
technology
• Leverage the wide selection of tools on AWS to
best fit the need
• Move data in phases — prototype, learn and
perfect

Choosing ETLTool for Cloud
Use Cases
• OLTP to Redshift
• SFTP/API to Redshift
• DataTransformation
• Dimensional Modelling
• AWS Integration
• Big Data
Tools
• Informatica
• AWS Glue
• Talend
• Fivetran
• Alooma
• Stitch
• Matillion

ETL Criteria
High:
• Security
• Support Redshift
• CDC
• Ease of Use for
BI/DW
• Cover use cases
• On-Premise and
full control
Medium:
• Support NoSQL
• Deployment/Architecture
• Encryption
• Ease of Use for non BI/DW
• DataTransformations
• Management
• Pricing
• Performance
Low:
• Version Control
• Linux OS
• ETL Monitoring
• Logging
• R/Python

WhyWe Picked Matillion
• Was built for Redshift and Cloud
• Speed of ELT operations
• Speed of development
• Wide range of data sources supported
• Ease of use outside of DE/DBA expertise
• Native with AWS
• $$$
After 2 years of usage, it proved our expectations!

---------------------------------------------------------------
Outline
---------------------------------------------------------------
Use Cases and Challenges

OLTP to Dimensional Modelling
Problem: Heavy transformations, lots of dependencies. Users like to consume classical star
schema. Lot’s of tables with CDC.
Solution: Using Matillion, we implemented CDC pattern and used it across all tables.
Visualize all jobs and dependencies. Using built-in components easily created Dimensional
Model.

Self-Service BI
Problem: Business Users wants Interactive and Self-Service tool. Fast time to Market and
less dependency on IT.
Solution:We choseTableau as a leader of BI and highly adopted acrossAmazon.

Integration with BI
Problem: Having best BI tool doesn’t guaranty good SLA.
Solution: Build Integration between Matillion ETL andTableau based onTrigger. Add data
quality checks.

Lack of Notification
Problem: Users are missing emails or they jump to spam.
Solution: Leverage Messenger with Webhooks. (Slack, Chime or so on).

Lack of Logging
Problem:We didn’t have any detail logs about our ETL performance and we didn’t have any
insights.
Solution: Matillion allow us own logs and audit engine. In addition, we are able to collect
logs on any level of ETL jobs and transformation.

MarketingAutomation
Problem: Marketing team wants “Move Fast and BreakThings”.
Solution: Using Matillion the gave Marketing template jobs and they doing their jobs
themselves using Build In marketing data connections.
Affiliates
Insights

DW slow done
Problem: After sometime, Redshift DW starts hitting concurrency and performance issues.
Solution: Scale Redshift Cluster based on current needs (couple minutes), implement
automation forVacuum and Compression.AddedWLM.
Amazon Redshift Utils: https://github.com/awslabs/amazon-redshift-utils

NoSQL (DynamoDB) to DW
Problem: Our main inventory database moved to NoSql (DynamoDB) and it is a challenge
to get incremental changes to the Redshift. It is a challenge to get incremental changes
with default functionality and costly.
Solution: Using mix of AWS tools like AWS Kinesis Firehose, AWS Glue and Matillion, we are
able to capture changes every hour.
Inventory
Changes
DynamoDB Kinesis
Firehose
Store
Changes
Glue converts to Parquet Redshift DW

Clickstream Logs (Big Data)
Problem: Business wants to analyze Bots traffics and discover broken URLs. Access logs are
~50GB per day, 7000 log files per day.
Solution: Leveraging Elastic Map Reduce and Spark in order to produce Parquet file. Using
AWS Glue, we built serverless Data Lake with Amazon Spectrum.
EMR+Spark Processing
Access Logs
Parquet into S3 Query via Spectrum
Crawler with Glue

SecurityAudit
Problem:We built solution fast and often stored string passwords or critical data.
Solution: Used AWS Secrets Manager, AWS Macie and updated DataTransformation logic
in order to exclude sensitive data.

Machine Learning
Problem:We are marketplace, buyers are searching product by category and sellers do bad
work with category labeling ->difficult to find product for Buyer.
Solution: Leveraging Amazon Sage Maker image classification deep learning -
Convolutional Neural Network (CNN).
Sheet Music

---------------------------------------------------------------
Outline
---------------------------------------------------------------
Prices

X-Small Package Medium Package
• BI Server** (EC2) (16vCPU/64RAM) – 585$
• ETL Server (EC2) (8 RAM) – 73$
• Data Lake (S3) 50TB – 1200$
• Big Data Processing (EMR) 3node
• DW (Redshift) 10TB – 2500$
• 3rd Party ETL (Matillion) 1780$
• Redshift Spectrum ~500$
• Support – 460$
Total: ~7098*$
Example of Monthly Prices in US$
* you might get significant discount forYearly Reserved Instances
** not include BI tool license cost
https://calculator.s3.amazonaws.com/index.html
• BI Server (EC2) – 146$
• ETL Server (EC2) – 31$
• DW (Redshift) 2TB – 622$
• S3 Storage (50Gb) – 1.15
• 3rd Party ETL (Matillion) 986$
• Support – 123$
Total: ~2348$*

---------------------------------------------------------------
Outline
---------------------------------------------------------------
Free Learning Resources

Coursera and Edx:
• Data Warehouse for Business Intelligence Specialization
• Data Engineering on Google Cloud Platform
• Architecting with Google Cloud Platform
AWSTutorials:
• Getting Started with Amazon Redshift
• SizingAmazon Redshift
• Getting Started with Amazon Spectrum,Athena, Glue, EMR
• AWS FreeTier (for example 2 months of Redshift)
AWSTrainings:
• AWSTechnical Essentials
Other:
Google Machine Learning Crash Course (Deep Learning withTensorFlow)
MatillionTrial and Learning Materials
TableauTrial and Learning Materials

Contact
LinkedIn: Dmitry Anoshin
Dmitry.Anoshin@gmail.com

Building Modern Data Platform with AWS

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Building Modern Data Platform with AWS

Similar a Building Modern Data Platform with AWS (20)

Más de Dmitry Anoshin

Más de Dmitry Anoshin (20)

Último

Último (20)

Building Modern Data Platform with AWS

Notas del editor