This document provides an outline for a presentation on analytics solutions powered by AWS. It introduces the presenter and their background in business intelligence. It then discusses the role of analytics, an overview of Abebooks, innovation and data, DW modernization at Abebooks using Matillion ETL and Redshift, use cases and challenges, example pricing models, and free learning resources. The document aims to provide an overview of analytics solutions and the presenter's experience implementing solutions on AWS.
7. Other Activities
Jumpstart Sno
wflake: A Step-
by-Step Guide
to Modern
Cloud Analytics.
• BITechTalk (100+ BI teams globally)
• AmazonTableau User Group (2000+ users)
• Conferences (EDW 2018, 2019, Data Summit, SQLPass)
• Amazon internal conferences
10. BIValue Chain
Stakeholders Employees Customers
Value
Decisions
Data
Value creation based on effective decisions
Effective decisions based on accurate
information
12. About Abebooks
• Online marketplace for books, art & collectibles.
• Amazon subsidiary since 2008 we are a marketplace
for used books and increasingly non-book-
collectibles
• 350M listings
• 3 in ‘Data EngineeringTeam’ for 120
• 2 locations:Victoria, BC and Dusseldorf
31. For Data to be a differentiator, customers need to be able to…
• Capture and store new non-relational data at
PB-EB scale in real time
• Discover value in a new type of analytics that
go beyond batch reporting to incorporate
real-time, predictive, voice, and image
recognition
• Democratize access to data in a secure and
governed way
New types of analytics
Dashboards Predictive Image
Recognition
VoiceReal-time
New types of data
32. Data & analytics partners extend the traditional
approach
DataWarehouse
Business Intelligence
OLTP ERP CRM LOB Devices Web Sensors Social
Big Data processing,
real-time, Machine Learning
Data Lake
Relational and non-relational data
TBs–EBs scale
Diverse analytical engines
Low-cost storage & analytics
37. Cloud Migration Strategy
Lift & Shift
• Typical Approach
• Move all-at-once
• Target platform then evolve
• Approach gets you to the cloud quickly
• Relatively small barrier to learning new technology
since it tends to be a close fit
Split & Flip
• Split application into logical functional data layers
• Match the data functionality with the right
technology
• Leverage the wide selection of tools on AWS to
best fit the need
• Move data in phases — prototype, learn and
perfect
38. Choosing ETLTool for Cloud
Use Cases
• OLTP to Redshift
• SFTP/API to Redshift
• DataTransformation
• Dimensional Modelling
• AWS Integration
• Big Data
Tools
• Informatica
• AWS Glue
• Talend
• Fivetran
• Alooma
• Stitch
• Matillion
39. ETL Criteria
High:
• Security
• Support Redshift
• CDC
• Ease of Use for
BI/DW
• Cover use cases
• On-Premise and
full control
Medium:
• Support NoSQL
• Deployment/Architecture
• Encryption
• Ease of Use for non BI/DW
• DataTransformations
• Management
• Pricing
• Performance
Low:
• Version Control
• Linux OS
• ETL Monitoring
• Logging
• R/Python
40. WhyWe Picked Matillion
• Was built for Redshift and Cloud
• Speed of ELT operations
• Speed of development
• Wide range of data sources supported
• Ease of use outside of DE/DBA expertise
• Native with AWS
• $$$
After 2 years of usage, it proved our expectations!
45. OLTP to Dimensional Modelling
Problem: Heavy transformations, lots of dependencies. Users like to consume classical star
schema. Lot’s of tables with CDC.
Solution: Using Matillion, we implemented CDC pattern and used it across all tables.
Visualize all jobs and dependencies. Using built-in components easily created Dimensional
Model.
46. Self-Service BI
Problem: Business Users wants Interactive and Self-Service tool. Fast time to Market and
less dependency on IT.
Solution:We choseTableau as a leader of BI and highly adopted acrossAmazon.
47. Integration with BI
Problem: Having best BI tool doesn’t guaranty good SLA.
Solution: Build Integration between Matillion ETL andTableau based onTrigger. Add data
quality checks.
48. Lack of Notification
Problem: Users are missing emails or they jump to spam.
Solution: Leverage Messenger with Webhooks. (Slack, Chime or so on).
49. Lack of Logging
Problem:We didn’t have any detail logs about our ETL performance and we didn’t have any
insights.
Solution: Matillion allow us own logs and audit engine. In addition, we are able to collect
logs on any level of ETL jobs and transformation.
50. MarketingAutomation
Problem: Marketing team wants “Move Fast and BreakThings”.
Solution: Using Matillion the gave Marketing template jobs and they doing their jobs
themselves using Build In marketing data connections.
Affiliates
Insights
51. DW slow done
Problem: After sometime, Redshift DW starts hitting concurrency and performance issues.
Solution: Scale Redshift Cluster based on current needs (couple minutes), implement
automation forVacuum and Compression.AddedWLM.
Amazon Redshift Utils: https://github.com/awslabs/amazon-redshift-utils
52. NoSQL (DynamoDB) to DW
Problem: Our main inventory database moved to NoSql (DynamoDB) and it is a challenge
to get incremental changes to the Redshift. It is a challenge to get incremental changes
with default functionality and costly.
Solution: Using mix of AWS tools like AWS Kinesis Firehose, AWS Glue and Matillion, we are
able to capture changes every hour.
Inventory
Changes
DynamoDB Kinesis
Firehose
Store
Changes
Glue converts to Parquet Redshift DW
53. Clickstream Logs (Big Data)
Problem: Business wants to analyze Bots traffics and discover broken URLs. Access logs are
~50GB per day, 7000 log files per day.
Solution: Leveraging Elastic Map Reduce and Spark in order to produce Parquet file. Using
AWS Glue, we built serverless Data Lake with Amazon Spectrum.
EMR+Spark Processing
Access Logs
Parquet into S3 Query via Spectrum
Crawler with Glue
54. SecurityAudit
Problem:We built solution fast and often stored string passwords or critical data.
Solution: Used AWS Secrets Manager, AWS Macie and updated DataTransformation logic
in order to exclude sensitive data.
55. Machine Learning
Problem:We are marketplace, buyers are searching product by category and sellers do bad
work with category labeling ->difficult to find product for Buyer.
Solution: Leveraging Amazon Sage Maker image classification deep learning -
Convolutional Neural Network (CNN).
Sheet Music
59. Coursera and Edx:
• Data Warehouse for Business Intelligence Specialization
• Data Engineering on Google Cloud Platform
• Architecting with Google Cloud Platform
AWSTutorials:
• Getting Started with Amazon Redshift
• SizingAmazon Redshift
• Getting Started with Amazon Spectrum,Athena, Glue, EMR
• AWS FreeTier (for example 2 months of Redshift)
AWSTrainings:
• AWSTechnical Essentials
Other:
Google Machine Learning Crash Course (Deep Learning withTensorFlow)
MatillionTrial and Learning Materials
TableauTrial and Learning Materials
The first Industrial Revolution started around 1780, and this was the beginning of the Age of Machines.
Steam and water power was used to power engines.
For the first time, we were able to produce goods using machines.
Steam power fueled the second Industrial Revolution [which started around 1870].
Here’s the Chicago World’s Fair in 1893 – and that was the place to be if you wanted to see the latest electricity-based inventions, such as lighting systems and elevators.
And electric power also enabled mass production. The assembly line made it possible to mass-manufacture new inventions such as the automobile and telephones.
Now, up to this point, data was still kept by hand. What data integration looked like was two accountants comparing paper ledgers with each other.
It was only during the Third Industrial Revolution – better known as the Digital Revolution – that digital data as we know it came to be.
The Digital Revolution started in the 1960s, it was fueled by electricity, and it introduced computerized automation.
It has brought the innovations we know and love, such as computers, smartphones, and the Internet.
The ability to manufacture a large variety of products + the internet gave rise to Amazon.
Here’s how the Amazon.com website looked like when it launched in 1995.
And here’s the website this year.
This is what customers see.
The changes are not just in the user experience, everything else has changed.
This is what is happening behind the scenes.
This is the floor plan for just one of our datacenters in Virginia.
Amazon developed a very sophisticated data integration platform. We were running the largest Oracle data warehouse in the world.
4th industrial revolution
Alexa
Driven by data
The size and complexity of the data that needs to be analyzed today means the same technology and approaches that worked in the past don’t work anymore. First, the volume of data is growing exponentially with machine-generated data from internet-connected devices growing 10x faster than data from business applications. This makes it impractical for customers to purchase and install larger, more powerful hardware each time storage and compute capacity limits are reached, and also limits moving massive amounts of data to a separate analytics system prior to analyzing it. Second, the types of available data are changing from traditional operational data that are structured as tables and columns to data being generated by new sources like social media, mobile apps, websites, and devices. Customers can no longer constrain their analytics to relational data, but now need to be able to store and analyze any type of data, including non-relational data without defined relationships or schema. Third, as data is generated in real-time, customers need to go beyond analyzing historical data to analyzing data as it becomes available.
To get the most value from their data, customers need a scalable, secure, and comprehensive data storage and analytics platform. Customers need to be able to securely store data coming from applications and devices in its native format, with high availability, durability, at low cost, and at any scale. In short, they need a data lake. Customers need to easily access and analyze data in a variety of ways using the tools and frameworks of their choice in a high performance, cost effective way without having to move large amounts of data between their storage and analytics systems. And, customers need to go beyond visualization and insights from operational reporting on historical data, to being able to perform machine learning and real-time analytics to accurately predict future outcomes.
company a 'winner'
will this tool be supported and fully usable in 3-5 years
will this be adopted by Amazon, will there be a community of use
recommendations within Amazon (such as AWS SA)
years in business, customers, profitability
management- scheduling built in- intuitive views of DW processes, models, schedules- does it help someone understand DW data flows
deployment / architectures- AWS better than local- linux better than windows- must be patchable platform within Amazon guideline
Biggest risk was the investment in a tool from a small player
Porting ETL processes from Matillion would be no less expensive than from PL/SQL and dblinks