Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 40 Anuncio

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Descargar para leer sin conexión

This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.

This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution (20)

Anuncio

Más de Dmitry Anoshin (20)

Más reciente (20)

Anuncio

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

  1. 1. Building Cloud Self-Service Analytical Solutions By Dmitry Anoshin, Data Engineer, Abebooks (Amazon Subsidiary)
  2. 2. Outline • About Myself • About Abebooks • Choosing ETL for the Cloud • Data Acquisition Patterns with Matillion ETL • Set Self-Service BI • Lessons Learned during the journey to the Cloud
  3. 3. About Myself • Work with BI since 2007 • Implemented BI in Russia/Europe/Canada
  4. 4. Technical Skills Matrix 2015 2010 2007 Databases (Oracle, Teradata, Vertica, Snowflake, Redshift, Mysql, Postgresql, MS SQL Server) ETL (Pentaho DI, Informatica, Matillion ETL) BI (SAP BusinessObje cts, Tableau, Microstrateg y, Pentaho BI, SAS BI) Bigdata (Cloudera Hadoop, Hive, Hue, Splunk, Hunk, ElasticSearch) Digital Marketing (GA, Piwik, Tealium, Adjust, Adobe,) Data Analytics (R, Python) 2018
  5. 5. My Books
  6. 6. #dimaworkplace
  7. 7. About Abebooks • Online marketplace for books, art & collectibles. • Amazon subsidiary since 2008 we are a marketplace for used books and increasingly non- book-collectibles • 350 Mln listings • 3 in ‘DB Team’ • 2 locations: Victoria, BC and Dusseldorf
  8. 8. Abebooks Data Flows • Built by DBAs - db links, PL/SQL, external tables, shell scripts • even before 2015 Redshift was a strategic but ETL re-write too expensive DW Storage Layer Access LayerSource Layer ETL (PL/SQL) Ad-hoc SQL SALES INVENTORY CS SFTP
  9. 9. Choosing ETL Tool for Cloud Use Cases • OLTP to S3 • S3 to Redshift • SFTP/API to Redshift • Data Transformation • Dimensional Modelling Tools • Pentaho DI • Informatica • AWD Data Pipeline • Talend • Matillion
  10. 10. ETL Criteria High: • Support native Redshift driver • Easily capture from relational db, CDC • Ease of Use for BI/DW • Cover use cases • On-Premise Medium: • Support NoSQL • Company “Winner” • Deployment/Architecture • Encryption • Ease of Use for non BI/DW • Data Transformations • Management • Pricing • Performance Low: • Version Control • Linux OS • ETL Monitoring • Logging • R/Pyhton
  11. 11. Why We Picked Matillion • specific redshift support, built around Redshift platform • speed of ETL operations • speed of development • wide range of data sources supported • ease of use outside of DE/DBA expertise • Native with AWS • $$$ • The biggest risk – putting our eggs in the Matillion future, betting on a small and new player.
  12. 12. Data acquisition patterns with Matillion ELT
  13. 13. Abebooks Cloud Analytics Architecture Source Systems Amazon Athena Amazon EMR Amazon Redshift Abebooks DW Account DynamoDB Amazon RDS Amazon Redshift Spectrum Amazon Elastic Load Balance S3 Data Lake SQS SNS Amazon Chime Event/Notification ServicesExternal API SFTP APPs Matillion ELT EC2 M4.large 2 vCPU 8 Gb Ram Tableau Server Tableau Web Tableau Desktop Ad-hock SQL End Users Access
  14. 14. Pattern 1: getting data via SFTP • Scan SFTP, get all files names, load into Redshift • Identify only new files • Load one ${file_name} per time (using IF we can choose right stream) • Insert processed ${file_name} into Redshift • Load next file Takeaways: • Python BOTO library for managing S3 • Matillion variables ${variable} • Using Matillion Iterators • Execute SQL via Python • If file is missing, try again later
  15. 15. Pattern 2: getting data via API • Connect API via Python script • Get data via calls and save to CSV at EC2 • Upload CSV into S3 • Load CSV into Redshift Takeaways: • Using Python to connect external API • Using AWS KMS to encrypt credentials • Using SNS for email notification • Using Matillion system variable for ETL Logs
  16. 16. Pattern 3: getting data from DynamoDB Takeaways: • Using DynamoDB component (generate COPY command for you) • You can’t easily get incremental changes, i.e. full reload • Speed depends depends on two things, the "read ratio" and the per-table "read capacity". The actual rows per hour value is going to be based on readRatio * tableReadCapacity. • 51m rows with 35% read ratio and 300 read capacity = 9 hours • 211m rows with 66% read ratio and 1500 read capacity = 4 hours • Reloading once a week
  17. 17. Pattern 4: getting data from external S3* Getting data from another VPC – change policy of the bucket and you can see it in the list of buckets through Matillion
  18. 18. Pattern 5: Matillion connectors for Apps
  19. 19. Pattern 6: Using SQS for Triggering Job Using SQS service we can trigger almost anything in Matillion or AWS
  20. 20. Improving end users experience
  21. 21. BI Survey • ETL was a black box • A lack of notifications • A lack of documentation and trainings • A lack of automation • No dependency between reports and ETL process • High dependency from BI/DW team
  22. 22. BI Champions The BI champion is the sheriff, ensuring the townspeople (or business users) be productive and can make analytics fast and smoothly. The BI Champion is meant to be both an evangelist and subject matter expert for BI within the organization. The champion should be well versed in the data important to their team, and knowledgeable in the core BI technologies and patterns used within AbeBooks.
  23. 23. ETL Monitor and notifications SNS Topic will send email. In addition we can add any number of Matillion variables Using Amazon Chime Webhook we can execute CURL command via bash script and send message to the business users
  24. 24. ETL Monitor Using Matillion system variables we are tracking all events and then visualize via Tableau for end users as well as allow to create alerts in case of failure.
  25. 25. ETL Trigger for Tableau Task: Refresh Tableau Data Source (Semantic Layer) & Workbooks when FACT tables are refreshed. Solution: Deploy Tableau CLI tool on EC2 Matillion and run via Bash Script
  26. 26. Self-Service BI • Change Management: from report-writing culture to data-driven company • The clear Authority: Support of Executive • The analytic culture: Business executives must have a vision for analytics and the willingness to invest in the people, processes, and technologies for the long haul to ensure a successful outcome. • The right people (data engineers, BI engineers, business analysts) • The right organizational structure: BI Center of Excellence, that establishes and inculcates best practices for building analytical applications • The right data and architecture • The right tools: Redshift, Matillion and Tableau are best for Self-Serve
  27. 27. Report Automatization • Central BI Portal • Reusable Tableau Data Sources a.k.a. Business Layer • Common WBR Format • Eliminate manual work • No spreadsheets and ad-hoc SQL queries • Data Discovery • ETL Integration • Friendly drag and drop GUI TL;DR: CTRL+C, CTRL+V, IT dependency • Lots of SQL and Excel routine • Each team define own style and format of report • Multiple metrics definition • No visualization, no alerts • Slow data discovery, hypothesis evaluation
  28. 28. Lessons Learned from moving DW into AWS (Cloud)
  29. 29. Five Points of Guidance for Redshift (SET DW) 1. Sort Keys: • Choose up to 3 columns • Ordered in increasing order of specificity, balanced with likelihood of use. • Leave INTERLEAVED sort keys for 1 year anniversary. 2. Column Encoding: • Compress all columns except for (at least) the first sort key. 3. Table Maintenance: • VACUUM and ANALYZE tables weekly (use STL_ALERT_EVENT_LOG as a guide for frequency). • ANALYZE PREDICATE COLUMNS is very useful for quick daily stats refresh. 4. Choose a Distribution Key that: • Follows the common join pattern for the table. • Evenly distributes the data across the database slices on the cluster. • DISTSTYLE ALL is a great go-to for dimension tables < ~3 million rows. • DISTSTYLE EVEN is a good fail-safe, but guarantees inter-node data redistribution. 5. Workload Management (WLM) and Query Monitoring Rules (QMR): • Start with up to 3 queues, (in addition to what Redshift provides automatically). • Put ETL in its own queue with very low active_statement count (perhaps as low as 1 or 2). Monitor commit queuing. • Split up the memory across the queues. Monitor the percent of each queue’s workload going to disk. • Expect to change WLM settings to match the workload changes (day|night, weekday|weekend)
  30. 30. Lesson One. CHOOSE RIGHT MIGRATION STRATEGY Lift & Shift • Typical Approach • Move all-at-once • Target platform then evolve • Approach gets you to the cloud quickly • Relatively small barrier to learning new technology since it tends to be a close fit
  31. 31. Lesson One. CHOOSE RIGHT MIGRATION STRATEGY Split & Flip • Split application into logical functional data layers • Match the data functionality with the right technology • Leverage the wide selection of tools on AWS to best fit the need • Move data in phases — prototype, learn and perfect
  32. 32. Lesson Two. CHANGE YOUR MINDSET Take the time to learn • Critical to train and learn the new technologies that are being used • Easy to think about translating or converting • Made many such changes — relational vs non- relational, batch vs streaming, service based vs procedural, etc.
  33. 33. Lesson Two. CHANGE YOUR MINDSET Traditional DW — faster runtime is better Cloud — if runtime is slower, it is easy to scale Reality Query #1 uses 64 cores & Query #2 uses 1 core Practical limitation to scale — fixed budget #1 RUNS IN 1 MIN RUNS IN 2 MINS DB DB#2
  34. 34. Lesson Two. CHANGE YOUR MINDSET We Optimized For Cost in RedShift • What is the most amount of work that can be done using the given fixed budget? • Focus is on the total amount of work versus optimizing for a single user • Everything you use comes at a cost on the Cloud  DynomoDB performance  Redshift vs Spectrum (S3) Cost is just one example of the many mindset changes that we made
  35. 35. Lesson Three. DO NOT SCARRY OPEN BLACK BOX • All business logic is hidden in legacy ETL scripts • Tradeoff between fast project and business users expectation • Learn about your business • Discover and fix the issues
  36. 36. Lesson Four. BE AGILE AND INVOLVE BUSINESS Agile Benefits • See results earlier • Feedback Constantly • Serves your users • Flexibility • Quality Assurance
  37. 37. Lesson Five. PLAN YOUR EVOLUTION Handling Less Efficient Queries • Provide separate cluster as a SandBox • App Developers design new queries that will fit the constraints of a hands-off operations Example. Create roll-up summary tables in RedShift SUMMARY TABLE
  38. 38. Q&A Contact details: anoshind@amazon.com

Notas del editor

  • company a 'winner'
    will this tool be supported and fully usable in 3-5 years
    will this be adopted by Amazon, will there be a community of use
    recommendations within Amazon (such as AWS SA)
    years in business, customers, profitability

    management - scheduling built in - intuitive views of DW processes, models, schedules - does it help someone understand DW data flows




    deployment / architectures - AWS better than local - linux better than windows - must be patchable platform within Amazon guideline
  • Biggest risk was the investment in a tool from a small player
    Porting ETL processes from Matillion would be no less expensive than from PL/SQL and dblinks

×