SlideShare una empresa de Scribd logo
1 de 67
Descargar para leer sin conexión
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sourabh Bajaj, Software Engineer, Coursera
October 2015
BDT404
Large-Scale ETL Data Flows
With Data Pipeline and Dataduct
What to Expect from the Session
●Learn about:
• How we use AWS Data Pipeline to manage ETL at Coursera
• Why we built Dataduct, an open source framework from
Coursera for running pipelines
• How Dataduct enables developers to write their own ETL
pipelines for their services
• Best practices for managing pipelines
• How to start using Dataduct for your own pipelines
Coursera
Coursera
Coursera
120
partners
2.5 million
course completions
Education at Scale
15 million
learners worldwide
1300
courses
Data Warehousing at Coursera
Amazon Redshift
167 Amazon Redshift users
1200 EDW tables
22 source systems
6 dc1.8xlarge instances
30,000,000 queries run
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
ETL at Coursera
150 Active pipelines 44 Dataduct developers
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Dataduct
●Open source wrapper around AWS Data Pipeline
Dataduct
Dataduct
●Open source wrapper around AWS Data Pipeline
●It provides:
• Code reuse
• Extensibility
• Command line interface
• Staging environment support
• Dynamic updates
Dataduct
●Repository
• https://github.com/coursera/dataduct
●Documentation
• http://dataduct.readthedocs.org/en/latest/
●Installation
• pip install dataduct
Let’s build some pipelines
Pipeline 1: Amazon RDS → Amazon Redshift
●Let’s start with a simple pipeline of pulling data from a
relational store to Amazon Redshift
Amazon
Redshift
Amazon
RDS
Amazon S3
Amazon EC2
AWS Data Pipeline
Pipeline 1: Amazon RDS → Amazon Redshift
● Definition in YAML
● Steps
● Shared Config
● Visualization
● Overrides
● Reusable code
Pipeline 1: Amazon RDS → Amazon Redshift
Pipeline 1: Amazon RDS → Amazon Redshift
(Steps)
●Extract RDS
• Fetch data from Amazon RDS and output to Amazon S3
Pipeline 1: Amazon RDS → Amazon Redshift
(Steps)
●Create-Load-Redshift
• Create table if it doesn’t exist and load data using COPY.
Pipeline 1: Amazon RDS → Amazon Redshift
●Upsert
• Update and insert into the production table from staging.
Pipeline 1: Amazon RDS → Amazon Redshift
(Tasks)
●Bootstrap
• Fully automated
• Fetch latest binaries from Amazon S3 for Amazon EC2 /
Amazon EMR
• Install any updated dependencies on the resource
• Make sure that the pipeline would run the latest version of code
Pipeline 1: Amazon RDS → Amazon Redshift
● Quality assurance
• Primary key violations in the warehouse.
• Dropped rows: By comparing the number of rows.
• Corrupted rows: By comparing a sample set of rows.
• Automatically done within UPSERT
Pipeline 1: Amazon RDS → Amazon Redshift
●Teardown
• Amazon SNS alerting for failed tasks
• Logging of task failures
• Monitoring
• Run times
• Retries
• Machine health
Pipeline 1: Amazon RDS → Amazon Redshift
(Config)
●Visualization
• Automatically generated
by Dataduct
• Allows easy debugging
Pipeline 1: Amazon RDS → Amazon Redshift
●Shared Config
• IAM roles
• AMI
• Security group
• Retries
• Custom steps
• Resource paths
Pipeline 1: Amazon RDS → Amazon Redshift
●Custom steps
• Open-sourced steps can easily be shared across multiple
pipelines
• You can also create new steps and add them using the config
Deploying a pipeline
●Command line interface for all operations
usage: dataduct pipeline activate [-h] [-m MODE] [-f] [-t TIME_DELTA] [-b]
pipeline_definitions
[pipeline_definitions ...]
Pipeline 2: Cassandra → Amazon Redshift
Amazon
Redshift
Amazon EMR
(Scalding)
Amazon S3
Amazon S3 Amazon EMR
(Aegisthus)
Cassandra
AWS Data Pipeline
●Shell command activity to rescue
Pipeline 2: Cassandra → Amazon Redshift
●Shell command activity to rescue
●Priam backups of Cassandra to Amazon S3
Pipeline 2: Cassandra → Amazon Redshift
●Shell command activity to rescue
●Priam backups of Cassandra to S3
●Aegisthus to parse SSTables into Avro dumps
Pipeline 2: Cassandra → Amazon Redshift
●Shell command activity to rescue
●Priam backups of Cassandra to Amazon S3
●Aegisthus to parse SSTables into Avro dumps
●Scalding to process Aegisthus output
• Extend the base steps to create more patterns
Pipeline 2: Cassandra → Amazon Redshift
Pipeline 2: Cassandra → Amazon Redshift
● Custom steps
• Aegisthus
• Scalding
Pipeline 2: Cassandra → Amazon Redshift
● EMR-Config overrides the defaults
Pipeline 2: Cassandra → Amazon Redshift
● Multiple output nodes from transform step
Pipeline 2: Cassandra → Amazon Redshift
● Bootstrap
• Save every pipeline definition
• Fetching new jar for the Amazon EMR jobs
• Specify the same Hadoop / Hive metastore installation
Pipeline 2: Cassandra → Amazon Redshift
Data products
●We’ve talked about data into the warehouse
●Common pattern:
• Wait for dependencies
• Computation inside redshift to create derived tables
• Amazon EMR activities for more complex process
• Load back into MySQL / Cassandra
• Product feature queries MySQL / Cassandra
●Used in recommendations, dashboards, and search
Recommendations
●Objective
• Connecting the learner to right content
●Use cases:
• Recommendations email
• Course discovery
• Reactivation of the users
Recommendations
●Computation inside Amazon Redshift to create derived
tables for co-enrollments
●Amazon EMR job for model training
●Model file pushed to Amazon S3
●Prediction API uses the updated model file
●Contract between the prediction and the training layer is
via model definition.
Internal Dashboard
●Objective
• Serve internal dashboards to create a data driven culture
●Use Cases
• Key performance indicators for the company
• Track results for different A/B experiments
Internal Dashboard
● Do:
• Monitoring (run times, retries, deploys, query times)
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
• Test environment in staging should mimic prod
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
• Test environment in staging should mimic prod
• Shared library to democratize writing of ETL
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
• Test environment in staging should mimic prod
• Shared library to democratize writing of ETL
• Using read-replicas and backups
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
• Non version controlled pipelines
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
• Non version controlled pipelines
• Really huge pipelines instead of modular small pipelines with
dependencies
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
• Non version controlled pipelines
• Really huge pipelines instead of modular small pipelines with
dependencies
• Not catching resource timeouts or load delays
Learnings
Dataduct
●Code reuse
●Extensibility
●Command line interface
●Staging environment support
●Dynamic updates
Dataduct
●Repository
• https://github.com/coursera/dataduct
●Documentation
• http://dataduct.readthedocs.org/en/latest/
●Installation
• pip install dataduct
Questions?
Also, we are hiring!
https://www.coursera.org/jobs
Remember to complete
your evaluations!
Thank you!
Also, we are hiring!
https://www.coursera.org/jobs
Sourabh Bajaj
sb2nov
@sb2nov

Más contenido relacionado

La actualidad más candente

AWS Summit Seoul 2023 | AWS에서 OpenTelemetry 기반의 애플리케이션 Observability 구축/활용하기
AWS Summit Seoul 2023 | AWS에서 OpenTelemetry 기반의 애플리케이션 Observability 구축/활용하기AWS Summit Seoul 2023 | AWS에서 OpenTelemetry 기반의 애플리케이션 Observability 구축/활용하기
AWS Summit Seoul 2023 | AWS에서 OpenTelemetry 기반의 애플리케이션 Observability 구축/활용하기Amazon Web Services Korea
 
클라우드 기반 AWS 데이터베이스 선택 옵션 - AWS Summit Seoul 2017
클라우드 기반 AWS 데이터베이스 선택 옵션 - AWS Summit Seoul 2017 클라우드 기반 AWS 데이터베이스 선택 옵션 - AWS Summit Seoul 2017
클라우드 기반 AWS 데이터베이스 선택 옵션 - AWS Summit Seoul 2017 Amazon Web Services Korea
 
AWS CLOUD 2017 - Amazon Aurora를 통한 고성능 데이터베이스 운용하기 (박선용 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Aurora를 통한 고성능 데이터베이스 운용하기 (박선용 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Aurora를 통한 고성능 데이터베이스 운용하기 (박선용 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Aurora를 통한 고성능 데이터베이스 운용하기 (박선용 솔루션즈 아키텍트)Amazon Web Services Korea
 
금융 회사를 위한 클라우드 이용 가이드 – 신은수 AWS 솔루션즈 아키텍트, 김호영 AWS 정책협력 담당:: AWS Cloud Week ...
금융 회사를 위한 클라우드 이용 가이드 –  신은수 AWS 솔루션즈 아키텍트, 김호영 AWS 정책협력 담당:: AWS Cloud Week ...금융 회사를 위한 클라우드 이용 가이드 –  신은수 AWS 솔루션즈 아키텍트, 김호영 AWS 정책협력 담당:: AWS Cloud Week ...
금융 회사를 위한 클라우드 이용 가이드 – 신은수 AWS 솔루션즈 아키텍트, 김호영 AWS 정책협력 담당:: AWS Cloud Week ...Amazon Web Services Korea
 
Amazon Aurora Deep Dive (김기완) - AWS DB Day
Amazon Aurora Deep Dive (김기완) - AWS DB DayAmazon Aurora Deep Dive (김기완) - AWS DB Day
Amazon Aurora Deep Dive (김기완) - AWS DB DayAmazon Web Services Korea
 
[AWSマイスターシリーズ] Amazon Elastic Compute Cloud (EC2)
 [AWSマイスターシリーズ] Amazon Elastic Compute Cloud (EC2) [AWSマイスターシリーズ] Amazon Elastic Compute Cloud (EC2)
[AWSマイスターシリーズ] Amazon Elastic Compute Cloud (EC2)Amazon Web Services Japan
 
AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020
AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020 AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020
AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020 AWSKRUG - AWS한국사용자모임
 
Lambda@Edge를통한멀티리전기반글로벌트래픽길들이기::이상현::AWS Summit Seoul 2018
Lambda@Edge를통한멀티리전기반글로벌트래픽길들이기::이상현::AWS Summit Seoul 2018Lambda@Edge를통한멀티리전기반글로벌트래픽길들이기::이상현::AWS Summit Seoul 2018
Lambda@Edge를통한멀티리전기반글로벌트래픽길들이기::이상현::AWS Summit Seoul 2018Amazon Web Services Korea
 
만들자! 데이터 기반의 스마트 팩토리 - 문태양 AWS 솔루션즈 아키텍트 / 배권 팀장, OCI 정보통신 :: AWS Summit Seou...
만들자! 데이터 기반의 스마트 팩토리 - 문태양 AWS 솔루션즈 아키텍트 / 배권 팀장, OCI 정보통신 :: AWS Summit Seou...만들자! 데이터 기반의 스마트 팩토리 - 문태양 AWS 솔루션즈 아키텍트 / 배권 팀장, OCI 정보통신 :: AWS Summit Seou...
만들자! 데이터 기반의 스마트 팩토리 - 문태양 AWS 솔루션즈 아키텍트 / 배권 팀장, OCI 정보통신 :: AWS Summit Seou...Amazon Web Services Korea
 
20200721 AWS Black Belt Online Seminar AWS App Mesh
20200721 AWS Black Belt Online Seminar AWS App Mesh20200721 AWS Black Belt Online Seminar AWS App Mesh
20200721 AWS Black Belt Online Seminar AWS App MeshAmazon Web Services Japan
 
Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기Amazon Web Services Korea
 
Amazon SageMaker 推論エンドポイントを利用したアプリケーション開発
Amazon SageMaker 推論エンドポイントを利用したアプリケーション開発Amazon SageMaker 推論エンドポイントを利用したアプリケーション開発
Amazon SageMaker 推論エンドポイントを利用したアプリケーション開発Amazon Web Services Japan
 
[Spring Camp 2018] 11번가 Spring Cloud 기반 MSA로의 전환 : 지난 1년간의 이야기
[Spring Camp 2018] 11번가 Spring Cloud 기반 MSA로의 전환 : 지난 1년간의 이야기[Spring Camp 2018] 11번가 Spring Cloud 기반 MSA로의 전환 : 지난 1년간의 이야기
[Spring Camp 2018] 11번가 Spring Cloud 기반 MSA로의 전환 : 지난 1년간의 이야기YongSung Yoon
 
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...Amazon Web Services Korea
 
AWS Finance Symposium_바로 도입할 수 있는 금융권 업무의 클라우드 아키텍처 알아보기
AWS Finance Symposium_바로 도입할 수 있는 금융권 업무의 클라우드 아키텍처 알아보기AWS Finance Symposium_바로 도입할 수 있는 금융권 업무의 클라우드 아키텍처 알아보기
AWS Finance Symposium_바로 도입할 수 있는 금융권 업무의 클라우드 아키텍처 알아보기Amazon Web Services Korea
 
[오픈소스컨설팅]Java Performance Tuning
[오픈소스컨설팅]Java Performance Tuning[오픈소스컨설팅]Java Performance Tuning
[오픈소스컨설팅]Java Performance TuningJi-Woong Choi
 
[AKIBA.AWS] VPCをネットワーク図で理解してみる
[AKIBA.AWS] VPCをネットワーク図で理解してみる[AKIBA.AWS] VPCをネットワーク図で理解してみる
[AKIBA.AWS] VPCをネットワーク図で理解してみるShuji Kikuchi
 
20190821 AWS Black Belt Online Seminar AWS AppSync
20190821 AWS Black Belt Online Seminar AWS AppSync20190821 AWS Black Belt Online Seminar AWS AppSync
20190821 AWS Black Belt Online Seminar AWS AppSyncAmazon Web Services Japan
 
AWS KMS를 활용하여 안전한 AWS 환경을 구축하기 위한 전략::임기성::AWS Summit Seoul 2018
AWS KMS를 활용하여 안전한 AWS 환경을 구축하기 위한 전략::임기성::AWS Summit Seoul 2018AWS KMS를 활용하여 안전한 AWS 환경을 구축하기 위한 전략::임기성::AWS Summit Seoul 2018
AWS KMS를 활용하여 안전한 AWS 환경을 구축하기 위한 전략::임기성::AWS Summit Seoul 2018Amazon Web Services Korea
 

La actualidad más candente (20)

Oracle on AWS RDS Migration - 성기명
Oracle on AWS RDS Migration - 성기명Oracle on AWS RDS Migration - 성기명
Oracle on AWS RDS Migration - 성기명
 
AWS Summit Seoul 2023 | AWS에서 OpenTelemetry 기반의 애플리케이션 Observability 구축/활용하기
AWS Summit Seoul 2023 | AWS에서 OpenTelemetry 기반의 애플리케이션 Observability 구축/활용하기AWS Summit Seoul 2023 | AWS에서 OpenTelemetry 기반의 애플리케이션 Observability 구축/활용하기
AWS Summit Seoul 2023 | AWS에서 OpenTelemetry 기반의 애플리케이션 Observability 구축/활용하기
 
클라우드 기반 AWS 데이터베이스 선택 옵션 - AWS Summit Seoul 2017
클라우드 기반 AWS 데이터베이스 선택 옵션 - AWS Summit Seoul 2017 클라우드 기반 AWS 데이터베이스 선택 옵션 - AWS Summit Seoul 2017
클라우드 기반 AWS 데이터베이스 선택 옵션 - AWS Summit Seoul 2017
 
AWS CLOUD 2017 - Amazon Aurora를 통한 고성능 데이터베이스 운용하기 (박선용 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Aurora를 통한 고성능 데이터베이스 운용하기 (박선용 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Aurora를 통한 고성능 데이터베이스 운용하기 (박선용 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Aurora를 통한 고성능 데이터베이스 운용하기 (박선용 솔루션즈 아키텍트)
 
금융 회사를 위한 클라우드 이용 가이드 – 신은수 AWS 솔루션즈 아키텍트, 김호영 AWS 정책협력 담당:: AWS Cloud Week ...
금융 회사를 위한 클라우드 이용 가이드 –  신은수 AWS 솔루션즈 아키텍트, 김호영 AWS 정책협력 담당:: AWS Cloud Week ...금융 회사를 위한 클라우드 이용 가이드 –  신은수 AWS 솔루션즈 아키텍트, 김호영 AWS 정책협력 담당:: AWS Cloud Week ...
금융 회사를 위한 클라우드 이용 가이드 – 신은수 AWS 솔루션즈 아키텍트, 김호영 AWS 정책협력 담당:: AWS Cloud Week ...
 
Amazon Aurora Deep Dive (김기완) - AWS DB Day
Amazon Aurora Deep Dive (김기완) - AWS DB DayAmazon Aurora Deep Dive (김기완) - AWS DB Day
Amazon Aurora Deep Dive (김기완) - AWS DB Day
 
[AWSマイスターシリーズ] Amazon Elastic Compute Cloud (EC2)
 [AWSマイスターシリーズ] Amazon Elastic Compute Cloud (EC2) [AWSマイスターシリーズ] Amazon Elastic Compute Cloud (EC2)
[AWSマイスターシリーズ] Amazon Elastic Compute Cloud (EC2)
 
AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020
AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020 AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020
AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020
 
Lambda@Edge를통한멀티리전기반글로벌트래픽길들이기::이상현::AWS Summit Seoul 2018
Lambda@Edge를통한멀티리전기반글로벌트래픽길들이기::이상현::AWS Summit Seoul 2018Lambda@Edge를통한멀티리전기반글로벌트래픽길들이기::이상현::AWS Summit Seoul 2018
Lambda@Edge를통한멀티리전기반글로벌트래픽길들이기::이상현::AWS Summit Seoul 2018
 
만들자! 데이터 기반의 스마트 팩토리 - 문태양 AWS 솔루션즈 아키텍트 / 배권 팀장, OCI 정보통신 :: AWS Summit Seou...
만들자! 데이터 기반의 스마트 팩토리 - 문태양 AWS 솔루션즈 아키텍트 / 배권 팀장, OCI 정보통신 :: AWS Summit Seou...만들자! 데이터 기반의 스마트 팩토리 - 문태양 AWS 솔루션즈 아키텍트 / 배권 팀장, OCI 정보통신 :: AWS Summit Seou...
만들자! 데이터 기반의 스마트 팩토리 - 문태양 AWS 솔루션즈 아키텍트 / 배권 팀장, OCI 정보통신 :: AWS Summit Seou...
 
20200721 AWS Black Belt Online Seminar AWS App Mesh
20200721 AWS Black Belt Online Seminar AWS App Mesh20200721 AWS Black Belt Online Seminar AWS App Mesh
20200721 AWS Black Belt Online Seminar AWS App Mesh
 
Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기
 
Amazon SageMaker 推論エンドポイントを利用したアプリケーション開発
Amazon SageMaker 推論エンドポイントを利用したアプリケーション開発Amazon SageMaker 推論エンドポイントを利用したアプリケーション開発
Amazon SageMaker 推論エンドポイントを利用したアプリケーション開発
 
[Spring Camp 2018] 11번가 Spring Cloud 기반 MSA로의 전환 : 지난 1년간의 이야기
[Spring Camp 2018] 11번가 Spring Cloud 기반 MSA로의 전환 : 지난 1년간의 이야기[Spring Camp 2018] 11번가 Spring Cloud 기반 MSA로의 전환 : 지난 1년간의 이야기
[Spring Camp 2018] 11번가 Spring Cloud 기반 MSA로의 전환 : 지난 1년간의 이야기
 
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...
 
AWS Finance Symposium_바로 도입할 수 있는 금융권 업무의 클라우드 아키텍처 알아보기
AWS Finance Symposium_바로 도입할 수 있는 금융권 업무의 클라우드 아키텍처 알아보기AWS Finance Symposium_바로 도입할 수 있는 금융권 업무의 클라우드 아키텍처 알아보기
AWS Finance Symposium_바로 도입할 수 있는 금융권 업무의 클라우드 아키텍처 알아보기
 
[오픈소스컨설팅]Java Performance Tuning
[오픈소스컨설팅]Java Performance Tuning[오픈소스컨설팅]Java Performance Tuning
[오픈소스컨설팅]Java Performance Tuning
 
[AKIBA.AWS] VPCをネットワーク図で理解してみる
[AKIBA.AWS] VPCをネットワーク図で理解してみる[AKIBA.AWS] VPCをネットワーク図で理解してみる
[AKIBA.AWS] VPCをネットワーク図で理解してみる
 
20190821 AWS Black Belt Online Seminar AWS AppSync
20190821 AWS Black Belt Online Seminar AWS AppSync20190821 AWS Black Belt Online Seminar AWS AppSync
20190821 AWS Black Belt Online Seminar AWS AppSync
 
AWS KMS를 활용하여 안전한 AWS 환경을 구축하기 위한 전략::임기성::AWS Summit Seoul 2018
AWS KMS를 활용하여 안전한 AWS 환경을 구축하기 위한 전략::임기성::AWS Summit Seoul 2018AWS KMS를 활용하여 안전한 AWS 환경을 구축하기 위한 전략::임기성::AWS Summit Seoul 2018
AWS KMS를 활용하여 안전한 AWS 환경을 구축하기 위한 전략::임기성::AWS Summit Seoul 2018
 

Similar a (BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct

Large-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and DataductLarge-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and DataductSourabh Bajaj
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...Amazon Web Services
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWSPaolo latella
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Sid Anand
 
AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722Amazon Web Services
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructureharendra_pathak
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017Pratim Das
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaHelen Rogers
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Running Business Critical Workloads on AWS – Nam Je Cho
Running Business Critical Workloads on AWS – Nam Je ChoRunning Business Critical Workloads on AWS – Nam Je Cho
Running Business Critical Workloads on AWS – Nam Je ChoAmazon Web Services
 
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)Amazon Web Services
 
Serverless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDBServerless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDBAmazon Web Services
 
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...Amazon Web Services
 

Similar a (BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct (20)

Large-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and DataductLarge-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Running Business Critical Workloads on AWS – Nam Je Cho
Running Business Critical Workloads on AWS – Nam Je ChoRunning Business Critical Workloads on AWS – Nam Je Cho
Running Business Critical Workloads on AWS – Nam Je Cho
 
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
 
Serverless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDBServerless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDB
 
Managing Your Cloud Assets
Managing Your Cloud AssetsManaging Your Cloud Assets
Managing Your Cloud Assets
 
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
 

Más de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Último (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sourabh Bajaj, Software Engineer, Coursera October 2015 BDT404 Large-Scale ETL Data Flows With Data Pipeline and Dataduct
  • 2. What to Expect from the Session ●Learn about: • How we use AWS Data Pipeline to manage ETL at Coursera • Why we built Dataduct, an open source framework from Coursera for running pipelines • How Dataduct enables developers to write their own ETL pipelines for their services • Best practices for managing pipelines • How to start using Dataduct for your own pipelines
  • 6. 120 partners 2.5 million course completions Education at Scale 15 million learners worldwide 1300 courses
  • 7. Data Warehousing at Coursera Amazon Redshift 167 Amazon Redshift users 1200 EDW tables 22 source systems 6 dc1.8xlarge instances 30,000,000 queries run
  • 8. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 9. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 10. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 11. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 12. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 13. ETL at Coursera 150 Active pipelines 44 Dataduct developers
  • 14. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 15. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 16. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 17. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 18. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 19. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 20. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 22. ●Open source wrapper around AWS Data Pipeline Dataduct
  • 23. Dataduct ●Open source wrapper around AWS Data Pipeline ●It provides: • Code reuse • Extensibility • Command line interface • Staging environment support • Dynamic updates
  • 25. Let’s build some pipelines
  • 26. Pipeline 1: Amazon RDS → Amazon Redshift ●Let’s start with a simple pipeline of pulling data from a relational store to Amazon Redshift Amazon Redshift Amazon RDS Amazon S3 Amazon EC2 AWS Data Pipeline
  • 27. Pipeline 1: Amazon RDS → Amazon Redshift
  • 28. ● Definition in YAML ● Steps ● Shared Config ● Visualization ● Overrides ● Reusable code Pipeline 1: Amazon RDS → Amazon Redshift
  • 29. Pipeline 1: Amazon RDS → Amazon Redshift (Steps) ●Extract RDS • Fetch data from Amazon RDS and output to Amazon S3
  • 30. Pipeline 1: Amazon RDS → Amazon Redshift (Steps) ●Create-Load-Redshift • Create table if it doesn’t exist and load data using COPY.
  • 31. Pipeline 1: Amazon RDS → Amazon Redshift ●Upsert • Update and insert into the production table from staging.
  • 32. Pipeline 1: Amazon RDS → Amazon Redshift (Tasks) ●Bootstrap • Fully automated • Fetch latest binaries from Amazon S3 for Amazon EC2 / Amazon EMR • Install any updated dependencies on the resource • Make sure that the pipeline would run the latest version of code
  • 33. Pipeline 1: Amazon RDS → Amazon Redshift ● Quality assurance • Primary key violations in the warehouse. • Dropped rows: By comparing the number of rows. • Corrupted rows: By comparing a sample set of rows. • Automatically done within UPSERT
  • 34. Pipeline 1: Amazon RDS → Amazon Redshift ●Teardown • Amazon SNS alerting for failed tasks • Logging of task failures • Monitoring • Run times • Retries • Machine health
  • 35. Pipeline 1: Amazon RDS → Amazon Redshift (Config) ●Visualization • Automatically generated by Dataduct • Allows easy debugging
  • 36. Pipeline 1: Amazon RDS → Amazon Redshift ●Shared Config • IAM roles • AMI • Security group • Retries • Custom steps • Resource paths
  • 37. Pipeline 1: Amazon RDS → Amazon Redshift ●Custom steps • Open-sourced steps can easily be shared across multiple pipelines • You can also create new steps and add them using the config
  • 38. Deploying a pipeline ●Command line interface for all operations usage: dataduct pipeline activate [-h] [-m MODE] [-f] [-t TIME_DELTA] [-b] pipeline_definitions [pipeline_definitions ...]
  • 39. Pipeline 2: Cassandra → Amazon Redshift Amazon Redshift Amazon EMR (Scalding) Amazon S3 Amazon S3 Amazon EMR (Aegisthus) Cassandra AWS Data Pipeline
  • 40. ●Shell command activity to rescue Pipeline 2: Cassandra → Amazon Redshift
  • 41. ●Shell command activity to rescue ●Priam backups of Cassandra to Amazon S3 Pipeline 2: Cassandra → Amazon Redshift
  • 42. ●Shell command activity to rescue ●Priam backups of Cassandra to S3 ●Aegisthus to parse SSTables into Avro dumps Pipeline 2: Cassandra → Amazon Redshift
  • 43. ●Shell command activity to rescue ●Priam backups of Cassandra to Amazon S3 ●Aegisthus to parse SSTables into Avro dumps ●Scalding to process Aegisthus output • Extend the base steps to create more patterns Pipeline 2: Cassandra → Amazon Redshift
  • 44. Pipeline 2: Cassandra → Amazon Redshift
  • 45. ● Custom steps • Aegisthus • Scalding Pipeline 2: Cassandra → Amazon Redshift
  • 46. ● EMR-Config overrides the defaults Pipeline 2: Cassandra → Amazon Redshift
  • 47. ● Multiple output nodes from transform step Pipeline 2: Cassandra → Amazon Redshift
  • 48. ● Bootstrap • Save every pipeline definition • Fetching new jar for the Amazon EMR jobs • Specify the same Hadoop / Hive metastore installation Pipeline 2: Cassandra → Amazon Redshift
  • 49. Data products ●We’ve talked about data into the warehouse ●Common pattern: • Wait for dependencies • Computation inside redshift to create derived tables • Amazon EMR activities for more complex process • Load back into MySQL / Cassandra • Product feature queries MySQL / Cassandra ●Used in recommendations, dashboards, and search
  • 50. Recommendations ●Objective • Connecting the learner to right content ●Use cases: • Recommendations email • Course discovery • Reactivation of the users
  • 51. Recommendations ●Computation inside Amazon Redshift to create derived tables for co-enrollments ●Amazon EMR job for model training ●Model file pushed to Amazon S3 ●Prediction API uses the updated model file ●Contract between the prediction and the training layer is via model definition.
  • 52. Internal Dashboard ●Objective • Serve internal dashboards to create a data driven culture ●Use Cases • Key performance indicators for the company • Track results for different A/B experiments
  • 54. ● Do: • Monitoring (run times, retries, deploys, query times) Learnings
  • 55. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline Learnings
  • 56. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod Learnings
  • 57. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod • Shared library to democratize writing of ETL Learnings
  • 58. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod • Shared library to democratize writing of ETL • Using read-replicas and backups Learnings
  • 59. ● Don’t: • Same code passed to multiple pipelines as a script Learnings
  • 60. ● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines Learnings
  • 61. ● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines • Really huge pipelines instead of modular small pipelines with dependencies Learnings
  • 62. ● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines • Really huge pipelines instead of modular small pipelines with dependencies • Not catching resource timeouts or load delays Learnings
  • 63. Dataduct ●Code reuse ●Extensibility ●Command line interface ●Staging environment support ●Dynamic updates
  • 65. Questions? Also, we are hiring! https://www.coursera.org/jobs
  • 67. Thank you! Also, we are hiring! https://www.coursera.org/jobs Sourabh Bajaj sb2nov @sb2nov