SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
ML IN DATA PLATFORM
A Case Study with NLP Application
US Office
2150 Ringwood Ave, San Jose,
CA 95131
UK Office
3 Beeston Place, Belgravia,
London SW1W 0JJ, UK
Vietnam Office
Floor #1-4, 302 Le Van Sy,
Ward 1, Tan Binh District, HCMC,
Vietnam
SG Office
6A Shenton Way #04-08 OUE
Downtown Gallery Singapore 068815
2
Table of content
No Content
1 Introduction
2 Data Platform – ETL Process
3 Data Platform – Analytics Workflow
4 Afterthoughts
3
INTRODUCTION
01
1. Introduction to Case Study
2. Introduction to Data Platform
1.1.1. Potential Values of ML/NLP Application
4
- ML applications can bring new-found values
- Case study: Online Review Analytics
- Opinions from others increasingly guide customer's purchases
=> Growth, Improvement, Investment implications
Refs
- https://www.mckinsey.com/industries/consumer-packaged-goods/our-insights/five-star-growth-using-online-ratings-to-design-better-products
- https://www.thinkwithgoogle.com/consumer-insights/consumer-trends/customer-review-preference-statistics/
1.1.2. Dealing with text data
5
- An insight-mining platform for review text is highly valuable. It is difficult though
- Engineering challenges
- Getting the reviews => web-scraping, data collection
- Storing reviews => moving, maintaining, deduplicating large amount of texts
- Processing reviews => text cleaning, processing, and analytics at scale
- Analytics challenges
- Natural Language Processing – NLP
- Insight communication: dashboards and visualization
1.2.1. Data Platform overall architecture
6
1.2.2. Example: output from ETL Process
7
1.3. Example: output from Analytics Workflow
8
1.3. Example: insight communication – Web Application
9
10
ETL PROCESS
02
1. Extract, Transform, Load
2. Data Collection
3. Data Storage
2.1. Extract, Transform, Load
11
- Extract:
- Data Collector: collect data from websites
- Extract and Map from raw data collected
- Transform: clean up data (trim, special characters,…), deduplications, etc.
- Load: to databases for storage and analysis: MongoDB, BigQuery
- Batching: split large amount of data into batches for parallel processing
- Worker: a container that moves/processes data -> Mini-ETL
2.1. Data Collection: web-scraping
12
Web Scraper
2.1. Data Collection: Benefit & Challenge
13
Benefit Challenge
It’s Free
It’s Big Data
Fake Data
- Captcha
- IP Blocking
Hard to collect
- Javascript Rendering
2.1. Data Collection: How to deal with challenges?
14
WEB BROWSER
SELENIUM
PROXY
To avoid IPs blocking & Captcha
To overcome Javascript rendering
Control Browser by Code
Control Browser by Code
2.2. Data Storage
15
- PostgreSQL: store process metadata (used by orchestrator)
- Google Cloud Storage: store intermediary CSV files
- MongoDB: flexible, persistent storage for text documents. Allow easy and frequent
edits
- Google BigQuery: analytics data storage and distributed processing engine using
SQL – familiar language for Data Analysts
16
ANALYTICS WORKFLOW
03
1. First Implementation
2. Inference Services
3.1.1 Analytics Workflow
17
- After ETL process, data is available for further processing and analysis
- Analytics Workflow:
- A part of Data Platform
- Extract information from data for insights
- Machine Learning models are integral part of text analytics
- Information is extracted, and pushed to BigQuery for queries
3.1.2 First implementation
18
- Implement each model as a worker
- Advantages:
- Easy to implement
- Suitable for early stages: fast
implementation and acceptable
performance
- Several drawbacks: technical debts
- Mixing of concerns
- Low flexibility
- Limited scalability
3.1.3 First implementation: mixing of concerns
19
- Data Platform’s intended purpose: moving data, processing, and interacting with
various API on the way => mostly I/O operations
- Computationally-heavy tasks are usually delegated: e.g. to BigQuery
- Mixing I/O and computations
3.1.4 First implementation: scalability
20
- Everything seems ok, until
we must process many
reviews (100,000s -
1,000,000s, various
lengths, can be very long)
- Manual scaling: replicate
workers -> VM
resource/cost constraint
- GPU acceleration? -> ETL
workers don’t need GPU
3.1.5 First implementation: monitoring and maintenance
21
- No real monitoring components for performance degradation
- Data drift, concept drift?
- If needed, model is inspected manually
- Collect, process, re-train models manually
- Upload trained model to GCS, re-deploy workers
3.2.1 Inference Services: separation of concerns
22
- Income Inference Services
- No direct I/O for data, only accept
HTTP requests with input and
response with computed results
=> Easier to maintain and optimize both
ends
3.2.2 Inference Services: overall architecture
23
3.2.3 Inference Services: solving redundancy and reusability
24
- Each ML model is treated as a microservice
- Several ML models can be connected as an inference pipeline for complex tasks
- Promote reusability and flexibility => save resources
3.2.4 Inference Services: solving scalability
25
- Services are containerized, run, and deployed independently
- Can be migrate to any environment with relative ease
- For maximum scalability => K8s cluster (GKE) with autoscaling
- Thanks to K8s, deployment is easier.
- Rollout deployments: no/minimal downtime
3.2.5. Inference Services: monitoring
26
- Metrics are logged to a central data-lake and visualized in a
dashboard.
Image from https://www.datarobot.com/wiki/machine-learning-operations-mlops/
3.2.6. Inference Services: results and drawbacks
27
- Results
- A more flexible and effective solution
- More resilient ETL process: less complex
- Reduced ETL resource consumption and processing time
- New system of services can be developed and maintained separately
- Drawbacks
- Appearance of more infrastructures and tools -> management overhead
- Complex inter-dependency of inference services as it expands
- Requires more expertise in managing K8s clusters and deployment
28
WHAT WE LEARNED
04
4.1. What We Learned?
29
- ML Application can be tricky to be done right
- Not much resources and best practices
- Solved by: thorough analysis of use-cases
- Solved by: proper scoping and sizing
- Separating I/O Intensive from Computationally-intensive tasks
- ETL components
- ML components
- Good architecture design from the beginning can save time and cost later
- Over-engineered vs under-engineered
- Easy in hindsight, difficult in practice
Hope these ideas help you in designing your next ML Application
THANK YOU – Q&A

Más contenido relacionado

La actualidad más candente

Microservice Architecture
Microservice ArchitectureMicroservice Architecture
Microservice ArchitectureNguyen Tung
 
Software Cost Estimation Methods:A Review - Persian
Software Cost Estimation Methods:A Review - PersianSoftware Cost Estimation Methods:A Review - Persian
Software Cost Estimation Methods:A Review - PersianMobin Ranjbar
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRESquadcast Inc
 
Apache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patternsApache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patternsFlorent Ramiere
 
Microservices Workshop - Craft Conference
Microservices Workshop - Craft ConferenceMicroservices Workshop - Craft Conference
Microservices Workshop - Craft ConferenceAdrian Cockcroft
 
Product management class rookie to pro
Product management class rookie to proProduct management class rookie to pro
Product management class rookie to proBim Akinfenwa
 
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023Steve Pember
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudVinay Kumar Chella
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking VN
 
The Case for Chaos
The Case for ChaosThe Case for Chaos
The Case for ChaosBruce Wong
 
Introducing GitLab (September 2018)
Introducing GitLab (September 2018)Introducing GitLab (September 2018)
Introducing GitLab (September 2018)Noa Harel
 
An Introduction To Jenkins
An Introduction To JenkinsAn Introduction To Jenkins
An Introduction To JenkinsKnoldus Inc.
 
CQRS + Event Sourcing
CQRS + Event SourcingCQRS + Event Sourcing
CQRS + Event SourcingMike Bild
 
Chef for DevOps - an Introduction
Chef for DevOps - an IntroductionChef for DevOps - an Introduction
Chef for DevOps - an IntroductionSanjeev Sharma
 
Jenkins tutorial for beginners
Jenkins tutorial for beginnersJenkins tutorial for beginners
Jenkins tutorial for beginnersBugRaptors
 
DDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkDDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkbanq jdon
 
Microservices Api Gateway Eureka Server.pptx
Microservices Api Gateway Eureka Server.pptxMicroservices Api Gateway Eureka Server.pptx
Microservices Api Gateway Eureka Server.pptxAayush Chimaniya
 
Building Microservices with the 12 Factor App Pattern on AWS
Building Microservices with the 12 Factor App Pattern on AWSBuilding Microservices with the 12 Factor App Pattern on AWS
Building Microservices with the 12 Factor App Pattern on AWSAmazon Web Services
 

La actualidad más candente (20)

Microservice Architecture
Microservice ArchitectureMicroservice Architecture
Microservice Architecture
 
Software Cost Estimation Methods:A Review - Persian
Software Cost Estimation Methods:A Review - PersianSoftware Cost Estimation Methods:A Review - Persian
Software Cost Estimation Methods:A Review - Persian
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 
Apache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patternsApache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patterns
 
Microservices Workshop - Craft Conference
Microservices Workshop - Craft ConferenceMicroservices Workshop - Craft Conference
Microservices Workshop - Craft Conference
 
Product management class rookie to pro
Product management class rookie to proProduct management class rookie to pro
Product management class rookie to pro
 
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
 
Intro to DevOps
Intro to DevOpsIntro to DevOps
Intro to DevOps
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
 
The Case for Chaos
The Case for ChaosThe Case for Chaos
The Case for Chaos
 
Introducing GitLab (September 2018)
Introducing GitLab (September 2018)Introducing GitLab (September 2018)
Introducing GitLab (September 2018)
 
SRE & Kubernetes
SRE & KubernetesSRE & Kubernetes
SRE & Kubernetes
 
An Introduction To Jenkins
An Introduction To JenkinsAn Introduction To Jenkins
An Introduction To Jenkins
 
CQRS + Event Sourcing
CQRS + Event SourcingCQRS + Event Sourcing
CQRS + Event Sourcing
 
Chef for DevOps - an Introduction
Chef for DevOps - an IntroductionChef for DevOps - an Introduction
Chef for DevOps - an Introduction
 
Jenkins tutorial for beginners
Jenkins tutorial for beginnersJenkins tutorial for beginners
Jenkins tutorial for beginners
 
DDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkDDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFramework
 
Microservices Api Gateway Eureka Server.pptx
Microservices Api Gateway Eureka Server.pptxMicroservices Api Gateway Eureka Server.pptx
Microservices Api Gateway Eureka Server.pptx
 
Building Microservices with the 12 Factor App Pattern on AWS
Building Microservices with the 12 Factor App Pattern on AWSBuilding Microservices with the 12 Factor App Pattern on AWS
Building Microservices with the 12 Factor App Pattern on AWS
 

Similar a ML in Data Platform: Case Study with NLP Application

Five ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeFive ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeSingleStore
 
MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021Ieva Navickaite
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLMárton Kodok
 
MODERN DATA PIPELINE
MODERN DATA PIPELINEMODERN DATA PIPELINE
MODERN DATA PIPELINEIRJET Journal
 
127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentationNitesh Kumar
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringCognizant
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdfChris Hoyean Song
 
How to overcome challenges in it system evolution
How to overcome challenges in it system evolutionHow to overcome challenges in it system evolution
How to overcome challenges in it system evolutionGrupa Unity
 
Datawarehouse and reporting in service manager
Datawarehouse and reporting in service manager Datawarehouse and reporting in service manager
Datawarehouse and reporting in service manager Eduardo Castro
 
Workshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databasesWorkshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databasesEduardo Piairo
 
Internet of Things Microservices
Internet of Things MicroservicesInternet of Things Microservices
Internet of Things MicroservicesCapgemini
 
Dataweave Libraries and ObjectStore
Dataweave Libraries and ObjectStoreDataweave Libraries and ObjectStore
Dataweave Libraries and ObjectStoreVikalp Bhalia
 
Book store Black Book - Dinesh48
Book store Black Book - Dinesh48Book store Black Book - Dinesh48
Book store Black Book - Dinesh48Dinesh Jogdand
 
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...CARLOS III UNIVERSITY OF MADRID
 
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...IRJET Journal
 
Bank Management System.docx
Bank Management System.docxBank Management System.docx
Bank Management System.docxNikhil Patil
 

Similar a ML in Data Platform: Case Study with NLP Application (20)

Five ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeFive ways database modernization simplifies your data life
Five ways database modernization simplifies your data life
 
MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
 
ESP POC Findings
ESP POC FindingsESP POC Findings
ESP POC Findings
 
MODERN DATA PIPELINE
MODERN DATA PIPELINEMODERN DATA PIPELINE
MODERN DATA PIPELINE
 
127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature Engineering
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
 
Distributed Systems in Data Engineering
Distributed Systems in Data EngineeringDistributed Systems in Data Engineering
Distributed Systems in Data Engineering
 
Print report
Print reportPrint report
Print report
 
How to overcome challenges in it system evolution
How to overcome challenges in it system evolutionHow to overcome challenges in it system evolution
How to overcome challenges in it system evolution
 
Datawarehouse and reporting in service manager
Datawarehouse and reporting in service manager Datawarehouse and reporting in service manager
Datawarehouse and reporting in service manager
 
Workshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databasesWorkshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databases
 
Internet of Things Microservices
Internet of Things MicroservicesInternet of Things Microservices
Internet of Things Microservices
 
Dataweave Libraries and ObjectStore
Dataweave Libraries and ObjectStoreDataweave Libraries and ObjectStore
Dataweave Libraries and ObjectStore
 
Book store Black Book - Dinesh48
Book store Black Book - Dinesh48Book store Black Book - Dinesh48
Book store Black Book - Dinesh48
 
Zakir_Hussain_cv
Zakir_Hussain_cvZakir_Hussain_cv
Zakir_Hussain_cv
 
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
 
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
 
Bank Management System.docx
Bank Management System.docxBank Management System.docx
Bank Management System.docx
 

Más de Grokking VN

Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking VN
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking VN
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking VN
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compilerGrokking VN
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problemGrokking VN
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoringGrokking VN
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...Grokking VN
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking VN
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking VN
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design PatternsGrokking VN
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking VN
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking VN
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking VN
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking VN
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking VN
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking VN
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking VN
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking VN
 

Más de Grokking VN (20)

Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellchecking
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKI
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design Patterns
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer Vision
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101
 

Último

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

ML in Data Platform: Case Study with NLP Application

  • 1. ML IN DATA PLATFORM A Case Study with NLP Application US Office 2150 Ringwood Ave, San Jose, CA 95131 UK Office 3 Beeston Place, Belgravia, London SW1W 0JJ, UK Vietnam Office Floor #1-4, 302 Le Van Sy, Ward 1, Tan Binh District, HCMC, Vietnam SG Office 6A Shenton Way #04-08 OUE Downtown Gallery Singapore 068815
  • 2. 2 Table of content No Content 1 Introduction 2 Data Platform – ETL Process 3 Data Platform – Analytics Workflow 4 Afterthoughts
  • 3. 3 INTRODUCTION 01 1. Introduction to Case Study 2. Introduction to Data Platform
  • 4. 1.1.1. Potential Values of ML/NLP Application 4 - ML applications can bring new-found values - Case study: Online Review Analytics - Opinions from others increasingly guide customer's purchases => Growth, Improvement, Investment implications Refs - https://www.mckinsey.com/industries/consumer-packaged-goods/our-insights/five-star-growth-using-online-ratings-to-design-better-products - https://www.thinkwithgoogle.com/consumer-insights/consumer-trends/customer-review-preference-statistics/
  • 5. 1.1.2. Dealing with text data 5 - An insight-mining platform for review text is highly valuable. It is difficult though - Engineering challenges - Getting the reviews => web-scraping, data collection - Storing reviews => moving, maintaining, deduplicating large amount of texts - Processing reviews => text cleaning, processing, and analytics at scale - Analytics challenges - Natural Language Processing – NLP - Insight communication: dashboards and visualization
  • 6. 1.2.1. Data Platform overall architecture 6
  • 7. 1.2.2. Example: output from ETL Process 7
  • 8. 1.3. Example: output from Analytics Workflow 8
  • 9. 1.3. Example: insight communication – Web Application 9
  • 10. 10 ETL PROCESS 02 1. Extract, Transform, Load 2. Data Collection 3. Data Storage
  • 11. 2.1. Extract, Transform, Load 11 - Extract: - Data Collector: collect data from websites - Extract and Map from raw data collected - Transform: clean up data (trim, special characters,…), deduplications, etc. - Load: to databases for storage and analysis: MongoDB, BigQuery - Batching: split large amount of data into batches for parallel processing - Worker: a container that moves/processes data -> Mini-ETL
  • 12. 2.1. Data Collection: web-scraping 12 Web Scraper
  • 13. 2.1. Data Collection: Benefit & Challenge 13 Benefit Challenge It’s Free It’s Big Data Fake Data - Captcha - IP Blocking Hard to collect - Javascript Rendering
  • 14. 2.1. Data Collection: How to deal with challenges? 14 WEB BROWSER SELENIUM PROXY To avoid IPs blocking & Captcha To overcome Javascript rendering Control Browser by Code Control Browser by Code
  • 15. 2.2. Data Storage 15 - PostgreSQL: store process metadata (used by orchestrator) - Google Cloud Storage: store intermediary CSV files - MongoDB: flexible, persistent storage for text documents. Allow easy and frequent edits - Google BigQuery: analytics data storage and distributed processing engine using SQL – familiar language for Data Analysts
  • 16. 16 ANALYTICS WORKFLOW 03 1. First Implementation 2. Inference Services
  • 17. 3.1.1 Analytics Workflow 17 - After ETL process, data is available for further processing and analysis - Analytics Workflow: - A part of Data Platform - Extract information from data for insights - Machine Learning models are integral part of text analytics - Information is extracted, and pushed to BigQuery for queries
  • 18. 3.1.2 First implementation 18 - Implement each model as a worker - Advantages: - Easy to implement - Suitable for early stages: fast implementation and acceptable performance - Several drawbacks: technical debts - Mixing of concerns - Low flexibility - Limited scalability
  • 19. 3.1.3 First implementation: mixing of concerns 19 - Data Platform’s intended purpose: moving data, processing, and interacting with various API on the way => mostly I/O operations - Computationally-heavy tasks are usually delegated: e.g. to BigQuery - Mixing I/O and computations
  • 20. 3.1.4 First implementation: scalability 20 - Everything seems ok, until we must process many reviews (100,000s - 1,000,000s, various lengths, can be very long) - Manual scaling: replicate workers -> VM resource/cost constraint - GPU acceleration? -> ETL workers don’t need GPU
  • 21. 3.1.5 First implementation: monitoring and maintenance 21 - No real monitoring components for performance degradation - Data drift, concept drift? - If needed, model is inspected manually - Collect, process, re-train models manually - Upload trained model to GCS, re-deploy workers
  • 22. 3.2.1 Inference Services: separation of concerns 22 - Income Inference Services - No direct I/O for data, only accept HTTP requests with input and response with computed results => Easier to maintain and optimize both ends
  • 23. 3.2.2 Inference Services: overall architecture 23
  • 24. 3.2.3 Inference Services: solving redundancy and reusability 24 - Each ML model is treated as a microservice - Several ML models can be connected as an inference pipeline for complex tasks - Promote reusability and flexibility => save resources
  • 25. 3.2.4 Inference Services: solving scalability 25 - Services are containerized, run, and deployed independently - Can be migrate to any environment with relative ease - For maximum scalability => K8s cluster (GKE) with autoscaling - Thanks to K8s, deployment is easier. - Rollout deployments: no/minimal downtime
  • 26. 3.2.5. Inference Services: monitoring 26 - Metrics are logged to a central data-lake and visualized in a dashboard. Image from https://www.datarobot.com/wiki/machine-learning-operations-mlops/
  • 27. 3.2.6. Inference Services: results and drawbacks 27 - Results - A more flexible and effective solution - More resilient ETL process: less complex - Reduced ETL resource consumption and processing time - New system of services can be developed and maintained separately - Drawbacks - Appearance of more infrastructures and tools -> management overhead - Complex inter-dependency of inference services as it expands - Requires more expertise in managing K8s clusters and deployment
  • 29. 4.1. What We Learned? 29 - ML Application can be tricky to be done right - Not much resources and best practices - Solved by: thorough analysis of use-cases - Solved by: proper scoping and sizing - Separating I/O Intensive from Computationally-intensive tasks - ETL components - ML components - Good architecture design from the beginning can save time and cost later - Over-engineered vs under-engineered - Easy in hindsight, difficult in practice Hope these ideas help you in designing your next ML Application