SlideShare a Scribd company logo
1 of 22
Training on a pluggable machine learning platform Machine Learning on Hadoop at Huffington Post | AOL
A Little Bit about Us Core Services Team at HPMG | AOL  Thu Kyaw (thu.kyaw@teamaol.com) Principal Software Engineer Worked on machine learning, data mining, and natural language processing Sang Chul Song, Ph.D. (sangchul.song@teamaol.com) Senior Software Engineer Worked on data intensive computing – data archiving / information retrieval
Machine Learning:Supervised Classification 1. Learning Phase Model Train “Business” 2. Classifying Phase “Entertainment” Model Result Classify capital gains to be taxed … “Politics”
Two Machine Learning Use Cases at HuffPost | AOL Comment Moderation Evaluate All New HuffPost User Comments Every Day Identify Abusive / Aggressive Comments Auto Delete / Publish ~25% Comments Every Day Article Classification Tag Articles for Advertising E.g.: scary, salacious, …
Our Classification Tasks abusive non-abusive non-abusive scary sexy non-abusive non-abusive abusive Comment Moderation Article Classification
In Order to Meet Our Needs,We Require… Support for important algorithms, including SVM Perceptron / Winnow Bayesian Decision Tree AdaBoost … Ability to build tons of models on regular basis, and pick the best Because, in general, it’s difficult to know in advance what algorithm / parameter set will work best
However, N algorithms, K parameters each, L values in each parameter  There are N x LK combinations!, which is often too many to deal with sequentially. For example, N=5, K=5, L=10  500K
So, we parallelize on Hadoop Good news:  Mahout, a parallel machine learning tool, is already available. There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news:  Mahout doesn’t support necessary algorithms yet.  Other algorithms do not run natively on Hadoop.
Therefore, we do… We build a flexible ML platform running on Hadoop that supports a wide range of algorithms, leveraging publicly available implementations. On top of our platform, we generate / test hundred thousands models, and choose the best. We use Pig for Hadoop implementation.
Our Approach OUR APPROACH More algorithms (thus better model), and faster parallel processing  AdaBoost, SVM, Decision Tree, Bayesian and a Lot Others Train Request Return CONVENTIONAL 1000s Models(one for each param set) Best Model Training Data Select Train (sequential)
What Parallelization? Training Task Training Task Training Task Training Task Training Task
General Processing Flow TrainingDocs Preprocess VectorizedDocs Train Model Preprocess Parameters Stopword use, n-gram size, stemming, etc. Train Parameters Algorithm and algorithm specific parameters (e.g. SVM, C, Ɛ, and other kernel parameters)
Our Parallel Processing Flow Model Vectorized Docs Model Model TrainingDocs Vectorized Docs Model Model Model Model Vectorized Docs Model Model
Preprocessing on Hadoop (see next slide) Preprocessing on Hadoop business	Investments are taxed as capital gains..... business	It was the overleveraged and underregulatedbanks … none   	I am afraid we may be headed for … none   	In the famous words of Homer Simpson, “it takes 2 to lie …” … Vector 1 Training Data Vector 2 Vector 3 Vector 4 279	68ngram_stem_stopword	1snowballtrue 279	68	ngram_stem_stopword2	snowball	true 279	68	ngram_stem_stopword3	snowball	true 279	68	ngram_stem_stopword	1	porter	true 279	68	ngram_stem_stopword2porter	true 279	68	ngram_stem_stopword3none	false … Vector 5 Preprocessing Request (a parameter set per line) Vector k
Preprocessing on HadoopBig Picture Vector 1 Through UDF Call Vector 2 UDF par = LOAD param_file AS par1, par2, …; run = FOREACH par GENERATE 		RunPreprocess(par1, par2, …); STORE run ..; RunPreprocess() …….. Preprocessors (Pluggable Pipes) Stemmer Tokenizer StopwordFilter Vector k Vectorizer FeatureSelector
Training on Hadoop 010101101020101100010101110100010101011100… 010111010100010100100010101011100110110101… 011101011010101011101011011010001010010101… 010010111010100010101010001010111010101010… 111010110001110101011010100101011010001011… Model 1 Training on Hadoop (see next slide) Vectors Model 2 Model 3 Model 4 73	923	balanced_winnow	5	1	10… 73	923	balanced_winnow	5	210… 73	923	balanced_winnow	5	310… 73	923	balanced_winnow	5	1	20	… 73	923	balanced_winnow	5	2	20	… 73	923	balanced_winnow	5	320… … Model 5 Train Request (a parameter set per line) Model k Mahout, Weka, Mallet or libsvm
Training on HadoopBig Picture Model 1 Through UDF Call Model 2 UDF RunTrainer() par = LOAD param_file AS par1, par2, …; run = FOREACH par GENERATERunTrainer(par1, par2, …); STORE run ..; ……. Mallet ,[object Object]
Bagging
Balanced Winnow
C45
Decision Tree
…Mahout ,[object Object]

More Related Content

What's hot

Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 

What's hot (20)

Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ... Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalRPivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 

Viewers also liked

Slides pentaho-hadoop-weka
Slides pentaho-hadoop-wekaSlides pentaho-hadoop-weka
Slides pentaho-hadoop-weka
lucboudreau
 
GBBrand 2012 - TOP 100 British Brands
GBBrand 2012 - TOP 100 British BrandsGBBrand 2012 - TOP 100 British Brands
GBBrand 2012 - TOP 100 British Brands
MPP Consulting
 
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبون
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبونممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبون
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبون
eythar
 
Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Smart SMBs: fine-tuning the engines of growth
Smart SMBs: fine-tuning the engines of growth Smart SMBs: fine-tuning the engines of growth
Smart SMBs: fine-tuning the engines of growth
Steve Bray
 
美雅找醬油篇
美雅找醬油篇美雅找醬油篇
美雅找醬油篇
suyuanc1
 
Pengenalan kepada Pentaho
Pengenalan kepada PentahoPengenalan kepada Pentaho
Pengenalan kepada Pentaho
Hisyammudin
 

Viewers also liked (19)

Slides pentaho-hadoop-weka
Slides pentaho-hadoop-wekaSlides pentaho-hadoop-weka
Slides pentaho-hadoop-weka
 
EURIB Korte opleiding: Online marketing - Maart 2016
EURIB Korte opleiding: Online marketing - Maart 2016EURIB Korte opleiding: Online marketing - Maart 2016
EURIB Korte opleiding: Online marketing - Maart 2016
 
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
 
World com
World comWorld com
World com
 
La vuelta al Mundo en 8 Minutos (por: carlitosrangel)
La vuelta al Mundo en 8 Minutos (por: carlitosrangel)La vuelta al Mundo en 8 Minutos (por: carlitosrangel)
La vuelta al Mundo en 8 Minutos (por: carlitosrangel)
 
GBBrand 2012 - TOP 100 British Brands
GBBrand 2012 - TOP 100 British BrandsGBBrand 2012 - TOP 100 British Brands
GBBrand 2012 - TOP 100 British Brands
 
Reactive architecture e microservices microservices, ap is e event driven (1)
Reactive architecture e microservices  microservices, ap is e event driven (1)Reactive architecture e microservices  microservices, ap is e event driven (1)
Reactive architecture e microservices microservices, ap is e event driven (1)
 
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبون
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبونممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبون
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبون
 
Zaragoza turismo-59
Zaragoza turismo-59Zaragoza turismo-59
Zaragoza turismo-59
 
Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
 
Venus - #UseYourAnd
Venus - #UseYourAndVenus - #UseYourAnd
Venus - #UseYourAnd
 
Final project report`````
Final project report`````Final project report`````
Final project report`````
 
Smart SMBs: fine-tuning the engines of growth
Smart SMBs: fine-tuning the engines of growth Smart SMBs: fine-tuning the engines of growth
Smart SMBs: fine-tuning the engines of growth
 
美雅找醬油篇
美雅找醬油篇美雅找醬油篇
美雅找醬油篇
 
Dubai Travel Guide
Dubai Travel GuideDubai Travel Guide
Dubai Travel Guide
 
Pengenalan kepada Pentaho
Pengenalan kepada PentahoPengenalan kepada Pentaho
Pengenalan kepada Pentaho
 
Ευρωπαϊκή Ένωση, Αντωνία και Ανιέζα
Ευρωπαϊκή Ένωση, Αντωνία και ΑνιέζαΕυρωπαϊκή Ένωση, Αντωνία και Ανιέζα
Ευρωπαϊκή Ένωση, Αντωνία και Ανιέζα
 
あっぱれじゃ
あっぱれじゃあっぱれじゃ
あっぱれじゃ
 
Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...
Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...
Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...
 

Similar to Machine Learning with Hadoop

Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Codiax
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
Robin Anil
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
Vipul Divyanshu
 

Similar to Machine Learning with Hadoop (20)

From Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerFrom Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMaker
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)
 
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
 
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
 
Build, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at ScaleBuild, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at Scale
 
An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)
 
Hivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAHivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CA
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
Advanced Machine Learning with Amazon SageMaker
Advanced Machine Learning with Amazon SageMakerAdvanced Machine Learning with Amazon SageMaker
Advanced Machine Learning with Amazon SageMaker
 
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
 
Train ML Models Using Amazon SageMaker with TensorFlow - SRV336 - Chicago AWS...
Train ML Models Using Amazon SageMaker with TensorFlow - SRV336 - Chicago AWS...Train ML Models Using Amazon SageMaker with TensorFlow - SRV336 - Chicago AWS...
Train ML Models Using Amazon SageMaker with TensorFlow - SRV336 - Chicago AWS...
 
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
 
Build, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleBuild, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at Scale
 
Build, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scaleBuild, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scale
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Machine Learning with Hadoop

  • 1. Training on a pluggable machine learning platform Machine Learning on Hadoop at Huffington Post | AOL
  • 2. A Little Bit about Us Core Services Team at HPMG | AOL Thu Kyaw (thu.kyaw@teamaol.com) Principal Software Engineer Worked on machine learning, data mining, and natural language processing Sang Chul Song, Ph.D. (sangchul.song@teamaol.com) Senior Software Engineer Worked on data intensive computing – data archiving / information retrieval
  • 3. Machine Learning:Supervised Classification 1. Learning Phase Model Train “Business” 2. Classifying Phase “Entertainment” Model Result Classify capital gains to be taxed … “Politics”
  • 4. Two Machine Learning Use Cases at HuffPost | AOL Comment Moderation Evaluate All New HuffPost User Comments Every Day Identify Abusive / Aggressive Comments Auto Delete / Publish ~25% Comments Every Day Article Classification Tag Articles for Advertising E.g.: scary, salacious, …
  • 5. Our Classification Tasks abusive non-abusive non-abusive scary sexy non-abusive non-abusive abusive Comment Moderation Article Classification
  • 6. In Order to Meet Our Needs,We Require… Support for important algorithms, including SVM Perceptron / Winnow Bayesian Decision Tree AdaBoost … Ability to build tons of models on regular basis, and pick the best Because, in general, it’s difficult to know in advance what algorithm / parameter set will work best
  • 7. However, N algorithms, K parameters each, L values in each parameter  There are N x LK combinations!, which is often too many to deal with sequentially. For example, N=5, K=5, L=10  500K
  • 8. So, we parallelize on Hadoop Good news: Mahout, a parallel machine learning tool, is already available. There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news: Mahout doesn’t support necessary algorithms yet. Other algorithms do not run natively on Hadoop.
  • 9. Therefore, we do… We build a flexible ML platform running on Hadoop that supports a wide range of algorithms, leveraging publicly available implementations. On top of our platform, we generate / test hundred thousands models, and choose the best. We use Pig for Hadoop implementation.
  • 10. Our Approach OUR APPROACH More algorithms (thus better model), and faster parallel processing AdaBoost, SVM, Decision Tree, Bayesian and a Lot Others Train Request Return CONVENTIONAL 1000s Models(one for each param set) Best Model Training Data Select Train (sequential)
  • 11. What Parallelization? Training Task Training Task Training Task Training Task Training Task
  • 12. General Processing Flow TrainingDocs Preprocess VectorizedDocs Train Model Preprocess Parameters Stopword use, n-gram size, stemming, etc. Train Parameters Algorithm and algorithm specific parameters (e.g. SVM, C, Ɛ, and other kernel parameters)
  • 13. Our Parallel Processing Flow Model Vectorized Docs Model Model TrainingDocs Vectorized Docs Model Model Model Model Vectorized Docs Model Model
  • 14. Preprocessing on Hadoop (see next slide) Preprocessing on Hadoop business Investments are taxed as capital gains..... business It was the overleveraged and underregulatedbanks … none I am afraid we may be headed for … none In the famous words of Homer Simpson, “it takes 2 to lie …” … Vector 1 Training Data Vector 2 Vector 3 Vector 4 279 68ngram_stem_stopword 1snowballtrue 279 68 ngram_stem_stopword2 snowball true 279 68 ngram_stem_stopword3 snowball true 279 68 ngram_stem_stopword 1 porter true 279 68 ngram_stem_stopword2porter true 279 68 ngram_stem_stopword3none false … Vector 5 Preprocessing Request (a parameter set per line) Vector k
  • 15. Preprocessing on HadoopBig Picture Vector 1 Through UDF Call Vector 2 UDF par = LOAD param_file AS par1, par2, …; run = FOREACH par GENERATE RunPreprocess(par1, par2, …); STORE run ..; RunPreprocess() …….. Preprocessors (Pluggable Pipes) Stemmer Tokenizer StopwordFilter Vector k Vectorizer FeatureSelector
  • 16. Training on Hadoop 010101101020101100010101110100010101011100… 010111010100010100100010101011100110110101… 011101011010101011101011011010001010010101… 010010111010100010101010001010111010101010… 111010110001110101011010100101011010001011… Model 1 Training on Hadoop (see next slide) Vectors Model 2 Model 3 Model 4 73 923 balanced_winnow 5 1 10… 73 923 balanced_winnow 5 210… 73 923 balanced_winnow 5 310… 73 923 balanced_winnow 5 1 20 … 73 923 balanced_winnow 5 2 20 … 73 923 balanced_winnow 5 320… … Model 5 Train Request (a parameter set per line) Model k Mahout, Weka, Mallet or libsvm
  • 17.
  • 20. C45
  • 22.
  • 24.
  • 27.
  • 28. Training on Hadoop: Trick #2 We call ML functions from UDF. Some functions can take too long to return, and Hadoop will kill the job if they do. RunTrainer() “Pig Heartbeat” Thread Main Thread
  • 29. As a result, we now see… We are now able to build tens of thousands of models within an hour and choose the best. Previously, the same task took us days. As we can generate more models more frequently, we become more adaptive to the fast-changing Internet community, catching up with newly-coined terms, etc.
  • 30. Useful Resources Mahout: http://mahout.apache.org/ Mallet: http://mallet.cs.umass.edu/ Weka: http://www.cs.waikato.ac.nz/ml/weka/ libsvm: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ OpenNLP: http://incubator.apache.org/opennlp/ Pig: http://pig.apache.org/