SlideShare una empresa de Scribd logo
1 de 24
Hadoop Based Intelligent Text Processing System
October 12, 2010
Hadoop World, NYC
Page 2
Who are we?
•Vaijanath N. Rao
•AOL
•Contact: vaijanath.rao@teamaol.com
•Rohini Uppuluri
•AOL
•Contact: rohini.uppuluri@teamaol.com
Page 3
Agenda
1. Introduction
2. Problem Statement
3. Our Intelligent Text Processing System
1. Overview
2. Detailed
3. Application(s)
4. Q and A
Page 4
Introduction
Page 5
Introduction( Continued…)
• Information Extraction - Extracting information From Text
• Part of Speech Analysis
Ex: BlackBeauty<noun> is<verb> a<det> pretty<adjective> horse<noun>
• Named Entity Extraction
Ex: The CEO <Person>Mr. A</Person> of <Location>New York</Location> based Firm
<Organization>Foo.Inc</Organization> announced its new Product
<date>today</date>
• Sentiment Analysis
Ex: Watch this film. AVATAR is an achievement in many technical departments. It is a
beautiful experience
• Sentence Detection
Ex: <Start Sentence>BlackBeauty is a pretty horse <End Sentence>
• Some Tools: OpenNLP[5], LingPipe[6], GATE[7], NLTK[8] etc
• Categorization/Classification - Categorize items into one of the predefined
classes
Ex: An article talking about some baseball match is a “Sports” article.
Page 6
Introduction (Continued…)
• Challenges
• Processing large amount of data
• Most approaches use machine learning methods
• Need to be trained on large amount of data
• Need to way to perform the computations in a scalable manner
• Domain Dependency
Page 7
Problem Statement
• What we want to do?
• Build Large Scale applications (processing text)
• Why is this useful?
• Analyze Large Content available at AOL
• Applications: User interests Mining, Ad Targeting, Personalization etc
• We need
• A Large Scale NLP System
• A Pipeline sort of architecture with users being able to plug in or out
components
• Abstraction or Transparency of the algorithms used as requested by the user
Page 8
Our Intelligent
Text Processing System
• Overview
• Pipelined Architecture
• Pluggable components
• Work Flow Manager
• Recovery Manager
• Job Manager
• Applications
• Large Scale Applications using scalable way of applying NLP Models
Page 9
Overview
Page 10
Job Manager
•Creates series of parallel and sequential dependent jobs (takes configuration
file)
•Example :
Jobs A, B, C, D, E and F
Job B depends on Job A ; Job E depends on D
•Job manager creates following Tree
•Jobs A,D and F are executed parallel
•Jobs B and E will be executed parallel depending upon there parent jobs
completion.
Page 11
Recovery Manager
•Each job writes the configuration, start time, end time (
if completed) into the status file
•Periodically checks for the status file updates to see if
any job failed, if so restarts the job, by calling the Job
manager with required configuration
Page 12
Sample Configuration
<job name="keyphrase">
<mapreduce depends="none" name="postagger">
<inputargs>input arguments as string</inputargs>
<output>$hdfsoutputLocation</output>
<jar>postagger.jar</jar>
<mainClass>com.aol.datalayer.nlp.postagger</mainClass>
</mapreduce>
<mapreduce depends="postagger" name="nounphrase">
<inputargs>input arguments as string</inputargs>
<output>$hdfsoutputlocation</output>
<jar>chunker.jar</jar>
<mainClass>com.aol.datalayer.nlp.chunker</mainClass>
</mapreduce>
</job>
Page 13
Overview
Page 14
NLP Modeling Engine
Page 15
Detailed
Page 16
Applications
Page 17
Application 1- Location Aware Contextual Advertising -
Example
Page 18
Location Aware Contextual Advertising- Overview
Page 19
Application 2- User Aware Ad Targetting - Example
This is an illustrative example and does not represent any real user
Page 20
User Aware Ad Targetting
Page 21
Conclusions
• Pipelined Architecture
• NLP System
• Large Scale Applications
• Location aware Contextual Ad Targetting
• User aware Ad targetting
Page 22
Future Work
• Developing distributed algorithms for
• POS Tagger
• Sentiment Analyzer models
• Exploring if it might be useful integrating with any
open source distributed ML/TM framework
Page 23
References
1. Part-of-Speech Tagging: en.wikipedia.org/wiki/Part-of-
speech_tagging
2. Coreference Resolution: en.wikipedia.org/wiki/Coreference
3. Named Entity Recognition:
en.wikipedia.org/wiki/Named_entity_recognition
4. Sentiment
Analysis:en.wikipedia.org/wiki/Sentiment_analysis
5. Open NLP: http://opennlp.sourceforge.net/
6. LingPipe: http://alias-i.com/lingpipe/
7. GATE: http://gate.ac.uk/ie/
8. NLTK: www.nltk.org
Page 24
Q & A
Thank You 

Más contenido relacionado

Destacado

Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemApache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemDuyhai Doan
 
Software Architecture: Styles
Software Architecture: StylesSoftware Architecture: Styles
Software Architecture: StylesHenry Muccini
 
Principles of software architecture design
Principles of software architecture designPrinciples of software architecture design
Principles of software architecture designLen Bass
 
Software Architecture and Design - An Overview
Software Architecture and Design - An OverviewSoftware Architecture and Design - An Overview
Software Architecture and Design - An OverviewOliver Stadie
 
Three Software Architecture Styles
Three Software Architecture StylesThree Software Architecture Styles
Three Software Architecture StylesJorgen Thelin
 
A Software Architect's View On Diagramming
A Software Architect's View On DiagrammingA Software Architect's View On Diagramming
A Software Architect's View On Diagrammingmeghantaylor
 
revenue model of paytm
revenue model of paytmrevenue model of paytm
revenue model of paytmVIJAY KUMAR
 

Destacado (7)

Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemApache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystem
 
Software Architecture: Styles
Software Architecture: StylesSoftware Architecture: Styles
Software Architecture: Styles
 
Principles of software architecture design
Principles of software architecture designPrinciples of software architecture design
Principles of software architecture design
 
Software Architecture and Design - An Overview
Software Architecture and Design - An OverviewSoftware Architecture and Design - An Overview
Software Architecture and Design - An Overview
 
Three Software Architecture Styles
Three Software Architecture StylesThree Software Architecture Styles
Three Software Architecture Styles
 
A Software Architect's View On Diagramming
A Software Architect's View On DiagrammingA Software Architect's View On Diagramming
A Software Architect's View On Diagramming
 
revenue model of paytm
revenue model of paytmrevenue model of paytm
revenue model of paytm
 

Similar a AOL - Rao & Uppuluri - Hadoop World 2010

Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoTaro L. Saito
 
Hadoop at Meebo: Lessons in the Real World
Hadoop at Meebo: Lessons in the Real WorldHadoop at Meebo: Lessons in the Real World
Hadoop at Meebo: Lessons in the Real Worldvoberoi
 
Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...p6academy
 
Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...p6academy
 
Stat 5.4 Pre Sales Demo Master
Stat 5.4 Pre Sales Demo MasterStat 5.4 Pre Sales Demo Master
Stat 5.4 Pre Sales Demo Masterreachtimsq
 
Resume_Sunil_Faroz
Resume_Sunil_FarozResume_Sunil_Faroz
Resume_Sunil_FarozSunil Faroz
 
Getting your project off the ground (BuildStuffLt)
Getting your project off the ground (BuildStuffLt)Getting your project off the ground (BuildStuffLt)
Getting your project off the ground (BuildStuffLt)Johannes Brodwall
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Derek Jacoby
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
Nirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_ExpNirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_ExpNirdesh Kulshreshtha
 
Mohd_Shaukath_5_Exp_Datastage
Mohd_Shaukath_5_Exp_DatastageMohd_Shaukath_5_Exp_Datastage
Mohd_Shaukath_5_Exp_DatastageMohammed Shaukath
 
Ranjit gupta(mainframe 6.1 years)
Ranjit gupta(mainframe 6.1 years)Ranjit gupta(mainframe 6.1 years)
Ranjit gupta(mainframe 6.1 years)Ranjit Gupta
 

Similar a AOL - Rao & Uppuluri - Hadoop World 2010 (20)

Santhosh_ Production Support_
Santhosh_ Production Support_Santhosh_ Production Support_
Santhosh_ Production Support_
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
MyResume
MyResumeMyResume
MyResume
 
JS Essence
JS EssenceJS Essence
JS Essence
 
SumitJaiswal
SumitJaiswalSumitJaiswal
SumitJaiswal
 
My C.V
My C.VMy C.V
My C.V
 
Hadoop at Meebo: Lessons in the Real World
Hadoop at Meebo: Lessons in the Real WorldHadoop at Meebo: Lessons in the Real World
Hadoop at Meebo: Lessons in the Real World
 
Resume
ResumeResume
Resume
 
Resume
ResumeResume
Resume
 
Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...
 
Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...
 
Stat 5.4 Pre Sales Demo Master
Stat 5.4 Pre Sales Demo MasterStat 5.4 Pre Sales Demo Master
Stat 5.4 Pre Sales Demo Master
 
Resume_Sunil_Faroz
Resume_Sunil_FarozResume_Sunil_Faroz
Resume_Sunil_Faroz
 
Getting your project off the ground (BuildStuffLt)
Getting your project off the ground (BuildStuffLt)Getting your project off the ground (BuildStuffLt)
Getting your project off the ground (BuildStuffLt)
 
RKCV
RKCVRKCV
RKCV
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Nirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_ExpNirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_Exp
 
Mohd_Shaukath_5_Exp_Datastage
Mohd_Shaukath_5_Exp_DatastageMohd_Shaukath_5_Exp_Datastage
Mohd_Shaukath_5_Exp_Datastage
 
Ranjit gupta(mainframe 6.1 years)
Ranjit gupta(mainframe 6.1 years)Ranjit gupta(mainframe 6.1 years)
Ranjit gupta(mainframe 6.1 years)
 

Más de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 

Último (20)

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 

AOL - Rao & Uppuluri - Hadoop World 2010

  • 1. Hadoop Based Intelligent Text Processing System October 12, 2010 Hadoop World, NYC
  • 2. Page 2 Who are we? •Vaijanath N. Rao •AOL •Contact: vaijanath.rao@teamaol.com •Rohini Uppuluri •AOL •Contact: rohini.uppuluri@teamaol.com
  • 3. Page 3 Agenda 1. Introduction 2. Problem Statement 3. Our Intelligent Text Processing System 1. Overview 2. Detailed 3. Application(s) 4. Q and A
  • 5. Page 5 Introduction( Continued…) • Information Extraction - Extracting information From Text • Part of Speech Analysis Ex: BlackBeauty<noun> is<verb> a<det> pretty<adjective> horse<noun> • Named Entity Extraction Ex: The CEO <Person>Mr. A</Person> of <Location>New York</Location> based Firm <Organization>Foo.Inc</Organization> announced its new Product <date>today</date> • Sentiment Analysis Ex: Watch this film. AVATAR is an achievement in many technical departments. It is a beautiful experience • Sentence Detection Ex: <Start Sentence>BlackBeauty is a pretty horse <End Sentence> • Some Tools: OpenNLP[5], LingPipe[6], GATE[7], NLTK[8] etc • Categorization/Classification - Categorize items into one of the predefined classes Ex: An article talking about some baseball match is a “Sports” article.
  • 6. Page 6 Introduction (Continued…) • Challenges • Processing large amount of data • Most approaches use machine learning methods • Need to be trained on large amount of data • Need to way to perform the computations in a scalable manner • Domain Dependency
  • 7. Page 7 Problem Statement • What we want to do? • Build Large Scale applications (processing text) • Why is this useful? • Analyze Large Content available at AOL • Applications: User interests Mining, Ad Targeting, Personalization etc • We need • A Large Scale NLP System • A Pipeline sort of architecture with users being able to plug in or out components • Abstraction or Transparency of the algorithms used as requested by the user
  • 8. Page 8 Our Intelligent Text Processing System • Overview • Pipelined Architecture • Pluggable components • Work Flow Manager • Recovery Manager • Job Manager • Applications • Large Scale Applications using scalable way of applying NLP Models
  • 10. Page 10 Job Manager •Creates series of parallel and sequential dependent jobs (takes configuration file) •Example : Jobs A, B, C, D, E and F Job B depends on Job A ; Job E depends on D •Job manager creates following Tree •Jobs A,D and F are executed parallel •Jobs B and E will be executed parallel depending upon there parent jobs completion.
  • 11. Page 11 Recovery Manager •Each job writes the configuration, start time, end time ( if completed) into the status file •Periodically checks for the status file updates to see if any job failed, if so restarts the job, by calling the Job manager with required configuration
  • 12. Page 12 Sample Configuration <job name="keyphrase"> <mapreduce depends="none" name="postagger"> <inputargs>input arguments as string</inputargs> <output>$hdfsoutputLocation</output> <jar>postagger.jar</jar> <mainClass>com.aol.datalayer.nlp.postagger</mainClass> </mapreduce> <mapreduce depends="postagger" name="nounphrase"> <inputargs>input arguments as string</inputargs> <output>$hdfsoutputlocation</output> <jar>chunker.jar</jar> <mainClass>com.aol.datalayer.nlp.chunker</mainClass> </mapreduce> </job>
  • 17. Page 17 Application 1- Location Aware Contextual Advertising - Example
  • 18. Page 18 Location Aware Contextual Advertising- Overview
  • 19. Page 19 Application 2- User Aware Ad Targetting - Example This is an illustrative example and does not represent any real user
  • 20. Page 20 User Aware Ad Targetting
  • 21. Page 21 Conclusions • Pipelined Architecture • NLP System • Large Scale Applications • Location aware Contextual Ad Targetting • User aware Ad targetting
  • 22. Page 22 Future Work • Developing distributed algorithms for • POS Tagger • Sentiment Analyzer models • Exploring if it might be useful integrating with any open source distributed ML/TM framework
  • 23. Page 23 References 1. Part-of-Speech Tagging: en.wikipedia.org/wiki/Part-of- speech_tagging 2. Coreference Resolution: en.wikipedia.org/wiki/Coreference 3. Named Entity Recognition: en.wikipedia.org/wiki/Named_entity_recognition 4. Sentiment Analysis:en.wikipedia.org/wiki/Sentiment_analysis 5. Open NLP: http://opennlp.sourceforge.net/ 6. LingPipe: http://alias-i.com/lingpipe/ 7. GATE: http://gate.ac.uk/ie/ 8. NLTK: www.nltk.org
  • 24. Page 24 Q & A Thank You 