SlideShare a Scribd company logo
1 of 36
Download to read offline
Large Scale Processing of Text
Suneel Marthi
DataWorks Summit 2017,
San Jose, California
@suneelmarthi
$WhoAmI
● Principal Software Engineer in the Office of Technology, Red Hat
● Member of Apache Software Foundation
● Committer and PMC member on Apache Mahout, Apache OpenNLP, Apache
Streams
What is a Natural Language?
What is a Natural Language?
Is any language that has evolved naturally in humans through
use and repetition without conscious planning or
premeditation
(From Wikipedia)
What is NOT a Natural Language?
Characteristics of Natural Language
Unstructured
Ambiguous
Complex
Hidden semantic
Ironic
Informal
Unpredictable
Rich
Most updated
Noise
Hard to search
and it holds most of human knowledge
and it holds most of human knowledge
and but it holds most of human knowledge
As information overload grows
ever worse, computers may
become our only hope for
handling a growing deluge of
documents.
MIT Press - May 12, 2017
What is Natural Language Processing?
NLP is a field of computer science, artificial intelligence and
computational linguistics concerned with the interactions
between computers and human (natural) languages, and, in
particular, concerned with programming computers to fruitfully
process large natural language corpora.(From Wikipedia)
???
How?
By solving small problems each time
A pipeline where an ambiguity type is solved, incrementally.
Sentence Detector
Mr. Robert talk is today at room num. 7. Let's go?
| | | | ❌
| | ✅
Tokenizer
Mr. Robert talk is today at room num. 7. Let's go?
|| | | | | | | || || | ||| | | ❌
| | | | | | | | || | | | | | ✅
By solving small problems each time
Each step of a pipeline solves one ambiguity problem.
Name Finder
<Person>Washington</Person> was the first president of the USA.
<Place>Washington</Place> is a state in the Pacific Northwest region
of the USA.
POS Tagger
Laura Keene brushed by him with the glass of water .
| | | | | | | | | | |
NNP NNP VBD IN PRP IN DT NN IN NN .
By solving small problems each time
A pipeline can be long and resolve many ambiguities
Lemmatizer
He is better than many others
| | | | | |
He be good than many other
Apache OpenNLP
Apache OpenNLP
Mature project (> 10 years)
Actively developed
Machine learning
Java
Easy to train
Highly customizable
Fast
Language Detector (soon)
Sentence detector
Tokenizer
Part of Speech Tagger
Lemmatizer
Chunker
Parser
....
Training Models for English
Corpus - OntoNotes (https://catalog.ldc.upenn.edu/ldc2013t19)
bin/opennlp TokenNameFinderTrainer.ontonotes -lang eng -ontoNotesDir
~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-ontonotes.bin
bin/opennlp POSTaggerTrainer.ontonotes -lang eng -ontoNotesDir
~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-maxent.bin
Training Models for Portuguese
Corpus - Amazonia (http://www.linguateca.pt/floresta/corpus.html)
bin/opennlp TokenizerTrainer.ad -lang por -data amazonia.ad -model por-tokenizer.bin -detokenizer
lang/pt/tokenizer/pt-detokenizer.xml -encoding ISO-8859-1
bin/opennlp POSTaggerTrainer.ad -lang por -data amazonia.ad -model por-pos.bin -encoding
ISO-8859-1 -includeFeatures false
bin/opennlp ChunkerTrainerME.ad -lang por -data amazonia.ad -model por-chunk.bin -encoding
ISO-8859-1
bin/opennlp TokenNameFinderTrainer.ad -lang por -data amazonia.ad -model por-ner.bin -encoding
ISO-8859-1
Name Finder API - Detect Names
NameFinderME nameFinder = new NameFinderME(new
TokenNameFinderModel(
OpenNLPMain.class.getResource("/opennlp-models/por-ner.bin”)));
for (String document[][] : documents) {
for (String[] sentence : document) {
Span nameSpans[] = nameFinder.find(sentence);
// do something with the names
}
nameFinder.clearAdaptiveData()
}
Name Finder API - Train a model
ObjectStream<String> lineStream =
new PlainTextByLineStream(new
FileInputStream("en-ner-person.train"), StandardCharsets.UTF8);
TokenNameFinderModel model;
try (ObjectStream<NameSample> sampleStream = new
NameSampleDataStream(lineStream)) {
model = NameFinderME.train("en", "person", sampleStream,
TrainingParameters.defaultParams(),
TokenNameFinderFactory nameFinderFactory);
}
model.serialize(modelFile);
Name Finder API - Evaluate a model
TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new
NameFinderME(model));
evaluator.evaluate(sampleStream);
FMeasure result = evaluator.getFMeasure();
System.out.println(result.toString());
Name Finder API - Cross Evaluate a model
FileInputStream sampleDataIn = new FileInputStream("en-ner-person.train");
ObjectStream<NameSample> sampleStream = new
PlainTextByLineStream(sampleDataIn.getChannel(),
StandardCharsets.UTF_8);
TokenNameFinderCrossValidator evaluator = new
TokenNameFinderCrossValidator("en", 100, 5);
evaluator.evaluate(sampleStream, 10);
FMeasure result = evaluator.getFMeasure();
System.out.println(result.toString());
Language
Detector
Sentence
Detector
Tokenizer
POS
Tagger
Lemmatizer
Name
Finder
Chunker
Language 1
Language 2
Language N
Index
.
.
.
Apache Flink
Apache Flink
Mature project - 320+ contributors, > 11K commits
Very Active project on Github
Java/Scala
Streaming first
Fault-Tolerant
Scalable - to 1000s of nodes and more
High Throughput, Low Latency
Apache Flink - Pos Tagger and NER
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> portugeseText =
env.readTextFile(OpenNLPMain.class.getResource(
"/input/por_newscrawl.txt").getFile());
DataStream<String> engText = env.readTextFile(
OpenNLPMain.class.getResource("/input/eng_news.txt").getFile());
DataStream<String> mergedStream = inputStream.union(portugeseText);
SplitStream<Tuple2<String, String>> langStream = mergedStream.split(new
LanguageSelector());
Apache Flink - Pos Tagger and NER
DataStream<Tuple2<String, String>> porNewsArticles = langStream.select("por");
DataStream<Tuple2<String, String[]>> porNewsTokenized = porNewsArticles.map(new
PorTokenizerMapFunction());
DataStream<POSSample> porNewsPOS = porNewsTokenized.map(new
PorPOSTaggerMapFunction());
DataStream<NameSample> porNewsNamedEntities = porNewsTokenized.map(new
PorNameFinderMapFunction());
Apache Flink - Pos Tagger and NER
private static class LanguageSelector implements OutputSelector<Tuple2<String, String>> {
public Iterable<String> select(Tuple2<String, String> s) {
List<String> list = new ArrayList<>();
list.add(languageDetectorME.predictLanguage(s.f1).getLang());
return list;
}
}
private static class PorTokenizerMapFunction implements MapFunction<Tuple2<String, String>,
Tuple2<String, String[]>> {
public Tuple2<String, String[]> map(Tuple2<String, String> s) {
return new Tuple2<>(s.f0, porTokenizer.tokenize(s.f0));
}
}
Apache Flink - Pos Tagger and NER
private static class PorPOSTaggerMapFunction implements MapFunction<Tuple2<String, String[]>,
POSSample> {
public POSSample map(Tuple2<String, String[]> s) {
String[] tags = porPosTagger.tag(s.f1);
return new POSSample(s.f0, s.f1, tags);
}
}
private static class PorNameFinderMapFunction implements MapFunction<Tuple2<String, String[]>,
NameSample> {
public NameSample map(Tuple2<String, String[]> s) {
Span[] names = engNameFinder.find(s.f1);
return new NameSample(s.f0, s.f1, names, null, true);
}
}
What’s Coming ??
What’s Coming ??
● DL4J: Mature Project: 114 contributors, ~8k commits
● Modular: Tensor library, reinforcement learning, ETL,..
● Focused on integrating with JVM ecosystem while
supporting state of the art like gpus on large clusters
● Implements most neural nets you’d need for language
● Named Entity Recognition using DL4J with LSTMs
● Language Detection using DL4J with LSTMs
● Possible: Translation using Bidirectional LSTMs with embeddings
● Computation graph architecture for more advanced use cases
Credits
Joern Kottmann — PMC Chair, Apache OpenNLP
Tommaso Teofili --- PMC - Apache Lucene, Apache OpenNLP
William Colen --- Head of Technology, Stilingue - Inteligência Artificial,
Sao Paulo, Brazil
PMC - Apache OpenNLP
Till Rohrmann --- Engineering Lead, Data Artisans, Berlin, Germany
Committer and PMC, Apache Flink
Fabian Hueske --- Data Artisans, Committer and PMC on Apache Flink
Questions ???

More Related Content

What's hot

ブロックチェーン技術が拓くオープンサイエンスの未来.pdf
ブロックチェーン技術が拓くオープンサイエンスの未来.pdfブロックチェーン技術が拓くオープンサイエンスの未来.pdf
ブロックチェーン技術が拓くオープンサイエンスの未来.pdf
Hiro Hamada
 
Predictive Biases in Natural Language Processing Models: A Conceptual Framewo...
Predictive Biases in Natural Language Processing Models: A Conceptual Framewo...Predictive Biases in Natural Language Processing Models: A Conceptual Framewo...
Predictive Biases in Natural Language Processing Models: A Conceptual Framewo...
Deven Shah
 

What's hot (20)

TIS 戦略技術センター AI技術推進室紹介
TIS 戦略技術センター AI技術推進室紹介TIS 戦略技術センター AI技術推進室紹介
TIS 戦略技術センター AI技術推進室紹介
 
Python Web Development Tutorial | Web Development Using Django | Edureka
Python Web Development Tutorial | Web Development Using Django | EdurekaPython Web Development Tutorial | Web Development Using Django | Edureka
Python Web Development Tutorial | Web Development Using Django | Edureka
 
Logstashを愛して5年、370ページを超えるガチ本を書いてしまった男の話.
Logstashを愛して5年、370ページを超えるガチ本を書いてしまった男の話.Logstashを愛して5年、370ページを超えるガチ本を書いてしまった男の話.
Logstashを愛して5年、370ページを超えるガチ本を書いてしまった男の話.
 
[CB19] S-TIP: サイバー脅威インテリジェンスのシームレスな活用プラットフォーム by 山田 幸治, 里見 敏孝
[CB19] S-TIP: サイバー脅威インテリジェンスのシームレスな活用プラットフォーム by 山田 幸治, 里見 敏孝[CB19] S-TIP: サイバー脅威インテリジェンスのシームレスな活用プラットフォーム by 山田 幸治, 里見 敏孝
[CB19] S-TIP: サイバー脅威インテリジェンスのシームレスな活用プラットフォーム by 山田 幸治, 里見 敏孝
 
OpenAI の音声認識 AI「Whisper」をテストしてみた
OpenAI の音声認識 AI「Whisper」をテストしてみたOpenAI の音声認識 AI「Whisper」をテストしてみた
OpenAI の音声認識 AI「Whisper」をテストしてみた
 
Soresa Spa - Regione Campania - Progetto SINFONIA
Soresa Spa - Regione Campania - Progetto SINFONIASoresa Spa - Regione Campania - Progetto SINFONIA
Soresa Spa - Regione Campania - Progetto SINFONIA
 
Identity based encryption from the weil pairing
Identity based encryption from the weil pairingIdentity based encryption from the weil pairing
Identity based encryption from the weil pairing
 
軟體接案自由職業者 (Freelancer) 意想不到的風險
軟體接案自由職業者 (Freelancer) 意想不到的風險軟體接案自由職業者 (Freelancer) 意想不到的風險
軟體接案自由職業者 (Freelancer) 意想不到的風險
 
ブロックチェーン技術が拓くオープンサイエンスの未来.pdf
ブロックチェーン技術が拓くオープンサイエンスの未来.pdfブロックチェーン技術が拓くオープンサイエンスの未来.pdf
ブロックチェーン技術が拓くオープンサイエンスの未来.pdf
 
data Structure
data Structuredata Structure
data Structure
 
LineairDBの紹介
LineairDBの紹介LineairDBの紹介
LineairDBの紹介
 
Debugging Python with Pdb!
Debugging Python with Pdb!Debugging Python with Pdb!
Debugging Python with Pdb!
 
社内ドキュメント検索システム構築のノウハウ
社内ドキュメント検索システム構築のノウハウ社内ドキュメント検索システム構築のノウハウ
社内ドキュメント検索システム構築のノウハウ
 
Predictive Biases in Natural Language Processing Models: A Conceptual Framewo...
Predictive Biases in Natural Language Processing Models: A Conceptual Framewo...Predictive Biases in Natural Language Processing Models: A Conceptual Framewo...
Predictive Biases in Natural Language Processing Models: A Conceptual Framewo...
 
F strings
F stringsF strings
F strings
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
NumPy.pptx
NumPy.pptxNumPy.pptx
NumPy.pptx
 
Streaming with Varnish
Streaming with VarnishStreaming with Varnish
Streaming with Varnish
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
資料視覺化 (科智企業股份有限公司 內訓課程)
資料視覺化 (科智企業股份有限公司 內訓課程)資料視覺化 (科智企業股份有限公司 內訓課程)
資料視覺化 (科智企業股份有限公司 內訓課程)
 

Similar to Large Scale Text Processing

Python Intro For Managers
Python Intro For ManagersPython Intro For Managers
Python Intro For Managers
Atul Shridhar
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003
butest
 
Generative programming (mostly parser generation)
Generative programming (mostly parser generation)Generative programming (mostly parser generation)
Generative programming (mostly parser generation)
Ralf Laemmel
 

Similar to Large Scale Text Processing (20)

DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated
 
The Mystery of Natural Language Processing
The Mystery of Natural Language ProcessingThe Mystery of Natural Language Processing
The Mystery of Natural Language Processing
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Python Intro For Managers
Python Intro For ManagersPython Intro For Managers
Python Intro For Managers
 
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translationAIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
programming language.pdf
programming language.pdfprogramming language.pdf
programming language.pdf
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003
 
Using Stanza NLP and TensorFlow to create a summary of a book
Using Stanza NLP and TensorFlow to create a summary of a bookUsing Stanza NLP and TensorFlow to create a summary of a book
Using Stanza NLP and TensorFlow to create a summary of a book
 
Distributed tracing with erlang/elixir
Distributed tracing with erlang/elixirDistributed tracing with erlang/elixir
Distributed tracing with erlang/elixir
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
ppt
pptppt
ppt
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 
Generative programming (mostly parser generation)
Generative programming (mostly parser generation)Generative programming (mostly parser generation)
Generative programming (mostly parser generation)
 
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
 

More from Suneel Marthi

More from Suneel Marthi (9)

Measuring vegetation health to predict natural hazards
Measuring vegetation health to predict natural hazardsMeasuring vegetation health to predict natural hazards
Measuring vegetation health to predict natural hazards
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagery
 
Streaming topic model training and inference
Streaming topic model training and inferenceStreaming topic model training and inference
Streaming topic model training and inference
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagery
 
Building streaming pipelines for neural machine translation
Building streaming pipelines for neural machine translationBuilding streaming pipelines for neural machine translation
Building streaming pipelines for neural machine translation
 
Moving beyond moving bytes
Moving beyond moving bytesMoving beyond moving bytes
Moving beyond moving bytes
 
Embracing diversity searching over multiple languages
Embracing diversity  searching over multiple languagesEmbracing diversity  searching over multiple languages
Embracing diversity searching over multiple languages
 
Distributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache MahoutDistributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache Mahout
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 

Recently uploaded

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Recently uploaded (20)

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Large Scale Text Processing

  • 1. Large Scale Processing of Text Suneel Marthi DataWorks Summit 2017, San Jose, California @suneelmarthi
  • 2. $WhoAmI ● Principal Software Engineer in the Office of Technology, Red Hat ● Member of Apache Software Foundation ● Committer and PMC member on Apache Mahout, Apache OpenNLP, Apache Streams
  • 3. What is a Natural Language?
  • 4. What is a Natural Language? Is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation (From Wikipedia)
  • 5. What is NOT a Natural Language?
  • 6. Characteristics of Natural Language Unstructured Ambiguous Complex Hidden semantic Ironic Informal Unpredictable Rich Most updated Noise Hard to search
  • 7. and it holds most of human knowledge
  • 8. and it holds most of human knowledge
  • 9. and but it holds most of human knowledge
  • 10. As information overload grows ever worse, computers may become our only hope for handling a growing deluge of documents. MIT Press - May 12, 2017
  • 11. What is Natural Language Processing? NLP is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.(From Wikipedia)
  • 12. ???
  • 13.
  • 14. How?
  • 15. By solving small problems each time A pipeline where an ambiguity type is solved, incrementally. Sentence Detector Mr. Robert talk is today at room num. 7. Let's go? | | | | ❌ | | ✅ Tokenizer Mr. Robert talk is today at room num. 7. Let's go? || | | | | | | || || | ||| | | ❌ | | | | | | | | || | | | | | ✅
  • 16. By solving small problems each time Each step of a pipeline solves one ambiguity problem. Name Finder <Person>Washington</Person> was the first president of the USA. <Place>Washington</Place> is a state in the Pacific Northwest region of the USA. POS Tagger Laura Keene brushed by him with the glass of water . | | | | | | | | | | | NNP NNP VBD IN PRP IN DT NN IN NN .
  • 17. By solving small problems each time A pipeline can be long and resolve many ambiguities Lemmatizer He is better than many others | | | | | | He be good than many other
  • 19. Apache OpenNLP Mature project (> 10 years) Actively developed Machine learning Java Easy to train Highly customizable Fast Language Detector (soon) Sentence detector Tokenizer Part of Speech Tagger Lemmatizer Chunker Parser ....
  • 20. Training Models for English Corpus - OntoNotes (https://catalog.ldc.upenn.edu/ldc2013t19) bin/opennlp TokenNameFinderTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-ontonotes.bin bin/opennlp POSTaggerTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-maxent.bin
  • 21. Training Models for Portuguese Corpus - Amazonia (http://www.linguateca.pt/floresta/corpus.html) bin/opennlp TokenizerTrainer.ad -lang por -data amazonia.ad -model por-tokenizer.bin -detokenizer lang/pt/tokenizer/pt-detokenizer.xml -encoding ISO-8859-1 bin/opennlp POSTaggerTrainer.ad -lang por -data amazonia.ad -model por-pos.bin -encoding ISO-8859-1 -includeFeatures false bin/opennlp ChunkerTrainerME.ad -lang por -data amazonia.ad -model por-chunk.bin -encoding ISO-8859-1 bin/opennlp TokenNameFinderTrainer.ad -lang por -data amazonia.ad -model por-ner.bin -encoding ISO-8859-1
  • 22. Name Finder API - Detect Names NameFinderME nameFinder = new NameFinderME(new TokenNameFinderModel( OpenNLPMain.class.getResource("/opennlp-models/por-ner.bin”))); for (String document[][] : documents) { for (String[] sentence : document) { Span nameSpans[] = nameFinder.find(sentence); // do something with the names } nameFinder.clearAdaptiveData() }
  • 23. Name Finder API - Train a model ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream("en-ner-person.train"), StandardCharsets.UTF8); TokenNameFinderModel model; try (ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream)) { model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(), TokenNameFinderFactory nameFinderFactory); } model.serialize(modelFile);
  • 24. Name Finder API - Evaluate a model TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new NameFinderME(model)); evaluator.evaluate(sampleStream); FMeasure result = evaluator.getFMeasure(); System.out.println(result.toString());
  • 25. Name Finder API - Cross Evaluate a model FileInputStream sampleDataIn = new FileInputStream("en-ner-person.train"); ObjectStream<NameSample> sampleStream = new PlainTextByLineStream(sampleDataIn.getChannel(), StandardCharsets.UTF_8); TokenNameFinderCrossValidator evaluator = new TokenNameFinderCrossValidator("en", 100, 5); evaluator.evaluate(sampleStream, 10); FMeasure result = evaluator.getFMeasure(); System.out.println(result.toString());
  • 28. Apache Flink Mature project - 320+ contributors, > 11K commits Very Active project on Github Java/Scala Streaming first Fault-Tolerant Scalable - to 1000s of nodes and more High Throughput, Low Latency
  • 29. Apache Flink - Pos Tagger and NER final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<String> portugeseText = env.readTextFile(OpenNLPMain.class.getResource( "/input/por_newscrawl.txt").getFile()); DataStream<String> engText = env.readTextFile( OpenNLPMain.class.getResource("/input/eng_news.txt").getFile()); DataStream<String> mergedStream = inputStream.union(portugeseText); SplitStream<Tuple2<String, String>> langStream = mergedStream.split(new LanguageSelector());
  • 30. Apache Flink - Pos Tagger and NER DataStream<Tuple2<String, String>> porNewsArticles = langStream.select("por"); DataStream<Tuple2<String, String[]>> porNewsTokenized = porNewsArticles.map(new PorTokenizerMapFunction()); DataStream<POSSample> porNewsPOS = porNewsTokenized.map(new PorPOSTaggerMapFunction()); DataStream<NameSample> porNewsNamedEntities = porNewsTokenized.map(new PorNameFinderMapFunction());
  • 31. Apache Flink - Pos Tagger and NER private static class LanguageSelector implements OutputSelector<Tuple2<String, String>> { public Iterable<String> select(Tuple2<String, String> s) { List<String> list = new ArrayList<>(); list.add(languageDetectorME.predictLanguage(s.f1).getLang()); return list; } } private static class PorTokenizerMapFunction implements MapFunction<Tuple2<String, String>, Tuple2<String, String[]>> { public Tuple2<String, String[]> map(Tuple2<String, String> s) { return new Tuple2<>(s.f0, porTokenizer.tokenize(s.f0)); } }
  • 32. Apache Flink - Pos Tagger and NER private static class PorPOSTaggerMapFunction implements MapFunction<Tuple2<String, String[]>, POSSample> { public POSSample map(Tuple2<String, String[]> s) { String[] tags = porPosTagger.tag(s.f1); return new POSSample(s.f0, s.f1, tags); } } private static class PorNameFinderMapFunction implements MapFunction<Tuple2<String, String[]>, NameSample> { public NameSample map(Tuple2<String, String[]> s) { Span[] names = engNameFinder.find(s.f1); return new NameSample(s.f0, s.f1, names, null, true); } }
  • 34. What’s Coming ?? ● DL4J: Mature Project: 114 contributors, ~8k commits ● Modular: Tensor library, reinforcement learning, ETL,.. ● Focused on integrating with JVM ecosystem while supporting state of the art like gpus on large clusters ● Implements most neural nets you’d need for language ● Named Entity Recognition using DL4J with LSTMs ● Language Detection using DL4J with LSTMs ● Possible: Translation using Bidirectional LSTMs with embeddings ● Computation graph architecture for more advanced use cases
  • 35. Credits Joern Kottmann — PMC Chair, Apache OpenNLP Tommaso Teofili --- PMC - Apache Lucene, Apache OpenNLP William Colen --- Head of Technology, Stilingue - Inteligência Artificial, Sao Paulo, Brazil PMC - Apache OpenNLP Till Rohrmann --- Engineering Lead, Data Artisans, Berlin, Germany Committer and PMC, Apache Flink Fabian Hueske --- Data Artisans, Committer and PMC on Apache Flink