SlideShare una empresa de Scribd logo
1 de 3
Descargar para leer sin conexión
Linguistic component: Lemmatizer for
the Russian language
Technical description
SemanticAnalyzer Group, 2013-08-29
www.semanticanalyzer.info
This document describes technical details of lemmatizer for the Russian language.
It is assumed, that prior to using this component an input text has been preprocessed with Tokenizer
component (see the corresponding Technical Description).
Demo package sent upon request contains the following:
 Java library of tokenizer in a form of a binary
 run_lemmatizer.sh script for swift checking the functionality of the module
 messages_to_lemmatize.txt file containing examples of generic text and tweets for tokenization
using the run_lemmatizer.sh script
Algorithm is based on combination of the following:
 dictionary search
 algorithm calculating morphological properties of unknown words
 compound word analyzer
 analyzer of numbers
 rule-based analyzer
Speed of processing
Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz
Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server
5037 characters/ms
880 tokens/ms
Tests were conducted in a single thread.
Format of the messages_to_lemmatize.txt file
This file describes input data for the tokenizer module for demo purposes.Формат:
Format:
TexttText type
Text contains textual data in Russian for lemmatization
t – tab symbol
Text type: supported values are GENERAL_TEXT and TWITTER.
Examples of lemmatization
The run_lemmatizer.sh script will generate the following file: messages_to_lemmatize.out.
For the following input file messages_to_tokenize.txt:
Прекрасный вечер))) прогулка по Набережной - самое то;) только маккафе подпортило настроение(
TWITTER
This output gets generated:
Прекрасный, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=ый,endings=[а, ая, ее, ей, о, ого, ое, ой, ом,
ому, ою, ую, ы, ые, ым, ыми, ых],lemma=прекрасный,pos=ADJECTIVE,weight=14317,stem=прекрасн]
вечер, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=,endings=[а, ам, ами, ах, е, ов, ом,
у],lemma=вечер,pos=NOUN,weight=39101,stem=вечер]
emopostkn, type: ALPHANUM
emopostkn, type: ALPHANUM
emopostkn, type: ALPHANUM
прогулка, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=а,endings=[ам, ами, ах, е, и, ой, ою,
у],lemma=прогулка,pos=NOUN,weight=3054,stem=прогулк]
по, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=по,pos=PREPOSITION,weight=573
564,stem=по]
Набережной, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=ая,endings=[ой, ою, ую, ые, ым, ыми,
ых],lemma=набережная,pos=NOUN,weight=2908,stem=набережн]
-, type: PUNCT
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=-
,pos=NUMERAL,weight=0,stem=-]
самое, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=ый,endings=[ая, ого, ое, ой, ом, ому, ою,
ую, ые, ым, ыми, ых],lemma=самый,pos=ADJECTIVE,weight=0,stem=сам]
то, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=то,pos=CONJUNCTION,weight=0,s
tem=то]
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=то,pos=ADVERB,weight=0,stem=т
о]
MorphDesc[removeNum=0,lemmaEnding=от,endings=[а, е, ем, еми, ех, о, ого, ой, ом,
ому, ою, у],lemma=тот,pos=PRONOUN_ADJECTIVE,weight=1139844,stem=т]
MorphDesc[removeNum=0,lemmaEnding=о,endings=[е, ем, еми, ех, ого, ом,
ому],lemma=то,pos=NOUN,weight=0,stem=т]
emopostkn, type: ALPHANUM
только, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=только,pos=PARTICLE,weight=0,s
tem=только]
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=только,pos=ADVERB,weight=0,st
em=только]
маккафе, type: ALPHANUM
подпортило, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=ть,endings=[в, вший, вшего, вшему, вшим,
вшем, вшая, вшей, вшую, вшею, вшее, вшие, вших, вшими, вши, л, ла, ли,
ло],lemma=подпортить,pos=VERB,weight=190,stem=подпорти]
настроение, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=е,endings=[ем, и, ю,
я],lemma=настроение,pos=NOUN,weight=8416,stem=настроени]
MorphDesc[removeNum=0,lemmaEnding=е,endings=[ем, и, й, ю, я, ям, ями,
ях],lemma=настроение,pos=NOUN,weight=8416,stem=настроени]
emonegtkn, type: ALPHANUM
Examples of using the library from the Java code
MorphAnalyzer morphAnalyzer = MorphAnalyzerLoader.loadDefault();
System.out.println(morphAnalyzer.analyzeBest("русского"));
output:
MorphDesc[removeNum=0,lemmaEnding=ий,endings=[ая, ие, им, ими, их, ого, ое, ой, ом, ому, ою,
ую],lemma=русский,pos=ADJECTIVE,weight=36739,stem=русск]

Más contenido relacionado

La actualidad más candente

Batch file programming
Batch file programmingBatch file programming
Batch file programmingswapnil kapate
 
Linux commd
Linux commdLinux commd
Linux commdragav03
 
Linux commd
Linux commdLinux commd
Linux commdragav03
 
In just one hour i will make you a power shell ninja
In just one hour i will make you a power shell ninjaIn just one hour i will make you a power shell ninja
In just one hour i will make you a power shell ninjaJason Brown
 
Command line for the beginner - Using the command line in developing for the...
Command line for the beginner -  Using the command line in developing for the...Command line for the beginner -  Using the command line in developing for the...
Command line for the beginner - Using the command line in developing for the...Jim Birch
 
Automating with ansible (Part c)
Automating with ansible (Part c) Automating with ansible (Part c)
Automating with ansible (Part c) iman darabi
 

La actualidad más candente (11)

Batch file programming
Batch file programmingBatch file programming
Batch file programming
 
Linux commd
Linux commdLinux commd
Linux commd
 
Linux commd
Linux commdLinux commd
Linux commd
 
Sahul
SahulSahul
Sahul
 
In just one hour i will make you a power shell ninja
In just one hour i will make you a power shell ninjaIn just one hour i will make you a power shell ninja
In just one hour i will make you a power shell ninja
 
Command line for the beginner - Using the command line in developing for the...
Command line for the beginner -  Using the command line in developing for the...Command line for the beginner -  Using the command line in developing for the...
Command line for the beginner - Using the command line in developing for the...
 
Linux introduction Class 03
Linux introduction Class 03Linux introduction Class 03
Linux introduction Class 03
 
Linux
LinuxLinux
Linux
 
Automating with ansible (Part c)
Automating with ansible (Part c) Automating with ansible (Part c)
Automating with ansible (Part c)
 
Apache
ApacheApache
Apache
 
Intro_Unix_Ppt
Intro_Unix_PptIntro_Unix_Ppt
Intro_Unix_Ppt
 

Destacado

Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageDmitry Kan
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for EnglishDmitry Kan
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupDmitry Kan
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupDmitry Kan
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsDmitry Kan
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1Dmitry Kan
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation systemDmitry Kan
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)Dmitry Kan
 
Automatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryAutomatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryDmitry Kan
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationDmitry Kan
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine TranslationDmitry Kan
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Dmitry Kan
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Dmitry Kan
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesDmitry Kan
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageDmitry Kan
 
Semantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesSemantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesDmitry Kan
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source stateDmitry Kan
 
SAS University Edition - Getting Started
SAS University Edition - Getting StartedSAS University Edition - Getting Started
SAS University Edition - Getting StartedCraig Trim
 

Destacado (19)

Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian language
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for English
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeup
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer Group
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwords
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation system
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
 
Automatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryAutomatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational Dictionary
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine Translation
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine Translation
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian language
 
Semantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesSemantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use cases
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source state
 
SAS University Edition - Getting Started
SAS University Edition - Getting StartedSAS University Edition - Getting Started
SAS University Edition - Getting Started
 

Similar a Linguistic component Lemmatizer for the Russian language

Chapter 3 Using Unix Commands
Chapter 3 Using Unix CommandsChapter 3 Using Unix Commands
Chapter 3 Using Unix CommandsMeenalJabde
 
Introduction to Assembly Language Programming
Introduction to Assembly Language ProgrammingIntroduction to Assembly Language Programming
Introduction to Assembly Language ProgrammingRahul P
 
55 best linux tips, tricks and command lines
55 best linux tips, tricks and command lines55 best linux tips, tricks and command lines
55 best linux tips, tricks and command linesArif Wahyudi
 
Developing web apps using Erlang-Web
Developing web apps using Erlang-WebDeveloping web apps using Erlang-Web
Developing web apps using Erlang-Webfanqstefan
 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysisSudhaa Ravi
 
Introduction to Compilers
Introduction to CompilersIntroduction to Compilers
Introduction to Compilersvijaya603274
 
Compiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaCompiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaSrikanth Vanama
 
Ruby 1.9.3 Basic Introduction
Ruby 1.9.3 Basic IntroductionRuby 1.9.3 Basic Introduction
Ruby 1.9.3 Basic IntroductionPrabu D
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
Matlab: Procedures And Functions
Matlab: Procedures And FunctionsMatlab: Procedures And Functions
Matlab: Procedures And Functionsmatlab Content
 
Procedures And Functions in Matlab
Procedures And Functions in MatlabProcedures And Functions in Matlab
Procedures And Functions in MatlabDataminingTools Inc
 
Generating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonGenerating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonTristan Penman
 
phases of compiler-analysis phase
phases of compiler-analysis phasephases of compiler-analysis phase
phases of compiler-analysis phaseSuyash Srivastava
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeProf. Wim Van Criekinge
 
Cd ch2 - lexical analysis
Cd   ch2 - lexical analysisCd   ch2 - lexical analysis
Cd ch2 - lexical analysismengistu23
 

Similar a Linguistic component Lemmatizer for the Russian language (20)

Chapter 3 Using Unix Commands
Chapter 3 Using Unix CommandsChapter 3 Using Unix Commands
Chapter 3 Using Unix Commands
 
Introduction to Assembly Language Programming
Introduction to Assembly Language ProgrammingIntroduction to Assembly Language Programming
Introduction to Assembly Language Programming
 
55 best linux tips, tricks and command lines
55 best linux tips, tricks and command lines55 best linux tips, tricks and command lines
55 best linux tips, tricks and command lines
 
Developing web apps using Erlang-Web
Developing web apps using Erlang-WebDeveloping web apps using Erlang-Web
Developing web apps using Erlang-Web
 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysis
 
Introduction to Compilers
Introduction to CompilersIntroduction to Compilers
Introduction to Compilers
 
Parsing
ParsingParsing
Parsing
 
Compiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaCompiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_Vanama
 
Ruby 1.9.3 Basic Introduction
Ruby 1.9.3 Basic IntroductionRuby 1.9.3 Basic Introduction
Ruby 1.9.3 Basic Introduction
 
AD102 - Break out of the Box
AD102 - Break out of the BoxAD102 - Break out of the Box
AD102 - Break out of the Box
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Matlab: Procedures And Functions
Matlab: Procedures And FunctionsMatlab: Procedures And Functions
Matlab: Procedures And Functions
 
Procedures And Functions in Matlab
Procedures And Functions in MatlabProcedures And Functions in Matlab
Procedures And Functions in Matlab
 
COMPILER DESIGN- Introduction & Lexical Analysis:
COMPILER DESIGN- Introduction & Lexical Analysis: COMPILER DESIGN- Introduction & Lexical Analysis:
COMPILER DESIGN- Introduction & Lexical Analysis:
 
Generating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonGenerating parsers using Ragel and Lemon
Generating parsers using Ragel and Lemon
 
phases of compiler-analysis phase
phases of compiler-analysis phasephases of compiler-analysis phase
phases of compiler-analysis phase
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekinge
 
Cd ch2 - lexical analysis
Cd   ch2 - lexical analysisCd   ch2 - lexical analysis
Cd ch2 - lexical analysis
 
Linux commands
Linux commandsLinux commands
Linux commands
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 

Más de Dmitry Kan

London IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesLondon IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesDmitry Kan
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural searchDmitry Kan
 
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Dmitry Kan
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaDmitry Kan
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_crDmitry Kan
 
Computer Semantics And Machine Translation
Computer Semantics And Machine TranslationComputer Semantics And Machine Translation
Computer Semantics And Machine TranslationDmitry Kan
 

Más de Dmitry Kan (6)

London IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesLondon IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use cases
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social media
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_cr
 
Computer Semantics And Machine Translation
Computer Semantics And Machine TranslationComputer Semantics And Machine Translation
Computer Semantics And Machine Translation
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

Linguistic component Lemmatizer for the Russian language

  • 1. Linguistic component: Lemmatizer for the Russian language Technical description SemanticAnalyzer Group, 2013-08-29 www.semanticanalyzer.info This document describes technical details of lemmatizer for the Russian language. It is assumed, that prior to using this component an input text has been preprocessed with Tokenizer component (see the corresponding Technical Description). Demo package sent upon request contains the following:  Java library of tokenizer in a form of a binary  run_lemmatizer.sh script for swift checking the functionality of the module  messages_to_lemmatize.txt file containing examples of generic text and tweets for tokenization using the run_lemmatizer.sh script Algorithm is based on combination of the following:  dictionary search  algorithm calculating morphological properties of unknown words  compound word analyzer  analyzer of numbers  rule-based analyzer Speed of processing Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server 5037 characters/ms 880 tokens/ms Tests were conducted in a single thread. Format of the messages_to_lemmatize.txt file This file describes input data for the tokenizer module for demo purposes.Формат: Format: TexttText type Text contains textual data in Russian for lemmatization t – tab symbol Text type: supported values are GENERAL_TEXT and TWITTER. Examples of lemmatization The run_lemmatizer.sh script will generate the following file: messages_to_lemmatize.out. For the following input file messages_to_tokenize.txt:
  • 2. Прекрасный вечер))) прогулка по Набережной - самое то;) только маккафе подпортило настроение( TWITTER This output gets generated: Прекрасный, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=ый,endings=[а, ая, ее, ей, о, ого, ое, ой, ом, ому, ою, ую, ы, ые, ым, ыми, ых],lemma=прекрасный,pos=ADJECTIVE,weight=14317,stem=прекрасн] вечер, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=,endings=[а, ам, ами, ах, е, ов, ом, у],lemma=вечер,pos=NOUN,weight=39101,stem=вечер] emopostkn, type: ALPHANUM emopostkn, type: ALPHANUM emopostkn, type: ALPHANUM прогулка, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=а,endings=[ам, ами, ах, е, и, ой, ою, у],lemma=прогулка,pos=NOUN,weight=3054,stem=прогулк] по, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=по,pos=PREPOSITION,weight=573 564,stem=по] Набережной, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=ая,endings=[ой, ою, ую, ые, ым, ыми, ых],lemma=набережная,pos=NOUN,weight=2908,stem=набережн] -, type: PUNCT MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=- ,pos=NUMERAL,weight=0,stem=-] самое, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=ый,endings=[ая, ого, ое, ой, ом, ому, ою, ую, ые, ым, ыми, ых],lemma=самый,pos=ADJECTIVE,weight=0,stem=сам] то, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=то,pos=CONJUNCTION,weight=0,s tem=то] MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=то,pos=ADVERB,weight=0,stem=т о] MorphDesc[removeNum=0,lemmaEnding=от,endings=[а, е, ем, еми, ех, о, ого, ой, ом, ому, ою, у],lemma=тот,pos=PRONOUN_ADJECTIVE,weight=1139844,stem=т] MorphDesc[removeNum=0,lemmaEnding=о,endings=[е, ем, еми, ех, ого, ом, ому],lemma=то,pos=NOUN,weight=0,stem=т] emopostkn, type: ALPHANUM только, type: ALPHANUM
  • 3. MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=только,pos=PARTICLE,weight=0,s tem=только] MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=только,pos=ADVERB,weight=0,st em=только] маккафе, type: ALPHANUM подпортило, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=ть,endings=[в, вший, вшего, вшему, вшим, вшем, вшая, вшей, вшую, вшею, вшее, вшие, вших, вшими, вши, л, ла, ли, ло],lemma=подпортить,pos=VERB,weight=190,stem=подпорти] настроение, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=е,endings=[ем, и, ю, я],lemma=настроение,pos=NOUN,weight=8416,stem=настроени] MorphDesc[removeNum=0,lemmaEnding=е,endings=[ем, и, й, ю, я, ям, ями, ях],lemma=настроение,pos=NOUN,weight=8416,stem=настроени] emonegtkn, type: ALPHANUM Examples of using the library from the Java code MorphAnalyzer morphAnalyzer = MorphAnalyzerLoader.loadDefault(); System.out.println(morphAnalyzer.analyzeBest("русского")); output: MorphDesc[removeNum=0,lemmaEnding=ий,endings=[ая, ие, им, ими, их, ого, ое, ой, ом, ому, ою, ую],lemma=русский,pos=ADJECTIVE,weight=36739,stem=русск]