SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno




Natural Language Processing
   by Advanced Artificial
    Intelligence Methods

                  Jan Žižka
         Department of Informatics
    Faculty of Business and Economics
  Mendel University in Brno, Czech Republic

  zizka.jan@gmail.com, zizka@mendelu.cz


                      (Text Mining)
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Data, information, knowledge
●   Electronic text data
●
    Inductive machine learning (ML)
●
    Pre-processing of data and its representation
●   Methods of searching, similarity, pattern recognition
●
    Algorithms (just some examples)
●   Application areas




               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Data, information, knowledge
    - data here means all (text) values somehow obtained
      (relevant, irrelevant, with or without noise, exact and
      inexact, approximate, and so like)

    - information is part of data that is interesting from the
      specific selected problem-solution viewpoint

    - knowledge is generalized information

    - metaknowledge is “knowledge about knowledge” (for
      example, to know which knowledge is applicable to
      a specific problem)


                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Electronic text data
    Text in an electronic form (ASCII/ANSI, Unicode, etc.).
    Typical text data can be found, e.g., on the Internet.
    Electronic text is used in many areas.


    Electronic text data are created in any common natural
    language (not only in prevailing English).
    Processing of such “human-like” data by machines is
    extraordinarily complicated and often depends on
    a specific language.


               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Inductive machine learning
                                                             - learning by using a limited set
                                                               of examples;

                                                             - the examples generally cover
                                                               only a proportion of reality;

                                                             - sufficient values describing the
                                                               data are missing (for example,
                                                               distribution);

                                                             - a mathematical model cannot
                                                               be created for a reliable
                                                               prediction or classification;

                                                             - knowledge is obtained by the
                                                               generalization of information.




              Natural Language Processing by Advanced
                    Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Inductive machine learning

    What is a color of a crow?

    Black? And why?

    Has anyone of you seen
    a crow that was not black?

    Has anyone seen completely all crows that have existed
    anywhere anytime on the Earth? (No, he/she surely hasn't.)

    To what degree is the generalization “a crow is black”
    correct and acceptable? Can you say?

               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Inductive machine learning
    How many specific crows we need
    to see to generalize “a crow is black”?




                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Inductive machine learning
    The hooded crow:




               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Inductive machine learning
    The generalization of specific available examples is one of
    possible learning methods.

    Machines (computers) need (unlike the human beings)
    usually significantly (much) larger amount of specific
    examples to generalize, therefore to get knowledge.

    The application of a method to determine a degree of
    similarity plays a big role – for example, to categorize an
    unknown example to a certain group of known samples.




               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Inductive machine learning
    Algorithms of machine learning define their relevant
    parameters automatically during their training phase. The
    quality of their training is verified during testing. If the
    results of testing are acceptable, the trained algorithm can
    be used for a given application.

    The training phase requests suitable learning examples
    because an algorithm’s properties (parameters) are
    finally defined by the applied training data.

    The testing phase uses examples which were not been
    used by an algorithm during its training phase.


               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    The typical way to get knowledge from electronic
    unstructured texts consists in the following steps:


    - source → a necessary volume of (generally noisy) data
    - removing noise → clear data
    - interesting part of data from the application viewpoint →
      information
    - information generalization → knowledge



                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    Representing text documents: bag of words (BOW).

    Methods of machine learning mostly see text documents
    as files containing symbolic values (terms, words) without
    analyzing their meaning (at most, only shallowly) or mutual
    dependence.

    Therefore, the word order in a document is considered as
    being “meaningless” – naturally, it eliminates a certain
    information contents. However, it significantly simplifies
    processing of natural languages from, for example, the
    classification point of view.


               Natural Language Processing by Advanced
                     Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    Pre-processing affects significantly the result quality:

    - excluding common words, which have no specific
      meaning from the application viewpoint (prepositions,
      abbreviations, definite/indefinite articles, etc.);
    - excluding words with very low or high frequency in all
      processed documents;
    - excluding punctuation, spaces, and so like;
    - transferring alphabetic characters to lower-case letters;
    - eliminating insignificant characters and words reduces
      the problem dimensionality (e.g., from 104 to 103)
      because each unique word is one dimension.


                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    An example of text representation where we ignore
    punctuation, spatial zoning (new lines, paragraphs,
    chapters, etc.), upper and lower letters, two languages
    (English terms in a Czech sentence), word orders – it can be
    very significant (for example, machine learning and learning
    machine), and excluding general words (“stop words”). We get
    a dictionary (a list of symbols) applied to training of a chosen
    algorithm:




                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    Příklad representace textu, kde se ignoruje interpunkce, členění
    textu do řádků, velká a malá písmena, dvojjazyčnost (anglické
    termíny v české větě), pořadí slov, které může mít velký
    význam (např. machine learning – strojové učení a learning
    machine – učící stroj má zcela odlišný význam), a vynechají se
    obecná slova.

     anglické české členění dvojjazyčnost ignoruje interpunkce
    learning machine má malá metody mít může obecná odlišný
    písmena pomocí pořadí příklad representace řádků slov stroj
    strojové termíny textu učení učící velká velký větě vynechají
    význam words zcela



                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    The next dimensionality reduction can be obtained, for
    example, by transferring words into their stems. In the previous
    example, we could reduce the generated dictionary (infinitive,
    grammmatical case, singular, voice, and so like), so the
    dimensionality 8 decreases to 4:

             mít má stroj strojové učení učící velká velký
                         mít stroj učit velký

    Stemming, of course, depends on a language. For English,
    there exists a simplified system Porter stemming, where the
    machine plainly cuts off word endings – this is far from being
    perfect, however, it is practically very effective.


                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Pre-processing of data, their representation
    The word incidence – more possibilites to represent it:

    - binary: 1/0 means a word is/isn’t in a document (a word
      weight is 1 or 0);

    - frequency: a word weight is given by its frequency in
      a document;

    - tf-idf: term frequency-inverted document frequency:
      a word frequency in a document (a document representation
      by a given word) to the number of documents having that word
      (the higher the number of documents with that word the lower
     the word’s discrimination value).


                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Methods of searching, similarity
    The general task is to find similarity between an unlabeled
    document and a labeled one. It can be used, for example, for
    classification: interesting/uninteresting, and so like.

    Unsupervised learning (clustering): learning without a techer.

    Supervised learning: learning with a teacher.

    Semi-supervised learning: a small amount of given samples
    significantly improves clustering.




                Natural Language Processing by Advanced
                      Artificial Intelligence Methods
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Methods of searching, similarity
    Supervised learning:

    - k-NN (k-nearest neighbors);
    - generation of decision trees;
    - disjunctive normal form (generating rules);
    - support vector machines;
    - Bayes naïve classifier (using conditional probability);
    - etc. (there are really many possibilities).




                 Natural Language Processing by Advanced
                       Artificial Intelligence Methods
w1               w2         w3     cj
            je          pěkné   počasí         +
            je          chladno                -
Training    není        velmi   chladno        +
texts:      není        pěkné                  -
            velmi       chladno                -
            chladno                            -
                 .            .         .      .
                 .            .         .      .
                 .            .         .      .

    + texts: total 6 words
                                  the number of unique words: 6
    - texts: total 7 words

A classified document “to není pěkné chladno”: + or - ?
After creating the dictionary from the unique words (here 6),
computing apriori probabilities (2 texts + and 4 texts – in 6
texts), computing aposteriori probabilties of words in + and –,
and the following normalization we can set the result:
                          w1      w2     w3     w4      w5        w6
 the sorted
 dictionary:            chladno je      není pěkné počasí velmi
frequency wi in +        1       1     1       1       1          1
frequency wi in -        3       1     1       1       0          1
 p (wi | +)               1/6     1/6 1/6       1/6     1/6       1/6
 p (wi | -)               3/7     1/7 1/7       1/7     0/7       1/7

p = p ( 'není', 'pěkné', 'chladno' | +/–) =
  = pNBK ('není' | +/–) × p('pěkné' | +/–) × p('chladno' | +/–)
“w3 w4 w1” = “není pěkné chladno”

P+ = p(+) p(w3 = 'není' | +) p(w4 = 'pěkné' | +) p(w1 = 'chladno' | +) =
                 2 1 1 1
                = × × × ≈ 0.00154
                 6 6 6 6                         
P- = p(–) p(w3 = 'není' | –) p(w4 = 'pěkné' | –) p(w1 = 'chladno' | –) =
                 4 1 1 3
                = × × × ≈ 0.00583
                 6 7 7 7                          
   +     0.00154
P =                   ≈ 0.21
  n 0.00154  0.00583
         0.00583                          Pn- > Pn+ ⇒ negative
P =-
                      ≈ 0.79
  n 0.00154  0.00583
Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno


●
    Application areas
    Many applications exist in various areas where massive
    electronic text data exist. Typical examples are browsing
    the Internet or filtering of email spam. Among the
    contemporary application areas belong, for example:

      - grouping of similar blog submissions;
      - determining subjectivity in text;
      - opinions/feelings/moods/attitudes/meanings in text;
      - revealing of text plagiarisms;
      - analyzing opinions;
      - business intelligence (legal commercial “espionage”);
    and so like.

                                    END
               Natural Language Processing by Advanced
                     Artificial Intelligence Methods

Más contenido relacionado

La actualidad más candente

Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficultiesijtsrd
 
A Survey on Word Sense Disambiguation
A Survey on Word Sense DisambiguationA Survey on Word Sense Disambiguation
A Survey on Word Sense DisambiguationIOSR Journals
 
To Infinite Possibilities and Beyond...
To Infinite Possibilities and Beyond...To Infinite Possibilities and Beyond...
To Infinite Possibilities and Beyond...Valeria de Paiva
 
About the authors
About the authorsAbout the authors
About the authorsbutest
 
Simulation of Language Acquisition Walter Daelemans
Simulation of Language Acquisition Walter DaelemansSimulation of Language Acquisition Walter Daelemans
Simulation of Language Acquisition Walter Daelemansbutest
 
Vl3.cultureplex presentation
Vl3.cultureplex presentationVl3.cultureplex presentation
Vl3.cultureplex presentationCameliaN
 
Vl3.lab presentation
Vl3.lab presentationVl3.lab presentation
Vl3.lab presentationCameliaN
 
Digital logic introduction using fpg as resume
Digital logic introduction using fpg as resumeDigital logic introduction using fpg as resume
Digital logic introduction using fpg as resumeRochmatDiantoro
 
Text-Analysis-Orange.pdf
Text-Analysis-Orange.pdfText-Analysis-Orange.pdf
Text-Analysis-Orange.pdfAkuhuruf
 
2.ganiyu rafiu adesina 14 21
2.ganiyu rafiu adesina 14 212.ganiyu rafiu adesina 14 21
2.ganiyu rafiu adesina 14 21Alexander Decker
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2Yuriy Guts
 
CV_Egorova_Ekaterina
CV_Egorova_EkaterinaCV_Egorova_Ekaterina
CV_Egorova_EkaterinaArenel
 
download
downloaddownload
downloadbutest
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)Sumit Raj
 
Lec 15,16,17 NLP.machine translation
Lec 15,16,17  NLP.machine translationLec 15,16,17  NLP.machine translation
Lec 15,16,17 NLP.machine translationguest873a50
 

La actualidad más candente (18)

Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficulties
 
A Survey on Word Sense Disambiguation
A Survey on Word Sense DisambiguationA Survey on Word Sense Disambiguation
A Survey on Word Sense Disambiguation
 
To Infinite Possibilities and Beyond...
To Infinite Possibilities and Beyond...To Infinite Possibilities and Beyond...
To Infinite Possibilities and Beyond...
 
About the authors
About the authorsAbout the authors
About the authors
 
10.1.1.35.8376
10.1.1.35.837610.1.1.35.8376
10.1.1.35.8376
 
Simulation of Language Acquisition Walter Daelemans
Simulation of Language Acquisition Walter DaelemansSimulation of Language Acquisition Walter Daelemans
Simulation of Language Acquisition Walter Daelemans
 
Vl3.cultureplex presentation
Vl3.cultureplex presentationVl3.cultureplex presentation
Vl3.cultureplex presentation
 
Sintec
SintecSintec
Sintec
 
Vl3.lab presentation
Vl3.lab presentationVl3.lab presentation
Vl3.lab presentation
 
Digital logic introduction using fpg as resume
Digital logic introduction using fpg as resumeDigital logic introduction using fpg as resume
Digital logic introduction using fpg as resume
 
Text-Analysis-Orange.pdf
Text-Analysis-Orange.pdfText-Analysis-Orange.pdf
Text-Analysis-Orange.pdf
 
2.ganiyu rafiu adesina 14 21
2.ganiyu rafiu adesina 14 212.ganiyu rafiu adesina 14 21
2.ganiyu rafiu adesina 14 21
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
 
Ijet journal
Ijet journalIjet journal
Ijet journal
 
CV_Egorova_Ekaterina
CV_Egorova_EkaterinaCV_Egorova_Ekaterina
CV_Egorova_Ekaterina
 
download
downloaddownload
download
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Lec 15,16,17 NLP.machine translation
Lec 15,16,17  NLP.machine translationLec 15,16,17  NLP.machine translation
Lec 15,16,17 NLP.machine translation
 

Similar a Text mining

Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and originShubhankar Mohan
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptOlusolaTop
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003butest
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language ProcessingMichel Bruley
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUSYuki Saito
 
Jawaharlal Nehru Technological University Natural Language Processing Capston...
Jawaharlal Nehru Technological University Natural Language Processing Capston...Jawaharlal Nehru Technological University Natural Language Processing Capston...
Jawaharlal Nehru Technological University Natural Language Processing Capston...write4
 
Jawaharlal Nehru Technological University Natural Language Processing Capston...
Jawaharlal Nehru Technological University Natural Language Processing Capston...Jawaharlal Nehru Technological University Natural Language Processing Capston...
Jawaharlal Nehru Technological University Natural Language Processing Capston...write5
 
Cross Model.pptx
Cross Model.pptxCross Model.pptx
Cross Model.pptxKomal526846
 
IRJET - Analysis of Paraphrase Detection using NLP Techniques
IRJET - Analysis of Paraphrase Detection using NLP TechniquesIRJET - Analysis of Paraphrase Detection using NLP Techniques
IRJET - Analysis of Paraphrase Detection using NLP TechniquesIRJET Journal
 
Vl3.culture plex presentation
Vl3.culture plex presentationVl3.culture plex presentation
Vl3.culture plex presentationCameliaN
 
Vl3.culture plex presentation
Vl3.culture plex presentationVl3.culture plex presentation
Vl3.culture plex presentationCameliaN
 
An Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingAn Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingScott Faria
 

Similar a Text mining (20)

Esa act
Esa actEsa act
Esa act
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and origin
 
ppt
pptppt
ppt
 
Soft computing01
Soft computing01Soft computing01
Soft computing01
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.ppt
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
 
Nltk
NltkNltk
Nltk
 
CV
CVCV
CV
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 
Jawaharlal Nehru Technological University Natural Language Processing Capston...
Jawaharlal Nehru Technological University Natural Language Processing Capston...Jawaharlal Nehru Technological University Natural Language Processing Capston...
Jawaharlal Nehru Technological University Natural Language Processing Capston...
 
Jawaharlal Nehru Technological University Natural Language Processing Capston...
Jawaharlal Nehru Technological University Natural Language Processing Capston...Jawaharlal Nehru Technological University Natural Language Processing Capston...
Jawaharlal Nehru Technological University Natural Language Processing Capston...
 
Cross Model.pptx
Cross Model.pptxCross Model.pptx
Cross Model.pptx
 
BEA12_sakaguchi
BEA12_sakaguchiBEA12_sakaguchi
BEA12_sakaguchi
 
IRJET - Analysis of Paraphrase Detection using NLP Techniques
IRJET - Analysis of Paraphrase Detection using NLP TechniquesIRJET - Analysis of Paraphrase Detection using NLP Techniques
IRJET - Analysis of Paraphrase Detection using NLP Techniques
 
Vl3.culture plex presentation
Vl3.culture plex presentationVl3.culture plex presentation
Vl3.culture plex presentation
 
Vl3.culture plex presentation
Vl3.culture plex presentationVl3.culture plex presentation
Vl3.culture plex presentation
 
An Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingAn Overview Of Natural Language Processing
An Overview Of Natural Language Processing
 

Más de Natalia Ostapuk

Más de Natalia Ostapuk (20)

Gromov
GromovGromov
Gromov
 
Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Ponomareva
PonomarevaPonomareva
Ponomareva
 
Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013
 
Tomita одесса
Tomita одессаTomita одесса
Tomita одесса
 
Mt engine on nlp semniar
Mt engine on nlp semniarMt engine on nlp semniar
Mt engine on nlp semniar
 
Tomita 4марта
Tomita 4мартаTomita 4марта
Tomita 4марта
 
Konyushkova
KonyushkovaKonyushkova
Konyushkova
 
Braslavsky 13.12.12
Braslavsky 13.12.12Braslavsky 13.12.12
Braslavsky 13.12.12
 
Клышинский 8.12
Клышинский 8.12Клышинский 8.12
Клышинский 8.12
 
Zizka synasc 2012
Zizka synasc 2012Zizka synasc 2012
Zizka synasc 2012
 
Zizka immm 2012
Zizka immm 2012Zizka immm 2012
Zizka immm 2012
 
Zizka aimsa 2012
Zizka aimsa 2012Zizka aimsa 2012
Zizka aimsa 2012
 
Analysis by-variants
Analysis by-variantsAnalysis by-variants
Analysis by-variants
 
место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1
 
Additional2
Additional2Additional2
Additional2
 
Additional1
Additional1Additional1
Additional1
 
Seminar1
Seminar1Seminar1
Seminar1
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge
 

Último

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Text mining

  • 1. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno Natural Language Processing by Advanced Artificial Intelligence Methods Jan Žižka Department of Informatics Faculty of Business and Economics Mendel University in Brno, Czech Republic zizka.jan@gmail.com, zizka@mendelu.cz (Text Mining)
  • 2. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Data, information, knowledge ● Electronic text data ● Inductive machine learning (ML) ● Pre-processing of data and its representation ● Methods of searching, similarity, pattern recognition ● Algorithms (just some examples) ● Application areas Natural Language Processing by Advanced Artificial Intelligence Methods
  • 3. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Data, information, knowledge - data here means all (text) values somehow obtained (relevant, irrelevant, with or without noise, exact and inexact, approximate, and so like) - information is part of data that is interesting from the specific selected problem-solution viewpoint - knowledge is generalized information - metaknowledge is “knowledge about knowledge” (for example, to know which knowledge is applicable to a specific problem) Natural Language Processing by Advanced Artificial Intelligence Methods
  • 4. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Electronic text data Text in an electronic form (ASCII/ANSI, Unicode, etc.). Typical text data can be found, e.g., on the Internet. Electronic text is used in many areas. Electronic text data are created in any common natural language (not only in prevailing English). Processing of such “human-like” data by machines is extraordinarily complicated and often depends on a specific language. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 5. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Inductive machine learning - learning by using a limited set of examples; - the examples generally cover only a proportion of reality; - sufficient values describing the data are missing (for example, distribution); - a mathematical model cannot be created for a reliable prediction or classification; - knowledge is obtained by the generalization of information. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 6. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Inductive machine learning What is a color of a crow? Black? And why? Has anyone of you seen a crow that was not black? Has anyone seen completely all crows that have existed anywhere anytime on the Earth? (No, he/she surely hasn't.) To what degree is the generalization “a crow is black” correct and acceptable? Can you say? Natural Language Processing by Advanced Artificial Intelligence Methods
  • 7. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Inductive machine learning How many specific crows we need to see to generalize “a crow is black”? Natural Language Processing by Advanced Artificial Intelligence Methods
  • 8. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Inductive machine learning The hooded crow: Natural Language Processing by Advanced Artificial Intelligence Methods
  • 9. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Inductive machine learning The generalization of specific available examples is one of possible learning methods. Machines (computers) need (unlike the human beings) usually significantly (much) larger amount of specific examples to generalize, therefore to get knowledge. The application of a method to determine a degree of similarity plays a big role – for example, to categorize an unknown example to a certain group of known samples. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 10. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Inductive machine learning Algorithms of machine learning define their relevant parameters automatically during their training phase. The quality of their training is verified during testing. If the results of testing are acceptable, the trained algorithm can be used for a given application. The training phase requests suitable learning examples because an algorithm’s properties (parameters) are finally defined by the applied training data. The testing phase uses examples which were not been used by an algorithm during its training phase. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 11. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation The typical way to get knowledge from electronic unstructured texts consists in the following steps: - source → a necessary volume of (generally noisy) data - removing noise → clear data - interesting part of data from the application viewpoint → information - information generalization → knowledge Natural Language Processing by Advanced Artificial Intelligence Methods
  • 12. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation Representing text documents: bag of words (BOW). Methods of machine learning mostly see text documents as files containing symbolic values (terms, words) without analyzing their meaning (at most, only shallowly) or mutual dependence. Therefore, the word order in a document is considered as being “meaningless” – naturally, it eliminates a certain information contents. However, it significantly simplifies processing of natural languages from, for example, the classification point of view. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 13. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation Pre-processing affects significantly the result quality: - excluding common words, which have no specific meaning from the application viewpoint (prepositions, abbreviations, definite/indefinite articles, etc.); - excluding words with very low or high frequency in all processed documents; - excluding punctuation, spaces, and so like; - transferring alphabetic characters to lower-case letters; - eliminating insignificant characters and words reduces the problem dimensionality (e.g., from 104 to 103) because each unique word is one dimension. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 14. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation An example of text representation where we ignore punctuation, spatial zoning (new lines, paragraphs, chapters, etc.), upper and lower letters, two languages (English terms in a Czech sentence), word orders – it can be very significant (for example, machine learning and learning machine), and excluding general words (“stop words”). We get a dictionary (a list of symbols) applied to training of a chosen algorithm: Natural Language Processing by Advanced Artificial Intelligence Methods
  • 15. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation Příklad representace textu, kde se ignoruje interpunkce, členění textu do řádků, velká a malá písmena, dvojjazyčnost (anglické termíny v české větě), pořadí slov, které může mít velký význam (např. machine learning – strojové učení a learning machine – učící stroj má zcela odlišný význam), a vynechají se obecná slova. anglické české členění dvojjazyčnost ignoruje interpunkce learning machine má malá metody mít může obecná odlišný písmena pomocí pořadí příklad representace řádků slov stroj strojové termíny textu učení učící velká velký větě vynechají význam words zcela Natural Language Processing by Advanced Artificial Intelligence Methods
  • 16. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation The next dimensionality reduction can be obtained, for example, by transferring words into their stems. In the previous example, we could reduce the generated dictionary (infinitive, grammmatical case, singular, voice, and so like), so the dimensionality 8 decreases to 4: mít má stroj strojové učení učící velká velký mít stroj učit velký Stemming, of course, depends on a language. For English, there exists a simplified system Porter stemming, where the machine plainly cuts off word endings – this is far from being perfect, however, it is practically very effective. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 17. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Pre-processing of data, their representation The word incidence – more possibilites to represent it: - binary: 1/0 means a word is/isn’t in a document (a word weight is 1 or 0); - frequency: a word weight is given by its frequency in a document; - tf-idf: term frequency-inverted document frequency: a word frequency in a document (a document representation by a given word) to the number of documents having that word (the higher the number of documents with that word the lower the word’s discrimination value). Natural Language Processing by Advanced Artificial Intelligence Methods
  • 18. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Methods of searching, similarity The general task is to find similarity between an unlabeled document and a labeled one. It can be used, for example, for classification: interesting/uninteresting, and so like. Unsupervised learning (clustering): learning without a techer. Supervised learning: learning with a teacher. Semi-supervised learning: a small amount of given samples significantly improves clustering. Natural Language Processing by Advanced Artificial Intelligence Methods
  • 19. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Methods of searching, similarity Supervised learning: - k-NN (k-nearest neighbors); - generation of decision trees; - disjunctive normal form (generating rules); - support vector machines; - Bayes naïve classifier (using conditional probability); - etc. (there are really many possibilities). Natural Language Processing by Advanced Artificial Intelligence Methods
  • 20. w1 w2 w3 cj je pěkné počasí + je chladno - Training není velmi chladno + texts: není pěkné - velmi chladno - chladno - . . . . . . . . . . . . + texts: total 6 words the number of unique words: 6 - texts: total 7 words A classified document “to není pěkné chladno”: + or - ?
  • 21. After creating the dictionary from the unique words (here 6), computing apriori probabilities (2 texts + and 4 texts – in 6 texts), computing aposteriori probabilties of words in + and –, and the following normalization we can set the result: w1 w2 w3 w4 w5 w6 the sorted dictionary: chladno je není pěkné počasí velmi frequency wi in + 1 1 1 1 1 1 frequency wi in - 3 1 1 1 0 1 p (wi | +) 1/6 1/6 1/6 1/6 1/6 1/6 p (wi | -) 3/7 1/7 1/7 1/7 0/7 1/7 p = p ( 'není', 'pěkné', 'chladno' | +/–) = = pNBK ('není' | +/–) × p('pěkné' | +/–) × p('chladno' | +/–)
  • 22. “w3 w4 w1” = “není pěkné chladno” P+ = p(+) p(w3 = 'není' | +) p(w4 = 'pěkné' | +) p(w1 = 'chladno' | +) = 2 1 1 1 = × × × ≈ 0.00154 6 6 6 6  P- = p(–) p(w3 = 'není' | –) p(w4 = 'pěkné' | –) p(w1 = 'chladno' | –) = 4 1 1 3 = × × × ≈ 0.00583 6 7 7 7  + 0.00154 P = ≈ 0.21 n 0.00154  0.00583 0.00583 Pn- > Pn+ ⇒ negative P =- ≈ 0.79 n 0.00154  0.00583
  • 23. Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno ● Application areas Many applications exist in various areas where massive electronic text data exist. Typical examples are browsing the Internet or filtering of email spam. Among the contemporary application areas belong, for example: - grouping of similar blog submissions; - determining subjectivity in text; - opinions/feelings/moods/attitudes/meanings in text; - revealing of text plagiarisms; - analyzing opinions; - business intelligence (legal commercial “espionage”); and so like. END Natural Language Processing by Advanced Artificial Intelligence Methods