SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
Identifying similar text documents

            Andr´ Santos
                e
         andrefs@cpan.org




            November 2011
What we get




        Andr´ Santos andrefs@cpan.org
            e                           Identifying similar text documents
Duplicated versions




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Duplicated versions




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
What this is really about




                    similarity



         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.

      Year references (e.g. “1977”)




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.

      Year references (e.g. “1977”)
      Proper names (e.g. “Sherlock Holmes”)




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
Measuring similarity




                                           |ALIEs ∩ BLIEs |
       similarity (A, B) =
                                           |ALIEs ∪ BLIEs |




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Measuring similarity




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
pairbooks

  Similarity values
     < 0.2 Documents                are    not related
     > 0.4 Documents                are    candidate pairs
     > 0.9 Documents                are    near duplicates
         1.0 Documents              are    duplicates

  Languages
  High similarity, same language: (Near) duplicates
  High similarity, different language: Candidate pairs

           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
Behold, pairbooks!

   ~ $ pairbooks                PT_list.txt ES_list.txt
  PTBR__Umberto_EcoO_nome_da_rosa.txt
    (0.227) [6954,7382]   ES__Umberto_EcoEl_Nombre_de_la_Rosa(...)
    (0.018) [6954,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)
    (0.018) [6954,5604]   ES__Umberto_EcoDiario_Minimo__2.txt(...)

  PTBR__Umberto_EcoO_Pendulo_de_Focault.txt
    (0.391) [11276,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)
    (0.042) [11276,6024] ES__Umberto_EcoLa_busqueda_de_la_Le(...)
    (0.035) [11276,5604] ES__Umberto_EcoDiario_Minimo__2.txt
  (...)




            Andr´ Santos andrefs@cpan.org
                e                           Identifying similar text documents
Perfect LIEs do not exist
  Year references
                         Can be confused with page numbers
                         Headers/footers can contain them
                         (publishing year, copyright, . . . )
  Proper names
                         Sometimes are translated (e.g. “S˜o
                                                          a
                         Tom´” “Judas Tom´” etc)
                              e,           e,
                         Some languages use different scripts
                         (e.g. Russian)
                         Some languages have declensions
        ...
              Andr´ Santos andrefs@cpan.org
                  e                           Identifying similar text documents
How to improve LIEs (future work)



     accept a list of equivalent words
     accept a list of stop words
     ...




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Give me one of those!


  CPAN
   http://search.cpan.org/perldoc?pairbooks

     Developer version
     requires Linux, Perl
     Incomplete documentation




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Identifying similar text documents

            Andr´ Santos
                e
         andrefs@cpan.org




            November 2011

Más contenido relacionado

Destacado

Steps to change the formatting of the text
Steps to change the formatting of the textSteps to change the formatting of the text
Steps to change the formatting of the textJeremy Dawes
 
Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pintoandrefsantos
 
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce KullanımıMobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımıekinozcicekciler
 
Professional Certifications
Professional CertificationsProfessional Certifications
Professional Certificationseeakin79
 
Kms 6 7 Newfeatures En
Kms 6 7 Newfeatures EnKms 6 7 Newfeatures En
Kms 6 7 Newfeatures Ensrrm7
 
Pps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and morePps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and morefilipj2000
 

Destacado (11)

Steps to change the formatting of the text
Steps to change the formatting of the textSteps to change the formatting of the text
Steps to change the formatting of the text
 
How we can help accountants clients save money
How we can help accountants clients save moneyHow we can help accountants clients save money
How we can help accountants clients save money
 
Final mh 101 for owls 2015(1)
Final mh 101 for owls 2015(1)Final mh 101 for owls 2015(1)
Final mh 101 for owls 2015(1)
 
Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pinto
 
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce KullanımıMobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
 
Professional Certifications
Professional CertificationsProfessional Certifications
Professional Certifications
 
Our Own Success Summit
Our Own Success SummitOur Own Success Summit
Our Own Success Summit
 
Kms 6 7 Newfeatures En
Kms 6 7 Newfeatures EnKms 6 7 Newfeatures En
Kms 6 7 Newfeatures En
 
La Excepción
La ExcepciónLa Excepción
La Excepción
 
Pps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and morePps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and more
 
Alf Lizzio 2015
Alf Lizzio 2015Alf Lizzio 2015
Alf Lizzio 2015
 

Similar a Identifying similar text documents

Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguisticsIrum Malik
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerandrefsantos
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining TechniquesHouw Liong The
 
Corpus linguistics intro
Corpus linguistics introCorpus linguistics intro
Corpus linguistics introAlex Curtis
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 

Similar a Identifying similar text documents (7)

Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleaner
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Corpus linguistics intro
Corpus linguistics introCorpus linguistics intro
Corpus linguistics intro
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 

Más de andrefsantos

Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesandrefsantos
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...andrefsantos
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment andrefsantos
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatosandrefsantos
 
Bigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesBigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesandrefsantos
 

Más de andrefsantos (8)

Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Slides
SlidesSlides
Slides
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challenges
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
 
Bigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesBigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challenges
 
Bigorna
BigornaBigorna
Bigorna
 

Último

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Último (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Identifying similar text documents

  • 1. Identifying similar text documents Andr´ Santos e andrefs@cpan.org November 2011
  • 2. What we get Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 3. Duplicated versions Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 4. Duplicated versions Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 5. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 6. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 7. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 8. What this is really about similarity Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 9. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 10. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Year references (e.g. “1977”) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 11. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Year references (e.g. “1977”) Proper names (e.g. “Sherlock Holmes”) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 12. Measuring similarity |ALIEs ∩ BLIEs | similarity (A, B) = |ALIEs ∪ BLIEs | Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 13. Measuring similarity Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 14. pairbooks Similarity values < 0.2 Documents are not related > 0.4 Documents are candidate pairs > 0.9 Documents are near duplicates 1.0 Documents are duplicates Languages High similarity, same language: (Near) duplicates High similarity, different language: Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 15. Behold, pairbooks! ~ $ pairbooks PT_list.txt ES_list.txt PTBR__Umberto_EcoO_nome_da_rosa.txt (0.227) [6954,7382] ES__Umberto_EcoEl_Nombre_de_la_Rosa(...) (0.018) [6954,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...) (0.018) [6954,5604] ES__Umberto_EcoDiario_Minimo__2.txt(...) PTBR__Umberto_EcoO_Pendulo_de_Focault.txt (0.391) [11276,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...) (0.042) [11276,6024] ES__Umberto_EcoLa_busqueda_de_la_Le(...) (0.035) [11276,5604] ES__Umberto_EcoDiario_Minimo__2.txt (...) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 16. Perfect LIEs do not exist Year references Can be confused with page numbers Headers/footers can contain them (publishing year, copyright, . . . ) Proper names Sometimes are translated (e.g. “S˜o a Tom´” “Judas Tom´” etc) e, e, Some languages use different scripts (e.g. Russian) Some languages have declensions ... Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 17. How to improve LIEs (future work) accept a list of equivalent words accept a list of stop words ... Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 18. Give me one of those! CPAN http://search.cpan.org/perldoc?pairbooks Developer version requires Linux, Perl Incomplete documentation Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 19. Identifying similar text documents Andr´ Santos e andrefs@cpan.org November 2011