SlideShare una empresa de Scribd logo
1 de 52
SINMIN
CORPUS FOR SINHALA LANGUAGE
Literature Review
Upeksha W. D.
Wijayarathna D. G. C. D.
Siriwardena M. P.
Lasandun K. H. L.
Supervisors :
Dr. Chinthana Wimalasuriya
Prof. Gihan Dias
Mr. N. H. N. D. De Silva
Sinmin is a Corpus for Sinhala language which is
➢Continuously updating
➢Dynamic (Scalable)
➢Covers wide range of language (Structured and
unstructured)
OUTLINE
● Literature Review
● Introduction to corpus linguistics and What is a Corpus
● Usages of a corpus
● Existing Corpus Implementations
● Identifying Sinhala Sources and Crawling
● Data Storage and Information Retrieval from Corpus
● Information Visualization
● Extracting Linguistic Feature
● Current Progress
INTRODUCTION TO CORPUS LINGUISTICS
AND WHAT IS A CORPUS
Handford, M. and McCarthy, M. J. (2004) “Invisible to us” - A preliminary corpus based study of spoken
business english, Discourse In the Profession: Perspectives from Corpus Linguistics 167-201
WHAT IS A CORPUS??
“A corpus is a principled collection of authentic texts
stored electronically that can be used to discover
information about language that may not have been
noticed through intuition alone.” - Bennet (2010)
Bennet, G. R. (2010) Using Corpora in the Language Learning Classroom,Michigan ELT.
● There are mainly 8 kinds of corpora.
● They are generalized corpuses, specialized corpuses,
learner corpuses, pedagogic corpuses, historical
corpuses, parallel corpuses, comparable corpuses,
and monitor corpuses.
● The broadest type of corpus is the genarilezed
corpes.
“Sinmin” will
be a generalized corpus.
cover all types of Sinhala Language.
USAGES OF A CORPUS
● Implementing translators, spell checkers and grammar
checkers.
● Identifying lexical and grammatical features of a language.
● Identifying varieties of language of context of usage and
time.
● Retrieving statistical details of a language.
● Providing backend support for tools like OCR, POS Tagger,
etc.
EXISTING CORPUS IMPLEMENTATIONS
● There is a implemented corpus for Sinhala language
which is known as UCSC Text Corpus of
Contemporary Sinhala.
● It consists of about 10 million words, but it covers
very little amount of language and it is not updating.
CORPUS FOR SINHALA LANGUAGE?
COMPOSITION OF THE CORPUS
● Language comprising the corpus cannot be random
but chosen according to specific characteristics.
● It must use authentic texts. The language it contains
is not made up for the sole purpose of creating the
corpus
EXAMPLE - COMPOSITION OF COCA
● The COCA contains more than 385 million words
from 1990–2008 (20 million words each year).
● Texts are evenly divided between 5 genres, spoken
(20%), fiction (20%), popular magazines (20%),
newspapers (20%) and academic journals (20%).
COMPOSITION OF UCSC TEXT CORPUS OF
CONTEMPORARY SINHALA
DATA STORAGE AND INFORMATION
RETRIEVAL FROM CORPUS
Existing corpora uses two main technologies for data
storage
● Relational Databases
● Indexed file Systems
INDEXED FILE SYSTEMS AS STORAGE
● BNC uses this mechanism.
● data is stored as XML like files which follows a
scheme known as the Corpus Data Interchange
Format.
● This supports to store a great deal of detail about the
structure of each text, such as its division into
sections or chapters, paragraphs, verse lines, etc.
RELATIONAL DATABASE AS STORAGE
● COCA, Corpus del Español use relational databases.
DATA MODEL IN COCA
CORPUS DEL ESPAÑOL USES SEPARATE
TABLES FOR BIGRAMS AND TRIGRAMS
RELATIONAL DB VS INDEXED FILE
SYSTEMS
● Indexed file systems use extensive use of indexes
● Relational Database models are relatively fast.
● In Indexed file systems, difficult to add additional
layers of annotation.
No study has been done on
how NoSQL performs in
implementing Corpora.
INFORMATION VISUALIZATION
Most of the popular corpora like BNC, COCA, Corpus
Del Espanol, Google books corpus use similar kind of
Web Interface.
USER INTERFACE
OF COCA
GOOGLE BOOKS NGRAM VIEWER UI
EXTRACTING LINGUISTIC FEATURES
● A main usage of a language corpus is extracting
linguistic features of a language.
● Linguistic features for many languages has been
identified using Corpora.
● Example - A corpus-based linguistics analysis on
written corpus: colligation of “TO” and “FOR.”
CURRENT PROGRESS
IDENTIFIED SINHALA RESOURCES
● Online Newspapers
● News Websites
● School Textbooks
● Sinhala Wikipedia
● Online Mahawansaya
● Subtitles
● Sinhala Fiction
● Sinhala Blogs
● Sinhala Magazines
● Gazette
DIVIDED INTO 5 MAIN GENRES
News Academic Creative
Writing
Spoken Gazette
News Paper Text books Fiction Subtitle Gazette
News Items Religious Blogs
Wikipedia Magazine
mahawansa
Implemented Crawlers for different sources,
adhering to same format.
https://github.com/madurangasiriwardena/corpus.sinhala.crawler
FINISHED CRAWLERS
CRAWLED DATA SAVED TO XML FILES WITH
FOLLOWING META DATA
● Post Name
● Author
● Link
● Published Date
CRAWLER CONTROLLER
Crawler controller monitors and handles the status of
the web crawlers.
Crawler controller address -
http://Sinhala-corpus.projects.uom.lk:8080/CrawlerControllerWeb
We tested performance of several database
systems to determine what should we use
to store data.
WE CONSIDERED FOLLOWING DATA
STORAGE SYSTEMS
We considered performance for inserting
data and for retrieving 12 different
information needs.
Data set and source code -
https://github.com/madurangasiriwardena/performance-test
DATA INSERTION TIME COMPARISON
INFORMATION RETRIEVAL PERFORMANCE
COMPARISON - PART 1
INFORMATION RETRIEVAL PERFORMANCE
COMPARISON - PART 2
Cassandra performed better than others in
most of the scenarios, and its insertion
time increased linearly.
So we chose it for implementing the
corpus.
USER INTERFACE DESIGN AND
IMPLEMENTATION
● Web interface of Sinmin has been designed for users
who would prefer a visualised and summarized view
of statistical data of Sinmin.
● Visual design of the interface has been made in a
way that any user without prior experience of the
interface is able to fulfill his information
requirements with little effort.
http://sinhala-corpus.projects.uom.lk/sinmin-web/
CORPUS API DESIGN AND IMPLEMENTATION
• REST API to expose Corpus services
• Much complex and customizable data retrieval and
filtering
• Interface for third party applications to consume
PUBLICATIONS
● Comparison between performance of various
database systems for implementing a language
corpus - 11th Beyond Databases, Architectures and
Structures conference (Pending)
● Implementing a Corpus for Sinhala Language -
Symposium on Language Technology for South Asia
(Pending)
REMAINING WORK FOR THE NEXT PHASE
• Finish writing crawlers
• Feed data to Cassendra database
• Connecting front end with API calls
Questions?
Thank you!

Más contenido relacionado

La actualidad más candente

Services semantic technology_terminology
Services semantic technology_terminologyServices semantic technology_terminology
Services semantic technology_terminology
Tenforce
 

La actualidad más candente (11)

Demystifying RDF
Demystifying RDFDemystifying RDF
Demystifying RDF
 
Dynamic websites
Dynamic websitesDynamic websites
Dynamic websites
 
Eclipse RDF4J - Working with RDF in Java
Eclipse RDF4J - Working with RDF in JavaEclipse RDF4J - Working with RDF in Java
Eclipse RDF4J - Working with RDF in Java
 
NoSQL
NoSQLNoSQL
NoSQL
 
RDF Seminar Presentation
RDF Seminar PresentationRDF Seminar Presentation
RDF Seminar Presentation
 
Redis - Your Magical superfast database
Redis - Your Magical superfast databaseRedis - Your Magical superfast database
Redis - Your Magical superfast database
 
Services semantic technology_terminology
Services semantic technology_terminologyServices semantic technology_terminology
Services semantic technology_terminology
 
NoSQL
NoSQLNoSQL
NoSQL
 
Open Location Data and Linked Open Data
Open Location Data and Linked Open DataOpen Location Data and Linked Open Data
Open Location Data and Linked Open Data
 
NoSQL Roundup
NoSQL RoundupNoSQL Roundup
NoSQL Roundup
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQL
 

Destacado

Cellsppt presentation-100813001954-phpapp02
Cellsppt presentation-100813001954-phpapp02Cellsppt presentation-100813001954-phpapp02
Cellsppt presentation-100813001954-phpapp02
Muhammad Fahad Saleh
 
Animal and Plant Cells
Animal and Plant CellsAnimal and Plant Cells
Animal and Plant Cells
ECSD
 
Example of dfd with answer
Example of dfd with answerExample of dfd with answer
Example of dfd with answer
Mahmoud Bakeer
 

Destacado (20)

G.C.E. O/L ICT Lessons Database sinhala
 G.C.E. O/L ICT Lessons Database sinhala G.C.E. O/L ICT Lessons Database sinhala
G.C.E. O/L ICT Lessons Database sinhala
 
දත්ත සහ තොරතුරු
දත්ත සහ තොරතුරුදත්ත සහ තොරතුරු
දත්ත සහ තොරතුරු
 
Grade 10 ICT Short Notes in Sinhala(2015)
Grade 10 ICT Short Notes in Sinhala(2015)Grade 10 ICT Short Notes in Sinhala(2015)
Grade 10 ICT Short Notes in Sinhala(2015)
 
SinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - ThesisSinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - Thesis
 
Entity Relationship diagrams - ER diagrams
Entity Relationship diagrams - ER diagramsEntity Relationship diagrams - ER diagrams
Entity Relationship diagrams - ER diagrams
 
CAN Bus
CAN BusCAN Bus
CAN Bus
 
Data flow diagrams - DFD
Data flow diagrams - DFDData flow diagrams - DFD
Data flow diagrams - DFD
 
ICT in Sinhala
ICT in SinhalaICT in Sinhala
ICT in Sinhala
 
Network Devices
Network DevicesNetwork Devices
Network Devices
 
GCE O/L ICT
GCE O/L ICT GCE O/L ICT
GCE O/L ICT
 
G.C.E O/L ICT Short Notes Grade-11
G.C.E O/L ICT Short Notes Grade-11G.C.E O/L ICT Short Notes Grade-11
G.C.E O/L ICT Short Notes Grade-11
 
Input and Output Devicesආදාන හා ප්‍රතිදාන උපාංග
Input and Output Devicesආදාන හා ප්‍රතිදාන උපාංගInput and Output Devicesආදාන හා ප්‍රතිදාන උපාංග
Input and Output Devicesආදාන හා ප්‍රතිදාන උපාංග
 
පරිගණක වර්ගීකරණය
පරිගණක වර්ගීකරණයපරිගණක වර්ගීකරණය
පරිගණක වර්ගීකරණය
 
Cellsppt presentation-100813001954-phpapp02
Cellsppt presentation-100813001954-phpapp02Cellsppt presentation-100813001954-phpapp02
Cellsppt presentation-100813001954-phpapp02
 
Animal and Plant Cells
Animal and Plant CellsAnimal and Plant Cells
Animal and Plant Cells
 
පරිගණකයේ විකාශය
පරිගණකයේ විකාශයපරිගණකයේ විකාශය
පරිගණකයේ විකාශය
 
Power point presentation of animal cell and plant cell
Power point presentation of animal cell and plant cellPower point presentation of animal cell and plant cell
Power point presentation of animal cell and plant cell
 
Cell : Structure and Function Part 01
Cell : Structure and Function Part 01Cell : Structure and Function Part 01
Cell : Structure and Function Part 01
 
Cell Structure And Function
Cell Structure And FunctionCell Structure And Function
Cell Structure And Function
 
Example of dfd with answer
Example of dfd with answerExample of dfd with answer
Example of dfd with answer
 

Similar a Sinmin Literature Review Presentation

Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation SystemInterpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System
ijtsrd
 
Intro_to_fswad_ppt_by_abhay (1).pptx
Intro_to_fswad_ppt_by_abhay (1).pptxIntro_to_fswad_ppt_by_abhay (1).pptx
Intro_to_fswad_ppt_by_abhay (1).pptx
dipen55
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
mimisy
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipeline
Conference Papers
 
Comp app lexicography
Comp app lexicographyComp app lexicography
Comp app lexicography
syila239
 
Tools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDLTools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDL
Chimezie Ogbuji
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
Sören Auer
 
8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org
8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org
8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org
AEGIS-ACCESSIBLE Projects
 

Similar a Sinmin Literature Review Presentation (20)

Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation SystemInterpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System
 
Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711
 
Intro_to_fswad_ppt_by_abhay (1).pptx
Intro_to_fswad_ppt_by_abhay (1).pptxIntro_to_fswad_ppt_by_abhay (1).pptx
Intro_to_fswad_ppt_by_abhay (1).pptx
 
methods and resources
methods and resourcesmethods and resources
methods and resources
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipeline
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
2012 Software Freedom Day Presentation about Koha ILMS
2012 Software Freedom Day Presentation about Koha ILMS2012 Software Freedom Day Presentation about Koha ILMS
2012 Software Freedom Day Presentation about Koha ILMS
 
Hindi language as a graphical user interface to relational database for tran...
Hindi language as a graphical user interface to relational  database for tran...Hindi language as a graphical user interface to relational  database for tran...
Hindi language as a graphical user interface to relational database for tran...
 
CSCorganization of programming languages
CSCorganization of programming languagesCSCorganization of programming languages
CSCorganization of programming languages
 
Comp app lexicography
Comp app lexicographyComp app lexicography
Comp app lexicography
 
Tools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDLTools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDL
 
5 Programming Languages for Databases to Learn In 2023.pptx
5 Programming Languages for Databases to Learn In 2023.pptx5 Programming Languages for Databases to Learn In 2023.pptx
5 Programming Languages for Databases to Learn In 2023.pptx
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
 
8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org
8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org
8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org
 
Php packages
Php packagesPhp packages
Php packages
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)
 
Bhashini (NLTM) Tools
Bhashini (NLTM) ToolsBhashini (NLTM) Tools
Bhashini (NLTM) Tools
 

Más de Chamila Wijayarathna

Helen Keller, The Story of My Life
Helen Keller, The Story of My LifeHelen Keller, The Story of My Life
Helen Keller, The Story of My Life
Chamila Wijayarathna
 
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
Chamila Wijayarathna
 

Más de Chamila Wijayarathna (17)

Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
 
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
 
GS0C - "How to Start" Guide
GS0C - "How to Start" GuideGS0C - "How to Start" Guide
GS0C - "How to Start" Guide
 
Xbotix 2014 Rules undergraduate category
Xbotix 2014 Rules   undergraduate categoryXbotix 2014 Rules   undergraduate category
Xbotix 2014 Rules undergraduate category
 
Kaggle KDD Cup Report
Kaggle KDD Cup ReportKaggle KDD Cup Report
Kaggle KDD Cup Report
 
Higgs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge ReportHiggs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge Report
 
Programs With Common Sense
Programs With Common SensePrograms With Common Sense
Programs With Common Sense
 
Knock detecting door lock research paper
Knock detecting door lock research paperKnock detecting door lock research paper
Knock detecting door lock research paper
 
IEEE Xtreme Final results 2012
IEEE Xtreme Final results 2012IEEE Xtreme Final results 2012
IEEE Xtreme Final results 2012
 
Helen Keller, The Story of My Life
Helen Keller, The Story of My LifeHelen Keller, The Story of My Life
Helen Keller, The Story of My Life
 
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research PaperShirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
 
Ieee xtreme 5.0 results
Ieee xtreme 5.0 resultsIeee xtreme 5.0 results
Ieee xtreme 5.0 results
 
Memory technologies
Memory technologiesMemory technologies
Memory technologies
 
History of Computer
History of ComputerHistory of Computer
History of Computer
 
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
 
Path Following Robot
Path Following RobotPath Following Robot
Path Following Robot
 
Path following robot
Path following robotPath following robot
Path following robot
 

Último

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 

Sinmin Literature Review Presentation

  • 1. SINMIN CORPUS FOR SINHALA LANGUAGE Literature Review Upeksha W. D. Wijayarathna D. G. C. D. Siriwardena M. P. Lasandun K. H. L. Supervisors : Dr. Chinthana Wimalasuriya Prof. Gihan Dias Mr. N. H. N. D. De Silva
  • 2. Sinmin is a Corpus for Sinhala language which is ➢Continuously updating ➢Dynamic (Scalable) ➢Covers wide range of language (Structured and unstructured)
  • 3. OUTLINE ● Literature Review ● Introduction to corpus linguistics and What is a Corpus ● Usages of a corpus ● Existing Corpus Implementations ● Identifying Sinhala Sources and Crawling ● Data Storage and Information Retrieval from Corpus ● Information Visualization ● Extracting Linguistic Feature ● Current Progress
  • 4. INTRODUCTION TO CORPUS LINGUISTICS AND WHAT IS A CORPUS Handford, M. and McCarthy, M. J. (2004) “Invisible to us” - A preliminary corpus based study of spoken business english, Discourse In the Profession: Perspectives from Corpus Linguistics 167-201
  • 5. WHAT IS A CORPUS?? “A corpus is a principled collection of authentic texts stored electronically that can be used to discover information about language that may not have been noticed through intuition alone.” - Bennet (2010) Bennet, G. R. (2010) Using Corpora in the Language Learning Classroom,Michigan ELT.
  • 6. ● There are mainly 8 kinds of corpora. ● They are generalized corpuses, specialized corpuses, learner corpuses, pedagogic corpuses, historical corpuses, parallel corpuses, comparable corpuses, and monitor corpuses. ● The broadest type of corpus is the genarilezed corpes.
  • 7. “Sinmin” will be a generalized corpus. cover all types of Sinhala Language.
  • 8. USAGES OF A CORPUS ● Implementing translators, spell checkers and grammar checkers. ● Identifying lexical and grammatical features of a language. ● Identifying varieties of language of context of usage and time. ● Retrieving statistical details of a language. ● Providing backend support for tools like OCR, POS Tagger, etc.
  • 10. ● There is a implemented corpus for Sinhala language which is known as UCSC Text Corpus of Contemporary Sinhala. ● It consists of about 10 million words, but it covers very little amount of language and it is not updating. CORPUS FOR SINHALA LANGUAGE?
  • 11. COMPOSITION OF THE CORPUS ● Language comprising the corpus cannot be random but chosen according to specific characteristics. ● It must use authentic texts. The language it contains is not made up for the sole purpose of creating the corpus
  • 12. EXAMPLE - COMPOSITION OF COCA ● The COCA contains more than 385 million words from 1990–2008 (20 million words each year). ● Texts are evenly divided between 5 genres, spoken (20%), fiction (20%), popular magazines (20%), newspapers (20%) and academic journals (20%).
  • 13. COMPOSITION OF UCSC TEXT CORPUS OF CONTEMPORARY SINHALA
  • 14. DATA STORAGE AND INFORMATION RETRIEVAL FROM CORPUS Existing corpora uses two main technologies for data storage ● Relational Databases ● Indexed file Systems
  • 15. INDEXED FILE SYSTEMS AS STORAGE ● BNC uses this mechanism. ● data is stored as XML like files which follows a scheme known as the Corpus Data Interchange Format. ● This supports to store a great deal of detail about the structure of each text, such as its division into sections or chapters, paragraphs, verse lines, etc.
  • 16.
  • 17. RELATIONAL DATABASE AS STORAGE ● COCA, Corpus del Español use relational databases.
  • 19. CORPUS DEL ESPAÑOL USES SEPARATE TABLES FOR BIGRAMS AND TRIGRAMS
  • 20. RELATIONAL DB VS INDEXED FILE SYSTEMS ● Indexed file systems use extensive use of indexes ● Relational Database models are relatively fast. ● In Indexed file systems, difficult to add additional layers of annotation.
  • 21. No study has been done on how NoSQL performs in implementing Corpora.
  • 22. INFORMATION VISUALIZATION Most of the popular corpora like BNC, COCA, Corpus Del Espanol, Google books corpus use similar kind of Web Interface.
  • 24.
  • 25. GOOGLE BOOKS NGRAM VIEWER UI
  • 26. EXTRACTING LINGUISTIC FEATURES ● A main usage of a language corpus is extracting linguistic features of a language. ● Linguistic features for many languages has been identified using Corpora. ● Example - A corpus-based linguistics analysis on written corpus: colligation of “TO” and “FOR.”
  • 27.
  • 28.
  • 30. IDENTIFIED SINHALA RESOURCES ● Online Newspapers ● News Websites ● School Textbooks ● Sinhala Wikipedia ● Online Mahawansaya ● Subtitles ● Sinhala Fiction ● Sinhala Blogs ● Sinhala Magazines ● Gazette
  • 31. DIVIDED INTO 5 MAIN GENRES News Academic Creative Writing Spoken Gazette News Paper Text books Fiction Subtitle Gazette News Items Religious Blogs Wikipedia Magazine mahawansa
  • 32. Implemented Crawlers for different sources, adhering to same format. https://github.com/madurangasiriwardena/corpus.sinhala.crawler
  • 34. CRAWLED DATA SAVED TO XML FILES WITH FOLLOWING META DATA ● Post Name ● Author ● Link ● Published Date
  • 35. CRAWLER CONTROLLER Crawler controller monitors and handles the status of the web crawlers. Crawler controller address - http://Sinhala-corpus.projects.uom.lk:8080/CrawlerControllerWeb
  • 36.
  • 37. We tested performance of several database systems to determine what should we use to store data.
  • 38. WE CONSIDERED FOLLOWING DATA STORAGE SYSTEMS
  • 39. We considered performance for inserting data and for retrieving 12 different information needs. Data set and source code - https://github.com/madurangasiriwardena/performance-test
  • 40. DATA INSERTION TIME COMPARISON
  • 43. Cassandra performed better than others in most of the scenarios, and its insertion time increased linearly. So we chose it for implementing the corpus.
  • 44. USER INTERFACE DESIGN AND IMPLEMENTATION ● Web interface of Sinmin has been designed for users who would prefer a visualised and summarized view of statistical data of Sinmin. ● Visual design of the interface has been made in a way that any user without prior experience of the interface is able to fulfill his information requirements with little effort. http://sinhala-corpus.projects.uom.lk/sinmin-web/
  • 45.
  • 46.
  • 47.
  • 48. CORPUS API DESIGN AND IMPLEMENTATION • REST API to expose Corpus services • Much complex and customizable data retrieval and filtering • Interface for third party applications to consume
  • 49. PUBLICATIONS ● Comparison between performance of various database systems for implementing a language corpus - 11th Beyond Databases, Architectures and Structures conference (Pending) ● Implementing a Corpus for Sinhala Language - Symposium on Language Technology for South Asia (Pending)
  • 50. REMAINING WORK FOR THE NEXT PHASE • Finish writing crawlers • Feed data to Cassendra database • Connecting front end with API calls

Notas del editor

  1. Extract linguistic features, hypothesis testing and validating linguistic rules.
  2. So our user interfaces are designed for the users who would prefer a visualized and summarized view of the data in sinmin in a simple way while keeping all the functionalities required