This document provides an overview of the Sinmin corpus project for the Sinhala language. Sinmin aims to be a continuously updating, dynamic corpus that covers both structured and unstructured Sinhala language data. It discusses corpus linguistics and existing Sinhala corpora. The project has identified Sinhala data sources, built crawlers to extract data, and evaluated different database systems for data storage. A user interface and API have been designed. Future work includes completing crawlers, loading data into Cassandra, and connecting the frontend. The goal is to build a large, freely available corpus to support NLP research and applications for Sinhala.
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
Sinmin Literature Review Presentation
1. SINMIN
CORPUS FOR SINHALA LANGUAGE
Literature Review
Upeksha W. D.
Wijayarathna D. G. C. D.
Siriwardena M. P.
Lasandun K. H. L.
Supervisors :
Dr. Chinthana Wimalasuriya
Prof. Gihan Dias
Mr. N. H. N. D. De Silva
2. Sinmin is a Corpus for Sinhala language which is
➢Continuously updating
➢Dynamic (Scalable)
➢Covers wide range of language (Structured and
unstructured)
3. OUTLINE
● Literature Review
● Introduction to corpus linguistics and What is a Corpus
● Usages of a corpus
● Existing Corpus Implementations
● Identifying Sinhala Sources and Crawling
● Data Storage and Information Retrieval from Corpus
● Information Visualization
● Extracting Linguistic Feature
● Current Progress
4. INTRODUCTION TO CORPUS LINGUISTICS
AND WHAT IS A CORPUS
Handford, M. and McCarthy, M. J. (2004) “Invisible to us” - A preliminary corpus based study of spoken
business english, Discourse In the Profession: Perspectives from Corpus Linguistics 167-201
5. WHAT IS A CORPUS??
“A corpus is a principled collection of authentic texts
stored electronically that can be used to discover
information about language that may not have been
noticed through intuition alone.” - Bennet (2010)
Bennet, G. R. (2010) Using Corpora in the Language Learning Classroom,Michigan ELT.
6. ● There are mainly 8 kinds of corpora.
● They are generalized corpuses, specialized corpuses,
learner corpuses, pedagogic corpuses, historical
corpuses, parallel corpuses, comparable corpuses,
and monitor corpuses.
● The broadest type of corpus is the genarilezed
corpes.
7. “Sinmin” will
be a generalized corpus.
cover all types of Sinhala Language.
8. USAGES OF A CORPUS
● Implementing translators, spell checkers and grammar
checkers.
● Identifying lexical and grammatical features of a language.
● Identifying varieties of language of context of usage and
time.
● Retrieving statistical details of a language.
● Providing backend support for tools like OCR, POS Tagger,
etc.
10. ● There is a implemented corpus for Sinhala language
which is known as UCSC Text Corpus of
Contemporary Sinhala.
● It consists of about 10 million words, but it covers
very little amount of language and it is not updating.
CORPUS FOR SINHALA LANGUAGE?
11. COMPOSITION OF THE CORPUS
● Language comprising the corpus cannot be random
but chosen according to specific characteristics.
● It must use authentic texts. The language it contains
is not made up for the sole purpose of creating the
corpus
12. EXAMPLE - COMPOSITION OF COCA
● The COCA contains more than 385 million words
from 1990–2008 (20 million words each year).
● Texts are evenly divided between 5 genres, spoken
(20%), fiction (20%), popular magazines (20%),
newspapers (20%) and academic journals (20%).
14. DATA STORAGE AND INFORMATION
RETRIEVAL FROM CORPUS
Existing corpora uses two main technologies for data
storage
● Relational Databases
● Indexed file Systems
15. INDEXED FILE SYSTEMS AS STORAGE
● BNC uses this mechanism.
● data is stored as XML like files which follows a
scheme known as the Corpus Data Interchange
Format.
● This supports to store a great deal of detail about the
structure of each text, such as its division into
sections or chapters, paragraphs, verse lines, etc.
20. RELATIONAL DB VS INDEXED FILE
SYSTEMS
● Indexed file systems use extensive use of indexes
● Relational Database models are relatively fast.
● In Indexed file systems, difficult to add additional
layers of annotation.
21. No study has been done on
how NoSQL performs in
implementing Corpora.
22. INFORMATION VISUALIZATION
Most of the popular corpora like BNC, COCA, Corpus
Del Espanol, Google books corpus use similar kind of
Web Interface.
26. EXTRACTING LINGUISTIC FEATURES
● A main usage of a language corpus is extracting
linguistic features of a language.
● Linguistic features for many languages has been
identified using Corpora.
● Example - A corpus-based linguistics analysis on
written corpus: colligation of “TO” and “FOR.”
31. DIVIDED INTO 5 MAIN GENRES
News Academic Creative
Writing
Spoken Gazette
News Paper Text books Fiction Subtitle Gazette
News Items Religious Blogs
Wikipedia Magazine
mahawansa
32. Implemented Crawlers for different sources,
adhering to same format.
https://github.com/madurangasiriwardena/corpus.sinhala.crawler
34. CRAWLED DATA SAVED TO XML FILES WITH
FOLLOWING META DATA
● Post Name
● Author
● Link
● Published Date
35. CRAWLER CONTROLLER
Crawler controller monitors and handles the status of
the web crawlers.
Crawler controller address -
http://Sinhala-corpus.projects.uom.lk:8080/CrawlerControllerWeb
36.
37. We tested performance of several database
systems to determine what should we use
to store data.
39. We considered performance for inserting
data and for retrieving 12 different
information needs.
Data set and source code -
https://github.com/madurangasiriwardena/performance-test
43. Cassandra performed better than others in
most of the scenarios, and its insertion
time increased linearly.
So we chose it for implementing the
corpus.
44. USER INTERFACE DESIGN AND
IMPLEMENTATION
● Web interface of Sinmin has been designed for users
who would prefer a visualised and summarized view
of statistical data of Sinmin.
● Visual design of the interface has been made in a
way that any user without prior experience of the
interface is able to fulfill his information
requirements with little effort.
http://sinhala-corpus.projects.uom.lk/sinmin-web/
45.
46.
47.
48. CORPUS API DESIGN AND IMPLEMENTATION
• REST API to expose Corpus services
• Much complex and customizable data retrieval and
filtering
• Interface for third party applications to consume
49. PUBLICATIONS
● Comparison between performance of various
database systems for implementing a language
corpus - 11th Beyond Databases, Architectures and
Structures conference (Pending)
● Implementing a Corpus for Sinhala Language -
Symposium on Language Technology for South Asia
(Pending)
50. REMAINING WORK FOR THE NEXT PHASE
• Finish writing crawlers
• Feed data to Cassendra database
• Connecting front end with API calls
Extract linguistic features, hypothesis testing and validating linguistic rules.
So our user interfaces are designed for the users who would prefer a visualized and summarized view of the data in sinmin in a simple way while keeping all the functionalities required