Order out of Chaos: Construction of Knowledge Models from PDF Textbooks

•Descargar como PPTX, PDF•

0 recomendaciones•39 vistas

Textbooks are educational documents created, structured and formatted by domain experts with the main purpose to explain the knowledge in the domain to a novice. Authors use their understanding of the domain when structuring and formatting the content of a textbook to facilitate this explanation. As a result, the formatting and structural elements of textbooks carry the elements of domain knowledge implicitly encoded by their authors. Our paper presents an extendable approach towards automated extraction of this knowledge from textbooks taking into account their formatting rules and internal structure. We focus on PDF as the most common textbook representation format; however, the overall method is applicable to other formats as well. The evaluation experiments examine the accuracy of the approach, as well as the pragmatic quality of the obtained knowledge models using one of their possible applications --- semantic linking of textbooks in the same domain. The results indicate high accuracy of model construction on symbolic, syntactic and structural levels across textbooks and domains, and demonstrate the added value of the extracted models on the semantic level. Presented at Document Engineering 2020

Educación

Isaac Alpizar-Chacon and Sergey Sosnovsky
Utrecht University
Utrecht, The Netherlands
Order out of Chaos: Construction of Knowledge
Models from PDF Textbooks

2
Motivation
Textbooks are high-quality
textual resources
Textbooks are non-
structured resources
Table of Content provides
browsing aid
Index provides searching aid
Authors use their
understanding of the domain
while creating textbooks
Formatting and structuring
conventions provide
meaningful information

Goal
The automated extraction of
machine-readable textbook models
3
Q1: can knowledge be automatically
extracted from textbooks?
Q2: what would be the quality and the
value of such models?

4
Rule-based workflow
PDF as the most common
and challenging format
4 stages 9 steps 39 rules

$6 Example Rule • REPEATED_LINES: 1. Create a sample of pages: 𝑃𝑠 = {𝑝𝑎 , 𝑝𝑏 , . . . , 𝑝𝑚 } | 𝑃𝑠 ⊂ 𝑃. 2. If the first line(s) are identical across 𝑃𝑠 : header is detected and removed in all pages 𝑝 ∈ 𝑃. 3. If the last line(s) are identical across 𝑃𝑠 : footer is detected and removed in all pages 𝑝 ∈ 𝑃.$

7
Elements identified in TOC and Index sections

8
Textbook model
Structure
(sections)
Content (words,
lines, etc.)
Domain
Knowledge
(terms)

9
Accuracy of the extraction of the models
Domains: Statistics, Computer Science, History, Literature

10
Accuracy of the extraction of the models: Results
Averages over all domains
Text
Extraction
Our approach:
93.85%
PDFBox:
89.72%
PdfAct:
84.19%
TOC
Recognition
Precision:
99.92%
Recall:
99.92%
Index
Recognition
Precision:
98.56%
Recall:
98.13%

11
Application of the textbook models
Book#1
Chap1
Sub1
Sub2
Chap2 Chap3
Book#2
Chap1
Sub1
Sub2
Sub3
Chap2 Chap3
Sub1
Sub2
Chap4
Book#1
Chap1
Sub1
Sub2
Chap2 Chap3
Book#2
Chap1
Sub1
Sub2
Sub3
Chap2 Chap3
Sub1
Sub2
Chap4

12
Application of the textbook models
• Linking model:
• A term-based Vector Space Model (VSM) with 1611 terms from two books
• VSM applied to all chapters and sub-chapters of the both books
• Measure:
• NDCG (normalized discounted cumulative gain) at 1, 3, and 5.
• Baselines:
• TFIDF model
• LDA model

13
Application of the textbook models: Results
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
NDCG@1 NDCG@3 NDCG@5
TFIDF LDA TFIDF+LDA Our model

14
Summary
• Our rule-based approach allows the automated extraction of knowledge models
(Q1)
• Our first evaluation experiment shows that the approach is capable of
processing PDF textbooks with high accuracy (Q2)
• The linking of section across textbooks within the same domain demonstrates
the added value of the extracted models (Q2)
Q1: can knowledge be automatically extracted from textbooks?
Q2: what would be the quality and the value of such models?

15
Related work
• We have integrated individual
textbooks within thew same domain
with each other and with the Linked
Open Data cloud using DBpedia
Mean
Venn
Diagram
…
• Our rule-based approach is the
foundation for Intextbooks: a system
capable of transforming PDF textbooks
into intelligent educational resources

16
Future work
• We plan to use the information in both the Table of Contents and the Index
more extensively:
• Each chapter/subchapter can potentially be treated as a topic/subtopic
annotated with terms in the domain thanks to the explicit connections
between the terms in the index section and the different content sections

Thank you!
https://github.com/intextbooks/ITCore
https://intextbooks.science.uu.nl
Contact:
Isaac Alpizar-Chacon <i.alpizarchacon@uu.nl>

Más contenido relacionado

La actualidad más candente

Kr Pawan furianpandit

Information retrieval 8 term weightingVaibhav Khanna

A New Linkage for Prior Learning AssessmentMarco Kalz

Data wrangling week1Ferdin Joe John Joseph PhD

Interactive Analysis of Word Vector Embeddingsgleicher

Paper Evaluation research methodologyEngr Syed Absar Kazmi

Question Answering for Machine Reading Evaluation on Romanian and EnglishFaculty of Computer Science

A Combined Method for E-Learning Ontology Population based on NLP and User Ac...Fred Kozlov

Ran zhou poster 2018Ran Zhou

Machine translation course program (in English)Dmitry Kan

Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...Yandex

Generating SPSS training materials in StatJRUniversity of Southampton

Mobile ComputingShehrevar Davierwala

Lec1-Intobutest

SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia

Improving Document Clustering by Eliminating Unnatural LanguageJinho Choi

Abis04Martin Homik

Cross-domain Document Retrieval: Matching between Conversational and Formal W...Jinho Choi

HyperQA: A Framework for Complex Question-AnsweringJinho Choi

Research Data Mantra - March 2011EDINA, University of Edinburgh

La actualidad más candente (20)

Kr Pawan

Information retrieval 8 term weighting

A New Linkage for Prior Learning Assessment

Data wrangling week1

Interactive Analysis of Word Vector Embeddings

Paper Evaluation research methodology

Question Answering for Machine Reading Evaluation on Romanian and English

A Combined Method for E-Learning Ontology Population based on NLP and User Ac...

Ran zhou poster 2018

Machine translation course program (in English)

Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...

Generating SPSS training materials in StatJR

Mobile Computing

Lec1-Into

SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION

Improving Document Clustering by Eliminating Unnatural Language

Abis04

Cross-domain Document Retrieval: Matching between Conversational and Formal W...

HyperQA: A Framework for Complex Question-Answering

Research Data Mantra - March 2011

Similar a Order out of Chaos: Construction of Knowledge Models from PDF Textbooks

Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Sergey Sosnovsky

Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...Sergey Sosnovsky

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMeMadrid network

score based ranking of documentsKriti Khanna

K-12 Computer Science Framework GaDOE UpdateTony Vlachakis

The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino

Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...Aliabbas Petiwala

Orchestration Graphs: Enabling Rich Learning Scenarios at ScaleStian Håklev

F0372032035inventionjournals

Automatic Classification of Springer Nature Proceedings with Smart Topic MinerFrancesco Osborne

Online Lecture May 2015Yasuhisa Tamura

Design Patterns - General IntroductionAsma CHERIF

Content Wizard: Concept-Based Recommender System for Instructors of Programmi...Hung Chau

What's in a textbookSergey Sosnovsky

Training Module Project PlanSherri Orwick Ogden

Smart like a Fox: How clever students trick dumb programming assignment asses...Nane Kratzke

Knowledge Representation on the WebRinke Hoekstra

Creating abstractions from scientific workflows: PhD symposium 2015dgarijo

[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema VarietyUniversity of Bologna

Similar a Order out of Chaos: Construction of Knowledge Models from PDF Textbooks (20)

Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...

Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM

score based ranking of documents

K-12 Computer Science Framework GaDOE Update

The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...

Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...

Orchestration Graphs: Enabling Rich Learning Scenarios at Scale

F0372032035

Automatic Classification of Springer Nature Proceedings with Smart Topic Miner

Online Lecture May 2015

Design Patterns - General Introduction

Content Wizard: Concept-Based Recommender System for Instructors of Programmi...

What's in a textbook

Training Module Project Plan

Smart like a Fox: How clever students trick dumb programming assignment asses...

Knowledge Representation on the Web

Creating abstractions from scientific workflows: PhD symposium 2015

[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety

Último

Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417

Accessible Digital Futures project (20/03/2024)Jisc

Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic

Understanding Accommodations and ModificationsMJDuyan

SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection

ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43

Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong

psychiatric nursing HISTORY COLLECTION .docxPoojaSen20

The basics of sentences session 3pptx.pptxheathfieldcps1

Introduction to Nonprofit Accounting: The BasicsTechSoup

Spatium Project Simulation student briefAssociation for Project Management

General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil

Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University of Engineering & Technology, Jamshoro

Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam

Holdier Curriculum Vitae (April 2024).pdfagholdier

Application orientated numerical on hev.pptRamjanShidvankar

Activity 01 - Artificial Culture (1).pdfciinovamais

Asian American Pacific Islander Month DDSD 2024.pptxDavid Douglas School District

Order out of Chaos: Construction of Knowledge Models from PDF Textbooks

1. Isaac Alpizar-Chacon and Sergey Sosnovsky Utrecht University Utrecht, The Netherlands Order out of Chaos: Construction of Knowledge Models from PDF Textbooks

2. 2 Motivation Textbooks are high-quality textual resources Textbooks are non- structured resources Table of Content provides browsing aid Index provides searching aid Authors use their understanding of the domain while creating textbooks Formatting and structuring conventions provide meaningful information

3. Goal The automated extraction of machine-readable textbook models 3 Q1: can knowledge be automatically extracted from textbooks? Q2: what would be the quality and the value of such models?

4. 4 Rule-based workflow PDF as the most common and challenging format 4 stages 9 steps 39 rules

5. 5 Rule-based workflow

6. 6 Example Rule • REPEATED_LINES: 1. Create a sample of pages: 𝑃𝑠 = {𝑝𝑎 , 𝑝𝑏 , . . . , 𝑝𝑚 } | 𝑃𝑠 ⊂ 𝑃. 2. If the first line(s) are identical across 𝑃𝑠 : header is detected and removed in all pages 𝑝 ∈ 𝑃. 3. If the last line(s) are identical across 𝑃𝑠 : footer is detected and removed in all pages 𝑝 ∈ 𝑃.

7. 7 Elements identified in TOC and Index sections

8. 8 Textbook model Structure (sections) Content (words, lines, etc.) Domain Knowledge (terms)

9. 9 Accuracy of the extraction of the models Domains: Statistics, Computer Science, History, Literature

10. 10 Accuracy of the extraction of the models: Results Averages over all domains Text Extraction Our approach: 93.85% PDFBox: 89.72% PdfAct: 84.19% TOC Recognition Precision: 99.92% Recall: 99.92% Index Recognition Precision: 98.56% Recall: 98.13%

11. 11 Application of the textbook models Book#1 Chap1 Sub1 Sub2 Chap2 Chap3 Book#2 Chap1 Sub1 Sub2 Sub3 Chap2 Chap3 Sub1 Sub2 Chap4 Book#1 Chap1 Sub1 Sub2 Chap2 Chap3 Book#2 Chap1 Sub1 Sub2 Sub3 Chap2 Chap3 Sub1 Sub2 Chap4

12. 12 Application of the textbook models • Linking model: • A term-based Vector Space Model (VSM) with 1611 terms from two books • VSM applied to all chapters and sub-chapters of the both books • Measure: • NDCG (normalized discounted cumulative gain) at 1, 3, and 5. • Baselines: • TFIDF model • LDA model

13. 13 Application of the textbook models: Results 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 NDCG@1 NDCG@3 NDCG@5 TFIDF LDA TFIDF+LDA Our model

14. 14 Summary • Our rule-based approach allows the automated extraction of knowledge models (Q1) • Our first evaluation experiment shows that the approach is capable of processing PDF textbooks with high accuracy (Q2) • The linking of section across textbooks within the same domain demonstrates the added value of the extracted models (Q2) Q1: can knowledge be automatically extracted from textbooks? Q2: what would be the quality and the value of such models?

15. 15 Related work • We have integrated individual textbooks within thew same domain with each other and with the Linked Open Data cloud using DBpedia Mean Venn Diagram … • Our rule-based approach is the foundation for Intextbooks: a system capable of transforming PDF textbooks into intelligent educational resources

16. 16 Future work • We plan to use the information in both the Table of Contents and the Index more extensively: • Each chapter/subchapter can potentially be treated as a topic/subtopic annotated with terms in the domain thanks to the explicit connections between the terms in the index section and the different content sections

17. Thank you! https://github.com/intextbooks/ITCore https://intextbooks.science.uu.nl Contact: Isaac Alpizar-Chacon <i.alpizarchacon@uu.nl>

Notas del editor

(pause: 2) Hello and welcome to this presentation. My name is Isaac, I am a PhD student at Utrecht University and I will be describing our work: (pause: 1) Order out of Chaos: construction of knowledge models from PDF textbooks.
(pause: 2) I will start by saying that textbooks are high-quality textual resources, but they are often considered to be non-structure. But, if we look carefully how textbooks are made, they provide a lot of information. The Table of Contents provides browsing aid, and the index provides searching aid and terms in the domain. The authors use their understanding of the domain while creating textbooks, and we use these formatting and structuring conventions to extract meaningful information.
(pause: 2) Our goal is to achieve the automated extraction of machine-readable textbooks models. This goal involves two research questions: (pause: 1) First, can knowledge be automatically extracted from textbooks? And second, what would be the quality and the value of such models? Our work seeks to answer these questions.
(pause: 2) We developed a rule-based approach for the extraction of the knowledge models. We focus on PDF as the most common and challenging digital textbook format. Our workflow has 4 stages, 9 steps, and 39 rules. (pause: 1) The modular nature of the rule-based approach support its gradual refinement. Each time we encounter a new variation of a formatting or structural pattern, we extend the approach by modifying an existing rule or adding a new one.
(pause: 2) In the diagram we can see the complete workflow. The first stage is the text extraction to reconstruct all the words, lines, and pages from the PDF. In the second stage, the workflow assigns role labels, such as section heading, subheading, important text, and body text, to each text fragment. This process facilitates the subsequent recognition of different logical elements of the textbook. The third large stage of the workflow is to recognize all different logical elements within a textbook. First, auxiliary elements such as page numbers and headers are filtered out. Then, the individual entries of the table of contents are recognized and processed. Later, each index term is identified. Finally, individual sections are recognized. In the final stage we construct the textbook model, which can be later enriched with external information.
(pause: 2) To give you one example of how the rules look like, we have the _repeated lines_ rule, which is used to detect general page header and footer. This rule is part of the auxiliary elements filtering step. (pause: 1) First, we create a sample of continuous pages from all the pages in the textbook. Then, if the first lines in each page of the sample are the same, a header is detected and removed in all the pages from the textbook. Footers are detected in a similar way but comparing the last lines in the pages from the sample.
(pause: 2) The rules are used to identify different elements in the textbooks. In the table of contents, we use them to detect the pages that belong to the toc, non-content sections like notation or preface, chapter and subchapter entries, entries that are split in multiple lines, and to identify one of three possible types of tocs: flat, flat-ordered or indented. (pause: 1) For the index sections, the rules identify the pages that belong to the section, the heading and page references of the terms, multiline terms, different types of terms like cross-references, and nested groups of terms.
(pause: 2) At the end of the workflow we construct a textbook model using the Text Encoding Initiative, which is a standard for digital representation of texts. In the model we group the information in 3 categories: structure, content, and domain knowledge. (pause: 1) The structure section contains the name and precise start and end page of each chapter and subchapter of the textbook. The content includes the textual information structured as words, lines, fragments, and pages for each chapter and subchapter. Finally, the domain knowledge contains all the important terms in the domain extracted from the index section.
(pause: 2) To test the accuracy of the extraction of the models, we extracted the models using our rule-based approach and using the epub version of the same textbooks. In the epub textbooks the information is already structured and marked, so it is easy to extract and it is accurate. We hypothesize that if the information obtained from the two versions of a textbook matches, that means the approach processes PDF correctly. (pause: 1) We used textbooks from 4 different domains: Statistics, Computer Science, History, and Literature.
(pause: 2) Results from this first evaluation show that our approach has high accuracy. (pause: 1) For the text extraction aspect, we also compared our approach against 2 other tools as baselines. Our approach achieved the highest similarity, followed by PDFBox and then PdfAct. We don’t reach 100 percent similarity mostly because of formulae, charts, and tables that are images in the epub but text in the PDF version. An additional effect of the rules that improve textual extraction, along with the rules for recognition of page is a cleaner textual version of the textbook, as seen when our approach is compared against the out-of-the-box PDFBox tool that lacks these features. (pause: 1) For the recognition of the individual entries in the Table of Content, we reach a precision and recall of almost 100%. (pause: 1) Precision and recall are also very high for the recognition of the index terms.
(pause: 2) We also study one of the possible knowledge-driven applications of the extracted models: we used models of two textbooks to cross-link relevant sections. The idea is that any chapter or subchapter from the first textbook can be linked to any chapter or subchapter of the second textbook to identify similar sections.
(pause: 2) We constructed a linking model using a term-based Vector Space Model (VSM) with one thousand six hundred eleven terms from the two books. Then, the VSM was applied to all chapters and sub-chapters of the both books. The sections have been annotated by the terms according to the knowledge models extracted from the textbooks’ indices. The inner product of these annotations has been used to compute similarity between all sections of book 1, and sections of book 2. We used the normalized discounted cumulative gain to measure the quality of the ranked documents by relevance. NDCG@1 measures the effectiveness of retrieving the most relevant document, while @3 and @5 measure the capability of the retrieval system to find the first three and five most relevant documents, respectively. We also used a manual linking produced by experts as the ground truth for the NDCG measures. Finally, we used two baselines for comparison: the standard TFIDF model and a LDA model. Both baselines have used the textual content of each part of the textbooks with basic preprocessing (lowercase, stop-words, and stemming).
(pause: 2) The results show that the proposed model consistently outperforms all baselines, as seen with the yellow bar in the graph. (pause: 2) The difference between our model and the baselines is the highest for NDCG@1. The semantic information placed by the authors of textbooks in the index sections and extracted by our approach helps our linking model find 72% of best possible matches between the textbook sections. As the number of potential matches increases the difference between NDCG scores diminishes due to the ceiling effect. (pause: 2)
(pause: 2) As summary, we developed a rule-based approach that allows the automated extraction of knowledge models. This answers our first research question. Our first evaluation experiment shows that the approach is capable of processing PDF textbooks with high accuracy. And the linking of section across textbooks within the same domain demonstrates the added value of the extracted models. The two evaluation experiments answer our second research question. (pause: 2)
(pause: 2) Related to this work, we have taken individual textbooks within the same domain and integrated them with each other and with the Linked Open Data cloud using DBpedia. For example, individual terms like mean and venn diagram are linked to their corresponding resources in DBpedia. (pause: 2) Also, our rule-based approach is the foundation for Intextbooks: a system capable of transforming PDF textbooks into intelligent educational resources. (pause: 2)
(pause: 2) As future work, we plan to use the information in both the Table of Contents and the Index more extensively: Each chapter/subchapter can potentially be treated as a topic/subtopic annotated with terms in the domain thanks to the explicit connections between the terms in the index section and the different content sections.
(pause: 2) Finally, I invite you to check out our GitHub project, and to use our web service to create textbooks models. Thank you for your attention! (pause: 2)

Order out of Chaos: Construction of Knowledge Models from PDF Textbooks

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Order out of Chaos: Construction of Knowledge Models from PDF Textbooks

Similar a Order out of Chaos: Construction of Knowledge Models from PDF Textbooks (20)

Último

Último (20)

Order out of Chaos: Construction of Knowledge Models from PDF Textbooks

Notas del editor