Textbooks are educational documents created, structured and formatted by domain experts with the main purpose to explain the knowledge in the domain to a novice. Authors use their understanding of the domain when structuring and formatting the content of a textbook to facilitate this explanation. As a result, the formatting and structural elements of textbooks carry the elements of domain knowledge implicitly encoded by their authors. Our paper presents an extendable approach towards automated extraction of this knowledge from textbooks taking into account their formatting rules and internal structure. We focus on PDF as the most common textbook representation format; however, the overall method is applicable to other formats as well. The evaluation experiments examine the accuracy of the approach, as well as the pragmatic quality of the obtained knowledge models using one of their possible applications --- semantic linking of textbooks in the same domain. The results indicate high accuracy of model construction on symbolic, syntactic and structural levels across textbooks and domains, and demonstrate the added value of the extracted models on the semantic level.
Presented at Document Engineering 2020
Asian American Pacific Islander Month DDSD 2024.pptx
Order out of Chaos: Construction of Knowledge Models from PDF Textbooks
1. Isaac Alpizar-Chacon and Sergey Sosnovsky
Utrecht University
Utrecht, The Netherlands
Order out of Chaos: Construction of Knowledge
Models from PDF Textbooks
2. 2
Motivation
Textbooks are high-quality
textual resources
Textbooks are non-
structured resources
Table of Content provides
browsing aid
Index provides searching aid
Authors use their
understanding of the domain
while creating textbooks
Formatting and structuring
conventions provide
meaningful information
3. Goal
The automated extraction of
machine-readable textbook models
3
Q1: can knowledge be automatically
extracted from textbooks?
Q2: what would be the quality and the
value of such models?
6. 6
Example Rule
• REPEATED_LINES:
1. Create a sample of pages: 𝑃𝑠 = {𝑝𝑎 , 𝑝𝑏 , . . . , 𝑝𝑚 } | 𝑃𝑠 ⊂ 𝑃.
2. If the first line(s) are identical across 𝑃𝑠 : header is detected and removed
in all pages 𝑝 ∈ 𝑃.
3. If the last line(s) are identical across 𝑃𝑠 : footer is detected and removed in
all pages 𝑝 ∈ 𝑃.
9. 9
Accuracy of the extraction of the models
Domains: Statistics, Computer Science, History, Literature
10. 10
Accuracy of the extraction of the models: Results
Averages over all domains
Text
Extraction
Our approach:
93.85%
PDFBox:
89.72%
PdfAct:
84.19%
TOC
Recognition
Precision:
99.92%
Recall:
99.92%
Index
Recognition
Precision:
98.56%
Recall:
98.13%
12. 12
Application of the textbook models
• Linking model:
• A term-based Vector Space Model (VSM) with 1611 terms from two books
• VSM applied to all chapters and sub-chapters of the both books
• Measure:
• NDCG (normalized discounted cumulative gain) at 1, 3, and 5.
• Baselines:
• TFIDF model
• LDA model
13. 13
Application of the textbook models: Results
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
NDCG@1 NDCG@3 NDCG@5
TFIDF LDA TFIDF+LDA Our model
14. 14
Summary
• Our rule-based approach allows the automated extraction of knowledge models
(Q1)
• Our first evaluation experiment shows that the approach is capable of
processing PDF textbooks with high accuracy (Q2)
• The linking of section across textbooks within the same domain demonstrates
the added value of the extracted models (Q2)
Q1: can knowledge be automatically extracted from textbooks?
Q2: what would be the quality and the value of such models?
15. 15
Related work
• We have integrated individual
textbooks within thew same domain
with each other and with the Linked
Open Data cloud using DBpedia
Mean
Venn
Diagram
…
• Our rule-based approach is the
foundation for Intextbooks: a system
capable of transforming PDF textbooks
into intelligent educational resources
16. 16
Future work
• We plan to use the information in both the Table of Contents and the Index
more extensively:
• Each chapter/subchapter can potentially be treated as a topic/subtopic
annotated with terms in the domain thanks to the explicit connections
between the terms in the index section and the different content sections
(pause: 2)
Hello and welcome to this presentation. My name is Isaac, I am a PhD student at Utrecht University and I will be describing our work:
(pause: 1)
Order out of Chaos: construction of knowledge models from PDF textbooks.
(pause: 2)
I will start by saying that textbooks are high-quality textual resources, but they are often considered to be non-structure. But, if we look carefully how textbooks are made, they provide a lot of information. The Table of Contents provides browsing aid, and the index provides searching aid and terms in the domain. The authors use their understanding of the domain while creating textbooks, and we use these formatting and structuring conventions to extract meaningful information.
(pause: 2)
Our goal is to achieve the automated extraction of machine-readable textbooks models. This goal involves two research questions:
(pause: 1)
First, can knowledge be automatically extracted from textbooks? And second, what would be the quality and the value of such models?
Our work seeks to answer these questions.
(pause: 2)
We developed a rule-based approach for the extraction of the knowledge models. We focus on PDF as the most common and challenging digital textbook format. Our workflow has 4 stages, 9 steps, and 39 rules.
(pause: 1)
The modular nature of the rule-based approach support its gradual refinement. Each time we encounter a new variation of a formatting or structural pattern, we extend the approach by modifying an existing rule or adding a new one.
(pause: 2)
In the diagram we can see the complete workflow. The first stage is the text extraction to reconstruct all the words, lines, and pages from the PDF. In the second stage, the workflow assigns role labels, such as section heading, subheading, important text, and body text, to each text fragment. This process facilitates the subsequent recognition of different logical elements of the textbook. The third large stage of the workflow is to recognize all different logical elements within a textbook. First, auxiliary elements such as page numbers and headers are filtered out. Then, the individual entries of the table of contents are recognized and processed. Later, each index term is identified. Finally, individual sections are recognized. In the final stage we construct the textbook model, which can be later enriched with external information.
(pause: 2)
To give you one example of how the rules look like, we have the _repeated lines_ rule, which is used to detect general page header and footer. This rule is part of the auxiliary elements filtering step.
(pause: 1)
First, we create a sample of continuous pages from all the pages in the textbook. Then, if the first lines in each page of the sample are the same, a header is detected and removed in all the pages from the textbook. Footers are detected in a similar way but comparing the last lines in the pages from the sample.
(pause: 2)
The rules are used to identify different elements in the textbooks. In the table of contents, we use them to detect the pages that belong to the toc, non-content sections like notation or preface, chapter and subchapter entries, entries that are split in multiple lines, and to identify one of three possible types of tocs: flat, flat-ordered or indented.
(pause: 1)
For the index sections, the rules identify the pages that belong to the section, the heading and page references of the terms, multiline terms, different types of terms like cross-references, and nested groups of terms.
(pause: 2)
At the end of the workflow we construct a textbook model using the Text Encoding Initiative, which is a standard for digital representation of texts. In the model we group the information in 3 categories: structure, content, and domain knowledge.
(pause: 1)
The structure section contains the name and precise start and end page of each chapter and subchapter of the textbook. The content includes the textual information structured as words, lines, fragments, and pages for each chapter and subchapter. Finally, the domain knowledge contains all the important terms in the domain extracted from the index section.
(pause: 2)
To test the accuracy of the extraction of the models, we extracted the models using our rule-based approach and using the epub version of the same textbooks. In the epub textbooks the information is already structured and marked, so it is easy to extract and it is accurate. We hypothesize that if the information obtained from the two versions of a textbook matches, that means the approach processes PDF correctly.
(pause: 1)
We used textbooks from 4 different domains: Statistics, Computer Science, History, and Literature.
(pause: 2)
Results from this first evaluation show that our approach has high accuracy.
(pause: 1)
For the text extraction aspect, we also compared our approach against 2 other tools as baselines. Our approach achieved the highest similarity, followed by PDFBox and then PdfAct. We don’t reach 100 percent similarity mostly because of formulae, charts, and tables that are images in the epub but text in the PDF version. An additional effect of the rules that improve textual extraction, along with the rules for recognition of page is a cleaner textual version of the textbook, as seen when our approach is compared against the out-of-the-box PDFBox tool that lacks these features.
(pause: 1)
For the recognition of the individual entries in the Table of Content, we reach a precision and recall of almost 100%.
(pause: 1)
Precision and recall are also very high for the recognition of the index terms.
(pause: 2)
We also study one of the possible knowledge-driven applications of the extracted models: we used models of two textbooks to cross-link relevant sections. The idea is that any chapter or subchapter from the first textbook can be linked to any chapter or subchapter of the second textbook to identify similar sections.
(pause: 2)
We constructed a linking model using a term-based Vector Space Model (VSM) with one thousand six hundred eleven terms from the two books. Then, the VSM was applied to all chapters and sub-chapters of the both books. The sections have been annotated by the terms according to the knowledge models extracted from the textbooks’ indices. The inner product of these annotations has been used to compute similarity between all sections of book 1, and sections of book 2.
We used the normalized discounted cumulative gain to measure the quality of the ranked documents by relevance. NDCG@1 measures the effectiveness of retrieving the most relevant document, while @3 and @5 measure the capability of the retrieval system to find the first three and five most relevant documents, respectively. We also used a manual linking produced by experts as the ground truth for the NDCG measures.
Finally, we used two baselines for comparison: the standard TFIDF model and a LDA model. Both baselines have used the textual content of each part of the textbooks with basic preprocessing (lowercase, stop-words, and stemming).
(pause: 2)
The results show that the proposed model consistently outperforms all baselines, as seen with the yellow bar in the graph.
(pause: 2)
The difference between our model and the baselines is the highest for NDCG@1.
The semantic information placed by the authors of textbooks in the index sections and extracted by our approach helps our linking model find 72% of best possible matches between the textbook sections. As the number of potential matches increases the difference between NDCG scores diminishes due to the ceiling effect.
(pause: 2)
(pause: 2)
As summary, we developed a rule-based approach that allows the automated extraction of knowledge models. This answers our first research question.
Our first evaluation experiment shows that the approach is capable of processing PDF textbooks with high accuracy.
And the linking of section across textbooks within the same domain demonstrates the added value of the extracted models.
The two evaluation experiments answer our second research question.
(pause: 2)
(pause: 2)
Related to this work, we have taken individual textbooks within the same domain and integrated them with each other and with the Linked Open Data cloud using DBpedia. For example, individual terms like mean and venn diagram are linked to their corresponding resources in DBpedia.
(pause: 2)
Also, our rule-based approach is the foundation for Intextbooks: a system capable of transforming PDF textbooks into intelligent educational resources.
(pause: 2)
(pause: 2)
As future work, we plan to use the information in both the Table of Contents and the Index more extensively:
Each chapter/subchapter can potentially be treated as a topic/subtopic annotated with terms in the domain thanks to the explicit connections between the terms in the index section and the different content sections.
(pause: 2)
Finally, I invite you to check out our GitHub project, and to use our web service to create textbooks models.
Thank you for your attention!
(pause: 2)