What's in a textbook

Sergey Sosnovsky
What’s in a textbook?

Architecture of an AES
Instructional
Content
Interaction
User
Model
0..1..1.
.0..1..1
..Adaptation
Model
Adaptation
M
e
t a d a t a
Domain
Model
2
2

Math-Bridge: Rich Adaptive and Intelligent Textbooks
Seite/Page 3
Sosnovsky, S., Dietrich, M., Andrès, E., Goguadze, G., Winterstein, S., Libbrecht, P., Siekmann, J., & Melis, E. (2014). Math-Bridge: Bridging the gaps in European remedial mathematics
with technology-enhanced learning. In T. Wassong, D. Frischemeier, P. R. Fischer, R. Hochmuth, & P. Bender (Eds.), Mit Werkzeugen Mathematik und Stochastik lernen – Using Tools for
Learning Mathematics and Statistics (pp. 437-451). Berlin/Heidelberg, Germany: Springer.

Intelligent Problem Solving Support
4

Personalized Course Generation
5

Metadata annotation
Metadata
annotation
error-
prone
time-
consu-
ming
limited
support
of tools
often
many
authors
often non
expert
authors
difficult
Seite/Page 7
•Math-Bridge metadata
schema has more
than 30 elements
•Math-Bridge content
collection contains
more than 10 000
learning objects
•About 50 people were involved
in preparing this collection

The Burden of Authoring
§Learning content authoring has always been Tedious, Expertise
demanding, Poorly supported
§Content & Knowledge authoring for Adaptive Intelligent Systems
requires a lot of extra efforts
§!!! Information & Knowledge existing in the system should become
not the authoring burden but the vehicle for authoring support !!!
Seite/Page 8
Instructional
Content
Authoring for
e-Learning
Metadata
InstructionalContent
Authoring for Adaptive e-
Learning
Instructional
Content
Authoring for Adaptive e-
Learning
as It Should Be

Semantic Gap Detection
F O U R M A I N S T E P S :
Conversion of Metadata to OWL2
Detection of Ontology Inconsistencies
Isolation of Causing Axioms
Generation of Verbal Explanations
Seite/Page 9
Sosnovsky, S., & Alpizar-Chacon, I. (2014). Semantic gap detection in metadata of adaptive learning environments. In Proceedings of ICALT'2014: 14th International
Conference on Advanced Learning Technologies (pp. 548-552). IEEE Computer Society.

Math-Bridge Metadata Schema
Seite/Page 10

Step 1: Conversion of Metadata to OWL2
OWL2
XSLT
Stylesheet
OMDoc
Seite/Page 11

Step 2: Detection of Ontology Inconsistencies
rdfs:domain rdfs:range
owl:ObjectProperty activemath:
hasDomainPrerequisite
intro_bikers_slope
activemath:Text
rdf:type
activemath:
KnowledgeItem
ex_tour_de_fr
activemath:Example
rdf:type
activemath:ConceptItem
activemath:
SateliteItem
Inconsistent!
Seite/Page 12

Step 3: Isolation of Causing Axioms
Seite/Page 13

Step 4: Generation of Verbal Explanations
Seite/Page 14

The Scale of the Problem
Interaction
Adaptation
15

Textbooks as a source of (extractable) knowledge
• Focus (narrow, cohesive domain)
• Quality (created by domain experts)
• Purpose (content explains domain knowledge to a novice)
16
• sections / subsectionsStructure
• easy to complexOrder
• ..of content and headersFormatting
• indices
• tables of content
Additional
structural
elements
•Underlying content
•Textual Labels
Topics/subtopics
•Prerequisites <-> outcomes
Pedagogical
relations
•header vs important vs regular
•same format = same role
Text types/roles
and relations
•Glossary of curated meaningful terms
•Set of important domain categories
Meaningful labels
• If automatically extracted and formally represented
these elements will form the model of the textbook and
the model of the domain as the author understands it

Linking Textbooks to Ontologies
17
Topic-based model of an HTML-based Java
textbook automatically extracted and mapped
to a central ontology already linked to a set of
Java exercises
• Mapping serves as a bridge to jointly
interpret learner’s reading and exercise
attempts in terms of ontology and adapt
access to textbook pages accordingly
Project 1 1.Sosnovsky, S., Hsiao, I-H., & Brusilovsky, P. (2012). Adaptation “in the wild”: Ontology-based personalization of open-corpus learning material. In A. Ravenscroft, S.
Lindstaedt, C. Delgado Kloos, & D. Hernández-Leo (Eds.), Proceedings of EC-TEL'2012: 7th European Conference on Technology Enhanced Learning (pp. 425-431).
Berlin/Heidelberg, Germany: Springer.

Linking Textbooks to Textbooks
Several LDA-based techniques are used to interlink
sections from a set of HTML-based textbooks in a
domain
A manual mapping by experts is used as a golden
standard
19
Linking
Linking
Project 2
Guerra, J., Sosnovsky, S., & Brusilovsky, P. (2013). When one textbook is not enough: Linking multiple textbooks using probabilistic topic models. In D.
Hernández-Leo, T. Ley, R., Klamma, & A. Harrer (Eds.), Proceedings of EC-TEL'2013: 8th European Conference on Technology Enhanced Learning (pp.
125-138). Berlin/Heidelberg, Germany: Springer.

Interlingua: linking textbooks
across languages
Statistics
ontology
....
....
....
!
Semantic
model of the
textbook
Project 3
DE
Chapter1
Section1.1
Subsection1.1.1
Subsection1.1.2
…
Section1.2
Subsection1.2.2
…
term -> page#
term -> page#
term -> page#
term -> page#
term -> page#
…. ....
....
....
EN
....
....
....
FR
Alpizar-Chacon, I., van der Hart, M., Wiersma, Z., Theunissen, L., & Sosnovsky, S. (2020). Transformation of PDF Textbooks into Interactive Educational
Resources. In Proceedings of the Workshop on Intelligent Textbooks at AIEd'2020 (pp. 4-16). Online, July 6, 2020.

Relevant Content in One’s Mother Tongue
Project 3

intextbooks
Isaac Alpizar Chacon
Alpizar-Chacon, I., & Sosnovsky, S.(2020). Knowledge models from PDF textbooks. New Review of Hypermedia and Multimedia, (in press).

Model extraction from PDF
textbooks
24
PDF as the most common
and challenging format
4 stages 9 steps 39 rules
Alpizar-Chacon, I., & Sosnovsky, S. (2020). Order out of Chaos: Construction of Knowledge Models from PDF Textbooks. In Proceedings of
DocEng’2020: The 20th ACM Symposium on Document Engineering, (Article No.: 8, pp 1–10). New York, NY, USA: ACM Press.

25
Example Rule
• REPEATED_LINES:
1. Select a sample of pages: 𝑃 𝑠 = {𝑝𝑎 , 𝑝𝑏 , . . . , 𝑝𝑚 } | 𝑃𝑠 ⊂ 𝑃.
2. If the first line(s) are identical across 𝑃 𝑠 : header is detected and removed in
all pages 𝑝 ∈ 𝑃.
3. If the last line(s) are identical across 𝑃 𝑠 : footer is detected and removed in
all pages 𝑝 ∈ 𝑃.

Style 1
Style 2
Style 3
Style 4
Style Font Family Font Size Font Face Font Color Occurrences
1 Liberation Sans 35 Bold Blue 3
2 Liberation Sans 18 Bold Blue 1
3 Liberation Sans 9 - Black 153
4 Liberation Sans 9 Bold Black 2
=> Body text
Chapter
Subchapter
2. Role labeling of fragments

3. Processing Table of Contents

TOC Section
Textbook Part
Chapter
Subchapter
Subchapter level 2
Subchapter
Subchapter
Subchapter
Chapter
.
.
.
.
.
.
.
.
.
.
.
Individual page
numbers for each
section
Subchapter level 2
Subchapter level 2
Subchapter level 2
Subchapter level 2
Subchapter level 2
3. Processing Table of Contents

Multi-column layout
Index Section
Index term + page number
"see" case
Multiline term
Nested Term
Range of page numbers
Reading order =
3. Processing Index

32
Structure
(sections)
Content (words,
lines, etc.)
Domain
Knowledge
(terms)
4. Textbook model

Potential Problems of These Models
• Structure
• Labels
• Order
• Focus
• Coverage
Variability
33
• Same domain + Different authors =
Different textbooks =>
Different models
Subjectivity
• Completeness
• Granularity
• Consistency
Quality
• More structure than knowledge
• Lack of links
• Cohesiveness of topics and index terms
Lack of
semantics
Textbook-levelModel-level

..nevertheless
• They are automatically extracted models of high-
quality resources and underlying domains
• Their individual quality might be not enough, but
they can be aggregated
• Linking models to the existing ontologies should help
filter our less relevant terms and extend them with
additional semantical information
• Interlinking multiple models within the same domain
should improve the coverage
34

35
Evaluation 1 (Accuracy of model extraction)
Domains: Statistics, Computer Science, History, Literature

36
Evaluation 1 (Accuracy of model extraction): Results
Averages over all domains
Text
Extraction
Our approach:
93.85%
PDFBox:
89.72%
PdfAct:
84.19%
TOC
Recognition
Precision:
99.92%
Recall:
99.92%
Index
Recognition
Precision:
98.56%
Recall:
98.13%

37
Evaluation 2 (Value of Extracted Models – Semantic Linking of Textbooks)
Book#1
Chap1
Sub1
Sub2
Chap2 Chap3
Book#2
Chap1
Sub1
Sub2
Sub3
Chap2 Chap3
Sub1
Sub2
Chap4
Book#1
Chap1
Sub1
Sub2
Chap2 Chap3
Book#2
Chap1
Sub1
Sub2
Sub3
Chap2 Chap3
Sub1
Sub2
Chap4

38
Evaluation 2 (Value of Extracted Models – Semantic Linking of Textbooks): Method
• Ground truth
• Average of manual linking of two textbooks by three experts in statistics
• Measure:
• NDCG (normalized discounted cumulative gain) at 1, 3, and 5.
• Baselines:
• TFIDF model
• LDA model

39
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
NDCG@1 NDCG@3 NDCG@5
TFIDF LDA TFIDF+LDA Our model
Evaluation 2 (Value of Extracted Models – Semantic Linking of Textbooks): Results

Model linking to
40
Alpizar-Chacon, I., & Sosnovsky, S. (2019). Expanding the Web of Knowledge: One Textbook at a Time. In Proceedings of ACM Hypertext’2019: 30th
International Conference on Hypertext and Social Media (pp. 9-18). New York, NY, USA: ACM Press.

1. Construction of the Glossary
41
a) Index parsing
b) Term recognition
c) Glossary creation
• Preparation for the next phase
D
..
Distribution
Gamma
Normal
…
Index Glossary terms
(with candidate labels)
Distribution 85
Gamma Distribution 106
Distribution Gamma
Normal Distribution 92
Distribution Normal
92
106
Distribution 85
Gamma Distribution 106
Normal Distribution 92

• We use index terms to query DBpedia => find matching resources
• DBpedia resources can have categories (e.g. Statistics)
• Categories form hierarchy (e.g., Statistics / Statistical_models / ...)
• In the beginning, we select the target top category (define the domain)
• The algorithm looks 2 more levels deeper
• This is the only manual input required
• If a query retrieves only 1 DBpedia resource and
it belongs to one of target categories (dct:subject)
this resource becomes the part of the core set
• dbo:abstract’s of all core set resources are concatenated to form domain
context (used at Step 2.c)
2.a Core set construction
42

2.b Candidate set construction
• If a query retrieves several DBpedia resources
they form the candidate set of the term
• Context is gathered for every candidate resource:
• dbo:abstract of this resource +
• dbo:abstract’s of all resources linked to it
• Context helps during the next step
43

2.c Resource disambiguation
• For each resource from a candidate set
• Cosine similarity is computed between
the context of the resource and
the domain context
• The resource with the highest cosine similarity (and > threshold) is
matched to the term
• Newly obtained resources help to extend the domain context
• Step 2.3 repeats until no more new terms can be matched
44

3. Model Enrichment
• Abstract
• Wikipedia link
• Categories
• Relation to other terms
• Multilingual information
• …
45
In statistics, the standard
score is the (signed) number
of standard deviations an
observation or…
standard
score
En probabilités et statistiques,
une variable centrée réduite
est une variable aléatoire…
Unter Standardisierung oder
z-Transformation versteht
man in der mathematischen
Statistik eine …
Statistical
Ratios
http://en.wikipedia.org/wiki/Standard_score
dct:subject
FR
DE
EN
t-statistics dct:subject
……
yago:WikicatStatisticalRatios rdf:type

4603-12-2020
TEI Textbook Model
Structure
(sections)
Content (words, lines,
titles, etc)
Domain
Knowledge
(terms)
+ RDFa
attributes

Evaluation: Linking to DBpedia
• Question: Are the index terms linked to the right DBpedia
resources?
• Task: validate the resources disambiguation procedure
• BL1 (random baseline): a random resources in the candidates list
is selected as the right resource
• BL2 (default sense baseline): the most linked/popular resource in
the candidate list is selected as the right resource
• Ground truth was created manually
47
Statistics#1 Statistics#2 Information Retrieval

Evaluation: Aggregation of Models
• Question: Would aggregation of additional textbooks move the model closer
to the ideal domain model (all relevant resources)?
• Ground truth: constructed based on the Glossary of statistical terms
• > 1000 terms
• Task: compare the matching between textbooks and DBpedia with the “ideal”
matching between the Glossary and the DBpedia
48
Average single textbook Average 5 textbooks 10 textbooks

Transformation of PDF textbooks into
interactive HTML
Structure
(sections)
Content (words, lines,
titles, etc)
Domain Knowledge
(terms)
+ RDFa attributes
Alpizar-Chacon, I., van der Hart, M., Wiersma, Z., Theunissen, L., & Sosnovsky, S. (2020). Transformation of PDF Textbooks into Interactive Educational
Resources. In Proceedings of the Workshop on Intelligent Textbooks at AIEd'2020 (pp. 4-16). Onlines, July 6, 2020.

5003-12-2020
PDF to HTML converter
• Several open libraries available:
• pdf2htmlEX, PDFMiner, pdf2html, Xpdf, etc.
• pdf2htmlEX:
• preserves the layout perfectly across very different types of documents
• produces the same structure across different documents
• fast, stable, and scalable

5103-12-2020
TEI-HTML synchronizer

5203-12-2020
TEI-HTML synchronizer

5303-12-2020
Validation
Test the accuracy of the matching algorithm for the TEI-HTML synchronization
70 university-level textbooks
domains: statistics, computer
science, web programming,
literature, history
evaluation metric: percentage
of words that were matched
between the TEI and HTML
representations
Results: 87-90 %

Current Work (1):
Extraction of accurate domain models from textbook indices
• Index entries have different roles
(different domain specificity):
- introduce core domain terms
<hypotheses testing>
- introduce related domain terms
<factorial>, <sample space>
- serve various pedagogically purposes (examples, use-cases,
data, etc.)
<Euro coin>, <Bovine Spongiform Encephalopathy>
54

Current Work (1):
Extraction of accurate domain models from textbook indices
Approach:
1. Use DBPedia to infer the domain specificity of matched index terms
2. Utilise DBPedia structure (categories and resources) and associated
textual content
3. Integrate indices from multiple textbooks to discover a " better”
domain model
Domains:
1. Statistics
2. Classic Philosophy
55

Current Work (2):
From tables of contents to topics
• Add rules for filtering out non-topical sections / TOC entries
• Explore how hierarchy, order and labels of topics can help
domain model extraction
• Create a global table of contents of the domain from
multiple textbooks
• Personalised textbook generation
56

Current Work (3):
assessment generation
• Use the rich intextbooks models (structured textual content annotated
with domain models, linked to DBPedia, linked to other textbooks) to
• generate self-assessment questions on demand
• targeting a specific subset of the model/content
- adaptive assessment generation
57

Thank you!
https://github.com/intextbooks/ITCore
https://intextbooks.science.uu.nl
Contact:
Isaac Alpizar-Chacon <i.alpizarchacon@uu.nl>

What's in a textbook

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (14)

Similar a What's in a textbook

Similar a What's in a textbook (20)

Más de Sergey Sosnovsky

Más de Sergey Sosnovsky (20)

Último

Último (20)

What's in a textbook