2. Architecture of an AES
Instructional
Content
Interaction
User
Model
0..1..1.
.0..1..1
..Adaptation
Model
Adaptation
M
e
t a d a t a
Domain
Model
2
2
3. Math-Bridge: Rich Adaptive and Intelligent Textbooks
Seite/Page 3
Sosnovsky, S., Dietrich, M., Andrès, E., Goguadze, G., Winterstein, S., Libbrecht, P., Siekmann, J., & Melis, E. (2014). Math-Bridge: Bridging the gaps in European remedial mathematics
with technology-enhanced learning. In T. Wassong, D. Frischemeier, P. R. Fischer, R. Hochmuth, & P. Bender (Eds.), Mit Werkzeugen Mathematik und Stochastik lernen – Using Tools for
Learning Mathematics and Statistics (pp. 437-451). Berlin/Heidelberg, Germany: Springer.
8. The Burden of Authoring
§Learning content authoring has always been Tedious, Expertise
demanding, Poorly supported
§Content & Knowledge authoring for Adaptive Intelligent Systems
requires a lot of extra efforts
§!!! Information & Knowledge existing in the system should become
not the authoring burden but the vehicle for authoring support !!!
Seite/Page 8
Instructional
Content
Authoring for
e-Learning
Metadata
InstructionalContent
Authoring for Adaptive e-
Learning
Instructional
Content
Authoring for Adaptive e-
Learning
as It Should Be
9. Semantic Gap Detection
F O U R M A I N S T E P S :
Conversion of Metadata to OWL2
Detection of Ontology Inconsistencies
Isolation of Causing Axioms
Generation of Verbal Explanations
Seite/Page 9
Sosnovsky, S., & Alpizar-Chacon, I. (2014). Semantic gap detection in metadata of adaptive learning environments. In Proceedings of ICALT'2014: 14th International
Conference on Advanced Learning Technologies (pp. 548-552). IEEE Computer Society.
15. The Scale of the Problem
Interaction
Adaptation
15
16. Textbooks as a source of (extractable) knowledge
• Focus (narrow, cohesive domain)
• Quality (created by domain experts)
• Purpose (content explains domain knowledge to a novice)
16
• sections / subsectionsStructure
• easy to complexOrder
• ..of content and headersFormatting
• indices
• tables of content
Additional
structural
elements
•Underlying content
•Textual Labels
Topics/subtopics
•Prerequisites <-> outcomes
Pedagogical
relations
•header vs important vs regular
•same format = same role
Text types/roles
and relations
•Glossary of curated meaningful terms
•Set of important domain categories
Meaningful labels
• If automatically extracted and formally represented
these elements will form the model of the textbook and
the model of the domain as the author understands it
17. Linking Textbooks to Ontologies
17
Topic-based model of an HTML-based Java
textbook automatically extracted and mapped
to a central ontology already linked to a set of
Java exercises
• Mapping serves as a bridge to jointly
interpret learner’s reading and exercise
attempts in terms of ontology and adapt
access to textbook pages accordingly
Project 1 1.Sosnovsky, S., Hsiao, I-H., & Brusilovsky, P. (2012). Adaptation “in the wild”: Ontology-based personalization of open-corpus learning material. In A. Ravenscroft, S.
Lindstaedt, C. Delgado Kloos, & D. Hernández-Leo (Eds.), Proceedings of EC-TEL'2012: 7th European Conference on Technology Enhanced Learning (pp. 425-431).
Berlin/Heidelberg, Germany: Springer.
18. Linking Textbooks to Textbooks
Several LDA-based techniques are used to interlink
sections from a set of HTML-based textbooks in a
domain
A manual mapping by experts is used as a golden
standard
19
Linking
Linking
Project 2
Guerra, J., Sosnovsky, S., & Brusilovsky, P. (2013). When one textbook is not enough: Linking multiple textbooks using probabilistic topic models. In D.
Hernández-Leo, T. Ley, R., Klamma, & A. Harrer (Eds.), Proceedings of EC-TEL'2013: 8th European Conference on Technology Enhanced Learning (pp.
125-138). Berlin/Heidelberg, Germany: Springer.
19. Interlingua: linking textbooks
across languages
Statistics
ontology
....
....
....
!
Semantic
model of the
textbook
Project 3
DE
Chapter1
Section1.1
Subsection1.1.1
Subsection1.1.2
…
Section1.2
Subsection1.2.2
…
term -> page#
term -> page#
term -> page#
term -> page#
term -> page#
…. ....
....
....
EN
....
....
....
FR
Alpizar-Chacon, I., van der Hart, M., Wiersma, Z., Theunissen, L., & Sosnovsky, S. (2020). Transformation of PDF Textbooks into Interactive Educational
Resources. In Proceedings of the Workshop on Intelligent Textbooks at AIEd'2020 (pp. 4-16). Online, July 6, 2020.
22. Model extraction from PDF
textbooks
24
PDF as the most common
and challenging format
4 stages 9 steps 39 rules
Alpizar-Chacon, I., & Sosnovsky, S. (2020). Order out of Chaos: Construction of Knowledge Models from PDF Textbooks. In Proceedings of
DocEng’2020: The 20th ACM Symposium on Document Engineering, (Article No.: 8, pp 1–10). New York, NY, USA: ACM Press.
23. 25
Example Rule
• REPEATED_LINES:
1. Select a sample of pages: 𝑃 𝑠 = {𝑝𝑎 , 𝑝𝑏 , . . . , 𝑝𝑚 } | 𝑃𝑠 ⊂ 𝑃.
2. If the first line(s) are identical across 𝑃 𝑠 : header is detected and removed in
all pages 𝑝 ∈ 𝑃.
3. If the last line(s) are identical across 𝑃 𝑠 : footer is detected and removed in
all pages 𝑝 ∈ 𝑃.
25. Style 1
Style 2
Style 3
Style 4
Style Font Family Font Size Font Face Font Color Occurrences
1 Liberation Sans 35 Bold Blue 3
2 Liberation Sans 18 Bold Blue 1
3 Liberation Sans 9 - Black 153
4 Liberation Sans 9 Bold Black 2
=> Body text
Chapter
Subchapter
2. Role labeling of fragments
31. Potential Problems of These Models
• Structure
• Labels
• Order
• Focus
• Coverage
Variability
33
• Same domain + Different authors =
Different textbooks =>
Different models
Subjectivity
• Completeness
• Granularity
• Consistency
Quality
• More structure than knowledge
• Lack of links
• Cohesiveness of topics and index terms
Lack of
semantics
Textbook-levelModel-level
32. ..nevertheless
• They are automatically extracted models of high-
quality resources and underlying domains
• Their individual quality might be not enough, but
they can be aggregated
• Linking models to the existing ontologies should help
filter our less relevant terms and extend them with
additional semantical information
• Interlinking multiple models within the same domain
should improve the coverage
34
33. 35
Evaluation 1 (Accuracy of model extraction)
Domains: Statistics, Computer Science, History, Literature
34. 36
Evaluation 1 (Accuracy of model extraction): Results
Averages over all domains
Text
Extraction
Our approach:
93.85%
PDFBox:
89.72%
PdfAct:
84.19%
TOC
Recognition
Precision:
99.92%
Recall:
99.92%
Index
Recognition
Precision:
98.56%
Recall:
98.13%
36. 38
Evaluation 2 (Value of Extracted Models – Semantic Linking of Textbooks): Method
• Ground truth
• Average of manual linking of two textbooks by three experts in statistics
• Measure:
• NDCG (normalized discounted cumulative gain) at 1, 3, and 5.
• Baselines:
• TFIDF model
• LDA model
38. Model linking to
40
Alpizar-Chacon, I., & Sosnovsky, S. (2019). Expanding the Web of Knowledge: One Textbook at a Time. In Proceedings of ACM Hypertext’2019: 30th
International Conference on Hypertext and Social Media (pp. 9-18). New York, NY, USA: ACM Press.
39. 1. Construction of the Glossary
41
a) Index parsing
b) Term recognition
c) Glossary creation
• Preparation for the next phase
D
..
Distribution
Gamma
Normal
…
Index Glossary terms
(with candidate labels)
Distribution 85
Gamma Distribution 106
Distribution Gamma
Normal Distribution 92
Distribution Normal
92
106
Distribution 85
Gamma Distribution 106
Normal Distribution 92
40. • We use index terms to query DBpedia => find matching resources
• DBpedia resources can have categories (e.g. Statistics)
• Categories form hierarchy (e.g., Statistics / Statistical_models / ...)
• In the beginning, we select the target top category (define the domain)
• The algorithm looks 2 more levels deeper
• This is the only manual input required
• If a query retrieves only 1 DBpedia resource and
it belongs to one of target categories (dct:subject)
this resource becomes the part of the core set
• dbo:abstract’s of all core set resources are concatenated to form domain
context (used at Step 2.c)
2.a Core set construction
42
41. 2.b Candidate set construction
• If a query retrieves several DBpedia resources
they form the candidate set of the term
• Context is gathered for every candidate resource:
• dbo:abstract of this resource +
• dbo:abstract’s of all resources linked to it
• Context helps during the next step
43
42. 2.c Resource disambiguation
• For each resource from a candidate set
• Cosine similarity is computed between
the context of the resource and
the domain context
• The resource with the highest cosine similarity (and > threshold) is
matched to the term
• Newly obtained resources help to extend the domain context
• Step 2.3 repeats until no more new terms can be matched
44
43. 3. Model Enrichment
• Abstract
• Wikipedia link
• Categories
• Relation to other terms
• Multilingual information
• …
45
In statistics, the standard
score is the (signed) number
of standard deviations an
observation or…
standard
score
En probabilités et statistiques,
une variable centrée réduite
est une variable aléatoire…
Unter Standardisierung oder
z-Transformation versteht
man in der mathematischen
Statistik eine …
Statistical
Ratios
http://en.wikipedia.org/wiki/Standard_score
dct:subject
FR
DE
EN
t-statistics dct:subject
……
yago:WikicatStatisticalRatios rdf:type
45. Evaluation: Linking to DBpedia
• Question: Are the index terms linked to the right DBpedia
resources?
• Task: validate the resources disambiguation procedure
• BL1 (random baseline): a random resources in the candidates list
is selected as the right resource
• BL2 (default sense baseline): the most linked/popular resource in
the candidate list is selected as the right resource
• Ground truth was created manually
47
Statistics#1 Statistics#2 Information Retrieval
46. Evaluation: Aggregation of Models
• Question: Would aggregation of additional textbooks move the model closer
to the ideal domain model (all relevant resources)?
• Ground truth: constructed based on the Glossary of statistical terms
• > 1000 terms
• Task: compare the matching between textbooks and DBpedia with the “ideal”
matching between the Glossary and the DBpedia
48
Average single textbook Average 5 textbooks 10 textbooks
47. Transformation of PDF textbooks into
interactive HTML
Structure
(sections)
Content (words, lines,
titles, etc)
Domain Knowledge
(terms)
+ RDFa attributes
Alpizar-Chacon, I., van der Hart, M., Wiersma, Z., Theunissen, L., & Sosnovsky, S. (2020). Transformation of PDF Textbooks into Interactive Educational
Resources. In Proceedings of the Workshop on Intelligent Textbooks at AIEd'2020 (pp. 4-16). Onlines, July 6, 2020.
48. 5003-12-2020
PDF to HTML converter
• Several open libraries available:
• pdf2htmlEX, PDFMiner, pdf2html, Xpdf, etc.
• pdf2htmlEX:
• preserves the layout perfectly across very different types of documents
• produces the same structure across different documents
• fast, stable, and scalable
51. 5303-12-2020
Validation
Test the accuracy of the matching algorithm for the TEI-HTML synchronization
70 university-level textbooks
domains: statistics, computer
science, web programming,
literature, history
evaluation metric: percentage
of words that were matched
between the TEI and HTML
representations
Results: 87-90 %
52. Current Work (1):
Extraction of accurate domain models from textbook indices
• Index entries have different roles
(different domain specificity):
- introduce core domain terms
<hypotheses testing>
- introduce related domain terms
<factorial>, <sample space>
- serve various pedagogically purposes (examples, use-cases,
data, etc.)
<Euro coin>, <Bovine Spongiform Encephalopathy>
54
53. Current Work (1):
Extraction of accurate domain models from textbook indices
Approach:
1. Use DBPedia to infer the domain specificity of matched index terms
2. Utilise DBPedia structure (categories and resources) and associated
textual content
3. Integrate indices from multiple textbooks to discover a " better”
domain model
Domains:
1. Statistics
2. Classic Philosophy
55
54. Current Work (2):
From tables of contents to topics
• Add rules for filtering out non-topical sections / TOC entries
• Explore how hierarchy, order and labels of topics can help
domain model extraction
• Create a global table of contents of the domain from
multiple textbooks
• Personalised textbook generation
56
55. Current Work (3):
assessment generation
• Use the rich intextbooks models (structured textual content annotated
with domain models, linked to DBPedia, linked to other textbooks) to
• generate self-assessment questions on demand
• targeting a specific subset of the model/content
- adaptive assessment generation
57