SlideShare una empresa de Scribd logo
1 de 17
Isaac Alpizar-Chacon and Sergey Sosnovsky
Utrecht University
Utrecht, The Netherlands
Order out of Chaos: Construction of Knowledge
Models from PDF Textbooks
2
Motivation
Textbooks are high-quality
textual resources
Textbooks are non-
structured resources
Table of Content provides
browsing aid
Index provides searching aid
Authors use their
understanding of the domain
while creating textbooks
Formatting and structuring
conventions provide
meaningful information
Goal
The automated extraction of
machine-readable textbook models
3
Q1: can knowledge be automatically
extracted from textbooks?
Q2: what would be the quality and the
value of such models?
4
Rule-based workflow
PDF as the most common
and challenging format
4 stages 9 steps 39 rules
5
Rule-based workflow
6
Example Rule
• REPEATED_LINES:
1. Create a sample of pages: 𝑃𝑠 = {𝑝𝑎 , 𝑝𝑏 , . . . , 𝑝𝑚 } | 𝑃𝑠 ⊂ 𝑃.
2. If the first line(s) are identical across 𝑃𝑠 : header is detected and removed
in all pages 𝑝 ∈ 𝑃.
3. If the last line(s) are identical across 𝑃𝑠 : footer is detected and removed in
all pages 𝑝 ∈ 𝑃.
7
Elements identified in TOC and Index sections
8
Textbook model
Structure
(sections)
Content (words,
lines, etc.)
Domain
Knowledge
(terms)
9
Accuracy of the extraction of the models
Domains: Statistics, Computer Science, History, Literature
10
Accuracy of the extraction of the models: Results
Averages over all domains
Text
Extraction
Our approach:
93.85%
PDFBox:
89.72%
PdfAct:
84.19%
TOC
Recognition
Precision:
99.92%
Recall:
99.92%
Index
Recognition
Precision:
98.56%
Recall:
98.13%
11
Application of the textbook models
Book#1
Chap1
Sub1
Sub2
Chap2 Chap3
Book#2
Chap1
Sub1
Sub2
Sub3
Chap2 Chap3
Sub1
Sub2
Chap4
Book#1
Chap1
Sub1
Sub2
Chap2 Chap3
Book#2
Chap1
Sub1
Sub2
Sub3
Chap2 Chap3
Sub1
Sub2
Chap4
12
Application of the textbook models
• Linking model:
• A term-based Vector Space Model (VSM) with 1611 terms from two books
• VSM applied to all chapters and sub-chapters of the both books
• Measure:
• NDCG (normalized discounted cumulative gain) at 1, 3, and 5.
• Baselines:
• TFIDF model
• LDA model
13
Application of the textbook models: Results
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
NDCG@1 NDCG@3 NDCG@5
TFIDF LDA TFIDF+LDA Our model
14
Summary
• Our rule-based approach allows the automated extraction of knowledge models
(Q1)
• Our first evaluation experiment shows that the approach is capable of
processing PDF textbooks with high accuracy (Q2)
• The linking of section across textbooks within the same domain demonstrates
the added value of the extracted models (Q2)
Q1: can knowledge be automatically extracted from textbooks?
Q2: what would be the quality and the value of such models?
15
Related work
• We have integrated individual
textbooks within thew same domain
with each other and with the Linked
Open Data cloud using DBpedia
Mean
Venn
Diagram
…
• Our rule-based approach is the
foundation for Intextbooks: a system
capable of transforming PDF textbooks
into intelligent educational resources
16
Future work
• We plan to use the information in both the Table of Contents and the Index
more extensively:
• Each chapter/subchapter can potentially be treated as a topic/subtopic
annotated with terms in the domain thanks to the explicit connections
between the terms in the index section and the different content sections
Thank you!
https://github.com/intextbooks/ITCore
https://intextbooks.science.uu.nl
Contact:
Isaac Alpizar-Chacon <i.alpizarchacon@uu.nl>

Más contenido relacionado

La actualidad más candente

Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weightingVaibhav Khanna
 
A New Linkage for Prior Learning Assessment
A New Linkage for Prior Learning AssessmentA New Linkage for Prior Learning Assessment
A New Linkage for Prior Learning AssessmentMarco Kalz
 
Interactive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector EmbeddingsInteractive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector Embeddingsgleicher
 
Paper Evaluation research methodology
Paper Evaluation research methodologyPaper Evaluation research methodology
Paper Evaluation research methodologyEngr Syed Absar Kazmi
 
Question Answering for Machine Reading Evaluation on Romanian and English
Question Answering for Machine Reading Evaluation on Romanian and EnglishQuestion Answering for Machine Reading Evaluation on Romanian and English
Question Answering for Machine Reading Evaluation on Romanian and EnglishFaculty of Computer Science
 
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...A Combined Method for E-Learning Ontology Population based on NLP and User Ac...
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...Fred Kozlov
 
Ran zhou poster 2018
Ran zhou poster 2018Ran zhou poster 2018
Ran zhou poster 2018Ran Zhou
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)Dmitry Kan
 
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...Yandex
 
Generating SPSS training materials in StatJR
Generating SPSS training materials in StatJRGenerating SPSS training materials in StatJR
Generating SPSS training materials in StatJRUniversity of Southampton
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Intobutest
 
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONSUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
 
Improving Document Clustering by Eliminating Unnatural Language
Improving Document Clustering by Eliminating Unnatural LanguageImproving Document Clustering by Eliminating Unnatural Language
Improving Document Clustering by Eliminating Unnatural LanguageJinho Choi
 
Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...Jinho Choi
 
HyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-AnsweringHyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-AnsweringJinho Choi
 

La actualidad más candente (20)

Kr Pawan
Kr Pawan Kr Pawan
Kr Pawan
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
 
A New Linkage for Prior Learning Assessment
A New Linkage for Prior Learning AssessmentA New Linkage for Prior Learning Assessment
A New Linkage for Prior Learning Assessment
 
Data wrangling week1
Data wrangling week1Data wrangling week1
Data wrangling week1
 
Interactive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector EmbeddingsInteractive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector Embeddings
 
Paper Evaluation research methodology
Paper Evaluation research methodologyPaper Evaluation research methodology
Paper Evaluation research methodology
 
Question Answering for Machine Reading Evaluation on Romanian and English
Question Answering for Machine Reading Evaluation on Romanian and EnglishQuestion Answering for Machine Reading Evaluation on Romanian and English
Question Answering for Machine Reading Evaluation on Romanian and English
 
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...A Combined Method for E-Learning Ontology Population based on NLP and User Ac...
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...
 
Ran zhou poster 2018
Ran zhou poster 2018Ran zhou poster 2018
Ran zhou poster 2018
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
 
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
 
Generating SPSS training materials in StatJR
Generating SPSS training materials in StatJRGenerating SPSS training materials in StatJR
Generating SPSS training materials in StatJR
 
Mobile Computing
Mobile ComputingMobile Computing
Mobile Computing
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Into
 
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONSUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
 
Improving Document Clustering by Eliminating Unnatural Language
Improving Document Clustering by Eliminating Unnatural LanguageImproving Document Clustering by Eliminating Unnatural Language
Improving Document Clustering by Eliminating Unnatural Language
 
Abis04
Abis04Abis04
Abis04
 
Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...
 
HyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-AnsweringHyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-Answering
 
Research Data Mantra - March 2011
Research Data Mantra - March 2011Research Data Mantra - March 2011
Research Data Mantra - March 2011
 

Similar a Order out of Chaos: Construction of Knowledge Models from PDF Textbooks

Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Sergey Sosnovsky
 
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...Sergey Sosnovsky
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMMULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMeMadrid network
 
score based ranking of documents
score based ranking of documentsscore based ranking of documents
score based ranking of documentsKriti Khanna
 
K-12 Computer Science Framework GaDOE Update
K-12 Computer Science Framework GaDOE UpdateK-12 Computer Science Framework GaDOE Update
K-12 Computer Science Framework GaDOE UpdateTony Vlachakis
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...Aliabbas Petiwala
 
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
Orchestration Graphs: Enabling Rich Learning Scenarios at ScaleOrchestration Graphs: Enabling Rich Learning Scenarios at Scale
Orchestration Graphs: Enabling Rich Learning Scenarios at ScaleStian Håklev
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerFrancesco Osborne
 
Design Patterns - General Introduction
Design Patterns - General IntroductionDesign Patterns - General Introduction
Design Patterns - General IntroductionAsma CHERIF
 
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...Hung Chau
 
Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...Nane Kratzke
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the WebRinke Hoekstra
 
Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015dgarijo
 
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema VarietyUniversity of Bologna
 

Similar a Order out of Chaos: Construction of Knowledge Models from PDF Textbooks (20)

Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
 
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMMULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
 
score based ranking of documents
score based ranking of documentsscore based ranking of documents
score based ranking of documents
 
K-12 Computer Science Framework GaDOE Update
K-12 Computer Science Framework GaDOE UpdateK-12 Computer Science Framework GaDOE Update
K-12 Computer Science Framework GaDOE Update
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
 
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
Orchestration Graphs: Enabling Rich Learning Scenarios at ScaleOrchestration Graphs: Enabling Rich Learning Scenarios at Scale
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
 
F0372032035
F0372032035F0372032035
F0372032035
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 
Online Lecture May 2015
Online Lecture May 2015Online Lecture May 2015
Online Lecture May 2015
 
Design Patterns - General Introduction
Design Patterns - General IntroductionDesign Patterns - General Introduction
Design Patterns - General Introduction
 
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
 
What's in a textbook
What's in a textbookWhat's in a textbook
What's in a textbook
 
Training Module Project Plan
Training Module Project PlanTraining Module Project Plan
Training Module Project Plan
 
Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015
 
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
 

Último

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 

Último (20)

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 

Order out of Chaos: Construction of Knowledge Models from PDF Textbooks

  • 1. Isaac Alpizar-Chacon and Sergey Sosnovsky Utrecht University Utrecht, The Netherlands Order out of Chaos: Construction of Knowledge Models from PDF Textbooks
  • 2. 2 Motivation Textbooks are high-quality textual resources Textbooks are non- structured resources Table of Content provides browsing aid Index provides searching aid Authors use their understanding of the domain while creating textbooks Formatting and structuring conventions provide meaningful information
  • 3. Goal The automated extraction of machine-readable textbook models 3 Q1: can knowledge be automatically extracted from textbooks? Q2: what would be the quality and the value of such models?
  • 4. 4 Rule-based workflow PDF as the most common and challenging format 4 stages 9 steps 39 rules
  • 6. 6 Example Rule • REPEATED_LINES: 1. Create a sample of pages: 𝑃𝑠 = {𝑝𝑎 , 𝑝𝑏 , . . . , 𝑝𝑚 } | 𝑃𝑠 ⊂ 𝑃. 2. If the first line(s) are identical across 𝑃𝑠 : header is detected and removed in all pages 𝑝 ∈ 𝑃. 3. If the last line(s) are identical across 𝑃𝑠 : footer is detected and removed in all pages 𝑝 ∈ 𝑃.
  • 7. 7 Elements identified in TOC and Index sections
  • 9. 9 Accuracy of the extraction of the models Domains: Statistics, Computer Science, History, Literature
  • 10. 10 Accuracy of the extraction of the models: Results Averages over all domains Text Extraction Our approach: 93.85% PDFBox: 89.72% PdfAct: 84.19% TOC Recognition Precision: 99.92% Recall: 99.92% Index Recognition Precision: 98.56% Recall: 98.13%
  • 11. 11 Application of the textbook models Book#1 Chap1 Sub1 Sub2 Chap2 Chap3 Book#2 Chap1 Sub1 Sub2 Sub3 Chap2 Chap3 Sub1 Sub2 Chap4 Book#1 Chap1 Sub1 Sub2 Chap2 Chap3 Book#2 Chap1 Sub1 Sub2 Sub3 Chap2 Chap3 Sub1 Sub2 Chap4
  • 12. 12 Application of the textbook models • Linking model: • A term-based Vector Space Model (VSM) with 1611 terms from two books • VSM applied to all chapters and sub-chapters of the both books • Measure: • NDCG (normalized discounted cumulative gain) at 1, 3, and 5. • Baselines: • TFIDF model • LDA model
  • 13. 13 Application of the textbook models: Results 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 NDCG@1 NDCG@3 NDCG@5 TFIDF LDA TFIDF+LDA Our model
  • 14. 14 Summary • Our rule-based approach allows the automated extraction of knowledge models (Q1) • Our first evaluation experiment shows that the approach is capable of processing PDF textbooks with high accuracy (Q2) • The linking of section across textbooks within the same domain demonstrates the added value of the extracted models (Q2) Q1: can knowledge be automatically extracted from textbooks? Q2: what would be the quality and the value of such models?
  • 15. 15 Related work • We have integrated individual textbooks within thew same domain with each other and with the Linked Open Data cloud using DBpedia Mean Venn Diagram … • Our rule-based approach is the foundation for Intextbooks: a system capable of transforming PDF textbooks into intelligent educational resources
  • 16. 16 Future work • We plan to use the information in both the Table of Contents and the Index more extensively: • Each chapter/subchapter can potentially be treated as a topic/subtopic annotated with terms in the domain thanks to the explicit connections between the terms in the index section and the different content sections

Notas del editor

  1. (pause: 2) Hello and welcome to this presentation. My name is Isaac, I am a PhD student at Utrecht University and I will be describing our work: (pause: 1) Order out of Chaos: construction of knowledge models from PDF textbooks.
  2. (pause: 2) I will start by saying that textbooks are high-quality textual resources, but they are often considered to be non-structure. But, if we look carefully how textbooks are made, they provide a lot of information. The Table of Contents provides browsing aid, and the index provides searching aid and terms in the domain. The authors use their understanding of the domain while creating textbooks, and we use these formatting and structuring conventions to extract meaningful information.
  3. (pause: 2) Our goal is to achieve the automated extraction of machine-readable textbooks models. This goal involves two research questions: (pause: 1) First, can knowledge be automatically extracted from textbooks? And second, what would be the quality and the value of such models? Our work seeks to answer these questions.
  4. (pause: 2) We developed a rule-based approach for the extraction of the knowledge models. We focus on PDF as the most common and challenging digital textbook format. Our workflow has 4 stages, 9 steps, and 39 rules. (pause: 1) The modular nature of the rule-based approach support its gradual refinement. Each time we encounter a new variation of a formatting or structural pattern, we extend the approach by modifying an existing rule or adding a new one.
  5. (pause: 2) In the diagram we can see the complete workflow. The first stage is the text extraction to reconstruct all the words, lines, and pages from the PDF. In the second stage, the workflow assigns role labels, such as section heading, subheading, important text, and body text, to each text fragment. This process facilitates the subsequent recognition of different logical elements of the textbook. The third large stage of the workflow is to recognize all different logical elements within a textbook. First, auxiliary elements such as page numbers and headers are filtered out. Then, the individual entries of the table of contents are recognized and processed. Later, each index term is identified. Finally, individual sections are recognized. In the final stage we construct the textbook model, which can be later enriched with external information.
  6. (pause: 2) To give you one example of how the rules look like, we have the _repeated lines_ rule, which is used to detect general page header and footer. This rule is part of the auxiliary elements filtering step. (pause: 1) First, we create a sample of continuous pages from all the pages in the textbook. Then, if the first lines in each page of the sample are the same, a header is detected and removed in all the pages from the textbook. Footers are detected in a similar way but comparing the last lines in the pages from the sample.
  7. (pause: 2) The rules are used to identify different elements in the textbooks. In the table of contents, we use them to detect the pages that belong to the toc, non-content sections like notation or preface, chapter and subchapter entries, entries that are split in multiple lines, and to identify one of three possible types of tocs: flat, flat-ordered or indented. (pause: 1) For the index sections, the rules identify the pages that belong to the section, the heading and page references of the terms, multiline terms, different types of terms like cross-references, and nested groups of terms.
  8. (pause: 2) At the end of the workflow we construct a textbook model using the Text Encoding Initiative, which is a standard for digital representation of texts. In the model we group the information in 3 categories: structure, content, and domain knowledge. (pause: 1) The structure section contains the name and precise start and end page of each chapter and subchapter of the textbook. The content includes the textual information structured as words, lines, fragments, and pages for each chapter and subchapter. Finally, the domain knowledge contains all the important terms in the domain extracted from the index section.
  9. (pause: 2) To test the accuracy of the extraction of the models, we extracted the models using our rule-based approach and using the epub version of the same textbooks. In the epub textbooks the information is already structured and marked, so it is easy to extract and it is accurate. We hypothesize that if the information obtained from the two versions of a textbook matches, that means the approach processes PDF correctly. (pause: 1) We used textbooks from 4 different domains: Statistics, Computer Science, History, and Literature.
  10. (pause: 2) Results from this first evaluation show that our approach has high accuracy. (pause: 1) For the text extraction aspect, we also compared our approach against 2 other tools as baselines. Our approach achieved the highest similarity, followed by PDFBox and then PdfAct. We don’t reach 100 percent similarity mostly because of formulae, charts, and tables that are images in the epub but text in the PDF version. An additional effect of the rules that improve textual extraction, along with the rules for recognition of page is a cleaner textual version of the textbook, as seen when our approach is compared against the out-of-the-box PDFBox tool that lacks these features. (pause: 1) For the recognition of the individual entries in the Table of Content, we reach a precision and recall of almost 100%. (pause: 1) Precision and recall are also very high for the recognition of the index terms.
  11. (pause: 2) We also study one of the possible knowledge-driven applications of the extracted models: we used models of two textbooks to cross-link relevant sections. The idea is that any chapter or subchapter from the first textbook can be linked to any chapter or subchapter of the second textbook to identify similar sections.
  12. (pause: 2) We constructed a linking model using a term-based Vector Space Model (VSM) with one thousand six hundred eleven terms from the two books. Then, the VSM was applied to all chapters and sub-chapters of the both books. The sections have been annotated by the terms according to the knowledge models extracted from the textbooks’ indices. The inner product of these annotations has been used to compute similarity between all sections of book 1, and sections of book 2. We used the normalized discounted cumulative gain to measure the quality of the ranked documents by relevance. NDCG@1 measures the effectiveness of retrieving the most relevant document, while @3 and @5 measure the capability of the retrieval system to find the first three and five most relevant documents, respectively. We also used a manual linking produced by experts as the ground truth for the NDCG measures. Finally, we used two baselines for comparison: the standard TFIDF model and a LDA model. Both baselines have used the textual content of each part of the textbooks with basic preprocessing (lowercase, stop-words, and stemming).
  13. (pause: 2) The results show that the proposed model consistently outperforms all baselines, as seen with the yellow bar in the graph. (pause: 2) The difference between our model and the baselines is the highest for NDCG@1. The semantic information placed by the authors of textbooks in the index sections and extracted by our approach helps our linking model find 72% of best possible matches between the textbook sections. As the number of potential matches increases the difference between NDCG scores diminishes due to the ceiling effect. (pause: 2)
  14. (pause: 2) As summary, we developed a rule-based approach that allows the automated extraction of knowledge models. This answers our first research question. Our first evaluation experiment shows that the approach is capable of processing PDF textbooks with high accuracy. And the linking of section across textbooks within the same domain demonstrates the added value of the extracted models. The two evaluation experiments answer our second research question. (pause: 2)
  15. (pause: 2) Related to this work, we have taken individual textbooks within the same domain and integrated them with each other and with the Linked Open Data cloud using DBpedia. For example, individual terms like mean and venn diagram are linked to their corresponding resources in DBpedia. (pause: 2) Also, our rule-based approach is the foundation for Intextbooks: a system capable of transforming PDF textbooks into intelligent educational resources. (pause: 2)
  16. (pause: 2) As future work, we plan to use the information in both the Table of Contents and the Index more extensively: Each chapter/subchapter can potentially be treated as a topic/subtopic annotated with terms in the domain thanks to the explicit connections between the terms in the index section and the different content sections.
  17. (pause: 2) Finally, I invite you to check out our GitHub project, and to use our web service to create textbooks models. Thank you for your attention! (pause: 2)