Mathematical Language Processing via Tree Embeddings

•Descargar como PPTX, PDF•

0 recomendaciones•144 vistas

Sergey Sosnovsky

Paper presented at the Third Workshop on Intelligent Textbooks (iTextbooks'2021)

Educación

Mathematical Language Processing
via Tree Embeddings
Jack Wang, Andrew Lan, Richard Baraniuk
June 15, 2021

Mathematical Language Is Everywhere
textbooks
academic papers
Wikipedia articles
Difficult to extract and synthesize information from massive content
How to efficiently find relevant mathematical content?

The Mathematical Content Retrieval Problem
Difficult to extract and synthesize information from massive content
Desired: efficient, automated system to aid indexing, searching, and organizing
mathematical contents
We focus on formula retrieval:
- Search for and retrieve similar equations, given a query equation

The Mathematical Content Retrieval Problem
Current search engines lack ability to effectively search for mathematical content
Machine
learning

The Mathematical Content Retrieval Problem
Current search engines lack ability to effectively search for mathematical content
query equation in a machine learning textbook
Search results contain only
specific characters that match
with input query but NOT the
entire equation

The Mathematical Content Retrieval Problem
Desired retrieval

Our Solution: Formula Representation via
Tree Embeddings
A novel framework that learns a good representation of mathematical formulae
Based on the encoder-decoder architecture
● A novel encoding scheme: equation as trees
● A novel decoding scheme: generate equation as trees
formula encoder decoder
Reconstructed
formula
Formula
embedding
Minimize this reconstruction loss

Our Solution, part #1: Equation Encoding
Explicitly capture the semantic and syntactic information in an equation
Encoder
(GRU)

Our Solution, part #1: Equation Encoding
Encoder
(GRU)
The formula embedding that we will use in the formula retrieval task

Our Solution, part #1: Equation Encoding
Encoder
(GRU)
After the encoding step
- Decode to recover the input formula tree, using the formula embedding
- Tree beam search to improve reconstruction quality

Formula Retrieval Experiment
- 18 queries formulae
- Train (and search) on 770k equations
- Compute the embedding of all equations and queries
- Compute the cosine similarity between all equations and each query
- For each query, choose the top 25 most relevant equations
- Human evaluation: compute % of relevant equations for each query

Formula Retrieval: Main Results
Our method outperforms the data-driven baseline

Formula Retrieval: Main Results
Our method achieves state-of-the-art when combined with Approach0

Formula Retrieval: Examples
Our method retrieves structurally and semantically more similar formulae

Learnt Formula Representation: T-SNE Example
Our method embeds good representations of different formulae

Summary
Framework to process equations via tree embeddings
- Novel encoder + decoder + beam search
- State-of-the-art formula retrieval performance
- Application to textbook math content search and beyond
Future work
- Joint math and text processing
- Deploy and pilot study at OpenStax
- Open-ended math solution feedback
Zhang et al. Math Operation Embeddings for Open-ended Solution Analysis and Feedback. To appear @EDM’21
https://arxiv.org/abs/2104.12047

Más contenido relacionado

La actualidad más candente

Machine Learningbutest

Information retrieval 8 term weightingVaibhav Khanna

Mining Product Reputations On the Webfeiwin

SelQA: A New Benchmark for Selection-based Question AnsweringJinho Choi

Mining from Open Answers in Questionnaire Datafeiwin

Ontology based approach for annotating a corpus of computer science abstractsZainab Almugbel

Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf

TextRank: Bringing Order into TextsShubhangi Tandon

Reportbutest

Real Time Competitive Marketing Intelligencefeiwin

Question Answering for Machine Reading Evaluation on Romanian and EnglishFaculty of Computer Science

06 quantitative data processingKanagaraj Easwaran

Data Mining and the Web_Past_Present and Futurefeiwin

QUT Bachelor of Mathematics (Honours) info presentationDann Mallet

OR SlideShreesha Shetty

Generating SPSS training materials in StatJRUniversity of Southampton

Learning to learn with meta learningShreeGowriRadhakrish

Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf

Resource comparison SciKnow 2019Allard Oelen

IRJET- Implementation of Automatic Question Paper Generator SystemIRJET Journal

La actualidad más candente (20)

Machine Learning

Information retrieval 8 term weighting

Mining Product Reputations On the Web

SelQA: A New Benchmark for Selection-based Question Answering

Mining from Open Answers in Questionnaire Data

Ontology based approach for annotating a corpus of computer science abstracts

Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...

TextRank: Bringing Order into Texts

Report

Real Time Competitive Marketing Intelligence

Question Answering for Machine Reading Evaluation on Romanian and English

06 quantitative data processing

Data Mining and the Web_Past_Present and Future

QUT Bachelor of Mathematics (Honours) info presentation

OR Slide

Generating SPSS training materials in StatJR

Learning to learn with meta learning

Concurrent Inference of Topic Models and Distributed Vector Representations

Resource comparison SciKnow 2019

IRJET- Implementation of Automatic Question Paper Generator System

Similar a Mathematical Language Processing via Tree Embeddings

Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort

Lecture 1 (bce-7)farazahmad005

How AI Helps Students Solve Math ProblemsAmazon Web Services

HyperQA: A Framework for Complex Question-AnsweringJinho Choi

Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering

EE-232-LEC-01 Data_structures.pptxiamultapromax

Learning from similarity and information extraction from structured documents...Infrrd

geekgap.io webinar #1junior Teudjio

MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMeMadrid network

An Evolution of Deep Learning Models for AI2 Reasoning ChallengeTraian Rebedea

Introduction to Artificial Intelligence...pptxMMCOE, Karvenagar, Pune

IRJET - Automated Essay Grading System using Deep LearningIRJET Journal

Data structure and algorithm. Abdul salam

PresentationXiaoyu Chen

AlgorithmsRamy F. Radwan

intership summaryJunting Ma

MACHINE LEARNING.pptxSOURAVGHOSH623569

Start machine learning in 5 simple stepsRenjith M P

Wecp all-india-test-series-program-brochureWeCP | We Create Problems

Wecp all-india-test-series-program-brochureBIPIN KAUSHIK

Similar a Mathematical Language Processing via Tree Embeddings (20)

Introduction to Machine Learning with SciKit-Learn

Lecture 1 (bce-7)

How AI Helps Students Solve Math Problems

HyperQA: A Framework for Complex Question-Answering

Naver learning to rank question answer pairs using hrde-ltc

EE-232-LEC-01 Data_structures.pptx

Learning from similarity and information extraction from structured documents...

geekgap.io webinar #1

MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM

An Evolution of Deep Learning Models for AI2 Reasoning Challenge

Introduction to Artificial Intelligence...pptx

IRJET - Automated Essay Grading System using Deep Learning

Data structure and algorithm.

Presentation

Algorithms

intership summary

MACHINE LEARNING.pptx

Start machine learning in 5 simple steps

Wecp all-india-test-series-program-brochure

Más de Sergey Sosnovsky

Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Sergey Sosnovsky

Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...Sergey Sosnovsky

Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...Sergey Sosnovsky

Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...Sergey Sosnovsky

Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...Sergey Sosnovsky

Creating Session Data from eTextbook Event StreamsSergey Sosnovsky

Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...Sergey Sosnovsky

Interactions of reading and assessment activitiesSergey Sosnovsky

Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...Sergey Sosnovsky

YAI4Edu: an Explanatory AI to Generate Interactive e-Books for EducationSergey Sosnovsky

Automatic Question Generation for Evidence-based Online Courseware EngineeringSergey Sosnovsky

Reading Comprehension Quiz Generation using Generative Pre-trained TransformersSergey Sosnovsky

Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...Sergey Sosnovsky

Generation of Assessment Questions from Textbooks Enriched with Knowledge ModelsSergey Sosnovsky

Using Semantics of Textbook Highlights to Predict Student Comprehension and K...Sergey Sosnovsky

Dental TutorBot: Exploitation of Dental Textbooks for Automated LearningSergey Sosnovsky

What's in a textbookSergey Sosnovsky

Using Programmed Instruction to Help Students Engage with eTextbook Content Sergey Sosnovsky

Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...Sergey Sosnovsky

Interlingua: Linking Textbooks Across Different Languages Sergey Sosnovsky

Más de Sergey Sosnovsky (20)

Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...

Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...

Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...

Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...

Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...

Creating Session Data from eTextbook Event Streams

Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...

Interactions of reading and assessment activities

Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...

YAI4Edu: an Explanatory AI to Generate Interactive e-Books for Education

Automatic Question Generation for Evidence-based Online Courseware Engineering

Reading Comprehension Quiz Generation using Generative Pre-trained Transformers

Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...

Generation of Assessment Questions from Textbooks Enriched with Knowledge Models

Using Semantics of Textbook Highlights to Predict Student Comprehension and K...

Dental TutorBot: Exploitation of Dental Textbooks for Automated Learning

What's in a textbook

Using Programmed Instruction to Help Students Engage with eTextbook Content

Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...

Interlingua: Linking Textbooks Across Different Languages

Último

On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash

Accessible Digital Futures project (20/03/2024)Jisc

ICT role in 21st century education and it's challenges.MaryamAhmad92

Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417

Introduction to Nonprofit Accounting: The BasicsTechSoup

Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417

Activity 01 - Artificial Culture (1).pdfciinovamais

Dyslexia AI Workshop for Slideshare.pptxcallscotland1987

Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George

Grant Readiness 101 TechSoup and Remy ConsultingTechSoup

Spatium Project Simulation student briefAssociation for Project Management

Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong

Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh

Asian American Pacific Islander Month DDSD 2024.pptxDavid Douglas School District

PROCESS RECORDING FORMAT.docxPoojaSen20

Magic bus Group work1and 2 (Team 3).pptxdhanalakshmis0310

Holdier Curriculum Vitae (April 2024).pdfagholdier

Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136

Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417

psychiatric nursing HISTORY COLLECTION .docxPoojaSen20

Mathematical Language Processing via Tree Embeddings

1. Mathematical Language Processing via Tree Embeddings Jack Wang, Andrew Lan, Richard Baraniuk June 15, 2021

2. Mathematical Language Is Everywhere textbooks academic papers Wikipedia articles Difficult to extract and synthesize information from massive content How to efficiently find relevant mathematical content?

3. The Mathematical Content Retrieval Problem Difficult to extract and synthesize information from massive content Desired: efficient, automated system to aid indexing, searching, and organizing mathematical contents We focus on formula retrieval: - Search for and retrieve similar equations, given a query equation

4. The Mathematical Content Retrieval Problem Current search engines lack ability to effectively search for mathematical content Machine learning

5. The Mathematical Content Retrieval Problem Current search engines lack ability to effectively search for mathematical content query equation in a machine learning textbook Search results contain only specific characters that match with input query but NOT the entire equation

6. The Mathematical Content Retrieval Problem Desired retrieval

7. Our Solution: Formula Representation via Tree Embeddings A novel framework that learns a good representation of mathematical formulae Based on the encoder-decoder architecture ● A novel encoding scheme: equation as trees ● A novel decoding scheme: generate equation as trees formula encoder decoder Reconstructed formula Formula embedding Minimize this reconstruction loss

8. Our Solution, part #1: Equation Encoding Explicitly capture the semantic and syntactic information in an equation Encoder (GRU)

9. Our Solution, part #1: Equation Encoding Encoder (GRU) The formula embedding that we will use in the formula retrieval task

10. Our Solution, part #1: Equation Encoding Encoder (GRU) After the encoding step - Decode to recover the input formula tree, using the formula embedding - Tree beam search to improve reconstruction quality

11. Formula Retrieval Experiment - 18 queries formulae - Train (and search) on 770k equations - Compute the embedding of all equations and queries - Compute the cosine similarity between all equations and each query - For each query, choose the top 25 most relevant equations - Human evaluation: compute % of relevant equations for each query

12. Formula Retrieval Experiment

13. Formula Retrieval: Main Results Our method outperforms the data-driven baseline

14. Formula Retrieval: Main Results Our method achieves state-of-the-art when combined with Approach0

15. Formula Retrieval: Examples Our method retrieves structurally and semantically more similar formulae

16. Learnt Formula Representation: T-SNE Example Our method embeds good representations of different formulae

17. Summary Framework to process equations via tree embeddings - Novel encoder + decoder + beam search - State-of-the-art formula retrieval performance - Application to textbook math content search and beyond Future work - Joint math and text processing - Deploy and pilot study at OpenStax - Open-ended math solution feedback Zhang et al. Math Operation Embeddings for Open-ended Solution Analysis and Feedback. To appear @EDM’21 https://arxiv.org/abs/2104.12047

Notas del editor

Hello my name is Jack Wang and today I am going to present my project on mathematical language processing.
The question we focus here is: how do we efficiently find relevant mathematical content?
In this talk, I will primarily focus on the problem of formula retrieval as a representative problem. Namely, given an equation, we would like to find the most relevant ones. You can think of this as a search engine such as Google but it is devoted to mathematical formulae. The ability to search for formula is useful for a number of educational related applications. For example, a student might want to search for relevant assessment questions given a query question, or they want to search for relevant content in a textbook given a query formula.
Here is a concrete hypothetical example. Say you have a machine learning textbook and you are searching relevant formula given a query formula. Current search engines lack the ability to effectively search for formulae.
If you look at the retrieval results , you will find that they contain specific components that match query but not the entire formulae. This observation suggests that we need a method that better captures the semantics of a math formula such that a search engine can return the most relevant ones.
For example, this retrieval result is a good match to the query
In this project, we present a solution from a representation learning perspective. The starting point is that, we want to learn a good representation of math formulae, such that we can use this representation for the formula retrieval task. Our solution is a novel framework that processes math formula in the form of trees. This is because every formula can be inherently represented as a tree structure, and by explicitly learning their tree representations, our framework retains the inherent properties of formulae and therefore improves the retrieval performance. More specifically, the framework contains 3 key components. The first component is a tree encoder, which encodes the formula in its tree format into a vector representation, or embedding. The second component is a generator, which reconstructs the input formula tree. The entire pipeline is optimized end-to-end by minimizing the reconstruction error between the input formula tree and the reconstructed formulae tree.
As I mentioned earlier, this step us to explicitly capture the semantic and syntactic information in an equation.
This embedding is what we will use for the formula retrieval task.
To complete the pipeline, After the encoding step, we use a decoder that reconstructs the input formula in its tree format. To improve reconstruction quality, we also develop a beam search algorithm specifically for tree structured data. I’ll skip the technical details but you can find them in the paper.
We validate our framework on a formula retrieval task. In this task, we have 18 query formula
Here are some examples of queries. You can see that they are diverse in appearance and subject domain.
First of all, we can first observe that our method outperforms the other data-driven baseline on both metrics.
So we develop a new method that combines the strengths of both our method and Approach0. We can see that this method achieves state-of-the-art performance on this formula retrieval task.
We can see that our method retrieves equations that are semantically and structurally more similar to the query, whereas the tangentCFT baseline fails to do so in some cases.
I also want to visualize how the learnt formula representations are. Here, we choose a small number of formula from different math topics and plot their 2 dimensional TSNE embeddings. We can see that these embeddings form nice clusters. Which indicates that our model learns meaningful representations of these formula.
And finally, we can apply our method to analyze students step-wise answers to open ended math questions. We have a paper that is going to appear in the educational data mining conference later this month. The arxiv version is already out. If you are interested you are welcome to checkout the paper and attend our talk at EDM to learn more. Thanks

Mathematical Language Processing via Tree Embeddings

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Mathematical Language Processing via Tree Embeddings

Similar a Mathematical Language Processing via Tree Embeddings (20)

Más de Sergey Sosnovsky

Más de Sergey Sosnovsky (20)

Último

Último (20)

Mathematical Language Processing via Tree Embeddings

Notas del editor