2. Mathematical Language Is Everywhere
textbooks
academic papers
Wikipedia articles
Difficult to extract and synthesize information from massive content
How to efficiently find relevant mathematical content?
3. The Mathematical Content Retrieval Problem
Difficult to extract and synthesize information from massive content
Desired: efficient, automated system to aid indexing, searching, and organizing
mathematical contents
We focus on formula retrieval:
- Search for and retrieve similar equations, given a query equation
4. The Mathematical Content Retrieval Problem
Current search engines lack ability to effectively search for mathematical content
Machine
learning
5. The Mathematical Content Retrieval Problem
Current search engines lack ability to effectively search for mathematical content
query equation in a machine learning textbook
Search results contain only
specific characters that match
with input query but NOT the
entire equation
7. Our Solution: Formula Representation via
Tree Embeddings
A novel framework that learns a good representation of mathematical formulae
Based on the encoder-decoder architecture
● A novel encoding scheme: equation as trees
● A novel decoding scheme: generate equation as trees
formula encoder decoder
Reconstructed
formula
Formula
embedding
Minimize this reconstruction loss
8. Our Solution, part #1: Equation Encoding
Explicitly capture the semantic and syntactic information in an equation
Encoder
(GRU)
9. Our Solution, part #1: Equation Encoding
Encoder
(GRU)
The formula embedding that we will use in the formula retrieval task
10. Our Solution, part #1: Equation Encoding
Encoder
(GRU)
After the encoding step
- Decode to recover the input formula tree, using the formula embedding
- Tree beam search to improve reconstruction quality
11. Formula Retrieval Experiment
- 18 queries formulae
- Train (and search) on 770k equations
- Compute the embedding of all equations and queries
- Compute the cosine similarity between all equations and each query
- For each query, choose the top 25 most relevant equations
- Human evaluation: compute % of relevant equations for each query
17. Summary
Framework to process equations via tree embeddings
- Novel encoder + decoder + beam search
- State-of-the-art formula retrieval performance
- Application to textbook math content search and beyond
Future work
- Joint math and text processing
- Deploy and pilot study at OpenStax
- Open-ended math solution feedback
Zhang et al. Math Operation Embeddings for Open-ended Solution Analysis and Feedback. To appear @EDM’21
https://arxiv.org/abs/2104.12047
Notas del editor
Hello my name is Jack Wang and today I am going to present my project on mathematical language processing.
The question we focus here is: how do we efficiently find relevant mathematical content?
In this talk, I will primarily focus on the problem of formula retrieval as a representative problem. Namely, given an equation, we would like to find the most relevant ones. You can think of this as a search engine such as Google but it is devoted to mathematical formulae.
The ability to search for formula is useful for a number of educational related applications. For example, a student might want to search for relevant assessment questions given a query question, or they want to search for relevant content in a textbook given a query formula.
Here is a concrete hypothetical example. Say you have a machine learning textbook and you are searching relevant formula given a query formula.
Current search engines lack the ability to effectively search for formulae.
If you look at the retrieval results , you will find that they contain specific components that match query but not the entire formulae.
This observation suggests that we need a method that better captures the semantics of a math formula such that a search engine can return the most relevant ones.
For example, this retrieval result is a good match to the query
In this project, we present a solution from a representation learning perspective. The starting point is that, we want to learn a good representation of math formulae, such that we can use this representation for the formula retrieval task.
Our solution is a novel framework that processes math formula in the form of trees. This is because every formula can be inherently represented as a tree structure, and by explicitly learning their tree representations, our framework retains the inherent properties of formulae and therefore improves the retrieval performance.
More specifically, the framework contains 3 key components. The first component is a tree encoder, which encodes the formula in its tree format into a vector representation, or embedding. The second component is a generator, which reconstructs the input formula tree. The entire pipeline is optimized end-to-end by minimizing the reconstruction error between the input formula tree and the reconstructed formulae tree.
As I mentioned earlier, this step us to explicitly capture the semantic and syntactic information in an equation.
This embedding is what we will use for the formula retrieval task.
To complete the pipeline, After the encoding step, we use a decoder that reconstructs the input formula in its tree format. To improve reconstruction quality, we also develop a beam search algorithm specifically for tree structured data. I’ll skip the technical details but you can find them in the paper.
We validate our framework on a formula retrieval task. In this task, we have 18 query formula
Here are some examples of queries. You can see that they are diverse in appearance and subject domain.
First of all, we can first observe that our method outperforms the other data-driven baseline on both metrics.
So we develop a new method that combines the strengths of both our method and Approach0. We can see that this method achieves state-of-the-art performance on this formula retrieval task.
We can see that our method retrieves equations that are semantically and structurally more similar to the query, whereas the tangentCFT baseline fails to do so in some cases.
I also want to visualize how the learnt formula representations are. Here, we choose a small number of formula from different math topics and plot their 2 dimensional TSNE embeddings.
We can see that these embeddings form nice clusters. Which indicates that our model learns meaningful representations of these formula.
And finally, we can apply our method to analyze students step-wise answers to open ended math questions. We have a paper that is going to appear in the educational data mining conference later this month. The arxiv version is already out. If you are interested you are welcome to checkout the paper and attend our talk at EDM to learn more. Thanks