1. M a c h i n e L e a r n i n g
Using machine learning to
enable students to search
educational video content
September 2016
2. M a c h i n e L e a r n i n g
Objective
Our client, BTechGuru, is the leading provider in India of
resources for candidates preparing for the GATE
examination. As part of its revision material, it has more
than 16,000 hours of video lectures by professors from
India’s leading universities on a range of engineering
and related subjects. Each video is between 40 minutes
and a hour long – but research indicates that students
using online video to prepare for an exam have an
average attention span of 12 minutes.
The objective was to develop a highly accurate semantic
search engine that could take students straight to the
most relevant section of a video to help with their
revision. (The videos have no metadata other than
lecturer name and course and lecture titles, and the
NPTEL website allows search only with keywords chosen
from a limited, predefined list.)
Data set
The search corpus consisted of transcripts of the video
lectures provided by NPTEL.
Since not all the video content is relevant to the GATE
exam, we used GATE question papers for the past five
years to assess the relevance of search results to the
GATE syllabus.
In the absence of suitable ontologies, we collected
Creative Commons textbook content to see whether this
content could be used to seed ontology creation.
Technical challenges
§ We needed to break the 40–60 minute transcripts
into meaningful chunks of less than 12 minutes to
retain user interest.
§ Some transcripts had been manually polished by the
lecturer, but the majority bore the hallmarks of
spoken English: hesitation, repetition, fragmented
sentences, and poor grammar. This leads to the
underperformance of conventional part-of-speech
About NPTEL
NPTEL (National Programme on Technology-
Enhanced Learning) is a joint initiative by the
premier universities in India – the Indian
Institutes of Technology (IITs) and Indian
Institute of Science (IISc) – to provide free online
learning through 750 Web and video courses in
engineering, science, and the humanities. Since
its launch in 2007 NPTEL content has come to be
seen as high-quality reference material for
students and faculty both in India and overseas.
Its 16,000 hours of video lectures from the IITs
and IISc, currently have as many views on
YouTube (197,000,000) as the MIT and Stanford
channels combined.
About BTechGuru
BTechGuru is an India-based online education
portal established in February 2008, providing
resources related to engineering careers, higher
education, competitive exams, and internships
and jobs in India for students of engineering,
soft skills and entrepreneurship. It is an official
distributor of the NPTEL courseware and India’s
leading provider of revision and assessment
resources to students preparing for the GATE
exam.
About the GATE exam
The Graduate Aptitude Test in Engineering
(GATE) is an all-India examination administered
and conducted jointly by the IISc and the IITs on
behalf of the government of India. GATE tests a
candidate’s understanding of undergraduate-
level engineering, technology, and architecture
and postgraduate-level science subjects. More
than 1 million students appear annually for
GATE, and exam results are used both for
admission to postgraduate programs and by
some public and private sector companies for
employment screening.
3. M a c h i n e L e a r n i n g
tagging and frequency-based keyword extraction algorithms. In addition, the transcripts are
available only in PDF, which itself presents a challenge for content extraction.
§ The lecture content is extremely technical. Our solution had to be generic enough to work across
multiple streams of engineering, yet able to retrieve relevant content based on deep domain
knowledge. The requirement of depth in each engineering field meant that shallow ontologies such
as DBpedia could not be leveraged effectively.
Approach
§ We took slide transitions in the lecturers’ PowerPoint presentations as a proxy for topic breaks to
chunk the videos. Timestamp information allowed us to align the video with the transcript, and
search result URLs were constructed to include this timestamp so that revising students can jump
directly to the relevant portion of the lecture on YouTube.
§ We extracted content from the PDF files using Apache PDFBox, an open source Java tool, which
proved to handle the technical content and the variability in input formats, such as the mix of
single- and double-column content, well.
§ The quality of information in the Creative Commons textbooks proved too poor to create deep
ontologies without significant manual curation. So we extracted keywords from the transcripts
themselves using a combination of natural language processing tools and statistical algorithms.
§ Our search engine was built on the Apache Lucene library and used keyword-based query expansion
combined with lemmatization and query boosting to improve the accuracy of search results. The
results were ranked for relevance using keywords extracted from the corpus of GATE question
papers.
Results and potential future work
The platform was tested extensively by subject matter experts to ensure that search results were
appropriate to the query, and initial tests have received positive feedback from students for relevance to
GATE requirements and ease of use: a textual search produces a page of results that are ranked for
relevance, categorized as beginner, intermediate, or advanced level, annotated with GATE keywords,
and linked to the exact position where the query topic is covered in the video lecture.
We have identified several potential areas for future development of both the technology approach and
the user experience:
§ Enabling students to rate search results on the Website so that the search algorithms can learn over
time
§ Improving engagement by linking from the search results to related GATE exam questions
§ Hosting institutions’ course syllabuses with search results already populated for each topic
§ Using word vectors from a larger dataset for query expansion and keyword display
§ Extracting triples from the dataset itself to create an ontology
For further information contact Jason@newgen.co