SlideShare una empresa de Scribd logo
1 de 3
Descargar para leer sin conexión
M a c h i n e L e a r n i n g
Using machine learning to
enable students to search
educational video content
September 2016
M a c h i n e L e a r n i n g
Objective
Our client, BTechGuru, is the leading provider in India of
resources for candidates preparing for the GATE
examination. As part of its revision material, it has more
than 16,000 hours of video lectures by professors from
India’s leading universities on a range of engineering
and related subjects. Each video is between 40 minutes
and a hour long – but research indicates that students
using online video to prepare for an exam have an
average attention span of 12 minutes.
The objective was to develop a highly accurate semantic
search engine that could take students straight to the
most relevant section of a video to help with their
revision. (The videos have no metadata other than
lecturer name and course and lecture titles, and the
NPTEL website allows search only with keywords chosen
from a limited, predefined list.)
Data set
The search corpus consisted of transcripts of the video
lectures provided by NPTEL.
Since not all the video content is relevant to the GATE
exam, we used GATE question papers for the past five
years to assess the relevance of search results to the
GATE syllabus.
In the absence of suitable ontologies, we collected
Creative Commons textbook content to see whether this
content could be used to seed ontology creation.
Technical challenges
§ We needed to break the 40–60 minute transcripts
into meaningful chunks of less than 12 minutes to
retain user interest.
§ Some transcripts had been manually polished by the
lecturer, but the majority bore the hallmarks of
spoken English: hesitation, repetition, fragmented
sentences, and poor grammar. This leads to the
underperformance of conventional part-of-speech
About NPTEL
NPTEL (National Programme on Technology-
Enhanced Learning) is a joint initiative by the
premier universities in India – the Indian
Institutes of Technology (IITs) and Indian
Institute of Science (IISc) – to provide free online
learning through 750 Web and video courses in
engineering, science, and the humanities. Since
its launch in 2007 NPTEL content has come to be
seen as high-quality reference material for
students and faculty both in India and overseas.
Its 16,000 hours of video lectures from the IITs
and IISc, currently have as many views on
YouTube (197,000,000) as the MIT and Stanford
channels combined.
About BTechGuru
BTechGuru is an India-based online education
portal established in February 2008, providing
resources related to engineering careers, higher
education, competitive exams, and internships
and jobs in India for students of engineering,
soft skills and entrepreneurship. It is an official
distributor of the NPTEL courseware and India’s
leading provider of revision and assessment
resources to students preparing for the GATE
exam.
About the GATE exam
The Graduate Aptitude Test in Engineering
(GATE) is an all-India examination administered
and conducted jointly by the IISc and the IITs on
behalf of the government of India. GATE tests a
candidate’s understanding of undergraduate-
level engineering, technology, and architecture
and postgraduate-level science subjects. More
than 1 million students appear annually for
GATE, and exam results are used both for
admission to postgraduate programs and by
some public and private sector companies for
employment screening.
M a c h i n e L e a r n i n g
tagging and frequency-based keyword extraction algorithms. In addition, the transcripts are
available only in PDF, which itself presents a challenge for content extraction.
§ The lecture content is extremely technical. Our solution had to be generic enough to work across
multiple streams of engineering, yet able to retrieve relevant content based on deep domain
knowledge. The requirement of depth in each engineering field meant that shallow ontologies such
as DBpedia could not be leveraged effectively.
Approach
§ We took slide transitions in the lecturers’ PowerPoint presentations as a proxy for topic breaks to
chunk the videos. Timestamp information allowed us to align the video with the transcript, and
search result URLs were constructed to include this timestamp so that revising students can jump
directly to the relevant portion of the lecture on YouTube.
§ We extracted content from the PDF files using Apache PDFBox, an open source Java tool, which
proved to handle the technical content and the variability in input formats, such as the mix of
single- and double-column content, well.
§ The quality of information in the Creative Commons textbooks proved too poor to create deep
ontologies without significant manual curation. So we extracted keywords from the transcripts
themselves using a combination of natural language processing tools and statistical algorithms.
§ Our search engine was built on the Apache Lucene library and used keyword-based query expansion
combined with lemmatization and query boosting to improve the accuracy of search results. The
results were ranked for relevance using keywords extracted from the corpus of GATE question
papers.
Results and potential future work
The platform was tested extensively by subject matter experts to ensure that search results were
appropriate to the query, and initial tests have received positive feedback from students for relevance to
GATE requirements and ease of use: a textual search produces a page of results that are ranked for
relevance, categorized as beginner, intermediate, or advanced level, annotated with GATE keywords,
and linked to the exact position where the query topic is covered in the video lecture.
We have identified several potential areas for future development of both the technology approach and
the user experience:
§ Enabling students to rate search results on the Website so that the search algorithms can learn over
time
§ Improving engagement by linking from the search results to related GATE exam questions
§ Hosting institutions’ course syllabuses with search results already populated for each topic
§ Using word vectors from a larger dataset for query expansion and keyword display
§ Extracting triples from the dataset itself to create an ontology
For further information contact Jason@newgen.co

Más contenido relacionado

Destacado

DE000010063015B4_all_pages
DE000010063015B4_all_pagesDE000010063015B4_all_pages
DE000010063015B4_all_pages
Dr. Ingo Dahm
 

Destacado (10)

Jessica Mora Resume
Jessica Mora ResumeJessica Mora Resume
Jessica Mora Resume
 
Nelda Prod Compro
Nelda Prod ComproNelda Prod Compro
Nelda Prod Compro
 
Le radici nel futuro
Le radici nel futuroLe radici nel futuro
Le radici nel futuro
 
Презентація досвіду роботи
Презентація досвіду роботиПрезентація досвіду роботи
Презентація досвіду роботи
 
Quảng cáo màn hình LED thành phố Hồ Chí Minh [HCMC - Led Billboard ]
Quảng cáo màn hình LED thành phố Hồ Chí Minh [HCMC - Led Billboard ]Quảng cáo màn hình LED thành phố Hồ Chí Minh [HCMC - Led Billboard ]
Quảng cáo màn hình LED thành phố Hồ Chí Minh [HCMC - Led Billboard ]
 
IFAE's workshop equipment
IFAE's workshop equipmentIFAE's workshop equipment
IFAE's workshop equipment
 
10306
1030610306
10306
 
DE000010063015B4_all_pages
DE000010063015B4_all_pagesDE000010063015B4_all_pages
DE000010063015B4_all_pages
 
Technology trends that can change healthcare dynamics slideshare
Technology trends that can  change healthcare dynamics  slideshareTechnology trends that can  change healthcare dynamics  slideshare
Technology trends that can change healthcare dynamics slideshare
 
Slide
SlideSlide
Slide
 

Similar a AI video search for revising students

Review on content based video lecture retrieval
Review on content based video lecture retrievalReview on content based video lecture retrieval
Review on content based video lecture retrieval
eSAT Journals
 
ResumeKSLid-2015
ResumeKSLid-2015ResumeKSLid-2015
ResumeKSLid-2015
Karen Lane
 
EDUU 551 Digital Technology Portfolio
EDUU 551 Digital Technology PortfolioEDUU 551 Digital Technology Portfolio
EDUU 551 Digital Technology Portfolio
Carla Piper
 
ID_Resume_RahulGhosh
ID_Resume_RahulGhoshID_Resume_RahulGhosh
ID_Resume_RahulGhosh
rahul.g
 

Similar a AI video search for revising students (20)

18 years developing educational technology at Loughborough University and beyond
18 years developing educational technology at Loughborough University and beyond18 years developing educational technology at Loughborough University and beyond
18 years developing educational technology at Loughborough University and beyond
 
Computer aided teaching and testing
Computer aided teaching and testingComputer aided teaching and testing
Computer aided teaching and testing
 
Partnering with IEEE— Continuing Education
Partnering with IEEE— Continuing EducationPartnering with IEEE— Continuing Education
Partnering with IEEE— Continuing Education
 
Resume
ResumeResume
Resume
 
Frontend Development - Intermediate Level.pdf
Frontend Development - Intermediate Level.pdfFrontend Development - Intermediate Level.pdf
Frontend Development - Intermediate Level.pdf
 
Frontend Development - Intermediate Level.pdf
Frontend Development - Intermediate Level.pdfFrontend Development - Intermediate Level.pdf
Frontend Development - Intermediate Level.pdf
 
Language Oriented Approach of Teaching Programming Skills
Language Oriented Approach of Teaching Programming SkillsLanguage Oriented Approach of Teaching Programming Skills
Language Oriented Approach of Teaching Programming Skills
 
Review on content based video lecture retrieval
Review on content based video lecture retrievalReview on content based video lecture retrieval
Review on content based video lecture retrieval
 
Expansion of Lecture Capture in Higher Education
Expansion of Lecture Capture in Higher EducationExpansion of Lecture Capture in Higher Education
Expansion of Lecture Capture in Higher Education
 
OLPD Technology Redesign (2011-2012)
OLPD Technology Redesign (2011-2012)OLPD Technology Redesign (2011-2012)
OLPD Technology Redesign (2011-2012)
 
ResumeKSLid-2015
ResumeKSLid-2015ResumeKSLid-2015
ResumeKSLid-2015
 
Cooperation Menu for Universities and Researchers in Latvia | Accenture
Cooperation Menu for Universities and Researchers in Latvia | AccentureCooperation Menu for Universities and Researchers in Latvia | Accenture
Cooperation Menu for Universities and Researchers in Latvia | Accenture
 
Improving Nuclear Training with Engaging Presentations
Improving Nuclear Training with Engaging PresentationsImproving Nuclear Training with Engaging Presentations
Improving Nuclear Training with Engaging Presentations
 
EDUU 551 Digital Technology Portfolio
EDUU 551 Digital Technology PortfolioEDUU 551 Digital Technology Portfolio
EDUU 551 Digital Technology Portfolio
 
ID_Resume_RahulGhosh
ID_Resume_RahulGhoshID_Resume_RahulGhosh
ID_Resume_RahulGhosh
 
O Level
O LevelO Level
O Level
 
InternshipReport-1.pdf
InternshipReport-1.pdfInternshipReport-1.pdf
InternshipReport-1.pdf
 
Overview of the ASPECT Project
Overview of the ASPECT ProjectOverview of the ASPECT Project
Overview of the ASPECT Project
 
TOGAF Portfolio from ITpreneurs
TOGAF Portfolio from ITpreneursTOGAF Portfolio from ITpreneurs
TOGAF Portfolio from ITpreneurs
 
Senior on full stack web develpoment .pptx
Senior on full stack web develpoment .pptxSenior on full stack web develpoment .pptx
Senior on full stack web develpoment .pptx
 

AI video search for revising students

  • 1. M a c h i n e L e a r n i n g Using machine learning to enable students to search educational video content September 2016
  • 2. M a c h i n e L e a r n i n g Objective Our client, BTechGuru, is the leading provider in India of resources for candidates preparing for the GATE examination. As part of its revision material, it has more than 16,000 hours of video lectures by professors from India’s leading universities on a range of engineering and related subjects. Each video is between 40 minutes and a hour long – but research indicates that students using online video to prepare for an exam have an average attention span of 12 minutes. The objective was to develop a highly accurate semantic search engine that could take students straight to the most relevant section of a video to help with their revision. (The videos have no metadata other than lecturer name and course and lecture titles, and the NPTEL website allows search only with keywords chosen from a limited, predefined list.) Data set The search corpus consisted of transcripts of the video lectures provided by NPTEL. Since not all the video content is relevant to the GATE exam, we used GATE question papers for the past five years to assess the relevance of search results to the GATE syllabus. In the absence of suitable ontologies, we collected Creative Commons textbook content to see whether this content could be used to seed ontology creation. Technical challenges § We needed to break the 40–60 minute transcripts into meaningful chunks of less than 12 minutes to retain user interest. § Some transcripts had been manually polished by the lecturer, but the majority bore the hallmarks of spoken English: hesitation, repetition, fragmented sentences, and poor grammar. This leads to the underperformance of conventional part-of-speech About NPTEL NPTEL (National Programme on Technology- Enhanced Learning) is a joint initiative by the premier universities in India – the Indian Institutes of Technology (IITs) and Indian Institute of Science (IISc) – to provide free online learning through 750 Web and video courses in engineering, science, and the humanities. Since its launch in 2007 NPTEL content has come to be seen as high-quality reference material for students and faculty both in India and overseas. Its 16,000 hours of video lectures from the IITs and IISc, currently have as many views on YouTube (197,000,000) as the MIT and Stanford channels combined. About BTechGuru BTechGuru is an India-based online education portal established in February 2008, providing resources related to engineering careers, higher education, competitive exams, and internships and jobs in India for students of engineering, soft skills and entrepreneurship. It is an official distributor of the NPTEL courseware and India’s leading provider of revision and assessment resources to students preparing for the GATE exam. About the GATE exam The Graduate Aptitude Test in Engineering (GATE) is an all-India examination administered and conducted jointly by the IISc and the IITs on behalf of the government of India. GATE tests a candidate’s understanding of undergraduate- level engineering, technology, and architecture and postgraduate-level science subjects. More than 1 million students appear annually for GATE, and exam results are used both for admission to postgraduate programs and by some public and private sector companies for employment screening.
  • 3. M a c h i n e L e a r n i n g tagging and frequency-based keyword extraction algorithms. In addition, the transcripts are available only in PDF, which itself presents a challenge for content extraction. § The lecture content is extremely technical. Our solution had to be generic enough to work across multiple streams of engineering, yet able to retrieve relevant content based on deep domain knowledge. The requirement of depth in each engineering field meant that shallow ontologies such as DBpedia could not be leveraged effectively. Approach § We took slide transitions in the lecturers’ PowerPoint presentations as a proxy for topic breaks to chunk the videos. Timestamp information allowed us to align the video with the transcript, and search result URLs were constructed to include this timestamp so that revising students can jump directly to the relevant portion of the lecture on YouTube. § We extracted content from the PDF files using Apache PDFBox, an open source Java tool, which proved to handle the technical content and the variability in input formats, such as the mix of single- and double-column content, well. § The quality of information in the Creative Commons textbooks proved too poor to create deep ontologies without significant manual curation. So we extracted keywords from the transcripts themselves using a combination of natural language processing tools and statistical algorithms. § Our search engine was built on the Apache Lucene library and used keyword-based query expansion combined with lemmatization and query boosting to improve the accuracy of search results. The results were ranked for relevance using keywords extracted from the corpus of GATE question papers. Results and potential future work The platform was tested extensively by subject matter experts to ensure that search results were appropriate to the query, and initial tests have received positive feedback from students for relevance to GATE requirements and ease of use: a textual search produces a page of results that are ranked for relevance, categorized as beginner, intermediate, or advanced level, annotated with GATE keywords, and linked to the exact position where the query topic is covered in the video lecture. We have identified several potential areas for future development of both the technology approach and the user experience: § Enabling students to rate search results on the Website so that the search algorithms can learn over time § Improving engagement by linking from the search results to related GATE exam questions § Hosting institutions’ course syllabuses with search results already populated for each topic § Using word vectors from a larger dataset for query expansion and keyword display § Extracting triples from the dataset itself to create an ontology For further information contact Jason@newgen.co