1. Prediction of Student learning interests using text analytics
Prethiviraj Elango1
,
Mithun Rajkumar Antony2
and Krishna
Ramanathan3
Faculty of Engineering and IT
University of Technology Sydney
Sydney, Australia
{1
Prethiviraj.Elango, 2
Mithun.RajkumarAntony
3
Krishna.Ramanathan}@student.uts.edu.au
Abstract – The collaboration of student learning in
online is popular because of its novel advantages over
the traditional class room learning. There are certain
benefits can be accomplished in using this platform of
learning, if the quality of approach is unique.
However, there ae some limitations in using the vast
amount of available student data. There is no proper
evident in in using the student data for various
purposes. In the existing literatures, there has been
various advantages in using the text analytics for the
enhancement of the educational pattern of learning;
on going through these literatures, this paper proposes
a process model to collect and analyze the student data
on their online learning environment. This proposed
thesis uses data analytic tool called RapidMiner for
text processing to indicate the students’ interest in
various area of study based on their available data.
Furthermore, this report is based on the proof of
concept of a project which is simple enough to target
University of Technology Sydney (UTS) and other
educational stakeholders.
Keywords – Text analytics, online learning,
Prediction accuracy
I. INTRODUCTION
The student online learning environment is a significant
change in the present day scenario. University of
Technology, Sydney (UTS) providing student an
opportunity for this engagement of students in online.
They are using UTS online software for making the
collaboration of student and professors. There are some
limitations in the UTS online in which the students can
participate only in the discussion board of their enrolled
subjects. Professors can only provide some updates
regarding subjects, can publish student marks and can
include the subject materials in UTS online. Professors
cannot monitor the student activities, interests,
intentions and so on, as UTS online does not provide any
opportunities to do so.The mentioned limitations can be
overcome by the implementation of project called CIC
Around. This project is currently under
Roberto Martinez-Maldonado
Connected Intelligence Centre
University of Technology Sydney
Sydney, Australia
Roberto.Martinez-Maldonado@uts.edu.au
process, handled by UTS Connected Intelligence Center
(CIC).
CIC is operating under UTS who handles multiple
project for UTS in which CIC Around is one among
them. The activities involved in CIC is to find the
happenings on intersecting human sense making and
computational analysis. CIC’s research is focused on
various domain projects like education, learning
analytics, human centered, research analytics and
transdisciplinary. Their main aim is to conduct research
to answer the unanswered questions on these domains.
In the CIC Around project, UTS CIC is designing a
participatory design process to build an online
WordPress multisite environment which will be useful
for student learning, their online collaboration with
their peers, provide a students an opportunity to
collaborate with the industry partners and for building a
community among the students. Students can create
groups for the various purpose of studying. Professors
are also provided with the opportunity to monitor the
students’ progress. The implementation of this project
will overcome the existing limitations of UTS online in
which this project is more of a participatory process that
help the student to participate more on this online
learning environment. This project will be more helpful
to the students who are studying blocked mode subjects.
The understanding of wider UTS community student’s
interest in the online learning environment will helps in
the analysis of student data for the future enhancement
of CIC Around.
The proof of concept on the UTS CIC Around with the
WordPress plugins, BuddyPress and BBpress has been
performed. Following the proof of concept, the proposal
of a process model for predicting the students’ interest
on this online learning environment has been done. The
prediction accuracy is based on the rate of interests on
the students over their other areas of learning. This
proposal will be helpful for the University authorities to
refine the particular courses based on the interest level
of students. The data analytic tool called RapidMiner
2. has been used in which the detailed explanations are
given as follows.
The rest of paper is organized as follows: Section 2
Motivation, Section 3 Methodology, Section 4 Related
work, Section 5 Existing process, Section 6 Proposed
process, Section 7 Conclusion.
II. MOTIVATION FOR THIS RESEARCH
The main aim of this research is to provide the
educational stakeholders a clear insight about using text
analytics in an effective and efficient way. The objective
trying to achieve in this paper, is to improve the
efficiency of the online learning based on their interests
that binds the students from various distance. As
technology is enhancing according to recent trends, it is
necessary for educational stakeholders to use that
technologies to enhance the existing pattern of learning.
Certain Universities will be having their own process
and norms in enhancing their student’s existing pattern
of learning. However, in many cases, student’s interests
cannot be predicted by the Universities to know their
exact thinking on their selected subjects.
There will be vast and vast subjects available for a
particular student to study based on their selected
course. To explain with the simple example, University
of Technology Sydney have refined their Information
Technology course on the four majors like Business
Information systems, Data Analytics, Networking and
Software Development based on the student
participation in their Subject Feedback Survey (SFS).
Other than this, there will be more and more internal
works might be done by UTS to enhance their course.
Also, in the survey students will provide the feedback
only about their enrolled subjects. This is more than a
direct approach without any technical means which does
not will provide the information regarding student actual
interests with their feedback on their enrolled subjects.
Conducting surveys for knowing the student interests is
a tedious process in which University authorities cannot
be able to collect the survey data manually to know their
interests to make some refinements in their course. This
is the starting point to perform the research in this area
which will be useful for various educational
stakeholders.
The research is also based on the similar technique
explained above but in an alternate way of collecting the
student data from their online learning environment.
This research will also help in overcoming the flaw of
not knowing the interest of a student. Here, data analytic
tool is used for the clear understanding of the process
involved in this research.
On the whole, the ultimate motivation of this research
is to accomplish the accuracy of predicting the student
interests in various domain areas and incorporating this
prediction accuracy to refine their subjects involved in
their course. The selected research will also provide a
better understanding of student data which will be
helpful in analyzing various patterns in future.
III. METHODOLOGY
In this research, review of many articles related to text
analytics has been done. And then based on the findings,
proposal has been done for an enhancement related to
the existing approaches in managing the student data and
what can be done with the student data considering on
their online participation. Initially based on the available
student data, proposed idea has been sorted considering
various factors.
The research is mainly focused on the two basis. The
first one is collecting the student data from online
learning environment for the enhancement purpose of
the student learning on the whole. This should be done
after retrieving the data from the online learning
environment. The data should be retrieved on the back
end by reporting and also according to the
specifications mentioned by the stakeholder’s purposes.
So, on researching various criteria’s, finally decided to
use the text analytics in which it will helpful in
collecting, measuring, analyzing and finding the similar
pattern among the students’ data.
The second one is focused on the data analytic tool
called RapidMiner in which the bulk student data will
be processed according to the keyword search option
available in that. The main focus on student data in
aiming the text analytics is to derive the high quality
information from the student entered text. By using this,
similar pattern of text will be structured which will be
supportive in interpreting the output. The above
mentioned two process will be useful in enhancing the
student learning. So, decided to use those two process
and then proceeded with the ideas with some
demonstration. The clear and detailed description of
this two process is clearly explained on the proposed
process section.
IV. RELATED WORK
The initial application of text mining in the field of
higher education was not that effective when compared
to the later one as they were not user friendly and was
very expensive. There are several application of text
mining and a unique method is preferred by every user
to work with the mining tool depending on their
category of knowledge. Text mining also has a great
3. effect in the field of higher education where the teachers
can analyze the activities of the learner and help the
learner in an efficient manner. Text mining is also used
as a major tool to refine the curriculum of any course in
a university or any education standards.
The author in his book Qualitative Text Mining in
Student’s Service Learning Diary has analyzed the
services in learning activities of the student’s in any
education sector in a way to analyze the outcomes of the
students from e-learning and also to provide a reflection
to the students based on their interaction with eLearning
tools like online discussion board, online exams, etc., He
also quotes that the curriculum of a course can be
updated by using some text mining technologies, which
makes the course even more refine, rather than putting
a huge syllabus with unrelated contents for the students.
He also introduced some computer technology like
(Hsu, 2012).
Instructional design
This is to provide a blueprint and to examine the
teaching standards of every teacher. Instructional
design is used to identify a particular learner who is
holding a high rate of dropping out of the subject. Once
such a learner is identified, a unique approach, and
strategies are used to make an efficient teaching
practice. The authors narrowed down the concept of
instructional design in their book of “Designing
instructional feedback for different learning outcomes”.
The book clearly states that the instructional event,
where a particular student is picked up for motivation
has to follow a pretest, practice and a post-test (Smith
et al., 1993)
Text mining prediction
The authors in their book of text mining predictive
methods for analyzing unstructured information
indicated that any data mining technology will be used
to find out the structured data base but not in the semi-
structured database. Hearst has identified that data
mining would not satisfy the human needs of learning
and teaching information. However, when text mining is
applied with appropriate language and statistics to
analyze text data helps us to attain new data (Weiss et
al., 1989).
The professor followed a research method of this study.
He says: “Initially apply the instructional design model
followed by text mining procedures. The model has to
combine 3 aspects of view: professor in action research,
student teacher in curriculum and instructional
development and design students in motivational
learning evaluation” which is explained on the below
figure.
Figure 1: Research models in three points of view
The author (Ai et al., 2006) in his paper “The
Application of Data Mining Technology in Distance
Learning Evaluation has listed out the knowledge that
we gather because of text mining, they are:
A. Generalized knowledge
A very general description of the characteristics
of any text the mining tools could generate (in
our case, the mining tool is a rapid miner). This
generally contains the reflection of common
nature of similar things, refining the abstract
data and so on.
B. Related knowledge
This data is gathered when one data is
dependent of other similar data or associated
knowledge.
C. Category knowledge
This is similar to the related knowledge but it
differs where the gathered texts are categorized
based on the different characteristics of
knowledge. The most widely used type of
classification of data is a tree view.
D. Predictive knowledge
This can also be said as future knowledge,
which is predicted according to the past data
and the current data. The trending predictive
methods are statistical method, neural networks
and machine learning.
E. Bias-based knowledge
This is nothing but an exceptional knowledge
that’s gathered as a description of the
4. differences between characteristics between
attributes.
They also quoted the use of E-Portfolio with text
mining as an application to evaluate the learning
behavior of the student. E-Portfolio when used by itself
proves to be an inefficient technique to evaluate the
learning behavior of the student as it’s evaluated
manually by the teacher. It also has the limitations of
handling large number of students. The below figure
shows that, Text mining when used with E-Portfolio
help the teacher to gather some knowledge and in
learning objectives associated with the analysis.
Through the recorded set of mined data, the teacher can
easily understand the regulatory standards and also
analyze the results of student’s learning behaviors,
which further increases the efficiency of learning
evaluation (Ai et al., 2006).
Figure 2: Application of data mining technology in
E-Portfolio
The MCMS (Mining Course Management Systems)
project in Thames Valley University recommends to
build a knowledge management system based on data
mining. Data mining techniques are applied to track the
individual student performance also to refine the
curriculum according to the activities of the student.
Text mining is used as a tool to represent the mined data
by the MCMS in a human understandable way for better
decision making (Oussena, 2008).
A model-driven data integration is applied in MCMS to
fetch the data from different systems into a single data
warehouse for analyzing (Kim et al., 2009). The data in
the warehouse should always be pre-processed and
transformed before it undergoes any mining techniques.
So when the data is ready, it increases the efficiency of
the data mining process. Such an efficient knowledge
gathered from the data mining process will be used by
the university to have an advanced approach of
prediction individual’s behavior, instructing the
students. Text mining is applied here to narrow down
the student’s interaction with the online learning
(ELearning) tool. When a knowledge management
system and a text mining process and used
simultaneously, an university will have the highest level
of data efficiency which further facilitates the university
to choose the most advanced approach in understanding
their student’s need.
Figure 3: Workflow of MCMS
The author determines the student’s test score by using
the data mining prediction technique by using an
effective factor. This factor is later altered according to
the student’s performance in the succeeding year
(Gabrilson, 2003). Luan groups the students into 2
categories. One with the students who can easily deal
with the courses and the other with students who take a
longer time to complete a course (Luan, 2002). Such
groups helps the universities to make a better decision
on refining their curriculum, the time for teaching and
so on.
To understand the factors which determines the
student’s retention, the universities usually collects data
about the history of academic performance of a student,
behavior and perceptions of a student, for instance the
author used different classifiers to predict the student’s
characteristics which lead to a very less accuracy or a
bad accuracy (Superby et al., 2006).
The authors in their paper “Use Data Mining To
Improve Student Retention In Higher Education” has
stated the student retention as the biggest challenge as
it decides better academic programs and a better revenue
for the universities (Oussena et al., 2010). A simple
formula for maintaining the student retention rate was
developed by Seidman (Seidman, 1996), which is:
Retention=Early Identification + (Early + Intensive
+ Continuous) Intervention
This formula helps to understand that early detection of
those students at risks and maintain regular interaction
will be the most recommendable way to increase student
retention
Tinto has provided 5 strategies to increase student
retention to the next level:
• Understanding the expectations of the student.
5. • Conducting a counselling session in helping the
students choose their courses.
• Providing academic and social support
specially before the start of the first semester
• Motivating the student on explaining their
capability
• Active interaction with the available learning
sources
The authors in their work introduces the idea of using
opinion mining from student’s feedback data. As
opinions of the stakeholders will be the major factor in
individual’s decision making, the authors have
considered this technique to understand their students
better and to refine the curriculum. The result of the
opinion mining depends on how good the data is
preprocessed or stages the data has undergone when it’s
prepared before classification (Dhanalakshmi et al.,
2016).
The authors in their work used linear regression
classifier to identify the variable which is associated
with the academic performance. This leads them to
realize, previous academic performance was the
important variable (Oussena , 2008).
V. EXISTING PROCESS
The existing system of text analytics in general is used
to process the unstructured information into structured,
extract the meaningful information from the entered
text and contained information of the text will be used
by the various data mining algorithms. The extraction
of information will be done by summarizing the number
of words in the document. The summarized words then
can be analyzed to find the similarities and relationship
between them. The most common method in text
analytics is to convert the text to numbers for the
analysis of clustering and predictive data mining
projects. In addition, this method will also be helpful in
various analysis. Text mining also includes sentimental
analysis, summarization of documents, entity relation
model, text clustering and text categorization. The
below figure shows the overall description of the text
analytics process:
Figure 4: Text analytics process
VI. PROPOSED PROCESS
In this proposal, the illustration is going to be with the
usage of text analytics with the student data. The
proceedings are based on the existing text analytics
process. As we are dealing with the student data from
the online learning environment, the first thing we needs
to do is collecting the student information like their
posted data, their comments, their participation data in
any discussion and their micro information like the page
they visits, they page they like and the topics they are
very much interested in. Every data that we will be
collecting from the relational databases will be in an
unstructured format. All unstructured data will be
retrieved in the document format. So to make it into
structured format we can use vector representation
feature. By using this feature, we can bring those
documents in a similar database which will then be
converted into structured format.
The collection of this structured data is very important
because we are going to find some of the similar patterns
and relationship among their data. The main purpose in
doing this is to make sure to find out the similarities of
a single student opinion regarding other subjects in
which it is not in the part of their course.
Example: For example, a student belongs to
Information Technology course but he/she has more
interest in marketing related topics. If that particular
student is participating in more and more marketing
related activities, we can come to the conclusion that
particular Information Technology student is equally
interested in marketing subjects as well. Like this many
other fellow Information Technology students might
have interest in marketing. Now, it is very clear from
this point is quite a considerable amount of information
technology students are interested in marketing. By
identifying this similarities and patterns, the
Universities are provided with the opportunity to refine
the Information Technology course by including
marketing subjects. Likewise many students who all are
6. comes under one particular course will have equal
interest in other areas as well. So with the help of text
analytics the course can be refined periodically
according to the present trends, scenario and students
behavior.
Demonstration: To predict the students’ interest on
different areas in the online learning environment, we
are going to use RapidMiner software platform. It is an
open source software in which it will be useful in
machine learning, business analysis, text analysis,
predictive analysis and data mining. In this software
platform, we are going to demonstrate how the text
mining process will be effective over the data in online
learning environment. Once the installation of
RapidMiner is done, we should load the extracted
student information from the online learning
environment to the RapidMiner. The extraction can be
done from any Business Intelligence tool like online
analytical processing, Data warehousing and so on.
Before loading the extracted file into RapidMiner, we
should look for the desired extensions for text
processing by clicking the Extensions icon like the
below screenshot:
Figure 5: RapidMiner Extensions
On clicking the extensions icon, we should install
the package of text processing. Once the text
processing package is installed, next selection
process would be dragging and dropping out the
Process Documents from Files from the text
processing package to the work area as given
below:
Figure 6: Dropping Process Documents from Files
to the RapidMiner workspace
After completing this, we should select the
parameters for this stipulated extension of Process
Documents from filters. This selection is shown in
the below screenshot:
Figure 7: Parameter selection
In the above screenshot, in text directories we
should provide the file path of the local computer.
Here, we are going to compare the two extracted
files of student data from their online learning
environment. The data that we are talking about
here is the dummy data for the demonstration
purpose. One is the student data that belongs to the
7. Information Technology department, the other is the
student data that belongs to the Telecommunication
department. The extraction is based on the student
information, their online participation, their intention,
topics they are very much interested in, the page they
like and so on. The loading of both the student data is
performed like the below screenshot in the RapidMiner
tool:
Figure 8: Loading dummy student data
Once the dummy student has been loaded, we needs to
select our option for vector creation. The Figure 4 shows
the vector creation. In that once the file is loaded, we
needs to specify which vector creation has to be done.
Documents are represented by the vectors. Here, when
the texts are processed, it is an unstructured and ordered
list of pairs which will then be converted into structured
with the help of document vector model. This conversion
will be done by counting the number of words in the
documents. There are four options for counting of words
which is explained below:
Binary Term Occurrences: This is the simplest option
in which it will count whether the selected word is there
in the document or not.
Term Occurrences: This option is related to binary
term occurrences in which it will be checking for how
often a word is occurred in a document.
Term Frequency: This will look for the fraction of
document length which is happening for the particular
term throughout the document.
TF-IDF: This is the most advanced option in the
RapidMiner tool which stands for term frequency-
inverse document frequency. Term frequency is same
as explained above. Inverse document frequency is
based on the document frequency which is a number of
documents that a word occurs in. It is used to determine
the characteristic of a word. In our demonstration we
have selected this option which collectively performs
two mentioned tasks.
The next step that we needs to perform is which process
should happen inside the loop. The process we have
selected is Tokenization. The main purpose of this
process selection is to cut the texts into individual terms
of terms of words. The different separators can be used
which is highlighted in the below screenshot
Figure 9: Selection of a separator
There are number of separators available on the
RapidMiner tool. The first one is non letter which
includes wide spaces, punctuations, symbols and so on.
The next one is specify characters separator in which we
can select the character according to our wish. Apart
from these two, we can also separators like regular
expression, linguistic sentences and linguistic tokens. In
our demonstration we have selected the non-letters
separator.
We can also perform more number of operations under
text processing. For an instance, we have selected the
filtering option called Filter Stop words (English). It will
helps to remove the articles, conjunctions, pronouns and
so on. As we are going to perform multiple operations
on the text processing in the rapid miner tool, we have
to make sure that we have to give the option of break
after in our second and third operation such as
Tokenization and Filter Stop words respectively.
The next step is we needs to run the selected operations
on the RapidMiner tool. Once we run it, we can see the
separation between the original text and processed text
like in the below screenshot:
8. table view, plot view and distribution table. The view
we have selected here is plot view in which it will
compare the number of words from Information
Technology student data and Telecommunication
student data. From the overall extraction of almost all
student data, we have compared only two department’s
student data to know their interest on the Marketing
area. On giving the selection of word marketing, we can
come to the conclusion that more number of Information
Technology students are interested in marketing area, as
the graph shows. From knowing this, University
authorities can refine Information technology subjects
by adding some of the Marketing subjects to their
curriculum.
Figure 10: Outcome of text analytics
The color has been changed between each and every
words because we have used the tokenizer option in
which it will make the separation between the individual
words and terms. Likewise the same procedures can be
repeated for each and every documents. In our
demonstration we have used two files containing student
data of Information Technology and
Telecommunication department. It also includes the
example set in which it is consist of one line for each
document and one column for each word. In addition to
this some of the Meta information is also provided like
file information, file date, extension path and group or
class which they belongs to with the label attribute.
In addition to this, if we wants to generate a
classification model, it is possible with the available
classification model with in the RapidMiner tool. In our
demonstration we have used Naïve Baiyes
classification model. The selection of this classification
model is available with the modelling package in the
RapidMiner tool. Once we select our classification, it
will be looking the below screenshot in the RapidMiner
working area.
Figure 11: Selecting a classification model
Once after adding the classification model, we can
perform different operation on the required output like
Figure 12: Plot view of processed text data
Thus, with the help of text processing it is easier to
identify the students’ interest on the online learning
environment. Similarly we can compare various patterns
among the students according to the university
specification.
VII. CONCLUSION & FUTURE IMPLICATIONS
The proposed research gives the clear insight of using
the available student data in an effective and efficient
way. The attributes discussed in this research will
provide a greater benefits to the educational
stakeholders to focus more on the students’ academics
based on the predicted interests of the students’. The
prediction factor of students’ largely depends on their
online participation which will also be further helpful in
providing the valuable outcome, if the research is done
on the various areas similar to this. The future
implications would be evaluating the performance of the
students individually, on evaluating the performance
of the students lecturer can provide some needed
assistance to the particular student, providing some
improvements in study materials, and finally
9. sometimes it will also provide an opportunity to evaluate
the performance of the lecturer. For this implication,
some of the learning analytic tool can be used which will
be solely focused on individual enhancement of
learning.
VIII. REFERENCES
Ai Yubing., Zhang Jianping., 2010. ‘ The
Application of Data Mining Technology in
Distance Learning Learning Evaluation’,
International Forum on Information
Technology in Distance Learning
Evaulation.
Cristianini, N., Shawe-Taylor, J., 2000. ‘An
Introduction to Support Vector Machines and
other kernel-based learning methods’.
Cambridge University Press.
Dhanalakshmi, v., Dhivya Bino., 2016.
‘Opinion mining from student feedback data
using supervised learning algorithms’, 3rd
MEC International Conference on Big Data
and Smart City
Gabrilson, S., Fabro, D. D. M., Valduriez, P.,
2008. ‘Towards the efficient development of
model transformations using model weaving
and matching transformations’, Office of
information technology, Geogia Department
of Education.
Hsu Chia-Ling., 2012. ‘Qualitative Text
Mining in Student’s Service Learning Diary’.
Third International Conference on
Innovations in Bio-Inspired Computing and
Applications
Kim, H., Zhang, Y., Oussena, S., and Clark,
T., 2009. A Case Study on Model Driven
Data Integration for Data Centric Software
Development, In Proceedings of ACM First
International Workshop on Data-intensive
Software Management and Mining
Luan, J. 2002. ‘Data mining and knowledge
management in higher education –
potential applications’. In Proceedings of
AIR Forum, Toronto, Canada.
Mazon, J. N., Trujillo, J., Serrano, M.,
Piattini, M., 2005. ‘Applying MDA to the
development of data warehouses’. DOLAP
2005
Oussena, S., 2008. ‘Mining Courses
Management Systems’. Thames Valley
University.
P. L. , and Smith, T. J. Ragan, ‘Instructional
design’, Macmillan, New York, 1993
Pathros Ibarra García, E. 2011, ‘Model
Prediction of Academic Performance for
First Year Students’, Mexican International
Conference.
S. M. Weiss, N.’ Indurkhya, T. Zhang, and,
F. Damerau, Text mining predictive methods
for analyzing unstructured information’,
Spring Science-Business Media, Inc., New
York, 2005M. Young, The Technical
Writer’s Handbook. Mill Valley, CA:
University Science, 1989.
Schönbrunn, K., Hilbert, A., 2006. ‘Data
Mining in Higher Education, Studies in
Classification’.Data Analysis,and
Knowledge Organization Advances in Data
Proceedings of the 30th Annual Conference
of the Gesellschaft für Klassifikation e.V.,
Berlin.
Seidman, A., 1996. Spring Retention
Revisited: RET = E Id + (E + I + C)Iv.
College and University, 71(4), 18-20.
National Audition Office, 2007, Staying the
course: the retention of students in higher
education
Superby, J.F., Vandamme, J-P., Meskens,
N., 2006. ‘Determination of factors
influencing the achievement of the first-
year university students using data mining
Methods’. Workshop on Educational Data
Mining.
Tinto, V., 2000. ‘Taking student retention
seriously: rethinking the first year of
college’, NACADA Journal, Vol. 19 No. 2,
pp. 5-10.
Thomas, L., 2002. ‘Student retention in
higher education: the role of institutional
habitus’, Journal of Education Policy, Vol.
17 No. 4, August, pp. 423-442.
Yorke, M., Longden, B., 2004. ‘Retention
and student success in higher education’ ,
Society for Research in Higher Education.