The document outlines recommendations to improve the search experience on a library website. It discusses analyzing log data from the library search engine and EZProxy to help enhance the user experience through a new recommender system. The project will collect and analyze activity data to provide personalized recommendations and improve search capabilities. Privacy and ethics are also addressed.
2. Why?
“The search
engine on the
library is not very
user friendly. I
had to find a
specific article
recommended in
the text and it “The search
took several facility is poor
attempts to and doesn’t
locate it.” find stuff that is
supposed to
http://www.flickr.com/photos/james_lumb/3921968993/sizes/z/in/photostream
be there”
3. New search system
New generation
Discovery
System from
EBSCO
http://www.flickr.com/photos/jiscimages/435135071/sizes/m/in/photostream/
4. Could we do more?
http://www.flickr.com/photos/davepattern/5808712333/sizes/z/in/photostream/
5. Recommendations Improve the
Search Experience?
“That recommender systems
can enhance the student
experience in new generation
e-resource discovery services”
6. Recommendations Improve the
Search Experience?
Can you use search data to
make recommendations?
Are recommendations useful
in Discovery systems?
http://www.flickr.com/photos/davepattern/3473326634/sizes/z/in/photostream/
7. JISC Activity Data Programme
JISC funded project
February – July 2011
One of eight projects [list at http://bit.ly/gwCmNS]
http://www.open.ac.uk/blogs/rise
8. Why activity data?
"Every day I wake up and ask,
'how can I flow data better,
manage data better, analyse
data better?"
Rollin Ford, the CIO of Wal-Mart
http://www.flickr.com/photos/zerimski/5215633183/sizes/z/in/photostream/
11. Library systems environment
Athens DA authentication built into local (SAMS) login system
EZProxy remote resource access
SFX knowledge base and OpenURL link resolver
Ebsco Discovery Solution
12. Scope of the project
Algorithms &
Activity data recommender Search
code interface
14. So what is in the EZProxy logs?
• Remote host
• Date/Time
• Oucu
• Request
• Status
• Size of response
• Referrer
• User agent
• Session
http://www.flickr.com/photos/vixon/116447718/sizes/m/in/photostream/
15. So what is in the EZProxy logs?
"0"|||"137.108.143.168"|||20110115235421|||“nn12
34"|||"GET http://libezproxy.open.ac.uk:80/connect?
Session=st3ShtizgtrS7tU5&url=
http://search.ebscohost.com/login.aspx?direct=true&
site=edslive&scope=site&type=0&cli0=FT&clv0=Y&c
li1=FT1&clv1=Y&authtype=ip&group=VCStud&bquer
y=War%20Against%20the%20Panthers
HTTP/1.1“|||302|||0|||http://library.open.ac.uk/
|||"Mozilla/5.0 (X11; U; Linux i686; en-US;
rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10
(maverick) Firefox/3.6.13"|||"t3ShtizgtrS7tU5"
16. So what is in the EZProxy logs?
"0"|||"137.108.143.168"|||20110115235421|||“nn12
34"|||"GET http://libezproxy.open.ac.uk:80/connect?
date and time
Session=st3ShtizgtrS7tU5&url=
http://search.ebscohost.com/login.aspx?direct=true&
site=edslive&scope=site&type=0&cli0=FT&clv0=Y&c
li1=FT1&clv1=Y&authtype=ip&group=VCStud&bquer
y=War%20Against%20the%20Panthers
HTTP/1.1“|||302|||0|||http://library.open.ac.uk/
|||"Mozilla/5.0 (X11; U; Linux i686; en-US;
rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10
(maverick) Firefox/3.6.13"|||"t3ShtizgtrS7tU5"
17. So what is in the EZProxy logs?
"0"|||"137.108.143.168"|||20110115235421|||“nn12
34"|||"GET http://libezproxy.open.ac.uk:80/connect?
User
Session=st3ShtizgtrS7tU5&url=
name
http://search.ebscohost.com/login.aspx?direct=true&
site=edslive&scope=site&type=0&cli0=FT&clv0=Y&c
li1=FT1&clv1=Y&authtype=ip&group=VCStud&bquer
y=War%20Against%20the%20Panthers
HTTP/1.1“|||302|||0|||http://library.open.ac.uk/
|||"Mozilla/5.0 (X11; U; Linux i686; en-US;
rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10
(maverick) Firefox/3.6.13"|||"t3ShtizgtrS7tU5"
18. So what is in the EZProxy logs?
"0"|||"137.108.143.168"|||20110115235421|||“nn12
34"|||"GET http://libezproxy.open.ac.uk:80/connect?
Session=st3ShtizgtrS7tU5&url=
http://search.ebscohost.com/login.aspx?direct=true&
site=edslive&scope=site&type=0&cli0=FT&clv0=Y&c
li1=FT1&clv1=Y&authtype=ip&group=VCStud&bquer
y=War%20Against%20the%20Panthers
HTTP/1.1“|||302|||0|||http://library.open.ac.uk/
|||"Mozilla/5.0 (X11; Request
U; Linux i686; en-US;
rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10
(maverick) Firefox/3.6.13"|||"t3ShtizgtrS7tU5"
22. What can the data tell us?
People who looked at resource ‘C’ also
People on course ‘A’ viewed resource ‘B’
looked at resource ‘D’
Which are the most popular resources
This resource is being used by people studying this course
23. But what isn’t there?
ISSNs DOI
Article Subject
information terms
http://www.flickr.com/photos/kevharb/5466661946/sizes/z/in/photostream/
24. So how do you improve your data?
Remote host | Date/Time | Oucu | request | status EZProxy
| size of response | referrer | user agent | session
user type | course code(s) CIRCE
EDS
Bibliographic data matching
Crossref
25. So what about collecting more
data?
http://library.open.ac.uk/rise
www.open.ac.uk/libraryservices/rise/
http://www.open.ac.uk/blogs/rise
30. So how do you improve your data?
Remote host | Date/Time | Oucu | request | status | size of EZProxy
response | referrer | user agent | session
user type | course code(s) CIRCE
EDS
Bibliographic data matching
Crossref
RISE Searches in RISE
32. What can the data tell us?
People on course ‘A’ viewed People who looked at resource People who searched for subject
resource ‘B’ ‘C’ also looked at resource ‘D’ ‘E’ looked at resource ‘F’
People are looking at resources on this subject
This resource is being used by people studying this course
34. Getting a recommendation
User A Views Resource B Views +1 Resource B
Module A123 RV=14 RV=15
User C Recommended Resource B Views +1 Resource B
Module A123 RV=15 RV=16
User C Rate Useful +1 Resource B
Module A123 RV=17
User C Rate Not Useful Resource B
Module A123 -2 RV=14
35. Data Protection and privacy
Added a privacy policy to RISE,
EDS and SFX interfaces
Provided an opt-out feature
Privacy and opt-out URL
http://library.open.ac.uk/rise/?p
age=privacy
If we go back to 2009, it became obvious that library search simply didn’t work as well as users expected it towe were getting the sort of comments you see on screen which showed that library users were struggling with the federated search system that we were usingSo the library embarked on some work to improve search, with a new discovery search system and other changes
we changed the search system to a new generation of library search system from EBSCO. Instead of searching library resources individually and telling you how many results are in each database it now searches one index and shows the results in a single list
We started thinking whether there was more that we could do to improve the user experience. For a while we’d been following with interest some JISC work looking at whether activity data could be used by libraries to improve services, in projects such as TILE and MOSAIC. So we started to think whether there was an opportunity to look at whether using activity data could improve the user experience of library search
So when we knew that JISC were going to be funding some more work on activity data, we thought about what we’d want to do, and came up with this hypothesis
The project we came up with was RISE – Recommendations Improve the Search ExperienceWe set out to test two thingsCan you use search data to make recommendationsAre recommendations useful for these new systems.
RISE was funded as part of the Activity Data strand of the JISC Infrastructure for Education and Research programmeIt was a very short project, just six months, with a small team – developer, project managerAnd there were seven other projects in the programme. Some of which were working with libraries such as SALT and LIDP, others of which are looking at activity data in a range of other areas from VLEs, through repositories, to student systems to video-conferencing data, and including the UCIAD project in Kmi looking at a user-centred approach to web clickstream data.
The business sector, particularly companies such as Tesco, Amazon and Wal-Mart exploit the data they have about customer activities to support decision making.Some early research by JISC, in the TILE and MOSAIC projects identified that the HE sector also had extensive user data and there was some potential to make use of it, but it was greatly underused. So this JISC programme has set out to explore this area in more detail. Across the sector we are being told to be more business-like and the use of customer data is one of the areas that businesses seem to be exploiting far more than we do
For a traditional ‘bricks and mortar’ university these are some of the ways that you’d typically interact with your customers.Well, for the OU things are a bit different
We don’t really loan many books to students or have many accessing the library. All our students are distance learners so they interact with us online and use our resources electronically. And with more than 450,000 unique users of our website and over 100,000 unique users of our e-resources each year then there’s a fair amount of activity data for us to use
So, if we are concentrating on our e-resources then the systems we use are SAMS single sign on. The EZProxy system from OCLC which allows students to access our resources as if they were locally within the library We are using SFX from ExLibris as our resources knowledge base and as the OpenURL link resolver and then finally the Ebsco Discovery Solution in place of an older federated search system
The stages of the project were to build the database fill it with activity data, write some software to create the recommendations create a search interface to show the recommendations test it with some users
We push as much as possible through ezproxy, so we use it for access through our discovery solution, for links from SFX, for links placed in our VLE. So it seemed the obvious choice as the place to start to look at e-resource activity data. We didn’t have access to the Ebsco Discovery log files and we hadn’t been using that system for long whereas we did have a few months of log files from EZProxySo we started with the EZProxy log files as the core dataset
So when we start to look in detail at what data is contained within the log files you’ve got some useful data and other data that isn’t so useful for activity data purposes.We know the user name – that’s the oucu the Open University Computer User account name. You know the request, that is the website that is being accessedSo when you look at the detail of the record what you get is…
Something that looks like this (we’ve anonymised the oucu for obvious reasons).this is one record out of tens of thousands of rows but with a bit of work you can break it down
So you’ve got the date and time – useful to be able to know when something happened
And the oucu of the user
And the request that has been made – in this case an ebsco host search
So our database starts to build up with details of userand resources
So we can get data about the courses that students were studying from our internal student information system
So that added a bit more to the mix
So, the data we have so far can tell us which courses people are on, so we can make recommendations based on that, i.e. these are the most popular resources that people on your course are looking at. We can also start to say that if you looked at resource C and then straightaway looked at resource D that there is a likelihood that there is some relationship between resource C and resource D.And we can also say which overall are the most popular articles or journals.But there are limitations to the ezproxy data, we don’t have the search terms that are used to find these resources.
But there are limitations. From the logs you don’t always know what search terms were used or have much information about the item that is being accessedAnd if you want to make a recommendation you don’t even have an article or journal title to show as the recommendationSo looked at how we could improve the data. At the moment we use another EDS API call to extract bibliographic details that are used to extract data from Crossref that we can store in the database
So we decided that we could use the EDS API to retrieve some bibliographic data.Originally we’d hoped that we would be able to store basic metadata from EBSCO in the system but after discussion with them we realised that the license terms wouldn’t let us do that.So we had to look for other metadata sources that we could use. So we set the system up to retrieve data keys from EBSCO and use them to search Crossref. The Crossref data license allows you to store that data locally.
We created a test search interface to test recommendations with users using the Ebsco Discovery Solution API.
And when you get your search results, you also get recommendations based the articles viewed by people who used similar search terms
If you view one of the recommended resources it will open the record in another window and you are given the chance to rate the usefulness of the recommendation.
We also built a second interface – this one is a Google Gadget version with pretty much the same functions as the main interface.
Log in sorted out by working with SocialLearn team
We also then started to capture search terms used in the RISE interface
Now we can add search terms that are being used
So we’ve ended up with a set of data that can give us a range of different types of recommendationsFrom ‘people on your course are looking at these articles’ through ‘people who looked at this article also looked at this article’ and ‘to people using this search term looked at these resources’And we are sure that you could put the data to other types of use.
When we were looking at recommendations we thought that the simplest approach was just to start with something very basicWhat drives the recommendations is a set of relationship values. Values are assigned based on resource views and subsequent ratings by usersThe relationships are ranked according to value so the top ones get shown as recommendations.
Each relationship starts as value 0 +1 each time the resource is viewed +1 each time the recommendation is viewed +1 each time the recommendation is rated as ‘Useful’ -2 each time the recommendation is rated as ‘Not Useful’Recommendations are displayed in value order
Any system that deals with personal data has to be mindful of privacy and data protection requirements. After discussion within the Activity Data programme and some helpful information particularly from EDINA’sOpenURL project we put together a specific privacy policy and discussed it with our data protection people at the University. The policy explicitly covered activity data and we have linked to it from the RISE interfaces, from our main EDS page and from SFX. The policy gives people an opt-out to have their data removed from the recommendations, even though they aren’t identified personally in any of the recommendations.With the new EU ‘cookies’ legislation we are doing some more work to ensure that we are legally compliant. Ideally we would want any institutional ‘cookie’ policy and agreement to cover permission to use data for this type of activity.
The original plan with the project was to be able to release an open data set of search data. And we spent quite a lot of time looking at methods of anonymising the data, by removing oucus, genericising courses to broad subjects and looking at whether there was a threshold of students that we needed on a course to be able to release any data from that course.We faced a major challenge because the activity data we had was fairly meaningless without some article metadata and at the time we could only find data we could use ourselves and nothing we could make available in an open data set.So unfortunately it wasn’t possible to release the data. But others at EDINA, LIDP and SALT were able to do so.
Google Gadget will go into list of tools for students alongside those being developed by DOULSWe are migrating the database so we can use it for more mainstream use. We plan to use it for the new MACON mobiles search project. And we’re interested in how this data could be used by Learning Analytics
We are also looking at how we can use these approaches to provide personalised services to users through the library website, so have been looking at being able to show people what articles are being looked at and have been developing some beta services to demonstrate this
EZProxy data – on its own it there are limits to the recommendations you can make, they would mostly be about which are the most popular resourcesOur main issue is to get access to bibliographic data about the articles being accessed and recommended.You need to combine the ezproxy data with other stuff, such CIRCE dataThe more data you can get the better. The more data you get hold of the better you can make the recommendationsLicense restrictions on article level metadata limit what you can store in your database
I’m now going to hand over to Liz who will take you through the findings of the testing with users