Kirsty Meddings discusses CrossCheck plagiarism screening tool from CrossRef at the 2010 Society for Scholarly Publishing Annual Meeting in San Francisco on an ethics panel
2. "CrossRef is a not-for-pro t membership
association whose mission is to enable easy
identi cation and use of trustworthy electronic
content by promoting the cooperative
development and application of a sustainable
infrastructure."
3. "CrossRef is a not-for-pro t membership
association whose mission is to enable easy
identi cation and use of trustworthy electronic
content by promoting the cooperative
development and application of a sustainable
infrastructure."
4. "CrossRef is a not-for-pro t membership
association whose mission is to enable easy
identi cation and use of trustworthy electronic
content by promoting the cooperative
development and application of a sustainable
infrastructure."
8. ACTA Press ● American Academy of Pediatrics ● American Association for the Advancement of Science ● American Association on
Intellectual and Developmental Disabilities ● American Diabetes Association ● American Geophysical Union ● American Institute
of Physics ● American Physical Society ● American Psychological Association ● American Roentgen Ray Society ● American
Statistical Association ● American Society for Microbiology ● American Society for Nutrition ● American Society of Neuroradiology
● American Society of Plant Biologists ● American Thoracic Society ● Ammons Scienti c ● Annual Reviews ● Association for
Computing Machinery ● Australian Academic Press ● BioMed Central ● BioScienti ca ● BMJ Publishing Group ● British Institute of
Non-Destructive Testing ● Cambridge University Press ● Cleveland Clinic Journal of Medicine ● Commonwealth Forestry
Association ● Croatian Medical Journal ● CSIRO ● Digital Science Press (Urotoday International Journal) ● EDP Sciences ● Elsevier ●
Environmental Health Perspectives ● European Respiratory Society Journals ● Expert Reviews Ltd ● Fundacion Infancia &
Aprendizaje ● Future Medicine Ltd ● Future Science Ltd ● Geological Society of America ● Hindawi Publishing Corporation ● IM
Publications ● IMAPS ● Inderscience ● INFORMS ● Institute of Electrical & Electronics Engineers ● International Union of
Crystallography ● IOP Publishing ● IWA Publishing ● Journal of Bone and Joint Surgery ● Journal of Histochemistry ● Journal of
Neurosurgery Publishing Group ● Journal of Rehabilitation Research & Development ● Journal of Zhejiang University SCIENCE ●
King Abdulaziz University Scienti c Publishing Centre ● Korean Institute of Science and Technology Information ● Mary Ann
Liebert ● Nature Publishing Group ● New England Journal of Medicine ● Oncology Nursing Society ● Optical Society of America ●
Oxford University Press ● Palgrave Macmillan ● Poultry Science Association ● Professional Engineering Publishing ● RMIT
Publishing ● Rockefeller University Press ● Royal College of Physicians of Edinburgh ● Royal Irish Academy ● Sage Publications ●
ScienceAsia Mahidol University ● Society for Endocrinology ● Society for General Microbiology ● Society for Industrial & Applied
Mathematics ● Society of Exploration Geophysicists ● Springer Science + Business Media ● Taylor & Francis (Informa) ● The Royal
Society ● TUBITAK ● Versita (CESJ) ● Vilnius Gediminas Technical University ● Wiley-Blackwell ● Wolters Kluwer Health
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21. Manuscript
Triage Acceptance
Submission
Yes
No
22. Manuscript
Triage Acceptance
Submission
Yes
No
Prior to acceptance?
Author? On Submission? Triage?
28. 0
2000
4000
6000
M
ay 8000
-0
9
Ju
n-
09
Ju
l-0
9
Au
g-
09
Se
p-
09
O
ct
-0
9
N
ov
-0
9
D
ec
-0
9
Ja
n-
10
Documents Checked
Fe
b-
10
M
ar
-1
0
Ap
r-
1 0
31. CrossCheck Survey
October 2009
At what point in the editorial process are you
checking manuscripts?
Pre-submission (author checking)
On submission
Prior to acceptance
Not checking yet
Other
0 1.2 2.4 3.6 4.8 6.0
33. CrossCheck Survey
October 2009
For your particular publication(s), what percentage of manuscripts
are you checking or planning to check?
All submitted manuscripts
A percentage of manuscripts
Only those that arouse suspicion
Only those that are accepted
Other
0 1.2 2.4 3.6 4.8 6.0
35. CrossCheck Survey
October 2009
Have you detected any plagiarised content using CrossCheck?
Yes
No
Not Sure
No response
0 3.667 7.333 11.000
36. Publisher Pilots
At what stage of the editorial process are you using CrossCheck?
On submission
Publisher A
By reviewers
Acceptance Publisher B
Post-acceptance
Only if suspicious
More than one of these
0 3 6 9 12 15
On submission
After acceptance
Other
0 5 10 15 20
42. Positive Feedback
“This is an invaluable tool and much
appreciated by our Editors.”
“By far the most effective and nancially
feasible software that I have found.”
43. Positive Feedback
“This is an invaluable tool and much
appreciated by our Editors.”
“By far the most effective and nancially
feasible software that I have found.”
“CrossCheck is a valuable tool... Previously I would use Google
Scholar, then need to access the journal article to con rm
suspicions of plagiarism, which was very time consuming.”
44. Issues
Title: Example Article Number One
Authors: S. Smith
8,274 words - 163 matches - 38 sources
45. “In the long run it has saved
enormous amounts of time.”
I’m going to talk about plagiarism and the more practical aspects of plagiarism detection, and specifically the CrossCheck initiative that aims to help publishers and editors tackle this type of misconduct. Over the next 15 minutes or so I’ll give a little background on the project but also an update on how publishers are using CrossCheck and what they are finding.
So I work for CrossRef - I’m pretty sure you all know CrossRef - we’re the DOI people. We work for our membership, and we work to address issues and challenges affecting publishers. And to give some background as to why we launched a plagiarism detection service which is quite distinct from our core linking service, I want to share our mission statement with you. I won’t read it out but...
..I would draw your attention to these words in the middle - part of our remit is to enable identification and use of trustworthy electronic content. All CrossRef members do of course strive to produce trustworthy content, and one aspect of content being trustworthy is knowing that it is original and isn’t plagiarised.
And CrossRef is also about publishers working together to do things that it would be much harder to do individually. CrossCheck, like our core reference linking system, relies on the participation of many publishers to work effectively, as I’ll explain.
So CrossCheck is two years old this month - it was launched in June 2008. I know that quite a few of you here are already CrossCheck participants and will be familiar with how the system works, but for those who aren’t I’ll give a brief explanation. There are two parts to CrossCheck. One is a piece of software called iThenticate which does sophisticated text analysis to identify passages of text that are similar to other passages of text. The other piece is the database of content against which text is checked.
I’ll talk about the latter of these two first, as the database that you screen content against is extremely important and one of the distinguishing features of CrossCheck. You can put text into a search engine and compare it against whatever that engine can find out there on the web, but if you’re screening a research manuscript that probably isn’t going to be especially helpful.
To effectively screen research material you need to compare it with other research material, and most of that is in publications that are on many different publisher platforms and often behind access control. So even if you find a match using Google Scholar you will still need to go to the publisher’s website to see the abstract, which may or may not contain the matching text. If it doesn’t, you need to get access to the full text, which may or may not involve paying, and so on and so forth.
CrossCheck changes this, by giving you access to a large and growing database of scholarly publications to screen manuscripts against.
Every publisher that joins CrossCheck agrees to add the full text of their electronic publications to the database, making this the only service that lets you screen manuscripts against such a large repository of relevant material.
And just very briefly, this is how it works. You submit your manuscript to the iThenticate system, and it is by default checked against three databases. It is checked against web content - iThenticate indexes web pages in much the same way as a search engine, but with the added advantage that they keep an archive of web pages going back eight years.
The manuscript is checked against the CrossCheck database, which contains the content from all of the participating publishers.
And it’s also checked against a growing repository of online and offline content that iThenticate is gathering and indexing, including databases from Gale and Ebsco, and sites such as PubMed and Arxiv.org.
Matches retrieved by comparison with these databases are pulled into a report for an editor to examine in more detail, and the process usually takes two to three minutes for an average sized journal article.
And just very briefly, this is how it works. You submit your manuscript to the iThenticate system, and it is by default checked against three databases. It is checked against web content - iThenticate indexes web pages in much the same way as a search engine, but with the added advantage that they keep an archive of web pages going back eight years.
The manuscript is checked against the CrossCheck database, which contains the content from all of the participating publishers.
And it’s also checked against a growing repository of online and offline content that iThenticate is gathering and indexing, including databases from Gale and Ebsco, and sites such as PubMed and Arxiv.org.
Matches retrieved by comparison with these databases are pulled into a report for an editor to examine in more detail, and the process usually takes two to three minutes for an average sized journal article.
And just very briefly, this is how it works. You submit your manuscript to the iThenticate system, and it is by default checked against three databases. It is checked against web content - iThenticate indexes web pages in much the same way as a search engine, but with the added advantage that they keep an archive of web pages going back eight years.
The manuscript is checked against the CrossCheck database, which contains the content from all of the participating publishers.
And it’s also checked against a growing repository of online and offline content that iThenticate is gathering and indexing, including databases from Gale and Ebsco, and sites such as PubMed and Arxiv.org.
Matches retrieved by comparison with these databases are pulled into a report for an editor to examine in more detail, and the process usually takes two to three minutes for an average sized journal article.
And just very briefly, this is how it works. You submit your manuscript to the iThenticate system, and it is by default checked against three databases. It is checked against web content - iThenticate indexes web pages in much the same way as a search engine, but with the added advantage that they keep an archive of web pages going back eight years.
The manuscript is checked against the CrossCheck database, which contains the content from all of the participating publishers.
And it’s also checked against a growing repository of online and offline content that iThenticate is gathering and indexing, including databases from Gale and Ebsco, and sites such as PubMed and Arxiv.org.
Matches retrieved by comparison with these databases are pulled into a report for an editor to examine in more detail, and the process usually takes two to three minutes for an average sized journal article.
And just very briefly, this is how it works. You submit your manuscript to the iThenticate system, and it is by default checked against three databases. It is checked against web content - iThenticate indexes web pages in much the same way as a search engine, but with the added advantage that they keep an archive of web pages going back eight years.
The manuscript is checked against the CrossCheck database, which contains the content from all of the participating publishers.
And it’s also checked against a growing repository of online and offline content that iThenticate is gathering and indexing, including databases from Gale and Ebsco, and sites such as PubMed and Arxiv.org.
Matches retrieved by comparison with these databases are pulled into a report for an editor to examine in more detail, and the process usually takes two to three minutes for an average sized journal article.
And just very briefly, this is how it works. You submit your manuscript to the iThenticate system, and it is by default checked against three databases. It is checked against web content - iThenticate indexes web pages in much the same way as a search engine, but with the added advantage that they keep an archive of web pages going back eight years.
The manuscript is checked against the CrossCheck database, which contains the content from all of the participating publishers.
And it’s also checked against a growing repository of online and offline content that iThenticate is gathering and indexing, including databases from Gale and Ebsco, and sites such as PubMed and Arxiv.org.
Matches retrieved by comparison with these databases are pulled into a report for an editor to examine in more detail, and the process usually takes two to three minutes for an average sized journal article.
The last step of this process - having an editor look at the report - is critical. iThenticate is an extremely helpful tool, but it is only a tool and in and of itself it can’t detect plagiarism. The technology is excellent at spotting overlapping or similar text, but it’s not always the case that matching text equals plagiarised content. There are legitimate reasons why the same text might appear in two pieces of content - reasons that may be very obvious to a human being but too subtle for a computer. So the use of tools such as iThenticate must always be combined with the domain expertise of an editor who can interpret the results and make a call on the authors intent.
This is the screen that you see when you’ve uploaded one or more manuscripts to iThenticate. You can see the article titles on the left, author and date processed on the right. The Report column with the square buttons beneath tells you what percentage of text within the manuscript has been found to match text in other documents. The percentages are usually made up of a number of smaller matches, and the different coloured buttons indicate which manuscripts have got matches above or below the threshold that I’ve set for my account - this can of course be varied for each user.
If you see a high percentage match that you want to look at more closely you click on the button
And you get to this, which is the first of four different report manipulations available - this one is called the Similarity Report: Manuscript on left, matches on right from highest to lowest. You may not be able to see on this screen shot but for every match you are given a link on the right hand side to a web page or an article, depending on where the match has been found. Scroll up and down to compare, and you can exclude a match if it’s not relevant. If one of the matches does look suspicious and you want to look at it more closely, you click on the passage of text in the left hand window...
...and you can see the two matching pieces of content side by side. On the left is the manuscript I uploaded, and on the right is the matching article. Importantly you can see the entire article or piece of content on the right, rather than just the matching passage and snippets surrounding it. We feel that it’s important with the kind of specialist content that our members publish that editors are able to see more than that in order to establish context. You can scroll up and down in both screens and start to get a pretty good idea of whether the overlap is legitimate or otherwise.
You might have spotted in the previous examples that the technology isn’t just looking for word for word matches. The way that it breaks the text down allows it to spot passages of text with word substitutions, so it is looking for similar as well as identical text. In this example you can see that some of the words have been very subtly substituted or moved but iThenticate still picks them up.
DON’T CLICK
You can screen a manuscript at any point in the editorial process - it doesn’t have to be done immediately on submission, for example, although many publishers are opting to do this. It might be that you prefer to check just prior to acceptance, or it could be that you use the system to back up or refute suspicions that are raised by reviewers. We have publishers taking all of these approaches as I’ll explain in a moment. We also have one or two publishers who are having their authors do the check ahead of submitting their manuscripts, although this approach is the least common.
DON’T CLICK
You can screen a manuscript at any point in the editorial process - it doesn’t have to be done immediately on submission, for example, although many publishers are opting to do this. It might be that you prefer to check just prior to acceptance, or it could be that you use the system to back up or refute suspicions that are raised by reviewers. We have publishers taking all of these approaches as I’ll explain in a moment. We also have one or two publishers who are having their authors do the check ahead of submitting their manuscripts, although this approach is the least common.
DON’T CLICK
You can screen a manuscript at any point in the editorial process - it doesn’t have to be done immediately on submission, for example, although many publishers are opting to do this. It might be that you prefer to check just prior to acceptance, or it could be that you use the system to back up or refute suspicions that are raised by reviewers. We have publishers taking all of these approaches as I’ll explain in a moment. We also have one or two publishers who are having their authors do the check ahead of submitting their manuscripts, although this approach is the least common.
DON’T CLICK
You can screen a manuscript at any point in the editorial process - it doesn’t have to be done immediately on submission, for example, although many publishers are opting to do this. It might be that you prefer to check just prior to acceptance, or it could be that you use the system to back up or refute suspicions that are raised by reviewers. We have publishers taking all of these approaches as I’ll explain in a moment. We also have one or two publishers who are having their authors do the check ahead of submitting their manuscripts, although this approach is the least common.
The main manuscript tracking systems have all integrated or are in the process of integrating iThenticate so that you can submit manuscripts directly as part of your existing workflow...
...Important to note that none of these systems are dictating when in the process you do the check - they have all left it very open and up to the publisher or user to decide at which point the checking should be done.
The main manuscript tracking systems have all integrated or are in the process of integrating iThenticate so that you can submit manuscripts directly as part of your existing workflow...
...Important to note that none of these systems are dictating when in the process you do the check - they have all left it very open and up to the publisher or user to decide at which point the checking should be done.
The main manuscript tracking systems have all integrated or are in the process of integrating iThenticate so that you can submit manuscripts directly as part of your existing workflow...
...Important to note that none of these systems are dictating when in the process you do the check - they have all left it very open and up to the publisher or user to decide at which point the checking should be done.
The progress of CrossCheck to date. 82 publishers of all sizes and covering all disciplines - not restricted to STM in any way. Most of the large publishers are on board.
48000 titles - journals, books and conference proceedings.
Very comprehensive database - can download a list of titles that are in the database from our website, where you can also see the list of participating publishers.
This graph shows how many documents have been run through iThenticate month on month. As you can see, early last year there were hardly any being screened, and it’s only really in the past six months that things have started to pick up,with the numbers almost doubling in the last couple of months. The slow start is because it took some time to get things up and running with the indexing in the first year or so of the project, and there was the issue of critical mass too - when the database was still quite small publishers weren’t getting as many matches. But as more publishers have run pilots with their titles and starting using iThenticate as a production service we’re seeing numbers really start to climb, and I expect these numbers to continue to rise quite significantly over the coming months.
So now I’d like to share with you some of the results and feedback that we’ve been getting from those using CrossCheck. I’ve drawn this from three sources - a survey that I sent to CrossCheck members in October last year, results from several pilot projects that some of our publishers were kind enough to share with me, and finally feedback from publishers who are up and running and using the system as part of their editorial process.
We did a short survey of CrossCheck members last October, to which 24 organisations responded. Obviously with a relatively low response rate the results aren’t necessarily representative, but I think that they are interesting because they do show that different organisations - at least in the early days of the project - were all taking quite different approaches.
We asked when in the editorial process manuscripts were being checked, the responses were evenly split between on submission and prior to acceptance, with a further 25% unable to say because they hadn’t started using iThenticate at that point.
Similarly, when asked how many manuscripts they were checking the answers varied - 25% checking all submissions, 20% checking only those that aroused editors’ or reviewers’ suspicions, and others spot-checking a percentage. So we didn’t see any patterns emerging back in October, but this was perhaps to be expected at such an early stage, and I’m hoping that a repeat of the exercise later this year will be quite different.
One result that was encouraging - or discouraging, depending on how you look at it - was that 45% of those who responded reported that they had detected plagiarised content as a result of using CrossCheck.
By comparison, a couple of publisher pilots that were run towards the end of the year show more of a trend. For these two sizable publishers, 63 and 66 percent of their pilot journals were checking manuscripts on submission, although they weren’t necessarily checking all manuscripts and in many cases were looking at a percentage.
I should explain also that although I’m talking about pilots here, these are fully-signed up members, and the pilot projects were to help them work out their plans for wider rollout across many more titles.
One of the publishers asked their testers how they found the iThenticate interface, and the feedback echoed previous comments that I’ve heard about how it’s very user-friendly, with almost half of the users being comfortable with it after a single use, and a further
Again, another encouraging and discouraging result. At two large publishers, 50% of testers discovered cases of plagiarism using CrossCheck and iThenticate.
I don’t have a breakdown of percentages and it’s only anecdotal, but talking to several members recently I’ve been hearing that they are actually uncovering more cases of self-plagiarism, salami-slicing and duplicate submission than they are outright plagiarism.
This was from another publisher: over 70% found the CrossCheck service and iThenticate interface useful enough to want to continue using it, with a further 20% undecided. Only 8% said no.
A few quotes from participating publishers....
A few quotes from participating publishers....
A few quotes from participating publishers....
I do want to give a balanced view of course, and there are some issues that we’re encountering - one of the main complaints I hear is one of information overload. The matching is quite sensitive, and can bring back a lot of results and can be quite daunting at first, with people unsure about how to decide which matches are significant and which aren’t. For the most part we’re finding that the solution to this is experience and that people do start to get a feel for what constitutes a significant match fairly quickly - and also the definition of a “significant match” varies between disciplines and between titles. There are also some features that iThenticate have introduced to help filter out background noise such as the ability to exlude matches below a certain percentage or number of words, and to exclude reference sections.
We’ve also had some feedback that the overall similarity score you see on the iThenticate homepage is misleading as it’s a total of all matches - and it’s true you do have to look at the reports that go with the overall score, but again familiarity with the system makes this easier.
We are starting to hear feedback about the savings that publishers are making. The obvious one here is that the service is saving time for editors when compared to alternative ways of investigating suspect papers. The quote here reflects this but does also acknowledge the issue I mentioned on the last slide - that there is an initial investment to get used to the system and the reports. After the initial training it does save time.
And finally some interesting feedback that I got from one journal just recently. I’ll read from the email:
For us, Cross Check is a game changer... We're mostly using it to identify self-plagiarism and repetition, rather than plagiarism of other people's work. Although that happens too.
This has allowed us to really implement pre-refereeing, with the effect that acceptance rates fell from 39% in 2008 to 27% in 2009, and in fact 23% for the second half of 2009 when we started pre-refereeing seriously.
The author community cannot have it both ways. They cannot publish multiple papers from one piece of research and still publish in high impact factor journals. The two things are incompatible. And Cross Check lets us find them out.
So to summarise, we’re really pleased with the progress that this initiative is making two years on. I think it’s fair to say that it has taken a little longer than we expected for things to get up and running and for publishers to start routinely screening documents, but now we’ve really got some momentum going and some results coming in. We hope that as more and more publishers join and use the service we’ll see a rising awareness amongst authors and something of a deterrence factor emerging, and to this end we’re encouraging members to use CrossCheck logos on their websites and content. And of course we welcome new members - the more organisations that join and add their content to the database, the more useful the service becomes for everyone involved in the project.
So to summarise, we’re really pleased with the progress that this initiative is making two years on. I think it’s fair to say that it has taken a little longer than we expected for things to get up and running and for publishers to start routinely screening documents, but now we’ve really got some momentum going and some results coming in. We hope that as more and more publishers join and use the service we’ll see a rising awareness amongst authors and something of a deterrence factor emerging, and to this end we’re encouraging members to use CrossCheck logos on their websites and content. And of course we welcome new members - the more organisations that join and add their content to the database, the more useful the service becomes for everyone involved in the project.
So to summarise, we’re really pleased with the progress that this initiative is making two years on. I think it’s fair to say that it has taken a little longer than we expected for things to get up and running and for publishers to start routinely screening documents, but now we’ve really got some momentum going and some results coming in. We hope that as more and more publishers join and use the service we’ll see a rising awareness amongst authors and something of a deterrence factor emerging, and to this end we’re encouraging members to use CrossCheck logos on their websites and content. And of course we welcome new members - the more organisations that join and add their content to the database, the more useful the service becomes for everyone involved in the project.
So to summarise, we’re really pleased with the progress that this initiative is making two years on. I think it’s fair to say that it has taken a little longer than we expected for things to get up and running and for publishers to start routinely screening documents, but now we’ve really got some momentum going and some results coming in. We hope that as more and more publishers join and use the service we’ll see a rising awareness amongst authors and something of a deterrence factor emerging, and to this end we’re encouraging members to use CrossCheck logos on their websites and content. And of course we welcome new members - the more organisations that join and add their content to the database, the more useful the service becomes for everyone involved in the project.
So these are my contact details and our website if you’d like to find out more. Thank you.