Some collected uses of the British Library Flickr collection, illustrating how a new presentation changed its usage.
Outlines the existence of collection bias, especially in digitised material.
7. Getting to the heart of it
British Library Labs works with researchers on their
specific problems, trying to assess how widely this
problem is felt.
With their help, we talk to communities of researchers and
try to pinpoint what they need as opposed to what they
think they need to ask us.
8. One theme keeps appearing:
All projects to date would’ve been made
incredibly easier if all “items” were accessible
and citable (in a way that a computer can
follow).
9.
10.
11.
12.
13. Impact?
Hard to measure but:
- 13-20 million hits on average every month,
over 500,000,000 hits to date.
- Over 450,000 tags added by volunteers and
machine algorithms.
- Iterative crowdsourcing is key to making
the collection more useful to more people.
14. Iterative crowdsourcing?
(The term is borrowed from Mia Ridge.)
1. Crowdsource broad facts and
subcollections of related items emerge.
2. No 'one-size-fits-all': Subcollections allow for
more focussed curation.
GOTO 1
41. Infancy of understanding
Large-scale analysis of
text is evolving but
young.
Exasperating situation
where ‘black boxes’ of
algorithms are used to
draw conclusions.
http://www.scottbot.net/HIAL/?p=41271
42. “Black Boxes”:
a misnomer
It is legitimate and
useful to use code that
you could not write.
It is not legitimate to
simply believe the
‘label’ on the side of
the box.
E.g. “Sentiment
Analysis” is often
nothing of the sort.
43. Quoting Scott Weingart: (emphasis mine)
● Do sentiment analysis algorithms agree with one another enough to be considered
valid?
● Do sentiment analysis results agree with humans performing the same task
enough to be considered valid?
● Is Jockers’ instantiation of aggregate sentiment analysis validly measuring
anything besides random fluctuations?
● Is aggregate sentiment analysis, by human or machine, a valid method for revealing
plot arcs?
● If aggregate sentiment analysis finds common but distinct patterns and they don’t seem to
map onto plot arcs, can they still be valid measurements of anything at all?
● Can a subjective concept, whether measured by people or machines, actually be
considered invalid or valid?
(again from http://www.scottbot.net/HIAL/?p=41271)
44. “I am interested in
travel accounts in
Europe during the
19th Century”
46. Bias in digitisation
The tool was made to give a statistically valid sample.
Due to the paltry amount digitised, it showed how skewed
the digital corpus is, compared to the overall holdings.
Allen B. Riddell in “Where are the novels?”* estimates
that using HathiTrust’s corpus:
“... about 58%—somewhere between 47% and 68%—of
the 2,903 novels [all publications in English between 1800
and 1836] have publicly accessible scans.”
* (2012) https://ariddell.org/where-are-the-novels.html
47. In Summary:
- Context about how an digitised image came to be and
why it was scanned is both crucial to understand and
sometimes crucial to hide.
- aka Opening up large collections brings its own issues.
- Presentation shapes perception.
- Too much trust in black boxes algorithms, like search
engines or social feed suggestions.
- So little of our history is online that there is a natural
bias. The gaps are being filled in with less credible
sources.
- It still might have happened even if you cannot google it, and vice
versa!