This document discusses the need to manage and preserve research data. It notes that scholars increasingly generate large amounts of data in their work but that support for managing, storing, and ensuring future access to research data is still developing. It argues that all researchers, regardless of field or the size of their work, should consider how to ensure their data can be discovered, understood, and reused over time. Saving and caring for research data properly requires cooperation across researchers, libraries, and other stakeholders.
2. Cyberinfrastructure
Petabytes
Y TE S
Data mining
X A B
Grid Computing
E tio n Terabytes
Id
aE-Research entit
or E-Science
ab ta
oll da
y
C a
G H !Stan
et Data Curation s dard
M
A A R
IT?
A A Faculty?
Libraries?
9. What I will not
talk about today
ā¢ Collaboration technology
ā¢ Identity-management, authentication,
authorization, etc.
ā¢ Grid computing
ā¢ Instrument science
ā¢ Open Notebook Science
Of course these are important.
Iām just not competent to opine.
Fortunately, you have Melissa!
14. In case youāre
wondering...
ike
it l ā
ab
ML nto co
is ws.
F to X ers i ay/
PD rg l K 607 >
ng ambu aev/200html
erti g h
nv in
ichl-de 0509.
āMes/xm sg0
āCo vert arc
hiv m
co n l .org
/
ist s.xm
/l
ht tp:/
<
15. Do we have to
keep data?
SOMETIMES.
(but itās often a good idea even if
you donāt have to)
20. What can be
done with data?
ā¢ Experimental validation
ā¢ Meta-analysis, data-mining, mashups
ā¢ Interdisciplinary investigation
ā¢ Historical investigation
ā¢ Modeling and model validation
ā¢ ... the possibilities are endlessāIF we
have the cows the data.
32. Librarians
i ful f
e aut re o oes
s b uctu at g le
thi str
... he e th peop
g is ng t
in di e ocod the talk tās
pen tan g th le t ed to tha
ap ers
h d n din sab us now
see f un rsta e it u at we ā
at I n o
h io u nde mak k th rian 5ā
n 1 ..ā
w
B ut inat n, and w to I thin libra b
āLi Suc
ia
rar cess.
b tio o . d
c om ma nd h ess it hybri rs o
f
or t, a acc the
inf ind i t to g Fac
to
o r fyin
b eh wan ded, I den
ti
n l., ā
ho t ble n.
w u e re
ta
ia alm
abo librar P
the
33. Grant
administrators
Cows donāt corral themselves.
Neither do researchers.
34. The big gray area
Informaticists?
Researchers who code?
IT pros who grok metadata?
Librarians who model data?
40. Ten Questions
1. What is the story of your data?
2. What form and format are the data in?
3. What is the expected lifecycle of your data?
4. How could your data be used, reused, and repurposed?
5. How large is your dataset, and what is its rate of
growth?
6. Who are the potential audiences for your data?
7. Who owns the data?
8. Does the dataset include any sensitive information?
9. What publications or discoveries have resulted from the
data?
10. How should the data be made accessible?
āMichael Witt and Jake Carlson, Purdue University
Good morning, and thank you for coming. My name is Dorothea Salo, and I work for the University of Wisconsin System as an odd sort of digital archivist. I do have strong interests in the area of cyberinfrastructure, as I hope to prove to you today, and so Melissa asked me to come here and talk to you a little bit about my angle on the whole cyberinfrastructure thing.And I promise you will understand the title by the time Iām done talking. Cross my heart.
So, when we say the word cyberinfrastructure, some of the first things that come to mind are grid computing, in which we throw a whole lot of little computers working together at huge, massive computational problems, and data mining, in which we throw those computing resources at huge amounts of data on a scale we could never have considered before.(CLICK) Of course, these processes create new data. Terabytes and petabytes of it. And now all the librarians listening to me are wincing, because our shock-and-awe sensors tripped as soon as you could fit the Library of Alexandria on a USB thumb drive, you know what Iām saying? (CLICK) And then the grid computing people start tossing around exabytes, and look, my brain just shuts down.(CLICK) In the UK, what we call cyberinfrastructure is often called āe-science.ā This, of course, betrays an assumption. (CLICK) So we donāt use āe-scienceā here, because itās not just the physicists and the astronomers and the climatologists; (CLICK) we say āe-researchā instead, because itās certainly true that the social sciences, the arts, and the humanities are joining the party too. And with that, we add concerns over collaboration, especially across institutions and across disciplines -- and doing cross-disciplinary collaboration creates sticky issues around identity and authorization and it all gets very evil and nasty and complicated very quickly.(CLICK) And while weāre at it, letās not forget the data I mentioned. An emerging professional specialty, though exactly *where* itās emerging is a really good question, is that of data curation. This brings up questions of metadata, a thing dear to librarian hearts that just made the IT professionals here cringe, and data standards. We have a few of those, in a few disciplines, but not nearly enough, and unstandardized, not-uniform data is something that I think we can all agree makes us ALL cringe!(CLICK) And then thereās the question of whoās going to do data curation. Is it an IT function? Are faculty responsible? After all, itās their data! And what about those libraries?(CLICK) And by this time much screaming has ensued and much hair is being torn out. Not least because wow, that is one ugly, ugly slide.
Scholars are using computers, in a number of different form factors, including big old server racks like this one, in their research. This, I am sure, is not news to anyone!
All this computation produces data, sometimes as the point of the exercise, sometimes as a sort of side effect. Data takes all kinds of forms; itās not just numbers. Word-clouds, scanned manuscripts, maps, images on wildly different scales -- itās all bits-and-bytes; itās all reusable and recomputable -- itās all data!
This is in addition to the books and journals that librarians are familiar with and already care for.Interestingly, as these materials move digital themselves (CLICK), they too can be treated as data, as grist for the computational mill. This doesnāt happen as much as it should, honestly, and the reason for that is that even when these materials are digital, theyāre locked up behind pay-access firewalls to protect the current scholarly-publishing business model, so the computers canāt get in to crunch on them. This is a major argument for open access to the literature -- and for those of you who know me and what I do, I hereby reassure you that itās the only open-access argument Iām going to make in this presentation.So to recap a bit, we have our researchers, and theyāre using computers, and theyāre generating data.
And that support, librarians, has to happen throughout the entire data lifecycle. And that support, IT professionals, is absolutely not limited to providing computational horsepower and storage. And that support, scholars and researchers, has to include verification and documentation of data-gathering methods, so that everyone knows that everythingās on the level, and itās got to include ways to refer back to other peopleās data that youāve used; thatās what I mean by ācertificationā here.
So thatās the cyberinfrastructure puzzle as I see it. There are large swathes of it that Iām not going to talk about today...
Now here we are. This is data, right? Nice bar graphs and charts, with a nice key in the corner; you can imagine this on a web page or equally well on a print journal page.(CLICK) NO. No. Not data. This is not data in the sense I mean it.
For optimum reusability, we need to save data before itās distilled into charts and graphs and tables. We need to save the cows before they become hamburger!
So in tight budget times, a very good question to ask is whether itās actually necessary to solve this problem. Even if it is, do we have to solve it now? Do we have to keep all these data?(CLICK) The answer is a resounding -- sometimes. But I do want to add that even when itās not absolutely required, itās often a really good idea. On the Madison campus, we have collected a number of stories of researchers who wish theyād done a better job keeping their data, because a new use turned up for it, often years or decades later!So in what cases is it mandatory?
(mention NIH, distinguish articles from data)
Most of the funders requiring open data are in Europe at the moment, but thatās not true of journals. I canāt give you a laundry list, because itās very discipline-dependent and also very volatile, but we are seeing more and more science journals instituting data-retention policies.Now, the ones Iāve seen have usually been time-limited; five or ten years is common. My question is this: if youāre going to do it for five or ten years, why not plan for longer? Sure, it makes sense to assess every now and again, because some datasets do become obsolete. But donāt let your thinking be governed by journal requirements; most of the work of keeping a dataset happens before the bits hit storage, so keeping them longer is often a very low-margin business.
Thereās nothing stopping a journal or a funder from creating an unfunded mandate to keep and preserve data. A few have. And we, collectively, researchers and librarians and IT professionals, are left dangling on the hook figuring out how to comply.Okay. So thatās the stick. Now for the carrot. Weāre keeping all these data. Why? Whatās the use?
Iāve answered this already, for those who were listening at the beginning, but for anybody who came late, and just to reiterate, thereās an image of cyberinfrastructure that assumes itās all about the Higgs bosons of this world. Physics, astronomy, and biomedicine. Thatās whoās got all the data, just like theyāve got all the money.
A broader concern is so-called āsmall science,ā which is science without the big bucks, which is frankly most scientists, not that that surprises anyone. The big guns have mostly worked out their data issues, as Iāve said. The small-science folks -- a lot of them hardly seem to know where to begin.(CLICK) And the sting in the tail here is that there are a lot MORE small-science researchers than big science. This means that if you pile up all their data, thereās probably a lot more of it! Each individual data-herd is pretty small by comparison with the Large Hadron Collider, granted. But add all those herds together, and we are talking a LOT of cows.
And my dearest loves, the arts and humanities, are hardly devoid of data. A digitized image is data. A digitized book is data, and can be computed upon. The performing arts are pushing out huge amounts of audio and video -- and while weāre talking storage capacity, digital video is an unbelievable headache because of file sizes.I like to think about folklorists and ethnographers while I consider digital data in the arts and humanities. Anything you can imagine is grist for their analysis mill, and yes, they are both analyzing digital data and recording their conclusions digitally.So weāve all got data, one way or another.
And hereās the other thing... We donāt have a service-provision model for this. Not in libraries. Not in IT. Not in most regular research practice. Nobodyās sure how itās going to get done yet.This is part of why Iām here today. UW Milwaukee is busily trying to sort out how to do all this, in addition to all the other cyberinfrastructure-related things I told you at the beginning I wasnāt going to talk about.
We know that apathy is not a solution. And here we often hear someone grumbling that if this was just all paper, itād be fine; itās this stupid digital stuff thatās the problem. Leaving aside that data on paper are completely useless as data, we shouldnāt ignore the incredibly complex safety net that libraries have built around paper. Paper doesnāt preserve itself either; librarians preserve it! Digital data are no different. We have to take intentional action to keep data viable.
Right, so whoās we?Okay. Show of hands. Librarians? IT pros? Faculty and researchers? Research support, grant administrators and the like?Right. If you raised your hand at any point, part of this is probably your problem. Which part, I donāt know, and anybody who tells you they know is lying and probably trying to sell you something.
So, can you tell a Holstein from an Angus? (Iām just going to die if thereās a dairy researcher in the room.)(CLICK) No, I canāt either. I can tell you that the Anguses are on the left, because I dug up the photos, but I swear thatās the only reason I know.The point of this little parable is that we know absolutely that data curation canāt happen without researchers helping and being cooperative with other people in the village. This is because data without context and interpretation are meaningless, like a spreadsheet with the header row chopped off -- and researchers are the people with the context and with the ability to interpret. Librarians and IT pros donāt automatically understand how a given dataset fits together, how it was created, how other people will expect to search for it or use it, what different parts of it even MEAN. Researchers will have to learn to express these things, if they donāt already know how!
IT pros, youāre going to be running the big iron. No surprises there. But there are surprises for you in this, such as time horizons youāre not used to, mass file format migrations, metadata internal and external and relational that we can hardly imagine yet... and so on. Donāt panic, weāre all in this together, and we have examples to work from, especially on the larger scales -- but by the same token, donāt make the mistake of thinking you can just sail in and solve this one. Itās complicated.
Librarians, this is your call to arms. Step up and sit at the table, or the table is going to forget that we exist. This isnāt good for the table, and itās not good for us, either.Sure, weāre used to dealing with the published literature, and weāre fond of its authority and finality. (CLICK) But weāre going to have to look earlier in the lifecycle for our greatest impact.
And then thereās the big gray area. When I said I didnāt know who would do all this? This is what I meant. Some researchers say that the solution is to teach themselves -- or up-and-coming newcomers -- information-management skills so that they become informaticists. Some researchers say that the answer is for researchers to learn to code.All of this will probably happen, in some fields and at some levels. I donāt know how it will all shake out, in the long run. But cross-functional training, no matter what end of the research enterprise youāre on, is probably the wave of the future.
Infrastructure is more than computers. Itās also a policy and procedures infrastructure, without which none of this can happen. And finally, as I dearly hope Iāve made clear, infrastructure is people. Fancy supercomputers arenāt worth a penny without people to use them, care for them, and take care of what they compute.
Everyone in this room can do this, and I hope you will. But, you may ask, what do you say?
mention Educause
I used so many Creative Commons-licensed photos that I have to actually roll the credits here... while thatās happening, let me ask if there are any questions?