Data Curation Notes

Notes on Data Curation Session
Session #2, Friday, 3-4:15pm

Highly recommended Symmetry article about big data in particle physics: http://www.symmetrymagazine.org/article/august-2012/particle-physics-tames-big-data

 Some relevant literature:

  •  Carlson, J., Fosmire, M., Miller, C. C., & Nelson, M. S. (2011). Determining Data Information Literacy Needs: A Study of Students and Research Faculty. portal: Libraries and the Academy, 11(2), 629–657. http://dx.doi.org/10.1353/pla.2011.0022
  • Cragin, M. H., Palmer, C. L., Carlson, J. R., & Witt, M. (2010). Data sharing, small science and institutional repositories. Philosophical transactions. Series A, Mathematical, physical, and engineering sciences, 368(1926), 4023–38. http://dx.doi.org/10.1098/rsta.2010.0165
  • Heidorn, P. B. (2008). Shedding Light on the Dark Data in the Long Tail of Science. Library Trends, 57(2), 280–299. http://dx.doi.org/10.1353/lib.0.0036
  • High Level Expert Group On Scientific Data. (2010). Riding the wave. How Europe can gain from the rising tide of scientific data. (J. Wood, Ed.) Communication. European Union. Retrieved from http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi-report.pdf

Interprofessional configurations within institutions:

  • Where does the subject liason’s job end and the data librarian’s begin?
  • How do librarians/libraries at various institutions build and maintain relationships with offices of research, researchers in various academic departments/divisions?
  • interactions/division of labor between librarians (subject-specialist and data), other “i-people” (e.g. “informationists”), researchers, IT people, graduate students
  • developing protocols/best practices for all stakeholders/contributors to follow

Intellectual-property issues associated with publishing datasets or making them available to a wider public:

  • getting buy-in from researchers who often jealously guard their datasets fearing others will free ride on their data-collecting efforts and beat them to the publishing finish-line — solutions proposed to this problem include:
    • systems for reserving submitters’ rights to first publication on datasets deposited in a respository
    • dataset publication/dataset depositing/dataset disclosure being considered a contribution to be weighed in tenure and promotion decisions
    • using this as a carrot for data openness
      • tabulation of data-citation statistics, weighing data-citation statistics in the calculation of h-index, etc.
      • need for author/ researcher ID#s? (unique identifiers for authors/ researchers/ data collectors)
        • look at ORCID?
  • when research is funded by a university or other research institution, they own the data; if the research is publicly funded (e.g. NIH, NSF), then the data has to be made freely available

How the NSF mandate and other data-transparency/ data-availability requirements are affecting relationships between libraries and faculty, academic departments and other offices

  • grant writing and the data-management-plan requirements of NSF, NIH and other grant giving agencies/ organizations
  • journals’ requirements for timely submission of datasets associated with publications
  • look at Purdue’s PURR (Purdue University Research Repository: https://research.hub.purdue.edu/); Cornell’s data management consultancies (https://confluence.cornell.edu/display/rdmsgweb/consultants); “LICO”? “LYCO”? “LAICO”? at Syracuse

Problems associated with applying data-management/ data-curation solutions across disciplines

Logistical problems

  • storage capacity issues with large datasets
  • budgetary issues: e.g. distribution of fiscal burden in interdepartmental intitiatives – who pays the data curation/data management bills? The library? Research offices/departments? Academic departments/divisions?
  • hammering out data-management plans for interinstitutional collaborative projects where each institution has its own policies/practices
  • emergency preparedness and data security
  • distinguishing between different levels or stages of data and determining which are to be preserved/submitted and which are to be discarded
    • distinctions between “raw data”, final “data products” associated with a publication or research outcome, and all intermediary stages of data
      • should all datasets – from the original raw data to the final “data products” and all intermediary datasets be preserved?
        • when answering this question one should keep in mind potential externalities of datasets or the possibility that the data could be repurposed in the context of other projects
      • should copies of the code for all software/programs used to process the raw/intermediary data also be preserved?
      • NASA has a scale of different levels of data (where ‘0’ represents the data coming straight off detectors)
      • the aforementioned Symmetry article discusses ways in which data curation practices used in phsyics could provide instructive examples for developing algorithms etc. to help people in other disciplines distinguish between different levels of data and to determine which data to keep and which to throw out

Ethical/Accountability issues

  • privacy concerns (really only an issue for research in particular disciplines – biomedical, social sciences, possibly for economic research)
  • data provenance

Professional-development issues/questions

  • most of our jobs (i.e. most librarian/informationist  jobs) are going to change to require some data chops – but are library-school/i-school programs adapting to produce librarians/informationists with the requisite skill sets? – No.
  • if you’re not really working with data in your position currently, how do you keep up with all this?