Princeton held its first research data management workshop Jan. 28-Feb. 1 for graduate students and other researchers. Attendees spent five afternoons attending short lectures and participating in hands-on exercises on a wide array of topics related to best practices in research data management, from practical advice on file naming and reproducibility to the legal and ethical considerations in sharing, publishing and reusing data. Some 35 graduate students and postdocs signed up for this training.
The program was organized and sponsored by staff from almost a dozen departments: Princeton University Library (PUL), Princeton Institute for Computational Science and Engineering (PICSciE) and OIT Research Computing; and the offices of the General Counsel, Information Technology, Research and Project Administration, and Research Integrity and Assurance. The event was co-sponsored by the Office of the Dean for Research, Graduate School, Center for Statistics and Machine Learning, and Center for Digital Humanities.
Twenty-two staff members, led by Curtis Hillegas, associate CIO, Research Computing, OIT and PICSciE, and Ma. Florevel (Floe) Fusin-Wischusen, PICSciE institute manager, worked closely with Engineering Librarian Willow Dressel and Scholarly Communications Librarian Yuan Li, to develop the research data management curriculum, which included topics such as the factors necessary for successful research reproducibility; software toolkits aimed at recording workflow information and improving computational transparency; and best practices around data sharing policies, including when to check with the funding agency or journal.
“If your study is receiving funding, the answer is always check with them before you share,” noted presenter April Clyburne-Sherin of Code Ocean, a cloud-based computational reproducibility platform.
In his opening remarks for the workshop, Daniel Marlow, Princeton’s Evans Crawford 1911 Professor of Physics, spoke of the challenges facing researchers in computationally intensive fields. “In big science disciplines, data management problems can be massive,” he explained. “For example, CMS, one of the large experiments at CERN’s Large Hadron Collider, accumulates data sets of multiple exabytes [an exabyte is a billion gigabytes]. Managing this data and making sure that analysts have ready access to it is a huge undertaking, requiring dozens of physicists and software professionals from all over the world. The same management principles are relevant to the data management problems encountered by graduate students and researchers.”
Presenter Bill Wichser, PICSciE’s associate director of Research Computing Systems and Storage, said, “The Research Data Management workshop is a first step to address some of the many issues we have been seeing, including storage.”
Storage is a particularly important phase in the life cycle of research data since it is connected to how well information can be retrieved, shared and disseminated. The rapid growth in how we generate and store data can be compared to the advent of the digital camera, explained Wichser. “With digital cameras, people filled up a memory card, transferred it all to a computer, and went off again taking more pictures. Each image consumed more space, yet it never stopped the influx. Nor did it ever become an issue to go back and delete the ones already taken, since storage was infinite.”
The same digital data explosion has played out in the scientific and research realms, as instruments have become higher in resolution and sensors have become less expensive. “And nothing is ever deleted,” added Wichser. “Why? Because you just never can tell what might be hidden there in the raw format for future reference and research. Today’s data management requires a level of sophistication from our researchers to properly tag and curate their files, so that there is value to it in years to come. Understanding what data researchers have will become ever more important as we move forward. Value can be realized through organization. But more importantly, with finite resources, knowing when to discard something is just as important as keeping it.”
Explained Curtis Hillegas, associate CIO, Research Computing, OIT and PICSciE: “When researchers don’t apply best practices to managing their data, it jeopardizes the significant investment of time, resources and funding behind their work.”
He commended Princeton in creating collaborative initiatives such as this workshop “to provide our researchers at all levels with the data management training and infrastructure that will maximize the impact and benefit of the research conducted here.”
Maximizing the impact and benefit of his research is on Jeffrey Bush’s mind, and is the reason he registered for the data management workshop. A participant in the Emerging Scholars in Political Science post-baccalaureate program, Bush is interested in political economy, public finance and socio-economic inequality research.
“I’m glad I attended the workshop, because I wanted to gain the necessary skills and knowledge to properly structure the different data sources for both my own research as well as when I am working within my research specialist role,” Bush said. “I would say documentation practices will be very useful to me moving forward. Being able to have coding scripts, alongside detailed research notes, that are functional and reproducible would allow others to participate in the debate on the analysis of the projects findings thus adding greater depth to the topic while increasing academic engagement with the general public, which is my ultimate goal.”