Data Curation at the University of Michigan
What I've learned as a data curator with the Institute for Social Research
{July 2024 — Present}
The Inter-university Consortium for Political and Social Research (ICPSR) is an international consortium of academic institutions and research organizations, maintaining the world's largest social science data archive. ICPSR is a unit of the Institute for Social Research (ISR) at the University of Michigan. 

Researchers submit studies to ICPSR so that they can be included in the data archive, providing access to other researchers in the field. Before a study is added to ICPSR's data archive, it must be curated.

I have been working as a data curator since July of 2024 (ah yes, the joy of two jobs: this one, and my fellowship program!). Data curation is the process of enhancing, organizing, cleaning, and documenting data. This involves a meticulous process of reviewing data content, running scripts, poring over documentation, manipulating data, writing code, and checking for quality. ICPSR lists even more tasks involved in curation (AKA "data enhancement") here
Examples of my work:
Note: Studies submitted to ICPSR often contain numerous datasets (at times over 100). Traditional approaches at ICPSR involve manually processing each of these datasets. My goal has been to improve the efficiency of curation tasks by writing code for more efficient, batch processing.
◈  I wrote an R script to streamline the task of generating "processing history files." Based on the number of datasets and variables specific to the study, this script exports all necessary ph files for data processing. 
            ➩  A processing history file is an essential aspect of data curation in which the working data is prepared to be converted to multiple statistical package formats. It is written in SPSS.

◈  I wrote a suite of Python scripts to batch process data, running SAS syntax in a loop from Python. I wrote multiple functions to retrieve data, apply variable definition syntax from SAS, and execute the SAS scripts. 
◈  I have learned syntax of various statistical packages such as SAS and SPSS in order to navigate a range of deposited data files. I am constantly using Unix bash scripting to run scripts.
Coming Soon: Anonymized excerpts of my Python and R code, and data topic examples.
Back to Top