Data Management Best Practices

Review common guidelines for managing research data. You will find that some specific recommendations apply better to particular disciplines or research projects, but overall, following the guidelines will help save you time and prevent data loss well into the future.

Basic storage
Backup
Preservation
Organization
Documentation and description
Identifiers
Metadata
Data clean-up

Basic storage

Computers and shared servers can be good places for temporary storage of your working files.
Cloud storage can be a convenient way to store and share temporary working files, but keep in mind there are many issues with putting data into the cloud. Campus IT Services has prepared helpful guidelines on these services, including important privacy and security concerns: Cloud Services Security Requirements.
For long-term storage, data should be put into preservation systems that are well-managed.
Use flash drives only for file transfer - they are too easy to lose!
Store copies of data in open, stable formats (e.g., ascii, .txt, .csv, .pdf) for long term accessibility, but keep a copy of the original format, because of potential formatting losses during conversion.
Consider using an electronic laboratory notebook (ELN) for maintaining and sharing data files within your research group.

Backup

Rule of 3: Keep 2 copies onsite, 1 offsite.
Backup regularly and frequently - automate the process if possible.

Preservation

Long-term preservation is not merely storage or backing up of your data. Preservation is the act of making sure your data are secure and accessible for future generations.
Identify data with long-term value. Preserve the raw data and any intermediate/derived products that are expensive to reproduce or can be directly used for analysis. Preserve any scripted code that was used to clean and transform the raw data.
If data files were created with custom code, provide a software program to enable the user to read the files.
Preserve software if outdated.
Save tabular data in a delimited text format.
Save data in uncompressed and unencrypted formats, where possible.
Read more about digital preservation and Chronopolis, UC San Diego's solution.

Organization

File and Folder organization

Choose a consistent filing system that will make sense to you or someone else five years from now.
Don't rely on directory structure to provide critical information about file contents, since individual files copied elsewhere will lose the context of their folder structure (e.g., Project001/SiteB/SiteB_2010_rawdata.txt may be better than Project001/SiteB/2010/rawdata.txt).

File-naming conventions

Assign descriptive file names. Describe relevant and meaningful aspects of your study – such as what, when, where, how, who, why, scale, and version. The types of useful details will vary across disciplines. Examples:
- “SiteMDO_PheromoneExpt_2001”
- "CellJurkat_TreatmentB01_Rep008"
- "DOLInterview_DoeJane_20061207"
Use consistent file names and formats within a project, and if possible, from project to project.
If your discipline recommends particular naming conventions, use them!
Use capital letters or underscores between words, rather than spaces.
Use surname first, followed by first name or initials.
Use the date format: YYYY-MM-DD, YYYY_MM_DD, or YYYYMMDD.
Keep track of different file versions with a suffix representing date (e.g., "file_20140620") or version number (e.g., "file_v001"). Document changes where possible.
Several free file renaming tools are available: Bulk Rename Utility, Renamer, PSRenamer

File and data granularity for tabular data

Disaggregate your data along columns. For example: record address information in separate variables for street number, street name, city, state, etc., rather than within a single variable containing all of this information as a whole.
Record a single piece of data only once. Likewise, make sure there is only one piece of data in each entry of a data table.
Minimize redundant data entry by creating a relational database, where information collected at different scales are in different tables. Then, link the tables with unique keys (identification codes) for each record.
Keep figures and analyses in separate, companion documents from the source data. Don't place figures or statistics in the data file, for example, summary statistics as the last row/record. Heterogeneity in the data table records will interfere with readability of the data in analytical tools, such as statistical software.
Use a single table (with its rows and columns) per spreadsheet. Do not create multiple tables per spreadsheet.

Documentation and description

Document your work

At the project level: When beginning a project, create a text file or spreadsheet with information on personnel and their roles, sponsors/funding sources, methods/techniques/protocols/standards used, instrumentation, software (w/versions), references used, any applicable restrictions on its distribution or use - all of the background information about a project that seems so obvious now but which you may very well forget in 5 years. Maintain a list of all data files associated with each project (names and file extensions). Update these documentation files throughout the project!
At the file level: Take consistent notes on file changes, name changes, dates of changes, etc. Keep a record of these changes as an internal worksheet within the file or in an external README.txt located in the same folder as the files to which they pertain.
Store the uncorrected data file with all its errors. Make corrections within a scripting language that you run on the data. This way, one mistake in data transformation or cleanup doesn't compound another in an untraceable way. Likewise, use a scripted program for analysis, rather than a GUI-driven application. Comment your script heavily, so you can recall which analyses you performed.
Describe the method used to create derived data products. Describe data processing performed, software (including version number) used, and analyses applied to data.
Consider creating templates for data collection. This helps you to ensure completeness of data collection, promotes early consideration of data description, and can help you to think forward about analysis.
Describe how to cite the dataset in your documentation. DataCite.org recommends the following format: Creator (PublicationYear). Title. Version [optional]. Publisher. ResourceType [optional]. Identifier.

Describe file contents

Describe scientific context. Why data were collected (questions, hypotheses), environmental conditions during collection, where and when collected, spatial and temporal resolution of data, and standards or calibrations used.
Include critical information, such as date or location, in the data table, not just as metadata embedded in the file name.
Within the dataset, use one or more header rows that identify parameters, at the top of each file. Do not use spaces or special characters in headings, as many databases and applications do not allow this.
When creating datasets, also create a data dictionary. This is a document that describes the contents of your data files: variables used (including formats), units of measure, and definitions of coded values (including missing values). The data dictionary can be included as a separate tab within your spreadsheet file, or as a companion text file with a similar name. It should contain all or most of the following:
- A complete list of the parameter names used in the dataset. Use standardized naming across files and projects, when possible. Include any abbreviations for those variables in codebooks. Keep abbreviations for variable names consistent, including capitalization.
- Description of each parameter. What quantity does the parameter represent? How was each measured or produced? If relevant or not mentioned elsewhere, when and where was the quantity measured?
- Data format, such as number type (text, integer, Flag), spatial coordinates, date/time, etc. Use consistent formatting throughout the file and among projects, if possible.
  - For dates, use the format: YYYY-MM-DD or YYYYMMDD.
  - For time, use 24-hr notation, HH:MM:SS or HHMMSS, and include local time zone.
  - For spatial coordinates, report in decimal degrees format to at least 4 significant digits (5 or 6 preferable). Make sure all location information in a file uses the same coordinate system. Include coordinate type, datum, and spheroid.
- Units of measurement (e.g., number per m^3, deaths per 10,000 individuals, % increase per year). When possible, use standards. If using abbreviations in the dataset, spell out the complete units in the documentation.
- Description of what a missing value signifies and how missing values are represented (e.g., -9999, n/a, FALSE, NULL, NaN, nodata, None). Leaving an entry blank may cause misregistration of the data in many applications.
- An attribute/variable that describes data quality or certainty using coded values. Describe precision, accuracy, and uncertainty, and the quality control methods used. Some repositories may have standardized data quality levels.

Identifiers

Assign unique identifiers to your data. Unique persistent identifiers provide a way to link to and cite your data easily from publications, websites, and other resources.
Describe how to cite the dataset in your documentation. DataCite.org recommends the following format: Creator (PublicationYear). Title. Version [optional]. Publisher. ResourceType [optional]. Identifier.
Read more about why identifiers are useful and how to obtain them.

Metadata

Metadata is data about your data. Creating metadata, i.e., information about your data's contents, structure, and permissions, makes it possible for others to find and use your data properly. Without good metadata, you might not be able to reuse your own data five years from now!
Use metadata schemas and standards when possible so that your data are described according to a common or known language. Many schemas have been developed for certain types of data, and use of these schemas when appropriate will result in the best metadata for your data. The ones that apply broadly can be relatively simple, whereas data- or discipline-specific schemas can be very complex.

Data clean-up

Do-it-yourself

Double-check data that are manually entered, either by entering twice and running comparisons, or by having a second person double-check entries.
Perform summary statistics or visualizations of your data to identify potential outliers or erroneous values.
Sorting data by different fields can help to spot empty cells and impossible values. But use caution to keep records intact when sorting within a spreadsheet application.

Available tools

Software programs are available online for cleaning up data that may have errors in consistency, format issues, etc. Although we don't provide technical support for specific clean-up software, below is a tool you might find useful:
- OpenRefine ( http://openrefine.org/), for making sure records and variables are consistently coded, filling in known blanks, replacing text selectively, transforming data, and more.