Basic Data Hygiene
Metadata
File system organisation
File naming
Link to handout: http://bit.ly/NHM_RDM_introduction
Hans Rosling on open data in 2006
How do we get there?
Better digital curation of the workhorses of modern science: code & data
aim to create secure materials that are easy to use and REUSE
We describe nine simple ways to make it easy to reuse the data that you share and also make it easier to work with it yourself. Our recommendations focus on making your data understandable, easy to analyze, and readily available to the wider community of scientists.
This guide for early career researchers explains what data and data management are, and provides advice and examples of best practices in data management, including case studies from researchers currently working in ecology and evolution.
Most university libraries have assistants dedicated to Research Data Management:
@tomjwebb @ScientificData Talk to their librarian for data management strategies #datainfolit
— Yasmeen Shorish (@yasmeen_azadi) January 16, 2015
Act as though every short term study will become a long term one @tomjwebb. Needs to be reproducible in 3, 20, 100 yrs
— oceans initiative (@oceansresearch) January 16, 2015
@tomjwebb stay away from excel at all costs?
— Timothée Poisot (@tpoi) January 16, 2015
read only
@tomjwebb @tpoi excel is fine for data entry. Just save in plain text format like csv. Some additional tips: pic.twitter.com/8fUv9PyVjC
— Jaime Ashander (@jaimedash) January 16, 2015
@jaimedash just don’t let excel anywhere near dates or times. @tomjwebb @tpoi @larysar
— Dave Harris (@davidjayharris) January 16, 2015
@tomjwebb databases? @swcarpentry has a good course on SQLite
— Timothée Poisot (@tpoi) January 16, 2015
@tomjwebb @tpoi if the data are moderately complex, or involve multiple people, best to set up a database with well designed entry form 1/2
— Luca Borger (@lucaborger) January 16, 2015
@tomjwebb Entering via a database management system (e.g., Access, Filemaker) can make entry easier & help prevent data entry errors @tpoi
— Ethan White (@ethanwhite) January 16, 2015
@tomjwebb it also prevents a lot of different bad practices. It is possible to do some of this in Excel. @tpoi
— Ethan White (@ethanwhite) January 16, 2015
@ethanwhite +1 Enforcing data types, options from selection etc, just some useful things a DB gives you, if you turn them on @tomjwebb @tpoi
— Gavin Simpson (@ucfagls) January 16, 2015
.csv
: comma separated values..tsv
: tab separated values..txt
: no formatting specified.@tomjwebb It has to be interoperability/openness - can I read your data with whatever I use, without having to convert it?
— Paul Swaddle (@paul_swaddle) January 16, 2015
more unusual formats will need instructions on use.
.csv
or .tsv
copy would need to be saved.
NA
or NULL
are also good options0
. Avoid numbers like -999
read.csv()
utilitiesna.string
: character vector of values to be coded missing and replaced with NA
to argument egstrip.white
: Logical. if TRUE
strips leading and trailing white space from unquoted character fieldsblank.lines.skip
: Logical: if TRUE
blank lines in the input are ignored.fileEncoding
: if you’re getting funny characters, you probably need to specify the correct encoding.read.csv(file, na.strings = c("NA", "-999"), strip.white = TRUE,
blank.lines.skip = TRUE, fileEncoding = "mac")
readr::read_csv()
utilitiesna
: character vector of values to be coded missing and replaced with NA
to argument egtrim_ws
: Logical. if TRUE
strips leading and trailing white space from unquoted character fieldscol_types
: Allows for column data type specification. (see more)locale
: controls things like the default time zone, encoding, decimal mark, big mark, and day/month namesskip
: Number of lines to skip before reading data.n_max
: Maximum number of records to read.read_csv(file, col_names = TRUE, col_types = NULL, locale = default_locale(),
na = c("", "NA", "-999"), trim_ws = TRUE, skip = 0, n_max = Inf)
Viewer(df)
summary(df)
df
;
head(df)
(see top few rows) and str(df)
(see object structure) are useful.
@tomjwebb don't, not even with a barge pole, not for one second, touch or otherwise edit the raw data files. Do any manipulations in script
— Gavin Simpson (@ucfagls) January 16, 2015
@tomjwebb @srsupp Keep one or a few good master data files (per data collection of interest), and code your formatting with good annotation.
— Desiree Narango (@DLNarango) January 16, 2015
master
copy of files
R
@tomjwebb Back it up
— Ben Bond-Lamberty (@BenBondLamberty) January 16, 2015
most solid version control.
keep everything in one project folder.
Can be problematic with really large files.
@tomjwebb I see tons of spreadsheets that i don't understand anything (or the stduent), making it really hard to share.
— Erika Berenguer (@Erika_Berenguer) January 16, 2015
@tomjwebb @ScientificData “Document. Everything.” Data without documentation has no value.
— Sven Kochmann (@indianalytics) January 16, 2015
@tomjwebb Annotate, annotate, annotate!
— CanJFishAquaticSci (@cjfas) January 16, 2015
Document all the metadata (including protocols).@tomjwebb
— Ward Appeltans (@WrdAppltns) January 16, 2015
You download a zip file of #OpenData. Apart from your data file(s), what else should it contain?
— Leigh Dodds (@ldodds) February 6, 2017
It’s out there somewhere:
Information that describes, explains, locates, or in some way makes it easier to find, access, and use a resource (in this case, data).
This usually takes the form of a structured set of elements.
Start at the very least by creating a metadata tab within your raw data spreadsheets
Ideally set up a system of normalised tables (see section 3 in this post) and README
documents to manage and document metadata.
Ensure everything someone might need to understand your data is documented
Different types data require different metadata
XML
file, a searchable, shareable file.
temporal (time of day, day, month, year, season)
geography (lat, lon, postcode)
species name; authority / source
@tomjwebb record every detail about how/where/why it is collected
— Sal Keith (@Sal_Keith) January 16, 2015
I’m using data from a project in which we compiled large dataset on bird reproductive, morphological, physiological, life history and ecological traits across as many bird species as possible to perform a network analysis on associations between trait pairs.
I’ll use a simplified subset of the data to show a simple metadata (attribute) structure that can easily form the basis of a more formal EML (ecological XML) using function in the package EML
species | max.altitude | dev.mode | courtship.feed.m | song.dur | breed.system |
---|---|---|---|---|---|
Acridotheres_tristis | NA | 2 | 0 | NA | 1 |
Aix_galericulata | NA | 1 | NA | NA | 2 |
Anas_americana | NA | 1 | NA | NA | 2 |
Anas_clypeata | NA | 1 | NA | NA | 2 |
Anthracothorax_nigricollis | NA | 2 | NA | NA | 2 |
Anthus_hodgsoni | NA | 2 | NA | NA | 1 |
Aphelocoma_coerulescens | NA | 2 | NA | NA | 4 |
Aphelocoma_ultramarina | NA | 2 | NA | NA | NA |
Ardea_cinerea | NA | 2 | NA | NA | 1 |
Like many real data sets, column headings are convenient for data entry and manipulation, but not particularly descriptive to a user not already familiar with the data.
More importantly, they don’t let us know what units they are measured in (or in the case of categorical / factor data, what the factor abbreviations refer to). So let us take a moment to be more explicit:
I use functions in eml_utils.R
to:
attr_tbl
in which to complete all info requiredattr_tbl
to supply to EML generating functions.library(RCurl)
eval(parse(text = getURL(
"https://raw.githubusercontent.com/annakrystalli/ACCE_RDM/master/R/eml_utils.R",
ssl.verifypeer = FALSE)))
attr_tbl
shelldt <- read.csv("data/bird_trait_db-v0.1.csv")
attr_tbl
shell from your data (dt
)get_attr_shell
from eml_utils.R
.attr_shell <- get_attr_shell(dt)
attr_tbl
shell structurestr(attr_shell)
## 'data.frame': 6 obs. of 11 variables:
## $ attributeName : chr "species" "max.altitude" "dev.mode" "courtship.feed.m" ...
## $ attributeDefinition: logi NA NA NA NA NA NA
## $ columnClasses : chr "character" "numeric" "numeric" "numeric" ...
## $ numberType : logi NA NA NA NA NA NA
## $ unit : logi NA NA NA NA NA NA
## $ minimum : logi NA NA NA NA NA NA
## $ maximum : logi NA NA NA NA NA NA
## $ formatString : logi NA NA NA NA NA NA
## $ definition : logi NA NA NA NA NA NA
## $ code : logi NA NA NA NA NA NA
## $ levels : logi NA NA NA NA NA NA
attributes
df columnsI use recognized column headers shown here to make it easier to create an EML object down the line. I focus on the core columns required but you can add additional ones for your own purposes.
Attributes associated with all variables:
"numeric"
, "character"
, "factor"
, "ordered"
, or "Date"
, case sensitive)columnClasses
dependant attributesnumeric
(ratio or interval) data:
character
(textDomain) data:
dateTime
data:
11-03-2001
formatString would be "DD-MM-YYYY"
code
and levels
to store information on factors. Use ";"
to separate code and level descriptions. These can be extracted by eml_utils.R
function get_attr_factors()
later on.
attr_tbl
attr_shell
to .csv
write.csv(attr_shell, file = "data/attr_shell.csv")
attr_tbl <- read.csv(file = "data/attr_tbl.csv")
attr_tbl
attributeName | attributeDefinition | columnClasses | numberType | unit | minimum | maximum | formatString | definition | code | levels |
---|---|---|---|---|---|---|---|---|---|---|
species | species | character | NA | NA | NA | NA | NA | species | NA | NA |
max.altitude | Maximum altitudinal distribution | numeric | integer | meter | NA | NA | NA | NA | NA | NA |
dev.mode | Developmental mode | ordered | NA | NA | NA | NA | NA | NA | 1;2;3 | Altricial;Semiprecocial;Precocial |
courtship.feed.m | Courtship feeding (by the male) | factor | NA | NA | NA | NA | NA | Courtship feeding (by the male) | 0;1 | FALSE;TRUE |
song.dur | Song duration | numeric | real | second | 0 | NA | NA | NA | NA | NA |
breed.system | Which adult(s) provides the majority of care: | factor | NA | NA | NA | NA | NA | Breeding system | 1;2;3;4;5 | Pair;Female;Male;Cooperative;Occassional |
Do not manually edit raw data
Keep a clean pipeline of data processing from raw to analytical.
There are going to be files
LOTS of files
The files will change over time
The files will have relationships to each other
The more things are self-explanatory, the better
READMEs are great, but don’t document something if you could just make that thing self-documenting by definition
A place for everything, everything in its place.
Benjamin Franklin
source: https://nicercode.github.io/blog/2013-04-05-projects/
Pick a strategy, any strategy, just pick one!
data/
data-raw
data-clean
data/
- raw/
- clean/
Pick a strategy, any strategy, just pick one!
R
code
scripts
analysis
bin
/Users/jenny/research/bohlmann/White_Pine_Weevil_DE:
total used in directory 246648 available 131544558
drwxr-xr-x 14 jenny staff 476 Jun 23 2014 .
drwxr-xr-x 4 jenny staff 136 Jun 23 2014 ..
-rw-r--r--@ 1 jenny staff 15364 Apr 23 10:19 .DS_Store
-rw-r--r-- 1 jenny staff 126231190 Jun 23 2014 .RData
-rw-r--r-- 1 jenny staff 19148 Jun 23 2014 .Rhistory
drwxr-xr-x 3 jenny staff 102 May 16 2014 .Rproj.user
drwxr-xr-x 17 jenny staff 578 Apr 29 10:20 .git
-rw-r--r-- 1 jenny staff 50 May 30 2014 .gitignore
-rw-r--r-- 1 jenny staff 1003 Jun 23 2014 README.md
-rw-r--r-- 1 jenny staff 205 Jun 3 2014 White_Pine_Weevil_DE.Rproj
drwxr-xr-x 20 jenny staff 680 Apr 14 15:44 analysis/
drwxr-xr-x 7 jenny staff 238 Jun 3 2014 data/
drwxr-xr-x 22 jenny staff 748 Jun 23 2014 model-exposition/
drwxr-xr-x 4 jenny staff 136 Jun 3 2014 results/
Ready to analyze data:
Raw data:
R
scripts + the Markdown files:
sample_ready_to_analyze_data
The figures created in those R
scripts and linked in those Markdown files:
sample_raw_data
Linear progression of R scripts, and Makefile to run the entire analysis:
sample_scripts
Tab-delimited files with one row per gene of parameter estimates, test statistics, etc.:
sample_results
Files to help collaborators understand the model we fit: some markdown docs, a Keynote presentation, Keynote slides exported as PNGs for viewability on GitHub:
sample_expository
This project is nowhere near done, i.e. no manuscript or publication-ready figs
File naming has inconsistencies due to three different people being involved
Code and reports/figures all sit together because it’s just much easier that way w/ knitr
& rmarkdown
Someone can walk away from the project and come back to it a year later and resume work fairly quickly
Collaborators (the two other people, the post-doc whose project it is + the bioinformatician for that lab) were able to figure out what I did and decide which files they needed to look at, etc.
Be consistent – when developing a naming scheme for folders it is important that once you have decided on a method, you stick to it. If you can, try to agree on a naming scheme from the outset of your research project
Structure folders hierarchically: - start with a limited number of folders for the broader topics - create more specific folders within these
Separate ongoing and completed work: as you start to create lots of folders and files, it is a good idea to think about separating older documents from those you are currently working on
from_joe
directoryLet’s say your collaborator and data producer is Joe.
He will send you data with weird space-containing file names, data in Microsoft Excel workbooks, etc.
It is futile to fight this, just quarantine all the crazy in a from_joe
directory.
Rename things and/or export to plain text and put those files in your data directory.
Record whatever you do you do to those inputs in a README or in comments in your R code
It’s a good idea to revoke your own write permission to the raw data file.
Then you can’t accidentally edit it.
It also makes it harder to do manual edits in a moment of weakness, when you know you should just add a line to your data cleaning script.
Sometimes you need a place to park key emails, internal documentation and explanations, random Word and PowerPoint docs people send, etc.
This is kind of like from_joe
, where I don’t force myself to keep same standards with respect to file names and open formats.
File organization should reflect inputs vs outputs and the flow of information
/Users/jenny/research/bohlmann/White_Pine_Weevil_DE:
drwxr-xr-x 20 jenny staff 680 Apr 14 15:44 analysis
drwxr-xr-x 7 jenny staff 238 Jun 3 2014 data
drwxr-xr-x 22 jenny staff 748 Jun 23 2014 model-exposition
drwxr-xr-x 4 jenny staff 136 Jun 3 2014 results
R
scripts:01_marshal-data.r
02_pre-dea-filtering.r
03_dea-with-limma-voom.r
04_explore-dea-results.r
90_limma-model-term-name-fiasco.r
02_pre-dea-filtering-preDE-filtering.png
03-dea-with-limma-voom-voom-plot.png
04_explore-dea-results-focus-term-adjusted-p-values1.png
04_explore-dea-results-focus-term-adjusted-p-values2.png
...
90_limma-model-term-name-fiasco-first-voom.png
90_limma-model-term-name-fiasco-second-voom.png
NO
myabstract.docx
Joe’s Filenames Use Spaces and Punctuation.xlsx
figure 1.png
fig 2.png
JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt
YES
2014-06-08_abstract-for-sla.docx
joes-filenames-are-getting-better.xlsx
fig01_scatterplot-talk-length-vs-interest.png
fig02_histogram-talk-attendance.png
1986-01-28_raw-data-from-challenger-o-rings.txt
"-"
and "_"
allows recovery of meta-data from the filenames:"_"
underscore used to delimit units of meta-data I want later"-"
hyphen used to delimit words so my eyes don’t bleedThis happens to be R
but also possible in the shell
, Python
, etc.
e.g. I’m saving a number of files of extracted environmental data at different resolutions (res
) and for a number of months (month
).
write.csv(df, paste("variable_", res, month, sep ="_"))
df <- read.csv(paste("variable_", res, month, sep ="_"))
Which set of file(name)s do you want at 3 a.m. before a deadline?
Chronological order:
Logical order: Put something numeric first
Use the ISO 8601 standard for dates: YYYY-MM-DD
If you don’t left pad, you get this:
10_final-figs-for-publication.R
1_data-cleaning.R
2_fit-model.R
which is just sad :(
Put something numeric first
Use the ISO 8601 standard for dates
Left pad other numbers with zeros
Machine readable
Human readable
Plays well with default ordering
chronological_order
logical_order
data
folderFile -> New Project -> New Directory
In the Project Type screen, click on Empty Project.
In the Create New Project screen, give your project a name and ensure that create a git repository is checked. Click on Create Project.
RStudio will create a new folder containing an empty project and set R’s working directory to within it.
Two files are created in the otherwise empty project:-
There is no need to worry about the contents of either of these for now.
ACCE Research Data Management workshop materials
Data carpentry File Organization workshop materials