Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should the database be versioned and referenced? #9

Open
bryanwweber opened this issue Nov 2, 2017 · 9 comments
Open

How should the database be versioned and referenced? #9

bryanwweber opened this issue Nov 2, 2017 · 9 comments

Comments

@bryanwweber
Copy link
Member

I guess there are two things here that are related: How do we define a version number for the database? and How should people reference/cite the database?

@bryanwweber
Copy link
Member Author

bryanwweber commented Nov 2, 2017

Regarding the version number, we need to decide on a format.

  • Do we want semantic versioning (v0.0.0) and if so, based on what criteria are the various numbers changed? What constitutes a patch version vs. a major version change?
  • If we want date-based versions, what should the format be? YYYY-MM-XX (where XX increments for each release in a particular month, if there are several)? YYYY-MM? YYYY-MM-DD? YY.MM?
  • In either case, how frequently will new versions be released? Will it be time based, a la Ubuntu, or will it be every time a new dataset is added, or something in between?

It seems like DOI is the way to go to reference a formal release like this, and that seems easy to set up with Zenodo. This raises some related questions:

  • What should people do if we don't make a release for a long time and they want to use updated files from the database?
  • What about when we have a website and allow people to download individual data files?
  • Once the database is big enough, what if people don't want to download the whole several MB or GB of files, to get at one small (kB-sized, say) set of the files?

@kyleniemeyer
Copy link
Member

FYI, I actually already made a formal release for the database in August 2017 using Zenodo (the GitHub-Zenodo integration is set up to mint a DOI with a release here): 10.5281/zenodo.838833.

(This was for a conference paper)

@bryanwweber
Copy link
Member Author

So it looks like you used Month YYYY as the format. That seems fine.

@kyleniemeyer
Copy link
Member

I did, and I agree this might be a reasonable approach. Perhaps we can just issue a new monthly release at the end of a given month if there were any changes made during that month?

In the (probably rare) cases where someone wants to cite the database before that release has been archived, it would be like citing the dev version of software that hasn't been released... seems fine to me.

@kyleniemeyer
Copy link
Member

This still leaves open my question about a CHANGELOG—to me, it makes sense to have an easy-to-read document detailed the differences between releases.

Yes, this info is in the git commit messages (though sometimes not in particularly easy-to-find ways), but many people who we want to use and cite the database may not be familiar with that, or find it challenging to navigate, while the CHANGELOG messages are associated with the release itself in plain writing.

@bryanwweber
Copy link
Member Author

So the CHANGELOG would essentially be

[Unreleased]
* Added Methyl Valerate data

[August 2017]
* Added butanol data

That seems fine, although I'm not sure what people would do with that information, because I don't care when the data was added, just that it was added. I guess it would also indicate when we made database version upgrades touching all the files, which would probably be valuable.

However, I still foresee scaling issues as the database grows, and I think we should put some more thought into how people should reference individual files or subsets of the files, if they want to. See here (and the links at the bottom) for more info about referencing data files: http://libguides.lib.msu.edu/citedata (I haven't looked through those references yet, but will do so soon).

@kyleniemeyer
Copy link
Member

Ugh... my preference would be just to reference a release of the database, since generally that would be done when using one or a subset of the files anyways.

To me, this would be analogous to citing SciPy when using one or two subroutines—you mention what you used, but cite the whole package/library.

In our case, the recommended practice could be to cite the database, and then the reference(s) associated with the file(s) used.

@bryanwweber
Copy link
Member Author

bryanwweber commented Nov 5, 2017

The case that I'm concerned about is

  1. User downloads file from website (or single file from GH) in November 2017, but doesn't note the most recent release (doesn't know they should, forgets to, etc.)
  2. User uses that file for several months, during which time we make several releases, and change the file they are using
  3. The user didn't note the release they grabbed the file from, and the most recent release isn't appropriate because the file is different

Do you think that situation is going to occur decently often or do you think it'll be an outlier? Personally, I can see that happening fairly frequently once we get a decent sized database (and decent size user base...). I would prefer to fix problems like this now rather than when it becomes a real problem, particularly since I think part of the solution is our permanent identifier scheme, which we really really shouldn't change once its in place (and also it would probably be better not to change the versioning scheme either).

The question then is how should the user cite the database version in that case? Should the user retool their application to support the newer format so they can cite the most recent database version? Would the CHANGELOG be detailed enough for the user to go back and look? (Probably not, if their file has been through several revisions before they downloaded it.) They could go back through the git history and find the release that most closely matches their version of the file (probably using something like git blame), but that seems error prone and requires users to be fairly expert git users.

Unfortunately, I don't really have a good idea for a solution... which means I'm just being a pain in the butt. Ugh, I hate it when I do that. I'll try to come up with something (and maybe one of those references I linked will have something...)

And, FWIW, I don't think that assigning a DOI or similar identifier to individual files will work (like some of the other databases do), because then a user would have to cite every one of those, which really doesn't scale for even moderately sized datasets.

@bryanwweber
Copy link
Member Author

bryanwweber commented Nov 5, 2017

Even DataCite recognizes this isn't an easy problem to solve. On Pg. 11 of the PDF here: https://schema.datacite.org/meta/kernel-4.1/doc/DataCite-MetadataKernel_v4.1.pdf they mention

A special note regarding citation of dynamic datasets:
For datasets that are continuously and rapidly updated, there are special challenges both in citation and preservation. For citation, four approaches are possible:

a. Cite a specific slice or subset (the set of updates to the dataset made during a particular period of time or to a particular area of the dataset); Example: Data Request T.Jansen; SAHFOS; Work published 2014 via SAHFOS ; Area Def: 54-65°N, 0-45°W. Temporal Def: 1980-2012 (April-August) Taxonomic Def: All zooplankton; (dataset). https://doi.org/10.7487/2014.15.1.1

b. Cite a specific snap-shot (a copy of the entire dataset made at a specific time); Example: König-Langlo, G., & Sieger, R. (2010). BSRN snapshot 2010-01 as ISO image file (3.75 GB) [Data set]. PANGAEA - Data Publisher for Earth & Environmental Science. (dataset). https://doi.org/10.1594/pangaea.833424

c. Cite the continuously updated dataset, but add an Access Date and Time to the citation. Example: Doe, J. and R. Roe. 2001. The FOO Data Set. Version 2.3. The FOO Data Center. (dataset). https://doi.org/10.xxxx/notfoo.547983. Accessed 1 May 2011.

d. Cite a query, time-stamped for re-execution against a versioned database. The RDA
recommended citation for this approach is:
R. Roe. 2017. "The Moo Data Query" created at 2017-07-21 10:25:30 PID https://doi.org/10.xxxx/notmoo.857988. Subset of Moo Database (dataset). PID https://doi.org/10.xxxx/bigmoo.360873.

Notes:
The “slice,” “snap-shot” and "query" options require unique identifiers. Be aware that the third
option (c) necessarily means that following the citation does not result in access to the resource
as cited. This limits reproducibility of the work that uses this form of citation.
In addition, please note that access date and time may be combined with the first (a), second (b)
and fourth (d) options, but it must be used with the third option (c).

The fourth option (d) may shift more work onto repositories to store database versions for all
the queries, so not all repositories will be able to support this alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants