Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAI Identifier uses outdated validation for domain name #126

Closed
abollini opened this issue Dec 16, 2021 · 6 comments
Closed

OAI Identifier uses outdated validation for domain name #126

abollini opened this issue Dec 16, 2021 · 6 comments

Comments

@abollini
Copy link
Contributor

The oai idenfier are usually generated using the system domain name as repository identifier, this lead for instance to identifier like that oai:dspace-cris.4science.cloud:e9ed438e-c7f7-4a18-95e5-3f635ea65fee

Unfortunately, the oai-identifier.xsd http://www.openarchives.org/OAI/2.0/oai-identifier.xsd , cached by the guidelines here
https://github.com/openaire/guidelines-cris-managers/blob/master/schemas/cached/oai-identifier.xsd#L36-L42

doesn't expect to have a number as first letter of a domain. This make the previous identifier invalid, but of course the domain dspace-cris.4science.cloud is perfectly valid

@hvdsomp
Copy link

hvdsomp commented Dec 17, 2021

(I am cc-ing @zimeon and @phonedude)

Well, indeed, it looks like something went a tad wrong with the definition of oai-identifier and, AFAIK, this is the first time the problem was brought up.

My interpretation:

The constructs related to domain names in the oai-identifier syntax definition of Section 2.1 of the OAI Identifier Format guideline build upon Section 3.2.2 of RFC2396, specifically these construction rules:

      hostname      = *( domainlabel "." ) toplabel [ "." ]
      domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
      toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

in which we can see that numericals are allowed in all positions of a component of a domain name with exception of the toplabel (TLD, e.g. org, com) in which only alphabeticals are allowed in the first position. But by using the following constructs:

  namespace-identifier = domainname-word "." domainname
  domainname = domainname-word [ "." domainname ]
  domainname-word = alpha *( alphanum | "-" )

the OAI Identifier Format guideline forbids numericals in the first position of all components of a domain name. I can't imagine this was the intention, and, if it was, I do not recall what the motivation could have been.

Possible solutions:

@phonedude
Copy link

Yes, I have no memory of this being by design. So I can only assume it was an oversight.

@zimeon
Copy link

zimeon commented Dec 17, 2021

Looks like an error to me. I also have no memory of an intention here.

I agree that quietly adjusting the schema is relatively easy. However, this change would make it not match the guideline which is weird, so maybe we should edit the 20-year old guideline too??

@ACz-UniBi
Copy link
Member

Dear,

I believe it was an evolution of the Internet. RFC2396 published in August 1998 has updated releases like RFC3986 in section 3.2 from January 2005.

@hvdsomp
Copy link

hvdsomp commented Dec 18, 2021

@zimeon, I think you’re right that the guideline should be updated too. I convinced myself when noticing that there’s a Document History section in which details of changes can be conveyed. I’m thinking that the only changes needed are in the construction rules (and document version, of course). I would prefer sticking to the reference to RFC2396 instead of more recent RFCs re URI syntax, because, after all, 2396 was the law of the land when oai-identifier was spec-ed and we’re merely correcting the spec, not creating a new one.

@jdvorak001
Copy link
Collaborator

This is done, thanks @ACz-UniBi !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants