Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to request a protected dataset from a script ? #92

Open
gmaze opened this issue Feb 28, 2023 · 19 comments
Open

How to request a protected dataset from a script ? #92

gmaze opened this issue Feb 28, 2023 · 19 comments

Comments

@gmaze
Copy link

gmaze commented Feb 28, 2023

Hi !
I recently ran over an issue I can't solve myself and therefore would like to ask here your feedback and/or help please

I maintain the argopy python library. It can be used to fetch Argo data from several sources (ftp, http, files) and in particular the Ifremer erddap instance.
Everything goes very well (congratulation for your work, erddap is really a game changer to easily access data) as long as datasets are public.

But recently we came across a new user requirement that is to use argopy to access protected data. We therefore implemented an erddap server with the recommended ORCID authentification process. It works well using the web browser interface.

However, even if a user is logged in on the erddap and can see/access the protected data using a web browser, I cannot managed to access/request this data using the argopy library from a CLI script or even a Jupyter notebook running in the same web browser.

Do you have any idea on how to solve this issue please ?

ps: I'm not even sure that having argopy to be authenticated by ORCID would make the erddap server to allow requests to the protected dataset (euroargodev/argopy#243).

ps: May be the issue is to know what are the http request header parameters required by the erddap server to consider the client request as authenticated

ps: I'm aware of the "https://coastwatch.pfeg.noaa.gov/erddap/download/AccessToPrivateDatasets.html" Scripts instructions. But it does not address the issue here (orcid login)

@BobSimons
Copy link
Collaborator

That is a challenging problem that affects a growing number of people.

See the "Scripts" section of https://coastwatch.pfeg.noaa.gov/erddap/download/AccessToPrivateDatasets.html It probably isn't directly applicable, but may give you a hint at how to solve the problem.

If it doesn't help, then one solution (already on the To Do list) is for us to add a feature to ERDDAP where a logged-in user can request a 24-hour (or user-specified duration?) temporary password, and where ERDDAP accepts this one time password when it is passed as a parameter from a script. The downside is that this is much less secure than OAuth authentication and so makes ERDDAP's protection of the data much less secure.

But I haven't kept up with how other software handles this problem. It is worth looking around for better solutions. I'll try to get Chris John involved.

@gmaze
Copy link
Author

gmaze commented Feb 28, 2023

Thanks for your quick answer !
Indeed the temporary password solution would be much less secure than OAuth, and basically the point is for our library users to be able to run a data fetching script in bash mode

In a perfect world, we can imagine that the erddap server could have a registered user settings page where users could ask&manage secret keys
Users could attribute a key to a specific program/client that aims to send request to the erddap.
From the client library side (e.g. argopy) we would let users to provide this key and automatically add it to http requests to the erddap (as a x-param in the header for instance).
The erddap server would then check for the validity of this key and let or block the request

but I'm afraid now that this is just paraphrasing your temporary password suggestion !

@BobSimons
Copy link
Collaborator

I'm not so keen on a settings page and having ERDDAP manage secrets for the long run. There are security advantages to having the password be valid for a short time rather than a long time. And there are security advantages if ERDDAP just has to keep secret info in memory and not store it to disk (for longer term use, and in case ERDDAP is restarted).

I'll add to your idea: the password could be tied to a specified IP address (not necessarily the computer the user is using to request the password). But I know that with some, e.g., Amazon setups, the script might run on multiple servers and you might not know the IP address of any of them.

@gmaze
Copy link
Author

gmaze commented Feb 28, 2023

I think I understand your concern and design vision for ERDDAP

About attaching the IP address, indeed, this would prevent requests to be sent from the computing nodes of HPC or other cloud computing providers, or at least make this much more complicated

to be sure I understand your suggestion, the implied workflow would be:

  1. in a browser, user login to the erddap server using any possible erddap provided service (e.g. ORCID),
  2. in a browser, user visits some dedicated webpage where they can request for a temporary password (max duration 24h00),
  3. in a script, user provide the temporary password to argopy (using our option mechanism or method arguments)
  4. in a script, user send an argopy data fetching request to the erddap server, with argopy sending in the http request header the temporary password
  5. the erddap server check for the validity of the password, and whatever the login status of the user, will follow on processing the request if the password is valid.

If this work like this, this means that from the erddap server point of view, access to a dataset depends on either the logged user credential (trying to visit the protected dataset webpage) or the password validity (trying to get the protected dataset as downloadable format like json or netcdf)

@rmendels
Copy link
Collaborator

@gmaze This is not something I know a lot about, but I am interested in looking into it. Can you tell me what you are using at present to handle the ORCID ID and authentication within the Python program?

@ChrisPJohn
Copy link
Collaborator

I need to read more about ORCID and exactly how ERDDAP handles it. That said, I do think the access to private datasets page is a useful resource here. Mostly the general approach of needing of using curl (or some other strategy) to make requests to the ERDDAP server. The requests for ORCID will be different than that example (the example is for Google login). As mentioned on the access to private datasets page, a useful resource for understanding what requests will be required for ORCID authentication is monitoring the network tab of the developer's console while going through the log in flow on the web.

There is a potential feature request to better support scripting authenticated access. I need to investigate what that would entail and how complex those changes would be though.

@rmendels
Copy link
Collaborator

@ChrisPJohn @gmaze My experience with R suggests there is not a whole lot more that can be done in ERDDAP, though I may be wrong. The issue in a script is you need something that mimics logging into ORCID, storing the cookie, and then have a communication protocol that allows that cookie to be used in the request. R now has some packages that can do that (usually providing some way to mimic a login and a front-end to curl). I would imagine Python has that capability somewhere, I am just not certain which packages. ORCID I believe has an API that perhaps can be used for the first step (as well as a Python wrapper for that), would have to look up options on different Python libraries on how to include that cookie.

@rmendels
Copy link
Collaborator

@ChrisPJohn @gmaze For example the following package should allow you to get the ORCID programmatically:

https://github.com/ORCID/python-orcid

Then if any of the url packages like urlLib allow the header to be set, include that in the header. But of course since I haven't actually implemented it, it would be famous last words, and since I don't have an ORCID account I have no way of testing,

@rmendels
Copy link
Collaborator

@gmaze @ChrisPJohn see also:

https://orcid.github.io/orcid-api-tutorial/get/

@gmaze
Copy link
Author

gmaze commented Mar 1, 2023

@gmaze This is not something I know a lot about, but I am interested in looking into it. Can you tell me what you are using at present to handle the ORCID ID and authentication within the Python program?

At the present, we don't have any authentication mechanism in argopy, it's being discussed here: euroargodev/argopy#243

@gmaze
Copy link
Author

gmaze commented Mar 1, 2023

There is a potential feature request to better support scripting authenticated access. I need to investigate what that would entail and how complex those changes would be though.

Surely, that would be great !

@gmaze
Copy link
Author

gmaze commented Mar 1, 2023

@ChrisPJohn @gmaze For example the following package should allow you to get the ORCID programmatically:

https://github.com/ORCID/python-orcid

This package does not look supported anymore,
it is not compatible with the last ORCID api version for instance, ORCID/python-orcid#32
So I would not rely on it

@gmaze
Copy link
Author

gmaze commented Mar 1, 2023

The issue in a script is you need something that mimics logging into ORCID, storing the cookie, and then have a communication protocol that allows that cookie to be used in the request.

Indeed, this looks like the key issue ! especially the 1st part (logging and storing cookie)...

Here is a small procedure that works on our test server and demonstrate how to do the 2nd part:

  1. Go to the erddap webpage and login with orcid
  2. Open the devtools and get the value of the cookie named JSESSIONID
  3. Now you can send a request to the erddap using this cookie:
import aiohttp
import pandas as pd

url = 'https://erddap-val.ifremer.fr/erddap/info/index.json'
cookies = {'JSESSIONID': <COOKIEVALUE>}
async with aiohttp.ClientSession(cookies=cookies) as session:
    async with session.get(url) as resp:
        data = await resp.json()
df = pd.DataFrame(data['table']['rows'], columns=data['table']['columnNames'])
df = df[['Accessible', 'Dataset ID', 'Title']]
df
Accessible Dataset ID Title
public allDatasets * The List of All Active Datasets in this ERDD...
yes Argo-ref-ctd CTD Reference Measurements
public Argo-ref-ctd-public CTD Reference Measurements

The request above will indeed return all the datasets on the server, including the protected one named "Argo-ref-ctd".

The same request with an empty cookie:

import aiohttp

url = 'https://erddap-val.ifremer.fr/erddap/info/index.json'
cookies = {'JSESSIONID': None}
async with aiohttp.ClientSession(cookies=cookies) as session:
    async with session.get(url) as resp:
        data = await resp.json()
df = pd.DataFrame(data['table']['rows'], columns=data['table']['columnNames'])
df = df[['Accessible', 'Dataset ID', 'Title']]
df
Accessible Dataset ID Title
public allDatasets * The List of All Active Datasets in this ERDD...
public Argo-ref-ctd-public CTD Reference Measurements

@rmendels
Copy link
Collaborator

rmendels commented Mar 1, 2023

@gmaze Nice. Thanks for posting this.

@gmaze
Copy link
Author

gmaze commented Apr 26, 2023

@rmendels is it ok if I put some of this content into a Discussion/Q&A post ?
I now have also another code snippet to show how to retrieve protected data when the erddap server is using a simple login user/password protection (not OAUTH2 like above)

@rmendels
Copy link
Collaborator

@gmaze not quite certain that I understand what you are asking, but don't control the group either, but it would be great to get some of that content posted

@gmaze
Copy link
Author

gmaze commented Apr 27, 2023

I mean that I think these code examples are not the solution to this "issue" and are more "quick and dirty" solutions that could fit into a FAQ, that's why I'd like to cc them in here: https://github.com/ERDDAP/erddap/discussions/categories/q-a

@BobSimons
Copy link
Collaborator

I think that in general we are encouraging using GitHub for programmer-related discussions and issues (e.g., bugs, new features) and are encouraging using the ERDDAP Google Group for end-user-related discussions. Certainly, there are far more users in the ERDDAP Google Group than here. Since this information is useful for users, maybe the appropriate place to post it is in the Google Group.

@ChrisJohnNOAA
Copy link
Contributor

I think having more documentation/information in the GitHub repo is a good thing. I'd be happy for you to post your code examples in the Q&A section. If you were to send a message to the erddap users group, you could link to that post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants