Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about cache_age #48

Open
pasqualina-vonlanthendinenna opened this issue Dec 9, 2021 · 7 comments
Open

Question about cache_age #48

pasqualina-vonlanthendinenna opened this issue Dec 9, 2021 · 7 comments

Comments

@pasqualina-vonlanthendinenna

We have been using your package argodata in R to download data from BGC-Argo floats, which works great and has been extremely useful! However, we wanted to ask what the difference is between specifying the argument max_global_cache_age = Inf (or max_data_cache_age = Inf) within the function argo_update_global() (or argo_update_data()) or when defining our cache directory with argo_set_cache_dir() ?
And did we understand correctly that the difference between the functions to update the cache is that argo_update_global() updates the index file while argo_update_data() updates the data files?

Thank you!

@paleolimbot
Copy link
Collaborator

Hi! Sorry for the slow response here.

Glad the package is useful!

You're correct in that argo_update_global() updates the index file while argo_update_data() updates the data files. The maximum cache age checks the file modified time and will skip downloading if it's less than max_*_cache_age hours old. The reason the default value is -Inf is so that when you call argo_update_*(), everything gets updated by default.

The default setup for argodata is to never use a cache that persists between R sessions. This is because caching Argo files is hard...the realtime files get replaced with delayed-mode files, so if you include any realtime files in your analysis they may not exist later and you'll get code that won't run because some realtime files that the index pointed to no longer exist.

Other types of files work really well for caching and age-based invalidation, such as Sprof or delayed-mode files. If you're using these files, you can use argo_set_cache_dir() to set the cache directory and:

  • Use options(argodata.max_global_cache_age = number_in_hours, argodata.max_data_cache_age = number_in_hours) to automatically download updated data files after a certain number of hours. I don't like this method because inevitably the moment that argodata will choose to update your cache is the worst possible moment for you to download possibly thousands of files.

...OR...

  • Periodically call argo_update_index() when you get errors regarding missing realtime files, then call argo_update_data() when you're ready to run your analysis on new data. Before submitting a paper, delete the cache directory entirely and re-run everything to make sure everything is fresh.

...OR...

  • You can pass download = TRUE to many functions to just force those files to download. This is nice when you only want to update part of your download cache, which might contain a lot of files you don't need anymore.

If none of those options work for you let me know what you would like! I'd be happy to consider other workflows. I hope that helps!

@paleolimbot
Copy link
Collaborator

Flagging @richardsc who might have other ideas!

@pasqualina-vonlanthendinenna
Copy link
Author

We're using the delayed-mode S-files for our analysis, so I think periodically updating the cache works best for us!

Thank you again for your help!

@mlarriere
Copy link

I am facing problems with mirroring data from a server to my local repository.
I have already downloaded all the data up to December 2023. Now I would like to update my database with the files added from December to today. I thought that the mirror would only download the missing files to my local repository, but it turns out that, with the following code, the entire dataset is being re-download.
The number of hours since the last update is not known (since the data was downloaded at different times) (max_global_cache_age and max_data_cache_age = number_in_hours).
I would like to update the dataset every month.

#   TRUE  =  refresh cache. 
opt_refresh_cache = TRUE

# Avoids reprocessing files 
opt_qc_only = FALSE

if (!opt_qc_only) {
  # set cache directory
  argo_set_cache_dir(cache_dir = path_argo_core)
  
  # check cache directory
  argo_cache_dir()
  
  # Server 
  argo_set_mirror("https://usgodae.org/pub/outgoing/argo")
  argo_mirror()
  
  if (opt_refresh_cache){
    print("Updating files from website -- in progress") 
    argo_update_global(max_global_cache_age = -Inf)  
    argo_update_data(max_data_cache_age = -Inf)
  } else {
    print("Retreiving files from server") 
    argo_update_global(max_global_cache_age = Inf)  
    argo_update_data(max_data_cache_age = Inf)
  }
}

@richardsc
Copy link

Hi @mlarriere!

First of all, I think that the behaviour you are seeing is in fact the expected behaviour -- the package doesn't know explicitly which of the files have been added or changed since the last refresh, because it only has the information contained in the index file, so exceeding the cache age (or forcing a re-download by using -Inf) will alway redownload the data files from the server.

Part of the reason for this is that the full archive changes not only because of new files being added, but also because of the reprocessing of older files -- either due to changes in data QC (e.g. real time to delayed mode), or blacklisted floats, or changes to metadata, etc. The "archive" obtained from the Argo DAC is not a static archive of files, but a dynamic archive that is continuously updated.

If what you want is an up-to-date mirror of the entire Argo DAC, I recommend you use the syncronization service provided by IFREMER, e.g. see:

https://argo.ucsd.edu/data/data-from-gdacs/

@richardsc
Copy link

Another way to approach this would be to update your local archive by only adding the floats that have been added in say, the last month, by subsetting the index in time before you download the files. This isn't really a complete update, as I described above it will not update any files from outside that time window. It should keep all the older files that you'd already downloaded, though.

@mlarriere
Copy link

Hi @richardsc
Thank you for your quick reply and for the useful explanations about the archives and how to update it!
I used the syncronization service provided by ifremer (terminal command) and it worked. Updating the files by setting a specific number of hours for the index also works, but as I'm using files in "delayed" mode, I think it is better to reload the entire database, in case they have been processed (variable adjustment, quality control...).

Thank you again for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants