Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status of Machine Learning for Argo QC #6

Open
gmaze opened this issue Dec 4, 2019 · 3 comments
Open

Status of Machine Learning for Argo QC #6

gmaze opened this issue Dec 4, 2019 · 3 comments
Labels
enhancement New feature or request procedure About a specific procedure

Comments

@gmaze
Copy link
Member

gmaze commented Dec 4, 2019

I'd like to open a discussion thread to get the status of developments with regard to the use of Machine Learning techniques in Argo QC procedures.

Different groups may have started to explore this possibility and it would be constructive to get here the status of these efforts, to avoid duplicates and to get feedback.

This could include a description of:

  • the target variables (eg: QC flag for one TEMP measure, QC flag for one PSAL profile,...)
  • the choice of features, explanatory variables
  • the ML method (eg: random forest)
  • the dataset used
  • the overall performance or difficulties encountered
  • anything you think relevant wrt this topic
@gmaze gmaze added enhancement New feature or request procedure About a specific procedure labels Dec 4, 2019
@gmaze
Copy link
Member Author

gmaze commented Dec 4, 2019

At Ifremer/LOPS, we've tried the following:

Target variables:

Alarm status (True, False) of the ISAS13 test against climatology for one PSAL measurement

Features:

A "patch" of variables from the same profile as the target as well as from profiles before and after (+/- 2). Variables used: TEMP, PSAL, SIG0 and PRES.

ML method

Random forest

Dataset used

Argo snapshot from 2016/02 and ISAS team QC logs.

Overall performance or difficulties encountered

  • Performances not stable. We wanted to use a "balanced" training set with as many True as False samples. But because they are many more False than True samples, we need to sub-sample the False alarm set. Then we encounter the difficulty of selecting statistically "similar" sub-samples. Overall performances are highly sensible to this sub-sampling.
  • The True/False alarms training set is highly in-balanced simply because the ISAS13 test against climatology is not an effective test and raises too many False alarms.

@gaelforget
Copy link
Member

  • The True/False alarms training set is highly in-balanced simply because the ISAS13 test against climatology is not an effective test and raises too many False alarms.

Not sure if that helps or if I totally understand but would it make sense to consider using several climatology products and e.g. counting the # of alarms (e.g. 0/6 vs 6/6) and setting a threshold? I used to do something like that in the MITprof QC for ECCO (I was using the min of cost functions if I recall).

@gmaze
Copy link
Member Author

gmaze commented Jan 25, 2020

@gaelforget this is a good suggestion that we started to experiment as well: taking a final decision on the basis of several QC test outcomes.
But the choice of acceptable distance to the climatology is as important as the climatology value itself. One would need an "optimization" approach where, based on the historical dataset, we would determine the best combination of distance/reference to detect bad data.
This however points to another problem: namely that the distance beyond which a data would be declared "bad" is in practice dependent on the user application, this is particularly true for data assimilation where data need to somehow be compatible with the numerical ocean simulation by the model.
This finally lead us to the fact that the best we could do would be to compute a goodness probability for the data, it would be up to the user to define a threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request procedure About a specific procedure
Projects
None yet
Development

No branches or pull requests

2 participants