Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the purpose of the 10-iteration failure when updating GLM vectors/Laplace approx in MixedModels.jl #780

Closed
coverton-usgs opened this issue Aug 12, 2024 · 16 comments

Comments

@coverton-usgs
Copy link

I am trying to deduce why my data set is producing the error from line 614 throw(ErrorException("number of averaging steps > 10"))

I suspect it is due to my random structure (nested individuals in months) having sparse representation of the fixed factors in a Bernoulli GLMM.

Does 10 'mean' anything or is it just a suitable number of times to be certain that random values for u produce a deviance that is greater than the previously estimated deviance?

I don't understand Julia language very well, so I likely misunderstand how the code functions. But, inside the while-loop starting line 610, I don't see how obj varies across the 10 iterations of nhalf. I see where each value of u is averaged and mapped to beta but not where they are reassigned. It looks like obj would be the same for each iteration.

Unfortunately, I am running my Julia model via JuliaCall so I don't think I get the verbose output to verify.

@palday
Copy link
Member

palday commented Aug 13, 2024

If you can share data + code, even privately, that is always useful for debugging the particulars of your situation. Or even a simulated dataset with the problem-causing structure. 😄

As for the reason for the number 10 -- that's the number of halving steps which means that you've reduced the step size 2^-10 = 0.000976562 the original step. Given that (penalized) iteratively re-weighted least squares typically converges very quickly, shrinking so small usually indicates a problem with the model that won't be addressed by additional steps.

Have you tried fast=true ? Or if you're using fast=true, already, have you tried fast=false? This makes a difference as to how much work is done by PIRLS vs. the non-linear optimizer.

@coverton-usgs
Copy link
Author

Thank you for your response and the clear description of the process.

In a previous iteration I used fast=true but I don't recall which reason prompted me to stop, Either I was not able to get the output returned to R (I am using JuliaCall in R and RCall and Jellyme4 in Julia to port data between programs) or I was receiving some nonsensical results when using fast=true (which could have been due to earlier model mis-specification as well.

testdataforjulia.csv

Datafile uploaded, sorry for the R code, this should be interpretable though:

julia_assign("m0form", formula(case ~ 0 + Analysisclass + (1|cropyear/individual_local_identifier)))
julia_assign("data", testdataforjulia)
julia_command(sprintf('Jmodel = fit(GeneralizedLinearMixedModel, m0form, data, Bernoulli(), wts=float.(data.weights), contrasts= Dict(:Analysisclass => DummyCoding(base="aRice_Wet_day")))'))
juliamodel<- julia_eval("robject(:glmerMod, Tuple([Jmodel, data]));",need_return ="R")

I am relatively certain that the problem is due to rank deficiency in the fixed effects for elements of the random effect structure. See:

testdataforjulia_case1<- testdataforjulia%>% filter(case==1)
table(testdataforjulia_case1$Analysisclass,testdataforjulia_case1$individual_local_identifier)

which shows three problems for this species/dataset: 1) some individuals only have a very small number of records, 2) some only have values for 1 or 2 of the 9 factors comprising Analysisclass, 3) most have no values for the reference class aRice_Wet_day (this is a group that prefers wetlands to rice).

There is a lot to go wrong there, so I am not surprised by a convergence failure. I was just struggling to interpret the error message.

Many thanks

@coverton-usgs
Copy link
Author

For completeness, here is a working R script and updated data file that includes both 1's and 0's for the dependent variable (the file above only had case=1, my apologies if you tried to run a GLMM with that).

This script included here wraps the model call in a tryCatch to provide a longer Stacktrace as well.

testdataforjulia_bothcase.zip

I very much appreciate the related discussion in the Julia forum and am thankful for @palday and @dmbates 's comments and attention to my question.

@palday
Copy link
Member

palday commented Aug 13, 2024

I am able to fit the model by setting fast=true:

Generalized Linear Mixed Model fit by maximum likelihood (nAGQ = 1)
  case ~ 0 + Analysisclass + (1 | cropyear) + (1 | cropyear & individual_local_identifier)
  Distribution: Bernoulli{Float64}
  Link: LogitLink()

   logLik     deviance      AIC         AICc        BIC     
 -52283.8716 104567.7433 104595.7433 104595.7530 104717.1024

Variance components:
                                          Column   Variance Std.Dev.
cropyear & individual_local_identifier (Intercept)  40.63670 6.37469
cropyear                               (Intercept)   0.00000 0.00000

 Number of obs: 42981; levels of grouping factors: 78, 11

Fixed-effects parameters:
─────────────────────────────────────────────────────────────────────────────
                                         Coef.   Std. Error       z  Pr(>|z|)
─────────────────────────────────────────────────────────────────────────────
Analysisclass: Nonhabitat_Dry_day    -526.944   8.07256e109   -0.00    1.0000
Analysisclass: Nonhabitat_Dry_night   -20.1344  1.07266      -18.77    <1e-77
Analysisclass: Nonhabitat_Wet_day    -523.382   5.97722e109   -0.00    1.0000
Analysisclass: Nonhabitat_Wet_night  -523.382   5.97722e109   -0.00    1.0000
Analysisclass: Wetland_Dry_day        -14.4728  0.81342      -17.79    <1e-70
Analysisclass: Wetland_Dry_night      -14.9585  0.817631     -18.29    <1e-74
Analysisclass: Wetland_Wet_day        -11.5421  0.806793     -14.31    <1e-45
Analysisclass: Wetland_Wet_night      -11.5783  0.8068       -14.35    <1e-45
Analysisclass: aRice_Dry_day          -18.2457  0.852996     -21.39    <1e-99
Analysisclass: aRice_Dry_night        -14.0891  0.807393     -17.45    <1e-67
Analysisclass: aRice_Wet_day          -15.0857  0.818055     -18.44    <1e-75
Analysisclass: aRice_Wet_night        -11.9868  0.807174     -14.85    <1e-49
─────────────────────────────────────────────────────────────────────────────

The model is singular -- the estimated variance for cropyear is zero, so you could probably drop it.

One thing that interests me is that the estimates for Nonhabitat_Wet_day and Nonhabitat_Wet_night are identical. Given the large standard errors, I suspect multicollinearity, which in dummy coding corresponds to very few observations in teh relevant table cells. Looking a bit closer, it looks like they individually make up less than 1% of observations, which is pretty sparse.

12×3 DataFrame
 Row │ Analysisclass         n      %       
     │ String31              Int64  Float64 
─────┼──────────────────────────────────────
   1 │ Nonhabitat_Wet_day      355      0.8
   2 │ Nonhabitat_Wet_night    355      0.8
   3 │ Wetland_Dry_night       951      2.2
   4 │ Wetland_Dry_day         986      2.3
   5 │ aRice_Wet_day          2624      6.1
   6 │ aRice_Wet_night        3767      8.8
   7 │ Wetland_Wet_night      4917     11.4
   8 │ Wetland_Wet_day        5021     11.7
   9 │ aRice_Dry_day          5098     11.9
  10 │ aRice_Dry_night        5915     13.8
  11 │ Nonhabitat_Dry_day     6417     14.9
  12 │ Nonhabitat_Dry_night   6575     15.3

Depending on your research question, it might make sense to split Analysisclass into separate predictors for

  1. nonhabitat vs wetland vs arice
  2. wet vs dry
  3. day vs night

I tried that and even the fast fit fails. Since that's just a reparameterization of your current model, that tells me that the model is very sensitive to parameterization, which is often not a great sign of being able to fit the desired model to the data in question.

@coverton-usgs
Copy link
Author

Thank you for looking that closely at things.

This particular subset of the data is the most limited. And I expect you are correct that the data at hand is just not sufficient in this case.
There are a minority of cropyears with multiple individual_id so I think that explains the low variance estimate. But cropyear is important in other subsets so I felt it was more important to keep model structure similar across models for interpretation.

I've argued myself in circles about dummy coding the habitat x condition x photoperiod or including them as interacting effects as you suggest. The primary reason I settled on dummy coding is that some subsets have no use (case==1) at for some variable combinations leading to rank deficiency. Which creates its own problems with convergence. I have found the models to be more stable if I dummy code and I feel have rational interpretations whereas interaction coding are harder to wrap my head around.

@palday
Copy link
Member

palday commented Aug 13, 2024

This particular subset of the data is the most limited. And I expect you are correct that the data at hand is just not sufficient in this case. There are a minority of cropyears with multiple individual_id so I think that explains the low variance estimate. But cropyear is important in other subsets so I felt it was more important to keep model structure similar across models for interpretation.

That makes sense! And one of the nice things about MixedModels.jl is that it can fit singular models. 😄

I've argued myself in circles about dummy coding the habitat x condition x photoperiod or including them as interacting effects as you suggest. The primary reason I settled on dummy coding is that some subsets have no use (case==1) at for some variable combinations leading to rank deficiency. Which creates its own problems with convergence. I have found the models to be more stable if I dummy code and I feel have rational interpretations whereas interaction coding are harder to wrap my head around.

That sounds like a good reason to stick with your current contrast coding. 😄

I tried one more thing, using an experimental feature that uses a linear mixed model to initialize the GLMM (instead of using a GLM for the initialization).

julia> model = fit(MixedModel, m0form, data, Bernoulli();
                   wts=float.(data.weights),
                   contrasts= Dict(:Analysisclass => DummyCoding(; base="aRice_Wet_day")),
                   init_from_lmm=[],
                   fast=false,
                   progress=true,
                   verbose=false)
Minimizing 900    Time: 0:01:12 (80.47 ms/it)
Generalized Linear Mixed Model fit by maximum likelihood (nAGQ = 1)
  case ~ 0 + Analysisclass + (1 | cropyear) + (1 | cropyear & individual_local_identifier)
  Distribution: Bernoulli{Float64}
  Link: LogitLink()

   logLik     deviance      AIC         AICc        BIC     
 -52299.4230 104598.8461 104626.8461 104626.8559 104748.2053

Variance components:
                                          Column   Variance Std.Dev.
cropyear & individual_local_identifier (Intercept)  10.78505 3.28406
cropyear                               (Intercept)   0.02275 0.15082

 Number of obs: 42981; levels of grouping factors: 78, 11

Fixed-effects parameters:
────────────────────────────────────────────────────────────────────────────
                                        Coef.   Std. Error       z  Pr(>|z|)
────────────────────────────────────────────────────────────────────────────
Analysisclass: Nonhabitat_Dry_day    -23.1087     4.72024    -4.90    <1e-06
Analysisclass: Nonhabitat_Dry_night  -20.1036     1.11847   -17.97    <1e-71
Analysisclass: Nonhabitat_Wet_day    -33.0436  2967.99       -0.01    0.9911
Analysisclass: Nonhabitat_Wet_night  -17.5231     1.33543   -13.12    <1e-38
Analysisclass: Wetland_Dry_day       -13.6633     0.439363  -31.10    <1e-99
Analysisclass: Wetland_Dry_night     -14.1096     0.446209  -31.62    <1e-99
Analysisclass: Wetland_Wet_day       -10.7518     0.427207  -25.17    <1e-99
Analysisclass: Wetland_Wet_night     -10.7884     0.427222  -25.25    <1e-99
Analysisclass: aRice_Dry_day         -17.6945     0.529162  -33.44    <1e-99
Analysisclass: aRice_Dry_night       -13.2977     0.428339  -31.04    <1e-99
Analysisclass: aRice_Wet_day         -14.2877     0.447925  -31.90    <1e-99
Analysisclass: aRice_Wet_night       -11.1991     0.427925  -26.17    <1e-99
────────────────────────────────────────────────────────────────────────────

There are a few things to note here, and most them can be boiled down to a single caveat emptor: the maximum likelihood estimate of this model seems to be poorly defined (because the likelihood surface is very flat), which is usually a sign that you're trying to fit a model that's more complex than your data. In this case, that extra complexity are those coefficients related to the very care cases.

  • We are working within the maximum likelihood framework, so the 'correct' fit is generally the one where the (log) likelihood is larger / the deviance is smaller.
  • In this particular case, you can see that comparing the log likelihoods and the deviances give you slightly different answers. I suspect that this comes down to the difference in the singularity of the random effects and some fine details.
  • The two different models here use the same approximation to the log likelihood, so the estimated log likelihoods are comparable. They differ in how they're allowed to explore the parameter space. The fixed effects in the latter fit seem to make more sense and show fewer signs of issues with collinearity, so I would tend to prefer that model.
  • I'm aware of a few examples where the fast and slow fits deliver quite different results despite similar log likelihoods, but I've never seen one where I felt confident in the resulting inferences afterwards.
  • It would be a good follow-up (but beyond the amount of time I've allocated this evening 😉) to check which model makes more sense when plotting effects, fitted vs observed, etc.
  • All that said, I would not place a lot of faith in either fit / would be very careful in interpreting the models. Going back to the core caveat emptor, the fine details here all speak to a lot of underlying uncertainty.

@coverton-usgs
Copy link
Author

Wow, @palday you are really going above and beyond. One aspect of this analysis that probably should be mentioned, if it is not already clear, is that this model is an animal "resource selection function" the values where case=1 are the covariates that are "used" at all observed locations. Rows where case = 0 are "available" habitats that represent covariates that are proportional to the occurrence. The absolute number of records used to define availability is subjective therefore the absolute value of parameter estimates isn't actually of interest. What is important is the relative value. I'd have to look closer to see how different those are, but at least the order of selection seems consistent between the models (and in line with what I would expect). But I agree, any interpretation is probably pretty minimal.

If you don't mind me asking one follow up question... when you say "the likelihood surface is very flat"... what are you looking at that tells you that?

@palday
Copy link
Member

palday commented Aug 14, 2024

If you don't mind me asking one follow up question... when you say "the likelihood surface is very flat"... what are you looking at that tells you that?

Ah, so the ideal way to do this check would be with profiling, which could then be used for plotting the likelihood function at various values of the parameters. But the profiling code is comparatively new and still fails in a few cases (to be fair, those cases are often numerically challenging cases like the models here) and I don't recall off the top of my head if we've tested it on GLMMs.

A second approach would be computing the Hessian or Fisher information, but that would almost certainly fail here because of the (near) singularity.

So what I've actually done is pieced together information from a few different sources:

  • the big standard errors in the one fit
  • watching the progress output and seeing the behavior of the optimizer as it bounces around. Given that this is a relatively small model in terms of parameters and not especially big in terms of observations (we've fit models with more than a million observations 😎), seeing the optimizer take so many steps where the deviance changes so little tells me that the optimizer is struggling to determine where the optimum is and this is indicative of a very flat likelihood surface.

@coverton-usgs
Copy link
Author

That makes a lot of sense. Thank you for taking the time to explain the inner workings to me. It almost makes me feel like I know what I am doing
I'll pass along the 72 million observation file when that one fails and see what sort of mess we can make of that one 😱

In seriousness, I appreciate your observations and thoughts immensely.

I am happy to consider this closed if you are.

@palday
Copy link
Member

palday commented Aug 14, 2024

I'll pass along the 72 million observation file when that one fails and see what sort of mess we can make of that one 😱

Sounds fun. 😄

@palday palday closed this as completed Aug 14, 2024
@dmbates
Copy link
Collaborator

dmbates commented Aug 15, 2024

Is there a reason that all the individual_local_identifier values end in .1? It would be easier to store and convert those values to category labels if they were rounded to integers. But I don't want to do that if future values would not end in .1.

@dmbates
Copy link
Collaborator

dmbates commented Aug 15, 2024

Following up on what @palday said about imbalance in the Analysisclass variable, the proportion of case == 1 values within a particular Analysisclass value ranges from 0% to over 60%. The zeros mean that you would not be able to fit a GLM with case as the response and Analysisclass as a fixed-effect.

julia> pr = leftjoin!(
           combine(groupby(df, :Analysisclass), nrow => :n),
           combine(groupby(subset(df, :case), :Analysisclass), nrow => :pos);
           on=:Analysisclass,
       )
       pr.pos = coalesce.(pr.pos, 0)  # replace missing values by zeros
       prop = transform!(pr, [:pos, :n] => ByRow(/) => :prop)
12×4 DataFrame
 Row │ Analysisclass         n      pos    prop        
     │ String31              Int64  Int64  Float64     
─────┼─────────────────────────────────────────────────
   1 │ Wetland_Wet_night      4917   2910  0.591824
   2 │ Wetland_Dry_night       951     56  0.0588854
   3 │ Wetland_Wet_day        5021   3014  0.600279
   4 │ Wetland_Dry_day         986     91  0.0922921
   5 │ aRice_Dry_night        5915    830  0.140321
   6 │ aRice_Dry_day          5098     13  0.00255002
   7 │ aRice_Wet_night        3767   1197  0.317759
   8 │ aRice_Wet_day          2624     54  0.0205793
   9 │ Nonhabitat_Dry_night   6575      2  0.000304183
  10 │ Nonhabitat_Dry_day     6417      0  0.0
  11 │ Nonhabitat_Wet_day      355      0  0.0
  12 │ Nonhabitat_Wet_night    355      0  0.0

@coverton-usgs
Copy link
Author

Thank you @dmbates, that is what I anticipated also.

Normally when I have a scenario where a habitat is not use, I would group that class in with the Nonhabitat which would at least have SOME use since it is an amalgam of several things anyway. That sort of failed when it was Nonhabitat that had no use (despite it being painful obvious to me in that there is a high likelihood of that.)
Alternatively, I have in the past, simply removed the offending class (for both case ==1 and ==0) 'interpreted' the relative selection for that class as -Inf. It isn't a very robust estimate, but it is functionally what was observed. Is there anything incorrect about that interpretation that is apparent to you?

@coverton-usgs
Copy link
Author

Is there a reason that all the individual_local_identifier values end in .1? It would be easier to store and convert those values to category labels if they were rounded to integers. But I don't want to do that if future values would not end in .1.

My apologies, I did not see your first response. The ".1" indicates that the individual is the first deployment of transmitter whose serial number precedes it (we can recover and redeploy transmitters). In this dataset, that "number" should be treated as a factor. There are other values in the larger dataset that are derived from different transmitters which contain characters, so the field should be recognized as a character when feeding into the model.

@dmbates
Copy link
Collaborator

dmbates commented Aug 16, 2024

I have a shorter script to check the counts and especially the missing cells.

julia> unstack(combine(groupby(data, [:Analysisclass, :case]), nrow => :n), :case, :n)
12×3 DataFrame
 Row │ Analysisclass         false   true    
     │ String31              Int64?  Int64?  
─────┼───────────────────────────────────────
   1 │ Wetland_Wet_night       2007     2910
   2 │ Wetland_Dry_night        895       56
   3 │ Wetland_Wet_day         2007     3014
   4 │ Wetland_Dry_day          895       91
   5 │ aRice_Dry_night         5085      830
   6 │ aRice_Dry_day           5085       13
   7 │ aRice_Wet_night         2570     1197
   8 │ aRice_Wet_day           2570       54
   9 │ Nonhabitat_Dry_night    6573        2
  10 │ Nonhabitat_Dry_day      6417  missing 
  11 │ Nonhabitat_Wet_day       355  missing 
  12 │ Nonhabitat_Wet_night     355  missing

Checking with cropyear and with individual_local_identifier and their interaction also shows missing cells. There is very little that can be done about the missing cells for fixed-effects. The random effects can tolerate a small number of missing cells but it is not surprising that there are difficulties here. The interaction of cropyear and individual_local_identifier has 78 levels and 54 missing cells.

@coverton-usgs
Copy link
Author

Thank you once again for your response and insights. I suppose I am somewhat fortunate that this problem is not very widespread across the different species and regions I am modelling. Which paradoxically makes it more difficult to determine how to deal with unused covariates when they do exist. There is always an oddball, usually its me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants