Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qceff dartlab #740

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open

qceff dartlab #740

wants to merge 45 commits into from

Conversation

hkershaw-brown
Copy link
Member

@hkershaw-brown hkershaw-brown commented Sep 25, 2024

Description:

DARTLAB qceff from Jceff

Fixes issue

Fixes #739
Note the slides are not in this pull request (yet!)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Documentation changes needed?

  • My change requires a change to the documentation.
    • I have updated the documentation accordingly.

Tests

Run by a group of workshop attendees at AGOS 2024

Checklist for merging

  • Updated changelog entry
  • Documentation updated
  • Update conf.py

Checklist for release

  • Merge into main
  • Create release from the main branch with appropriate tag
  • Delete feature-branch

Testing Datasets

  • Dataset needed for testing available upon request
  • Dataset download instructions included
  • No dataset needed

jlaucar added 30 commits May 1, 2024 10:33
THis involved adding optional arguments to return information about the weights
in the prior and posterior bins to obs_increment_rhf.m
…ne the

pdf for a bounded normal rank histogram given the ensemble members and
information about the tail normals.
distribution plot is updated whenever the buttons selecte a new
distribution.
…instead

of using hard coded bounds from oned_ensemble's axes.
…se where

bnrh is bounded on the left at 0. oned_ensemble is believed to be working
for all cases with the improved plotting. bounded_oned_ensemble is believed
to be functioning correctly except for the case of applying inflation with
the gamma or BNRH distributions.
in which the inflated prior mean value was not being correctly computed.
This was also changed in oned_ensemble.m where it did not produce a
wrong answer (since inflated mean is always the same as prior mean)
but could have led to a misunderstanding when looking at the code.
These are directly modified versions of the fortran routines of the same names.
They include slightly less error checking at the moment and should be further
polished for use in actual matlab assimilations as opposed to graphical demonstrations.
These have been tested for direct inversion and appear to be bitwise for the
small cases tested.
…unction

weighted_norm_inv.m out of obs_increment_rhf.m since it is needed for the
recently added cdf and inverse cdf of the bnrh. All inflation variants now work
in bounded_oned_ensemble.
exisiting guis can use it without modification.
an additional plot box for the transformed ensemble and new button
for the unobserved variable transform.
that observed variable is nonnegative. Put in appropriate error handling
if a negative value is clicked. Fixed the error messages for the observation
and obs error sd specification.
…ies.

The PPI space plots are now appropriately cleared in the same fashion
as the standard state space.
…rved

variable. Added PPI transforms for the observed variable. Options
limited to normal and unbounded RH since the observed variable is unbounded
in this demonstration.
distribution. Made correct use of SCALAR function for transforming
observed posterior with BNRH.
…nd the

existing oned_ensemble.m and twod_ensemble.m
… from

the update_ens callback. Updates the observation at each cycle and plots
the updated obs, prior and posterior.
the continuous distributions with solid curves. Added approximate
fitted continuous distributions for the EnKF for comparison.
Matlab guis. Sections 1, 2, 4 and 5 contain revised versions of the
same material as in the previous versions. Section 4 is new and describes
QCEFF filters. The EnKF is no longer described although the GUIs still
support it.
@hkershaw-brown
Copy link
Member Author

I can't review this because it is my pull request.

Changes needed:

guide/DART_LAB/DART_LAB.rst needs to have the new section 4. & probably the new tutorial exercises.

Also web: https://dart.ucar.edu/tutorials/dart-lab/

@hkershaw-brown
Copy link
Member Author

section 3 different title, needs updating in readme
section 4 different style (some previous version of the CISL branding, not the latest NSF NCAR), also weird characters. Note I have not (and will not - this pull request is a dump of content) look super closely at these slides this just stuck out:
Screenshot 2024-10-03 at 9 09 39 AM
Don't have the original slides to make any changes.

@hkershaw-brown
Copy link
Member Author

still waiting for slides, closing for now.
Probably will have someone else reopen this if there is need/desire to have the qceff DARTLAB released.

@hkershaw-brown
Copy link
Member Author

@jlaucar read this

@jlaucar jlaucar reopened this Dec 10, 2024
…tandards. Other

minor corrections have been made. The DART_LAB.rst has been updated to include
the new section titles and to list all of the new matlab main scripts.
@kdraeder kdraeder self-requested a review December 13, 2024 14:30
@kdraeder
Copy link
Contributor

I didn't get too far before stumbling on something that I need to resolve before continuing.

Section 1, slide 4 ("1:4") brings up the issue that I'm often confused in discussions of conditional probability because people seem to read the notation P(A|S) in a variety of ways, some of which seem contradictory.
I was taught to read that conditional probability as
"The probability of the measurement being T_O, given that the true temp is T"
This is consistent with Freund's Modern Elementary Statistics; a conditional probability P(A|S) is the
"probability of event A relative to the sample space S", aka the "probability of A given S".
So the thing before the | is a specific event or value.
Its probability depends on the space in which it happens,
or "all of the values which it could theoretically have", which is listed after the |.

It doesn't make sense to me to say "the probability of this space given that a specific event happened"

I'm guessing that all of the math is correct, but the descriptions confuse me (and maybe others).
If I'm not wrong I'll suggest alternatives.

@jlaucar
Copy link
Contributor

jlaucar commented Dec 13, 2024 via email

@kdraeder
Copy link
Contributor

kdraeder commented Dec 13, 2024

I've worked through Section 1 and have a list of other suggestions and questions,
but I'm hoping to clear up this question in order to make the list clearer and shorter.

I just want to dig into it enough to be able to interpret the figures
and descriptions using a consistent system of notation.
When I look at the graphs I interpret Temperature to be the T in the math;
the thing that varies or the space of possible values.
T_O and T_True are specific values of T. And that's how they're introduced.
But this interpretation trips me up in the definitions which use "( | )",
so I need a way to know when to reinterpret them and how to do it.

In a description of "likelihood" I dug up (Wikipedia) it says that
for x an instance of random variable X, and theta a parameter of a function f,
f(x|theta) is interpreted as a probability density function if x is fixed,
but as a likelihood if theta is fixed
(rather different from the conditional probability translation of that notation).
So maybe your "=T" notation is signalling which is fixed.
If that's the case, it would be clearer to me if the notation were T=T_True or T=T_O (like an assignment)
L(T) = P(T_O | T=T_True) likelihood
and
E(T) = P(T=T_O | T_True) probability density function:
E is harder to interpret; the probabilities of seeing specific value T_O given all the possible T_True values?
Then could that be E(T_True) = ... and L(T_O) = ... ? Those would clearly signal what's varying.

Later, in 1:20, things are labeled oppositely from this interpretation.
P(T_O,2 | T) is labeled the likelihood, but the thing before the | is what's fixed,
instead of the thing after the |.


Kevin, It's fine for you to dig in just a little on this topic, but... Some of the issues are related to the difference between discrete and continuous distributions for conditional probabilities and Bayes. There is a note in the slides that indicates that professional statisticians would be concerned about how we discuss things in a discrete way while applying to continuous. Doing this formally would drag into hours of basic probs and stats, so all we really need is to get to Bayes in some format that makes some sense. The formal distinction between an observational error distribution and a likelihood is challenging to even the best DA folks. For instance, Craig Bishop totally gums this up in his gamma/inverse gamma papers. There is no generally accepted notation that easily handles both views. Feel free to suggest some alternative wording that makes more sense and we can discuss. Jeff

@kdraeder
Copy link
Contributor

I'll stop trying to make all the notation consistent with my limited understanding.
I'm suggesting a few things that would help me understand the notation
in individual slides. There are non-notation suggestions too.

1:2 It would fit the picture better to use a red line at 1 C to represent the ob,
since the axes imply that T_O is happening with 0 probability.
Better yet would be have the star on a number line with no probability axis.

1:3 I think the picture is a little misleading. I'd omit the star (or line),
since it's being replaced by a more complete description; the curve.

1:4 The likelihood equation would be easier for me to read if it were
L(T_True) = P(T_O | T_True)
This tells me that the distribution is a function of T_True, for a value of the parameter T_O.
This would be clarified even more if the x-axis were labeled T_True.
Then I'd change "This is"... to "This is the relative probability of seeing T_O
given the space of possible true temperatures."

1:9 The notation here implies to me that this distribution is some kind of inversion of the likelihood.
But I could be missing the point.
The equation would be clearer to me as
E(T_O) = P(T_O | T_true)
and the x-axis would be T_O, with the red * now being T_True.
I'd be even more comfortable with
E(T_O) = P(T_True | T_O).
To me that would be "the relative probability of T_True being the true temperature,
given all of the possible T_O measurements."

1:12 Is the figure (still) showing T_O (now T_O,1) and its likelihood (or error) distribution?
1:13 And is it used, with some unspecified information, to create this prior dist.?

1:18 Do students ever ask why independent random errors is sufficient to
make measurements T_2 independent of T_1?
Is it because we've assumed unbiased obs?

Why is denominator unaffected by this assumption?
I don't know how to interpret P(T_1 | T_2).
Since they're independent obs, can I interpret it as P(T_1 , T_2), the probability of seeing T_1 and T_2 ?

1:19 Since T = T_True, the equation would be clearer to me with T replaced by T_True.
The expressions do get cumbersome with lots of T_Trues in them,
So T_True could be replaced (throughout) by T_T.

1;22 the T_O,2 likelihood looks identical to the T_O in previous slides.
I think T_O,1 is also that T_O. It shouldn't matter, but it's confusing if the 2 obs are the same (see 1:29).

1:24 Is this a definition of the normalization we want to use,
which is maybe unrelated to the denominator in the 2 ob Bayes' (1:18)?

1:29 It looks like T_O is being redefined here to describe T_O,2,
which is equal to T_O,1 (unless that's changed in 1:22).

1:44 Do you care about spaces in equations in 2.? sigma has more than T.

1:45 A nitpick; the subscripts look like zeros. Should they be capital o?

1:50 Notational consistency; the T subscripts in earlier equations have
the kind of T (update, prior, ...) noted first, then the time (1,2....).
Here time is first, and member, which could be seen as a kind, is second.
It would be more consistent if these new ones were T_{n,time}.

@kdraeder
Copy link
Contributor

I'm suggesting a lot of notation changes.
I'd be willing to do the grunt work of changing them,
so that your time constraints don't prevent useful,
but maybe not essential, updates from happening.

2:2 "additional variable" ->? "unobserved additional variable"

2:32 first slide where 'h' appears, but the topic has moved beyond it.
Should it appear in 2:29 or 2:30?

2:38 Mixed notation convention here; some variables have subscript k to denote a time,
x has subscript t_k to denote the same thing.
? Does x need the more complete description for later equations?
Related to Y_{t_k} in 2:42.

2:39 Obs. error covariance is used as the standard deviation (variance?) of the normal.
Is this a common (good? useful?) generalized way of describing it?
When I see 'covariance' I start looking for things that it co-varies with,
but I think there aren't any here.

2:45 x in likelihood doesn't need time subscript?

2:47 "All zeros except a single 1 in the last number of obs columns."
But each column contains an exended state element. Obs are in rows.
Also, wouldn't the identity obs operator have 1s in a different column
for each row, so that each ob is picking out a different state variable?
Maybe leave this description for 2:48.

2:48 This makes me think that the description (and in 2:47) should be
"Each row of Hhat_k is all zeros except for a single 1
in a unique column in the observation section." or similar.
Are the 1s always on the obs-section diagonal,
or would it show the true nature better by scrambling the rows a bit?

2:49 The x argument in the H is the extended state here (^),
but wasn't in earlier equations.

2:50 Might want to highlight "Extended" to point out that the picture is not redundant.
Is the prior x also an extended state?

2:51 (and 2:52) Does this picture want an h (or H)?

2:54 Would it be useful to have a picture (or exercize) showing some stage(s)
of the assimilation operating on the regular state and estimated obs
parts of the extended state (different colors)?
This would show that the estimated obs are being processed the same
as the conventional state.

2:55 If we want to go the extra mile, this slide could have a different distribution
of estimated obs, to point out the variation with each successive ob.

2:56 "next time with observations" ->? "next time having observations"
to prevent "with" from referring to "model state" instead of "time"

@kdraeder
Copy link
Contributor

I didn't test scripts and exercizes in this section as of 2024-12-20

3:4 "1/5 chance that this is in any given bin." ->
"On average, 1/5 chance that this is in any given bin"
Or maybe
"It may be in any of the 5 bins"

3:5 Nitpick; the rank histogram Count axis would be more accurate and less cluttered
if only whole numbers were displayed. (and in subsequent slides).
Do the comments about this slide make the transition that
the ensemble members of an assimilation are draws from a distribution and
the observation serves as the truth?

3:9 "Rank histograms for good ensembles" ->? "Rank histograms for good assimilations"
since the RH comes from many ensembles, some good, some bad.
? Do you want to keep the implicit definition in these slides that "truth" is a series
of observations and "ensemble" is a series of draws or members from a sequence
of assimilations? I worry that this further muddies peoples understanding of "ensemble".
A simple change to clarify this might be
"Want truth"... -> "On average, want truth"...

3:10 Then "A biased ensemble" -> "Biased ensembles"

3:11 And so forth.

3:13 Locating the truth in a bin is OK for an idealized experiment,
but wouldn't it communicate the same ideas to use the observation instead,
which would translate directly to real obs assimilations?

3:21 In the Note; "inflation" -> "variance inflation"
would provide an immediate explanation and reinforce that inflation is applied to the variance.

3:34 "Weakly correlated observations" ->? "weak correlations"
since it's discussing correlations between state variables and observations.

3:39 "no weight is being given to state variables on the opposite side of the domain from an observation"
->? "the regression factor falls to 0 near the opposite side of the domain from an observation"...

@kdraeder
Copy link
Contributor

4:6 This equation isn't used in the rest of the slides, so maybe it's not worth making the notation consistent.
But if it is, the Y still uses the double subscript notation, while x uses a new 'x,t_k',
which is confusing for people who read it as a conditional probability;
the probability of time k happening (and x) given all the previous obs.

4:15 1. 'an ensemble' -> 'a prior ensemble' would motivate the F^p notation in 2.
4. 'Modify the PDF' Should this be the CDF?

4:22 Nitpick; 'section 1' -> 'Section 1'

4:23 Is 'conjugate pairs' just an interesting aside for people who recognize it?

4:24-34 I didn't (can't) confirm the content, but I didn't see anything that looks like a typo
My only suggestion is to put something in the header of the 4th column; Application? Notes?

4:42 I don't know the U(0, 1) abbreviation.
I gathered later that it means "spans the space, exclusive of the end values".

4:49 I'm guessing that the numbers closest to the green stars are the number of members
which have that value, but it's a little challenging to apply that to the curve.
The dots andor circles don't seem to denote members.
I can decipher it by noting that each increase in CDF of 1/(8+1) represents 1 member,
regardless of the dots and circles.

4:53 'Correct distribution contours' -> 'Contours for correct distribution'
would be less misinterpretable.

4:56 I don't follow what 'This is the probability integral transform' refers to.

4:57 Up to here quantiles have been discreet, taken from members representing a distribution.
'Quantile function' seems to be a new thing; a continuous function of a continuous quantile variable.
Is this just the inversion of 4:20?

4:59 Do you want a legend in the second figure?

4:64 What do the red arrows point to?
If the top and bottom rows are the unobserved posterior and prior, what are the other rows?
Is the vertical axis the cutoff value of the localization?

@kdraeder
Copy link
Contributor

Again, I'm suggesting notation changes. Tell me if I should implement any of them
and point me to the source files.

A motivation slide could be helpful; why and when is inflation needed?

5.2 'adaptive error tolerant filter' -> 'adaptive error-tolerant filter' throughout.
This is what I think is shown here, but it took me a while to (mis)understand it,
based on the current content. If the background assumptions are in the verbal comments
near the beginning, then I probably would have followed it fine.
But if this should be a free-standing slide, then I think the figure
should be smaller and there should be more description.
I think that 'expected separation' brings up questions of what we expect and why,
which don't need to be asked or answered here. If that term needs to be kept,
then in the figure '4.714 SDs' should be changed to '4.7 ESs' (and in following slides).
The text could be

 1. Tiny probabilities at overlap; unlikely they describe the same thing.
 2. The 'consistency threshhold' of 2 distributions is  [sqrt(sigmi_p^2 + sigma_o^2)] = ~0.85
 3. Actual separation is 4.7 times larger; one or both are poor representations of the truth
 Model error? Obs error? Random chance?

5.3 'expected separation' causes additional confusion here; to me, increasing
the expected separation means "I expect them to appear more separated",
while inflation makes them appear less separated. So
3. ->? 'Inflating increases the consistency threshhold (combined uncertainty)
and makes them appear more consistent.'

5.4 > The second comment (and prior usage) makes me think that
'prior mean y to obs is' should be 'prior mean to obs y is'
but 'y' isn't used, so it could be omitted. How about
'Distance from prior mean to obs distribution; D = N(0, ...) == N(0,theta)'
> Then there's y_O or y_0. It's ambiguous and later slides make me think
that it's y_0 (zero) because of the y_k (time index) variables.
Can we omit the subscript here?

5.5 'Use Bayesian"... looks like magic. I assume it doesn't matter at this point.
(Again) does the 'lambda,t_k' notation mean lambda at time t_k
and not 'the probability of t_k and lambda, given all the previous obs'?

5.6 If y_k is sufficient notation, how about simplifying Y_{t_k} to Y_k? (throughout)
And use lambda_k instead of lambda,t_k?

5.7 > Is one intention of the top figure to show that variance "inflation" can reduce the variance?
> It's not clear if this slide shows combining distributions or specific probabilities.
The 2 stars in the second figure make me think "specific".
But plotting them on a probability graph is implies that the posterior has 0 probability,
which is probably not what's intended. And the prior and likelihood are both 0.75,
which also doesn't make sense.
Ah! Reading ahead to 5.8 clarifies it somewhat. How about switching the lambda values
in 5.8 and 5.7, so that it's easier to interpret what's happening?
And then the first "inflation" that readers see is actually inflating the prior.

5.13 'One option is to use Gaussian prior for lambda.'
Wasn't this already assumed in 5.10 (5.6)? Should this be 'posterior'?

5.15 This version would be easier for me to follow.
1. Evaluate the numerator at mean (p_m == p(lambda_u^bar))
and second point (p_sigma == p(lambda_u^bar + sigma_{lambda,p})
2. Find sigma_{lambda,u}^2 so N(...) goes through p_m and p_sigma.
3. Compute as sigma.... where r = p_sigma/p_m

5.16 Labeling the x-axis would emphasize that the picture is not showing
the updated inflation distribution.

5.18 3. would be clearer to me (and still correct?) as
"Use inflated prior to compute posterior ensemble of estimated y."

5.19 Nitpick; spacing of p() is messed up.

5.20 It could be helpful to make the connection between the 'inflation mean' and 'value';
'Minimum value of inflation (mean), often'...

5.26 2. Shouldn't there be the step of applying the inflation between b. and c.?

@jlaucar
Copy link
Contributor

jlaucar commented Dec 23, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: DARTLAB QCEFF
3 participants