-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Template v2 (plus Pint/Pydantic cleanups) #159
Template v2 (plus Pint/Pydantic cleanups) #159
Conversation
b8f8915
to
e6ed78f
Compare
My biggest concern in this PR is:
from my perspective, it is not among "best practices" of SW development. |
if isinstance(y, pd.DataFrame): | ||
# Something went wrong with the GroupBy operation and we got a pd.DataFrame | ||
# insted of being called column-by-column. | ||
breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a breakpoint in production code?
This might have been helpful for debugging, maybe you just forgot it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did forget it. It's also a draft PR for discussion mainly. But you are correct that all breakpoints should be removed.
ITR/data/base_providers.py
Outdated
try: | ||
production_units = company_dict[self.column_config.PRODUCTION_METRIC]['units'] | ||
except TypeError: | ||
breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't propagate TypeError
further, is it intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is debugging code that should be removed/vestiges of the big Pydantic cleanup that needs some small cleanups. Good catch.
Should we wait, until the Pint team merge proposed |
return benchmark_production_projections.add(1).cumprod(axis=1).mul( | ||
company_production, axis=0) | ||
except TypeError: | ||
breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In you last commit, you forgot to un-mute TypeError exceptions
Note that V2 data templates not yet supported by the ITR tool. Signed-off-by: MichaelTiemann <[email protected]>
WIP -- production metrics working. emissions metrics next. Signed-off-by: [email protected] Signed-off-by: MichaelTiemann <[email protected]>
Another WIP check-in: we are now failing in attempting to re-normalize a series with uncertainties + Pint types in base_providers _standardize. Progress! Signed-off-by: [email protected] Signed-off-by: MichaelTiemann <[email protected]>
WIP checkin...production/emission metrics still confused between types and dicts. Signed-off-by: [email protected] Signed-off-by: MichaelTiemann <[email protected]>
With this code we successfully complete most of the V2 test case (there are numbers that have to be recalculated due to uncertainties math). This also fixes fundamental wrong assumptions/coding from MichaelTiemannOSC's first attempt at using Pydantic and Pint together. Finally! Signed-off-by: [email protected] Signed-off-by: MichaelTiemann <[email protected]>
For a long tme the Pint and Pydantic codes have had a messy relationship (pint_ify). That's now fixed (and gone). Breaking changes to JSON schema. Signed-off-by: [email protected] Signed-off-by: MichaelTiemann <[email protected]>
…ncertaintains. We do not yet actually plot error bars...that's next. But we can do all the calculations the GUI requires, which took some work. Signed-off-by: MichaelTiemann <[email protected]>
b86bcbc
to
50b6f3a
Compare
Signed-off-by: MichaelTiemann <[email protected]>
Signed-off-by: MichaelTiemann <[email protected]>
Signed-off-by: MichaelTiemann <[email protected]>
Signed-off-by: MichaelTiemann <[email protected]>
Signed-off-by: MichaelTiemann <[email protected]>
Signed-off-by: MichaelTiemann <[email protected]>
@@ -269,12 +277,20 @@ def _get_projection(self, company_ids: List[str], projections: pd.DataFrame, pro | |||
|
|||
return projected_emissions_s1s2 | |||
|
|||
def _convert_target_data(self, target_data: pd.DataFrame) -> List[ITargetData]: | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert target data to what?
My I suggest, to add a comment or the method, which would explain, what the method does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really about taking a row of "target data" (meaning data about published company targets for decarbonization, i.e. achieving a specific reduction by a given data) and converting it into the ITargetData model.
epsilon = 1e-7 | ||
wavg = ufloat(sum([v.n/(v.s**2+epsilon) for v in values])/sum([1/(v.s**2+epsilon) for v in values]), | ||
np.sqrt(len(values)/sum([1/(v.s**2+epsilon) for v in values]))) | ||
if wavg.s==0.0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that you pushed a fix in a separate commit tells me, that we need a unit test coverage for this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely correct. I edited #156 to include creating the test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe unit tests need to be added now.
otherwise we cannot guarantee the quality of this method
Signed-off-by: MichaelTiemann <[email protected]>
Change benchmark structure to permit better analyses (Note that TPI has not been similarly changed so broken at the moment) New notebook showing S1S2 vs. S1S2S3 benchmarks, and cumulative corp emissions This is more of a WIP handoff...not yet complete (due to above) Signed-off-by: Michael Tiemann <[email protected]>
u_pa_data = [_ufloat_nan if pn else pt if isinstance(pt, UFloat) else ufloat(pt, 0) | ||
for pt, pn in zip(pa.data, pa_nans)] | ||
intensities[col] = pd.Series(PA_(u_pa_data, pa_units), dtype=pa.dtype, | ||
index=intensities[col].index, name=intensities[col].name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like, this way, we completely skip intensities
, if HAS_UNCERTAINTIES is False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is merely doing by hand a kind of numerical promotion that only needs to be done if we find that our incoming series has some uncertainty values mixed with NaNs. The code could be written so that ALL numbers, upon initial parsing, are promoted to uncertainties (+/- 0.0), but that adds a lot of overhead to cases where we don't need it.
Presently, the code uses NaN to represent NULL values (a hold-over from pre-1.0 versions of Pandas). Lots of Pandas code does this, though in the modern era pd.NA (or pd.NAT for text) is a perfectly reasonable NULL value. The Pint-Pandas folks are just about finishing changes so that the PintArray ExtensionArray will have a fully-working pd.NA value for its type, and when that happens, we don't need to create a Quantity(NaN, unit) but may be able to just have pd.NA as a generic NULL value.
In any case, this code is all about normalizing the NaNs, representing NULL, between the pd.nan world (pure float) and the UFloat world. We don't need to do any work if there are no NaNs in the particular series.
@@ -14,6 +11,7 @@ | |||
from dataclasses import dataclass | |||
|
|||
import pint | |||
import ITR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, from dependencies standpoint, i believe including everything (import ITR
) from anything (any "ITR/...py") is not a healthy pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the unhealthy pattern would be:
from ITR import *
import ITR is just making it possible for us to reference ITR.xyz when we need xyz. But I'm open to discuss better ways to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about from ITR import xyz
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well...do we really want to bring isnan
or ufloat
or UFloat
into the namespace? Perhaps. By leaving it as import ITR
the few references we do make can be prefixed as ITR.xyz
. Since it seems pretty common for people to use np.nan
or pd.NA
I thought it would make sense to require the ITR.
prefix for utility/math things that are not really ITR use-case-specific. But I'm open to do things otherwise.
company_data = data[ | ||
[self.c.COLS.COMPANY_ID, self.c.COLS.TIME_FRAME, self.c.COLS.SCOPE, self.c.COLS.GHG_SCOPE12, | ||
self.c.COLS.GHG_SCOPE3, self.c.COLS.TEMPERATURE_SCORE, self.c.SCORE_RESULT_TYPE] | ||
].groupby([self.c.COLS.COMPANY_ID, self.c.COLS.TIME_FRAME, self.c.COLS.SCOPE]).mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a wide copy-paste here, maybe we could optimize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's all about deciding how much we want to make mean
a polymorphic function across Float64 and UFloat. I didn't want to think too hard about it, but welcome suggestions.
Indeed this is the direction we are going: get everything else working and then drop in uncertainties afterward. |
There's still quite some work to do to get the V2 input data to align with the specific scope expectations of both OECM and TPI, but with these changes "easy" cases can be visualized. Signed-off-by: MichaelTiemann <[email protected]>
Signed-off-by: Michael Tiemann <[email protected]>
…, and without uncertainties Signed-off-by: Michael Tiemann <[email protected]>
Signed-off-by: Michael Tiemann <[email protected]>
This has been superseded by #166 |
This pull request enables processing of a "V2" template, in which corp data is split between fundamental financial data on one sheet and esg data on a second sheet. Other sheets from the V1 template remain as-is. The ESG data is more tidy than the V1 template, and makes it really easy to work with multi-year reports that may contain/create restatements. It also makes it much more practical to deal with the many scopes that may be reported (not just equity vs. operational vs. full for S1 and not just market vs. location for S2, but all 15 categories of S3, which would have been a nightmare in the V1 template).
And because the V2 template makes it practical to deal with deal with detailed S3 emissions, which also bring uncertainties into the mix, this changeset brings uncertainties in the mix. Which required creating a custom version of Pint and Pint-Pandas, which you can grab from:
In the process of fixing things to work with uncertainties, the creaky work I had done last year with Pydantic and Pint had become untenable. It is now 80% cleaned up, and I'd be happy to get it to 100% in the course of this code review.
I'm also generating JSON files with a slightly simplified/improved format, which I expect to improve further as @kmarinushkin 's S3 changes are merged. Don't be alarmed to see JSON file changes here. And we really should have just a single place where our benchmark files live, not scattered across test, examples, etc. Another reason this is a draft!
Sources for generating the OECM benchmark files are here: https://github.com/os-climate/itr-data-pipeline/tree/template-v2
@joriscram and @dp90 feel free to add yourselves as reviewers if you like!