Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template v2 (plus Pint/Pydantic cleanups) #159

Closed

Conversation

MichaelTiemannOSC
Copy link
Contributor

@MichaelTiemannOSC MichaelTiemannOSC commented Oct 14, 2022

This pull request enables processing of a "V2" template, in which corp data is split between fundamental financial data on one sheet and esg data on a second sheet. Other sheets from the V1 template remain as-is. The ESG data is more tidy than the V1 template, and makes it really easy to work with multi-year reports that may contain/create restatements. It also makes it much more practical to deal with the many scopes that may be reported (not just equity vs. operational vs. full for S1 and not just market vs. location for S2, but all 15 categories of S3, which would have been a nightmare in the V1 template).

And because the V2 template makes it practical to deal with deal with detailed S3 emissions, which also bring uncertainties into the mix, this changeset brings uncertainties in the mix. Which required creating a custom version of Pint and Pint-Pandas, which you can grab from:

In the process of fixing things to work with uncertainties, the creaky work I had done last year with Pydantic and Pint had become untenable. It is now 80% cleaned up, and I'd be happy to get it to 100% in the course of this code review.

I'm also generating JSON files with a slightly simplified/improved format, which I expect to improve further as @kmarinushkin 's S3 changes are merged. Don't be alarmed to see JSON file changes here. And we really should have just a single place where our benchmark files live, not scattered across test, examples, etc. Another reason this is a draft!
Sources for generating the OECM benchmark files are here: https://github.com/os-climate/itr-data-pipeline/tree/template-v2

@joriscram and @dp90 feel free to add yourselves as reviewers if you like!

@kmarinushkin
Copy link
Collaborator

My biggest concern in this PR is:

Which required creating a custom version of Pint and Pint-Pandas

from my perspective, it is not among "best practices" of SW development.
Could we instead create classes inside ITR project, which would derive from pint, and add desired uncertainty support on top?

ITR/data/__init__.py Show resolved Hide resolved
if isinstance(y, pd.DataFrame):
# Something went wrong with the GroupBy operation and we got a pd.DataFrame
# insted of being called column-by-column.
breakpoint()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a breakpoint in production code?
This might have been helpful for debugging, maybe you just forgot it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did forget it. It's also a draft PR for discussion mainly. But you are correct that all breakpoints should be removed.

ITR/data/template.py Outdated Show resolved Hide resolved
ITR/interfaces.py Outdated Show resolved Hide resolved
ITR/interfaces.py Show resolved Hide resolved
ITR/interfaces.py Outdated Show resolved Hide resolved
try:
production_units = company_dict[self.column_config.PRODUCTION_METRIC]['units']
except TypeError:
breakpoint()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't propagate TypeError further, is it intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is debugging code that should be removed/vestiges of the big Pydantic cleanup that needs some small cleanups. Good catch.

ITR/data/base_providers.py Outdated Show resolved Hide resolved
ITR/data/template.py Show resolved Hide resolved
ITR/interfaces.py Outdated Show resolved Hide resolved
@kmarinushkin
Copy link
Collaborator

My biggest concern in this PR is:

Which required creating a custom version of Pint and Pint-Pandas
from my perspective, it is not among "best practices" of SW development.

Understood, but I have been in touch with the Pint team

Should we wait, until the Pint team merge proposed pint changes into upstream?
If they propose modifications, we might need to re-adjust content of this PR

return benchmark_production_projections.add(1).cumprod(axis=1).mul(
company_production, axis=0)
except TypeError:
breakpoint()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In you last commit, you forgot to un-mute TypeError exceptions

ITR/data/base_providers.py Show resolved Hide resolved
ITR/data/template.py Outdated Show resolved Hide resolved
ITR/data/template.py Outdated Show resolved Hide resolved
Note that V2 data templates not yet supported by the ITR tool.

Signed-off-by: MichaelTiemann <[email protected]>
WIP -- production metrics working.  emissions metrics next.

Signed-off-by: [email protected]
Signed-off-by: MichaelTiemann <[email protected]>
Another WIP check-in: we are now failing in attempting to re-normalize a series with uncertainties + Pint types in base_providers _standardize.  Progress!

Signed-off-by: [email protected]
Signed-off-by: MichaelTiemann <[email protected]>
WIP checkin...production/emission metrics still confused between types and dicts.

Signed-off-by: [email protected]
Signed-off-by: MichaelTiemann <[email protected]>
With this code we successfully complete most of the V2 test case (there are numbers that have to be recalculated due to uncertainties math).  This also fixes fundamental wrong assumptions/coding from MichaelTiemannOSC's first attempt at using Pydantic and Pint together.  Finally!

Signed-off-by: [email protected]
Signed-off-by: MichaelTiemann <[email protected]>
For a long tme the Pint and Pydantic codes have had a messy relationship (pint_ify).  That's now fixed (and gone).  Breaking changes to JSON schema.

Signed-off-by: [email protected]
Signed-off-by: MichaelTiemann <[email protected]>
…ncertaintains.

We do not yet actually plot error bars...that's next.  But we can do all the calculations
the GUI requires, which took some work.

Signed-off-by: MichaelTiemann <[email protected]>
ITR/data/excel.py Show resolved Hide resolved
@@ -269,12 +277,20 @@ def _get_projection(self, company_ids: List[str], projections: pd.DataFrame, pro

return projected_emissions_s1s2

def _convert_target_data(self, target_data: pd.DataFrame) -> List[ITargetData]:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert target data to what?
My I suggest, to add a comment or the method, which would explain, what the method does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really about taking a row of "target data" (meaning data about published company targets for decarbonization, i.e. achieving a specific reduction by a given data) and converting it into the ITargetData model.

epsilon = 1e-7
wavg = ufloat(sum([v.n/(v.s**2+epsilon) for v in values])/sum([1/(v.s**2+epsilon) for v in values]),
np.sqrt(len(values)/sum([1/(v.s**2+epsilon) for v in values])))
if wavg.s==0.0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that you pushed a fix in a separate commit tells me, that we need a unit test coverage for this method

Copy link
Contributor Author

@MichaelTiemannOSC MichaelTiemannOSC Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely correct. I edited #156 to include creating the test case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe unit tests need to be added now.
otherwise we cannot guarantee the quality of this method

Change benchmark structure to permit better analyses
(Note that TPI has not been similarly changed so broken at the moment)
New notebook showing S1S2 vs. S1S2S3 benchmarks, and cumulative corp emissions
This is more of a WIP handoff...not yet complete (due to above)

Signed-off-by: Michael Tiemann <[email protected]>
u_pa_data = [_ufloat_nan if pn else pt if isinstance(pt, UFloat) else ufloat(pt, 0)
for pt, pn in zip(pa.data, pa_nans)]
intensities[col] = pd.Series(PA_(u_pa_data, pa_units), dtype=pa.dtype,
index=intensities[col].index, name=intensities[col].name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like, this way, we completely skip intensities, if HAS_UNCERTAINTIES is False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is merely doing by hand a kind of numerical promotion that only needs to be done if we find that our incoming series has some uncertainty values mixed with NaNs. The code could be written so that ALL numbers, upon initial parsing, are promoted to uncertainties (+/- 0.0), but that adds a lot of overhead to cases where we don't need it.

Presently, the code uses NaN to represent NULL values (a hold-over from pre-1.0 versions of Pandas). Lots of Pandas code does this, though in the modern era pd.NA (or pd.NAT for text) is a perfectly reasonable NULL value. The Pint-Pandas folks are just about finishing changes so that the PintArray ExtensionArray will have a fully-working pd.NA value for its type, and when that happens, we don't need to create a Quantity(NaN, unit) but may be able to just have pd.NA as a generic NULL value.

In any case, this code is all about normalizing the NaNs, representing NULL, between the pd.nan world (pure float) and the UFloat world. We don't need to do any work if there are no NaNs in the particular series.

@@ -14,6 +11,7 @@
from dataclasses import dataclass

import pint
import ITR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, from dependencies standpoint, i believe including everything (import ITR) from anything (any "ITR/...py") is not a healthy pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the unhealthy pattern would be:

from ITR import *

import ITR is just making it possible for us to reference ITR.xyz when we need xyz. But I'm open to discuss better ways to do this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about from ITR import xyz ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well...do we really want to bring isnan or ufloat or UFloat into the namespace? Perhaps. By leaving it as import ITR the few references we do make can be prefixed as ITR.xyz. Since it seems pretty common for people to use np.nan or pd.NA I thought it would make sense to require the ITR. prefix for utility/math things that are not really ITR use-case-specific. But I'm open to do things otherwise.

company_data = data[
[self.c.COLS.COMPANY_ID, self.c.COLS.TIME_FRAME, self.c.COLS.SCOPE, self.c.COLS.GHG_SCOPE12,
self.c.COLS.GHG_SCOPE3, self.c.COLS.TEMPERATURE_SCORE, self.c.SCORE_RESULT_TYPE]
].groupby([self.c.COLS.COMPANY_ID, self.c.COLS.TIME_FRAME, self.c.COLS.SCOPE]).mean()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a wide copy-paste here, maybe we could optimize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's all about deciding how much we want to make mean a polymorphic function across Float64 and UFloat. I didn't want to think too hard about it, but welcome suggestions.

@MichaelTiemannOSC
Copy link
Contributor Author

My biggest concern in this PR is:

Which required creating a custom version of Pint and Pint-Pandas

from my perspective, it is not among "best practices" of SW development. Could we instead create classes inside ITR project, which would derive from pint, and add desired uncertainty support on top?

Indeed this is the direction we are going: get everything else working and then drop in uncertainties afterward.

There's still quite some work to do to get the V2 input data to align with the specific scope expectations of both OECM and TPI, but with these changes "easy" cases can be visualized.

Signed-off-by: MichaelTiemann <[email protected]>
…, and without uncertainties

Signed-off-by: Michael Tiemann <[email protected]>
@MichaelTiemannOSC MichaelTiemannOSC marked this pull request as ready for review November 18, 2022 19:06
@MichaelTiemannOSC MichaelTiemannOSC added the enhancement New feature or request label Nov 18, 2022
@MichaelTiemannOSC MichaelTiemannOSC marked this pull request as draft November 28, 2022 19:24
@MichaelTiemannOSC
Copy link
Contributor Author

This has been superseded by #166

@MichaelTiemannOSC MichaelTiemannOSC deleted the template-v2 branch September 7, 2023 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants