Template v2 (plus Pint/Pydantic cleanups) #159

MichaelTiemannOSC · 2022-10-14T01:27:25Z

This pull request enables processing of a "V2" template, in which corp data is split between fundamental financial data on one sheet and esg data on a second sheet. Other sheets from the V1 template remain as-is. The ESG data is more tidy than the V1 template, and makes it really easy to work with multi-year reports that may contain/create restatements. It also makes it much more practical to deal with the many scopes that may be reported (not just equity vs. operational vs. full for S1 and not just market vs. location for S2, but all 15 categories of S3, which would have been a nightmare in the V1 template).

And because the V2 template makes it practical to deal with deal with detailed S3 emissions, which also bring uncertainties into the mix, this changeset brings uncertainties in the mix. Which required creating a custom version of Pint and Pint-Pandas, which you can grab from:

In the process of fixing things to work with uncertainties, the creaky work I had done last year with Pydantic and Pint had become untenable. It is now 80% cleaned up, and I'd be happy to get it to 100% in the course of this code review.

I'm also generating JSON files with a slightly simplified/improved format, which I expect to improve further as @kmarinushkin 's S3 changes are merged. Don't be alarmed to see JSON file changes here. And we really should have just a single place where our benchmark files live, not scattered across test, examples, etc. Another reason this is a draft!
Sources for generating the OECM benchmark files are here: https://github.com/os-climate/itr-data-pipeline/tree/template-v2

@joriscram and @dp90 feel free to add yourselves as reviewers if you like!

kmarinushkin · 2022-10-14T20:44:28Z

My biggest concern in this PR is:

Which required creating a custom version of Pint and Pint-Pandas

from my perspective, it is not among "best practices" of SW development.
Could we instead create classes inside ITR project, which would derive from pint, and add desired uncertainty support on top?

ITR/data/__init__.py

kmarinushkin · 2022-10-14T08:44:27Z

ITR/data/template.py

+        if isinstance(y, pd.DataFrame):
+            # Something went wrong with the GroupBy operation and we got a pd.DataFrame
+            # insted of being called column-by-column.
+            breakpoint()


Do we want a breakpoint in production code?
This might have been helpful for debugging, maybe you just forgot it here?

Yes, I did forget it. It's also a draft PR for discussion mainly. But you are correct that all breakpoints should be removed.

ITR/data/template.py

ITR/interfaces.py

kmarinushkin · 2022-10-14T21:30:51Z

ITR/data/base_providers.py

+        try:
+            production_units = company_dict[self.column_config.PRODUCTION_METRIC]['units']
+        except TypeError:
+            breakpoint()


we don't propagate TypeError further, is it intended?

This is debugging code that should be removed/vestiges of the big Pydantic cleanup that needs some small cleanups. Good catch.

ITR/data/base_providers.py

ITR/data/template.py

ITR/interfaces.py

kmarinushkin · 2022-10-14T21:49:10Z

My biggest concern in this PR is:

Which required creating a custom version of Pint and Pint-Pandas
from my perspective, it is not among "best practices" of SW development.

Understood, but I have been in touch with the Pint team

Should we wait, until the Pint team merge proposed pint changes into upstream?
If they propose modifications, we might need to re-adjust content of this PR

kmarinushkin · 2022-10-22T08:16:53Z

ITR/data/base_providers.py

+            return benchmark_production_projections.add(1).cumprod(axis=1).mul(
+                company_production, axis=0)
+        except TypeError:
+            breakpoint()


In you last commit, you forgot to un-mute TypeError exceptions

ITR/data/base_providers.py

ITR/data/template.py

Note that V2 data templates not yet supported by the ITR tool. Signed-off-by: MichaelTiemann <[email protected]>

WIP -- production metrics working. emissions metrics next. Signed-off-by: [email protected] Signed-off-by: MichaelTiemann <[email protected]>

Another WIP check-in: we are now failing in attempting to re-normalize a series with uncertainties + Pint types in base_providers _standardize. Progress! Signed-off-by: [email protected] Signed-off-by: MichaelTiemann <[email protected]>

WIP checkin...production/emission metrics still confused between types and dicts. Signed-off-by: [email protected] Signed-off-by: MichaelTiemann <[email protected]>

With this code we successfully complete most of the V2 test case (there are numbers that have to be recalculated due to uncertainties math). This also fixes fundamental wrong assumptions/coding from MichaelTiemannOSC's first attempt at using Pydantic and Pint together. Finally! Signed-off-by: [email protected] Signed-off-by: MichaelTiemann <[email protected]>

For a long tme the Pint and Pydantic codes have had a messy relationship (pint_ify). That's now fixed (and gone). Breaking changes to JSON schema. Signed-off-by: [email protected] Signed-off-by: MichaelTiemann <[email protected]>

…ncertaintains. We do not yet actually plot error bars...that's next. But we can do all the calculations the GUI requires, which took some work. Signed-off-by: MichaelTiemann <[email protected]>

Signed-off-by: MichaelTiemann <[email protected]>

ITR/data/excel.py

kmarinushkin · 2022-10-25T07:12:26Z

ITR/data/excel.py

@@ -269,12 +277,20 @@ def _get_projection(self, company_ids: List[str], projections: pd.DataFrame, pro

        return projected_emissions_s1s2

+    def _convert_target_data(self, target_data: pd.DataFrame) -> List[ITargetData]:
+        """


Convert target data to what?
My I suggest, to add a comment or the method, which would explain, what the method does?

It's really about taking a row of "target data" (meaning data about published company targets for decarbonization, i.e. achieving a specific reduction by a given data) and converting it into the ITargetData model.

kmarinushkin · 2022-10-25T07:20:44Z

ITR/utils.py

+    epsilon = 1e-7
+    wavg = ufloat(sum([v.n/(v.s**2+epsilon) for v in values])/sum([1/(v.s**2+epsilon) for v in values]), 
+                  np.sqrt(len(values)/sum([1/(v.s**2+epsilon) for v in values])))
+    if wavg.s==0.0:


The fact that you pushed a fix in a separate commit tells me, that we need a unit test coverage for this method

Completely correct. I edited #156 to include creating the test case.

i believe unit tests need to be added now.
otherwise we cannot guarantee the quality of this method

Signed-off-by: MichaelTiemann <[email protected]>

Change benchmark structure to permit better analyses (Note that TPI has not been similarly changed so broken at the moment) New notebook showing S1S2 vs. S1S2S3 benchmarks, and cumulative corp emissions This is more of a WIP handoff...not yet complete (due to above) Signed-off-by: Michael Tiemann <[email protected]>

ITR/data/data_warehouse.py

kmarinushkin · 2022-11-01T06:11:42Z

ITR/data/base_providers.py

+                        u_pa_data = [_ufloat_nan if pn else pt if isinstance(pt, UFloat) else ufloat(pt, 0)
+                                     for pt, pn in zip(pa.data, pa_nans)]
+                        intensities[col] = pd.Series(PA_(u_pa_data, pa_units), dtype=pa.dtype,
+                                                     index=intensities[col].index, name=intensities[col].name)


looks like, this way, we completely skip intensities, if HAS_UNCERTAINTIES is False

This code is merely doing by hand a kind of numerical promotion that only needs to be done if we find that our incoming series has some uncertainty values mixed with NaNs. The code could be written so that ALL numbers, upon initial parsing, are promoted to uncertainties (+/- 0.0), but that adds a lot of overhead to cases where we don't need it.

Presently, the code uses NaN to represent NULL values (a hold-over from pre-1.0 versions of Pandas). Lots of Pandas code does this, though in the modern era pd.NA (or pd.NAT for text) is a perfectly reasonable NULL value. The Pint-Pandas folks are just about finishing changes so that the PintArray ExtensionArray will have a fully-working pd.NA value for its type, and when that happens, we don't need to create a Quantity(NaN, unit) but may be able to just have pd.NA as a generic NULL value.

In any case, this code is all about normalizing the NaNs, representing NULL, between the pd.nan world (pure float) and the UFloat world. We don't need to do any work if there are no NaNs in the particular series.

ITR/data/base_providers.py

kmarinushkin · 2022-11-01T06:16:43Z

ITR/interfaces.py

@@ -14,6 +11,7 @@
 from dataclasses import dataclass

 import pint
+import ITR


also, from dependencies standpoint, i believe including everything (import ITR) from anything (any "ITR/...py") is not a healthy pattern

I think the unhealthy pattern would be:

from ITR import *

import ITR is just making it possible for us to reference ITR.xyz when we need xyz. But I'm open to discuss better ways to do this.

How about from ITR import xyz ?

Well...do we really want to bring isnan or ufloat or UFloat into the namespace? Perhaps. By leaving it as import ITR the few references we do make can be prefixed as ITR.xyz. Since it seems pretty common for people to use np.nan or pd.NA I thought it would make sense to require the ITR. prefix for utility/math things that are not really ITR use-case-specific. But I'm open to do things otherwise.

kmarinushkin · 2022-11-01T06:17:24Z

ITR/temperature_score.py

+            company_data = data[
+                [self.c.COLS.COMPANY_ID, self.c.COLS.TIME_FRAME, self.c.COLS.SCOPE, self.c.COLS.GHG_SCOPE12,
+                 self.c.COLS.GHG_SCOPE3, self.c.COLS.TEMPERATURE_SCORE, self.c.SCORE_RESULT_TYPE]
+            ].groupby([self.c.COLS.COMPANY_ID, self.c.COLS.TIME_FRAME, self.c.COLS.SCOPE]).mean()


we have a wide copy-paste here, maybe we could optimize

Yes, it's all about deciding how much we want to make mean a polymorphic function across Float64 and UFloat. I didn't want to think too hard about it, but welcome suggestions.

examples/data/json-units/benchmark_production_OECM.json

MichaelTiemannOSC · 2022-11-02T16:52:28Z

My biggest concern in this PR is:

Which required creating a custom version of Pint and Pint-Pandas

from my perspective, it is not among "best practices" of SW development. Could we instead create classes inside ITR project, which would derive from pint, and add desired uncertainty support on top?

Indeed this is the direction we are going: get everything else working and then drop in uncertainties afterward.

There's still quite some work to do to get the V2 input data to align with the specific scope expectations of both OECM and TPI, but with these changes "easy" cases can be visualized. Signed-off-by: MichaelTiemann <[email protected]>

Signed-off-by: Michael Tiemann <[email protected]>

…, and without uncertainties Signed-off-by: Michael Tiemann <[email protected]>

Signed-off-by: Michael Tiemann <[email protected]>

MichaelTiemannOSC · 2022-11-28T20:09:09Z

This has been superseded by #166

MichaelTiemannOSC requested a review from kmarinushkin October 14, 2022 01:27

MichaelTiemannOSC force-pushed the template-v2 branch from b8f8915 to e6ed78f Compare October 14, 2022 01:29

MichaelTiemannOSC mentioned this pull request Oct 14, 2022

unumpy.sin() and similar functions should behave with Pandas object like NumPy functions do lmfit/uncertainties#150

Open

kmarinushkin reviewed Oct 14, 2022

View reviewed changes

kmarinushkin reviewed Oct 22, 2022

View reviewed changes

MichaelTiemannOSC added 7 commits October 23, 2022 18:25

Initial commit of notebook to convert V1 data templates to V2.

14746a8

Note that V2 data templates not yet supported by the ITR tool. Signed-off-by: MichaelTiemann <[email protected]>

V2 Template and uncertainties

133ff59

WIP -- production metrics working. emissions metrics next. Signed-off-by: [email protected] Signed-off-by: MichaelTiemann <[email protected]>

WIP checkin (almost through to temp score)

1c5d7ec

WIP checkin...production/emission metrics still confused between types and dicts. Signed-off-by: [email protected] Signed-off-by: MichaelTiemann <[email protected]>

This completes the first smoke-test of the ITR_UI tool working with u…

50b6f3a

…ncertaintains. We do not yet actually plot error bars...that's next. But we can do all the calculations the GUI requires, which took some work. Signed-off-by: MichaelTiemann <[email protected]>

MichaelTiemannOSC force-pushed the template-v2 branch from b86bcbc to 50b6f3a Compare October 23, 2022 22:25

MichaelTiemannOSC added 6 commits October 23, 2022 18:27

WIP test code. Not finished, but moving forward.

40d245d

Signed-off-by: MichaelTiemann <[email protected]>

Get excel.py working, fixing and harmonizing as we go...

c5ac785

Signed-off-by: MichaelTiemann <[email protected]>

Fix calculation and use of epsilon in umean function.

655f60d

Signed-off-by: MichaelTiemann <[email protected]>

Remove print statement used for debugging.

c99b805

Signed-off-by: MichaelTiemann <[email protected]>

Remove unconditional breakpoint used for debuggin.

af7fa70

Signed-off-by: MichaelTiemann <[email protected]>

Plot temperature scores with std_dev from uncertainties.

780a015

Signed-off-by: MichaelTiemann <[email protected]>

kmarinushkin reviewed Oct 25, 2022

View reviewed changes

MichaelTiemannOSC added 2 commits October 25, 2022 10:06

Support using GHG Protocol labels for S3 categories in template.

01b316d

Signed-off-by: MichaelTiemann <[email protected]>

kmarinushkin reviewed Nov 1, 2022

View reviewed changes

ITR/data/data_warehouse.py Show resolved Hide resolved

kmarinushkin reviewed Nov 1, 2022

View reviewed changes

ITR/data/base_providers.py Show resolved Hide resolved

kmarinushkin reviewed Nov 1, 2022

View reviewed changes

ITR/data/base_providers.py Show resolved Hide resolved

kmarinushkin reviewed Nov 1, 2022

View reviewed changes

examples/data/json-units/benchmark_production_OECM.json Show resolved Hide resolved

MichaelTiemannOSC added 3 commits November 7, 2022 17:34

co2budget.ipynb works with both TPI and OECM

150a00f

There's still quite some work to do to get the V2 input data to align with the specific scope expectations of both OECM and TPI, but with these changes "easy" cases can be visualized. Signed-off-by: MichaelTiemann <[email protected]>

Fix minor breakage caused by making ITR work w/o uncertainties.

b267aa5

Signed-off-by: Michael Tiemann <[email protected]>

Minor fixes so that ITR demos can work with vanilla Pint, Pint-Pandas…

d91ab20

…, and without uncertainties Signed-off-by: Michael Tiemann <[email protected]>

MichaelTiemannOSC marked this pull request as ready for review November 18, 2022 19:06

MichaelTiemannOSC added the enhancement New feature or request label Nov 18, 2022

Update requirements to use latest Pint, Pint-Pandas, Pandas, etc.

7cbdf76

Signed-off-by: Michael Tiemann <[email protected]>

MichaelTiemannOSC mentioned this pull request Nov 23, 2022

Template v2 scopes #166

Merged

MichaelTiemannOSC marked this pull request as draft November 28, 2022 19:24

MichaelTiemannOSC closed this Nov 28, 2022

MichaelTiemannOSC deleted the template-v2 branch September 7, 2023 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Template v2 (plus Pint/Pydantic cleanups) #159

Template v2 (plus Pint/Pydantic cleanups) #159

MichaelTiemannOSC commented Oct 14, 2022 •

edited

Loading

kmarinushkin commented Oct 14, 2022

kmarinushkin Oct 14, 2022

MichaelTiemannOSC Oct 14, 2022

kmarinushkin Oct 14, 2022

MichaelTiemannOSC Oct 14, 2022

kmarinushkin commented Oct 14, 2022

kmarinushkin Oct 22, 2022

kmarinushkin Oct 25, 2022

MichaelTiemannOSC Nov 3, 2022

kmarinushkin Oct 25, 2022

MichaelTiemannOSC Oct 25, 2022 •

edited

Loading

kmarinushkin Oct 25, 2022

kmarinushkin Nov 1, 2022

MichaelTiemannOSC Nov 2, 2022

kmarinushkin Nov 1, 2022

MichaelTiemannOSC Nov 2, 2022

kmarinushkin Nov 3, 2022

MichaelTiemannOSC Nov 3, 2022

kmarinushkin Nov 1, 2022

MichaelTiemannOSC Nov 2, 2022

MichaelTiemannOSC commented Nov 2, 2022

MichaelTiemannOSC commented Nov 28, 2022

Template v2 (plus Pint/Pydantic cleanups) #159

Template v2 (plus Pint/Pydantic cleanups) #159

Conversation

MichaelTiemannOSC commented Oct 14, 2022 • edited Loading

kmarinushkin commented Oct 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmarinushkin commented Oct 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelTiemannOSC Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelTiemannOSC commented Nov 2, 2022

MichaelTiemannOSC commented Nov 28, 2022

MichaelTiemannOSC commented Oct 14, 2022 •

edited

Loading

MichaelTiemannOSC Oct 25, 2022 •

edited

Loading