base_providers.py/_get_intensity_benchmarks is badly factored #152

MichaelTiemannOSC · 2022-09-14T01:31:29Z

As part of #146 and reviewing @kmarinushkin's PR (initial_load_all_scopes) I traced what would happen when benchmark data for Autos was properly classified as S3 rather than S1S2. While there are a few pitfalls along the way, the big problem are these functions in base_providers.py:

    def _get_projected_intensities(self, scope: EScope = EScope.S1S2) -> pd.DataFrame:
        """
        Converts IEmissionIntensityBenchmarkScopes into dataframe for a scope
        :param scope: a scope
        :return: pd.DataFrame
        """
        results = []
        for bm in self._EI_benchmarks.__getattribute__(str(scope)).benchmarks:
            results.append(self._convert_benchmark_to_series(bm))
        with warnings.catch_warnings():
            # pd.DataFrame.__init__ (in pandas/core/frame.py) ignores the beautiful dtype information adorning the pd.Series list elements we are providing.  Sad!
            warnings.simplefilter("ignore")
            df_bm = pd.DataFrame(results)
        df_bm.index.names = [self.column_config.REGION, self.column_config.SECTOR]
        return df_bm

    def _get_intensity_benchmarks(self, company_sector_region_info: pd.DataFrame,
                                  scope: EScope = EScope.S1S2) -> pd.DataFrame:
        """
        Overrides subclass method
        returns a Dataframe with production benchmarks per company_id given a region and sector.
        :param company_sector_region_info: DataFrame with at least the following columns :
        ColumnsConfig.COMPANY_ID, ColumnsConfig.SECTOR and ColumnsConfig.REGION
        :param scope: a scope
        :return: A DataFrame with company and intensity benchmarks per calendar year per row
        """
        benchmark_projection = self._get_projected_intensities(scope)  # TODO optimize performance
        reg_sec = company_sector_region_info[[self.column_config.REGION,self.column_config.SECTOR]].copy()
        merged_df=reg_sec.reset_index().merge(benchmark_projection.reset_index()[[self.column_config.REGION,self.column_config.SECTOR]], how='left', indicator=True).set_index('index') # checking which combinations of reg-sec are missing in the benchmark
        reg_sec.loc[merged_df._merge == 'left_only', self.column_config.REGION] = "Global" # change region in missing combination to "Global"
        sectors = reg_sec.sector
        regions = reg_sec.region
        benchmark_projection = benchmark_projection.loc[list(zip(regions, sectors))]
        benchmark_projection.index = sectors.index
        return benchmark_projection

Within these methods, self._EI_benchmarks does contain the full complexity of S1S2, S3, and S1S2S3 benchmark data. However, the per-company company_sector_region_info dataframe has heterogeneous corp data, with Electricity Utilities and Steel having meaningful S1 and S2 benchmark data (but no S3 benchmark data) and Autos having mainly (by a factor of more than 100x) S3 benchmark data. Because it's so small, S1 and S2 data for autos wouldn't be missed.

From this perspective, we can see that it's not really right to ask for a single scope-based benchmarks to be matched against a list of companies when sector is so determinant of what scope data may even be available. One way to solve this is to call into these functions with only company_sector_region_info companies that are relevant for the given scope.

Any thoughts on where might be the right place to address these problems? And whether Kirill's changes to handle scopes actually points the way to a better factoring and implementation?

The text was updated successfully, but these errors were encountered:

kmarinushkin · 2022-09-14T07:22:58Z

If I understand correctly, we still want to calculate temperature score for all geographical regions, reported by a company, - right?

If so, what we might want is DataFrames with not only company_id as index, but [company_id, region] as an index. I am new to pandas, but looks like they have a MultiIndex class for such use-cases. Please correct me if it doesn't fit our goal.

In my "scopes" work, I am investigating ways to calculate scores per company per scope, and experimenting with MultiIndex in this context.

If you find this option promising, we could turn our DataFrames into MultiIndex [company_id, region, scope], to support both work packages.

What do you think?

MichaelTiemannOSC · 2022-09-14T08:49:26Z

Kirill, your understanding of MultiIndex is correct, and indeed I was thinking along similar lines. Indeed, benchmarks are a pd.Series that use sector and region as a MultiIndex (the name parameter is set as a tuple in this constructor):

    def _convert_benchmark_to_series(self, benchmark: IBenchmark) -> pd.Series:
        """
        extracts the company projected intensities or targets for a given scope
        :param scope: a scope
        :return: pd.Series
        """
        return pd.Series({p.year: p.value for p in benchmark.projections}, name=(benchmark.region, benchmark.sector),
                         dtype=f'pint[{benchmark.benchmark_metric.units}]')

Note also that the comment retains a vestigial connection to the concept of a scope parameter, but no such parameter is any longer present. Perhaps putting the scope into the benchmark's MultiIndex is indeed the right fix.

MichaelTiemannOSC · 2022-11-08T15:35:28Z

xref #162

MichaelTiemannOSC · 2023-10-24T23:54:29Z

This issue is obsolete. We now keep lots of data in company/scope columns, which preserves their Pint-like nature.

MichaelTiemannOSC added the enhancement New feature or request label Sep 14, 2022

MichaelTiemannOSC assigned joriscram and dp90 Sep 14, 2022

MichaelTiemannOSC mentioned this issue Nov 8, 2022

Reorganizing benchmark hierarchy (scopes vs. sectors) #162

Open

6 tasks

franz-mf assigned LeylaJavadova Nov 11, 2022

MichaelTiemannOSC closed this as completed Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

base_providers.py/_get_intensity_benchmarks is badly factored #152

base_providers.py/_get_intensity_benchmarks is badly factored #152

MichaelTiemannOSC commented Sep 14, 2022

kmarinushkin commented Sep 14, 2022

MichaelTiemannOSC commented Sep 14, 2022

MichaelTiemannOSC commented Nov 8, 2022

MichaelTiemannOSC commented Oct 24, 2023

base_providers.py/_get_intensity_benchmarks is badly factored #152

base_providers.py/_get_intensity_benchmarks is badly factored #152

Comments

MichaelTiemannOSC commented Sep 14, 2022

kmarinushkin commented Sep 14, 2022

MichaelTiemannOSC commented Sep 14, 2022

MichaelTiemannOSC commented Nov 8, 2022

MichaelTiemannOSC commented Oct 24, 2023