Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base_providers.py/_get_intensity_benchmarks is badly factored #152

Closed
MichaelTiemannOSC opened this issue Sep 14, 2022 · 4 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@MichaelTiemannOSC
Copy link
Contributor

As part of #146 and reviewing @kmarinushkin's PR (initial_load_all_scopes) I traced what would happen when benchmark data for Autos was properly classified as S3 rather than S1S2. While there are a few pitfalls along the way, the big problem are these functions in base_providers.py:

    def _get_projected_intensities(self, scope: EScope = EScope.S1S2) -> pd.DataFrame:
        """
        Converts IEmissionIntensityBenchmarkScopes into dataframe for a scope
        :param scope: a scope
        :return: pd.DataFrame
        """
        results = []
        for bm in self._EI_benchmarks.__getattribute__(str(scope)).benchmarks:
            results.append(self._convert_benchmark_to_series(bm))
        with warnings.catch_warnings():
            # pd.DataFrame.__init__ (in pandas/core/frame.py) ignores the beautiful dtype information adorning the pd.Series list elements we are providing.  Sad!
            warnings.simplefilter("ignore")
            df_bm = pd.DataFrame(results)
        df_bm.index.names = [self.column_config.REGION, self.column_config.SECTOR]
        return df_bm

    def _get_intensity_benchmarks(self, company_sector_region_info: pd.DataFrame,
                                  scope: EScope = EScope.S1S2) -> pd.DataFrame:
        """
        Overrides subclass method
        returns a Dataframe with production benchmarks per company_id given a region and sector.
        :param company_sector_region_info: DataFrame with at least the following columns :
        ColumnsConfig.COMPANY_ID, ColumnsConfig.SECTOR and ColumnsConfig.REGION
        :param scope: a scope
        :return: A DataFrame with company and intensity benchmarks per calendar year per row
        """
        benchmark_projection = self._get_projected_intensities(scope)  # TODO optimize performance
        reg_sec = company_sector_region_info[[self.column_config.REGION,self.column_config.SECTOR]].copy()
        merged_df=reg_sec.reset_index().merge(benchmark_projection.reset_index()[[self.column_config.REGION,self.column_config.SECTOR]], how='left', indicator=True).set_index('index') # checking which combinations of reg-sec are missing in the benchmark
        reg_sec.loc[merged_df._merge == 'left_only', self.column_config.REGION] = "Global" # change region in missing combination to "Global"
        sectors = reg_sec.sector
        regions = reg_sec.region
        benchmark_projection = benchmark_projection.loc[list(zip(regions, sectors))]
        benchmark_projection.index = sectors.index
        return benchmark_projection

Within these methods, self._EI_benchmarks does contain the full complexity of S1S2, S3, and S1S2S3 benchmark data. However, the per-company company_sector_region_info dataframe has heterogeneous corp data, with Electricity Utilities and Steel having meaningful S1 and S2 benchmark data (but no S3 benchmark data) and Autos having mainly (by a factor of more than 100x) S3 benchmark data. Because it's so small, S1 and S2 data for autos wouldn't be missed.

From this perspective, we can see that it's not really right to ask for a single scope-based benchmarks to be matched against a list of companies when sector is so determinant of what scope data may even be available. One way to solve this is to call into these functions with only company_sector_region_info companies that are relevant for the given scope.

Any thoughts on where might be the right place to address these problems? And whether Kirill's changes to handle scopes actually points the way to a better factoring and implementation?

@MichaelTiemannOSC MichaelTiemannOSC added the enhancement New feature or request label Sep 14, 2022
@kmarinushkin
Copy link
Collaborator

If I understand correctly, we still want to calculate temperature score for all geographical regions, reported by a company, - right?

If so, what we might want is DataFrames with not only company_id as index, but [company_id, region] as an index. I am new to pandas, but looks like they have a MultiIndex class for such use-cases. Please correct me if it doesn't fit our goal.

In my "scopes" work, I am investigating ways to calculate scores per company per scope, and experimenting with MultiIndex in this context.

If you find this option promising, we could turn our DataFrames into MultiIndex [company_id, region, scope], to support both work packages.

What do you think?

@MichaelTiemannOSC
Copy link
Contributor Author

Kirill, your understanding of MultiIndex is correct, and indeed I was thinking along similar lines. Indeed, benchmarks are a pd.Series that use sector and region as a MultiIndex (the name parameter is set as a tuple in this constructor):

    def _convert_benchmark_to_series(self, benchmark: IBenchmark) -> pd.Series:
        """
        extracts the company projected intensities or targets for a given scope
        :param scope: a scope
        :return: pd.Series
        """
        return pd.Series({p.year: p.value for p in benchmark.projections}, name=(benchmark.region, benchmark.sector),
                         dtype=f'pint[{benchmark.benchmark_metric.units}]')

Note also that the comment retains a vestigial connection to the concept of a scope parameter, but no such parameter is any longer present. Perhaps putting the scope into the benchmark's MultiIndex is indeed the right fix.

@MichaelTiemannOSC
Copy link
Contributor Author

xref #162

@MichaelTiemannOSC
Copy link
Contributor Author

This issue is obsolete. We now keep lots of data in company/scope columns, which preserves their Pint-like nature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants