You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been looking into using OSeMOSYS PuLP to facilitate a project which needs some Monte Carlo runs. I've met a roadblock when trying to load data from another model programmatically. It's been quite difficult to understand the internal data structure currently used, but once I figured it out, it has proved difficult to modify cleanly.
Recommend using a dict of pandas DataFrames for parameters and sets
Currently, the data is read into a giant ragged pandas DataFrame which contains a lot of empty columns, and different parameters are listed on rows. This is not memory efficient, as you're reserving 13 columns of memory for each row.
I would recommend using a dictionary of dataframes, where the key is the parameter or set name, and the value the pandas DataFrame. Use long format for the dataframe (as you are currently using) e.g. for AccumulatedAnnualDemand you would have four columns: REGION, FUEL, YEAR and VALUE.
This would also match the format used by otoole which provides a number of handy import functions to convert between GNU MathProg models, and would make it very easy to begin using PuLP from other implementations.
Code structure
The createParameter, createTuple and createVariable helper functions exist primarily to deal with the storage of all the model elements in this giant DataFrame. These functions also obscure the otherwise straightforward syntax offered by PuLP.
The SETS and PARAMETERS AND DATA section are essentially more data munging - getting your data in shape before passinging it into the model. This could all be placed into a load_data function.
I'm a bit suspicious of PERMUTATION OF SETS because it looks like it creates the dense cross-product of all the sets, which means that you're then generating a dense matrix of constraints later on. If so, this is going to be extremely memory inefficient, and there will be many constraints that are not being used. The solver will then need to remove these before going onto solve the model - hurting performance and ability of PuLP to scale to larger models.
Model construction
In this section, you create a big dict containing all the specifications of the variables, and then create the variables using the createVariable function. Why not just call the PuLP function pulp.LpVariable directly? Do you need to pass in the cross-product of the sets to the variable creation function?
For the constraint generation, is it necessary to iterate over the cross-product of the indexes? In many cases, you only need to generate constraints for the indices for which you have data.
Also, it might simplify things to create functions for each constraint returning a pulp.LPConstraint type. This would also allow you to document what's happening in each constraint.
I also recommend wrapping the entire model in a create_model function which returns the model, and takes a package of data (e.g. the dict of DataFrames) as an argument. This would allow the model to be placed in a separate file, and make a clearer distinction between all the data management, running of the model, and model formulation itself.
The text was updated successfully, but these errors were encountered:
I've been looking into using OSeMOSYS PuLP to facilitate a project which needs some Monte Carlo runs. I've met a roadblock when trying to load data from another model programmatically. It's been quite difficult to understand the internal data structure currently used, but once I figured it out, it has proved difficult to modify cleanly.
Recommend using a dict of pandas DataFrames for parameters and sets
Currently, the data is read into a giant ragged pandas DataFrame which contains a lot of empty columns, and different parameters are listed on rows. This is not memory efficient, as you're reserving 13 columns of memory for each row.
I would recommend using a dictionary of dataframes, where the key is the parameter or set name, and the value the pandas DataFrame. Use long format for the dataframe (as you are currently using) e.g. for
AccumulatedAnnualDemand
you would have four columns:REGION, FUEL, YEAR
andVALUE
.This would also match the format used by
otoole
which provides a number of handy import functions to convert between GNU MathProg models, and would make it very easy to begin using PuLP from other implementations.Code structure
The
createParameter
,createTuple
andcreateVariable
helper functions exist primarily to deal with the storage of all the model elements in this giant DataFrame. These functions also obscure the otherwise straightforward syntax offered by PuLP.The
SETS
andPARAMETERS AND DATA
section are essentially more data munging - getting your data in shape before passinging it into the model. This could all be placed into aload_data
function.I'm a bit suspicious of
PERMUTATION OF SETS
because it looks like it creates the dense cross-product of all the sets, which means that you're then generating a dense matrix of constraints later on. If so, this is going to be extremely memory inefficient, and there will be many constraints that are not being used. The solver will then need to remove these before going onto solve the model - hurting performance and ability of PuLP to scale to larger models.Model construction
In this section, you create a big dict containing all the specifications of the variables, and then create the variables using the
createVariable
function. Why not just call the PuLP functionpulp.LpVariable
directly? Do you need to pass in the cross-product of the sets to the variable creation function?For the constraint generation, is it necessary to iterate over the cross-product of the indexes? In many cases, you only need to generate constraints for the indices for which you have data.
Also, it might simplify things to create functions for each constraint returning a
pulp.LPConstraint
type. This would also allow you to document what's happening in each constraint.I also recommend wrapping the entire model in a
create_model
function which returns the model, and takes a package of data (e.g. the dict of DataFrames) as an argument. This would allow the model to be placed in a separate file, and make a clearer distinction between all the data management, running of the model, and model formulation itself.The text was updated successfully, but these errors were encountered: