Skip to content

Establish new python package "swmm-pandas" #85

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
michaeltryby opened this issue Aug 17, 2021 · 19 comments
Closed

Establish new python package "swmm-pandas" #85

michaeltryby opened this issue Aug 17, 2021 · 19 comments

Comments

@michaeltryby
Copy link
Contributor

michaeltryby commented Aug 17, 2021

Purpose of this package would be to marshal swmm data into and out of pandas data frames.

@michaeltryby
Copy link
Contributor Author

michaeltryby commented Aug 17, 2021

@jennwuu so I understand that moving data from pyswmm to pandas is pretty straight forward. The thought I have is that if a lot of people are interested in doing it and rolling their own code for that purpose, then it would make sense to offer a module.

This was the same thought process that led to creation of pyswmm. Everyone was writing their own wrappers, but no one was extending SWMM's C API . That was the hook.

When I look at the other packages out there, a lot of them are using pandas to keep data organized. We are getting inquiries about it from the community at large so I think pandas should be on our radar. But we need a hook.

What I would like to create is a family of easy to use python modules in the "swmm" namespace that work together and allow developers to leverage other packages and standard formats to construct powerful and interoperable workflows.

@karosc
Copy link
Member

karosc commented Aug 19, 2021

I like the idea of a swmm-pandas oriented package, but I'd more like to see a general tool for post processing outputs that leverages the pandas/numpy api in places rather than something stamped with the pandas brand; such a stamp might limit the scope of a potentially powerful project in the long run. More than just providing the single attribute timeseries per model element as a dataframe, I think there might value in the following:

  1. Pulling multiple attributes for multiple elements of a given type into a single dataframe a la:

    link datetime attribute value
    COND1 1/1/2021 00:00 flow_rate_cfs 1
    COND1 1/1/2021 00:00 flow_velocity_fps 0.01
    COND2 1/1/2021 00:00 flow_rate_cfs 2
    COND2 1/1/2021 00:00 flow_velocity_fps 0.02

    or

    datetime COND1_flow_rate_cfs COND2_flow_rate_cfs
    1/1/2021 00:00 1 2
    1/1/2021 01:00 2 3
    1/1/2021 02:00 3 4
  2. Grouping output data into "events" for the purpose of comparing stats (think multiple design storms in a single run)

  3. A cli for pulling output data directly into a sqlite database or HDF file for the purpose of connecting to other things (think dashboard tools or spreadsheets)

  4. Adding report file module to allow quick assessment of model stats already put out by the model (I realize there are existing tools for this, but I think there is value in having a one-stop-shop package for swmm output post processing)

  5. Integrated plotting tools to quickly compare simulated flows and velocities in a link to those estimated by conventional engineering methods (e.g. manning's equation)

  6. Adding a more object-based api for pulling single element, single attribute timeseries (e.g. outputs.links["COND1"].flow -> pd.Series), to improve interactive coding experience via auto-completion tools.

I'd start developing something like this under my own account, but think there is more value in housing it within the OpenWaterAnalytics org. The brand might attract users and centralize development of the vision. Do you find the above features too specific to include in an OWA repo? Or do you see a place for them in the canon?

I think much of @jennwuu's pyswmm fork could be leveraged in such a package, providing a solid launching point.

@michaeltryby
Copy link
Contributor Author

I like your thinking @karosc. You have identified several compelling use cases. I certainly think the software you are envisioning has a home here at OWA.

I have an interest in developing a Python based UI for SWMM and EPANET. I have been thinking about a suitable data model for such an application. Current thinking is geopandas. Do you think it would be possible to design the software so that your application and a UI could share code? How could we go about doing that?

@cbuahin
Copy link

cbuahin commented Aug 20, 2021

@michaeltryby, how would the UI you are envisioning be different from the effort being undertaken here?

@michaeltryby
Copy link
Contributor Author

@cbuahin First off I want to say that I was asking @karosc some leading questions to gently encourage him to think about some software design objectives that he may not have been considering. One of the problems is that there are so many duplicative Python packages in the SWMM space that don't work together. So I'm interested in strategies to get around that. I agree with @karosc that there needs to be centralization of development and vision. Achieving it is the fun part and the hard part.

I do have a desire to complete what was started in the SWMM-EPANET_UI project. That is one of the reasons for creating the swmm-toolkit and epanet-toolkit packages. We made several major mistakes in the prosecution of that work that I have an ambition to correct ...

@karosc
Copy link
Member

karosc commented Aug 21, 2021

I have been thinking about a suitable data model for such an application. Current thinking is geopandas. Do you think it would be possible to design the software so that your application and a UI could share code? How could we go about doing that?

@michaeltryby ,I like the idea of having geopandas-like way of visualizing model data. Such an api would also allow easy exporting of model features to shapefiles and geodatabases, which I think provides lots of value to modelers like myself who use GIS to map models for reports. A proper and complete SWMM data model has been on my wish list for the longest time.

Although I know there must be some SWMM object model buried in the SWMM-EPANET_UI Project mentioned above, the only existing one I've used is swmmio, which I think is pretty good, but has limited use cases in my job duties.

It is great for quickly QAQC'ing someone else's SWMM model. I can have preexisting logic that checks object attributes against typical values or assumptions, flagging things that are significantly different. This process streamlines model review significantly and better allows me to ask the right questions during review.

swmmio is also great for running many alternative model configurations that don't alter model geometry (eg. alternative inflows, hydrology, or attributes of existing elements). However, all the SWMM models I build are georeferenced and are usually derived from GIS databases that show the exact location of pipes in the ground. With faster computers and better data management, clients are asking to include every single pipe in their system in the model, while most of the older legacy models we maintain are simplified representations of the actual system. While swmmio allows me to add and remove model components for alternatives, it not easy to add new geometries without having things up on an interactive basemap; this is where GUIs come in really handy. The problem with most gui's is that they are not easily programmable. In my ideal gui, every change that is made to the model is logged by the gui, and the python commands necessary to the make those changes are also extractable from said gui. This kind of feature is available in ArcGIS software, in which past run geoprocessing commands started by the GUI are show in a "results" list, and you can right click them and click a "copy python command" button and paste it into a script. This kind of feature would make scenario management buttery smooth. The act of managing the baseline model, alternative models, and children of alternative models would come down to managing the order of a bunch of python commands, which could be documented in detail with python comments. I personally am not fully satisfied with any of the SWMM scenario manger's I've tried (mike urban, InfoSWMM, infoworks ICM, PCSWMM). I want to be able to edit a model in a clean, unobtrusive UI with a fast GIS mapping engine, then log my changes to the model in a python script so I can regenerate all my scenarios should I make a change to the baseline. This might be a huge ask from a gui, and a feature that might not be fully appreciated by the whole community, but it's a dream of mine.

Finally, something that swmmio can't do natively, which I think any swmm object model should do, is clearly and concisely diffing models. I want to be able to load into two input files, and have python tell me the difference between them. Further more, if the diffing engine could then produce the python commands necessary to transform model1 into model2, that would be super cool and contribute to the scenario management use case above.

I think the end game features I have in mind are rather ambitious and would come later in the life of the tool, after the core API is stable and reasonably final. With respect to module organization, I'd like to see something like swmm.inp, swmm.out, and swmm.rpt, with each module providing a python api for interacting with the respective SWMM file. out and rpt would be read only modules of course, while inp would provide read/write capabilities. With respect to sharing a code base, I'm not sure how these modules could share much code since the three files with which they interact all inherently store different data in different formats and would require different data models. However, I do see potential for an even higher level module that ties the three together into a holistic model module. Tools that require knowing things about the model inputs and outputs would live in the higher level module, using the lower level modules to access data, then manipulating and summarizing that data to a more general report (think the model vs manning's example in my last post, or perhaps rapidly pulling up the model's unstable/time-step critical elements listed in the rpt file into a single table with their attributes and output stats).

There are lots of possibilities, here, but I think the swmm.out module is very attainable in the near future, as well as the swmm.rpt module. The swmm.inp module and accompanying object model is in my opinion far more ambitious than the others, and would require some careful design keeping end goals in mind (perhaps some pseudocode is in order or a flow chart of some sorts to help map out api capability). I'm not a software engineer by training, so I may not have the foresight necessary to design an elegant and capable api that doesn't cause problems/limitations down the road (if that's even possible for anyone at all 😉), but I'm on board to help in any way I can!

@karosc
Copy link
Member

karosc commented Aug 23, 2021

Out of curiosity would you propose that the swmm.out and swmm.rpt packages above live in their own github repositories? Or would suggest they be python packages within separate directories of the swmm-python repository?

@michaeltryby
Copy link
Contributor Author

michaeltryby commented Aug 24, 2021

@karosc You have laid out an ambitious vision. Can we agree that columnar data format such as pandas / geopandas would make an appropriate data model for a UI? Such a package would also be of utility to many other smaller purpose built applications that want to work with pandas.

Like you, what I am envisioning is a package containing modules for marshaling swmm input, report, output, and other data into and out of data frames. It would also provide the ability to filter and reshape data frames to facilitate tailoring data for custom applications.

At the moment, we are organizing python packages in this repo. As the number of packages grows we may have to revisit that decision, but we can cross that bridge when we come to it.

Finally, I would recommend studying Markus Pichler's work found here:
https://gitlab.com/markuspichler/swmm_api

@michaeltryby
Copy link
Contributor Author

As I recall swmm-toolkit was reorganized several times before it settled into its current form.

Right now what we need is simply a place for these modules to live while they evolve realizing that we should remain flexible as our shared understanding of the use case grows.

So I propose swmm-python/swmm-pandas as a straw man location for the package realizing that it may get changed at some point in the future.

Any other suggestions or objections?

@karosc
Copy link
Member

karosc commented Aug 28, 2021

I did some thinking about how to best to achieve pulling multiple time series into a single pandas dataframe and found that numpy has a swig wrapper itself. I am wondering what you think about adding some functions to output.c that would allow pulling multiple time series into multidimensional arrays. Personally, I think it would make the python code a bit more concise to be able to ask the C extension for an array, then just build the dataframe from that array (maybe it would provide some speed benefits as well?).

I pulled together a working example in my forks of the outputAPI and the swig wrapper.

The function works like below:

In [1]: from swmm.toolkit import output, shared_enum, output_metadata
   ...: pth = 'test_Example1.out'
   ...: handle = output.init()
   ...: nodeIDXs=[1,2]
   ...: output.open(handle,pth)

In [2]: output.get_node_array(handle,nodeIDXs,shared_enum.NodeAttribute.INVERT_DEPTH,0,10)
Out[2]: 
array([[0.        , 0.        ],
       [0.5213659 , 0.20744564],
       [3.        , 0.29623413],
       [3.        , 0.40630534],
       [3.        , 0.30344424],
       [0.3971988 , 0.15370804],
       [0.12404095, 0.02930042],
       [0.05183569, 0.01554774],
       [0.03520615, 0.01035316],
       [0.02636687, 0.00766949]], dtype=float32)

The naming convention and actual functionalities are still TBD; I just wanted to see if I could pull together a working example of getting a 2D numpy array from C, then get your input on potentially incorporating this kind of thing into the main repo for swmm-pandas.

@michaeltryby
Copy link
Contributor Author

michaeltryby commented Aug 30, 2021

@karosc I appreciate your initiative on this. Couple of thoughts ...

Have you heard the phrase, "avoid premature optimization"? It looks like your objective is to build multiple time series as quickly as possible. Unfortunately, the data layout in the binary file is not amenable to building time series. To build one the entire file needs to be traversed and built up one value at a time for every series you want to create. The data layout is the bottle neck.

Even in C, file IO is relatively slow compared to working with variables in memory. If you want to build many time series quickly I think you would be better off working with the variables in memory rather than through file IO. This would involve slurping all the output data into a data frame and then filtering and reshaping it to create the time series you want.

I understand that Pandas is slow, so perhaps there is a faster library that offers data frame filtering and reshaping that you can work with. There are several out there. Thoughts?

@karosc
Copy link
Member

karosc commented Aug 30, 2021

Maybe I'm not understanding you, but I am not sure about slurping an entire outfile. Some models have loads of objects and a small reporting step; I've seen .out files >10GB in size, reading such into memory could be a resource hungry strategy. I understand I can adjust the SWMM simulation to keep the outfile size down (limiting the reporting timestep and reported elements), but sometimes I'm not the one who ran the model and re-running such might take hours. I would want to be able to control which data I load into memory (i.e. which model elements and attributes).

As for premature optimization, I've not yet come into that realm of software development. Nevertheless, I'm not sure why this would fall under that umbrella since my main goal is to provide a more feature rich C-API so the python code can be more concise and readable.

I think you are right though, the C code traverses the outfile from the beginning each time it wants to pull a value. Maybe we can optimize a routine to only traverse from the beginning once, pulling requested data along the way. Perhaps this thought falls under the premature optimization umbrella though.

@michaeltryby
Copy link
Contributor Author

michaeltryby commented Aug 30, 2021

Many computers have > 32GB ram and can easily fit the entire contents of an output file in memory.

If your goal is a clean Python API wouldn’t it be easier to write a function that takes a list of nodes, calls output.get_node_series() for each, builds up the numpy array, then returns it? Why mess with C and SWIG at all?

@karosc
Copy link
Member

karosc commented Aug 30, 2021

I didn't mean to imply that computers today wouldn't be able to handle loading that amount of data into memory, only that it could be needlessly resource hungry.

In my opinion, the less data "wrangling" that needs to be written in python, the cleaner the code will appear. Sure it would be easier to just write the whole thing in python, but I got the sense from your second post in this thread that extending the C API is desirable; and wouldn't using C naturally provide some amount of speed benefit? Maybe not though? I'm not sure.

@michaeltryby
Copy link
Contributor Author

michaeltryby commented Aug 30, 2021

Actually what I consider desirable is minimizing the amount of data wrangling code written in both Python and C.

I think you may have misunderstood my second post in this thread.

To be sure C does provide a speed benefit. Python is a lot easier to work in though.

@karosc
Copy link
Member

karosc commented Aug 30, 2021

Alrighty, if we want to avoid touching the C-code, then we could achieve what I want with code similar to this.

pth = 'outfile.out'
_handle = output.init()
output.open(_handle, pth)

periods = output.get_times(_handle,shared_enum.Time.NUM_PERIODS)
nodes =[1,2]


# pull wide array
df = pd.DataFrame(np.stack(
        [ 
            output.get_node_series(
                _handle,
                node,
                shared_enum.NodeAttribute.INVERT_DEPTH,
                0,
                periods) 
                for node in nodes
                ],
                axis=1
    ),columns=nodes)

# pull long array
df = pd.DataFrame(
    np.hstack(
        [
            output.get_node_series(
                _handle,
                node,
                shared_enum.NodeAttribute.INVERT_DEPTH,
                0,
                periods,
            )
            for node in nodes
        ]
    ),
    columns=["value"],
)
df['node'] = np.repeat(nodes,periods)

@karosc
Copy link
Member

karosc commented Sep 20, 2021

In developing swmm-pandas, I've come across a couple things I'd like to raise:

  1. It appears the function SMO_getSystemAttribute might be broken in the C code. I consistently get segmentation fault errors when trying to run it, and it appears @jennwuu may have already noticed this when writing her code. Is the team aware of this? Do we need a pull request, or is there already an effort underway?
  2. In SWMM 5.1.010, PET was added as a system output variable. However, the output api enum has this value commented out and the swmm-python shared_enum excludes it entirely. Is there any reason why it's missing? Could we possibly add this in? I only ask because the swmm output api SMO_getSystemResult method returns 15 values, but the shared_enum (which I am using to build pandas index) only has 14 values.
  3. I am building package documentation using sphinx and am modeling it after the pandas docs, using the pydata theme and sphnix autosummary layout. It produces what I think is a clean look for an API reference.

Cheers!

@jennwuu
Copy link
Contributor

jennwuu commented Sep 22, 2021

Hi @karosc,

Regarding point 1: yes, I ran into segmentation default errors when trying to add this function in pyswmm. I forgot to open an issue for this bug. I will do it now. Thanks for reminding me.

@abhiramm7
Copy link
Collaborator

@jennwuu @michaeltryby @karosc Closing this issue as it seems to have been resolved. If you all need this for something, please feel free to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants