-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collections with many items saving time issue #1207
Comments
My timings on pystac (master branch) are as follows:
It's weird that the times look reasonable (nearly linear) for my tests, but not for you. The difference is a different machine, installing validation requirements and installing from pypi, while I'm on master. |
Strange, i was getting the same slow saving on the github action workflow (which uses ubuntu-latest). My machine is also ubuntu, what OS are you on? |
I'm running Ubuntu 22.04.2 LTS through WSL. I also just ran pystac 1.8.3 installed from pypi with validation requirements and the timings are similar to the ones I reported above. Weird... |
Thanks for the report, and thanks @m-mohr for also taking a look. My timings (macos) working from main w/ Python 3.11:
So I'm seeing performance similar to @santilland. I will note that there's some known issues around serializing a |
Weird. We were just thinking maybe it's an issue with memory consumption / swapping? My timings are from Python 3.10.6 |
Yeah, I'm seeing my run get CPU bound, but not memory bound. 🤔 |
I think @m-mohr commented on possible swap issue because my machine in general is close to the edge already but i did not see a special increase in memory use. But one core is always at 100% while saving. |
Yeah, maybe it's something else. I'm on WSL and not on native Ubuntu, so maybe some kind of IO thing? Or something completely different :-) I guess profiling on an affected machine may give some insights... |
I've started profiling this and will attempt to take a crack at it assuming no one else is too deep in it yet. Echoing what @santilland said above, I'm seeing very little memory pressure and a pretty engaged CPU. What I've found so far via Some thoughts about performance that I'm curious about now:
|
Sounds great @moradology thanks for digging in. I agree generally w/ your assessment that caching Feels Bad™. FYI |
If we have a machine that can reproduce the dramatic increase in runtimes here, I'd be interested to see how For now, this branch should work for testing slots behavior: https://github.com/moradology/pystac/tree/feature/use-slots Here's some code I've been using to check things out: import argparse
import cProfile
import pstats
import time
from datetime import datetime, timedelta
from pystac import (
Catalog,
CatalogType,
Collection,
Extent,
Item,
SpatialExtent,
TemporalExtent,
)
from pystac.layout import TemplateLayoutStrategy
def parse_args():
parser = argparse.ArgumentParser(description="STAC Catalog performance test.")
parser.add_argument(
"--numdays",
type=int,
required=True,
help="Number of days to generate items for.",
)
return parser.parse_args()
def build_items(numdays):
"""Builds items for the catalog."""
base = datetime.today()
times = [base - timedelta(days=x) for x in range(numdays)]
items = [
Item(
id=t.isoformat(),
bbox=[-180.0, -90.0, 180.0, 90.0],
properties={},
geometry=None,
datetime=t,
)
for t in times
]
return items
def run_test(numdays):
items = build_items(numdays)
catalog = Catalog(
id="test",
description="catalog to test performance",
title="performance test catalog",
catalog_type=CatalogType.RELATIVE_PUBLISHED,
)
spatial_extent = SpatialExtent([[-180.0, -90.0, 180.0, 90.0]])
temporal_extent = TemporalExtent([[datetime.now()]])
extent = Extent(spatial=spatial_extent, temporal=temporal_extent)
collection = Collection(
id="big_collection",
title="collection for items",
description="some desc",
extent=extent,
)
for item in items:
collection.add_item(item)
collection.update_extent_from_items()
catalog.add_child(collection)
strategy = TemplateLayoutStrategy(item_template="${collection}/${year}")
catalog.normalize_hrefs("https://exampleurl.com/", strategy=strategy)
# Profile the catalog save operation
pr = cProfile.Profile()
pr.enable()
start_time = time.perf_counter()
catalog.save(dest_href="../test_build/")
end_time = time.perf_counter()
pr.disable()
save_time = end_time - start_time
print(f"Saving Time with {numdays} items: {save_time:.6f} seconds")
return pr
def main():
args = parse_args()
numdays = args.numdays
profiler = run_test(numdays)
stats = pstats.Stats(profiler)
stats.strip_dirs()
print("\n=================================")
stats.print_callers("get_self_href")
stats.print_callees("get_self_href")
print("\n=================================")
stats.print_callers("get_single_link")
stats.print_callees("get_single_link")
print("\n=================================")
print("default stats output")
print("=================================")
stats.sort_stats('cumulative')
stats.print_stats()
if __name__ == "__main__":
main() And here's what the outputs ought to look like:
|
One thing to note is that we're iterating through links very regularly (which happens here: https://github.com/stac-utils/pystac/blob/main/pystac/stac_object.py#L177-L206). It isn't entirely clear that this is the best possible move, as it means iterating through the list and doing a bunch of comparison logic for
|
I don't hate this idea.
This feels ok-ish. My initial worry is around unintended mutation — as a rule pystac isn't too careful about what gets changed when, so if we just work on pointers I'm a little worried we might twiddle bits of a STAC tree without intending to. But again, I'd be open to looking at an implementation. |
Carrying over some results discussed in the These are the top offenders in terms of cumulative time for smallish (1000 item) catalog saves:
Here are the top offenders with 100x more (100,000) items:
|
Using pystac[validation] 1.8.3
I am creating collections with a larger amount of items and was surprised by the time it took to save them. I have been doing some very preliminary tests and it somehow seems that the save time increases exponentially with the amount of items in a collection.
For example saving a catalog with 1 collection takes depending on item count:
If i create 5 collections with 2000 items the saving time is 25s. So the same amount of items are being saved in total but it takes 4 times less when separated into multiple collections.
Any ideas why this could be happening?
Here is a very rough testing script:
The text was updated successfully, but these errors were encountered: