Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collections endpoint #386

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from

Conversation

jvita
Copy link
Member

@jvita jvita commented Sep 22, 2021

Starting point for discussing a general "Collections" entry

optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated
- The keys should be short strings describing the type of metadata being supplied.
- The values can be any string, which may be human-readable.

aggregated_fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe properties instead of fields, or am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need an example to see what this field is used for, this just aggregates field/property names but not values? Does every entry have to have a value of each field listed here?

Copy link
Member Author

@jvita jvita Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had only said "fields" (or "properties" seems fine too) instead of also including values because:

  1. I wasn't sure what the best way would be to specify the reduction operations over the values (for example, sometimes you might want to sum the values, other times you might want to build a set from a list of values, etc.)
  2. I wasn't sure when the values should be reduced. Should the reduction occur before the collection is uploaded, meaning the reduced values wouldn't change even if the linked entries were edited? This wouldn't seem ideal, but I also don't know if it's acceptable (or possible) to specify that the reduction would be performed every time the collection was accessed.

A basic example of this, which we've been using for the OpenKIM/ColabFit project, is to have a "StructuresCollection" that aggregates all of the attributes.elements fields of the linked structures to provide a single set of elements present in the collection. Something like structure1.attributes.elements = ['C', 'Fe'], structure2.attributes.elements = ['Al'], collection.attributes.elements = ['Al', 'C', 'Fe']. Another simple example would be to aggregate attributes.nsites to count the total number of sites in the collection.

Does every entry have to have a value of each field listed here?

Though it's a bit restrictive, I think that I'd say yes, every entry should have a value for each of the aggregated fields. I think that a collection should be assumed to be homogenous, but perhaps that could use some discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should update the collection every time one of the entries gets updated.
If entries get updated regularly, you could place relationships in each entry pointing to the collections they belong to. If this rarely happens, you could probably query all the collections to check whether a particular entry is in that collection and then update it. We don't have to specify how to update the data belonging to the collections in the specification though, only that it SHOULD be updated.
In some cases it could however be worth while to create a new structure rather than to update the existing one. For example, when you want a collection you refer to in an article to stay the same.

Perhaps you could make a dictionary for each Optimade property.
Which could, depending on the property, hold a set or the minimum, average and maximum value in the collection.

When making the properties for these collections I think it would be good to think about how you would search for collections.

The number of entries in your collection would probably also be a good property to include.

There is also the info endpoint where you can specify which properties are shared for each endpoint
For collections, it would be /info/collections. You therefore do not have to specify which properties are available for the collections. (I do have a field like that in the trajectories endpoint because in that case the fields do not need to be queryable.) If they are queryable you could use the IS KNOWN query to check whether an entry has the particular field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have difficulty seeing the utility of aggregated_fields being an OPTIMADE standardized property. There is of course no problem for OpenKIM serving, e.g., an _openkim_aggregated_elements that aggregates the values of the elements, etc.; but it just seems the definition means this field anyway needs to be interpreted differently depending on which database is being queried.

Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for kicking this off @jvita! I invited you to the Materials-Consortia GitHub organization - lmk if it didn't work.

My comments here echo those in #360 - it would be good to have a general discussion about the preferred way of dealing with relationships to more complicated entry types (like files and collections). I guess the overall aim here is to provide a simple container for any entry types that averts the need to provide all-to-all relationships between a set of related entries, so it might be nice to mock-up some example responses that showcases this.

optimade.rst Outdated
Comment on lines 199 to 200
**Collection**
A Collection defines a relationship between a group of Entry resources. A Collection can be used to store metadata that applies to all of the entries in the group, and to aggregate metadata from each entry in the group.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I would define collections here, as the list is primarily external defintions of terminology that we are going to use in OPTIMADE, not terminology we are really defining ourselves (e.g. structure is not in this list)

optimade.rst Outdated
Comment on lines 2339 to 2341
- **Query**: support for queries on this property is OPTIONAL. If supported, only a subset of the filter features MAY be supported.
- The keys should be short strings describing the type of metadata being supplied.
- The values can be any string, which may be human-readable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to go down the route of some defined (optional) fields for e.g. description, name, then let provider-specific fields do the rest of the work (which can then be described in /info/collections), e.g. _exmpl_dft_parameters if the collection defines a consistent set of DFT calculations.

optimade.rst Outdated
- The keys should be short strings describing the type of metadata being supplied.
- The values can be any string, which may be human-readable.

aggregated_fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need an example to see what this field is used for, this just aggregates field/property names but not values? Does every entry have to have a value of each field listed here?

@JPBergsma JPBergsma self-requested a review September 27, 2021 09:59
@JPBergsma
Copy link
Contributor

Thank you for starting this PR!
I always find it helpful to have an example. It would be nice if you could add one if you have the time.

Copy link
Contributor

@JPBergsma JPBergsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for giving you this feedback so late. I did not realize my comments were still pending.

optimade.rst Outdated
- The keys should be short strings describing the type of metadata being supplied.
- The values can be any string, which may be human-readable.

aggregated_fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should update the collection every time one of the entries gets updated.
If entries get updated regularly, you could place relationships in each entry pointing to the collections they belong to. If this rarely happens, you could probably query all the collections to check whether a particular entry is in that collection and then update it. We don't have to specify how to update the data belonging to the collections in the specification though, only that it SHOULD be updated.
In some cases it could however be worth while to create a new structure rather than to update the existing one. For example, when you want a collection you refer to in an article to stay the same.

Perhaps you could make a dictionary for each Optimade property.
Which could, depending on the property, hold a set or the minimum, average and maximum value in the collection.

When making the properties for these collections I think it would be good to think about how you would search for collections.

The number of entries in your collection would probably also be a good property to include.

There is also the info endpoint where you can specify which properties are shared for each endpoint
For collections, it would be /info/collections. You therefore do not have to specify which properties are available for the collections. (I do have a field like that in the trajectories endpoint because in that case the fields do not need to be queryable.) If they are queryable you could use the IS KNOWN query to check whether an entry has the particular field.

@jvita
Copy link
Member Author

jvita commented Oct 14, 2021

Sorry I haven't responded to this in awhile, I've been a bit overwhelmed with other projects and haven't found the time to work on this PR.

If entries get updated regularly, you could place relationships in each entry pointing to the collections they belong to. If this
rarely happens, you could probably query all the collections to check whether a particular entry is in that collection and then update it. We don't have to specify how to update the data belonging to the collections in the specification though, only that it SHOULD be updated.
In some cases it could however be worth while to create a new structure rather than to update the existing one. For example, when you want a collection you refer to in an article to stay the same.

I've been struggling with deciding what the best approach to this would be. It seems like it would be difficult and/or expensive to make sure to update everything whenever an entry is changed, especially in the case where you have collections of collections. An alternative could be to not automatically update a collection when one of its entries changes, but to instead provide some kind of functionality to "re-synchronize" the collection to aggregate any desired properties up to the collection. This option would mean the collection isn't guaranteed to be up-to-date unless it's re-synchronized regularly, but it would also avoid running the potentially expensive task of updating everything on every change.

I always find it helpful to have an example. It would be nice if you could add one if you have the time.

I'll try to add an example soon. We (the OpenKIM team) are still working on ironing out our data structures, so I've been holding off on providing an example until I have one that would be representative of our use case.

@JPBergsma
Copy link
Contributor

`I've been struggling with deciding what the best approach to this would be. It seems like it would be difficult and/or expensive to make sure to update everything whenever an entry is changed, especially in the case where you have collections of collections. An alternative could be to not automatically update a collection when one of its entries changes, but to instead provide some kind of functionality to "re-synchronize" the collection to aggregate any desired properties up to the collection. This option would mean the collection isn't guaranteed to be up-to-date unless it's re-synchronized regularly, but it would also avoid running the potentially expensive task of updating everything on every change.

I think that it is most important that the database as presented to the outside world is consistent.
That a new entry appears immediately is less important.

So I think you could create a backlog with all the changes that need to be made to the database and execute these at an appropriate time. For example when the server load is low. This would allow you to accumulate multiple changes, so you would only have to update a collection once.
At that moment you could create an updated collection entry and then write both the updated collection and the updated entries belonging to this collection to the database. Thereby minimizing the time the database is in an inconsistent state.

I am not sure whether it will be that much work to update a collection though. In most cases you will not need to access the other entries in a collection.
For example: If the property in the collection is an average, you can update it without needing to access the other entries as long as you know the old value and the number of entries in the collection.
Adding an entry should also be possible without needing to access any of the other entries in the collection.
Removing an entry would be more work, as you have to check whether any of the properties in a list(for example, which elements occur in any of the structures in the collection) also occur in another entry. But in many cases these values will occur in another entry, and you can stop as soon as you have found them.

Therefore, I am not sure if it is necessary to implement a log of all the changes that need to be executed.

@jvita
Copy link
Member Author

jvita commented Nov 5, 2021

I think that it is most important that the database as presented to the outside world is consistent.
That a new entry appears immediately is less important.

So I think you could create a backlog with all the changes that need to be made to the database and execute these at an appropriate time. For example when the server load is low. This would allow you to accumulate multiple changes, so you would only have to update a collection once.

I like your idea of creating a backlog of pending changes that can be applied at chosen moments rather than automatically. We have implemented something similar in our software package, where our collections have a boolean flag that tracks if any of their entries have been changed, that way they can apply those changes before doing any critical operations (like saving to a file, or providing an aggregated statistic).

I am not sure whether it will be that much work to update a collection though. In most cases you will not need to access the other entries in a collection.

I agree with you that updating a single collection with multiple changes might not be that bad, but what I'm also worried about is having to update many collections that all share a common entry (or equivalently, nested collections). In this case you would need to propagate the changes to all collections containing the changed entry, and to all collections containing those collections, and so on and so forth... in this way, for a single changed entry the number of collections that need to be updated may be growing exponentially and it could begin to be a computationally demanding task to keep the database synchronized.

Here is a basic example that I think will be useful for these discussions. Perhaps we can edit this example as we go:

# A basic collection of (x, y) points
collection_1 = {
	'A': {'x': 0, 'y': 0},  # A, B, and C are low-level entries
	'B': {'x': 0, 'y': 1},
	'C': {'x': 0, 'y': 2},
	'aggregated_field': {   # these are the fields that have been chosen to be aggregated
		'x': [0],
		'y': [0, 1, 2]
	}
}

# A second collection
collection_2 = {
	'D': {'x': 0, 'y': 3},
	'E': {'x': 1, 'y': 3},
	'F': {'x': 2, 'y': 3},
	'aggregated_fields':  {
		'x': [0, 1, 2],
		'y': [3],
	}
}

# And now, a collection of collections
collection_3 = {
	'collection_1': <pointer_to_collection_1>,
	'collection_2': <pointer_to_collection_2>,
	'aggregated_fields': {
		'x': [0, 1, 2],  # "double aggregation"
		'y': [0, 1, 2, 3],
	}
}

# When a user queries for collections with 'aggregated_fields.x',
# it's important that the database not have any backlogged changes,
# that way the correct results are returned by the query

Here is a second example if collections are allowed to be non-homogeneous (i.e., not all entries are required to be the same). Again, I think we should discuss if we want this to be allowed:

# A basic collection of (x, y) points
collection_1 = {
	'A': {'x': 0, 'y': 0},  # A, B, and C are low-level entries
	'B': {'x': 0, 'y': 1},
	'C': {'x': 0, 'y': 2},
	'aggregated_field': {   # these are the fields that have been chosen to be aggregated
		'x': [0],
		'y': [0, 1, 2]
	}
}

# A collection of (x, y, z) points
collection_2 = {
	'D': {'x': 1, 'y': 3, 'z': 0},
	'E': {'x': 1, 'y': 3, 'z': 1},
	'F': {'x': 1, 'y': 3, 'z': 2},
	'aggregated_fields':  {  # not all fields are being aggregated
		'x': [1],
		'z': [0, 1, 2],
	}
}

# And now, a collection of collections
collection_3 = {
	'collection_1': <pointer_to_collection_1>,
	'collection_2': <pointer_to_collection_2>,
	'aggregated_fields': {
		'x': [0, 1],  # "double aggregation"
		'y': [0, 1, 2],  # note that 'y' wasn't aggregated in collection_2, so we don't get the value '3'
		'z': [0, 1, 2],
	}
}

@JPBergsma
Copy link
Contributor

JPBergsma commented Nov 8, 2021

I like your idea of creating a backlog of pending changes that can be applied at chosen moments rather than automatically. We have implemented something similar in our software package, where our collections have a boolean flag that tracks if any of their entries have been changed, that way they can apply those changes before doing any critical operations (like saving to a file, or providing an aggregated statistic).

Another option could be to have “two” databases. Each time you want to do an update on the database, you can perform the update on a copy. Once you finish updating, you switch the database you serve. If you include a version number in the pagination link, you can inform the clients that the database has been updated and the result they get may not be consistent.

I agree with you that updating a single collection with multiple changes might not be that bad, but what I'm also worried about is having to update many collections that all share a common entry (or equivalently, nested collections). In this case you would need to propagate the changes to all collections containing the changed entry, and to all collections containing those collections, and so on and so forth... in this way, for a single changed entry the number of collections that need to be updated may be growing exponentially and it could begin to be a computationally demanding task to keep the database synchronized.

It depends on what kind of aggregated properties you have, but for the aggregated properties I can think of, you would not need to visit the members of other collections when you update one of the collection in a super collection. So unless you have more than 10.000 or so collections in which the changed entry occurs, I do not think it would be a problem to update one entry and update all the collections it is in.

collection_1 = {
	'A': {'x': 0, 'y': 0},  # A, B, and C are low-level entries
	'B': {'x': 0, 'y': 1},
	'C': {'x': 0, 'y': 2},
	'aggregated_field': {   # these are the fields that have been chosen to be aggregated
		'x': [0],
		'y': [0, 1, 2]
	}
}

In the example above, you have placed the data('A': {'x': 0, 'y': 0}) inside your collection. This looks a bit strange, as normally your collection would contain a reference to the structure.

I would discourage not aggregating all the fields, as you do in the second example, as the aggregated fields are now inconsistent. A user could look for a collection with 'y' : 3 but he/she would not find it.

Copy link
Contributor

@rartino rartino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the collection proposal is sound. I've thumbed up a few comments by others which I agree with (and it would be great if @jvita can address them in jvita:collections_endpoint).

One thing I also want to discuss is, in addition to the proposal to remove additional_metadata and aggregated_fields from the PR, would it make sense to add a category string property to make it easier to think of different types of collections? We're probably going to end up with collections describing very different things, e.g., sets of structures because they are part of one investigation, sets of structures that are brought together because they are structurally similar, sets of calculations done as part of a project, etc. However, I realize that I can achieve the same thing by making collections of collections - but is that ok, or does it get too abstract? And how does a user find all "collection types" there is to ask for?

(One idea is to let the standard propose, at least on SHOULD level, a specialized collection with a specific title of collections of collections that the database wants users to be aware of. This is getting quite abstract, though.)

optimade.rst Outdated
- The keys should be short strings describing the type of metadata being supplied.
- The values can be any string, which may be human-readable.

aggregated_fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have difficulty seeing the utility of aggregated_fields being an OPTIMADE standardized property. There is of course no problem for OpenKIM serving, e.g., an _openkim_aggregated_elements that aggregates the values of the elements, etc.; but it just seems the definition means this field anyway needs to be interpreted differently depending on which database is being queried.

@ml-evs ml-evs added the type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus. label Jun 1, 2022
@rartino
Copy link
Contributor

rartino commented Jun 17, 2022

@jvita

Thanks again for the work with this. I'm seeing some need of collections in my work and would like this PR to be added. I see that the PR has been idling for a while (and I very much understand being busy with other things). Do you plan to address the concerns eventually, or would you be ok with me pushing commits to address the outstanding issues?

@jvita
Copy link
Member Author

jvita commented Jun 17, 2022

Hi @rartino; please feel free to make any changes that you wish -- I'm not sure that I'll be able to dedicate much time to this right now, unfortunately.

Thanks!

@ml-evs
Copy link
Member

ml-evs commented Jun 18, 2022

As this PR was made from @jvita's fork, perhaps we should remake in a branch that other's have access to and address the initial comments above? I'm happy to make the PR but can't really commit to owning the PR overall right now (maybe in a few weeks).

@rartino
Copy link
Contributor

rartino commented Jun 21, 2022

@ml-evs

No need to remake the branch - if someone clones a GitHub repo and then create a PR against it from a branch, the default is to allow the repo owners of the original repo push-access to that branch for precisely this reason.

I've pushed the edits I wanted to do. Please review.

@ml-evs
Copy link
Member

ml-evs commented Jun 21, 2022

@ml-evs

No need to remake the branch - if someone clones a GitHub repo and then create a PR against it from a branch, the default is to allow the repo owners of the original repo push-access to that branch for precisely this reason.

I've pushed the edits I wanted to do. Please review.

Great! I must have been misremembering why we had to add collaborator access on forks outside of the org in the past, guess it was probably before the PR had been made.

Copy link
Member

@merkys merkys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the (draft) proposal presents a sufficiently low-level method to represent groups of OPTIMADE entries. I left a few comments which I like to see answered before merge.

optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Show resolved Hide resolved
@merkys
Copy link
Member

merkys commented Jun 30, 2022

@rartino

I had the idea that the contents of a collections should all go in a "contents" relationship, i.e., that the tag directly under relationship should be contents rather than structures, but then I messed up the example.

After seeing your comments and re-reading the specification, I realize that not all output formats need to support the facility to "name" relationships the way that JSON API allows, so we shouldn't rely on that feature (or, we should add that feature to the definition of a relationship). You also remark that the relationship "name" must be the name of the entry type (is that in the specification?).

So, I've re-done this now, so that "any" "normal" "named-as-the-entry-type" relationship defines the set of entries in the collection.

I seemingly misunderstood the whole issue with relationship naming. As well I thought the JSON API had some requirements for them (apparently it does not). So now I envisage three suggestions:

  1. Do not mandate the relationship names (this is what the PR is doing now). The advantage of this option is that we do not need to make any JSON API-specifc provisions in the specification, but having no recommendation apart from an example may leave implementers of both clients and servers wondering.
  2. Use contents relationship name for JSON API. This would put all related entries to the same list.
  3. Separate relationships per entry type (as in example). Not sure if this is actually beneficial.

@rartino
Copy link
Contributor

rartino commented Jun 30, 2022

@merkys

I think we first need to clarify this ambiguity of JSON API relationship naming in general...

The most straightforward solution seems to be to edit 5.3.2 Entry Listing JSON Response Schema and for the definition of the "relationships" field say that: "Relationships MUST be grouped into a single JSON API relationships object for each entry type and placed in the relationships dictionary with that entry type name as key (e.g., "structures").

If we rather want to preserve the JSON API relationship "name" as a means to group related things - possibly of different types - together, this feature of a 'relationship name' needs to be added to the output-format-agnostic description of relationships in 3.6 Relationships. Preferably with some hint in the implementers note about how one can encode these names in more straightforward key-value based output formats (e.g. "type name + '_' + relationship name"?). It is probably also a good idea with a clarification in 5.3.2 that the key in this dictionary field represents the relationship name.

  1. Separate relationships per entry type (as in example). Not sure if this is actually beneficial.

The most straightforward solution above means going with this choice.

Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some specific comments below, and one general comment here:

It would be good to have examples of the two-step approach of querying collections (following the example of references elsewhere in the spec), e.g., find the collection ID and then do /v1/structures?filter=collections.id = "42".

Re: my comment about included below, I drafted an issue (probably years ago now) about our non-compliance with JSON:API relationships (but never got around to posting it). The Fetching relationships part of the JSON:API spec states that we should also serve relationships at e.g., https://example.com/v1/collections/42/relationships/structures. This would then be a normal entry listing endpoint that could be paginated. For such one-to-many relationships this is probably preferable to using included, which is primarily for condensing many-to-one relationships into a single response (e.g., all structures pointing at the same reference).

This is actually part of a bigger non-compliance, as JSON:API mandates in Fetching resources that you should be able to pick out particular attributes of an entry via e.g., https://example.com/v1/structures/123/elements. This would break our allowed IDs (which can have slashes, as I currently do with odbx/1... hence why I never posted the issue 😅). Looking at my draft (which I am happy to post), my suggestion was to just add a note in the specification that this aspect of JSON:API is not supported, but maybe it is now actually required for collections functionality to work properly...

It's also the case that very few implementations are serving relationships at all at the moment... and support for the include query parameter/included is even lower.

optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
The set of entries that belong to the collection is defined using relationships from the collection to each entry (OPTIMADE relationships are defined in `Relationships`_).
A collection can contain other collections.
Furthermore, implementations are suggested to add database-specific properties for additional metadata they want to store about the collections.
An OPTIMADE response representing a collection with all referenced entries included via the JSON API field :field:`included` (or equivalent in other response formats) can be used as a universal format for storage or transfer of a subset of (or all) data in an OPTIMADE database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about pagination of included values, in the case of e.g., 100,000 structures in the same collection. Do we need to worry about that? included is only an optional field at the moment.

Copy link
Contributor

@rartino rartino Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ml-evs

included is mentioned here only as a suggestion of the potential use of a collection as an export format - in which case the whole idea would be to put everything you want to export (e.g., all 100'000 structures) in the same stream. Nothing is said here that indicates mandatory support for included?

I thought we intended for clients to generally just get the list of ids from the relationship and then request entry data by further queries using the endpoint + id format. (Or, for efficiency when there are many, perhaps via filter=id=X OR id=Y OR id=Z OR ...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

included is mentioned here only as a suggestion of the potential use of a collection as an export format - in which case the whole idea would be to put everything you want to export (e.g., all 100'000 structures) in the same stream. Nothing is said here that indicates mandatory support for included?

I thought we intended for clients to generally just get the list of ids from the relationship and then request entry data by further queries using the endpoint + id format. (Or, for efficiency when there are many, perhaps via filter=id=X OR id=Y OR id=Z OR ...)

I understand that this is the intention, but I am a bit nervous that we have a field that can now grow unboundedly in a response if requested (or even if not, you cannot disable relationships from the response, I don't think?). I guess you could argue the same for a 1 million site structure object but here it feels well within the design that even the list of IDs could be very large.

I think the larger comment now addressed in #420 would be the best mechanism around this (if we are going to support it anyway).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit nervous that we have a field that can now grow unboundedly in a response if requested (or even if not, you cannot disable relationships from the response, I don't think?)

Are you talking about included or relationships now?; both of them can grow to arbitrary size - although - we are talking extremely large collections before relationships becomes unduly large.

Possibly repeating myself a bit, but to be clear: I don't see a problem with included. Implementations probably should avoid it except as recommended in OPTIMADE for references, unless a client somehow explicitly requests it (which we don't have a standard mechanism for yet). If an implementation decides to include included anyway, while simultaneously having "unboundedly large" relationships, it would be silly to not implement a limit on the number of entries included this way.

The situation is more tricky with huge relationships. I think JSON:API silently is built on the assumption that the list of IDs for all relationships of a resource must be small enough to handle in a single request. Sure, one can use the articles/1/relationships/comments syntax to get something paginated, but how does one know in the first place which JSON:API relationship keys to fetch without first fetching the unboundedly large articles/1?

Hence, I think we have to look at this list of IDs as a single "datum" where our default OPTIMADE JSON:API output format isn't equipped to handle arbitrary large data. This is echo:ed by our recommendation for other output formats to simply encode relationships alongside other data.

If we are concerned about this limitation, I don't see any other way to address it than to implement an alternative output format that can handle pagination on individual properties, including the relationships.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly repeating myself a bit, but to be clear: I don't see a problem with included. Implementations probably should avoid it except as recommended in OPTIMADE for references, unless a client somehow explicitly requests it (which we don't have a standard mechanism for yet). If an implementation decides to include included anyway, while simultaneously having "unboundedly large" relationships, it would be silly to not implement a limit on the number of entries included this way.

Understood

The situation is more tricky with huge relationships. I think JSON:API silently is built on the assumption that the list of IDs for all relationships of a resource must be small enough to handle in a single request.

I think my concern is that some of the intended use cases for collections might cross this boundary already (would 10,000/100,000 IDs to structures that define a training set break this?) I'm also not sure that relationships can be excluded from the request using response_fields, so you can't even hit /collections to browse/filter them without getting these potentially large responses. I understand that this is already the case with e.g., COD's mythical 100k atom structure, but at least you could choose which fields you wanted to be returned!

Sure, one can use the articles/1/relationships/comments syntax to get something paginated, but how does one know in the first place which JSON:API relationship keys to fetch without first fetching the unboundedly large articles/1?

I'm leaning towards this being the correct approach. The relationships can be included as a self link to articles/1/relationships/comments rather than as a data which I think solves your problem. Perhaps we could say something like. "It is RECOMMENDED that implementations use self links instead of explicit relationships for collections with a number of entries in significantly in excess of the implementation's page limit."

If we are concerned about this limitation, I don't see any other way to address it than to implement an alternative output format that can handle pagination on individual properties, including the relationships.

Let's see how the discussion in #419 goes...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relationships can be included as a self link to articles/1/relationships/comments rather than as a data which I think solves your problem. Perhaps we could say something like. "It is RECOMMENDED that implementations use self links instead of explicit relationships for collections with a number of entries in significantly in excess of the implementation's page limit."

You seem to be right - excluding the data key if it contains too many entries, and instead rely on a link that returns a paginated result set of the entries is a straightforward solution. So, if we try to follow the JSON API examples, I think this link should point to, e.g.,: /collections/42/structures and in reference to the ongoing discussion in #420 this would thus be the first "required" use of third-level URLs.

optimade.rst Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Show resolved Hide resolved
optimade.rst Show resolved Hide resolved
optimade.rst Show resolved Hide resolved
@rartino
Copy link
Contributor

rartino commented Jul 6, 2022

I drafted an issue (probably years ago now) about our non-compliance with JSON:API relationships (but never got around to posting it). The Fetching relationships part of the JSON:API spec states that we should also serve relationships at e.g., https://example.com/v1/collections/42/relationships/structures.

Hmm... Reading Section 1 I'd say that strictly speaking, this construct is indeed required to be supported:


"The API specification described in this document builds on top of the JSON API v1.0 specification. In particular, the JSON API specification is assumed to apply wherever it is stricter than what is formulated in this document. Exceptions to this rule are stated explicitly (e.g. non-compliant responses are tolerated if a non-standard response format is explicitly requested)."


This is actually part of a bigger non-compliance, as JSON:API mandates in Fetching resources that you should be able to pick out particular attributes of an entry via e.g., https://example.com/v1/structures/123/elements. This would break our allowed IDs (which can have slashes, as I currently do with odbx/1... hence why I never posted the issue 😅). Looking at my draft (which I am happy to post), my suggestion was to just add a note in the specification that this aspect of JSON:API is not supported, but maybe it is now actually required for collections functionality to work properly...

This is quite something to drop on this poor PR :-), and IMO a very important issue, but with very little to do with the acceptance or not of this PR, given that we already generally allow relationships - it should just be posted as a general issue. (In relation to what I commented above about included: I didn't even reflect over any other way of fetching the resources than by specifically querying them from their respective endpoints from the list of ids; so I wouldn't say this feature is essential for collections.)

Nevertheless, strictly formally, based on the quote above, I think the implication is that we MUST support this. How your implementation works out the separation between your "/"-based IDs and these fields is on up to you, but you can certaintly do it by, e.g., relying on that your ids have a fixed format so you know the first slash is always part of the id. That said, the realization here is that one probably should avoid "/" in IDs... (Very helpful of us to give examples showing "/"-based ids though.)

@ml-evs
Copy link
Member

ml-evs commented Jul 6, 2022

This is quite something to drop on this poor PR :-), and IMO a very important issue, but with very little to do with the acceptance or not of this PR, given that we already generally allow relationships - it should just be posted as a general issue. (In relation to what I commented above about included: I didn't even reflect over any other way of fetching the resources than by specifically querying them from their respective endpoints from the list of ids; so I wouldn't say this feature is essential for collections.)

Indeed, I'll dig out my draft issue and post it. I think I mentioned it at a meeting once but perhaps I did not explain myself well.

@ml-evs
Copy link
Member

ml-evs commented Jul 6, 2022

I don't think a discussion of #420 is blocking this PR, if we are happy with the format otherwise.

optimade.rst Outdated Show resolved Hide resolved
@rartino
Copy link
Contributor

rartino commented Jul 7, 2022

I don't think a discussion of #420 is blocking this PR, if we are happy with the format otherwise.

I agree - but I think this PR is now blocked by #419, so if it may be helpful if you have any thoughts on that...

@ml-evs
Copy link
Member

ml-evs commented Jul 8, 2022

Just another quick thought that there is some enforced directionality to the collections relationship, in that each member of the collection does not have a relationship back to its collection. If it did, you would be able to do /structures?filter=collections.id = "Si-potential". Do we need to enforce either mono- or bi-directional links, or should we just suggest that each database is consistent in its application (e.g., if one structure has a link to its collection, then all other structures that are part of a collection in that database should also have the same link?)

@ml-evs
Copy link
Member

ml-evs commented Jul 8, 2022

Just another quick thought that there is some enforced directionality to the collections relationship, in that each member of the collection does not have a relationship back to its collection. If it did, you would be able to do /structures?filter=collections.id = "Si-potential". Do we need to enforce either mono- or bi-directional links, or should we just suggest that each database is consistent in its application (e.g., if one structure has a link to its collection, then all other structures that are part of a collection in that database should also have the same link?)

The suggested direction of the mono-directional relationships could also be reversed, i.e. instead of collections->relationships->structures, we enforce structures->relationships->collections (and ban the opposite direction). Each relationship then becomes one-to-few (a structure will be in fewer collections than the number of structures per collection) and the overall collection metadata is still accessible at the collections endpoint.

@jvita
Copy link
Member Author

jvita commented Jul 8, 2022

Do we need to enforce either mono- or bi-directional links, or should we just suggest that each database is consistent in its application (e.g., if one structure has a link to its collection, then all other structures that are part of a collection in that database should also have the same link?)

Just to throw in my two cents...

In our work with ColabFit we originally started off only requiring mono-directional links (in the collections->relationships->structures direction), but we quickly found that this made it unwieldy/inefficient to do many important queries that would have been simple if we had instead just used bi-directional links. The trade-off is of course that it might be more challenging for data providers to maintain the bi-directional links, especially if the given application requires that changes be propagated along the links whenever content at one end of the link changes.

So to me, the question of whether to use mono-/bi-directional links, and in which direction if you're only doing mono-directional, is just too application-dependent. Each application would want to make a different choice depending on 1) what types of queries they want to support, and 2) what direction information needs to flow if they're doing data aggregation.

My suggestion:

  1. Don't enforce any constraints on the link directions, but add some method of letting users specify which directions are available for their given application
  2. OR just choose one of the MONO-directional links

Personally, option 1 seems better to me. Though I can't think of an example off the top of my head, I could imagine that there might be some application where it's easier for the data provider to maintain one direction than it is to maintain the other, in which case option 2 would be overly restrictive.

@rartino
Copy link
Contributor

rartino commented Jul 8, 2022

The way I edited the PR, it is meant to say that the direction collection -> members is mandatory. The way I read the relationships part of the OPTIMADE specification, databases are allowed (but not required) to also provide relationships in the other direction.

I don't like the idea of exposing clients to the possibility of databases providing only members -> collection, since it effectively means they have to run one query for every endpoint of the database before they are sure to have all the members of a collection.

With enforced collection -> members you get all members from a single query.

Even if your database deep down encodes collections as member -> collection,I expect you to have implemented some efficient way of collecting all members in a collection (otherwise, what are you doing with your collections?...), so you should be able to answer OPTIMADE queries on this form efficiently.

@rartino
Copy link
Contributor

rartino commented Dec 19, 2022

@ml-evs I was about to say that with #438 merged this should be good to go. But, I realize your concerns about very large collections aren't resolved yet.

It seems issue #428, created in response of the need to sharing very large sets as part of trajectories in #377, should be relevant here; but I don't see how it can be directly applied to JSON:API relationships.

What do you think: should we re-edit collections to not use JSON:API relationships and instead use a regular attribute named, e.g., ranged_members which uses the mechanisms in #428 to access the contents?

@JPBergsma
Copy link
Contributor

Using a ranged_members field instead of JSON:API relationships would enable returning all the ids and types of the members of the collection, even when there are more than ~ 100,000 members.

It would also solve the included problem. If we used the normal JSON:API relationships the client could specify in the include field that the structures belonging to the collection should be returned as well. This would result in a very large response.
(I did notice in the specification that the included field is optional, but I'm wondering whether it would be strange if we support it for the references but not for e.g. structures.)

Instead, the entries in the collection could be retrieved via a query that checks if the id of an entry is in a list of ids supplied by the client. This would use the same paging mechanism that we already use. A disadvantage could however be that there may be a maximum size for the query parameter (depending on the software and settings, it could be as small as 1024B). So the number of entries per query could be limited.

An alternative solution would be to make bidirectional associations and query the individual entries for membership.

So in summary, I think it would be better to use a separate variable for storing the members, as this gives more flexibility to handle them.

@rartino
Copy link
Contributor

rartino commented Dec 20, 2022

It would also solve the included problem. [...]

To try to stay on point, I do not think we need to solve the "included problem" to add collections to the specification. The specification clearly allows an implementation not to ever include anything in a collections response, and then there is no problem. If someone wants to improve the general handling of this (which applies throughout all of OPTIMADE), it can go in a separate PR.

(I did notice in the specification that the included field is optional, but I'm wondering whether it would be strange if we support it for the references but not for e.g. structures.)

It is specifically encouraged for references because it is a sensitive issue to include data without sourcing it. That only applies to references. I'd be happy with implementations never using or supporting included beyond that.

Instead, the entries in the collection could be retrieved via a query that checks if the id of an entry is in a list of ids supplied by the client. This would use the same paging mechanism that we already use. A disadvantage could however be that there may be a maximum size for the query parameter (depending on the software and settings, it could be as small as 1024B). So the number of entries per query could be limited.

I was trying to force MUST level support for POST in OPTIMADE using the exact same semantics as for GET, to avoid issues like this. I'm sad that was rejected. Nevertheless, a client can work around the limitations by splitting up its request, so the issue can be worked around. I think that is good enough to accept the proposed version of collections.

An alternative solution would be to make bidirectional associations and query the individual entries for membership.

I am very strongly opposed to a MUST level requirement for bi-directional collections (and, if clients cannot rely on it being there, i.e, a MUST, then it is not very useful). In the httk implementation we generally want to view stored data as immutable, and hence adding something existing to a collection would need some kind of side storage / querying tricks to handle the bi-directionality.

So in summary, I think it would be better to use a separate variable for storing the members, as this gives more flexibility to handle them.

Right, so, lets get the solution of #428 implemented and merged, and then use it here.

@merkys
Copy link
Member

merkys commented Dec 21, 2022

Instead, the entries in the collection could be retrieved via a query that checks if the id of an entry is in a list of ids supplied by the client. This would use the same paging mechanism that we already use. A disadvantage could however be that there may be a maximum size for the query parameter (depending on the software and settings, it could be as small as 1024B). So the number of entries per query could be limited.

I was trying to force MUST level support for POST in OPTIMADE using the exact same semantics as for GET, to avoid issues like this. I'm sad that was rejected. Nevertheless, a client can work around the limitations by splitting up its request, so the issue can be worked around. I think that is good enough to accept the proposed version of collections.

This is yet another argument in favor of supporting POST (#114). It may be worth revisiting the discussion there.

@rartino rartino mentioned this pull request Jan 13, 2023
@rartino
Copy link
Contributor

rartino commented Jan 9, 2024

I note that our merging of #467 probably should have closed #428. I don't see why we cannot complete and merge this PR now. (It would of course be nice to also have #481, but it isn't necessary). I'll take a look on the text and check if I can update it.

@rartino
Copy link
Contributor

rartino commented Jan 17, 2024

After discussions: we realize that the partial data format in OPTIMADE may not cover the case of a too long relationship list to fit in a response. So we have to do one of:

  • Put the members of the collection in the data part instead
  • Amend the partial data handling to work with a too long relationship list

@merkys
Copy link
Member

merkys commented Jun 15, 2024

Some pros and cons of having the collection members in data part, my PoV.

Pros:

  • easier to paginate
  • easier to define additional metadata for collection participation (otherwise such metadata would have to go to meta of relationships objects)
  • allows relationships between collections and other entries (otherwise it would be impossible to distinguish between collection membership and their relationships)

Cons:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR/ready-for-review Add this flag if you are the author of the PR and you want it to be reviewed. Remove it when editing type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants