Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested triples maps #6

Open
dachafra opened this issue Oct 21, 2020 · 9 comments
Open

Nested triples maps #6

dachafra opened this issue Oct 21, 2020 · 9 comments
Assignees
Labels
representation representation issues rml rml issues

Comments

@dachafra
Copy link
Member

issue: not possible to scope joins that will only occur within the scope of a single iteration

suggestion: add rml:subTriplesMap (or rml:nestedTriplesMap) as a property of rml:RefObjectMap. This will allow engines to resolve joins within a single iteration, covering the much occurring case where certain related objects only occur with the sub-hierarchy / sub-graph of other objects. This will also improve the streamability of mappings .

@dachafra dachafra added rml rml issues representation representation issues labels Oct 21, 2020
@bjdmeest
Copy link
Member

bjdmeest commented Mar 2, 2022

Related discussion: kg-construct/rml-questions#11

@bjdmeest
Copy link
Member

bjdmeest commented Mar 2, 2022

copying conclusion of kg-construct/rml-questions#11 here: what makes this trivial in, e.g., SPARQL-Anything, is the fact that values from the same iteration can be linked with some kind of 'iteration identifier' (that is independent of the actual data values).

The issue is that a join needs some kind of join condition, whereas here, the join condition is basically 'be the same iteration'. If there would be a way to define this in RML, there would be no need for the current join condition.

Maybe it's a good idea to think of assigning each iteration with a some kind of ID by default, and have some way to refer to that id in the mapping? Or maybe even some additional functionality so we can do things like 'do join condition with the previous iteration' (in cases where order in a data source is actually meaningful), 'do join condition with the parent iteration' (in cases where data sources actually have hierarchy) etc.? That way, we keep a single construct in the RML language (i.e., join conditions), and the engines can optimize however they want. (no-one's to say we can't provide shortcuts as well ofc)

In fact, e.g., in JSONpath, part of the spec is to return the normalized path expression, as an alternative to the actual values, which we could see as the iteration ID? (see bottom of https://goessner.net/articles/JsonPath/ )

@bjdmeest
Copy link
Member

bjdmeest commented Mar 2, 2022

I'm now wondering how much #20 (access fields in parent iteration) is actually this (join on iteration) + #29 (join to get literal instead of term) @frmichel do you think pushDown can be seen as syntactic sugar over 'join on parent iteration + get literal from join'?

@justin2004
Copy link
Contributor

justin2004 commented Mar 2, 2022

do join condition with the parent iteration

@bjdmeest

I need that often with JSON sources and when I tried to use RML I was able to do it by preprocessing the json and adding id keys ("_id") to each object and then I can reference them in the mapping to find the appropriate parent.

e.g. I changed this to this.

Here is the RML that uses those id keys I added.

@frmichel
Copy link

frmichel commented Mar 2, 2022

I'm now wondering how much kg-construct/rml-cc#5 (access fields in parent iteration) is actually this (join on iteration) + kg-construct/rml-core#29 (join to get literal instead of term) @frmichel do you think pushDown can be seen as syntactic sugar over 'join on parent iteration + get literal from join'?

Hi @bjdmeest, I'm not sure I get the whole picture here as I did not follow each of the issues, but I would say yes. The pushDown can be used not only in the rml:iterator but also in any sub-iteration made in a nested term map. Yet if you could send me an example that would help me be more specific.

@bjdmeest
Copy link
Member

bjdmeest commented Mar 4, 2022

Hmmm, reiterating on this, this is getting very complex (see below for some pseudocode), probably it's better to take this into account when thinking about joins (as being discussed at https://github.com/kg-construct/rml-fno-spec/issues/2 )

- triplesmap
  - logicalsource
    - iterator: $.parent[*]
  - subjectmap: "ex:{parentID}"
    class: "ex:Parent"
- triplesmap2
  - logicalsource
    - iterator: $.parent[*].child[*]
  - subjectmap: "ex:{childID}"
  - po
    - predicate: "ex:nestedName"
      object:
        function: joinvalues
        parameters:
        value1: "{childID}_" 
        value2:
          referencingobject:
            parenttriplesmap: triplesmap
            # joinOnSameParentIteration
            # from ParentIteration, take value "parentID"

@frmichel
Copy link

frmichel commented Mar 9, 2022

Ok I think I got it. So the answer is no, pushDown will not be sufficient in this case. I'll try to explain but I'm afraid that's not gonna be clear ;).

When you evaluate the iterator $.parent[*].child[*] on input documents, you mix up all parents and all children. You don't keep the association of parents to children. So this has to be done one hierarchical level at a time: first an iterator on parent[], then on child[].

The pushDown makes it possible to work one hierarchical level at a time, but it has a limitation:
The first iteration level can be set in the logical source. Such that the fields pushed down from the logical source will be available for all term maps (subject, pred, and object).
The next levels will be set in (nested)term maps. But the fields pushed down from a (nested)term map will be available only for subsequent nested term maps, that is, within the context of either a unique subject map, or a unique object map, but not for both at the same time. Since the iterations in the subject map and the object map are not "in sync", that fails.

The example below will mix up all children from a given parent: the subject map will generate all terms "ex:{childID}" of a given parent, and those will be mixed them up with all (predicate object) couples for the same parent.

 triplesmap2
  - logicalsource
    - iterator: $.parent[*]
    - pushdown:
      - reference: $.parent[*]
      - as: theParent
  - subjectmap:
        reference: $.child[*]
        nestedTermMap: "ex:{childID}"
  - po
    - predicate: "ex:nestedName"
      object:
        - reference: $.child[*]
        - pushdown:
          - reference: $.theParent
          - as: theParent
        - nestedTermMap: 
           - parenttriplesmap: triplesmap
           ... # some join condition involving theParent and parentID

I hope I answer your question, I'm still not sure I do ;).

@bjdmeest
Copy link
Member

It does for me :). A possible approach to tackle this, is the Fields approach I assume? Where you basically create your own iterations so you can make sure that they remain 'in sync'? Another solution would be to do everything via join conditions, but then we need some kind of 'iteration identifier' (and in this case, nested iteration identifiers, so (grand)children can refer to (grand)parent iterations), so you can join on iteration instead of on data values, cfr my previous comment? Maybe it's a good idea to pursue both?

@bjdmeest
Copy link
Member

bjdmeest commented Feb 9, 2023

Additional use case in favor of 'iteration identifiers': being able to get the 'accessing' reference formulation per reference is needed at RMLio/yarrrml-parser#184.

So, I can imagine for CSV, per iteration you need to be able to identify that iteration (in this case, the row index would be enough), and per reference you need to identify that reference formulation (in this case, the combination row index / column index would be enough).

So a CSV file like below

lastname firstname
De Meester Ben
Chaves David

Could have a CSV iteration like below

firstname
David

Could actually have following references (relying on https://w3c.github.io/csvw/metadata/#uri-template-properties)

_sourceRow firstname firstname_sourceColumn
2 David 1

For JSONPath, you could include the actual used path for each iteration and reference
e.g. iteration $.persons[*] with reference * would give, for the first iteration, identifier $.persons[0] and reference identifiers lastname and firstname.

It won't create the most elegant mappings, but gives a lot of context for users to hack stuff together

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
representation representation issues rml rml issues
Projects
None yet
Development

No branches or pull requests

4 participants