[similarity] add similarity distance #1176

barendgehrels · 2023-07-12T18:01:31Z

This is a new similarity distance, comparable to Hausdorff or Frechet.

It calculates the area between two lines and divides them by the line length. This is an average distance, measuring quite well how similar two lines are. It is less sensible to outliers (such as Hausdorff/Frechet) and not discrete.

For example this case (west1/west2):

They are quite similar. Distance impressions are:
Frechet: 53.4426 Hausdorff: 53.4426 Average: 5.61308

How it works in detail:

it projects all points on the other line
and vice versa
it adds their own points too. Then both collections have an equal amount of points, and because of the projections, their locations are comparable (especially if they are already similar)
it checks if one of the lines should be reversed
then it calculates areas of all four-corner rings and sums them
and it divides by the shortest length

(See the comment below for an enhanced description)

Zoomed in:

Apart from this, this PR also adds an experimental geojson_writer

If you like it, I'll also add a version for rings. For polygons/multi it is a bit more complex (especially if the number of objects are unequal), to be worked out.

The geojson_writer is really useful for visualizing results (or debug steps). But it is not completely finished either.

Complete output of the test program:

Simplex cases: 
Frechet: 0.1 Hausdorff: 0.1 Average: 0.1
Frechet: 0.509902 Hausdorff: 0.509902 Average: 0.1
Frechet: 1.58114 Hausdorff: 1.58114 Average: 0.114

Real cases: 
Frechet: 53.4426 Hausdorff: 53.4426 Average: 5.61308
Frechet: 81.0897 Hausdorff: 81.0897 Average: 3.98684

Reversed:
Frechet: 990.622 Hausdorff: 53.4426 Average: 5.61308
Frechet: 1199.04 Hausdorff: 81.0897 Average: 3.98684

Simplified (west): 
Frechet: 494.575 Hausdorff: 494.575 Average: 61.4674
Frechet: 475.107 Hausdorff: 475.107 Average: 24.5124
Frechet: 54.7065 Hausdorff: 54.7065 Average: 0.733346
Frechet: 23.9904 Hausdorff: 23.9904 Average: 0

Simplified (east): 
Frechet: 520.245 Hausdorff: 520.245 Average: 124.301
Frechet: 74.5731 Hausdorff: 74.5731 Average: 7.53191
Frechet: 74.5731 Hausdorff: 74.5731 Average: 0.511786
Frechet: 0 Hausdorff: 0 Average: 0

Not similar: 
Frechet: 8760.18 Hausdorff: 7737.14 Average: 4925.39
Frechet: 8771.21 Hausdorff: 7746.79 Average: 4848.01

Parts (TODO):
Frechet: 831.744 Hausdorff: 831.744 Average: 7.25765
Frechet: 1094.28 Hausdorff: 1094.28 Average: 199.363

Identical: 
Frechet: 0 Hausdorff: 0 Average: 0
Frechet: 0 Hausdorff: 0 Average: 0

Performance:
Frechet: 105 ms
Hausdorff: 78 ms
Average: 498 ms

Showing that:

hausdorff was not symmetric (I created an issue for that)
frechet gives different results for reversed cases
frechet and hausdorff are often, but not always, the same
the new method is much slower (we can use an index to enhance this)
because frechet/hausdorff are discrete, its behavior is not good when there are points in between (see simplex 2, or the simplified cases)
because frechet/hausdorff use a max distance, a spike large influence (simplex 3)

I also tried to give a "partial" result, which happens if one line is partly identical to the other line. This is possible, but not yet finished. It uses the projections and then recursively finds the best division point. But this is not part of the PR.

barendgehrels · 2023-07-13T17:52:35Z

A rephrased description, with a bit of AI help:

To compute the Average Similarity Distance (ASD) between two linestrings, let's consider linestring P and linestring Q. The ASD is derived by comparing the geometric characteristics of the two linestrings.

Projection: Each point of linestring Q is projected onto linestring P, identifying the shortest distance for projection. This projection process creates a modified version of linestring P, denoted as P'. Similarly, each point of linestring P is projected onto linestring Q, resulting in a modified version of linestring Q called Q'.
Augmentation: All points of linestring P are added to P', maintaining their original position, with an offset value of 0 and increasing segment distance. Similarly, all points of linestring Q are added to Q'.
Alignment: P' and Q' are sorted based on their segment index and offset values in increasing order. This sorting ensures that both collections are comparable.
Reversal: If necessary, one of the linestrings is reversed to achieve better alignment between corresponding points in P' and Q'. This step ensures consistency in the comparison.
Area Calculation: The areas of all quadrilaterals, formed by connecting each point with its subsequent point in both P' and Q', are calculated. The sum of these areas is computed.
Normalization: The total sum of areas is divided by the shortest length of the two linestrings P and Q to normalize the ASD value.

The resulting ASD value provides an average distance measure that captures the geometric similarity between the linestrings P and Q. By incorporating projection, augmentation, alignment, and area calculation steps, the ASD offers an enhanced understanding of the similarity between the two linestrings.

vissarion

Great!! That's a new algorithm right? There is a similar work where they compute the area between two linestrings but in a slightly different way https://link.springer.com/article/10.1007/s12289-018-1421-8 (ends up in a different area).

A did a first parse and put some comments in the code.

Is there cases where other measures (Fréchet or Hausdorff) is better than the new one? How the new method perform when the inputs differ a lot between them?

vissarion · 2023-07-19T11:33:13Z

include/boost/geometry/algorithms/similarity.hpp

+auto get_projection(std::size_t index, Point const& point, Geometry const& target)
+{
+    using coor_t = typename coordinate_type<Point>::type;
+    using closest_strategy = geometry::strategy::closest_points::detail::compute_closest_point_to_segment<coor_t>;


I think this should passed as a function parameter following a similar mechanism as in all other algorithms.

vissarion · 2023-07-19T11:42:19Z

include/boost/geometry/algorithms/similarity.hpp

+        auto const& current = target[i];
+        auto const& next = target[i + 1];
+        projection<Point> other;
+        auto const pp = closest_strategy::apply(point, current, next);


This is very similar to how closest points are computed for a linestring. I guess you are not using that algorithm because you want to keep track also the index. Maybe the closest_points procedure can be enhanced with this extra information (e.g. providing the index) and reused here?

From performance point of view there is some duplication in the computation of the closest point here with the distance computed in two lines below. Also, you may consider using the comparable distance instead of the real distance.

I looked at it. It seems that some code duplication is also there in the detail::closest_feature::point_to_point_range function, where the closest segment is determined based on (comparable) distance. Which, internally, already finds the projection - but it is discarded (in the strategy itself).
Then the closest point is calculated.

Here I do it the other way round, calling first closest point strategy (which already does some calculations to find it - but does not calculate the distance yet) and then calculate its distance (I could use comparable instead - changed it right now).

I'm not sure if it is worth the effort to harmonize this code for this reason only, it will make it more complex.

What could be useful is a strategy which gives all the internal information to avoid code duplication in other places. Something like that is already in extensions (calculate_distance_info) but never made it. So maybe also that is not really worth the effort...

vissarion · 2023-07-19T13:24:07Z

include/boost/geometry/algorithms/similarity.hpp

+    auto const sum_area = get_areal_sum(boost::begin(p_enriched), boost::end(p_enriched), 
+        boost::begin(q_enriched), boost::end(q_enriched), ring, visitor);
+
+    auto const length = (std::min)(geometry::length(p), geometry::length(q));


at least for the reversed linestring the length is already computed (maybe this can be reused here)

I don't see that, unless you mean the distance calculation done for reversal.

vissarion · 2023-07-21T16:22:43Z

include/boost/geometry/algorithms/similarity.hpp

+    int const side_p0_q = side_strategy::apply(q0.point, q1.point, p0.point);
+    int const side_p1_q = side_strategy::apply(q0.point, q1.point, p1.point);
+
+    if (side_q0_p == 0 && side_q1_p == 0 && side_p0_q == 0 && side_p1_q == 0)


shouldn't side_q0_p == 0 && side_q1_p == 0 be enough?

That should indeed be enough, at least for the cases I tested. However, I'm not sure about FP-precision where there are sometimes weird results (such as 3 times 0 and one non-zero). And the sides have to be calculated later (used later) anyway.
But with those FP-precision errors the area is probably minimal - so it probably won't make a real difference.

I will think about what my final preference is here.

vissarion · 2023-07-21T16:23:17Z

include/boost/geometry/algorithms/similarity.hpp

+}
+
+template <typename Projection, typename Ring>
+bool fill_quadrilateral(Projection const& p0, Projection const& p1, Projection const& q0, Projection const& q1, Ring& ring)


is it always a quadrileteral? In some cases could be a triangle or even degenerated to a segment.

Indeed, it can. But sorting that out is probably not worth it, because the area will be the same.

awulkiew · 2023-07-21T22:03:50Z

Thanks, that's interesting!

AFAIU the CS-specific operations are closest points, side and area so technically this distance should work in all coordinate systems correct?

Do I understand correctly that you wanted to show a prototype to start a discussion and you plan to implement it with Boost.Geometry structure (strategy, dispatch, etc.) if we agree that this is a good addition to the library?

Is ASD an official name for this measure?
Would giving the function a name more specific than similarity_distance be a good idea?
Or do you think that this should be baseline, default similarity measure in the library?

awulkiew · 2023-07-21T22:08:29Z

include/boost/geometry/algorithms/discrete_hausdorff_distance.hpp

+#include <algorithm>
+#include <iterator>
+#include <utility>
+#include <limits>


I recall that this order of headers is prefered in general. In our case this would be Boost.Geometry headers, then other Boost headers, then standard headers. Do we want to make a guideline about this?

I'd add one thing. The headers are not sorted alphabetically. They probably should.

I see that in other file the order is different so I'm curious why were they moved here?

awulkiew · 2023-07-21T22:09:39Z

include/boost/geometry/algorithms/similarity.hpp

+#include <cstdint>
+#include <vector>
+
+#include <boost/geometry/core/point_type.hpp>
+#include <boost/geometry/algorithms/distance.hpp>
+#include <boost/geometry/algorithms/length.hpp>


Here standard headers are before Boost.Geometry headers though.

barendgehrels · 2023-07-22T13:27:42Z

Thank you both for your comments! Cool!
Yes, I think similarity_distance would be a good name, it also crossed my mind.
I will come back to your comments in more detail next week on Thursday, probably.

awulkiew · 2023-07-22T14:45:59Z

To clarify, I'm considering whether or not giving it such general name is a good idea. There are various measures of similarity after all. And AFAIU they are not interchangeable. Unless it would be possible to define some kind of standard of similarity allowing to generate comparable results between various algorithms? So the question is which one if any should be considered as the default one in the library? If it's not possible to answer this question then maybe this algorithm should have a specific name like the other ones.

barendgehrels · 2023-07-27T12:09:21Z

AFAIU the CS-specific operations are closest points, side and area so technically this distance should work in all coordinate systems correct?

Yes, correct, I added a call for geographic coordinate system as well.

Do I understand correctly that you wanted to show a prototype to start a discussion and you plan to implement it with Boost.Geometry structure (strategy, dispatch, etc.) if we agree that this is a good addition to the library?

Yes, indeed

Is ASD an official name for this measure? Would giving the function a name more specific than similarity_distance be a good idea? Or do you think that this should be baseline, default similarity measure in the library?

Okay, I now have average_distance - which is, in fact, what it is. And I think it's a better measure for similarity than hausdorff or frechet, because that is more a "max distance" which might have a large value while all other points are close or equal, apart from the problems because it is discrete. But this might be biased and have to be researched a bit more.

Fixes: issue boostorg#1186

…nguishable

awulkiew · 2023-09-25T12:01:05Z

Okay, I now have average_distance - which is, in fact, what it is.

Would it be possible to calculate the similarity of areal geometries (rings or polygons) this way, i.e. pass them into this algorithm and internally calculate the similarity of their boundaries? If it is then this name would not work for them because the average distance of two intersecting polygons would always be 0.

[similarity] add similarity distance

0380f75

barendgehrels added the experimental Experimental PR - might need work or discussion label Jul 12, 2023

barendgehrels requested review from awulkiew and vissarion July 12, 2023 18:01

barendgehrels self-assigned this Jul 12, 2023

barendgehrels requested a review from mloskot July 13, 2023 18:07

vissarion reviewed Jul 21, 2023

View reviewed changes

awulkiew reviewed Jul 21, 2023

View reviewed changes

barendgehrels closed this Jul 27, 2023

fixup! [similarity] add similarity distance

5043795

barendgehrels reopened this Jul 27, 2023

barendgehrels added 6 commits July 28, 2023 18:37

fixup! [similarity] add similarity distance

e51ded3

similarity additional, geojson additional

f93673f

[fix] overlay threshold for sort_by_side

d872912

Fixes: issue boostorg#1186

temp

f387baa

[fix] avoid current get_distance_measure which can make sides indisti…

a5771cc

…nguishable

[test] add unit test for get_distance_measure showing its failures

d557b5a

[similarity] add similarity distance #1176

Are you sure you want to change the base?

[similarity] add similarity distance #1176

Uh oh!

Conversation

barendgehrels commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

barendgehrels commented Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vissarion left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awulkiew commented Jul 21, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barendgehrels commented Jul 22, 2023

Uh oh!

awulkiew commented Jul 22, 2023

Uh oh!

barendgehrels commented Jul 27, 2023

Uh oh!

awulkiew commented Sep 25, 2023

Uh oh!

Uh oh!

barendgehrels commented Jul 12, 2023 •

edited

Loading

barendgehrels commented Jul 13, 2023 •

edited

Loading