Speedup PriorityQueue a little #13936

original-brownbear · 2024-10-19T20:39:48Z

Saving some field accesses results in small but visible savings:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                         LowTerm      617.46      (8.2%)      615.47      (7.4%)   -0.3% ( -14% -   16%) 0.855
                    OrHighNotLow      488.90      (9.5%)      491.20      (7.8%)    0.5% ( -15% -   19%) 0.808
                    OrHighNotMed      384.07      (7.9%)      386.54      (6.4%)    0.6% ( -12% -   16%) 0.688
                       OrHighMed      269.60      (4.9%)      271.50      (4.5%)    0.7% (  -8% -   10%) 0.504
                         Prefix3      397.59      (7.4%)      401.40      (6.8%)    1.0% ( -12% -   16%) 0.546
                      AndHighMed      156.59      (4.2%)      158.14      (4.2%)    1.0% (  -7% -    9%) 0.290
                      AndHighLow     1711.42      (7.5%)     1729.95      (6.7%)    1.1% ( -12% -   16%) 0.496
                       OrHighLow      564.40      (4.6%)      570.66      (5.3%)    1.1% (  -8% -   11%) 0.315
             LowIntervalsOrdered      255.12      (5.7%)      258.02      (7.6%)    1.1% ( -11% -   15%) 0.448
                        HighTerm      688.93      (6.5%)      697.31      (6.4%)    1.2% ( -10% -   15%) 0.400
                        PKLookup      242.28      (1.5%)      245.33      (2.5%)    1.3% (  -2% -    5%) 0.006
            HighTermTitleBDVSort       37.87      (3.3%)       38.37      (3.3%)    1.3% (  -5% -    8%) 0.073
             MedIntervalsOrdered       48.08      (5.5%)       48.72      (6.7%)    1.3% ( -10% -   14%) 0.335
                   OrHighNotHigh      419.24      (6.8%)      424.98      (6.3%)    1.4% ( -10% -   15%) 0.350
                         MedTerm      668.97      (7.0%)      678.42      (5.4%)    1.4% ( -10% -   14%) 0.311
               HighTermTitleSort       76.23      (6.0%)       77.32      (5.0%)    1.4% (  -9% -   13%) 0.254
                    OrNotHighLow      993.21      (7.3%)     1007.93      (5.6%)    1.5% ( -10% -   15%) 0.310
                      TermDTSort      207.44      (3.7%)      210.59      (4.2%)    1.5% (  -6% -    9%) 0.086
                          IntNRQ      216.19      (6.9%)      219.48      (7.3%)    1.5% ( -11% -   16%) 0.340
                          Fuzzy1       98.73      (2.1%)      100.28      (2.7%)    1.6% (  -3% -    6%) 0.004
            BrowseDateTaxoFacets        5.31      (8.6%)        5.40      (7.8%)    1.7% ( -13% -   19%) 0.368
            MedTermDayTaxoFacets       34.37      (4.4%)       34.94      (4.2%)    1.7% (  -6% -   10%) 0.086
       BrowseDayOfYearTaxoFacets        5.39      (8.4%)        5.48      (7.9%)    1.7% ( -13% -   19%) 0.343
                      OrHighHigh       79.75      (7.3%)       81.16      (5.7%)    1.8% ( -10% -   15%) 0.227
                   OrNotHighHigh      355.03      (6.4%)      361.39      (5.8%)    1.8% (  -9% -   14%) 0.188
                    OrNotHighMed      384.05      (6.4%)      391.18      (7.6%)    1.9% ( -11% -   16%) 0.237
            HighIntervalsOrdered       34.09      (4.9%)       34.76      (6.3%)    2.0% (  -8% -   13%) 0.118
               HighTermMonthSort     1504.80      (6.1%)     1537.64      (6.1%)    2.2% (  -9% -   15%) 0.111
                       LowPhrase      144.85      (4.6%)      148.15      (4.7%)    2.3% (  -6% -   12%) 0.029
           HighTermDayOfYearSort      359.83      (6.6%)      368.10      (6.9%)    2.3% ( -10% -   16%) 0.127
         AndHighMedDayTaxoFacets       22.54      (3.7%)       23.08      (4.0%)    2.4% (  -5% -   10%) 0.006
     BrowseRandomLabelTaxoFacets        4.41      (3.4%)        4.52      (3.8%)    2.4% (  -4% -    9%) 0.003
                HighSloppyPhrase       27.97      (5.9%)       28.70      (6.2%)    2.6% (  -9% -   15%) 0.055
                     LowSpanNear       18.23      (7.2%)       18.71      (8.1%)    2.7% ( -11% -   19%) 0.120
                      HighPhrase      279.12      (7.9%)      286.62      (7.2%)    2.7% ( -11% -   19%) 0.112
                 MedSloppyPhrase        5.58      (4.1%)        5.73      (4.9%)    2.7% (  -6% -   12%) 0.007
                     AndHighHigh       61.57      (4.6%)       63.31      (4.2%)    2.8% (  -5% -   12%) 0.004
           BrowseMonthTaxoFacets       12.23     (28.7%)       12.58     (31.2%)    2.8% ( -44% -   88%) 0.672
                        Wildcard       84.87      (3.2%)       87.37      (3.0%)    2.9% (  -3% -    9%) 0.000
                         Respell       56.85      (1.9%)       58.60      (2.4%)    3.1% (  -1% -    7%) 0.000
            BrowseDateSSDVFacets        1.24      (6.0%)        1.28      (6.4%)    3.2% (  -8% -   16%) 0.020
                          Fuzzy2       69.18      (3.5%)       71.42      (3.4%)    3.2% (  -3% -   10%) 0.000
                    HighSpanNear       18.56      (6.8%)       19.19      (6.9%)    3.4% (  -9% -   18%) 0.028
          OrHighMedDayTaxoFacets        8.58      (7.4%)        8.88      (7.0%)    3.5% ( -10% -   19%) 0.031
                 LowSloppyPhrase       20.16      (8.0%)       20.88      (9.9%)    3.5% ( -13% -   23%) 0.079
                     MedSpanNear       30.28      (2.6%)       31.46      (4.4%)    3.9% (  -3% -   11%) 0.000
        AndHighHighDayTaxoFacets        8.28      (5.8%)        8.60      (5.9%)    3.9% (  -7% -   16%) 0.003
           BrowseMonthSSDVFacets        4.48      (9.6%)        4.66      (9.1%)    4.1% ( -13% -   25%) 0.048
                       MedPhrase       32.89      (4.5%)       34.28      (6.1%)    4.3% (  -6% -   15%) 0.000
     BrowseRandomLabelSSDVFacets        3.29      (3.8%)        3.43      (5.4%)    4.4% (  -4% -   14%) 0.000
       BrowseDayOfYearSSDVFacets        4.41      (4.2%)        4.65      (6.5%)    5.5% (  -5% -   16%) 0.000

Saving some field accesses.

rmuir · 2024-10-20T11:37:54Z

This results in a lot more code complexity, which makes maintenance difficult. Maybe the version of java you are testing with has a bug in its register allocator or something?

seriously? I think we should take a step back before making all of our code more complex for a 1% benefit which might just be an upstream compiler bug.

original-brownbear · 2024-10-20T14:39:30Z

I think the queue methods changed here in isolation get a far bigger improvement than 1% in many cases. Plus making methods like the ones adjusted here smaller and easier on the CPU cache tends to help the performance of "neighboring" code as well in many case (hence the across the board speedup in the luceneutil run).

I don't think this is the result of a JVM bug, it's just something that is hard to optimize by the compiler with Java so dynamic. It's a combination of two things.

Current JIT implementations don't really take advantage of final fields in normal objects https://openjdk.org/jeps/8132243, https://bugs.openjdk.org/browse/JDK-8058164 and so on. That's what makes caching stuff like size or the heap array so helpful.
Field loads simple result in larger methods when measuring byte code size than caching in a local variable. Unless JIT implementations become more sophisticated (looking at e.g. https://mail.openjdk.org/pipermail/core-libs-dev/2023-July/109461.html it doesn't look like that's happening anytime soon), avoiding field access tends to result in deeper inlining here and there.

dweiss · 2024-10-20T15:06:00Z

lucene/core/src/java/org/apache/lucene/util/PriorityQueue.java

@@ -117,26 +117,29 @@ public PriorityQueue(int maxSize, Supplier<T> sentinelObjectSupplier) {
   * ArrayIndexOutOfBoundsException} is thrown.
   */
  public void addAll(Collection<T> elements) {
-    if (this.size + elements.size() > this.maxSize) {
+    int s = size;


Can we at least rename "s" to "size" and use this.size as the right hand side of this assignment?

Right that was a little weird sorry :) renamed now.

dweiss · 2024-10-20T15:08:03Z

lucene/core/src/java/org/apache/lucene/util/PriorityQueue.java

@@ -270,7 +280,7 @@ public final boolean remove(T element) {
    return false;
  }

-  private final boolean upHeap(int origPos) {
+  private boolean upHeap(int origPos, T[] heap) {


I'd create a local heap variable (var heap = this.heap) locally in this method, not pass it as an argument. It is confusing why you'd want it as an argument. I agree with Robert here that we should perhaps see long-term maintenance as worth the tiny performance benefit (although I think assigning to a local variable within the method would yield the same result).

although I think assigning to a local variable within the method would yield the same result

Not quite, the idea was that I already have heap in a local in the caller, so if I pass it as an argument I save a field read and as an added bonus get a smaller method that inlines better.

long-term maintenance as worth the tiny performance benefit

With this class in particular I'm not sure the argument holds. Isn't the whole point of it the ability to mutate top and resort via updateTop as an optimization over the JDKs priority queue? If the implementation is slower than java.util.PriorityQueue, then what's the point? :) Also, I'm still not sure I agree with the "tiny" part :)
Granted there's limits to the benchmark data provided, but it's more likely than not that a couple things improved by 3%+ isn't it? Plus, I could see a possible compounding effect with further optimizations in the users of the PQ if those can be reduced in size enough to have lessThan inline and not be a megamorphic callsite here and there.

Not quite, the idea was that I already have heap in a local in the caller, so if I pass it as an argument I save a field read and as an added bonus get a smaller method that inlines better.

I did understand the intention but I think the difference, if any, will be noticeable only if the loop doesn't hoist out the field read (which, I think it should?). My suggestion keeps the variables local, which helps in understanding of what it does. But anyway. I'm not entirely sold on these low-level optimizations that target c2/hotspot. There is so many moving parts here... operating system and CPU included. Eh.

Isn't the whole point of it the ability to mutate top and resort via updateTop as an optimization over the JDKs priority queue? If the implementation is slower than java.util.PriorityQueue, then what's the point? :)

I believe the differences were also functional - insertWithOverflow is one particular example that comes to mind and would require more complex logic in the JDK's PQ. Another is resigning from one level of indirection (method instead of Comparator) - these choices predate a lot of newer Java's offerings - perhaps it could be implemented in a different way now.

Whether or not it's worth doing this kind of optimization for the observed gain is a tricky question. From the perspective of a user of a large (and read-heavy) ES, Opensearch or similar deployment, an O(1%) gain might translate into a lot of dollars saved and this kind of thing is well worth the effort.
Personally, an extra 10 lines of code for the observed speedups seems like a reasonable deal, but that's admittedly quite subjective. Maybe a stronger argument would be: optimizing this kind of thing in core hot code removes potential bottlenecks from the system, enabling other optimizations. If the core logic puts massive pressure on e.g. the CPU cache then optimizations (or regressions!) in higher-level code are masked on CPUs with smaller caches. So doing a 1% optimization and living with slightly more complicated code makes more sense here than a 1% gain would in more "peripheral" code. Also, you could use that same angle and argue that this code hardly ever gets touched, so the maintenance burden added matters less than it would elsewhere.

That said :) as far as the technical details go I don't think it can hoist out those reads and it's not an exclusively C2/hotstpot specific thing either. Since Java allows using reflection to update final field values (except for fields that are either static, on a record or on a hidden classs) the compiler can't hoist the field access out of the loop I think (maybe in some happy cases escape analysis helps here).
You can make the JIT hoist these things via -XX:+TrustFinalNonStaticFields which gives me a result like (main vs main with that flag set).

results

TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value HighTermTitleBDVSort 25.72 (7.0%) 25.77 (6.5%) 0.2% ( -12% - 14%) 0.897 BrowseRandomLabelSSDVFacets 3.37 (6.0%) 3.40 (4.2%) 1.0% ( -8% - 11%) 0.409 OrHighMed 214.06 (3.7%) 216.37 (3.8%) 1.1% ( -6% - 8%) 0.206 OrHighNotHigh 353.55 (8.4%) 358.52 (8.9%) 1.4% ( -14% - 20%) 0.475 AndHighHigh 111.32 (4.9%) 113.08 (5.5%) 1.6% ( -8% - 12%) 0.179 OrNotHighHigh 567.88 (4.8%) 577.94 (4.9%) 1.8% ( -7% - 12%) 0.108 PKLookup 241.21 (2.1%) 245.53 (2.1%) 1.8% ( -2% - 6%) 0.000 HighTerm 455.94 (6.6%) 464.35 (7.5%) 1.8% ( -11% - 17%) 0.250 MedTerm 590.06 (6.5%) 601.24 (6.0%) 1.9% ( -9% - 15%) 0.182 AndHighMed 156.22 (3.1%) 159.19 (2.9%) 1.9% ( -3% - 8%) 0.005 LowTerm 750.87 (4.6%) 765.45 (4.2%) 1.9% ( -6% - 11%) 0.052 BrowseRandomLabelTaxoFacets 4.48 (8.6%) 4.57 (3.9%) 2.0% ( -9% - 15%) 0.182 OrNotHighMed 479.29 (4.6%) 489.00 (5.4%) 2.0% ( -7% - 12%) 0.074 HighTermMonthSort 1515.68 (6.4%) 1546.97 (7.0%) 2.1% ( -10% - 16%) 0.171 OrHighHigh 85.48 (4.6%) 87.32 (5.3%) 2.2% ( -7% - 12%) 0.055 MedTermDayTaxoFacets 19.13 (3.0%) 19.55 (4.1%) 2.2% ( -4% - 9%) 0.007 MedIntervalsOrdered 28.59 (6.3%) 29.23 (4.7%) 2.2% ( -8% - 14%) 0.079 OrHighLow 610.70 (5.0%) 624.94 (5.0%) 2.3% ( -7% - 13%) 0.040 OrHighNotMed 474.52 (5.5%) 485.78 (5.7%) 2.4% ( -8% - 14%) 0.061 Fuzzy2 66.51 (3.2%) 68.09 (3.0%) 2.4% ( -3% - 8%) 0.001 BrowseDateSSDVFacets 1.24 (7.7%) 1.27 (8.1%) 2.4% ( -12% - 19%) 0.181 MedSpanNear 119.05 (4.4%) 121.94 (4.4%) 2.4% ( -6% - 11%) 0.016 HighTermTitleSort 76.83 (4.8%) 78.72 (3.7%) 2.5% ( -5% - 11%) 0.011 AndHighHighDayTaxoFacets 14.60 (3.8%) 14.96 (3.5%) 2.5% ( -4% - 10%) 0.003 BrowseMonthTaxoFacets 11.04 (38.5%) 11.32 (40.3%) 2.5% ( -55% - 132%) 0.778 OrNotHighLow 1089.24 (4.0%) 1117.30 (4.0%) 2.6% ( -5% - 10%) 0.004 TermDTSort 188.79 (4.6%) 193.74 (4.9%) 2.6% ( -6% - 12%) 0.015 Wildcard 426.59 (4.2%) 437.79 (4.2%) 2.6% ( -5% - 11%) 0.006 MedPhrase 78.10 (3.4%) 80.38 (3.2%) 2.9% ( -3% - 9%) 0.000 Prefix3 1068.70 (7.7%) 1100.07 (7.7%) 2.9% ( -11% - 19%) 0.094 AndHighLow 1546.10 (5.3%) 1591.97 (6.0%) 3.0% ( -7% - 15%) 0.020 LowIntervalsOrdered 134.11 (6.2%) 138.10 (5.0%) 3.0% ( -7% - 15%) 0.019 MedSloppyPhrase 47.07 (4.5%) 48.49 (3.7%) 3.0% ( -5% - 11%) 0.001 AndHighMedDayTaxoFacets 65.36 (2.3%) 67.38 (2.3%) 3.1% ( -1% - 7%) 0.000 LowSpanNear 175.93 (3.7%) 181.36 (4.7%) 3.1% ( -5% - 11%) 0.001 HighPhrase 131.54 (7.2%) 135.70 (5.8%) 3.2% ( -9% - 17%) 0.033 Fuzzy1 108.08 (3.4%) 111.62 (2.1%) 3.3% ( -2% - 9%) 0.000 BrowseDayOfYearSSDVFacets 4.52 (7.7%) 4.67 (7.9%) 3.4% ( -11% - 20%) 0.056 OrHighNotLow 550.21 (7.0%) 569.01 (7.9%) 3.4% ( -10% - 19%) 0.043 HighTermDayOfYearSort 380.03 (7.6%) 393.27 (6.6%) 3.5% ( -9% - 19%) 0.030 HighSpanNear 11.37 (4.5%) 11.77 (6.0%) 3.5% ( -6% - 14%) 0.004 Respell 54.77 (1.6%) 56.69 (1.7%) 3.5% ( 0% - 6%) 0.000 HighSloppyPhrase 30.28 (5.2%) 31.40 (4.8%) 3.7% ( -5% - 14%) 0.001 LowPhrase 76.63 (5.6%) 79.65 (5.6%) 3.9% ( -6% - 16%) 0.002 OrHighMedDayTaxoFacets 6.78 (6.2%) 7.05 (6.9%) 4.0% ( -8% - 18%) 0.007 IntNRQ 78.26 (6.5%) 81.38 (7.2%) 4.0% ( -9% - 18%) 0.010 LowSloppyPhrase 65.45 (6.6%) 68.14 (6.0%) 4.1% ( -7% - 17%) 0.004 HighIntervalsOrdered 9.16 (6.5%) 9.59 (5.9%) 4.6% ( -7% - 18%) 0.001 BrowseMonthSSDVFacets 4.48 (10.4%) 4.70 (12.4%) 4.8% ( -16% - 30%) 0.062 BrowseDateTaxoFacets 5.38 (10.8%) 5.67 (12.7%) 5.4% ( -16% - 32%) 0.043 BrowseDayOfYearTaxoFacets 5.44 (10.6%) 5.74 (12.7%) 5.5% ( -16% - 32%) 0.039

So to me it feels like manually hoisting field access is a generally valid optimization in a world that has reflective writes to final fields. To me, reducing field access is not in the same category as e.g. extracting cold paths artifically to make a method inline or other such tricks that are specific to C2 and hardware. This is just giving the compiler input that it cannot practically work out with the constraints imposed by the language and the JIT's runtime cost needing

the compiler can't hoist the field access out of the loop I think (maybe in some happy cases escape analysis helps here).

I don't think there's anything in the spec preventing it from doing so. The final keyword is indeed for the java compiler, not for the jvm, but... you know - it's easy to show that c2 can happily hoist out field reads, try it.

public final class SuperSoft { private static boolean ready; public static void startThread() { new Thread() { public void run() { try { sleep(2000); } catch (Exception e) { /* ignore */ } System.out.println("Marking loop exit."); ready = true; } }.start(); } public static void main(String[] args) { startThread(); System.out.println("Entering the loop..."); while (!ready) { // Do nothing. } System.out.println("Done, I left the loop!"); } }

This aside, I am not rejecting the change - I just suggested to rename one local variable (s) and to remove method parameter in favor of a single local variable read - this should result in identical code to what your 1% gain was producing, if my gut feeling is right.

Whether or not it's worth doing this kind of optimization for the observed gain is a tricky question

We've done such optimizations in the past for very hot hotspots in Lucene, e.g. readVInt, all the carefully gen'd code for decoding int[] blocks in different bit widths, etc. But it clearly is a tricky judgement call in each case...

mikemccand

The gains measured by luceneutil are quite surprising ... Lucene's PQ is clearly a hot hotspot.

mikemccand · 2024-10-21T13:52:48Z

lucene/core/src/java/org/apache/lucene/util/PriorityQueue.java

@@ -117,7 +117,8 @@ public PriorityQueue(int maxSize, Supplier<T> sentinelObjectSupplier) {
   * ArrayIndexOutOfBoundsException} is thrown.
   */
  public void addAll(Collection<T> elements) {
-    if (this.size + elements.size() > this.maxSize) {
+    int size = this.size;


Could you add comments explaining that the local variable assignment is done on purpose for performance reasons? We don't want a future refactoring to "simplify" this code and cut back to this.size.

mikemccand · 2024-10-21T13:55:26Z

lucene/core/src/java/org/apache/lucene/util/PriorityQueue.java

@@ -283,7 +293,7 @@ private final boolean upHeap(int origPos) {
    return i != origPos;
  }

-  private final void downHeap(int i) {
+  private void downHeap(int i, T[] heap, int size) {


Why are we removing final on upHeap and downHeap? Does that somehow help performance?

mikemccand · 2024-10-21T13:56:54Z

lucene/core/src/java/org/apache/lucene/util/PriorityQueue.java

@@ -270,7 +280,7 @@ public final boolean remove(T element) {
    return false;
  }

-  private final boolean upHeap(int origPos) {
+  private boolean upHeap(int origPos, T[] heap) {


Whether or not it's worth doing this kind of optimization for the observed gain is a tricky question

We've done such optimizations in the past for very hot hotspots in Lucene, e.g. readVInt, all the carefully gen'd code for decoding int[] blocks in different bit widths, etc. But it clearly is a tricky judgement call in each case...

github-actions · 2024-11-05T00:22:15Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Speedup PriorityQueue a little

d60183b

Saving some field accesses.

dweiss reviewed Oct 20, 2024

View reviewed changes

rename

aacd539

original-brownbear requested a review from dweiss October 20, 2024 15:50

mikemccand reviewed Oct 21, 2024

View reviewed changes

github-actions bot added the Stale label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup PriorityQueue a little #13936

Speedup PriorityQueue a little #13936

original-brownbear commented Oct 19, 2024

rmuir commented Oct 20, 2024

original-brownbear commented Oct 20, 2024

dweiss Oct 20, 2024

original-brownbear Oct 20, 2024

dweiss Oct 20, 2024

original-brownbear Oct 20, 2024

original-brownbear Oct 20, 2024

dweiss Oct 20, 2024

original-brownbear Oct 20, 2024

dweiss Oct 21, 2024

mikemccand Oct 21, 2024

mikemccand left a comment

mikemccand Oct 21, 2024

mikemccand Oct 21, 2024

mikemccand Oct 21, 2024

github-actions bot commented Nov 5, 2024

Speedup PriorityQueue a little #13936

Are you sure you want to change the base?

Speedup PriorityQueue a little #13936

Conversation

original-brownbear commented Oct 19, 2024

rmuir commented Oct 20, 2024

original-brownbear commented Oct 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 5, 2024