Skip to content

Conversation

@fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Jan 11, 2026

-Ofast implies -funsafe-math-optimizations, which affects the global FP state for all programs: when used at link time, it may include libraries or startup files that change the default FPU control word or other similar optimisations.

-Ofast enables, directly or indirectly:

We should not use -fallow-store-data-races, -ffinite-math-only.

We should not use -funsafe-math-optimizations at link time, as it affects the global FP state. It's probably easier to replace it with the individual options -fno-signed-zeros, -fno-trapping-math and -fassociative-math.

We may revisit -fcx-limited-range if we do use Complex arithmetics.

-freciprocal-math was already disabled explicitly, see #8280.

-Ofast implies -funsafe-math-optimizations, which affects the global FP state
for all programs: when used at link time, it may include libraries or startup
files that change the default FPU control word or other similar optimizations.
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard for branch IB/CMSSW_16_1_X/master.

@akritkbehera, @cmsbuild, @iarspider, @raoatifshad, @smuzaffar can you please review it and eventually sign? Thanks.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 11, 2026

cms-bot internal usage

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 11, 2026

  • -fallow-store-data-races

    Allow the compiler to perform optimizations that may introduce new data races on stores, without proving that the variable cannot be concurrently accessed by other threads. Does not affect optimization of local data. It is safe to use this option if it is known that global data will not be accessed by multiple threads.

    Examples of optimizations enabled by -fallow-store-data-races include hoisting or if-conversions that may cause a value that was already in memory to be re-written with that same value. Such re-writing is safe in a single threaded context but may be unsafe in a multi-threaded context.
    Note that on some processors, if-conversions may be required in order to enable vectorization.

  • -fno-semantic-interposition

    Some object formats, like ELF, allow interposing of symbols by the dynamic linker. This means that for symbols exported from the DSO, the compiler cannot perform interprocedural propagation, inlining and other optimizations in anticipation that the function or variable in question may change. While this feature is useful, for example, to rewrite memory allocation functions by a debugging implementation, it is expensive in the terms of code quality.
    With -fno-semantic-interposition the compiler assumes that if interposition happens for functions the overwriting function will have precisely the same semantics (and side effects). Similarly if interposition happens for variables, the constructor of the variable will be the same.
    The flag has no effect for functions explicitly declared inline (where it is never allowed for interposition to change semantics) and for symbols explicitly declared weak.

  • -ffast-math

    Sets the options -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math (default), -fno-signaling-nans (default), -fcx-limited-range and -fexcess-precision=fast.

    This option causes the preprocessor macro __FAST_MATH__ to be defined.

  • -fno-math-errno

    Do not set errno after calling math functions that are executed with a single instruction, e.g., sqrt. A program that relies on IEEE exceptions for math error handling may want to use this flag for speed while maintaining IEEE arithmetic compatibility.

  • -funsafe-math-optimizations

    Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards.
    When used at link time, it may include libraries or startup files that change the default FPU control word or other similar optimizations.

    Enables -fno-signed-zeros, -fno-trapping-math, -fassociative-math and
    -freciprocal-math.

  • -ffinite-math-only

    Allow optimizations for floating-point arithmetic that assume that arguments and results are not NaNs or +-Infs.

  • -frounding-math (disabled by default)

    Disable transformations and optimizations that assume default floating-point rounding behavior. This is round-to-zero for all floating point to integer conversions, and round-to-nearest for all other arithmetic truncations. This option should be specified for programs that change the FP rounding mode dynamically, or that may be executed with a non-default rounding mode. This option disables constant folding of floating-point expressions at compile time (which may be affected by rounding mode) and arithmetic transformations that are unsafe in the presence of sign-dependent rounding modes.

  • -fsignaling-nans (disabled by default)

    Compile code assuming that IEEE signaling NaNs may generate user-visible traps during floating-point operations. Setting this option disables optimizations that may change the number of exceptions visible with signaling NaNs. This option implies -ftrapping-math.

    This option causes the preprocessor macro __SUPPORT_SNAN__ to be defined.

  • -fcx-limited-range (complex numbers)

    When enabled, this option states that a range reduction step is not needed when performing complex division. Also, there is no checking whether the result of a complex multiplication or division is NaN + I*NaN, with an attempt to rescue the situation in that case.

  • -fexcess-precision=fast

    This option allows further control over excess precision on machines where floating-point operations occur in a format with more precision or range than the IEEE standard and interchange floating-point types. By default, -fexcess-precision=fast is ineffect; this means that operations may be carried out in a widerprecision than the types specified in the source if that wouldresult in faster code, and it is unpredictable when rounding to the types specified in the source code takes place.
    When compiling C or C++, if -fexcess-precision=standard is specified then excess precision follows the rules specified in ISO C99 or C++; in particular, both casts and assignments cause values to be rounded to their semantic types (whereas -ffloat-store only affects assignments).
    This option is enabled by default for C or C++ if a strict conformance option such as -std=c99 or -std=c++17 isused.

  • -fno-signed-zeros

    Allow optimizations for floating-point arithmetic that ignore the signedness of zero. IEEE arithmetic specifies the behavior of distinct +0.0 and −0.0 values, which then prohibits simplification of expressions such as x+0.0 or 0.0*x (even with -ffinite-math-only). This option implies that the sign of a zero result isn't significant.

  • -fno-trapping-math

    Compile code assuming that floating-point operations cannot generate user-visible traps. These traps include division by zero, overflow, underflow, inexact result and invalid operation. This option requires that -fno-signaling-nans be in effect. Setting this option may allow faster code if one relies on "non-stop" IEEE arithmetic, for example.

  • -fassociative-math

    Allow re-association of operands in series of floating-point operations. This violates the ISO C and C++ language standard by possibly changing computation result. NOTE: re-ordering may change the sign of zero as well as ignore NaNs and inhibit or create underflow or overflow (and thus cannot be used on code that relies on rounding behavior like (x + 2**52) - 2**52. May also reorder floating-point comparisons and thus may not be used when ordered comparisons are required. This option requires that both -fno-signed-zeros and -fno-trapping-math be in effect. Moreover, it doesn't make much sense with -frounding-math.

  • -freciprocal-math

    Allow the reciprocal of a value to be used instead of dividing by the value if this enables optimizations. For example x / y can be replaced with x * (1/y), which is useful if (1/y) is subject to common subexpression elimination. Note that this loses precision and increases the number of flops operating on the value.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 11, 2026

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 11, 2026

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests nvidia_l40sUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-919f35/50502/summary.html
COMMIT: 6a383ac
CMSSW: CMSSW_16_1_X_2026-01-11-0000/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10267/50502/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed Unit Tests

I found 2 errors in the following unit tests:

---> test Miscellanea had ERRORS
---> test createDBObjects had ERRORS

Comparison Summary

The workflows 2025.0010001, 2024.0050001, 2023.0020001 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

Summary:

  • You potentially removed 13 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 48986 differences found in the comparisons
  • Reco comparison had 2 failed jobs
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4071839
  • DQMHistoTests: Total failures: 32646
  • DQMHistoTests: Total nulls: 100
  • DQMHistoTests: Total successes: 4039073
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -20.945 KiB( 51 files compared)
  • DQMHistoSizes: changed ( 2023.0020001 ): -20.484 KiB Hcal/DigiRunHarvesting
  • DQMHistoSizes: changed ( 2023.0020001 ): -0.539 KiB RPC/DCSInfo
  • DQMHistoSizes: changed ( 2023.0020001 ): 0.043 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 2023.0020001 ): 0.005 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 2024.0000001 ): 0.004 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 2024.0040001 ): -0.012 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 2024.0050001 ): 0.031 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 2024.0060001 ): 0.008 KiB JetMET/SUSYDQM
  • Checked 227 log files, 198 edm output root files, 53 DQM output files
  • TriggerResults: found differences in 2 / 51 workflows

AMD_MI300X Comparison Summary

Summary:

AMD_W7900 Comparison Summary

Summary:

NVIDIA_H100 Comparison Summary

Summary:

NVIDIA_L40S Comparison Summary

Summary:

NVIDIA_T4 Comparison Summary

Summary:

@makortel
Copy link
Contributor

While I agree with the premise of the PR description, I wonder if these options

We should not use -fallow-store-data-races, -ffinite-math-only.

We should not use -funsafe-math-optimizations at link time, as it affects the global FP state. It's probably easier to replace it with the individual options -fno-signed-zeros, -fno-trapping-math and -fassociative-math.

impact the GCC auto-vectorization? (that, IIRC, was one major motivation for using -ffast-math in the first place)

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 12, 2026

IMHO -funsafe-math-optimizations should not be used, and should be replaced by (a subset of) the options it enables.

We could look into enabling it at compile time and not at link time, but I'm a bit wary of this mixed approach :-(

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 12, 2026

-fallow-store-data-races seems dangerous from its description, but indeed I don't have any concrete evidence against it.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 12, 2026

-ffinite-math-only is pretty annoying, because it disables also explicit checks like isnan().
While we do have a workaround in CMSSW, it is also used in externals like e.g. Eigen, and we cannot easily patch that.

@dan131riley
Copy link

We could look into enabling it at compile time and not at link time, but I'm a bit wary of this mixed approach :-(

Does that even work with LTO? (OTOH, I thought mucking with the global FP state in the shared library init was removed in gcc13.)

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 12, 2026

thought mucking with the global FP state in the shared library init was removed in gcc13

that would be good :-)

It's still mentioned in the 13.4 documentation, though 🤔

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 12, 2026

Let's see what is the impact of removing -Ofast only from libxz in #10273 ...

@dan131riley
Copy link

thought mucking with the global FP state in the shared library init was removed in gcc13

that would be good :-)

It's still mentioned in the 13.4 documentation, though 🤔

'-fast' at link time of an executable (e.g., cmsRun) will include initialization code that changes the FP state, but from the gcc13 release notes

-Ofast, -ffast-math and -funsafe-math-optimizations will no longer add startup code to alter the floating-point environment when producing a shared object with -shared.

@dan131riley
Copy link

-fallow-store-data-races seems dangerous from its description, but indeed I don't have any concrete evidence against it.

I'm not absolutely sure, but from my reading I believe what the description means is that it can introduce data races for programs that assume a strongly-ordered memory model. It should be ok for programs that conform to the C++ memory model with appropriate use of atomics, and it may allow some vectorization optimizations that would otherwise be forbidden.

@srlantz
Copy link

srlantz commented Jan 12, 2026

FWIW, vectorization of trig functions has been important to mkFit timing performance. The most recent versions of mkFit accomplish this vectorization through VDT, which is a header-only library. The latest version of VDT (0.4.6, April 2025) changed the preferred optimization flags from -O3 -ffast-math to just -O3. Therefore I wonder if @dpiparo has any insights on this.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 12, 2026

Therefore I wonder if @dpiparo has any insights on this.

Or @VinInn .

@makortel
Copy link
Contributor

The latest version of VDT (0.4.6, April 2025) changed the preferred optimization flags from -O3 -ffast-math to just -O3.

I guess we should update VDT anyhow

### RPM cms vdt 0.4.3

@srlantz
Copy link

srlantz commented Jan 13, 2026

Single-threaded performance tests with mkFit standalone (VDT included) confirms that deleting -Ofast and instead substituting -O3 and the specific options selected by this PR results in a 40% slowdown on the step that is most sensitive to vectorization: initialStep or iteration 0, where all Si strip hits are available for track finding. (Test input: 100 ttbar events at 14TeV, pileup 50, phase 1 detector.)

If the -Ofast suboptions that are omitted by this PR are re-added to the list, i.e., -fallow-store-data-races -funsafe-math-optimizations -ffinite-math-only, then the prior timing performance is recovered. Of these, -funsafe-math-optimizations is observed to restore more than half of the missing performance, while -fallow-store-data-races gives an improvement of only 1-2%.

Note that the suboptions of -Ofast that are selected for this PR seem somewhat inconsistent:

  • The option -fno-signed-zeros appears twice in the list.
  • The option -fassociative-math requires both -fno-signed-zeros and -fno-trapping-math, but only the first of these appears. Therefore, -fassociative-math is ignored, and warnings about it are issued.

Finally, according to gcc-13 documentation, the option -funsafe-math-optimizations enables -fno-signed-zeros, -fno-trapping-math, -fassociative-math, and -freciprocal-math. However, substituting-fno-signed-zeros -fno-trapping-math -fassociative-math for -funsafe-math-optimizations -fno-reciprocal-math (as required by #8280 ) does not restore full performance. So these sets of options are not equivalent. It seems that the combination -funsafe-math-optimizations -fno-reciprocal-math does something more, and whatever it does seems to be important to mkFit's timing performance.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 13, 2026

Ah, thanks, I wrongly put -fno-signed-zeros -fno-signed-zeros instead of -fno-signed-zeros -fno-trapping-math.

-Ofast implies -funsafe-math-optimizations, which affects the global FP state
for all programs: when used at link time, it may include libraries or startup
files that change the default FPU control word or other similar optimizations.

-Ofast enables (directly or indirectly):
  - ❌ -fallow-store-data-races
  - ✔ -fno-semantic-interposition
  - ✔ -fno-math-errno
  - ❌ -funsafe-math-optimizations
  - ❌ -ffinite-math-only
  - ✔ -fno-rounding-math (_default_)
  - ✔ -fno-signaling-nans (_default_)
  - ❔ -fcx-limited-range
  - ✔ -fexcess-precision=fast
  - ✔ -fno-signed-zeros
  - ✔ -fno-trapping-math
  - ✔ -fassociative-math
  - ❌ -freciprocal-math (disabled in cms-sw#8280)

We should not use -fallow-store-data-races, -ffinite-math-only.

We should not use -funsafe-math-optimizations at link time, as it affects the
global FP state. It's probably easier to replace it with the individual options
-fno-signed-zeros, -fno-trapping-math and -fassociative-math.

We may revisit -fcx-limited-range if we do use Complex arithmetics.

-freciprocal-math was already disabled explicitly, see cms-sw#8280.
@fwyzard fwyzard force-pushed the IB/CMSSW_16_1_X/master_no_Ofast branch from 6a383ac to 70c5c86 Compare January 13, 2026 22:54
@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 13, 2026

please test

@cmsbuild
Copy link
Contributor

Pull request #10267 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 13, 2026

OK, yes, browsing through the GCC code, -funsafe-math-optimizations sets its own internal flag, that enables additional transformations (e.g. a × (b + c) <-> a × b + a × c, etc).

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 13, 2026

@srlantz how does -O3 -ffast-math -no-finite-math-only -fno-reciprocal-math look like ?

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-919f35/50589/summary.html
COMMIT: 70c5c86
CMSSW: CMSSW_16_1_X_2026-01-13-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10267/50589/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed External Build

I found compilation warning when building: See details on the summary page.

@srlantz
Copy link

srlantz commented Jan 13, 2026

how does -O3 -ffast-math -no-finite-math-only -fno-reciprocal-math look like ?

About 20% slower, similar to omitting -funsafe-math-optimizations

@srlantz
Copy link

srlantz commented Jan 13, 2026

@fwyzard note also that we generally include -fno-math-errno in all standalone compilations as I believe this is common to CMSSW builds

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 13, 2026

@fwyzard note also that we generally include -fno-math-errno in all standalone compilations as I believe this is common to CMSSW builds

Sure, but that's already implied by -ffast-math.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 13, 2026

how does -O3 -ffast-math -fno-finite-math-only -fno-reciprocal-math look like ?

About 20% slower, similar to omitting -funsafe-math-optimizations

Pity, because -ffinite-math-only is really annoying.

@srlantz
Copy link

srlantz commented Jan 13, 2026

how does -O3 -ffast-math -fno-finite-math-only -fno-reciprocal-math look like ?

About 20% slower, similar to omitting -funsafe-math-optimizations

Pity, because -ffinite-math-only is really annoying.

Yeah, isnan() is a hard habit to break, one naturally inserts it and expects it to work instead of silently doing nothing 😝

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals RelVals-NVIDIA_H100 nvidia_l40sUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-919f35/50606/summary.html
COMMIT: 70c5c86
CMSSW: CMSSW_16_1_X_2026-01-13-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S,NVIDIA_T4
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10267/50606/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed RelVals

----- Begin Fatal Exception 14-Jan-2026 15:03:57 CET-----------------------
An exception of category 'OutOfBound' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 4 stream: 0
   [1] Running path 'HLTriggerFinalPath'
   [2] Prefetching for module TriggerSummaryProducerAOD/'hltTriggerSummaryAOD'
   [3] Prefetching for module L1HPSPFTauProducer/'l1tHPSPFTauProducer'
   [4] Prefetching for module L1TPFCandMultiMerger/'l1tLayer1'
   [5] Prefetching for module L1TCorrelatorLayer1Producer/'l1tLayer1HGCal'
   [6] Calling method for module HGCalBackendLayer2Producer/'l1tHGCalBackEndLayer2Producer'
Exception Message:
TC X1 = 0.0713466 out of the seeding histogram bounds 0.076 - 0.58
----- End Fatal Exception -------------------------------------------------

Failed RelVals-NVIDIA_H100

  • 34634.40334634.403_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka_Validation/step2_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka_Validation.log
  • 34634.40234634.402_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka/step2_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka.log
  • 34634.75134634.751_TTbar_14TeV+Run4D121PU_HLT75e33TimingAlpaka/step2_TTbar_14TeV+Run4D121PU_HLT75e33TimingAlpaka.log
Expand to see more relval errors ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants