Your default compiler flags are leaving 8× on the table

Wed, 15 Apr 2026 00:00:00 GMT

Here is a number that stopped me.

I ran a tight loop on an Apple M1 Max. One hundred million iterations, adding 1.0 to a double each time. The program was compiled and run in eight languages. The timings, in milliseconds, best of five runs:

Language	loop_overhead
Rust (`rustc -O`)	99
C++ (`g++ -O3`)	98
Java (OpenJDK 21, JIT)	98
Swift (`swiftc -O`)	97
Go (`go build`)	97
Node.js 25 (V8)	53
Bun 1.3.5 (JavaScriptCore)	40
Compiled TypeScript (Perry, LLVM)	12

Sit with the top five for a moment. Five different compilers, five different language designs, two decades apart in age. They agree on this loop to within 2%. That alone is worth noticing — the conventional wisdom that picking C++ over Go buys you raw throughput is, on this benchmark, just not visible.

And then the bottom row. Twelve. An order of magnitude better than the five "fast" languages — built, of all things, by compiling TypeScript.

This isn't a story about TypeScript being fast. It's a story about why the five compiled languages are identical to each other, and why their shared default output is eight times slower than it has to be.

The setup

The compiled-TypeScript entry is Perry, an ahead-of-time compiler I work on. It parses TS with SWC and generates native code through LLVM, the same backend clang and rustc use. For this article Perry is a measuring instrument — a way to isolate one specific thing: LLVM's optimizer, when you hand it identical IR but with different flags.

The benchmark is a ported set of eight compute microbenchmarks, one of which is the loop above. Full source and raw numbers are in the polyglot benchmark suite. I ran every benchmark in every language at the flags its documentation recommends for release builds — nothing extra, nothing turned off. Best of five; best of twenty on the one benchmark (fibonacci) sensitive to branch-predictor state.

These are compute microbenchmarks. Before we continue: do not generalize them to "language X is 8× slower than language Y on real workloads." On a realistic application — one that spends its time in I/O, allocation, a scheduler, a database driver — the programming-language choice drops to the noise floor. What these numbers probe is narrow: the compiler's output on numeric loops with double / f64 arithmetic. That narrow probe, it turns out, is where the defaults get interesting.

Three specific optimization choices account for every case where the compiled-TypeScript column looks strange. I'll walk through each with the LLVM IR to back the claim.

Optimization 1: IEEE 754 strict addition is really slow

The 99 ms Rust number is not laziness, and it's not because Rust is worse than C. Here is what clang emits from a vanilla -O3 build of the same C++ loop:

; clang -O3 bench.cpp -S -emit-llvm, inside bench_loop_overhead:
2:                                    ; preds = %2, %0
  %3 = phi i32    [ 0,           %0 ], [ %9, %2 ]
  %4 = phi double [ 0.000000e+00, %0 ], [ %8, %2 ]
  %5 = fadd double %4, 1.000000e+00    ; serialized
  %6 = fadd double %5, 1.000000e+00    ; waits for %5
  %7 = fadd double %6, 1.000000e+00    ; waits for %6
  %8 = fadd double %7, 1.000000e+00    ; waits for %7
  %9 = add nuw i32 %3, 4
  %10 = icmp eq i32 %9, 100000000
  br i1 %10, label %11, label %2

clang unrolled the loop four times — four fadds in the body, counter incrementing by 4. But look at the data dependencies: each fadd takes the result of the previous fadd as its input. %6 cannot start until %5 finishes. %7 has to wait on %6. Every instruction in the body sits in a serial latency chain.

On an M1 Max a single fadd takes about 3 cycles. Four serialized fadds per loop body × 3 cycles each = 12 cycles per iteration. With the 4× unrolling, 100M iterations becomes 25M loop bodies. 25M × 12 cycles = 300M cycles. At 3.2 GHz that's 94 ms. Measured: 98 ms. Close enough.

clang cannot collapse this chain. Not because it doesn't see the pattern — it obviously does — but because IEEE 754 forbids the transformation. Floating-point addition is not associative. For arbitrary inputs, (a + b) + c can differ from a + (b + c), because a large intermediate in one order rounds away bits that would have survived in the other. Programs that care about that — numerical simulations, interval arithmetic, reproducibility guarantees — need the result. The compiler must preserve it.

Now the same function with one flag added:

; clang -O3 -ffast-math bench.cpp:
2:                                    ; preds = %2, %0
  %3 = phi i32        [ 0, %0 ],              [ %6, %2 ]
  %4 = phi <2 x double> [ zeroinitializer, %0 ], [ %5, %2 ]
  %5 = fadd fast <2 x double> %4, splat (double 4.000000e+00)
  %6 = add nuw i32 %3, 8
  %7 = icmp eq i32 %6, 100000000
  br i1 %7, label %8, label %2

One fadd fast <2 x double> per iteration. Two parallel lanes, each adding 4.0 (because LLVM folded ((x+1)+1)+1)+1 into x+4). Eight additions per iteration, one vector instruction. No dependency between iterations except the accumulator itself.

LLVM needed fast to permit the rewrite — the fast flag is a bundle that includes reassoc ("may reorder"), contract ("may fuse mul+add into fma"), and four more properties about NaN, infinity, signed zero, and reciprocal arithmetic. Turning it on says "I don't care about strict IEEE 754 anywhere in this compilation unit." clang's measured result with the flag: 12 ms. Eight times faster than the default.

Perry's generated IR for the same function carries reassoc contract on every float instruction by default — a subset of fast that permits reordering and fma contraction but preserves NaN, Inf, and -0.0 semantics (which JS programs can observe). After LLVM's standard optimization pipeline runs on Perry's naïve load/fadd/store IR, it becomes:

vector.body:
  %vec.phi   = phi <2 x double> [...], [ %0, %vector.body ]
  %vec.phi14 = phi <2 x double> [...], [ %1, %vector.body ]
  %0 = fadd reassoc contract <2 x double> %vec.phi,   splat (double 1.0)
  %1 = fadd reassoc contract <2 x double> %vec.phi14, splat (double 1.0)
  %index.next = add nuw i32 %index, 4

Two parallel <2 x double> accumulators instead of clang-fast's one — LLVM's interleave pass picked a different unroll factor here, but the result is structurally identical: parallel fadd lanes, no serial chain. Final disassembly shows Perry's binary running four independent fadd.2d NEON instructions per loop iteration, consuming the two FP issue pipes M1 has available. Measured: 12 ms, the same number C++ gets with -ffast-math, by a different route.

Two things follow.

First: the thing Rust and C++ lost by default was never compiler quality. It was one bit of metadata on every fadd instruction. Perry turns that bit on in its emitter. clang turns it on when you pass -ffast-math. Both end up at the same 12 ms because both are routing through the same LLVM optimizer. LLVM is doing the work. The languages differ only in whether they hand LLVM the permission slip.

Second: Go cannot participate. Go's compiler has no -ffast-math, no reassoc flag, and its backend does not ship a floating-point reassociation pass. Writing the same loop in Go and building with go build — with any flags, any compiler version — produces something indistinguishable from clang's default 97 ms. This is intentional: Go's design prioritizes predictable compiler output over absolute throughput. It's also the cleanest instance in this whole investigation of "the default is the ceiling."

For Rust, the situation is halfway. Stable Rust has no flag to toggle reassoc on individual fadd instructions. Nightly exposes std::intrinsics::fadd_fast, which takes the same loop from 99 ms to 12 ms — matching clang-fast. Manual 4-way unrolling in stable Rust reaches 24 ms, good but not great. On this benchmark, "use nightly" is a real answer if you need parity.

Optimization 2: the benchmark that fooled me

Here is accumulate: loop 100 million times, do sum += i % 1000 on double values, report the elapsed time. My prior belief going in was straightforward: on ARM64 there is no hardware instruction for fmod on f64. The default C++ benchmark uses double, so the modulo lowers to a libm function call — roughly 30 ns per call, 30 ns × 100M iterations = three full seconds theoretical, something under a second in practice once clang vectorizes around the call. Perry's type inference recognizes the operands are integer-valued and emits srem — one hardware instruction, one cycle — which is why Perry reports 24 ms while the other languages sit at 96–99 ms.

That story is wrong in an interesting way.

Here is what clang actually emits, with default flags, for the C++ version of accumulate:

%9  = urem <4 x i32> %5, splat (i32 1000)
%10 = urem <4 x i32> %6, splat (i32 1000)
%11 = urem <4 x i32> %7, splat (i32 1000)
%12 = urem <4 x i32> %8, splat (i32 1000)
%13 = uitofp nneg <4 x i32> %9  to <4 x double>
; ... uitofp for the other three lanes ...
%17 = tail call double @llvm.vector.reduce.fadd.v4f64(double %4,  <4 x double> %13)
%18 = tail call double @llvm.vector.reduce.fadd.v4f64(double %17, <4 x double> %14)
%19 = tail call double @llvm.vector.reduce.fadd.v4f64(double %18, <4 x double> %15)
%20 = tail call double @llvm.vector.reduce.fadd.v4f64(double %19, <4 x double> %16)

Because i is declared int in bench.cpp, clang was free to lower i % 1000 to vectorized integer remainder — urem <4 x i32>. No fmod anywhere. The C++ benchmark isn't paying the libm tax I assumed it was.

So what is the 97 ms? Look at the bottom: four llvm.vector.reduce.fadd calls, chained, each feeding the next. Without reassoc, a vector.reduce.fadd.v4f64 must happen in a specific order — it's semantically a serial chain of three fadds inside. Four of those chained per iteration is twelve serial fadds. That's the bottleneck.

Perry, on the same benchmark, compiles down to:

vector.body:
  %1 = urem i64 %index, 1000
  %2 = urem i64 %0, 1000
  %3 = uitofp nneg i64 %1 to double
  %4 = uitofp nneg i64 %2 to double
  %5 = fadd reassoc contract double %vec.phi,   %3
  %6 = fadd reassoc contract double %vec.phi15, %4
  %index.next = add nuw i64 %index, 2

Two parallel scalar accumulators. Two urems, two uitofps, two fadds, no reductions. The urem was always going to be there — both compilers found the integer remainder. The difference is that Perry's reassoc flag let LLVM hoist the accumulate out into parallel lanes instead of a vector-reduce chain.

The original story I told about Perry vs C++ on this benchmark — that it's the fmod libm call versus an srem hardware instruction — turns out to be a story about Perry vs naïvely-compiled TypeScript. Perry does have an integer-mod fast path, and it's a real optimization: if a future TypeScript compiler on this benchmark emitted frem double, it would sit around 600 ms (Node's number: 602 ms, which is exactly this — V8 didn't inline the fmod call). The fast path matters against that reference point.

But against clang -O3 on the same algorithm, the fast path isn't what's making the difference. It's the reassociation flag, again.

C++ with -O3 -ffast-math on accumulate clocks at 26 ms. Virtually identical to Perry's 24 ms. Rust with its stable-toolchain opt variant (switch the accumulator to i64 so the benchmark stays in integer domain) gets to 41 ms — the integer change helps, but the reduce-chain cost for stable Rust's fadd structure isn't reachable at 24 ms without full fast FMF, which stable Rust doesn't expose. Nightly Rust's fadd_fast doesn't help on this benchmark either, because the bottleneck here is the reduce-chain shape, not individual fadd permissions.

Go with its opt variant (int64 accumulator) goes from 99 ms to 70 ms, the biggest improvement any Go benchmark saw in the opt sweep. The delta came entirely from avoiding the uitofp per iteration, not from vectorizing the remainder. Go's compiler emitted one iteration per loop pass, scalar SMULH + MSUB for the modulo, scalar integer add. No vectorization. 70 ms is what you get when nothing auto-parallelizes the accumulator.

Optimization 3: it's reassoc all the way down

The third optimization the plan called for was bounds-check elimination and i32 loop counter promotion on the array_read benchmark — sum 10 million doubles from an array. Perry's codegen detects the classic for (let i = 0; i < arr.length; i++) pattern, caches arr.length at loop entry, maintains a parallel i32 counter alongside the f64 one, and skips the JS runtime bounds check. Measured: 3 ms.

The prediction was that the other languages would be meaningfully slower on the default benchmarks and would snap close to 3 ms when given the right idiom: Rust's .iter().sum(), Swift's withUnsafeBufferPointer, C++'s already-no-bounds std::vector.

Here's what actually happened:

Language	default	opt	delta
C++	9	1	-89%
Rust	10	9	-10%
Go	10	11	0
Swift	9	9	0

Only C++ moved materially. And C++ moved because of -ffast-math (again), not bounds-elim — there are no C++ bounds to eliminate. With fast on, LLVM interleaves the array-sum reduction four lanes wide (four parallel <2 x double> accumulators, 8 f64 per load, 4 × 8 = 32-wide unroll) and gets to 1 ms. That's faster than Perry's 3 ms.

Rust's .iter().sum() vs the indexed for i in 0..arr.len() form gave about one millisecond — within run-to-run noise. rustc at -O already proves i < arr.len() for that classic loop shape and strips the bounds check as dead code. There was nothing to eliminate.

Swift's UnsafeBufferPointer produced an identical 9 ms. The safe indexed form was already efficient.

So the third "Perry optimization" I set out to document turns out to be real in Perry's source — the code in stmt.rs does track i32 counter promotion and bounded_index_pairs — but it isn't load-bearing on this benchmark. The loop vectorizer's interleave factor is what separates 9 ms from 1 ms. That's an LLVM heuristic, not a bounds thing.

The honest takeaway is smaller than the plan suggested: bounds-check elimination is mostly already happening, at least in Rust and C++, for the straight-line loops these benchmarks exercise. What isn't already happening is aggressive vectorization under strict IEEE 754, which is the same optimization we discussed in section 1.

Where the compiled-TypeScript side loses

Two benchmarks where Perry loses cleanly. They matter to the argument — if the thesis were just "TypeScript is faster," they'd be awkward. Since the thesis is "defaults matter," they're consistent with it.

object_create: 0 ms (Rust, C++, Go, Swift) vs 2 ms (Perry). The benchmark allocates a million Point{x, y} structs, sums fields, and reports the time. In statically typed compiled languages, the optimizer stack-allocates the struct, inlines the constructor, proves the struct never escapes the loop, and eliminates the whole thing as dead code. The measured result is zero because the work is zero. Perry cannot match this without abandoning its dynamic value model. A recent Perry pass (v0.5.17) does scalar-replacement for objects whose only uses are field get/set, which is why Perry measures 2 ms and not 10 — but any method call on the object defeats it. This is the shape of workload where ahead-of-time compiling a dynamic language pays a real tax against languages with static types, and no amount of flag-tuning closes the gap.

nested_loops: Perry 9 ms vs C++ opt 1 ms. Same story as array_read, same cause: -ffast-math enables a more aggressive interleave factor than Perry's reassoc contract subset does. Perry's 3 ms on array_read and 9 ms on nested_loops are both beaten by C++ opt, because fast includes nnan and ninf permissions that the loop vectorizer uses to pick a higher unroll. Perry deliberately does not emit those, because JavaScript programs can observe NaN and infinity and it would break Math.max(-0, 0) === -0. That's a real correctness tradeoff — the ceiling Perry could hit if it stopped caring about NaN/Inf semantics is several milliseconds faster on flat-array sums. Right now it doesn't hit it.

An aside: where JIT beats AOT

fibonacci: Java 280 ms vs Perry 311 ms. Recursive fib(40) runs about two billion real calls. Java's C2 JIT observes the recursion at runtime and applies aggressive inlining based on actual hot-call frequencies — something no AOT compiler can match without whole-program profile data. Perry, C++, and Rust all cluster at ~310–319 ms through LLVM; Swift at 360 ms and Go at 450 ms lose at the recursion-folding stage inside their own backends. This benchmark is essentially a compiler-pass quality test, not a flag-tuning target. No flag changes any of these numbers materially.

The meta-point

Rust, C++, Go, and Swift picked conservative defaults for a reason. Their users care about reproducibility, IEEE 754 correctness, and about not having to audit every numeric operation for the possibility the compiler silently reassociated it. A 3D renderer that reads back the same color from two parallel paths, a simulation that needs bit-exact replay for debugging, a financial calculation that must be verifiably deterministic — all of these care, and they'd be angry if (a + b) + c != a + (b + c). The languages' compile defaults reflect that population.

Compiling TypeScript for a JavaScript audience is a different tradeoff. JS programs mostly don't treat -0.0 distinctly from 0.0 even when they could. Most TS code that hits a numeric loop is a game tick, a compiler pass, a canvas renderer — workloads where a bit of reassociation is fine. So Perry turns reassoc on by default. It isn't braver or smarter than Rust; it serves a different population.

What's interesting isn't that Perry made the call. It's that the call is invisible in most comparisons. The numbers people see when they benchmark "Rust vs TypeScript" or "C++ vs JavaScript" reflect the defaults both sides picked, with no indication that one side spent those defaults on numerical robustness and the other spent them on throughput. The benchmarks look like they're comparing languages. They are actually comparing flag choices.

There's no meta-rule for which default is right. "Enable reassoc by default" is good for numeric loops and bad for scientific simulations. "Strict IEEE by default" is the opposite. Both are defensible. What isn't defensible is concluding from benchmark tables alone that one language is faster than another. The defaults are the experiment.

Closing

Every claim in this post is reproducible with the code at the link below. The four bench_opt files showed that the "Perry wins" column closes to within noise on all three flag-sensitive benchmarks when the other languages are given the equivalent optimization path — except on Go, where the path doesn't exist. None of this required anything exotic. -ffast-math is a flag you can type today. Nightly Rust's fadd_fast intrinsic is #![feature(core_intrinsics)] plus one use statement. Whether either should be your default is a judgment call about what you're building.

Perry exists because some of us wanted to compile TypeScript to something that isn't a JavaScript engine. It uses LLVM. You've seen other LLVM-based compilers in this post: clang, rustc, swiftc. They all produce similar output when you ask them for similar things. The experiment this article documented is what they do when you don't.

Reproduction

git clone https://github.com/ralphkuepper/perry
cd perry/benchmarks/polyglot
cargo build --release --manifest-path=../../Cargo.toml -p perry
bash run_all.sh 5          # default-flags numbers — produces RESULTS.md
bash run_opt.sh 5 20       # opt variants — produces RESULTS_OPT.md

Hardware used for the numbers in this post: Apple M1 Max (10 cores), 64 GB RAM, macOS 26.4. Perry commit e1cbd37 (v0.5.22). rustc 1.92.0 stable, 1.97.0-nightly 2026-04-14, Apple clang 21.0, Swift 6.3, Go 1.21.3, Node 25.8, Bun 1.3.5, Python 3.14.

All LLVM IR snippets in this article are in assets/ as full .ll files, reproducible with clang -S -emit-llvm (C++), rustc -O --emit=llvm-ir (Rust), and PERRY_SAVE_LL=<dir> perry compile (Perry). The accompanying METHODOLOGY.md in that directory has the exact iteration counts, clocks, and timing methodology.

amlug