Your default compiler flags are leaving 8× on the table
Here is a number that stopped me.
I ran a tight loop on an Apple M1 Max. One hundred million iterations,
adding 1.0 to a double each time. The program was compiled and run in
eight languages. The timings, in milliseconds, best of five runs:
| Language | loop_overhead |
|---|---|
Rust (rustc -O) | 99 |
C++ (g++ -O3) | 98 |
| Java (OpenJDK 21, JIT) | 98 |
Swift (swiftc -O) | 97 |
Go (go build) | 97 |
| Node.js 25 (V8) | 53 |
| Bun 1.3.5 (JavaScriptCore) | 40 |
| Compiled TypeScript (Perry, LLVM) | 12 |
Sit with the top five for a moment. Five different compilers, five different language designs, two decades apart in age. They agree on this loop to within 2%. That alone is worth noticing — the conventional wisdom that picking C++ over Go buys you raw throughput is, on this benchmark, just not visible.
And then the bottom row. Twelve. An order of magnitude better than the five “fast” languages — built, of all things, by compiling TypeScript.
This isn’t a story about TypeScript being fast. It’s a story about why the five compiled languages are identical to each other, and why their shared default output is eight times slower than it has to be.
The setup
The compiled-TypeScript entry is Perry, an ahead-of-time compiler I work on. It parses TS with SWC and generates native code through LLVM, the same backend clang and rustc use. For this article Perry is a measuring instrument — a way to isolate one specific thing: LLVM’s optimizer, when you hand it identical IR but with different flags.
The benchmark is a ported set of eight compute microbenchmarks, one of
which is the loop above. Full source and raw numbers are in the
polyglot benchmark suite. I ran every benchmark in every language at
the flags its documentation recommends for release builds — nothing
extra, nothing turned off. Best of five; best of twenty on the one
benchmark (fibonacci) sensitive to branch-predictor state.
These are compute microbenchmarks. Before we continue: do not generalize
them to “language X is 8× slower than language Y on real workloads.” On
a realistic application — one that spends its time in I/O, allocation,
a scheduler, a database driver — the programming-language choice drops
to the noise floor. What these numbers probe is narrow: the compiler’s
output on numeric loops with double / f64 arithmetic. That narrow
probe, it turns out, is where the defaults get interesting.
Three specific optimization choices account for every case where the compiled-TypeScript column looks strange. I’ll walk through each with the LLVM IR to back the claim.
Optimization 1: IEEE 754 strict addition is really slow
The 99 ms Rust number is not laziness, and it’s not because Rust is
worse than C. Here is what clang emits from a vanilla -O3 build of the
same C++ loop:
; clang -O3 bench.cpp -S -emit-llvm, inside bench_loop_overhead:
2: ; preds = %2, %0
%3 = phi i32 [ 0, %0 ], [ %9, %2 ]
%4 = phi double [ 0.000000e+00, %0 ], [ %8, %2 ]
%5 = fadd double %4, 1.000000e+00 ; serialized
%6 = fadd double %5, 1.000000e+00 ; waits for %5
%7 = fadd double %6, 1.000000e+00 ; waits for %6
%8 = fadd double %7, 1.000000e+00 ; waits for %7
%9 = add nuw i32 %3, 4
%10 = icmp eq i32 %9, 100000000
br i1 %10, label %11, label %2
clang unrolled the loop four times — four fadds in the body, counter
incrementing by 4. But look at the data dependencies: each fadd takes
the result of the previous fadd as its input. %6 cannot start
until %5 finishes. %7 has to wait on %6. Every instruction in the
body sits in a serial latency chain.
On an M1 Max a single fadd takes about 3 cycles. Four serialized
fadds per loop body × 3 cycles each = 12 cycles per iteration. With the
4× unrolling, 100M iterations becomes 25M loop bodies. 25M × 12 cycles
= 300M cycles. At 3.2 GHz that’s 94 ms. Measured: 98 ms. Close enough.
clang cannot collapse this chain. Not because it doesn’t see the
pattern — it obviously does — but because IEEE 754 forbids the
transformation. Floating-point addition is not associative. For
arbitrary inputs, (a + b) + c can differ from a + (b + c), because
a large intermediate in one order rounds away bits that would have
survived in the other. Programs that care about that — numerical
simulations, interval arithmetic, reproducibility guarantees — need
the result. The compiler must preserve it.
Now the same function with one flag added:
; clang -O3 -ffast-math bench.cpp:
2: ; preds = %2, %0
%3 = phi i32 [ 0, %0 ], [ %6, %2 ]
%4 = phi <2 x double> [ zeroinitializer, %0 ], [ %5, %2 ]
%5 = fadd fast <2 x double> %4, splat (double 4.000000e+00)
%6 = add nuw i32 %3, 8
%7 = icmp eq i32 %6, 100000000
br i1 %7, label %8, label %2
One fadd fast <2 x double> per iteration. Two parallel lanes, each
adding 4.0 (because LLVM folded ((x+1)+1)+1)+1 into x+4). Eight
additions per iteration, one vector instruction. No dependency between
iterations except the accumulator itself.
LLVM needed fast to permit the rewrite — the fast flag is a bundle
that includes reassoc (“may reorder”), contract (“may fuse mul+add
into fma”), and four more properties about NaN, infinity, signed zero,
and reciprocal arithmetic. Turning it on says “I don’t care about
strict IEEE 754 anywhere in this compilation unit.” clang’s measured
result with the flag: 12 ms. Eight times faster than the default.
Perry’s generated IR for the same function carries reassoc contract
on every float instruction by default — a subset of fast that permits
reordering and fma contraction but preserves NaN, Inf, and -0.0
semantics (which JS programs can observe). After LLVM’s standard
optimization pipeline runs on Perry’s naïve load/fadd/store IR, it
becomes:
vector.body:
%vec.phi = phi <2 x double> [...], [ %0, %vector.body ]
%vec.phi14 = phi <2 x double> [...], [ %1, %vector.body ]
%0 = fadd reassoc contract <2 x double> %vec.phi, splat (double 1.0)
%1 = fadd reassoc contract <2 x double> %vec.phi14, splat (double 1.0)
%index.next = add nuw i32 %index, 4
Two parallel <2 x double> accumulators instead of clang-fast’s one —
LLVM’s interleave pass picked a different unroll factor here, but the
result is structurally identical: parallel fadd lanes, no serial chain.
Final disassembly shows Perry’s binary running four independent
fadd.2d NEON instructions per loop iteration, consuming the two FP
issue pipes M1 has available. Measured: 12 ms, the same number C++
gets with -ffast-math, by a different route.
Two things follow.
First: the thing Rust and C++ lost by default was never compiler
quality. It was one bit of metadata on every fadd instruction. Perry
turns that bit on in its emitter. clang turns it on when you pass
-ffast-math. Both end up at the same 12 ms because both are routing
through the same LLVM optimizer. LLVM is doing the work. The languages
differ only in whether they hand LLVM the permission slip.
Second: Go cannot participate. Go’s compiler has no -ffast-math,
no reassoc flag, and its backend does not ship a floating-point
reassociation pass. Writing the same loop in Go and building with
go build — with any flags, any compiler version — produces something
indistinguishable from clang’s default 97 ms. This is intentional: Go’s
design prioritizes predictable compiler output over absolute
throughput. It’s also the cleanest instance in this whole investigation
of “the default is the ceiling.”
For Rust, the situation is halfway. Stable Rust has no flag to toggle
reassoc on individual fadd instructions. Nightly exposes
std::intrinsics::fadd_fast, which takes the same loop from 99 ms to
12 ms — matching clang-fast. Manual 4-way unrolling in stable Rust
reaches 24 ms, good but not great. On this benchmark, “use nightly” is
a real answer if you need parity.
Optimization 2: the benchmark that fooled me
Here is accumulate: loop 100 million times, do sum += i % 1000 on
double values, report the elapsed time. My prior belief going in was
straightforward: on ARM64 there is no hardware instruction for fmod
on f64. The default C++ benchmark uses double, so the modulo
lowers to a libm function call — roughly 30 ns per call, 30 ns × 100M
iterations = three full seconds theoretical, something under a second
in practice once clang vectorizes around the call. Perry’s type
inference recognizes the operands are integer-valued and emits srem
— one hardware instruction, one cycle — which is why Perry reports 24
ms while the other languages sit at 96–99 ms.
That story is wrong in an interesting way.
Here is what clang actually emits, with default flags, for the C++
version of accumulate:
%9 = urem <4 x i32> %5, splat (i32 1000)
%10 = urem <4 x i32> %6, splat (i32 1000)
%11 = urem <4 x i32> %7, splat (i32 1000)
%12 = urem <4 x i32> %8, splat (i32 1000)
%13 = uitofp nneg <4 x i32> %9 to <4 x double>
; ... uitofp for the other three lanes ...
%17 = tail call double @llvm.vector.reduce.fadd.v4f64(double %4, <4 x double> %13)
%18 = tail call double @llvm.vector.reduce.fadd.v4f64(double %17, <4 x double> %14)
%19 = tail call double @llvm.vector.reduce.fadd.v4f64(double %18, <4 x double> %15)
%20 = tail call double @llvm.vector.reduce.fadd.v4f64(double %19, <4 x double> %16)
Because i is declared int in bench.cpp, clang was free to lower
i % 1000 to vectorized integer remainder — urem <4 x i32>. No
fmod anywhere. The C++ benchmark isn’t paying the libm tax I assumed
it was.
So what is the 97 ms? Look at the bottom: four llvm.vector.reduce.fadd
calls, chained, each feeding the next. Without reassoc, a
vector.reduce.fadd.v4f64 must happen in a specific order — it’s
semantically a serial chain of three fadds inside. Four of those
chained per iteration is twelve serial fadds. That’s the bottleneck.
Perry, on the same benchmark, compiles down to:
vector.body:
%1 = urem i64 %index, 1000
%2 = urem i64 %0, 1000
%3 = uitofp nneg i64 %1 to double
%4 = uitofp nneg i64 %2 to double
%5 = fadd reassoc contract double %vec.phi, %3
%6 = fadd reassoc contract double %vec.phi15, %4
%index.next = add nuw i64 %index, 2
Two parallel scalar accumulators. Two urems, two uitofps, two
fadds, no reductions. The urem was always going to be there — both
compilers found the integer remainder. The difference is that Perry’s
reassoc flag let LLVM hoist the accumulate out into parallel lanes
instead of a vector-reduce chain.
The original story I told about Perry vs C++ on this benchmark — that
it’s the fmod libm call versus an srem hardware instruction — turns
out to be a story about Perry vs naïvely-compiled TypeScript. Perry
does have an integer-mod fast path, and it’s a real optimization: if a
future TypeScript compiler on this benchmark emitted frem double, it
would sit around 600 ms (Node’s number: 602 ms, which is exactly this —
V8 didn’t inline the fmod call). The fast path matters against that
reference point.
But against clang -O3 on the same algorithm, the fast path isn’t
what’s making the difference. It’s the reassociation flag, again.
C++ with -O3 -ffast-math on accumulate clocks at 26 ms. Virtually
identical to Perry’s 24 ms. Rust with its stable-toolchain opt variant
(switch the accumulator to i64 so the benchmark stays in integer
domain) gets to 41 ms — the integer change helps, but the reduce-chain
cost for stable Rust’s fadd structure isn’t reachable at 24 ms without
full fast FMF, which stable Rust doesn’t expose. Nightly Rust’s
fadd_fast doesn’t help on this benchmark either, because the
bottleneck here is the reduce-chain shape, not individual fadd
permissions.
Go with its opt variant (int64 accumulator) goes from 99 ms to 70 ms,
the biggest improvement any Go benchmark saw in the opt sweep. The
delta came entirely from avoiding the uitofp per iteration, not from
vectorizing the remainder. Go’s compiler emitted one iteration per loop
pass, scalar SMULH + MSUB for the modulo, scalar integer add. No
vectorization. 70 ms is what you get when nothing auto-parallelizes the
accumulator.
Optimization 3: it’s reassoc all the way down
The third optimization the plan called for was bounds-check elimination
and i32 loop counter promotion on the array_read benchmark — sum 10
million doubles from an array. Perry’s codegen detects the classic
for (let i = 0; i < arr.length; i++) pattern, caches arr.length at
loop entry, maintains a parallel i32 counter alongside the f64 one, and
skips the JS runtime bounds check. Measured: 3 ms.
The prediction was that the other languages would be meaningfully
slower on the default benchmarks and would snap close to 3 ms when
given the right idiom: Rust’s .iter().sum(), Swift’s
withUnsafeBufferPointer, C++‘s already-no-bounds std::vector.
Here’s what actually happened:
| Language | default | opt | delta |
|---|---|---|---|
| C++ | 9 | 1 | -89% |
| Rust | 10 | 9 | -10% |
| Go | 10 | 11 | 0 |
| Swift | 9 | 9 | 0 |
Only C++ moved materially. And C++ moved because of -ffast-math
(again), not bounds-elim — there are no C++ bounds to eliminate. With
fast on, LLVM interleaves the array-sum reduction four lanes wide
(four parallel <2 x double> accumulators, 8 f64 per load, 4 × 8 =
32-wide unroll) and gets to 1 ms. That’s faster than Perry’s 3 ms.
Rust’s .iter().sum() vs the indexed for i in 0..arr.len() form gave
about one millisecond — within run-to-run noise. rustc at -O already
proves i < arr.len() for that classic loop shape and strips the
bounds check as dead code. There was nothing to eliminate.
Swift’s UnsafeBufferPointer produced an identical 9 ms. The safe
indexed form was already efficient.
So the third “Perry optimization” I set out to document turns out to be
real in Perry’s source — the code in stmt.rs does track i32 counter
promotion and bounded_index_pairs — but it isn’t load-bearing on this
benchmark. The loop vectorizer’s interleave factor is what separates 9
ms from 1 ms. That’s an LLVM heuristic, not a bounds thing.
The honest takeaway is smaller than the plan suggested: bounds-check elimination is mostly already happening, at least in Rust and C++, for the straight-line loops these benchmarks exercise. What isn’t already happening is aggressive vectorization under strict IEEE 754, which is the same optimization we discussed in section 1.
Where the compiled-TypeScript side loses
Two benchmarks where Perry loses cleanly. They matter to the argument — if the thesis were just “TypeScript is faster,” they’d be awkward. Since the thesis is “defaults matter,” they’re consistent with it.
object_create: 0 ms (Rust, C++, Go, Swift) vs 2 ms (Perry). The
benchmark allocates a million Point{x, y} structs, sums fields, and
reports the time. In statically typed compiled languages, the optimizer
stack-allocates the struct, inlines the constructor, proves the struct
never escapes the loop, and eliminates the whole thing as dead code.
The measured result is zero because the work is zero. Perry cannot
match this without abandoning its dynamic value model. A recent Perry
pass (v0.5.17) does scalar-replacement for objects whose only uses are
field get/set, which is why Perry measures 2 ms and not 10 — but any
method call on the object defeats it. This is the shape of workload
where ahead-of-time compiling a dynamic language pays a real tax
against languages with static types, and no amount of flag-tuning
closes the gap.
nested_loops: Perry 9 ms vs C++ opt 1 ms. Same story as
array_read, same cause: -ffast-math enables a more aggressive
interleave factor than Perry’s reassoc contract subset does. Perry’s
3 ms on array_read and 9 ms on nested_loops are both beaten by C++
opt, because fast includes nnan and ninf permissions that the
loop vectorizer uses to pick a higher unroll. Perry deliberately does
not emit those, because JavaScript programs can observe NaN and
infinity and it would break Math.max(-0, 0) === -0. That’s a real
correctness tradeoff — the ceiling Perry could hit if it stopped caring
about NaN/Inf semantics is several milliseconds faster on flat-array
sums. Right now it doesn’t hit it.
An aside: where JIT beats AOT
fibonacci: Java 280 ms vs Perry 311 ms. Recursive fib(40) runs
about two billion real calls. Java’s C2 JIT observes the recursion at
runtime and applies aggressive inlining based on actual hot-call
frequencies — something no AOT compiler can match without whole-program
profile data. Perry, C++, and Rust all cluster at ~310–319 ms through
LLVM; Swift at 360 ms and Go at 450 ms lose at the recursion-folding
stage inside their own backends. This benchmark is essentially a
compiler-pass quality test, not a flag-tuning target. No flag changes
any of these numbers materially.
The meta-point
Rust, C++, Go, and Swift picked conservative defaults for a reason.
Their users care about reproducibility, IEEE 754 correctness, and about
not having to audit every numeric operation for the possibility the
compiler silently reassociated it. A 3D renderer that reads back the
same color from two parallel paths, a simulation that needs bit-exact
replay for debugging, a financial calculation that must be verifiably
deterministic — all of these care, and they’d be angry if
(a + b) + c != a + (b + c). The languages’ compile defaults reflect
that population.
Compiling TypeScript for a JavaScript audience is a different tradeoff.
JS programs mostly don’t treat -0.0 distinctly from 0.0 even when
they could. Most TS code that hits a numeric loop is a game tick, a
compiler pass, a canvas renderer — workloads where a bit of
reassociation is fine. So Perry turns reassoc on by default. It isn’t
braver or smarter than Rust; it serves a different population.
What’s interesting isn’t that Perry made the call. It’s that the call is invisible in most comparisons. The numbers people see when they benchmark “Rust vs TypeScript” or “C++ vs JavaScript” reflect the defaults both sides picked, with no indication that one side spent those defaults on numerical robustness and the other spent them on throughput. The benchmarks look like they’re comparing languages. They are actually comparing flag choices.
There’s no meta-rule for which default is right. “Enable reassoc by default” is good for numeric loops and bad for scientific simulations. “Strict IEEE by default” is the opposite. Both are defensible. What isn’t defensible is concluding from benchmark tables alone that one language is faster than another. The defaults are the experiment.
Closing
Every claim in this post is reproducible with the code at the link
below. The four bench_opt files showed that the “Perry wins” column
closes to within noise on all three flag-sensitive benchmarks when the
other languages are given the equivalent optimization path — except on
Go, where the path doesn’t exist. None of this required anything
exotic. -ffast-math is a flag you can type today. Nightly Rust’s
fadd_fast intrinsic is #![feature(core_intrinsics)] plus one use
statement. Whether either should be your default is a judgment call
about what you’re building.
Perry exists because some of us wanted to compile TypeScript to something that isn’t a JavaScript engine. It uses LLVM. You’ve seen other LLVM-based compilers in this post: clang, rustc, swiftc. They all produce similar output when you ask them for similar things. The experiment this article documented is what they do when you don’t.
Reproduction
git clone https://github.com/ralphkuepper/perry
cd perry/benchmarks/polyglot
cargo build --release --manifest-path=../../Cargo.toml -p perry
bash run_all.sh 5 # default-flags numbers — produces RESULTS.md
bash run_opt.sh 5 20 # opt variants — produces RESULTS_OPT.md
Hardware used for the numbers in this post: Apple M1 Max (10 cores),
64 GB RAM, macOS 26.4. Perry commit e1cbd37 (v0.5.22). rustc 1.92.0
stable, 1.97.0-nightly 2026-04-14, Apple clang 21.0, Swift 6.3, Go
1.21.3, Node 25.8, Bun 1.3.5, Python 3.14.
All LLVM IR snippets in this article are in assets/ as full .ll
files, reproducible with clang -S -emit-llvm (C++), rustc -O --emit=llvm-ir (Rust), and PERRY_SAVE_LL=<dir> perry compile (Perry).
The accompanying METHODOLOGY.md in that directory has the exact
iteration counts, clocks, and timing methodology.