<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>amlug</title><description>Personal technical writing by Ralph Kuepper (amlug). Deep notes on compilers, systems, and the occasional side project.</description><link>https://amlug.net/</link><item><title>Your default compiler flags are leaving 8× on the table</title><link>https://amlug.net/posts/default-compiler-flags-8x/</link><guid isPermaLink="true">https://amlug.net/posts/default-compiler-flags-8x/</guid><description>Five compiled languages agree on a numeric loop to within 2%. A compiled-TypeScript experiment is 8× faster. This isn&apos;t a story about TypeScript — it&apos;s about what the other five lost by default.</description><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Here is a number that stopped me.&lt;/p&gt;
&lt;p&gt;I ran a tight loop on an Apple M1 Max. One hundred million iterations,
adding &lt;code&gt;1.0&lt;/code&gt; to a double each time. The program was compiled and run in
eight languages. The timings, in milliseconds, best of five runs:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;loop_overhead&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Rust (&lt;code&gt;rustc -O&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;99&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C++ (&lt;code&gt;g++ -O3&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;98&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Java (OpenJDK 21, JIT)&lt;/td&gt;
&lt;td&gt;98&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Swift (&lt;code&gt;swiftc -O&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Go (&lt;code&gt;go build&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Node.js 25 (V8)&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bun 1.3.5 (JavaScriptCore)&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compiled TypeScript (Perry, LLVM)&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Sit with the top five for a moment. Five different compilers, five
different language designs, two decades apart in age. They agree on this
loop to within 2%. That alone is worth noticing — the conventional
wisdom that picking C++ over Go buys you raw throughput is, on this
benchmark, just not visible.&lt;/p&gt;
&lt;p&gt;And then the bottom row. Twelve. An order of magnitude better than the
five &quot;fast&quot; languages — built, of all things, by compiling TypeScript.&lt;/p&gt;
&lt;p&gt;This isn&apos;t a story about TypeScript being fast. It&apos;s a story about why
the five compiled languages are identical to each other, and why their
shared default output is eight times slower than it has to be.&lt;/p&gt;
&lt;h2&gt;The setup&lt;/h2&gt;
&lt;p&gt;The compiled-TypeScript entry is &lt;a href=&quot;https://perry.sh/&quot;&gt;Perry&lt;/a&gt;, an ahead-of-time compiler I
work on. It parses TS with SWC and generates native code through LLVM,
the same backend clang and rustc use. For this article Perry is a
measuring instrument — a way to isolate one specific thing: LLVM&apos;s
optimizer, when you hand it identical IR but with different flags.&lt;/p&gt;
&lt;p&gt;The benchmark is a ported set of eight compute microbenchmarks, one of
which is the loop above. Full source and raw numbers are in &lt;a href=&quot;https://github.com/ralphkuepper/perry/tree/main/benchmarks/polyglot&quot;&gt;the
polyglot benchmark suite&lt;/a&gt;. I ran every benchmark in every language at
the flags its documentation recommends for release builds — nothing
extra, nothing turned off. Best of five; best of twenty on the one
benchmark (&lt;code&gt;fibonacci&lt;/code&gt;) sensitive to branch-predictor state.&lt;/p&gt;
&lt;p&gt;These are compute microbenchmarks. Before we continue: do not generalize
them to &quot;language X is 8× slower than language Y on real workloads.&quot; On
a realistic application — one that spends its time in I/O, allocation,
a scheduler, a database driver — the programming-language choice drops
to the noise floor. What these numbers probe is narrow: the compiler&apos;s
output on numeric loops with &lt;code&gt;double&lt;/code&gt; / &lt;code&gt;f64&lt;/code&gt; arithmetic. That narrow
probe, it turns out, is where the defaults get interesting.&lt;/p&gt;
&lt;p&gt;Three specific optimization choices account for every case where the
compiled-TypeScript column looks strange. I&apos;ll walk through each with
the LLVM IR to back the claim.&lt;/p&gt;
&lt;h2&gt;Optimization 1: IEEE 754 strict addition is really slow&lt;/h2&gt;
&lt;p&gt;The 99 ms Rust number is not laziness, and it&apos;s not because Rust is
worse than C. Here is what clang emits from a vanilla &lt;code&gt;-O3&lt;/code&gt; build of the
same C++ loop:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;; clang -O3 bench.cpp -S -emit-llvm, inside bench_loop_overhead:
2:                                    ; preds = %2, %0
  %3 = phi i32    [ 0,           %0 ], [ %9, %2 ]
  %4 = phi double [ 0.000000e+00, %0 ], [ %8, %2 ]
  %5 = fadd double %4, 1.000000e+00    ; serialized
  %6 = fadd double %5, 1.000000e+00    ; waits for %5
  %7 = fadd double %6, 1.000000e+00    ; waits for %6
  %8 = fadd double %7, 1.000000e+00    ; waits for %7
  %9 = add nuw i32 %3, 4
  %10 = icmp eq i32 %9, 100000000
  br i1 %10, label %11, label %2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;clang unrolled the loop four times — four &lt;code&gt;fadd&lt;/code&gt;s in the body, counter
incrementing by 4. But look at the data dependencies: each &lt;code&gt;fadd&lt;/code&gt; takes
the result of the previous &lt;code&gt;fadd&lt;/code&gt; as its input. &lt;code&gt;%6&lt;/code&gt; cannot start
until &lt;code&gt;%5&lt;/code&gt; finishes. &lt;code&gt;%7&lt;/code&gt; has to wait on &lt;code&gt;%6&lt;/code&gt;. Every instruction in the
body sits in a serial latency chain.&lt;/p&gt;
&lt;p&gt;On an M1 Max a single &lt;code&gt;fadd&lt;/code&gt; takes about 3 cycles. Four serialized
fadds per loop body × 3 cycles each = 12 cycles per iteration. With the
4× unrolling, 100M iterations becomes 25M loop bodies. 25M × 12 cycles
= 300M cycles. At 3.2 GHz that&apos;s 94 ms. Measured: 98 ms. Close enough.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;clang cannot collapse this chain.&lt;/strong&gt; Not because it doesn&apos;t see the
pattern — it obviously does — but because IEEE 754 forbids the
transformation. Floating-point addition is not associative. For
arbitrary inputs, &lt;code&gt;(a + b) + c&lt;/code&gt; can differ from &lt;code&gt;a + (b + c)&lt;/code&gt;, because
a large intermediate in one order rounds away bits that would have
survived in the other. Programs that care about that — numerical
simulations, interval arithmetic, reproducibility guarantees — need
the result. The compiler must preserve it.&lt;/p&gt;
&lt;p&gt;Now the same function with one flag added:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;; clang -O3 -ffast-math bench.cpp:
2:                                    ; preds = %2, %0
  %3 = phi i32        [ 0, %0 ],              [ %6, %2 ]
  %4 = phi &amp;lt;2 x double&amp;gt; [ zeroinitializer, %0 ], [ %5, %2 ]
  %5 = fadd fast &amp;lt;2 x double&amp;gt; %4, splat (double 4.000000e+00)
  %6 = add nuw i32 %3, 8
  %7 = icmp eq i32 %6, 100000000
  br i1 %7, label %8, label %2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One &lt;code&gt;fadd fast &amp;lt;2 x double&amp;gt;&lt;/code&gt; per iteration. Two parallel lanes, each
adding &lt;code&gt;4.0&lt;/code&gt; (because LLVM folded &lt;code&gt;((x+1)+1)+1)+1&lt;/code&gt; into &lt;code&gt;x+4&lt;/code&gt;). Eight
additions per iteration, one vector instruction. No dependency between
iterations except the accumulator itself.&lt;/p&gt;
&lt;p&gt;LLVM needed &lt;code&gt;fast&lt;/code&gt; to permit the rewrite — the &lt;code&gt;fast&lt;/code&gt; flag is a bundle
that includes &lt;code&gt;reassoc&lt;/code&gt; (&quot;may reorder&quot;), &lt;code&gt;contract&lt;/code&gt; (&quot;may fuse mul+add
into fma&quot;), and four more properties about NaN, infinity, signed zero,
and reciprocal arithmetic. Turning it on says &quot;I don&apos;t care about
strict IEEE 754 anywhere in this compilation unit.&quot; clang&apos;s measured
result with the flag: &lt;strong&gt;12 ms&lt;/strong&gt;. Eight times faster than the default.&lt;/p&gt;
&lt;p&gt;Perry&apos;s generated IR for the same function carries &lt;code&gt;reassoc contract&lt;/code&gt;
on every float instruction by default — a subset of &lt;code&gt;fast&lt;/code&gt; that permits
reordering and fma contraction but preserves NaN, Inf, and &lt;code&gt;-0.0&lt;/code&gt;
semantics (which JS programs can observe). After LLVM&apos;s standard
optimization pipeline runs on Perry&apos;s naïve load/fadd/store IR, it
becomes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;vector.body:
  %vec.phi   = phi &amp;lt;2 x double&amp;gt; [...], [ %0, %vector.body ]
  %vec.phi14 = phi &amp;lt;2 x double&amp;gt; [...], [ %1, %vector.body ]
  %0 = fadd reassoc contract &amp;lt;2 x double&amp;gt; %vec.phi,   splat (double 1.0)
  %1 = fadd reassoc contract &amp;lt;2 x double&amp;gt; %vec.phi14, splat (double 1.0)
  %index.next = add nuw i32 %index, 4
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Two parallel &lt;code&gt;&amp;lt;2 x double&amp;gt;&lt;/code&gt; accumulators instead of clang-fast&apos;s one —
LLVM&apos;s interleave pass picked a different unroll factor here, but the
result is structurally identical: parallel fadd lanes, no serial chain.
Final disassembly shows Perry&apos;s binary running four independent
&lt;code&gt;fadd.2d&lt;/code&gt; NEON instructions per loop iteration, consuming the two FP
issue pipes M1 has available. Measured: &lt;strong&gt;12 ms&lt;/strong&gt;, the same number C++
gets with &lt;code&gt;-ffast-math&lt;/code&gt;, by a different route.&lt;/p&gt;
&lt;p&gt;Two things follow.&lt;/p&gt;
&lt;p&gt;First: &lt;strong&gt;the thing Rust and C++ lost by default was never compiler
quality. It was one bit of metadata on every fadd instruction.&lt;/strong&gt; Perry
turns that bit on in its emitter. clang turns it on when you pass
&lt;code&gt;-ffast-math&lt;/code&gt;. Both end up at the same 12 ms because both are routing
through the same LLVM optimizer. LLVM is doing the work. The languages
differ only in whether they hand LLVM the permission slip.&lt;/p&gt;
&lt;p&gt;Second: &lt;strong&gt;Go cannot participate.&lt;/strong&gt; Go&apos;s compiler has no &lt;code&gt;-ffast-math&lt;/code&gt;,
no &lt;code&gt;reassoc&lt;/code&gt; flag, and its backend does not ship a floating-point
reassociation pass. Writing the same loop in Go and building with
&lt;code&gt;go build&lt;/code&gt; — with any flags, any compiler version — produces something
indistinguishable from clang&apos;s default 97 ms. This is intentional: Go&apos;s
design prioritizes predictable compiler output over absolute
throughput. It&apos;s also the cleanest instance in this whole investigation
of &quot;the default is the ceiling.&quot;&lt;/p&gt;
&lt;p&gt;For Rust, the situation is halfway. Stable Rust has no flag to toggle
&lt;code&gt;reassoc&lt;/code&gt; on individual fadd instructions. Nightly exposes
&lt;code&gt;std::intrinsics::fadd_fast&lt;/code&gt;, which takes the same loop from 99 ms to
12 ms — matching clang-fast. Manual 4-way unrolling in stable Rust
reaches 24 ms, good but not great. On this benchmark, &quot;use nightly&quot; is
a real answer if you need parity.&lt;/p&gt;
&lt;h2&gt;Optimization 2: the benchmark that fooled me&lt;/h2&gt;
&lt;p&gt;Here is &lt;code&gt;accumulate&lt;/code&gt;: loop 100 million times, do &lt;code&gt;sum += i % 1000&lt;/code&gt; on
&lt;code&gt;double&lt;/code&gt; values, report the elapsed time. My prior belief going in was
straightforward: on ARM64 there is no hardware instruction for &lt;code&gt;fmod&lt;/code&gt;
on f64. The default C++ benchmark uses &lt;code&gt;double&lt;/code&gt;, so the modulo
lowers to a libm function call — roughly 30 ns per call, 30 ns × 100M
iterations = three full seconds theoretical, something under a second
in practice once clang vectorizes around the call. Perry&apos;s type
inference recognizes the operands are integer-valued and emits &lt;code&gt;srem&lt;/code&gt;
— one hardware instruction, one cycle — which is why Perry reports 24
ms while the other languages sit at 96–99 ms.&lt;/p&gt;
&lt;p&gt;That story is wrong in an interesting way.&lt;/p&gt;
&lt;p&gt;Here is what clang actually emits, with default flags, for the C++
version of &lt;code&gt;accumulate&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;%9  = urem &amp;lt;4 x i32&amp;gt; %5, splat (i32 1000)
%10 = urem &amp;lt;4 x i32&amp;gt; %6, splat (i32 1000)
%11 = urem &amp;lt;4 x i32&amp;gt; %7, splat (i32 1000)
%12 = urem &amp;lt;4 x i32&amp;gt; %8, splat (i32 1000)
%13 = uitofp nneg &amp;lt;4 x i32&amp;gt; %9  to &amp;lt;4 x double&amp;gt;
; ... uitofp for the other three lanes ...
%17 = tail call double @llvm.vector.reduce.fadd.v4f64(double %4,  &amp;lt;4 x double&amp;gt; %13)
%18 = tail call double @llvm.vector.reduce.fadd.v4f64(double %17, &amp;lt;4 x double&amp;gt; %14)
%19 = tail call double @llvm.vector.reduce.fadd.v4f64(double %18, &amp;lt;4 x double&amp;gt; %15)
%20 = tail call double @llvm.vector.reduce.fadd.v4f64(double %19, &amp;lt;4 x double&amp;gt; %16)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Because &lt;code&gt;i&lt;/code&gt; is declared &lt;code&gt;int&lt;/code&gt; in &lt;code&gt;bench.cpp&lt;/code&gt;, clang was free to lower
&lt;code&gt;i % 1000&lt;/code&gt; to &lt;strong&gt;vectorized integer remainder&lt;/strong&gt; — &lt;code&gt;urem &amp;lt;4 x i32&amp;gt;&lt;/code&gt;. No
&lt;code&gt;fmod&lt;/code&gt; anywhere. The C++ benchmark isn&apos;t paying the libm tax I assumed
it was.&lt;/p&gt;
&lt;p&gt;So what is the 97 ms? Look at the bottom: four &lt;code&gt;llvm.vector.reduce.fadd&lt;/code&gt;
calls, chained, each feeding the next. Without &lt;code&gt;reassoc&lt;/code&gt;, a
&lt;code&gt;vector.reduce.fadd.v4f64&lt;/code&gt; must happen in a specific order — it&apos;s
semantically a serial chain of three &lt;code&gt;fadd&lt;/code&gt;s inside. Four of those
chained per iteration is twelve serial &lt;code&gt;fadd&lt;/code&gt;s. That&apos;s the bottleneck.&lt;/p&gt;
&lt;p&gt;Perry, on the same benchmark, compiles down to:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;vector.body:
  %1 = urem i64 %index, 1000
  %2 = urem i64 %0, 1000
  %3 = uitofp nneg i64 %1 to double
  %4 = uitofp nneg i64 %2 to double
  %5 = fadd reassoc contract double %vec.phi,   %3
  %6 = fadd reassoc contract double %vec.phi15, %4
  %index.next = add nuw i64 %index, 2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Two parallel scalar accumulators. Two &lt;code&gt;urem&lt;/code&gt;s, two &lt;code&gt;uitofp&lt;/code&gt;s, two
&lt;code&gt;fadd&lt;/code&gt;s, no reductions. The &lt;code&gt;urem&lt;/code&gt; was always going to be there — both
compilers found the integer remainder. The difference is that Perry&apos;s
&lt;code&gt;reassoc&lt;/code&gt; flag let LLVM hoist the accumulate out into parallel lanes
instead of a vector-reduce chain.&lt;/p&gt;
&lt;p&gt;The original story I told about Perry vs C++ on this benchmark — that
it&apos;s the &lt;code&gt;fmod&lt;/code&gt; libm call versus an &lt;code&gt;srem&lt;/code&gt; hardware instruction — turns
out to be a story about Perry vs &lt;em&gt;naïvely-compiled TypeScript&lt;/em&gt;. Perry
does have an integer-mod fast path, and it&apos;s a real optimization: if a
future TypeScript compiler on this benchmark emitted &lt;code&gt;frem double&lt;/code&gt;, it
would sit around 600 ms (Node&apos;s number: 602 ms, which is exactly this —
V8 didn&apos;t inline the fmod call). The fast path matters against that
reference point.&lt;/p&gt;
&lt;p&gt;But against &lt;code&gt;clang -O3&lt;/code&gt; on the same algorithm, the fast path isn&apos;t
what&apos;s making the difference. It&apos;s the reassociation flag, again.&lt;/p&gt;
&lt;p&gt;C++ with &lt;code&gt;-O3 -ffast-math&lt;/code&gt; on &lt;code&gt;accumulate&lt;/code&gt; clocks at 26 ms. Virtually
identical to Perry&apos;s 24 ms. Rust with its stable-toolchain opt variant
(switch the accumulator to &lt;code&gt;i64&lt;/code&gt; so the benchmark stays in integer
domain) gets to 41 ms — the integer change helps, but the reduce-chain
cost for stable Rust&apos;s fadd structure isn&apos;t reachable at 24 ms without
full &lt;code&gt;fast&lt;/code&gt; FMF, which stable Rust doesn&apos;t expose. Nightly Rust&apos;s
&lt;code&gt;fadd_fast&lt;/code&gt; doesn&apos;t help on this benchmark either, because the
bottleneck here is the reduce-chain shape, not individual fadd
permissions.&lt;/p&gt;
&lt;p&gt;Go with its opt variant (&lt;code&gt;int64&lt;/code&gt; accumulator) goes from 99 ms to 70 ms,
the biggest improvement any Go benchmark saw in the opt sweep. The
delta came entirely from avoiding the &lt;code&gt;uitofp&lt;/code&gt; per iteration, not from
vectorizing the remainder. Go&apos;s compiler emitted one iteration per loop
pass, scalar &lt;code&gt;SMULH + MSUB&lt;/code&gt; for the modulo, scalar integer add. No
vectorization. 70 ms is what you get when nothing auto-parallelizes the
accumulator.&lt;/p&gt;
&lt;h2&gt;Optimization 3: it&apos;s reassoc all the way down&lt;/h2&gt;
&lt;p&gt;The third optimization the plan called for was bounds-check elimination
and i32 loop counter promotion on the &lt;code&gt;array_read&lt;/code&gt; benchmark — sum 10
million &lt;code&gt;double&lt;/code&gt;s from an array. Perry&apos;s codegen detects the classic
&lt;code&gt;for (let i = 0; i &amp;lt; arr.length; i++)&lt;/code&gt; pattern, caches &lt;code&gt;arr.length&lt;/code&gt; at
loop entry, maintains a parallel i32 counter alongside the f64 one, and
skips the JS runtime bounds check. Measured: 3 ms.&lt;/p&gt;
&lt;p&gt;The prediction was that the other languages would be meaningfully
slower on the default benchmarks and would snap close to 3 ms when
given the right idiom: Rust&apos;s &lt;code&gt;.iter().sum()&lt;/code&gt;, Swift&apos;s
&lt;code&gt;withUnsafeBufferPointer&lt;/code&gt;, C++&apos;s already-no-bounds &lt;code&gt;std::vector&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Here&apos;s what actually happened:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;default&lt;/th&gt;
&lt;th&gt;opt&lt;/th&gt;
&lt;th&gt;delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;C++&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;-89%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;-10%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Swift&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Only C++ moved materially. And C++ moved because of &lt;code&gt;-ffast-math&lt;/code&gt;
(again), not bounds-elim — there are no C++ bounds to eliminate. With
&lt;code&gt;fast&lt;/code&gt; on, LLVM interleaves the array-sum reduction four lanes wide
(four parallel &lt;code&gt;&amp;lt;2 x double&amp;gt;&lt;/code&gt; accumulators, 8 f64 per load, 4 × 8 =
32-wide unroll) and gets to 1 ms. That&apos;s faster than Perry&apos;s 3 ms.&lt;/p&gt;
&lt;p&gt;Rust&apos;s &lt;code&gt;.iter().sum()&lt;/code&gt; vs the indexed &lt;code&gt;for i in 0..arr.len()&lt;/code&gt; form gave
about one millisecond — within run-to-run noise. rustc at &lt;code&gt;-O&lt;/code&gt; already
proves &lt;code&gt;i &amp;lt; arr.len()&lt;/code&gt; for that classic loop shape and strips the
bounds check as dead code. There was nothing to eliminate.&lt;/p&gt;
&lt;p&gt;Swift&apos;s &lt;code&gt;UnsafeBufferPointer&lt;/code&gt; produced an identical 9 ms. The safe
indexed form was already efficient.&lt;/p&gt;
&lt;p&gt;So the third &quot;Perry optimization&quot; I set out to document turns out to be
real in Perry&apos;s source — the code in &lt;code&gt;stmt.rs&lt;/code&gt; does track &lt;code&gt;i32 counter&lt;/code&gt;
promotion and &lt;code&gt;bounded_index_pairs&lt;/code&gt; — but it isn&apos;t load-bearing on this
benchmark. The loop vectorizer&apos;s interleave factor is what separates 9
ms from 1 ms. That&apos;s an LLVM heuristic, not a bounds thing.&lt;/p&gt;
&lt;p&gt;The honest takeaway is smaller than the plan suggested: bounds-check
elimination is mostly already happening, at least in Rust and C++, for
the straight-line loops these benchmarks exercise. What isn&apos;t already
happening is aggressive vectorization under strict IEEE 754, which is
the same optimization we discussed in section 1.&lt;/p&gt;
&lt;h2&gt;Where the compiled-TypeScript side loses&lt;/h2&gt;
&lt;p&gt;Two benchmarks where Perry loses cleanly. They matter to the argument
— if the thesis were just &quot;TypeScript is faster,&quot; they&apos;d be awkward.
Since the thesis is &quot;defaults matter,&quot; they&apos;re consistent with it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;object_create&lt;/code&gt;: 0 ms (Rust, C++, Go, Swift) vs 2 ms (Perry).&lt;/strong&gt; The
benchmark allocates a million &lt;code&gt;Point{x, y}&lt;/code&gt; structs, sums fields, and
reports the time. In statically typed compiled languages, the optimizer
stack-allocates the struct, inlines the constructor, proves the struct
never escapes the loop, and eliminates the whole thing as dead code.
The measured result is zero because the work is zero. Perry cannot
match this without abandoning its dynamic value model. A recent Perry
pass (v0.5.17) does scalar-replacement for objects whose only uses are
field get/set, which is why Perry measures 2 ms and not 10 — but any
method call on the object defeats it. This is the shape of workload
where ahead-of-time compiling a dynamic language pays a real tax
against languages with static types, and no amount of flag-tuning
closes the gap.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;nested_loops&lt;/code&gt;: Perry 9 ms vs C++ opt 1 ms.&lt;/strong&gt; Same story as
&lt;code&gt;array_read&lt;/code&gt;, same cause: &lt;code&gt;-ffast-math&lt;/code&gt; enables a more aggressive
interleave factor than Perry&apos;s &lt;code&gt;reassoc contract&lt;/code&gt; subset does. Perry&apos;s
3 ms on &lt;code&gt;array_read&lt;/code&gt; and 9 ms on &lt;code&gt;nested_loops&lt;/code&gt; are both beaten by C++
opt, because &lt;code&gt;fast&lt;/code&gt; includes &lt;code&gt;nnan&lt;/code&gt; and &lt;code&gt;ninf&lt;/code&gt; permissions that the
loop vectorizer uses to pick a higher unroll. Perry deliberately does
not emit those, because JavaScript programs can observe NaN and
infinity and it would break &lt;code&gt;Math.max(-0, 0) === -0&lt;/code&gt;. That&apos;s a real
correctness tradeoff — the ceiling Perry could hit if it stopped caring
about NaN/Inf semantics is several milliseconds faster on flat-array
sums. Right now it doesn&apos;t hit it.&lt;/p&gt;
&lt;h3&gt;An aside: where JIT beats AOT&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;fibonacci&lt;/code&gt;: Java 280 ms vs Perry 311 ms. Recursive &lt;code&gt;fib(40)&lt;/code&gt; runs
about two billion real calls. Java&apos;s C2 JIT observes the recursion at
runtime and applies aggressive inlining based on actual hot-call
frequencies — something no AOT compiler can match without whole-program
profile data. Perry, C++, and Rust all cluster at ~310–319 ms through
LLVM; Swift at 360 ms and Go at 450 ms lose at the recursion-folding
stage inside their own backends. This benchmark is essentially a
compiler-pass quality test, not a flag-tuning target. No flag changes
any of these numbers materially.&lt;/p&gt;
&lt;h2&gt;The meta-point&lt;/h2&gt;
&lt;p&gt;Rust, C++, Go, and Swift picked conservative defaults for a reason.
Their users care about reproducibility, IEEE 754 correctness, and about
not having to audit every numeric operation for the possibility the
compiler silently reassociated it. A 3D renderer that reads back the
same color from two parallel paths, a simulation that needs bit-exact
replay for debugging, a financial calculation that must be verifiably
deterministic — all of these care, and they&apos;d be angry if
&lt;code&gt;(a + b) + c != a + (b + c)&lt;/code&gt;. The languages&apos; compile defaults reflect
that population.&lt;/p&gt;
&lt;p&gt;Compiling TypeScript for a JavaScript audience is a different tradeoff.
JS programs mostly don&apos;t treat &lt;code&gt;-0.0&lt;/code&gt; distinctly from &lt;code&gt;0.0&lt;/code&gt; even when
they could. Most TS code that hits a numeric loop is a game tick, a
compiler pass, a canvas renderer — workloads where a bit of
reassociation is fine. So Perry turns &lt;code&gt;reassoc&lt;/code&gt; on by default. It isn&apos;t
braver or smarter than Rust; it serves a different population.&lt;/p&gt;
&lt;p&gt;What&apos;s interesting isn&apos;t that Perry made the call. It&apos;s that the call
is invisible in most comparisons. The numbers people see when they
benchmark &quot;Rust vs TypeScript&quot; or &quot;C++ vs JavaScript&quot; reflect the
defaults both sides picked, with no indication that one side spent
those defaults on numerical robustness and the other spent them on
throughput. The benchmarks look like they&apos;re comparing languages. They
are actually comparing flag choices.&lt;/p&gt;
&lt;p&gt;There&apos;s no meta-rule for which default is right. &quot;Enable reassoc by
default&quot; is good for numeric loops and bad for scientific simulations.
&quot;Strict IEEE by default&quot; is the opposite. Both are defensible. What
isn&apos;t defensible is concluding from benchmark tables alone that one
language is faster than another. The defaults are the experiment.&lt;/p&gt;
&lt;h2&gt;Closing&lt;/h2&gt;
&lt;p&gt;Every claim in this post is reproducible with the code at the link
below. The four &lt;code&gt;bench_opt&lt;/code&gt; files showed that the &quot;Perry wins&quot; column
closes to within noise on all three flag-sensitive benchmarks when the
other languages are given the equivalent optimization path — except on
Go, where the path doesn&apos;t exist. None of this required anything
exotic. &lt;code&gt;-ffast-math&lt;/code&gt; is a flag you can type today. Nightly Rust&apos;s
&lt;code&gt;fadd_fast&lt;/code&gt; intrinsic is &lt;code&gt;#![feature(core_intrinsics)]&lt;/code&gt; plus one use
statement. Whether either should be your default is a judgment call
about what you&apos;re building.&lt;/p&gt;
&lt;p&gt;Perry exists because some of us wanted to compile TypeScript to
something that isn&apos;t a JavaScript engine. It uses LLVM. You&apos;ve seen
other LLVM-based compilers in this post: clang, rustc, swiftc. They all
produce similar output when you ask them for similar things. The
experiment this article documented is what they do when you don&apos;t.&lt;/p&gt;
&lt;h2&gt;Reproduction&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;git clone https://github.com/ralphkuepper/perry
cd perry/benchmarks/polyglot
cargo build --release --manifest-path=../../Cargo.toml -p perry
bash run_all.sh 5          # default-flags numbers — produces RESULTS.md
bash run_opt.sh 5 20       # opt variants — produces RESULTS_OPT.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Hardware used for the numbers in this post: Apple M1 Max (10 cores),
64 GB RAM, macOS 26.4. Perry commit &lt;code&gt;e1cbd37&lt;/code&gt; (v0.5.22). rustc 1.92.0
stable, 1.97.0-nightly 2026-04-14, Apple clang 21.0, Swift 6.3, Go
1.21.3, Node 25.8, Bun 1.3.5, Python 3.14.&lt;/p&gt;
&lt;p&gt;All LLVM IR snippets in this article are in &lt;code&gt;assets/&lt;/code&gt; as full &lt;code&gt;.ll&lt;/code&gt;
files, reproducible with &lt;code&gt;clang -S -emit-llvm&lt;/code&gt; (C++), &lt;code&gt;rustc -O --emit=llvm-ir&lt;/code&gt; (Rust), and &lt;code&gt;PERRY_SAVE_LL=&amp;lt;dir&amp;gt; perry compile&lt;/code&gt; (Perry).
The accompanying &lt;code&gt;METHODOLOGY.md&lt;/code&gt; in that directory has the exact
iteration counts, clocks, and timing methodology.&lt;/p&gt;
</content:encoded></item><item><title>On publishing slowly</title><link>https://amlug.net/posts/on-publishing-slowly/</link><guid isPermaLink="true">https://amlug.net/posts/on-publishing-slowly/</guid><description>Why this site will only see four to eight posts a year — and why that&apos;s the point.</description><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;I have a drafts folder full of technical writing that never shipped. Some of
it was too half-formed to post; most of it was fine, but I kept pushing it
to &quot;after I run one more benchmark,&quot; and after enough rounds of that, the
piece stops feeling urgent to anyone, myself included.&lt;/p&gt;
&lt;p&gt;This site is an attempt to fix the second failure mode without giving in to
the first. The plan is narrow: four to eight posts a year, each one about
something I actually had to work out, and nothing in between. No weekly
cadence, no &quot;thinking out loud&quot; threads, no TIL dumps. If I have a
half-formed thought worth sharing, a Mastodon post is the right shape for
it, not a page on a domain with my name on it.&lt;/p&gt;
&lt;h2&gt;What goes here&lt;/h2&gt;
&lt;p&gt;Most of what I&apos;ll write about lives near compilers and systems. I spend
my days on &lt;a href=&quot;https://perry.dev/&quot;&gt;Perry&lt;/a&gt;, a TypeScript-to-native compiler,
and the questions I get stuck on tend to be the kind that take a week of
measurement to answer honestly. Things like: what does &lt;code&gt;for...of&lt;/code&gt; actually
cost in a lowered IR, where is the line between &quot;clever abstraction&quot; and
&quot;extra two memory loads per iteration,&quot; and when is the answer to a
performance question &quot;the benchmark was wrong.&quot;&lt;/p&gt;
&lt;p&gt;Before Perry I spent years on Swift server-side work and contributed to
&lt;a href=&quot;https://vapor.codes/&quot;&gt;Vapor&lt;/a&gt;; some of that will show up here too, when I
have something to add that isn&apos;t already in the docs.&lt;/p&gt;
&lt;h2&gt;What does not&lt;/h2&gt;
&lt;p&gt;No company updates. No product announcements. No &quot;10 things I learned
this year.&quot; If you&apos;re looking for Skelpo news or Perry release notes,
those live on their own sites and will stay there.&lt;/p&gt;
&lt;h2&gt;The first real post&lt;/h2&gt;
&lt;p&gt;There&apos;s a benchmark investigation in the queue — a longer piece on the
cost of abstraction in the Perry front-end, with enough numbers attached
that I&apos;d rather over-verify before publishing than push it and have to
correct it in public. That article is the reason this site exists now
instead of in six months.&lt;/p&gt;
&lt;p&gt;Until then, this is the note on the door.&lt;/p&gt;
</content:encoded></item></channel></rss>