Optimization Techniques

Assembly Optimization Techniques

Optimizing assembly code requires understanding both the processor architecture and the specific requirements of your application. These techniques can help you write faster, more efficient code.

Basic Optimization Principles

Profile before optimizing: Identify bottlenecks first
Understand the pipeline: Modern CPUs have deep pipelines
Minimize dependencies: Avoid sequential dependencies between instructions
Use registers efficiently: Register access is fastest
Align data and code: Proper alignment improves memory access

Instruction Selection

Choosing the right instructions can significantly impact performance:

Use Specialized Instructions


; Instead of this:
mov eax, 0

; Use this (smaller and often faster):
xor eax, eax

Use LEA for Arithmetic


; Instead of:
mov eax, ebx
add eax, 10

; Use (doesn't modify flags):
lea eax, [ebx + 10]

Register Usage

Efficient register usage is critical for performance:

Minimize register spills to memory
Reuse registers when possible
Use caller-saved registers for temporary values
Consider register pressure when inlining functions

Loop Optimization

Loops are often performance-critical sections:

Loop Unrolling


; Original loop:
mov ecx, 100
loop_start:
    ; loop body
    dec ecx
    jnz loop_start

; Unrolled version (4 iterations per loop):
mov ecx, 25
loop_start:
    ; loop body iteration 1
    ; loop body iteration 2
    ; loop body iteration 3
    ; loop body iteration 4
    dec ecx
    jnz loop_start

Align Loop Entries


align 16      ; Align to 16-byte boundary
loop_start:
    ; loop body
    jnz loop_start

Memory Access Optimization

Memory access is often the bottleneck:

Access memory sequentially when possible
Align data to cache line boundaries (typically 64 bytes)
Use prefetch instructions for predictable access patterns
Minimize cache misses by keeping working sets small

Branch Optimization

Branches can disrupt the instruction pipeline:

Branch Prediction


; Arrange code so the most likely path is the fall-through
cmp eax, ebx
jne rare_case   ; Branch not taken is faster
; Common case code here
jmp end
rare_case:
    ; Rare case code
end:

Branch Elimination


; Instead of:
test eax, eax
jz zero_case
mov ebx, 1
jmp end
zero_case:
mov ebx, 0
end:

; Use conditional moves:
xor ebx, ebx    ; ebx = 0
test eax, eax
setnz bl        ; ebx = (eax != 0) ? 1 : 0

Function Call Optimization

Function calls have overhead that can be minimized:

Inline Small Functions


; Instead of calling a small function:
call small_func

; Copy the function body directly
; (contents of small_func here)

Tail Call Optimization


; Instead of:
call func
ret

; Use:
jmp func   ; Tail call

SIMD Optimization

Single Instruction Multiple Data (SIMD) instructions can process multiple data elements in parallel:


; Example using SSE to add 4 floats at once
movaps xmm0, [array1]  ; Load 4 floats
movaps xmm1, [array2]  ; Load 4 floats
addps xmm0, xmm1       ; Add all 4 floats in parallel
movaps [result], xmm0  ; Store result

Alignment Optimization

Proper alignment improves memory access performance:


section .data
align 16      ; Align to 16-byte boundary
my_array dd 1.0, 2.0, 3.0, 4.0

section .text
align 16      ; Align code to 16-byte boundary
fast_function:
    ; Code here

Cache Optimization

Optimizing for CPU cache can dramatically improve performance:

Structure data to fit in cache lines
Use temporal and spatial locality
Avoid cache thrashing in loops
Consider cache-oblivious algorithms for large datasets

Floating-Point Optimization

Floating-point operations require special consideration:

Use SSE/AVX instead of legacy x87 FPU when possible
Minimize conversions between integer and floating-point
Reorder operations to reduce pipeline stalls
Consider precision requirements when choosing instructions

Common Pitfalls

Over-optimizing non-critical code
Creating register pressure with excessive inlining
Ignoring cache effects in memory access patterns
Assuming optimizations that work on one CPU will work on all
Optimizing before measuring actual performance

Performance Measurement

Essential tools for measuring performance:

Hardware performance counters (RDTSC, perf events)
Profiling tools (perf, VTune, Callgrind)
Microbenchmarking frameworks
Cycle-accurate simulators for deep analysis

Modern CPU Considerations

Recent CPU features to consider:

Out-of-order execution
Speculative execution
SIMD extensions (SSE, AVX, AVX-512)
Hyperthreading/SMT
Multiple cache levels

Next Steps

To learn more about specific architectures:

ARM Assembly
Read CPU vendor optimization manuals
Study compiler output for optimization ideas