Assembly Optimization Techniques
Optimizing assembly code requires understanding both the processor architecture and the specific requirements of your application. These techniques can help you write faster, more efficient code.
Basic Optimization Principles
- Profile before optimizing: Identify bottlenecks first
- Understand the pipeline: Modern CPUs have deep pipelines
- Minimize dependencies: Avoid sequential dependencies between instructions
- Use registers efficiently: Register access is fastest
- Align data and code: Proper alignment improves memory access
Instruction Selection
Choosing the right instructions can significantly impact performance:
Use Specialized Instructions
; Instead of this:
mov eax, 0
; Use this (smaller and often faster):
xor eax, eax
Use LEA for Arithmetic
; Instead of:
mov eax, ebx
add eax, 10
; Use (doesn't modify flags):
lea eax, [ebx + 10]
Register Usage
Efficient register usage is critical for performance:
- Minimize register spills to memory
- Reuse registers when possible
- Use caller-saved registers for temporary values
- Consider register pressure when inlining functions
Loop Optimization
Loops are often performance-critical sections:
Loop Unrolling
; Original loop:
mov ecx, 100
loop_start:
; loop body
dec ecx
jnz loop_start
; Unrolled version (4 iterations per loop):
mov ecx, 25
loop_start:
; loop body iteration 1
; loop body iteration 2
; loop body iteration 3
; loop body iteration 4
dec ecx
jnz loop_start
Align Loop Entries
align 16 ; Align to 16-byte boundary
loop_start:
; loop body
jnz loop_start
Memory Access Optimization
Memory access is often the bottleneck:
- Access memory sequentially when possible
- Align data to cache line boundaries (typically 64 bytes)
- Use prefetch instructions for predictable access patterns
- Minimize cache misses by keeping working sets small
Branch Optimization
Branches can disrupt the instruction pipeline:
Branch Prediction
; Arrange code so the most likely path is the fall-through
cmp eax, ebx
jne rare_case ; Branch not taken is faster
; Common case code here
jmp end
rare_case:
; Rare case code
end:
Branch Elimination
; Instead of:
test eax, eax
jz zero_case
mov ebx, 1
jmp end
zero_case:
mov ebx, 0
end:
; Use conditional moves:
xor ebx, ebx ; ebx = 0
test eax, eax
setnz bl ; ebx = (eax != 0) ? 1 : 0
Function Call Optimization
Function calls have overhead that can be minimized:
Inline Small Functions
; Instead of calling a small function:
call small_func
; Copy the function body directly
; (contents of small_func here)
Tail Call Optimization
; Instead of:
call func
ret
; Use:
jmp func ; Tail call
SIMD Optimization
Single Instruction Multiple Data (SIMD) instructions can process multiple data elements in parallel:
; Example using SSE to add 4 floats at once
movaps xmm0, [array1] ; Load 4 floats
movaps xmm1, [array2] ; Load 4 floats
addps xmm0, xmm1 ; Add all 4 floats in parallel
movaps [result], xmm0 ; Store result
Alignment Optimization
Proper alignment improves memory access performance:
section .data
align 16 ; Align to 16-byte boundary
my_array dd 1.0, 2.0, 3.0, 4.0
section .text
align 16 ; Align code to 16-byte boundary
fast_function:
; Code here
Cache Optimization
Optimizing for CPU cache can dramatically improve performance:
- Structure data to fit in cache lines
- Use temporal and spatial locality
- Avoid cache thrashing in loops
- Consider cache-oblivious algorithms for large datasets
Floating-Point Optimization
Floating-point operations require special consideration:
- Use SSE/AVX instead of legacy x87 FPU when possible
- Minimize conversions between integer and floating-point
- Reorder operations to reduce pipeline stalls
- Consider precision requirements when choosing instructions
Common Pitfalls
- Over-optimizing non-critical code
- Creating register pressure with excessive inlining
- Ignoring cache effects in memory access patterns
- Assuming optimizations that work on one CPU will work on all
- Optimizing before measuring actual performance
Performance Measurement
Essential tools for measuring performance:
- Hardware performance counters (RDTSC, perf events)
- Profiling tools (perf, VTune, Callgrind)
- Microbenchmarking frameworks
- Cycle-accurate simulators for deep analysis
Modern CPU Considerations
Recent CPU features to consider:
- Out-of-order execution
- Speculative execution
- SIMD extensions (SSE, AVX, AVX-512)
- Hyperthreading/SMT
- Multiple cache levels
Next Steps
To learn more about specific architectures:
- ARM Assembly
- Read CPU vendor optimization manuals
- Study compiler output for optimization ideas