DSPRelated.com

Floating-Point (DSP)

Category: Numerics | Also known as: floating point

Floating-point is a number representation that encodes a value as a signed mantissa (significand) and a signed exponent relative to a fixed base, allowing a wide dynamic range with a nearly constant relative precision. In DSP and embedded contexts, the IEEE 754 standard dominates, defining 32-bit single-precision and 64-bit double-precision formats as well as the rules for rounding, infinities, and NaN (Not a Number) values.

In practice

In embedded DSP, floating-point is attractive because algorithms can be developed and debugged without manually tracking the scaling, overflow, and saturation that fixed-point requires. Filters, FFTs, and control loops written in floating-point closely mirror textbook math, making it easier to validate against reference implementations. The tradeoff is cost: on MCUs without a hardware floating-point unit (FPU), every operation is emulated in software and can be 10x to 100x slower than the equivalent integer operation. On cores that do include an FPU, such as the ARM Cortex-M4F, Cortex-M7, and Cortex-M33 (with FPU option fitted), single-precision (float) throughput can approach integer throughput in well-pipelined code, though real-world performance depends on instruction mix, data dependencies, and compiler code generation. Double-precision (double) is typically still emulated in software on most MCU-class parts in these families, though some implementations or vendor-specific variants may provide hardware double-precision support.

Precision is the subtlest pitfall. A 32-bit IEEE 754 float has approximately 7 significant decimal digits of precision. Accumulating many additions, or subtracting two nearly equal values (catastrophic cancellation), can produce results far less accurate than the word length implies. In audio and control DSP, long accumulator chains that look fine in MATLAB may exhibit audible noise or instability on hardware if the same algorithm runs in single precision. Using double precision during prototyping and then narrowing to float for production is a common strategy, but requires re-validation. The blog post "Feedback Controllers - Making Hardware with Firmware. Part 10. DSP/FPGAs Behaving Irrationally" covers concrete cases where floating-point behavior surprises engineers moving algorithms to hardware.

For determinism-critical or resource-constrained targets, such as 8-bit AVR, PIC16/18, or small Cortex-M0/M0+ devices without an FPU, fixed-point arithmetic is often preferred. The blog posts "Simple Concepts Explained: Fixed-Point" and "Fixed-Point Simulation in GNU Octave—Without MATLAB" provide practical starting points for that alternative. FPGAs occupy a middle ground: hard floating-point DSP blocks appear in higher-end families (Intel Stratix, Xilinx UltraScale), while smaller FPGAs typically synthesize floating-point from LUT logic at a significant resource cost.

In real-time embedded systems, floating-point use in ISRs requires that the FPU context be saved and restored on context switches. On ARM Cortex-M4F/M7, lazy FPU stacking is an architectural feature that defers the save of the 32 S-registers until they are actually used; RTOSes such as FreeRTOS can be configured to take advantage of this when the port and configuration support it, reducing overhead when most ISRs do not touch the FPU. Forgetting to enable this or to allocate the larger task stack that FPU stacking requires is a common source of hard-to-reproduce corruption bugs.

 Learn this in DSP Foundations

Discussed on DSPRelated

Frequently asked

Does having a hardware FPU mean floating-point is free?
Not entirely. A Cortex-M4F or M7 FPU makes single-precision arithmetic roughly one cycle per operation in pipelined code, close to integer speed. But double-precision is still software-emulated on most MCU-class FPUs, context save/restore adds overhead in RTOS tasks and ISRs, and code size increases. Throughput is much better than pure software emulation, but floating-point is not zero-cost even with hardware support.
When should I choose fixed-point over floating-point for DSP?
Fixed-point is typically preferred when the target lacks a hardware FPU and software emulation is too slow, when deterministic bit-exact results are required (e.g., for certification or interoperability), or when power and silicon area are tightly constrained. Floating-point is usually easier to develop and maintain when the hardware supports it, provided precision is validated carefully. The blog post 'Simple Concepts Explained: Fixed-Point' is a good reference for understanding the tradeoffs.
What is the precision of a 32-bit IEEE 754 float, and why does it matter in DSP?
A single-precision float has a 23-bit mantissa, giving about 7 significant decimal digits of precision and a relative error (machine epsilon) of roughly 1.19e-7. In DSP, operations like long summations, recursive IIR filters, and poorly conditioned transforms can accumulate rounding errors well beyond this per-operation limit. High-order IIR filters and long FFTs are classic cases where single-precision can produce noticeable degradation compared to double-precision or exact fixed-point arithmetic.
What is a NaN and when does it appear in embedded DSP?
NaN (Not a Number) is a special IEEE 754 bit pattern produced by undefined operations such as 0/0, sqrt of a negative number, or infinity minus infinity. In embedded DSP, NaNs typically appear due to uninitialized buffers, divide-by-zero in normalization or AGC code, or log/sqrt applied to a negative value after numerical instability. NaN is contagious: any arithmetic involving a NaN produces another NaN, so a single bad value can silently corrupt an entire output stream. Many MCU FPUs can be configured to trap on NaN generation rather than propagate it silently.
Do FPGAs support floating-point DSP?
Higher-end FPGA families include hard floating-point DSP blocks (for example, Intel Stratix 10 NX and some Xilinx UltraScale+ variants), but most mid-range and smaller FPGAs implement floating-point entirely in soft logic (LUTs and DSP48/DSP slices). A single-precision adder or multiplier in soft logic can consume dozens of DSP slices and hundreds of LUTs and adds pipeline latency. For most FPGA DSP work, fixed-point remains the practical choice unless the algorithm genuinely requires the dynamic range of floating-point.

Differentiators vs similar concepts

Floating-point is often contrasted with fixed-point representation. In fixed-point, the binary point is at a programmer-defined static position, giving constant absolute precision across the number range but limited dynamic range; overflow and scaling must be managed explicitly. In floating-point, the binary point moves with the exponent, giving roughly constant relative precision (in fractional terms) across a very wide dynamic range, at the cost of more complex hardware, non-uniform rounding behavior, and the possibility of special values (NaN, Inf). A third alternative, block floating-point, applies a single shared exponent to a block of fixed-point values, approximating floating-point dynamic range with cheaper hardware, and is common in some DSP processor architectures.