Hello, I am trying to write a function for implementing a FIR filter in C. I am implementing fixed point integer arithmetic but have a filter specified in floating point so I need to quantise it. I'm ok with float to fixed conversion but I am unsure how one would typically deal with the limits of the accumulator from the perspective of coefficient scaling.
My options, as I see it:
- scale my coefficients such that the accumulator will never overflow -> maybe lots of unnecessary quantisation noise due to small coefs
- allow overflow -> less quant noise, some signals could cause overflow/saturation/badness.
An example might be:
- 16 bit coefs, 16 bit data, 40 bit accumulator, 128 taps -> easy, coefs use all available 16 bits
- 16 bit coefs, 16 bit data, 40 bit accumulator, 4096 taps -> ??
Thanks for any advice
This is a tiny bit off topic. If you want to get the very very best use of your available date lengths and accumulator lengths, you should make some effort to be sure that you exploit every known feature of the data. For example, If the data is not like white noise and the filter has some frequency selectivity, you can make a tighter specification than the the obvious one.
Remember that an overflow in the accumulator during the computation of some output isn't what's important. It is the final value of the contents of the accumulator at the end of the computation of an output sample that matters.
There are a lot of optimizations you can make but most of them are only necessary if you're working in hardware (ASIC or FPGA design) and need to minimize resources. For C programs things are a lot less critical. Probably the most important is (as you noted) to correctly scale the accumulator. I generally do this by adding up the absolute values of all the coefficients to determine the worst-case bit growth and add that to your input data size to get the accumulator size needed to ensure no overflows. You'll likely not want to carry all of that on the output of the filter so if the overall gain of the filter is near 0dB then it's usually safe to just saturate back down to the input wordsize.
Andrew-
Or you can saturate, the old "DSP chip" method. Some source code that does this is here:
https://github.com/signalogic/SigSRF_SDK/blob/mast...
The limit and shift amounts in lines 358-360 are set for 64-bit accumulator with 16-bit output. They could be changed. Filter coefficients quantized to 16-bit are in filt_coeffs.h.
-Jeff
The multiplier output precision will also be a concern- the quantization noise due to any truncation at the multiplier outputs or the accumulator if truncated multiple times will grow according to 10Log(N) where N is the number of additions of the independent noise sources —- I typically aim to keep the total noise due to quantization 10 dB below the datapath precision (for a 0.4 dB SNR penalty) and with that you can take different approaches as you accumulate. Due to the possibility of adding unexpected noise I always do an SNR test to confirm I am actually getting the number of bits I had expected to get.
Further a good rule of thumb is to use a coefficient precision that is at least 2 bits higher than the datapath precision (or rather the precision required in the datapath). One should never scale the coefficients such that the accumulator doesn’t overflow. Always use an extended precision accumulator and as fred harris has taught us “let the filter grow the signal”… for the reasons I introduced quantifying the noise accumulation that would otherwise happen.
I am a bit unsure why all these concerns.
On FPGAs (and I guess same applies to DSP) we control filtering through coefficient scaling (from floating point) and final truncation.
With these two nodes of control we allow full internal bit growth.
Coefficient scaling + final truncation commonly targets power unity to optimise dynamic range. Quantised scaled coefficients can be pre-checked for any filtering degradation. We normally scale by some 2^n factor.
The gain control is finalised by discarding LSBS (with rounding) from final output to offset the coefficient scaling i.e. scale down by 2^-n.
Some final MSBs can be discarded according to well known rules.
As such there is no need to worry about any overflow or re-adjust the gain afterwards.
This also allows for bit-true matching with standard filter functions
The concern is when the goal is to optimize resources. It is easy to confirm coefficient scaling through simulation, but quantization noise growth due to truncating the results (selecting the MSB's) is another matter especially with higher order filters given the 10Log(N) factor that I mention. If someone isn't paying attention to this, they may have significantly less effective number of bits than they think they have (regardless of how many actual bits are on the data path and output). I show what I am referring to in this diagram showing the primary noise sources as additive to otherwise pure signals affecting overall quantization noise. Due to the delays at each node, the truncation error introduced is both white and independent so summing in power or total variance. Truncating at the output is clearly the best strategy but there does need to be a decision on how much to truncate the multiplier outputs, so understanding this is important to anyone designing digital filters with fixed point (IMO).
FPGA filter ips and designers deal with limited coefficient sets and I have never seen anyone doing internal truncations. They allow full internal bit growth.
Modules like FFTs, however do internal truncations and vendors supply their model for reference.
So it could be DSP filter designers using hundreds or thousands of taps may opt for internal truncation. That is sad ??
If you want to minimize resources this can be a significant factor and low hanging fruit to get there (there is no need to accumulate 32 bits from each 16x16 multiplier!). This is referred to as a "Truncated Multiplier FIR" and is a common approach / consideration for reduced size, cost and power.
With fixed-point design, it's a LOT easier when you don't care about those things (and then proceed just as you describe). There are many applications where size, cost and power (of the FIR filter itself compared the the overall design) is not the primary driver - so I understand your perspective.
Ok thanks,
So there is a tipping point for internal truncation based on number of taps. Similarly that same tipping point should apply to our discussion.