fixed-point in C

original material: artist-embedded.org/EmbeddedControl Slides
reference 1: fixedpt.html
reference 2: Q_format

1. Fixed-point Representation

  • x: real number
  • X: fixed-point number
  • N: wordlength
  • m: integer (excluding sign bit)
  • f: number of fraction bit
  • “Q-format” : Qm.n
0/1101/011
sign bit/4bit integer/3bit fraction

2. Conversion to and from fixed-point

  • real to fixed

    • Multiply the floating point number by 2^f
    • Round to the nearest integer

      X=round(x˙2f)

  • fixed to real

    x=X˙2f

example) 13.4 to Q4.3 format

X=round(13.4˙23)=107(=011010112)

3. Range of fixed-point representation

  • negative number: 2’s complement

    • N=8, 2^(-8) ~ 2^(8-1)
      binary representation decimal
      00000000 0
      00000001 1
      00000010 2
      01111111 127
      10000000 -128
      10000001 -127
      11111111 -1
  • range of Qm.f [ref]

    [2m,2m2f]

4. Arithmetic operations of fixed-point

  • Satuation check
int16_t sat16(int32_t x)
{
    if (x > 0x7FFF) return 0x7FFF;
    else if (x < 0x8000) return 0x8000;
    else return (int16_t)x;
}
  • Addition
int16_t q_add_sat(int16_t a, int16_t b)
{
    int16_t result;
    int32_t tmp;

    tmp = (int32_t)a + (int32_t)b;
    if (tmp > 0x7FFF)
        tmp = 0x7FFF;
    if (tmp < -1 * 0x8000)
        tmp = -1 * 0x8000;
    result = (int16_t)tmp;

    return result;
}
  • Subtraction
int16_t q_sub(int16_t a, int16_t b)
{
    int16_t result;
    result = a - b;
    return result;
}
  • Multiplication
// precomputed value:
#define K   (1 << (f - 1))    // f: fraction of fixed-point

int16_t q_mul(int16_t a, int16_t b)
{
    int16_t result;
    int32_t temp;

    temp = (int32_t)a * (int32_t)b; // result type is operand's type
    // Rounding; mid values are rounded up
    temp += K;
    // Correct by dividing by base and saturate result
    result = sat16(temp >> Q);

    return result;
}
  • Division
int16_t q_div(int16_t a, int16_t b)
{
    int16_t result;
    int32_t temp;

    // pre-multiply by the base (Upscale to Q16 so that the result will be in Q8 format)
    temp = (int32_t)a << Q;
    // Rounding: mid values are rounded up (down for negative values).
    if ((temp >= 0 && b >= 0) || (temp < 0 && b < 0))
        temp += b / 2;
    else
        temp -= b / 2;
    result = (int16_t)(temp / b);

    return result;
}

'C and C++' 카테고리의 다른 글

Fixed Point Prototype  (0) 2016.04.09
main function arguments  (0) 2016.03.11
OpenMP in Macbook  (0) 2016.03.10
OpenMP  (0) 2016.03.09
C언어 최적화 기법  (0) 2016.02.05

+ Recent posts