DSPRelated.com
Forums

Writing DSP routines in C or assembly?

Started by DandiKain 7 years ago7 replieslatest reply 7 years ago1674 views
 I'm working on a DSP project on an Analog Devices BlackFin digital signal processor. What I have realized is that if I want to use the circuits designed for DSP applications I need to program in assembly to be able to benefit from those instructions.(like multiply and add, etc. ) My question is if I just program in C, wouldn't the compiler(which also comes from the DSP chip company) optimize it for that DSP and use its capabilities? Or do I really need to write DSP routines directly in assembly? That would be cumbersome.
[ - ]
Reply by jms_nhAugust 1, 2017

C compilers are optimized for general-purpose programming. If the compiler writers are aware of special instructions on the processor (like multiply-and-accumulate) they can try to take advantage of them. But DSP instructions are really difficult to do this, in part because the instructions tend to be pipelined: when you write DSP assembly manually, it's not just a series of isolated instructions but rather they are part of a sequence where each one fits into the next. And there also tend to be mode bits like saturation or shift counts, and the compiler has to keep track of this. The data type sizes are another issue: if the accumulator is not a standard C type like int16_t or int32_t or int64_t then there is no way to express the accumulator in C.

So yes, you either have to write in assembly, or use processor-specific intrinsics (for gcc they are usually called __builtin_XYZABC() for some XYZABC) that essentially map directly to DSP instructions and then you can call them in the order you need.

[ - ]
Reply by SteveSmithAugust 1, 2017

Here's a couple of links you may find useful... my experience with this very issue.  


http://www.dspguide.com/CH28.PDF

http://www.dspguide.com/CH29.PDF


[ - ]
Reply by BEBSynthesizersAugust 1, 2017

Hi Dandikain,

simply said : there is currently *no* C compiler able to optimize DSP code as efficiently as native assembly programming. And simply forget C++, it consumes a lot of CPU cycle (it's not made for optimization, but for programming simplification)

For example, this instruction (from SigmaDSP) coded in assembly 

acc0 = acc0 + (coeffRAM[xxxx]*dataRAM[yyyy])

is a single instruction (you can find similar ones on SHARC and Blackfin), even if it does two memory access, one multiplication, one addition and a final hidden shift (so 5 basic operations). If you code this in C, you will at best get something coded in 2 or 3 instructions, because most of the time, the compiler is unable to identify that it can simplify the MAC instruction.

The drawback from assembly is that it is much more time consuming for development (and you can more easily make errors which are extremely hard to detect sometimes). When you know assembly language well, it takes roughly 20 to 30% more time to code an algorithm in ASM than in C (but the code you obtain is easily 20% to 30% faster than if coded in C, and most of the time it has a lower memory footprint)

To help C programmers to (almost) reach ASM programming performance, most compilers (especially those from ADI) provide two workarounds:

- you can encode critical sections in assembly directly in C source using inline assembly (so you write in assembly, not C)

- they provide many functions for FFT, FIR, IIR, etc... natively coded in assembly that you can call from C source

But you can never reach the optimization from a "pure" assembly program.

So you can turn the question like this : is the cost of the time needed to write in assembly balanced by the gain of performance you get against C?

Hope this helps you to define the best solution

Benoit


[ - ]
Reply by Braddon Van SlykeAugust 1, 2017

Analog Devices, I'm sure, provides a number of dsp library routines that can be linked into your C program.  I use mostly TI DSPs, and they provide linkable libraries for functions such as FIR filters, FFTs, and a bunch of other routines that don't come to mind. These libraries have been optimized with for the dsp hardware target.  

My C code is the glue, so to speak, but when it comes to number crunching, I see if there is a library function available first.  And/or like jms_nh said, look for intrinsics as well.  

[ - ]
Reply by probbieAugust 1, 2017

It really depends on how much processing you need to squeeze in. You will get very dramatic improvements in efficiency if you write in assembler. The compiler will not come anywhere near close to what can be achieved with careful hand-coding. However, the Sharc assembler is much easier to use than most, so I don't think the learning curve is that steep. Of course, if you can do what you need to do with C, then use that..

[ - ]
Reply by rrlagicAugust 1, 2017

Though I am from TI church, I am pretty impressed so many people suggesting assembly. Guys, are you serious about keeping in mind dozens of registers and trying to balance the load between operation units? 

In TI world C compiler is doing true magic. Of course, we read generated assembly of critical pieces to make sure the algorithm was coded properly and compiler understood programmer intention correctly. But beyond that - its a hell, very few people can do that reasonably well, and they definitely don't ask this kind of questions. 

Other contributors already mentioned intrinsic instructions. These are C style of calling assembly instructions. Sometimes they help, especially with packed data. 

Just to give you idea about writing assembly, see the following fragment:

void ul_equalizer_spectral_flatness
(
    const   int32                   cnt,    // allocated reference symbol length
            __float2_t * RESTRICT   ec,     // equalizer coefficients buffer
            float      * RESTRICT   flat,   // output buffer
            float      * RESTRICT   fmax,
            float      * RESTRICT   fmin
)
{
    int32 i;
    float nmr;

    *fmax = 0;
    *fmin = 0;

    nmr = 0;
    LOOP_COUNT_INFO(N_SC_RB, N_SC_RB)
    for ( i = 0; i < cnt; i++ ) nmr += F2RE( _complex_conjugate_mpysp(ec[i], ec[i]) );

There is just one loop for clarity. Now see what compiler has generated for that single loop and consider, whether you could beat it with hand:

ul_equalizer_spectral_flatness:
;** --------------------------------------------------------------------------*
;          EXCLUSIVE CPU CYCLES: 18
;* 1477    -----------------------    ec = ec;
;* 1477    -----------------------    flat = flat;
;* 1477    -----------------------    fmax = fmax;
;* 1477    -----------------------    fmin = fmin;
;* 1481    -----------------------    U$11 = C$31 = 0.0F;
;* 1481    -----------------------    *fmax = U$11;
;* 1482    -----------------------    U$12 = U$11;
;* 1482    -----------------------    *fmin = U$12;
;**      -----------------------    U$24 = ec;
;* 1486    -----------------------    L$1 = cnt;
;* 1486    -----------------------    I$5 = C$31;
;* 1486    -----------------------    I$6 = I$5;
;* 1486    -----------------------    I$7 = I$6;
;* 1486    -----------------------    I$8 = I$7;
;* 1486    -----------------------    I$9 = I$8;
;* 1486    -----------------------    I$10 = I$9;
;**      -----------------------    #pragma MUST_ITERATE(2, 357913940, 2)
;**      -----------------------    // LOOP BELOW UNROLLED BY FACTOR(6)
;**      -----------------------    #pragma LOOP_FLAGS(4098u)
    .dwcfi    cfa_offset, 0
           STW     .D2T2   B11,*SP--(8)      ; |1477|
    .dwcfi    cfa_offset, 8
    .dwcfi    save_reg_to_mem, 27, 0
           STW     .D2T2   B10,*SP--(8)      ; |1477|
    .dwcfi    cfa_offset, 16
    .dwcfi    save_reg_to_mem, 26, -8
           STW     .D2T2   B3,*SP--(8)       ; |1477|
    .dwcfi    cfa_offset, 24
    .dwcfi    save_reg_to_mem, 19, -16
           STDW    .D2T1   A11:A10,*SP--     ; |1477|
    .dwcfi    cfa_offset, 32
    .dwcfi    save_reg_to_mem, 11, -20
    .dwcfi    save_reg_to_mem, 10, -24

           MV      .L2X    A8,B10            ; |1477|
||         MV      .L1X    B6,A2             ; |1477|
||         MV      .S1     A6,A26            ; |1477|
||         MV      .D1     A4,A24            ; |1477|

           MV      .L1X    B4,A23            ; |1477|
    .dwpsn    file "frame.c",line 1481,column 5,is_stmt,isa 0

           ZERO    .L1     A28               ; |1481|
||         ZERO    .S1     A22               ; |1481|

           STW     .D1T1   A28,*A2           ; |1481|
    .dwpsn    file "frame.c",line 1482,column 5,is_stmt,isa 0

           MV      .L2X    A28,B11           ; |1482|
||         MV      .L1     A23,A20

           STW     .D2T2   B11,*B10          ; |1482|
    .dwpsn    file "frame.c",line 1486,column 33,is_stmt,isa 0

           MVK     .S1     24,A4
||         MV      .L1     A24,A3            ; |1486|
||         MV      .L2X    A22,B9            ; |1486|

           CMPLT   .L1     A24,A4,A0
||         MV      .L2X    A22,B8            ; |1486|

   [!A0]   B       .S1     $C$L70
|| [ A0]   LDDW    .D1T2   *+A20(16),B25:B24 ; |1486|
||         MV      .L2X    A22,B16           ; |1486|

   [ A0]   LDDW    .D1T2   *+A20(24),B23:B22 ; |1486|
||         MV      .L2X    A22,B19           ; |1486|

   [ A0]   LDDW    .D1T2   *+A20(8),B21:B20  ; |1486|
||         MV      .L2X    A22,B18           ; |1486|

   [ A0]   LDDW    .D1T2   *+A20(32),B7:B6   ; |1486|
   [ A0]   LDDW    .D1T1   *+A20(40),A5:A4   ; |1486|
   [ A0]   LDDW    .D1T2   *A20,B5:B4        ; |1486|
           ; BRANCHCC OCCURS {$C$L70} {0}
;** --------------------------------------------------------------------------*
;**   BEGIN LOOP $C$L69
;** --------------------------------------------------------------------------*
$C$L69:    
$C$DW$L$ul_equalizer_spectral_flatness$2$B:
;          EXCLUSIVE CPU CYCLES: 3
;**    -----------------------g2:
;* 1486    -----------------------    C$30 = *U$24;
;* 1486    -----------------------    I$5 += _hif(_complex_conjugate_mpysp(C$30, C$30));
;* 1486    -----------------------    C$29 = U$24[1];
;* 1486    -----------------------    I$6 += _hif(_complex_conjugate_mpysp(C$29, C$29));
;* 1486    -----------------------    C$28 = U$24[2];
;* 1486    -----------------------    I$7 += _hif(_complex_conjugate_mpysp(C$28, C$28));
;* 1486    -----------------------    C$27 = U$24[3];
;* 1486    -----------------------    I$8 += _hif(_complex_conjugate_mpysp(C$27, C$27));
;* 1486    -----------------------    C$26 = U$24[4];
;* 1486    -----------------------    I$9 += _hif(_complex_conjugate_mpysp(C$26, C$26));
;* 1486    -----------------------    C$25 = U$24[5];
;* 1486    -----------------------    I$10 += _hif(_complex_conjugate_mpysp(C$25, C$25));
;* 1486    -----------------------    U$24 += 6;
;* 1486    -----------------------    if ( (L$1 = L$1-6) >= 6 ) goto g2;
           NOP             3
$C$DW$L$ul_equalizer_spectral_flatness$2$E:
;** --------------------------------------------------------------------------*
$C$DW$L$ul_equalizer_spectral_flatness$3$B:
;          EXCLUSIVE CPU CYCLES: 23
;**      -----------------------    nmr = I$5+I$6+I$7+I$8+I$9+I$10;
           CMPYSP  .M2     B25:B24,B25:B24,B31:B30:B29:B28 ; |1486|
           CMPYSP  .M2     B21:B20,B21:B20,B3:B2:B1:B0 ; |1486|
           CMPYSP  .M2     B23:B22,B23:B22,B27:B26:B25:B24 ; |1486|
           CMPYSP  .M2     B7:B6,B7:B6,B23:B22:B21:B20 ; |1486|
           CMPYSP  .M2     B5:B4,B5:B4,B7:B6:B5:B4 ; |1486|
           CMPYSP  .M1     A5:A4,A5:A4,A7:A6:A5:A4 ; |1486|
           NOP             1
           DSUBSP  .L2     B31:B30,B29:B28,B29:B28 ; |1486|

           DSUBSP  .L2     B7:B6,B5:B4,B7:B6 ; |1486|
||         DSUBSP  .S2     B27:B26,B25:B24,B25:B24 ; |1486|

           DSUBSP  .L2     B23:B22,B21:B20,B5:B4 ; |1486|
||         DSUBSP  .S2     B3:B2,B1:B0,B27:B26 ; |1486|
||         DSUBSP  .L1     A7:A6,A5:A4,A5:A4 ; |1486|

           FADDSP  .L2     B29,B19,B19       ; |1486|

           FADDSP  .L2     B7,B18,B18        ; |1486|
||         FADDSP  .S2     B25,B9,B9         ; |1486|

           FADDSP  .L2     B5,B16,B16        ; |1486|
||         FADDSP  .S2     B27,B8,B8         ; |1486|
||         FADDSP  .S1     A5,A22,A22        ; |1486|

    .dwpsn    file "frame.c",line 1486,column 27,is_stmt,isa 0
           ADDK    .S1     48,A20            ; |1486|
    .dwpsn    file "frame.c",line 1486,column 18,is_stmt,isa 0
           SUB     .L1     A3,6,A3           ; |1486|
           CMPLTU  .L1     A3,6,A0           ; |1486|
   [!A0]   LDDW    .D1T2   *+A20(32),B7:B6   ; |1486|

   [!A0]   B       .S1     $C$L69            ; |1486|
|| [!A0]   LDDW    .D1T2   *+A20(16),B25:B24 ; |1486|

   [ A0]   B       .S1     $C$L74
|| [!A0]   LDDW    .D1T2   *+A20(24),B23:B22 ; |1486|

   [!A0]   LDDW    .D1T2   *+A20(8),B21:B20  ; |1486|
   [!A0]   LDDW    .D1T2   *A20,B5:B4        ; |1486|
   [!A0]   LDDW    .D1T1   *+A20(40),A5:A4   ; |1486|
           NOP             1
           ; BRANCHCC OCCURS {$C$L69}        ; |1486|
$C$DW$L$ul_equalizer_spectral_flatness$3$E:
;** --------------------------------------------------------------------------*
;          EXCLUSIVE CPU CYCLES: 1

           FADDSP  .L2     B8,B18,B4
||         ZERO    .L1     A3
||         MV      .D1     A23,A19
||         INTSP   .S1     A24,A27

           ; BRANCH OCCURS {$C$L74}  
;** --------------------------------------------------------------------------*
$C$L70:    
;          EXCLUSIVE CPU CYCLES: 1

           MV      .L1X    B19,A21
||         ADD     .L2X    8,A20,B24
||         DINT                              ; interrupts off

;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : frame.c
;*      Loop source line                 : 1486
;*      Loop opening brace source line   : 1486
;*      Loop closing brace source line   : 1486
;*      Loop Unroll Multiple             : 6x
;*      Known Minimum Trip Count         : 2                    
;*      Known Max Trip Count Factor      : 2
;*      Loop Carried Dependency Bound(^) : 3
;*      Unpartitioned Resource Bound     : 4
;*      Partitioned Resource Bound(*)    : 4
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        1     
;*      .S units                     1        0     
;*      .D units                     4*       2     
;*      .M units                     3        3     
;*      .X cross paths               0        0     
;*      .T address paths             3        3     
;*      Long read paths              0        0     
;*      Long write paths             0        0     
;*      Logical  ops (.LS)           6        6     (.L or .S unit)
;*      Addition ops (.LSD)          0        1     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             4*       4*    
;*      Bound(.L .S .D .LS .LSD)     4*       4*    
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 4  Schedule found with 5 iterations in parallel
;*
;*      Register Usage Table:
;*          +-----------------------------------------------------------------+
;*          |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;*          |00000000001111111111222222222233|00000000001111111111222222222233|
;*          |01234567890123456789012345678901|01234567890123456789012345678901|
;*          |--------------------------------+--------------------------------|
;*       0: |*    *******        ***         |**  **** *       * ******       |
;*       1: |*   ********    ** ***          |**  ******       * ***  *       |
;*       2: |*  * *******        *           |**   *  *       ****    *       |
;*       3: |*  * *******        * *         |**  ****        ****** **       |
;*          +-----------------------------------------------------------------+
;*
;*      Done
;*
;*      Redundant loop generated
;*      Epilog not removed
;*      Collapsed epilog stages       : 0
;*
;*      Prolog not entirely removed
;*      Collapsed prolog stages       : 3
;*
;*      Minimum required memory pad   : 0 bytes
;*
;*      For further improvement on this loop, try option -mh192
;*
;*      Minimum safe trip count       : 4 (after unrolling)
;*
;*
;*      Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 }
;*      Mem bank perf. penalty (est.) : 0.0%
;*
;*
;*      Total cycles (est.)         : 15 + trip_cnt * 4        
;*----------------------------------------------------------------------------*
;*       SETUP CODE
;*
;*                  MV              A20,B24
;*                  ADD             8,B24,B24
;*
;*        SINGLE SCHEDULED ITERATION
;*
;*        $C$C1038:
;*   0              LDDW    .D1T1   *A20++(48),A7:A6  ; |1486|
;*     ||           LDDW    .D2T2   *B24++(48),B19:B18 ; |1486|
;*   1              LDDW    .D1T2   *-A20(24),B21:B20 ; |1486|
;*   2              LDDW    .D2T1   *-B24(40),A7:A6   ; |1486|
;*     ||           LDDW    .D1T2   *-A20(16),B23:B22 ; |1486|
;*   3              LDDW    .D1T1   *-A20(8),A17:A16  ; |1486|
;*   4              NOP             2
;*   6              CMPYSP  .M1     A7:A6,A7:A6,A11:A10:A9:A8 ; |1486|
;*     ||           CMPYSP  .M2     B19:B18,B19:B18,B7:B6:B5:B4 ; |1486|
;*   7              CMPYSP  .M1     A7:A6,A7:A6,A11:A10:A9:A8 ; |1486|
;*     ||           CMPYSP  .M2     B23:B22,B23:B22,B7:B6:B5:B4 ; |1486|
;*     ||           SUB     .L2     B17,6,B17         ; |1486|
;*   8              CMPYSP  .M2     B21:B20,B21:B20,B7:B6:B5:B4 ; |1486|
;*     ||           CMPYSP  .M1     A17:A16,A17:A16,A7:A6:A5:A4 ; |1486|
;*   9      [!B0]   CMPLTU  .L2     B17,6,B0          ; |1486|
;*  10              DSUBSP  .L1     A11:A10,A9:A8,A5:A4 ; |1486|
;*     ||           DSUBSP  .S2     B7:B6,B5:B4,B5:B4 ; |1486|
;*  11              DSUBSP  .S2     B7:B6,B5:B4,B23:B22 ; |1486|
;*  12              DSUBSP  .L2     B7:B6,B5:B4,B19:B18 ; |1486|
;*     ||           DSUBSP  .S1     A7:A6,A5:A4,A5:A4 ; |1486|
;*  13              FADDSP  .S2     B5,B8,B8          ; |1486|  ^
;*     ||           DSUBSP  .L1     A11:A10,A9:A8,A19:A18 ; |1486|
;*     ||   [!B0]   B       .S1     $C$C1038          ; |1486|
;*  14              FADDSP  .S1     A5,A3,A3          ; |1486|  ^
;*     ||           FADDSP  .L2     B23,B16,B16       ; |1486|  ^
;*  15              FADDSP  .L1     A5,A22,A22        ; |1486|  ^
;*  16              FADDSP  .L1     A19,A21,A21       ; |1486|  ^
;*     ||           FADDSP  .S2     B19,B9,B9         ; |1486|  ^
;*  17              NOP             2
;*  19              ; BRANCHCC OCCURS {$C$C1038}      ; |1486|
;*----------------------------------------------------------------------------*
$C$L71:    ; PIPED LOOP PROLOG
;          EXCLUSIVE CPU CYCLES: 3

           MVK     .L1     0x3,A0            ; init prolog collapse predicate
||         MV      .L2     B18,B0
||         MV      .S2X    A3,B17
||         MV      .S1X    B18,A3
||         LDDW    .D1T1   *A20++(48),A7:A6  ; |1486| (P) <0,0>
||         LDDW    .D2T2   *B24++(48),B19:B18 ; |1486| (P) <0,0>

           SUB     .D2     B17,18,B17
||         LDDW    .D1T2   *-A20(24),B21:B20 ; |1486| (P) <0,1>
||         B       .S1     $C$L72            ; |1486| (P) <0,13>

           MV      .L2X    A0,B1
||         LDDW    .D2T1   *-B24(40),A7:A6   ; |1486| (P) <0,2>
||         LDDW    .D1T2   *-A20(16),B23:B22 ; |1486| (P) <0,2>

;** --------------------------------------------------------------------------*
$C$L72:    ; PIPED LOOP KERNEL
$C$DW$L$ul_equalizer_spectral_flatness$7$B:
;          EXCLUSIVE CPU CYCLES: 4

   [ A0]   SUB     .S1     A0,1,A0           ; <0,15>
|| [!B1]   FADDSP  .L1     A5,A22,A22        ; |1486| <0,15>  ^
||         DSUBSP  .S2     B7:B6,B5:B4,B23:B22 ; |1486| <1,11>
||         SUB     .L2     B17,6,B17         ; |1486| <2,7>
||         CMPYSP  .M1     A7:A6,A7:A6,A11:A10:A9:A8 ; |1486| <2,7>
||         CMPYSP  .M2     B23:B22,B23:B22,B7:B6:B5:B4 ; |1486| <2,7>
||         LDDW    .D1T1   *-A20(8),A17:A16  ; |1486| <3,3>

   [!B1]   FADDSP  .S2     B19,B9,B9         ; |1486| <0,16>  ^
|| [!B1]   FADDSP  .L1     A19,A21,A21       ; |1486| <0,16>  ^
||         DSUBSP  .S1     A7:A6,A5:A4,A5:A4 ; |1486| <1,12>
||         DSUBSP  .L2     B7:B6,B5:B4,B19:B18 ; |1486| <1,12>
||         CMPYSP  .M2     B21:B20,B21:B20,B7:B6:B5:B4 ; |1486| <2,8>
||         CMPYSP  .M1     A17:A16,A17:A16,A7:A6:A5:A4 ; |1486| <2,8>
||         LDDW    .D1T1   *A20++(48),A7:A6  ; |1486| <4,0>
||         LDDW    .D2T2   *B24++(48),B19:B18 ; |1486| <4,0>

   [ B1]   SUB     .D2     B1,1,B1           ; <0,17>
|| [!A0]   FADDSP  .S2     B5,B8,B8          ; |1486| <1,13>  ^
||         DSUBSP  .L1     A11:A10,A9:A8,A19:A18 ; |1486| <1,13>
|| [!B0]   B       .S1     $C$L72            ; |1486| <1,13>
|| [!B0]   CMPLTU  .L2     B17,6,B0          ; |1486| <2,9>
||         LDDW    .D1T2   *-A20(24),B21:B20 ; |1486| <4,1>

   [!A0]   FADDSP  .S1     A5,A3,A3          ; |1486| <1,14>  ^
|| [!A0]   FADDSP  .L2     B23,B16,B16       ; |1486| <1,14>  ^
||         DSUBSP  .L1     A11:A10,A9:A8,A5:A4 ; |1486| <2,10>
||         DSUBSP  .S2     B7:B6,B5:B4,B5:B4 ; |1486| <2,10>
||         CMPYSP  .M1     A7:A6,A7:A6,A11:A10:A9:A8 ; |1486| <3,6>
||         CMPYSP  .M2     B19:B18,B19:B18,B7:B6:B5:B4 ; |1486| <3,6>
||         LDDW    .D2T1   *-B24(40),A7:A6   ; |1486| <4,2>
||         LDDW    .D1T2   *-A20(16),B23:B22 ; |1486| <4,2>

$C$DW$L$ul_equalizer_spectral_flatness$7$E:
;** --------------------------------------------------------------------------*
$C$L73:    ; PIPED LOOP EPILOG
;          EXCLUSIVE CPU CYCLES: 12

           FADDSP  .L1     A5,A22,A16        ; |1486| (E) <1,15>  ^
||         DSUBSP  .S2     B7:B6,B5:B4,B23:B22 ; |1486| (E) <2,11>
||         CMPYSP  .M1     A7:A6,A7:A6,A11:A10:A9:A8 ; |1486| (E) <3,7>
||         CMPYSP  .M2     B23:B22,B23:B22,B7:B6:B5:B4 ; |1486| (E) <3,7>
||         LDDW    .D1T1   *-A20(8),A17:A16  ; |1486| (E) <4,3>

           FADDSP  .S2     B19,B9,B9         ; |1486| (E) <1,16>  ^
||         FADDSP  .L1     A19,A21,A3        ; |1486| (E) <1,16>  ^
||         DSUBSP  .S1     A7:A6,A5:A4,A5:A4 ; |1486| (E) <2,12>
||         DSUBSP  .L2     B7:B6,B5:B4,B19:B18 ; |1486| (E) <2,12>
||         CMPYSP  .M2     B21:B20,B21:B20,B7:B6:B5:B4 ; |1486| (E) <3,8>
||         CMPYSP  .M1     A17:A16,A17:A16,A7:A6:A5:A4 ; |1486| (E) <3,8>

           FADDSP  .S2     B5,B8,B8          ; |1486| (E) <2,13>  ^
||         DSUBSP  .L1     A11:A10,A9:A8,A19:A18 ; |1486| (E) <2,13>

           FADDSP  .S1     A5,A3,A6          ; |1486| (E) <2,14>  ^
||         FADDSP  .L2     B23,B16,B9        ; |1486| (E) <2,14>  ^
||         DSUBSP  .L1     A11:A10,A9:A8,A5:A4 ; |1486| (E) <3,10>
||         DSUBSP  .S2     B7:B6,B5:B4,B5:B4 ; |1486| (E) <3,10>
||         CMPYSP  .M1     A7:A6,A7:A6,A11:A10:A9:A8 ; |1486| (E) <4,6>
||         CMPYSP  .M2     B19:B18,B19:B18,B7:B6:B5:B4 ; |1486| (E) <4,6>

           FADDSP  .L1     A5,A16,A3         ; |1486| (E) <2,15>  ^
||         DSUBSP  .S2     B7:B6,B5:B4,B23:B22 ; |1486| (E) <3,11>
||         CMPYSP  .M1     A7:A6,A7:A6,A11:A10:A9:A8 ; |1486| (E) <4,7>
||         CMPYSP  .M2     B23:B22,B23:B22,B7:B6:B5:B4 ; |1486| (E) <4,7>

           FADDSP  .S2     B19,B9,B8         ; |1486| (E) <2,16>  ^
||         FADDSP  .L1     A19,A3,A16        ; |1486| (E) <2,16>  ^
||         DSUBSP  .S1     A7:A6,A5:A4,A5:A4 ; |1486| (E) <3,12>
||         DSUBSP  .L2     B7:B6,B5:B4,B19:B18 ; |1486| (E) <3,12>
||         CMPYSP  .M2     B21:B20,B21:B20,B7:B6:B5:B4 ; |1486| (E) <4,8>
||         CMPYSP  .M1     A17:A16,A17:A16,A7:A6:A5:A4 ; |1486| (E) <4,8>

           FADDSP  .S2     B5,B8,B9          ; |1486| (E) <3,13>  ^
||         DSUBSP  .L1     A11:A10,A9:A8,A19:A18 ; |1486| (E) <3,13>

           FADDSP  .S1     A5,A6,A3          ; |1486| (E) <3,14>  ^
||         FADDSP  .L2     B23,B9,B6         ; |1486| (E) <3,14>  ^
||         DSUBSP  .L1     A11:A10,A9:A8,A5:A4 ; |1486| (E) <4,10>
||         DSUBSP  .S2     B7:B6,B5:B4,B5:B4 ; |1486| (E) <4,10>

           FADDSP  .L1     A5,A3,A6          ; |1486| (E) <3,15>  ^
||         DSUBSP  .S2     B7:B6,B5:B4,B23:B22 ; |1486| (E) <4,11>

           FADDSP  .S2     B19,B8,B4         ; |1486| (E) <3,16>  ^
||         FADDSP  .L1     A19,A16,A3        ; |1486| (E) <3,16>  ^
||         DSUBSP  .S1     A7:A6,A5:A4,A5:A4 ; |1486| (E) <4,12>
||         DSUBSP  .L2     B7:B6,B5:B4,B19:B18 ; |1486| (E) <4,12>

           FADDSP  .S2     B5,B9,B8          ; |1486| (E) <4,13>  ^
||         DSUBSP  .L1     A11:A10,A9:A8,A19:A18 ; |1486| (E) <4,13>

           FADDSP  .S1     A5,A3,A4          ; |1486| (E) <4,14>  ^
||         FADDSP  .L2     B23,B6,B16        ; |1486| (E) <4,14>  ^

;** --------------------------------------------------------------------------*
;          EXCLUSIVE CPU CYCLES: 6
           FADDSP  .S2     B19,B4,B9         ; |1486| (E) <4,16>  ^
           FADDSP  .L1     A19,A3,A3         ; |1486| (E) <4,16>  ^
           INTSP   .L1     A24,A27
           MV      .L2X    A4,B18
           FADDSP  .L2     B8,B18,B4

           MV      .S1     A23,A19
||         RINT                              ; interrupts on
||         FADDSP  .L1     A5,A6,A22         ; |1486| (E) <4,15>  ^
||         MV      .L2X    A3,B19
||         ZERO    .D1     A3



[ - ]
Reply by dszaboAugust 1, 2017

I agree with everyone else, but I would like to play devil's advocate. There is a school of thought that you shouldn't worry about performance until you have a metric by which that performance can be evaluated. It's a waste of money and time to try and optimize the performance of software that doesn't contribute in a meaningful way to the overall CPU load. So you would ideally design software tests to test the accuracy of the software as well as the execution time. If it happens that you need to decrease the execution time, you now have a test suite that checks that your hand assembled code works and a benchmark to compare its performance against.

That being said, it's pretty rare that I put in that much effort. I don't know what your application is, but usually you have a pretty good idea of where your problem areas will be before you start, so you would end up taking care of it early to mitigate risk