Tuesday, May 24, 2016

Performance


First, some information about the hardware timings. The calculator's screen redisplays its contents ("refreshes") at 64 Hz, which means we have to create ("render") a new image every 0.015625 seconds, or 15625 millionths of a second ("μs", microseconds).

Remembering that we have 3.5 million CPU ticks available every second this means that we have 3.5*15625= 54687.5 cycles available to render each new screen. Since the CPU frequency fluctuates +/- 0.2% (spread spectrum modulated clock to reduce EMI) we can't be that accurate in our expectations. More than that, we might be running on hardware that has a slower clock. Let's round it down to 50 thousand cycles ("kc", kilocycles) just to make it easier to remember and to be safe.

Ok so we're going to fill the screen with two different-looking tiles. The outer ones are supposed to represent walls and those inside represent floors. Every game needs a background, right? Over it, we will have our hero, his (or her) enemies, powerups, traps and crates (you must always have crates!)

"Game" area
Each tile is 8x8 pixels which, if you've read the previous post, means 16 consecutive memory locations. We will write it to memory one row at a time, with each successive row being 136/4=34 memory locations away, or in other words, each row takes 34 memory locations and we're only writing the first two for each row. (The darker column on the right edge is the part of the memory that we have to write but is not displayed on the screen)

label   command operand cycles
        |       |       |
loop    A=DAT0  B       * 15
        DAT1=A  B       * 13
        D0=D0+  2       * 6.25
        D1=D1+  16      * 6.25
        D1=D1+  16      * 6.25
        D1=D1+  2       * 6.25
        B=B-1   A       * 5
        GONC    loop    * 9.5


I won't bore you with explaining in detail what each instruction does but it's good to know that the first line reads the first byte ("B") from the tile into the A register, the second line writes it to the screen, from third to sixth it advances the memory pointers (which can only be adjusted by 16 locations each time), the seventh decrements a counter (which is initialized to 7 before the loop) and the last one loops back to the start if the previous operation did not over/underflow.

More important right now is the cycles column. It reads how many cycles the CPU needs to complete it. For one iteration of the loop, the total is 67.5. For eight iterations, it's 8*67.5=540. To fill the entire screen with eight rows of 17 tiles each, it's 8*17*540= 73440.

73440 cycles. When we only have ~50000 cycles available.

Fuck. We don't have time to do the innermost loop of rendering the background in time... even without our hero, enemies, powerups, traps or crates (no crates ;( ). And we'd need extra time to handle the interactions between those entities, too!

Let's try optimizing the tile rendering, with complete disregard for memory and register usage. Loop unrolled, switched to using B to increment C so doing D1=D1+34 indirectly, which is 7 cycles faster:

A=DAT0 B * 15 DAT1=A B * 13 D0=D0+ 2 * 6.25 C=C+B A * 5 D1=C * 6.75

This is repeated seven times and after that we have one more read/write ("copy") to do:

        A=DAT0 B * 15 DAT1=A B * 13

The total time now is 7*46+28=350 and 8*17*350=47600. Nope. Still not remotely good. Spending 95% of the time available just to write what is essentially an empty screen will not get us anywhere.

But there must be a way... and i will find it. Otherwise...

... this wouldn't be possible :)

No comments:

Post a Comment