Assembly programming adventures for a scientific calculator.: 2016

Wednesday, May 25, 2016

Taking advantage of a device's capabilities

Know thy hardware

In the previous post, I showed you that rendering, in a straighforward way, a background the size of the display is so slow that it's unusable. There are of course much more efficient methods of displaying images, background, foreground, sprites, etc. The first step towards finding a better method is learning more about the hardware, what it can do and how to program it to do it. Today I will try explaining an example of this.

The display shows only a section of what is called the hardbuffer (actually, there are two hardbuffers, but that's for later). In reality the hardbuffer (from now on "HARDBUFF") may be any memory area. It's just that you can tell the calculator "use this memory location as the start of the image you will show on the screen". It's like you give someone an entire book but you tell them "start reading on page five". You set HARDBUFF to be a specific location by writing that address (which is 20 bits wide) into a special memory location by the human-readable name =DISP1CTL (=DISP2CTL is for the other hardbuffer and the equal sign is not very important). There is another special memory location that accepts how long each row of HARDBUFF is, in nibbles, which is called =LINENIBS. There are some other peculiarities about =DISP1CTL and some other special memory locations, but these will do for now.

Suppose we write the background image, as we did in the previous post, starting at address #01000h. That display would appear if we wrote the nibbles 0,0,0,1,0 to the locations =DISP1CTL, =DISP1CTL+1 ... =DISP1CTL+4 (or "write #01000h to =DISP1CTL"). This is how the screen will look if we write #01022h (#01000h + 34) instead:

One pixel/row/step up

Essentially, we scrolled the game area up, as if our player character moved down. The last row looks random. It certainly isn't our image, we didn't write that far anyway. What is it? It's memory. It's the memory that happens to (already) exist, just after the area our image's data exists. Doesn't sound very useful. Yet.

We observe that we can change what is shown on the screen without redrawing it. We just wrote a different number in =DISP1CTL and the display changed. Not much, but it changed. Can we use it? Yes. Can we abuse it? Of course!

So by rendering a 136*128 background (instead of the current 136*64) and decrementing/incrementing the value we write to =DISP1CTL by 34, we can move the image down/up (scroll up/down) respectively.

What about horizontal scrolling? Well, things now get tricky: =DISP1CTL can only accept addresses on byte boundaries, so we can only do a big step left/right by 8 pixels. There is, however, a 4bit address, called =BITOFFSET, which sets the pixel offset of what is shown on the screen. Right now it is set to 0. To scroll towards the left, as the player moves right, we would have to increment =BITOFFSET by 1 for each pixel towards the right and when we're about to increase it from 7 to 8, write 0 to it instead and increment =DISP1CTL by 2. That, however, implies the image we are displaying is wider than 136 pixels: if we did this with the image we have shown until now, the screen would become garbled and at the start of each row of pixels, pixels from the ending of the previous row would appear. We need to adjust =LINENIBS, to configure how many nibbles the display controller should skip after reaching the end of each row, by an amount appropriate to the size of the displayed image.

So, we need a larger image. And with a larger image comes heavier memory usage. And longer rendering times. For now however, we can, theoretically, scroll a static image back and forth, up and down, without much cpu usage: just a couple writes. That's a big step from earlier when we didn't have time to render one screen's worth of data in time.

Tuesday, May 24, 2016

Performance

First, some information about the hardware timings. The calculator's screen redisplays its contents ("refreshes") at 64 Hz, which means we have to create ("render") a new image every 0.015625 seconds, or 15625 millionths of a second ("μs", microseconds).

Remembering that we have 3.5 million CPU ticks available every second this means that we have 3.5*15625= 54687.5 cycles available to render each new screen. Since the CPU frequency fluctuates +/- 0.2% (spread spectrum modulated clock to reduce EMI) we can't be that accurate in our expectations. More than that, we might be running on hardware that has a slower clock. Let's round it down to 50 thousand cycles ("kc", kilocycles) just to make it easier to remember and to be safe.

Ok so we're going to fill the screen with two different-looking tiles. The outer ones are supposed to represent walls and those inside represent floors. Every game needs a background, right? Over it, we will have our hero, his (or her) enemies, powerups, traps and crates (you must always have crates!)

"Game" area

Each tile is 8x8 pixels which, if you've read the previous post, means 16 consecutive memory locations. We will write it to memory one row at a time, with each successive row being 136/4=34 memory locations away, or in other words, each row takes 34 memory locations and we're only writing the first two for each row. (The darker column on the right edge is the part of the memory that we have to write but is not displayed on the screen)


label   command operand cycles

        |       |       |

loop    A=DAT0  B       * 15

        DAT1=A  B       * 13

        D0=D0+  2       * 6.25

        D1=D1+  16      * 6.25

        D1=D1+  16      * 6.25

        D1=D1+  2       * 6.25

        B=B-1   A       * 5

        GONC    loop    * 9.5

I won't bore you with explaining in detail what each instruction does but it's good to know that the first line reads the first byte ("B") from the tile into the A register, the second line writes it to the screen, from third to sixth it advances the memory pointers (which can only be adjusted by 16 locations each time), the seventh decrements a counter (which is initialized to 7 before the loop) and the last one loops back to the start if the previous operation did not over/underflow.

More important right now is the cycles column. It reads how many cycles the CPU needs to complete it. For one iteration of the loop, the total is 67.5. For eight iterations, it's 8*67.5=540. To fill the entire screen with eight rows of 17 tiles each, it's 8*17*540= 73440.

73440 cycles. When we only have ~50000 cycles available.

Fuck. We don't have time to do the innermost loop of rendering the background in time... even without our hero, enemies, powerups, traps or crates (no crates ;( ). And we'd need extra time to handle the interactions between those entities, too!

Let's try optimizing the tile rendering, with complete disregard for memory and register usage. Loop unrolled, switched to using B to increment C so doing D1=D1+34 indirectly, which is 7 cycles faster:

A=DAT0 B * 15 DAT1=A B * 13 D0=D0+ 2 * 6.25 C=C+B A * 5 D1=C * 6.75

This is repeated seven times and after that we have one more read/write ("copy") to do:

A=DAT0 B * 15 DAT1=A B * 13

The total time now is 7*46+28=350 and 8*17*350=47600. Nope. Still not remotely good. Spending 95% of the time available just to write what is essentially an empty screen will not get us anywhere.

But there must be a way... and i will find it. Otherwise...

... this wouldn't be possible :)

Monday, May 23, 2016

Monochrome pixel graphics

Little black squares

This is the third step of the introduction to the nightmare that is realtime graphics coding for the HP48 calculator and now i'll try explaining how pixel graphics work.

The display is 131 pixels (little squares) wide and 64 pixels high. Well, actually the display is 136 pixels wide, but the rightmost 5 pixels are not there. We still have to take them into account when programming, but the player will never see them.

The display

In the above image, most pixels are some sickly shade of green, but some pixels (the grid pattern) are very dark blue. I show you this grid so you can see the resolution and scale of the display. The dark blue pixels are "on" and the green are "off". On a more common display, they would be respectively black (off) and white (on). We will rarely see anything "common" applicable to this calculator.

Now focus on this square of 16 pixels:

Some pixels

(The black grid is not part of the display, it's there only so you can tell how the individual pixels are divided.)

Let's assign the digit 0 for each pixel that is off (green) and the digit 1 for each pixel that is on (dark blue). The above square can be thus represented with the following sequence of lines:

0001

0000

1111

0000

But if we interpret the sequence "0001" as a binary number, it represents the quantity "one". Similarly, the other three lines represent the quantities 0, 15 and 0, in decimal notation, the notation you use daily. In hexadecimal notation, they are the numbers 1, 0, F and 0.

This method is important because it means we can represent an image with numbers. Just take it one row at a time, divide each row into sets of four pixels, "read" those sets them from left to right, represent each four pixels with one number from 0 to 15 and write them down. The first row on the larger, first, image above (The display) would be 34 repetitions (because 34*4=136) of 10, the second row would be 34 repetitions of 0 (because the row has only pixels in the "off" state) and the third row would be the pattern 8, 0 repeated 17 times. That is, of course, in decimal. In hexadecimal, and writing each row sequentially, it would be writen ("stored in memory") as:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA00000000000000000000000000000000008080808080808080808080808080808080

, with each digit (number) stored in each memory location and in increasing order (what, you thought memory is organized in bytes? Hahahaha, no. Half bytes, or nibbles). Thankfully, we don't have to mess with such cumbersome representations of memory, we mostly consider the display divided into tiles, such as the grid in the first picture, which is the 8-by-8-pixel tile:

, repeated in specific locations in memory (not all sequentially).

That representation is somewhat easier to imagine, even if it doesn't always correspond directly to what exists in memory. An all black, 4-by-4 square would be FFFF. All white, 0000. Checkerboard pattern, A5A5.

The following picture shows all possible patters represented with the numbers 0 to 15 (decimal) or 0 to F (hexadecimal), from top to bottom.

All numbers

In other words, how each number, from 0 to F, is shown on screen.

The conclusion is: if we want to display a character on the screen, we have to take the image:

Duuh...

, divide it into sets of 4 pixels in a row, find out what numbers correspond to each pattern and write those numbers to the appropriate memory locations. In the case of our perplexed friend above, the pattern of the first (=leftmost) four pixels of the first row corresponds to the fourth row in the previous image, the row which shows how the number three (the first line shows the number zero) appears on screen. Another way is considering the pixels in sets of four, as binary digits (0011) and reading them as hex digits (3). First row: 0011, 1111, 1111, 0000 or 3,F,F,0. Second row, 400C, etc.

Thanks for reading. More, later.

Sunday, May 22, 2016

Quick, dirty and incomplete description of the instruction set

The CPU

Here is the summary.

It runs from 3.75 MHz to 4 MHz (depending on date of manufacture and model G/G+/GX), which is internally (probably... we're not sure yet. Could be clock stretching.) subdivided into quarter cycles. Since that wouldn't be weird enough, as long as the LCD is enabled (which is pretty much always) about 10% of CPU time is always spent refreshing the display. So practically speaking, we have 3.5 million CPU cycles ("ticks") available each second. Might sound much, but it really (really) isn't.

Memory runs at 2 MHz. It's a "DX2".

What does it have?

Four 64bit, semi-general purpose registers: A,B,C and D.

Five 64bit, "scratch" registers: R0, R1, R2, R3, R4 and R5.

Two 20bit address registers: D0 and D1.

One 16bit status register: ST.

One 4bit pointer register: P.

One 20bit, 8-level hardware stack: "RSTK".

One 20bit program counter: PC.

One carry flag: "CRY" or "C" (not to be confused with the C register).

Four hardware status flags: MP, SR, SB, XM.

Its address space is 512 KB, not 1 MB, because the data bus is 4 bits wide, not 8.

What can you do with those?

With A, B, C and D, you can...

... clear it: r=0,

negate it (2's complement): r=-r,

invert it (logical NOT): r=-r-1,

increment or decrement it: r=r+1, r=r-1,

add or subtract a constant from 0 to 15: r=r+CON, r=r-CON,

shift it one bit to the left or right: r=r+r, rSRB,

shift it one nibble (4bits) to the left or right: rSL, rSR,

shift it one nibble to the left or right *circularly*: rSLC, rSRC,

test it: ?r=0, ?r#0,

(for r= A,B and D) add, subtract, OR or AND with C: r=r+C, r=r-C, r=C-r, r=r!C, r=r&C,

(for r= A,B and D) exchange it with C: rCEX,

(for r= A, B and D)compare it with C: ?r<C, ?r<=C, ?r=C, ?r>=C, ?r>C, ?r#C.

You can not do anything between B and D.

So, A can interact with B and C.
B can interact with A and C.
C can interact with A, B and D.
D can interact only with C.

With A and C you can also...

access the address registers: D0=A, AD0EX and similarly for D1 and/or C,

read from and write to memory: A=DAT0, C=DAT0, A=DAT1, C=DAT1, DAT0=A, DAT0=C, ...

access R0...R4: A=R0, R1=A, AR0EX, AR3EX, CR2EX, etc

test their first 16 bits: ?ABIT=0 13

set/clear their first 16 bits: ABIT=0 12, CBIT=1 10

With C you can also...
access the stack: C=RSTK, RSTK=C,
access the program counter: C=PC, PC=C, PC=(C), CPCEX
access the pointer register: C=P, P=C, CPEX, C=C+P+1 (lolwat)

One note: for most of the commands, you can (or must) specify a "field" operand. For example, "A=0 B" clears only the first byte, leaving the other 14 bytes unaffected. This extends to arithmetic, logic and some other operations. There are the B, X, XS, M, S, A, W and P fields.

You can do conditional ( GOC, GONC ) and unconditional jumps ( GOTO, GOLONG). You can also call subroutines ( GOSUB ) and RTN from them

There are some various other bits and pieces here and there, but these should keep your noggin dizzy for a while.

Next time, we'll attempt some graphics and observe just how miserably slow is this CPU.

A short description of the calculator.

A calculator, you say?

It's an HP48G/G+/GX. It was manufactured by Hewlett-Packard, until a couple decades ago. It's the last, decent, old-school, HP calculator. Of course it's RPN only. That is because it was made before they bonked their heads on a wall, releasing the HP49G, before they went mad releasing the HP49G+ and before they went completely apeshit and releasing the HP50G. Now they write "apps" which you can run on your phone to do calculations.

I've been using my gx for quite some time now. It's gotten me through the last year of high school, university, post-grad and (mandatory) army "service". It's as sturdy as a rock, as reliable as the sun rises every morning, it can be powered by a hamster running in a wheel and it's as fast as a sloth covered in molasses and stoned off his gourd.

My (computer's) graphics card, an AMD R9 380, claims memory transfer rates of about 180 GB/s. The computer's RAM, which is DDR4-2133, claims about 17 GB/s. My cellphone's RAM goes at about 15 GB/s. The HP48GX, you ask? 70 KB/s. Or, to put it in perspective, 0.00007 GB/s.

It's so slow that when you perform an addition, you can see the "busy" icon stay on for half a second.

Of course, this means I must try to write a game for it. An action game. With bullets, enemies, powerups, randomized levels and explosions and stuff.

I'm sure the DSM-V has a name for my condition.