Programming stories
January 5, 2014 8:46 AM   Subscribe

For your Sunday reading, a couple of stories of ye olden computing days: Why MacPaint's Original Canvas was 416 Pixels Wide and A Great Old Timey Game Programming Hack.
posted by curious nu (29 comments total) 37 users marked this as a favorite
I love heroic graphics programming. My favorite examples are in the original Quake, especially the trick of doing a parallel FDIV during the inner rendering loop (Pentiums had just come out with fast floating-point instructions) and the low-quality subdivision rasterization trick.

GPUs and shaders are fun, but you're not getting the huge bang for your buck that One Weird Coding Trick could get back then.
posted by RobotVoodooPower at 9:55 AM on January 5, 2014 [2 favorites]

Having taken a course on assembly programming, I basically understood and enjoyed these stories. If you have not, clicking any of these links triggers a sanity check.
posted by LogicalDash at 10:12 AM on January 5, 2014 [3 favorites]

If you miss the days of heroic programming check out the feats performed by the Parallax Propeller, an inexpensive embedded chip almost designed from the ground up to require this kind of thinking. It was designed to be able to bit-bang video, but at only 80 MHz it's only barely able to generate VGA and all the drivers involve heroic compromises.
posted by localroger at 10:26 AM on January 5, 2014 [2 favorites]

I'm still pleased with the improvements I found for Michael Mahon's NadaNet project in 2005. Transferring bit-serial data from one 1MHz Apple II to another without using a hardware shift register, at 8μs per bit (which worked out to 94-95μs per byte including loop control, DPLL and checksum generation overhead) took a lot of jiggling and fiddling. My part starts at page 60 of this source code listing PDF (6502 assembly).

I loved writing timing-critical code, and do sometimes mourn its passing as a useful skill.
posted by flabdablet at 10:32 AM on January 5, 2014 [3 favorites]

The relevant things to know about the 6502 instruction set when thinking about that problem, by the way, are that forcing the output pin high or low, or reading an input pin, all cost 4μs; shifting the accumulator costs 2μs; conditional branches cost 2μs if not taken or 3 if taken.
posted by flabdablet at 10:38 AM on January 5, 2014 [1 favorite]

I try to avoid anything assembly, so I've got no hardware hacks. The assembler talk reminded me though of dabbling with "core wars" as a kid. This was a game where you have a block of memory and some machine code ("red code"), and you have to write a program that will corrupt or subvert your opponent's code before theirs does yours.
It wasn't very popular at the time, because a program (called Mouse, iirc) was considered unbeatable. Mouse was a tight loop that rapidly copied itself to a random spot in memory. Under the rules you had to take out the copy as well as the original, but by the time you found it, there would be another ten copies, and eventually so many copies that they'd be landing in the memory occupied by your program(s) destroying them.
I made a program that was extremely effective against traditional programs, and it could also beat mouse more often than not. It worked by rapidly systematically dropping code bombs every x addresses in memory, also a tight loop, so whether it or mouse won depended on whether mouse got a lucky hit by chance before the steam roller caught a mouse and subverted its processes.
A problem though, was that my program would rapidly lace all the memory, loop around the core and bomb memory that it was using. My little "yes!" moment was finding an optimisation that tightened the code such that in the traditional memory sizes for core wars, the program would perfectly align so as to drop the bomb directly onto it's own bomb - thus repairing instead of corrupting itself when the moment came that it would target memory it occupied. Meanwhile the optimization also meant bomb address was now in the center of the code, so with only half the program on each side of the bomb, the remaining lines of code on either side of the bomb were small enough to fall between the cracks between bomb drops. The program became inoculated against itself, freeing it to blast through all the memory over and over.
posted by anonymisc at 10:42 AM on January 5, 2014 [17 favorites]

There is still room in the world for hacks like this, which always makes me smile.

The most recent winner is John Papandriopoulos, the creator of SnappyCam for iOS. SnappyCam is an epic creation of hand-tuned ARM assembly that allowed him to achieve massive image capture and decoding rates in software, even faster than Apple was doing it in hardware. Technical details are (well, were) on his blog and it's a testament to what a dedicated assembly programmer can do.

Apple apparently was just as impressed since they bought SnappyLabs and we're assuming that Papandriopoulos is now working for Apple. If you have SnappyCam, keep it, since it's now been pulled off the App Store.

We've come a long way from Atkinson's 68000 hack, that's for sure.
posted by JoeZydeco at 11:03 AM on January 5, 2014 [5 favorites]

This seems like the proper time and place to post this again: The Story of Mel, a Real Programmer.
posted by Mr. Bad Example at 12:31 PM on January 5, 2014 [9 favorites]

This made me happy, and I thank curious nu for posting it. I work several universes up the stack, in database query land, but ingenuity and hard work like this always makes me so happy.

People aren't awful. Good stuff is being done. So thanks for that, OP.
posted by mrdaneri at 12:38 PM on January 5, 2014 [2 favorites]

BTW - the 6809 trick and the 68K trick are one and the same. In 1991, me and another dedicated hack that I worked with, got our hands on a Stargate machine (actually a boardset that we could install in our Robotron). My buddy wrote a 6809 disassembler and both of us started taking the code apart. The PSHS/PULU trick was in there. Stargate had no dedicated hardware for bit-blitting (unlike Robotron). It was pretty awesome code.

I've done my share - for example, I worked on a printer that was supposed to be cheap, cheap, cheap. It was running a pokey 68K, but DEC really wanted it to have serial, parallel, and appletalk all running simultaneously. The problem was that if you were printing and getting interrupts from communications at the same time, you'd either drop scan lines when the printer FIFO starved or you'd lose communication bytes (this changed depending on how you ordered the priority of the interrupts). I was able to mitigate the problem by fully unrolling the FIFO code, but it would still happen every hundred pages or so if communications was getting pounded. Ultimately, we had to upgrade to a slightly deeper FIFO.

Later I worked on the PostScript cartridge for the HP LaserJet III series printers. On the version with the duplex unit, pages printed in the order 2-1-4-3-6-5 because of the duplex unit. Even pages had to be printed rotated 180 degrees. The problem was that for legal-sized paper, there was an unacceptable delay from when the paper reached the registration point until it got pulled into the engine. The reason was that the frame buffer was 1.3M, 1 bit per pixel and the code looked reasonable:
while (low < high) {
   char temp = revtable[*low];
   *low++ = revtable[*high];
   *high-- = temp;

One issue was that that while test was being run 1.3M times and only failed once at the end. I unrolled that with Duff's device to keep the code portable and got a nice preformance boost.

HP said it wasn't good enough, so I rewrote this one in assembly. The problem with PostScript was that it was complicated enough that it tripped compiler bugs (oh yeah, Sun's 68K was pretty buggy with optimization on) if you turned on optimization, so that was out.

So in rewriting the code, I made sure that everything I needed was in a register and that the loop was unrolled enough so that I could fit everything in a sobgtr instruction (subtract one and branch on greated), which helped the performance tremendously. HP was still not happy, but I had two tricks up my sleeve - if they could grant us an additional 128K of ROM space, I could rewrite the code to use words instead of bytes (since the 68K was really a 16 bit machine internally, but a byte reveral table is 256 bytes, but a word based one is 2 * 64K = 128K). In addition, I could skip reversal on any 0 word, which would be a probabilistic win since most pages are white (no bits on). HP gave us the ROM budget, we gave them the performance.

At this point, I liked to think that there is a software engineer's Valhalla where we get drunk discuss brilliant hacks used to solve tight resource constraints.
posted by plinth at 4:58 PM on January 5, 2014 [7 favorites]

I feel old, btw.
posted by plinth at 4:58 PM on January 5, 2014

These are just fascinating. I would love a whole book about olde optimization stories of yore, along with its natural counterpart, Horrible Bugfixes. That graphics story combines both! Perfect.
posted by Jon Mitchell at 5:01 PM on January 5, 2014 [1 favorite]

Last year I needed a propeller VGA driver with properties that none of the existing objects had. I needed 40 columns and at least two different text colors on the same line, and using the Propeller's ROM font but also allowing for extended characters. This simply was not possible (I thought) in the inner loop that feeds 32-bit pixel values to WAITVID. So I wrote a driver using three of the Propeller's 8 cores, two of them processing odd and even lines respectively to turn text into raw bit patterns, fed to the video system by the third cog.

Then later in the year one of the other Propeller maven implemented most of the same features using a single cog.

I just love this shit.
posted by localroger at 5:31 PM on January 5, 2014 [1 favorite]

the 6809 trick and the 68K trick are one and the same

and there's a related one for the 65816 found in the Apple //gs and various Nintendo consoles.

One of the things I was very pleased to find in the 65816 instruction set when that processor was first released was a block-move instruction. The way it was supposed to work was that you'd load the X register with a source address, Y with a destination address, A with a byte count, and then use MVP or MVN to do the block move (where MVP increments the address registers after each byte and MVN decrements them).

But my pleasure turned to disappointment once I found out how the instruction actually worked. MVP and MVN are actually three-byte instructions: the two bytes following the opcode are source and destination memory bank addresses (the '816 has a 24-bit address space, but X and Y are both only 16-bit registers). And rather than make the processor do internal looping, which would complicate what it would have to do when dealing with interrupts, all the MVP/MVN instructions do is decrement the program counter by 3 at the end of each byte-move cycle: the processor actually has to fetch and decode the block move instruction anew for each byte moved. All of which works out to the instruction taking seven cycles to move each byte: better than the obvious hand-coded copy loop, but very disappointing compared to the two cycles per byte (one read, one write) that I had expected.

I'll come back and explain the trick I found for improving that once I'm released from small child chess duty.
posted by flabdablet at 9:28 PM on January 5, 2014

Meanwhile, the seminal Massalin paper on superoptimization captures the spirit of the time quite nicely and its results are actually in current use.
posted by flabdablet at 9:32 PM on January 5, 2014 [1 favorite]

Specifically Massalin begat the Synthesis kernel which begat the Self project which begat the idea of JIT compilers in production code which then was incorporated into the Java virtual machine, modern Javascript VMs, .NET, etc.

I love hacking bits and optimizing specific CPU cycles too. They're fun little puzzles, immensely satisfying when you come up with a complete solution. But it's a whole lot of work for a relatively small result: "look, I can copy pixels to the screen!" So I'm happy to program way up in Python or something, far removed from the details of the hardware but in the celestial realm where I can make complex things happen with relatively little code. I just hope the fiddly assembly programmers are still there making my VM run fast. (Go PyPy!)
posted by Nelson at 9:37 PM on January 5, 2014

Ha I just noticed JoeZydeco's post about SnappyCam linked to the Wayback archive of Papandriopoulos' post. Because true to form, once a brilliant hacker is hired into Apple, Inc their work disappears from the Internet too, taken down. I've known a few map experts who've been disappeared into Apple too. It's really stupid; forcing bright people to stop talking to their peers about their work is no way to build a hacker culture. Shame on you, Apple.
posted by Nelson at 9:41 PM on January 5, 2014 [4 favorites]

On some systems around that time (e.g. the Amiga and the Commodore 64) re-drawing the entire screen wasn't necessary - you could scroll the contents of the screen to the left or right simply by setting the hardware register that pointed to the start of screen memory. Increase it by one and the contents of the screen would appear to move 1 pixel to the left. Because it wasn't quite as simple as 1 pixel = 1 byte, there was a second register to allow scrolling in single pixel increments, which effectively moved the start of screen memory by a fraction of a byte. Of course, you can't just keep incrementing the screen address forever or you will get to the end of memory, you had to do separate tricks to make it "wrap around" to the bottom of the screen memory address space. Still, scrolling the entire screen with only a couple of memory writes - you can't get more optimal than that!
posted by L.P. Hatecraft at 9:48 PM on January 5, 2014

OK, so.

Another 65816 extensions to the 6502 instruction set was PEA (push effective address). Like so much of the '816, this one promised more than it delivered: the 6502's address generation logic was really very flexible, and a true Push Effective Address instruction that allowed for all the address modes available to, say, Load Accumulator would have been tremendously useful. Unfortunately there really wasn't room for one in the opcode map, and all PEA actually did was push the two bytes following the opcode.

But there was also PEI (push effective indirect address), which treated one byte following the opcode as a Page 0 address, retrieved two bytes from that address and the one following, then pushed both. Also, the '816 had a 16-bit Direct Page register: you could load that with any 16-bit address, and that would be taken as a base address for subsequent Page 0 address references.

PEI took 6 cycles: one for opcode fetch, another for opcode decode overlapped with operand fetch, two for reading two bytes from page 0, and two for writing those to the stack.

So provided that the memory block I wanted to move resided in the bottom 64K of RAM, and provided I didn't need to move it outside that region (both of which conditions were true for the screen-scrolling stuff I was writing at the time) I could load Direct Page with the end address of the source block, turn off interrupts, load the stack pointer with the end address of the destination, and run an unrolled sequence of PEI instructions to do the move. At 6 cycles per two bytes moved that came to 3 cycles per byte, more than twice as fast as the inbuilt memory move instruction.
posted by flabdablet at 11:40 PM on January 5, 2014 [1 favorite]

Couple of years ago I saw a beautiful trick for doing unsigned 8 bit multiply at reasonable speed on a 6502, using lookup tables.

The basis of the trick is the arithmetic identity a2 - b2 = (a + b)(a - b). You transform that a little to get x * y = [(x + y) / 2]2 - [(x - y) / 2]2, then capture (n/2)2 for n = 0..512 as 1K of lookup tables. You can then do your multiply using an 8-bit subtract, an 8-bit add, two 16-bit lookups and a 16-bit subtract.
posted by flabdablet at 11:51 PM on January 5, 2014 [1 favorite]

I once worked on a 68000-based video poker machine, where the hardware designer had invented a very cheap way to get video sprite data written to memory at high speed without needing a proper DMA controller.

There was a separate static RAM for sprite data, with address and data ports you could write to from code; the SRAM address register behind the address port would automatically post-increment on each access to the data port.

There was also a Bypass bit you could turn on, and if you did that, then any write cycle generated by the CPU would source its data not from the CPU data bus but from the sprite SRAM's data port instead. The intent was that you'd turn on the Bypass bit, do a 68000 Move Multiple instruction generating enough write cycles to dump the sprite data to VRAM, then turn Bypass off again.

Which was all well and good, except that you had to have interrupts masked when you did this: the effects of a processor trying to push execution context with Bypass turned on would be Not Good. Also, Move Multiple with interrupts masked caused horrendous interrupt latency. Also, the machine had to speak AppleTalk, and it had the usual Z85C30 Serial Communication Controller with the usual 3-byte receive FIFO, and of course no DMA controller: the hardware guy had just looked at the worse-case interrupt latency specified in the 68000 data sheet and decided that interrupt-per-byte was going to be workable.

And of course the guy in charge of the project as a whole was a portable software ideologue to whom the idea that you might need to write at least the time-critical parts of this thing in assembler was anathema.

I eventually got called in as a consultant on that job after it had stalled for a few months. The timing worked out. Just.
posted by flabdablet at 12:13 AM on January 6, 2014

Does anyone know how QuickDraw "regions" (the ones Bill famously still remembered) worked? It's a headscratcher to me how they drew only the visible parts of the screen with lots of overlapping windows.
posted by bonaldi at 8:07 AM on January 6, 2014

At the point where drawing into screen memory happened, it would need to be bounding boxes and bitmasks AFAIK. I don't think there's a magic way to do that.

There is a neat instruction sequence for combining pixel data, mask data and existing screen data so that only those pixels not masked out get their data changed, which I first saw used in the Apple II lo-res plot routines.

In 68000 assembler, the simple-minded way to do that job would go something like
        move.l (a0),d0  ;get existing screen data
        move.l (a1)+,d1 ;get new image data
        move.l (a2)+,d2 ;get clipping bitmask
        move.l d2,d3    ;make inverse clipping mask
        not.l  d3
        and.l  d2,d0    ;clip existing screen data
        and.l  d3,d1    ;inverse-clip new image data
        or.l   d1,d0    ;combine existing and new data
        move.l d0,(a0)+ ;write to screen
The neat way works like this:
        move.l (a0),d0  ;get existing screen data
        move.l (a1)+,d1 ;get new image data
        eor.l  d1,d0    ;mix them for masking
        and.l  (a2)+,d0 ;clear mixture where clipped
        eor.l  d0,(a0)+ ;update screen
The first eor.l generates a mixture of image pixels and pre-existing screen pixels. The and.l clears all bits that are supposed to stay the same on the screen to zeroes. The final eor.l unmixes the old screen pixels away in the regions the and.l didn't affect, replacing what's in screen memory with new image data; in regions the and.l did clear, it exclusive-ors existing screen data with zeroes, leaving it unchanged.

Kind of annoying that the reputedly orthogonal 68000 was missing an EA-to-register form of the EOR instruction.
posted by flabdablet at 9:35 AM on January 6, 2014 [1 favorite]

Yeah, XOR is cool.

Sometimes it can save a lot of memory. If you have a need for a list of arbitrary length that can be traversed in either direction, one of the conventional ways to do that is using a doubly linked list, where each list node contains two pointers: one to its successor node and another to its predecessor.

If instead of maintaining a single "current node" pointer into such a list you maintain two pointers to a pair of neighbouring nodes, then instead of keeping separate backward and forward pointers inside each list node, you can use a single pointer-sized link consisting of the backward and forward pointers XOR'd together.
posted by flabdablet at 7:10 PM on January 6, 2014

Yeah, and nobody seems to think that sort of thing is worth bothering with any more. And they're so wrong that it's never necessary. There is no computer so powerful or with so much memory that you can't flatline it with a totally ridiculous algorithm.
posted by localroger at 7:44 PM on January 6, 2014

posted by flabdablet at 7:50 PM on January 6, 2014

Waffling about posting this since it's not really focused on specific hacks, but James Hague's Halcyon Days: Interviews with Classic Computer and Video Game Programmers is a fun read for nostalgia alone.
posted by mubba at 1:34 PM on January 7, 2014

To build the first TiVo boxes, we had to make a 50 Mhz PowerPC do a crazy amount of work in real-time* (and your TV, unlike a PC, cannot crash), so we spent a fair amount of time optimizing code (and custom hardware).

After we shipped our first few models (eventually with a 200 Mhz MIPS core, woohoo!), my favorite engineer left for Google, thinking he would have all the computer horsepower he could ever use, but when a million copies of your code are running on a million cores, it's worth optimizing that code too, both for power savings and speed, plus you also have a million users waiting.

*Record a video stream to disk while "simultaneously" playing back another video stream from disk without losing audio sync on either, and maintaining a UI, a database, and miscellaneous I/O hardware like the IR receiver, LEDs, fan, and a modem.
posted by Hello Dad, I'm in Jail at 2:23 PM on January 7, 2014 [1 favorite]

« Older Dark Incantations In Corrupt Languages   |   This is Mr Maupin. He invented San Francisco. Newer »

This thread has been archived and is closed to new comments