My newschool desktop PC motherboard is severely sick. Searching in my old junk for the possibility to find at least a Pentium 2 or better motherboard&CPU so that I can continue working on modern compilers, I stumbled upon my old 386. It still has a Gravis Ultra Sound in it! I watched some demos. I even reconsidered my old dream, to optimize some effects and maybe code an oldschool 386 demo. I guess, I'll just have it there and during my free time I'll give it a go.
I even have some old code of a fire routine there in the HD. Done years ago when I first wanted to teach myself X86 assembly. The old code was predictable,. and slow. My friend Antitec helped me optimize it in an old Pentium in the gas station where he was once working. It was fun! A bit later (still years ago) I tried a neat trick I thought which is simply to do the blur algorithm with 32 bit registers, 4 pixels at once. I suppose that the pixel gradient goes from 0 to 63 and so I can safely read a 32bit from memory and add three more of them without overflowing to the left, shift them by two on the right for the division by four and then AND the 32bit register with 0x3F3F3F3F so that I zero the overflow on the right. 4 pixels at once!
; out of inner loop
MOV AX, Blurbuffer
MOV DS, AX
MOV ES, AX
; inner loop
; Unroll inner loop for 80 times (one scanline)
It does the work for 4 pixels at once and still looks like a per pixel blur! It actually works well!!! (Originally, I thought it would just produce junk or blocky pixels and would be hard to maintain). This way in my 386DX40 with a Tseng Labs ET4000 able to display 320*200*8bpp at 85fps, I got this fire somewhere to 35fps (exactly two frames, I haven't implemented a timer yet to now exactly, rather than the lousy raster CPU meter).
Taking in advance the data alignment, I tried to both experiment with code alignment (doesn't gain too much) and data alignment. Normally, it would be great if the DI-1 and DI+1 would be DI-4 and DI+4 in the code above because it really gains some speed, but this produces a little garbage near the pixels. I tried other variations and keeping only one with -4 or +4 and the other as it was original, worked a bit better in how it looks with just changing the direction of the fire a bit. My last one must be DI-1, DI, DI+4 and DI+320 and the fire goes upwards and maybe a bit to the right iirc.
Anyways, that's not the matter. I have another good idea to try. To have another buffer in memory that describes the condition of pixel blocks. Let's say for blocks of 8*8 or 16*16 pixels. When I'll be having some bobs moving on the screen, a flame or a burning wireframe cube for example, not every region of the screen has to be blurred. While writting a bob or line in the blobbuffer, I could also make two shifts on the coordinates of the pixel I am writting to know where in the gridbuffer it is located, then I will write a non zero number there. To indicate that in which block of the blobbuffer something has been recently written. Only the blobbuffer blocks whose grid number is currently not zero will be processed for either blurring and writting to the vram. And at each frame, I will decrease the value of every element in the gridbuffer. So, if a pixel of the bob was there in that buffer, it will set a defined number (let's say 16), and we suppose that if it doesn't return to that block, decreasing each time the number, after 16 frames that aera will be totally blurred and be black again. It's hard to explain and maybe it produces some artifacts (except if I put a bigger define than 16) and also you'd say is slower (but if I write few nicely blurred bobs or the few line pixels of the wireframe cube, for very few pixels I'd also have to write something on the grid buffer, maybe gaining more than wasting) but maybe this way I expect to achieve full frame rate (70fps) for this. Maybe I could just show a fire cube in a smaller screen area but I want something moving all over the screen and looking like a fullscreen blur (but not necessary updating and blurring the same screen blocks all the time :)
Other than that, I was thinking of some other assembly optimization ideas or parts where I can improve things, but they won't gain me much speed more than a good trick ;)
I also tried this 32bit at once trick on other effects were possible. Like a software per pixel plasma to get a bit more than full frame rate, even though a bit less if I have to make the plasma looking a bit better or more complex. But it works for the moment, I'll just have to fix that or use some ModeX tricks and go for a fake translucent one!
Maybe the next thing I'd like to code is a 3d starfield or a rotozoomer. And a precalculated spherical mapping too. Always wishing to make them as fast as possible.
But I am dreaming of my first oldschool PC demo (I am wondering why isn't there an oldschool PC coding scene as there is a C64 or CPC scene). I will use the GUS. Found some players out there. For the little 386, I need not to loose speed on some lousy Sound Blaster player. First time I have seen Second Reality running so well on a 386 and that's because of my good VGA card and the GUS gaining a lot of speed in that. I'll prefer to use GUS for an old PC that needs some power and SB for my 486.
Maybe it doesn't make any sense to you out there, but I simply love doing this. Optimizing oldschool PCs was an old dream of mine that I only recently had the chance to fulfill. A waste of time? One day I'd like to write an essay called "The meaning of useless." =)