About 3 weeks ago, I got hooked by another senseless project on 3DO. It's not like a complete project rather than an experiment. Also my love for the game Daggerfall. I just had to see whether I can load the Daggerfall map data and have the player move around any place in the world of Daggerfall and if this is possible at a semi-decent framerate on 3DO. This project hasn't finished yet as I only have examples of me clicking on a map pixel and rendering the area from far enough and not from player's eye. Each quad I render in the above twitter link is around 164 square meters and I should be much closer and subdivide the geometry more to get the exact scale of Daggerfall.
That's a project I will continue soon, it just gave me the opportunity in the meanwhile to make some performance tests on the real machine and get some numbers I was curious to. One of these numbers, trying to reach the theoritical limit of how many polygons per second, was posted in my latest tweet. Of course when I use the CPU to transform the polygons (and maybe I can optimize or learn to use the alleged matrix hardware of the machine, if I find out how or maybe the math portfolio API utilizes it) a geometry of 2916 quads drops down to 5 frames per second. But when pressing a key, I make it skip calculating the next transformation, but keep sending the already transformed list of quads from the last frame again and again. This way, I am testing exclusively the limits of purely feeding the hardware with polygons, possibly mimicking the way the marketing of the old consoles would claim thousands or millions of polygons per second (numbers you would never see in a game, even if you divided them with 30 or 60 to get a rough number per 30 or 60fps). They never talk of the actual polygons you expect to see at games which will be way much less. But still, such test is a good idea of what to expect, if you were the god of optimizing, you'd know that you still wouldn't be able to get over these numbers or even half of them. Being conservative, maybe you would divide these numbers by 4 or 8 to get some rough idea of what could be a practical focus on the complexity of the scenes.
I was really curious about the whole thing and I am with pushing things and benchmarking on a real hardware. I did hit some of the theoritical maximums I've heard back in the past this way. You usually try to test the ideal conditions. Small polygons, flat untextured, I got a bit over 50fps, this is around 150000 per second, which is an exact number I kept hearing around the internet. With textures (and small ones, 8*8, size really matters) I dropped to 90000 (I might have heard this number somewhere too). It's not over, I want to do a more robust test. I realized that a small percentage of the polygons (but maybe 5-10%) might be hiding a bit outside the screen and I have hardware clipping enabled, while others might be sided in the back and the CW/CCW settings might have discard rendering another small percentage. I'll prepare a more robust test where I subdivide a 2d screens into more more detailed grid, looking always straight at the player like 2d sprites. That's a next test I want to do. And I will do it in a way that will test both high poly throughput and pure fill rate in the lowres.
But that's not why I opened this post. I did some more tests I didn't posted that have to do with the fillrate exclusively. I also tried some new methods of scrolling the big 2D map more efficiently. Before the 3d rendering, I had to load the lowres map (1000*500 pixels) which later was subdivided to 5*5 subpixels (making it 5000*2500, but those subpixels would have to be streamed from CD every time I click a pixel on the lowres 2D map). And I would scroll it around so that the player can chose a pixel to display in 3D (it really was chosing an area of 3*3 such pixels, subdivided to 14*14 quads (a grid line consists of 3*5 vertices, but you have to subtract one, as they fit 14 quads), that's for the simple tests, but of course for the 2916 quads test, it was 11*11 and 54*54 quads). And even the scrolling of this big bitmap became a problem of performance the brute force I was doing it at first. It wasn't bad but it was lacking the full 50-60fps (I get 50 on my PAL 3DO here).
Simply put, I had created a CEL with size 1000*500 16bit pixels (and that's near the limit as from what I've read in the docs, the maximum texture width or height the GPU can accept has to be 1024). The regular screen is a fraction of this, 320*240*16bpp. So, I would simply render this as a sprite (with my code that you define a sprite object with position, zooming, rotation (although we don't need the last two) and prepares a CEL object with the necessary vectors to display it) and let the hardware manage when the majority of the CEL is outside the screen.
There is another interesting bottleneck of the 3DO, the fact that CELs will not be very efficiently clipped against the screen the way a modern GPU would do it. First of all, a trap is that when you start a new 3DO project, by default the superclipping (as it's called) of the GPU is disabled. And while CELs are carrying flags where by changing bits you can enable or disable various things, and two flags for clipping (CEL clipping and line clipping) are available, in my first test I would enable this and nothing would change in the performance. This is a fact I knew way before doing this experiment, in an old code where I would zoom some 3D stars (really 2D billboards) and when some of them were getting very closed but maybe outside the frame so it wasn't obvious, I would get enormous frame drops. I found out that you don't only have to enable the two super clipping functions in the flags of every of your CELs (every sprite is one CEL object). But you first have to remember when you init the BitmapItems of your videobuffer at the beginning of your code, to set some CEL engine controls, for example enabling the superclipping functionality for all by calling SetCEControl(BitmapItems[i], 0xffffffff, ASCALL);
Now, if you didn't do that, then my scrolling map of 1000*500 with enormous amounts going out of the screen would be slow as hell. And that wasn't the case. But still, something was off. And I knew the reason way before too. When I start my test, with the position nailed at the upper left of the map (this means that most of the out of bounds texture will extend out of the screen on the right side and under) I got over 50fps (when disabling the vsync it hit 60). But when scrolling the map to the very right I got 25. When scrolling down to reach the lower right corner (so most of my bitmap would be outside the screen on the left and upper sides) I got 33. I understand right now the 25 but still a bit puzzled with the 33 (because of my understand on how it works and how I provide the vector directions for this sprite, I would expect 25 and 33 to be reversed, but anyway this needs more investigation). And the reason is that the GPU clipping is not just deciding to find geometrically a smaller subquad region where to start rendering, but it walks a horizontal rasterizer for every line (or maybe builds a grid) and every time it decides whether it should stop rendering preemptively. For example, if I render a horizontal scanline of the texture from X=0 to 1000 with the direction to the right, every time I'd check I am still the framebuffer, and when I am over 319 I'd say "Since my direction vector is pointing to the right and my current position is already outside of the framebuffer in the right side, there is no reason to continue, so let's skip and go to the next line". But if I was rendering from -680 to 319, every time I was on the left side outside the framebuffer, I would say "Yes, it might be possible that I am outside for a lot of pixel, but since my direction goes to the right I can't predict if I ever hit the framebuffer, so I am not allowed to stop.". Your scanline could be from -1000 to -1 for all I know but it would waste time, since the GPU wouldn't geometrically calculate bounds and skip before from what I understand so far. There might be partial clipping (that will still be faster than no clipping at all) and no rendering outside but still going through a process that could be avoided by finding the smaller region that fits. My understanding comes from the CEL Clipping section on this link.
But in my latest version of my project I thought of a much better and easier way, where you can make a CEL only read and render a specified region of a much bigger texture. There are two numbers for the horizontal width in some of the flags of a CEL. These are set in some flags called Preamble. One of the numbers (or really certain bits on the flags) is the exact number of pixels of a CEL bitmap line (I set this to 320 instead of the original 1000). The other number also affects the width and is how many words minus 2 to skip in the bitmap data to reach the next line. So, I could set this to correspond to 1000 pixels (500-2 = 498, since pixels are 16bit=2bytes, 1000 shorts(16bit) are 500 longs(32bit)). It's a bit complicated and has bit fidling and reading the manuals till you get it. And now I render 320*240 pixels from the big bitmap and never outside. But of course there are more problems like how do you scroll? You change the bitmap start address. But this is dword(32bit) aligned. So in 16bpp texture, you'd scroll every 2 pixels horizontally, but vertically you can have smooth scroll at least. However, combining some other flags of CEL like the SKIPX (I learned this from the original Doom 3DO column rendering code) you can offset by pixel (you have to self correct by reducing the texture width between 320 and 219 with a module over the offset X) and combining all these you achieve pixel perfect scrolling (or region selection of an exact 320*240 subtexture rendered over the 320*240 screen).
So, anyway I achieved this, and here is where I started some more performance test I was curious of. What are the limitations of the fillrate on this machine? What to expect for 2D games? It seems kinda not that fast, but not so bad either and with more optimizations and careful selection you could achieve a lot. So what are some results?
In this one, I had the perfect simple test of exactly reading 320*240*16bpp bitmap and rendering back to 320*200*16bpp screen. One fullscreen read and one fullscreen write. I would get 85fps overall (regardless where I scrolled the texture, no variance as there is no supercliping needed anymore). It's over the target fps (being 50 ot 60) if we want to vsync for perfect smooth syncing. That's good. But of course if we had some heavy layers parallax 2d game, maybe things would become problematic. But is it?
Usually the 2nd/3rd parallax layers won't be fullscreen pixels, they will have transparnet gaps in the majority of the bitmap and that is greatly optimized with 3DO. So, if you don't do stupid things and you optimize it in terms of graphics design you can achieve way more. Secondly, my texture is 16bit but as I realized there is more to be gained by designing your graphics to fit in 8bit or less. a CEL texture can be 1, 2, 4, 6, 8 or 16 bits. So, I tried the same experiment with different bit depth. I tested 8bit-16bit Uncoded (these are the pure RGB) and 4bit-6bit-8bit Coded (this is how the 3DO manuals and API call the texture formats based on an indexed palette). Here are my results.
That's a project I will continue soon, it just gave me the opportunity in the meanwhile to make some performance tests on the real machine and get some numbers I was curious to. One of these numbers, trying to reach the theoritical limit of how many polygons per second, was posted in my latest tweet. Of course when I use the CPU to transform the polygons (and maybe I can optimize or learn to use the alleged matrix hardware of the machine, if I find out how or maybe the math portfolio API utilizes it) a geometry of 2916 quads drops down to 5 frames per second. But when pressing a key, I make it skip calculating the next transformation, but keep sending the already transformed list of quads from the last frame again and again. This way, I am testing exclusively the limits of purely feeding the hardware with polygons, possibly mimicking the way the marketing of the old consoles would claim thousands or millions of polygons per second (numbers you would never see in a game, even if you divided them with 30 or 60 to get a rough number per 30 or 60fps). They never talk of the actual polygons you expect to see at games which will be way much less. But still, such test is a good idea of what to expect, if you were the god of optimizing, you'd know that you still wouldn't be able to get over these numbers or even half of them. Being conservative, maybe you would divide these numbers by 4 or 8 to get some rough idea of what could be a practical focus on the complexity of the scenes.
I was really curious about the whole thing and I am with pushing things and benchmarking on a real hardware. I did hit some of the theoritical maximums I've heard back in the past this way. You usually try to test the ideal conditions. Small polygons, flat untextured, I got a bit over 50fps, this is around 150000 per second, which is an exact number I kept hearing around the internet. With textures (and small ones, 8*8, size really matters) I dropped to 90000 (I might have heard this number somewhere too). It's not over, I want to do a more robust test. I realized that a small percentage of the polygons (but maybe 5-10%) might be hiding a bit outside the screen and I have hardware clipping enabled, while others might be sided in the back and the CW/CCW settings might have discard rendering another small percentage. I'll prepare a more robust test where I subdivide a 2d screens into more more detailed grid, looking always straight at the player like 2d sprites. That's a next test I want to do. And I will do it in a way that will test both high poly throughput and pure fill rate in the lowres.
But that's not why I opened this post. I did some more tests I didn't posted that have to do with the fillrate exclusively. I also tried some new methods of scrolling the big 2D map more efficiently. Before the 3d rendering, I had to load the lowres map (1000*500 pixels) which later was subdivided to 5*5 subpixels (making it 5000*2500, but those subpixels would have to be streamed from CD every time I click a pixel on the lowres 2D map). And I would scroll it around so that the player can chose a pixel to display in 3D (it really was chosing an area of 3*3 such pixels, subdivided to 14*14 quads (a grid line consists of 3*5 vertices, but you have to subtract one, as they fit 14 quads), that's for the simple tests, but of course for the 2916 quads test, it was 11*11 and 54*54 quads). And even the scrolling of this big bitmap became a problem of performance the brute force I was doing it at first. It wasn't bad but it was lacking the full 50-60fps (I get 50 on my PAL 3DO here).
Simply put, I had created a CEL with size 1000*500 16bit pixels (and that's near the limit as from what I've read in the docs, the maximum texture width or height the GPU can accept has to be 1024). The regular screen is a fraction of this, 320*240*16bpp. So, I would simply render this as a sprite (with my code that you define a sprite object with position, zooming, rotation (although we don't need the last two) and prepares a CEL object with the necessary vectors to display it) and let the hardware manage when the majority of the CEL is outside the screen.
There is another interesting bottleneck of the 3DO, the fact that CELs will not be very efficiently clipped against the screen the way a modern GPU would do it. First of all, a trap is that when you start a new 3DO project, by default the superclipping (as it's called) of the GPU is disabled. And while CELs are carrying flags where by changing bits you can enable or disable various things, and two flags for clipping (CEL clipping and line clipping) are available, in my first test I would enable this and nothing would change in the performance. This is a fact I knew way before doing this experiment, in an old code where I would zoom some 3D stars (really 2D billboards) and when some of them were getting very closed but maybe outside the frame so it wasn't obvious, I would get enormous frame drops. I found out that you don't only have to enable the two super clipping functions in the flags of every of your CELs (every sprite is one CEL object). But you first have to remember when you init the BitmapItems of your videobuffer at the beginning of your code, to set some CEL engine controls, for example enabling the superclipping functionality for all by calling SetCEControl(BitmapItems[i], 0xffffffff, ASCALL);
Now, if you didn't do that, then my scrolling map of 1000*500 with enormous amounts going out of the screen would be slow as hell. And that wasn't the case. But still, something was off. And I knew the reason way before too. When I start my test, with the position nailed at the upper left of the map (this means that most of the out of bounds texture will extend out of the screen on the right side and under) I got over 50fps (when disabling the vsync it hit 60). But when scrolling the map to the very right I got 25. When scrolling down to reach the lower right corner (so most of my bitmap would be outside the screen on the left and upper sides) I got 33. I understand right now the 25 but still a bit puzzled with the 33 (because of my understand on how it works and how I provide the vector directions for this sprite, I would expect 25 and 33 to be reversed, but anyway this needs more investigation). And the reason is that the GPU clipping is not just deciding to find geometrically a smaller subquad region where to start rendering, but it walks a horizontal rasterizer for every line (or maybe builds a grid) and every time it decides whether it should stop rendering preemptively. For example, if I render a horizontal scanline of the texture from X=0 to 1000 with the direction to the right, every time I'd check I am still the framebuffer, and when I am over 319 I'd say "Since my direction vector is pointing to the right and my current position is already outside of the framebuffer in the right side, there is no reason to continue, so let's skip and go to the next line". But if I was rendering from -680 to 319, every time I was on the left side outside the framebuffer, I would say "Yes, it might be possible that I am outside for a lot of pixel, but since my direction goes to the right I can't predict if I ever hit the framebuffer, so I am not allowed to stop.". Your scanline could be from -1000 to -1 for all I know but it would waste time, since the GPU wouldn't geometrically calculate bounds and skip before from what I understand so far. There might be partial clipping (that will still be faster than no clipping at all) and no rendering outside but still going through a process that could be avoided by finding the smaller region that fits. My understanding comes from the CEL Clipping section on this link.
But in my latest version of my project I thought of a much better and easier way, where you can make a CEL only read and render a specified region of a much bigger texture. There are two numbers for the horizontal width in some of the flags of a CEL. These are set in some flags called Preamble. One of the numbers (or really certain bits on the flags) is the exact number of pixels of a CEL bitmap line (I set this to 320 instead of the original 1000). The other number also affects the width and is how many words minus 2 to skip in the bitmap data to reach the next line. So, I could set this to correspond to 1000 pixels (500-2 = 498, since pixels are 16bit=2bytes, 1000 shorts(16bit) are 500 longs(32bit)). It's a bit complicated and has bit fidling and reading the manuals till you get it. And now I render 320*240 pixels from the big bitmap and never outside. But of course there are more problems like how do you scroll? You change the bitmap start address. But this is dword(32bit) aligned. So in 16bpp texture, you'd scroll every 2 pixels horizontally, but vertically you can have smooth scroll at least. However, combining some other flags of CEL like the SKIPX (I learned this from the original Doom 3DO column rendering code) you can offset by pixel (you have to self correct by reducing the texture width between 320 and 219 with a module over the offset X) and combining all these you achieve pixel perfect scrolling (or region selection of an exact 320*240 subtexture rendered over the 320*240 screen).
So, anyway I achieved this, and here is where I started some more performance test I was curious of. What are the limitations of the fillrate on this machine? What to expect for 2D games? It seems kinda not that fast, but not so bad either and with more optimizations and careful selection you could achieve a lot. So what are some results?
In this one, I had the perfect simple test of exactly reading 320*240*16bpp bitmap and rendering back to 320*200*16bpp screen. One fullscreen read and one fullscreen write. I would get 85fps overall (regardless where I scrolled the texture, no variance as there is no supercliping needed anymore). It's over the target fps (being 50 ot 60) if we want to vsync for perfect smooth syncing. That's good. But of course if we had some heavy layers parallax 2d game, maybe things would become problematic. But is it?
Usually the 2nd/3rd parallax layers won't be fullscreen pixels, they will have transparnet gaps in the majority of the bitmap and that is greatly optimized with 3DO. So, if you don't do stupid things and you optimize it in terms of graphics design you can achieve way more. Secondly, my texture is 16bit but as I realized there is more to be gained by designing your graphics to fit in 8bit or less. a CEL texture can be 1, 2, 4, 6, 8 or 16 bits. So, I tried the same experiment with different bit depth. I tested 8bit-16bit Uncoded (these are the pure RGB) and 4bit-6bit-8bit Coded (this is how the 3DO manuals and API call the texture formats based on an indexed palette). Here are my results.
- 16bit - 85fps
- 8bit - 113fps
- 6bit - 129fps
- 4bit - 148fps
It's not exactly linear (but you could plot a line that fits) as the 8bit are not twice than 16bit, but it's also that we have the heavier rendering to the 16bit videoram. I could say if I can make a simple equation to predict as I learn more from my tests. Another thing, there is no difference between the Coded (paletized) and Uncoded 8bit versions here (both at 113fps really), but I read somewhere in the manual Coded might be preferred in cases where you have transparent pixels (index color 0 instead of being a solid color, to skip rendering, maybe easier to check than RGB 0,0,0 (which does get transparent indeed) ?). There is more really to see in the future for optimization, there is even a texture format with RLE compression which I guess could vastly skip big transparent lines of pixels. So it really depends what kind of texture formats you use and you have to be strategic between lowering your color depth as much as your art doesn't look too bad.
Another thing. I wanted to try pure rendering and very little texture reading. The original test would read exactly 320*240*16bpp data and write back the same amount of data. What if I fed a very tiny texture 4*4*16bpp and scale it up (so that it's not 4*4 pixels on screen, but fit to screen). This will mostly measure pure writing to vram performance with minimal loss from texture reading. And that's what I tried. The result was 239fps!
And another test. Did you know that there is not one but to CEL engines in the GPU. There are two parallel processors in other words and from some reading I did and later experiments, it seems that the rendering is split between the odd and even scanlines of the texture bitmap. CEL engine 1 will take care of bitmap lines 0, 2, 4, 8, etc and CEL engine 2 will work on 1, 3, 5, 7, etc. There is a flag in the CEL that will tell it whether each CEL engine waits for the other to finish it's work or let them work independently. There is also another flag that will disable and only have one of the two do the whole work. Thankfully, in contrast to the superclipping, this two parallel engines feature is by default enabled when you create cels by calling the API function (but if you were to feed the bits on your own, you'd have to not forget it) so it's harder to be dumbed and underperform because the high level API didn't do it for you.
The 4*4 texture rendered to the screen, went from 239fps to 133fps if I disable one of the two CEL engines. In the regular tests (16, 8, 6, 4bit), I got the numbers 60, 82, 90 and 99fps. Another weird thing. As I enabled the dual CEL engines back in my code, I suddenly got slightly better fps. I got 89, 129, 143 and 163fps. That's a bit significant. But why? Here is the kicker. I enabled the flag that actually tells "lock the two corner engines together". But wait? The default for this was to let them operate independently. Which would make me think, if one finishes (one of it's scanlines I assume), it doesn't wait for the other. I would expect a slightly increase but not decrease by not locking them together. But what do I know about how the hardware works deeply? And does it work differently for smaller sizes of polygons or other conditions? This needs more investigation. (You can have a look btw of some of these CEL flags here)
I finally tested the DMA. Another part of the hardware is called SPORT. And this is what I use currently to erase the vram each frame with a background color. At some point I thought what if this takes time, as in my default framework I always erased the frame with this transfer, before rendering anything, and here I had a daggerfall map erasing it every time for me, so maybe it was unnecessary. It could be eating part of the framerate too for all I know. But no! It's a useful free for performance function. I tested only calling that one (totally disabled rendering the daggerfall map, just let this one clean the screen). I got 1198fps! Then, I removed that one too (nothing else is rendered except my tiny fps counter) and I got 1593fps. Keeping it or out when rendering the daggermap, wouldn't affect the 85fps of course (sometimes jumps between 84 and 85 but does it anyway). So, if you want to clean the screen before next rendering, don't do it like I did before by rendering a big black CEL on the screen, but use the SPORT DMA instead.
One final thing, if I understood well (and I need to write code and test it), the SPORT can be used for one more thing. You can render a static background from another buffer to the vram. So, instead of just cleaning fast with single color, you can clean with a static background that doesn't move. You could had for example a nice picture of space galaxies and nebulas or whatever which wouldn't scroll. But it would be your static background layer and then you can add parallax in front. You can't scroll if I understand well, but you can blit either a color or a full screen 16bit static image. The second one, I don't know how fast it will be (maybe because it's SPORT DMA it will be also something for free you can use instead of a fitting Quad CEL which would waste a lot). I'll definitely test in the future.
If one combines all these things (and even more possibly I could find) cleverly, you have a great power at your hands, could possibly have good 2d parallax layers and many sprites, sometimes zoom/rotate, as long as you know the hardware and are clever with the limitations and your art. But of course it's also very easy to screw up if you bruteforce things and don't thing about it. Which would happen in modern hardware where it's so easy to just not think about it, and sometimes will lead to very retroish indies still underperform in my modern rig. No matter how fast the system, you can always fall into traps. In 3DO is more interesting because it's too easy and yet there are great functionality and cool work arounds to get that sweet full frame rate.
p.s. Not related but I bought a 3DO mouse. Someone gave me the idea if I could support this in OptiDoom, while I didn't even know this thing existed. That would be fun and quite easy to try I think at least!
p.p.s. I forgot to mention another very interesting thing. When I was doing my STNICCC demo and in the last versions converted it to a benchmark, I noticed that some tests between textured and untextured big polygon, the untextured polygons had slightly worse speed than the textured with very small texture sizes. I was like, how is it possible? The reason was that both parallel CEL processors would need more than one bitmap row to cooperate on the job. The original CEL values on the flat rendering (as taken from Doom) from what I understand they make a 1*1 texture size (but maybe also put certain flags not sure and have to investigate, because later I need to give null pointer to source and the 16bit color to the palete pointer, weird) and that will not be halfed between the two processors. By making a single color texture on a tiny texture (e.g. 4*4 or even 1*2 would work) you force the processors to really split the job in half for big untextured single color polygons. This is why my present test uses a 4*4 (also 4*16bit is 8 bytes, minimal size for texture width in the flags and I have good reasons to have n*n texture and not n*m although not useful in this test but it doesn't matter, results wouldn't change, but something else I discovered that makes me prefer n*n some times, I will talk in the future)
p.p.s. I forgot to mention another very interesting thing. When I was doing my STNICCC demo and in the last versions converted it to a benchmark, I noticed that some tests between textured and untextured big polygon, the untextured polygons had slightly worse speed than the textured with very small texture sizes. I was like, how is it possible? The reason was that both parallel CEL processors would need more than one bitmap row to cooperate on the job. The original CEL values on the flat rendering (as taken from Doom) from what I understand they make a 1*1 texture size (but maybe also put certain flags not sure and have to investigate, because later I need to give null pointer to source and the 16bit color to the palete pointer, weird) and that will not be halfed between the two processors. By making a single color texture on a tiny texture (e.g. 4*4 or even 1*2 would work) you force the processors to really split the job in half for big untextured single color polygons. This is why my present test uses a 4*4 (also 4*16bit is 8 bytes, minimal size for texture width in the flags and I have good reasons to have n*n texture and not n*m although not useful in this test but it doesn't matter, results wouldn't change, but something else I discovered that makes me prefer n*n some times, I will talk in the future)