Super FX speed test #340

paulb-nl · 2022-06-16T15:45:47Z

Here is a test that measures how long it takes for an instruction to complete. It counts in a loop until the SFX is stopped so higher numbers mean it took longer. Small differences don't matter so much. One 21MHz cycle results in a difference of around $66 (102) loops. For example nop is $0134 in 21Mhz cache mode which is 1 cycle. add # is 2 cycles and results in $0199 loops.

It can be run on an original Super FX cart by swapping the cartridge while the console is on. The code runs in WRAM on SNES and Cart RAM/Cache on Super FX.

Here are reference captures of a StarFox cart (Mario Chip), Stunt Race FX (GSU1) and Yoshi's Island (GSU2)
https://drive.google.com/drive/folders/15ac9U-x__n0AgOlWa3FGo5eEMShZYl5g?usp=sharing

The Mario Chip (v1) is unstable with reading/writing to Cart RAM. Some tests timeout which doesn't happen with the GSU chips.

Another difference with the Mario Chip is that the cache opcode will work immediately with GSU while it seems the Mario Chip needs 16 bytes to fill first so not all instructions are faster in this test with the StarFox cart.

The ljmp instruction is also quite weird. It takes much longer on the GSU chip than on Mario Chip. Not sure what's going on there.

With cache off the MiSTer core runs faster in 10Mhz than 21Mhz which is strange.

Buttons:
Left/Right: Switch to different tests
Select: Toggle 10/21Mhz
Y: Toggle High speed multiplier
B: Toggle Cache

MiSTer captures:
sfx_test_MiSTer_captures.zip
https://drive.google.com/drive/folders/1noo2pRPoexCtVPgqSbzaexr61WOvjHNW?usp=sharing

Test rom:
SuperFX.sfc.zip

Source:
https://github.com/paulb-nl/sfx_speed_test

The text was updated successfully, but these errors were encountered:

FitzRoyX · 2022-06-18T01:29:01Z

Nice test! Would it be possible to color the text green/red based to reflect correct/incorrect based on the hw results?

paulb-nl · 2022-06-25T19:25:01Z

That would get a bit complicated with the 2 different chip versions and 8 setting combinations. It is also complicated to decide if a small difference is acceptable because some tests like the plot tests need to be accurate to 1/8th of a cycle. The cycle count can change after 8 plot instructions because then it will write the pixel data to ram.

paulb-nl · 2022-06-25T19:33:19Z

Here is a comparison of some differences. These are not all the differences but I think it is enough for now :)

The cycles mentioned below with the 10MHz tests are 10MHz cycles so 1 cycle = 2x 21MHz cycles.

MiSTer vs Stunt Race FX (GSU1):

10MHz, MS0, No cache:
Everything is too fast.
NOP $72F-$4C8 = $267 = 3 cycles too fast
ADC # (2 NOPS) $993-$660 = $333 = 4 cycles too fast
MiSTer NOP vs 2 NOPS $660 - $4C8 = $198 = 2 cycles
GSU NOP vs 2 NOPS $993 - $72F = $264 = 3 cycles

10MHz, MS0, Cache on
FMULT $8C9 - $7F9 = $D0 = 1 cycle too fast
GETB* $7FC - $662 = $19A = 2 cycles too fast
GETB_2 $730 - $595 = $19B = 2 cycles too fast
LDB $663 - $595 = $CE = 1 cycle too fast
LDW $730 - $661 = $CF = 1 cycle too fast
LM $994 - $8C5 = $CF = 1 cycle too fast
LMS $8C7 - $7F8 = $CF = 1 cycle too fast
LMULT $994 - $8C5 = $CF = 1 cycle too fast
SBK $4CB - $3FE = $CD = 1 cycle too fast
SM $663 - $595 = $CE = 1 cycle too fast
SMS $597 - $4C9 = $CE = 1 cycle too fast
STW $4CB - $3FD = $CE = 1 cycle too fast

10MHz, MS1, Cache on
FMULT $598 - $4C9 = $CF = 1 cycle too fast
LMULT $663 - $595 = $CE= 1 cycle too fast

10MHz PLOT, Cache on
PLOT 4 color: $267 - $29A = -$33 = 0.25 cycles too slow (2 cycles every 8 plots?)
PLOT 16 color: $266 - $2FE = -$98 = 0.75 cycles too slow (6 cycles every 8 plots?)
PLOT 256 color: $280 - $3CA = -$14A = 1.625 cycles too slow (13 cycles every 8 plots?)

The PLOT -> LOOP-> NOP loop takes 3 cycles so 8 plots takes 8x3= 24 cycles. This is enough cycles to save the secondary pixel cache to RAM for 4 & 16 color data without waiting so PLOT should only take 1 cycle. For 256 color PLOT is 0.125 cycles slower ($280 vs $266) so it seems to wait 1 cycle every 8 plots.

PLOT with color #$FC should be treated as no-plot in 4 color transparent mode since low 2 bits are zero.

21MHz, MS0, No cache
FMULT $AC5 - $7F8 = $2CD = 7 cycles too fast
GETB* $CC4 - $BF4 = $D0 = 2 cycles too fast
GETB_2 $AC6-$9F6 = $D0 = 2 cycles too fast
LDB $A60 - $C5A = -$1FA = 5 cycles too slow
LDW $9F9 - $C5A = -$261 = 6 cycles too slow
LM $FF4 - $1253 = -$25F = 6 cycles too slow
LMS $DF6 - $1055 = -$25F = 6 cycles too slow
LMULT $CC4 - $9F6 = $2CE = 7 cycles too fast
MULT $861 - $7F8 = $69 = 1 cycle too fast
SBK $BF8 - $C5A = -$62 = 1 cycle too slow
SM $FF4 - $1055 = -$61 = 1 cycle too slow
SMS $DF6 - $E58 = -$62 = 1 cycle too slow
STW $9F9 - $A5C = -$63 = 1 cycle too slow
UMULT $861 - $7F8 = $69 = 1 cycle too fast

21MHz, MS1, No cache
FMULT $92D - $7F8 = $135 = 3 cycles too fast
LMULT $B2B - $9F6 = $135 = 3 cycles too fast

21MHz, MS0, Cache on
FMULT $466 - $3FD = $69 = 1 cycle too fast
GETB* $4CB - $3FE = $CD = 2 cycles too fast
GETB_2 $465- $397 = $CE = 2 cycles too fast
LDW $531 - $595 = -$64 = 1 cycle too slow
LM $663 - $6C7 = -$64 = 1 cycle too slow
LMS $5FD - $661 = -$64 = 1 cycle too slow
LMULT $4CB - $463 = $68 = 1 cycle too fast
SBK $3FF - $463 = -$64 = 1 cycle too slow
SM $4CB - $52F = -$64 = 1 cycle too slow
SMS $465 - $4C9 = -$64 = 1 cycle too slow
STW $3FF - $463 = -$64 = 1 cycle too slow

21MHz, MS1, Cache on
FMULT $2CD - $3FD = -$130 = 3 cycles too slow
LMULT $332 - $463 = -$131 = 3 cycles too slow

21MHz PLOT Cache on
PLOT 4 color: $134 - $19B = -$67 = 1 cycle too slow (8 cycles every 8 plots?)
PLOT 16 color: $133 - $218 = -$E5 = 2.25 cycles too slow (18 cycles every 8 plots?)
PLOT 256 color: $20C - $317 = -$10B = 2.625 cycles too slow (21 cycles every 8 plots?)

sorgelig · 2022-06-25T20:12:38Z

If i remember right GSU code was written as a functional analog, not cycle accurate. So, most likely it needs rework with cycle accuracy.

paulb-nl · 2022-06-25T20:55:13Z

With this list it may seem that not much is accurate but many of the instructions in 21MHz mode (and 10Mhz with cache) are accurate.

Almost all of the instructions that are not accurate are about reading/writing from ROM/RAM and the multiplier instructions.

srg320 · 2022-08-15T13:54:53Z

Fixed some timings. I do not yet understand the logic of instructions rpix and ljmp.

birdybro · 2022-08-15T16:19:46Z

Some ljmp and rpix info for quick reference:

from https://en.wikibooks.org/wiki/Super_NES_Programming/Super_FX_tutorial#Instruction_Set

Instruction	Description	ALT(Hex)	CODE(HEX)	ARG	Length(B)	B	ATL1	ALT2	O/V	S	CY	Z	ROM	RAM	Cache	Classification	Note
LJMP	Long jump	3D	0x9	Rn	2	0	0	0	/	/	/	/	6	6	2	"Jump, Branch and Loop Instructions"
RPIX	Read pixel color	3D	0x4C	/	2	0	0	0	/	*	/	*	24-80	24-78	20-74	Plot/related instructions

ROM/RAM/Cache columns are execution time in cycles.

LJMP seems pretty tight. o_O

paulb-nl · 2022-08-16T16:03:39Z

Thanks @srg320. I have some findings.

RAM_CYCLES for 10Mhz should be "010" instead of "001". Otherwise it will access RAM with only 2 cycles instead of 3.

SNES_MiSTer/rtl/chip/GSU/GSU.vhd

Lines 680 to 681 in a6daf9b

    
           elsif SPEED = '0' then 
        
           	RAM_CYCLES := "001";

4-color transparency should only check the lower 2 bits so this should be added: if COLR(1 downto 0) /= "00"

SNES_MiSTer/rtl/chip/GSU/GSU.vhd

Lines 1123 to 1131 in a6daf9b

    
           elsif SCMR_MD /= "11" or POR_FH = '1' then 
        
           	if COLR(3 downto 0) /= "0000" then 
        
           		PLOT_EXEC <= '1'; 
        
           	end if; 
        
           else 
        
           	if COLR /= "00000000" then 
        
           		PLOT_EXEC <= '1'; 
        
           	end if; 
        
           end if;

I did some tests to figure out the PLOT pixel cache save logic:
PLOT will save the pixel cache to RAM after 8 PLOTS if it is full. Not at 9th PLOT.

If executing from ROM or Cache and the pixel cache is being saved to RAM and it executes an STB or STW instruction to write to RAM then the pixel cache save is paused and continues after the RAM write buffer is finished. This is probably the same for the other instructions that use the RAM write buffer like SM, SMS, SBK.

For example executing the loop STB->PLOT->LOOP->NOP will only take 5 cycles @ 10Mhz because it doesn’t wait for the RAM writes. It must be interrupting the pixel cache save at the end of writing a byte because otherwise both pixel caches would fill up and PLOT would go into wait state.

Here are some test roms. sfx_stb will use STB to write to RAM while the pixel cache is writing to RAM and reads the values after the SFX is stopped. The value $FF means the pixel cache write has overwritten the data written by STB. There is a cache instruction before the STB writes so you can ignore the NO CACHE text in the test rom.

sfx_speed_test_stb_plot has removed some tests to add two STB/STW PLOT speed tests. The result of the STB PLOT test at 10Mhz with Cache On is $3FE-$400 for 4, 16 & 256 color. This is only 2 cycles more than the PLOT tests and STB is a 2 cycle opcodes so that means it didn't wait.

sfx_stb.zip
sfx_speed_test_stb_plot.zip

Reference captures:

srg320 · 2022-08-16T16:58:54Z

I did some tests to figure out the PLOT pixel cache save logic:
PLOT will save the pixel cache to RAM after 8 PLOTS if it is full. Not at 9th PLOT.

That's interesting. Thanks.

If executing from ROM or Cache and the pixel cache is being saved to RAM and it executes an STB or STW instruction to write to RAM then the pixel cache save is paused and continues after the RAM write buffer is finished. This is probably the same for the other instructions that use the RAM write buffer like SM, SMS, SBK.
For example executing the loop STB->PLOT->LOOP->NOP will only take 5 cycles @ 10Mhz because it doesn’t wait for the RAM writes. It must be interrupting the pixel cache save at the end of writing a byte because otherwise both pixel caches would fill up and PLOT would go into wait state.

I agree, executing an any RAM write instructions do not stop the queue of next instructions until any RAM access appears. And this is implemented in the core in last commit.

srg320 · 2022-08-16T17:21:13Z

I am also interested in the ROM access time when the cache is loaded. I suspect that this time is faster than the time to load byte from ROM.

paulb-nl · 2022-08-20T16:29:47Z

The tests on the first page at 21Mhz with Cache on seem to be all fixed. The plot tests also look good. 21Mhz without Cache and 10Mhz still need to be fixed.

However the latest fixes caused everything executing from ROM at 21MHz to be 2 cycles too slow. From 5 to 7 cycles per byte. I have attached a test rom that runs the SFX code from ROM. Most results without cache should have the same results as the version that runs from Cart RAM, except for instructions that access RAM/ROM. For example PLOT without cache should be faster executing from ROM than RAM.

SuperFX_rom.sfc.zip

Unfortunately I am unable to make reference captures for the ROM versions because that would need a modified Super FX cartridge.

I agree, executing an any RAM write instructions do not stop the queue of next instructions until any RAM access appears. And this is implemented in the core in last commit.

Ok but I meant the RAM write buffer will have priority and will pause the pixel cache write. I will give an example from my test:

    ibt R0, #$34
    iwt R3, #$1031

    plots 7
    cache
    plot ; 8th plot, start pixel cache write (256-color 8 bytes)
    
    stb (R3) ;  pause pixel cache write, RAM buffer will write $34 to $701031
    inc R0

    ; pixel cache will overwrite $701031 ($34) with $FF

I am also interested in the ROM access time when the cache is loaded. I suspect that this time is faster than the time to load byte from ROM.

Which ROM access do you mean? As far as I know ROM access is the same as RAM. 3 cycles at 10Mhz and 5 cycles at 21Mhz. The GETB instructions test ROM reading so we know what the results should be.

srg320 · 2022-08-20T17:23:19Z

Ok but I meant the RAM write buffer will have priority and will pause the pixel cache write. I will give an example from my test:

    ibt R0, #$34
    iwt R3, #$1031

    plots 7
    cache
    plot ; 8th plot, start pixel cache write (256-color 8 bytes)
    
    stb (R3) ;  pause pixel cache write, RAM buffer will write $34 to $701031
    inc R0

    ; pixel cache will overwrite $701031 ($34) with $FF

Ok. I wonder what the result would be if you add one or two nop before stb (R3).

As far as I know ROM access is the same as RAM. 3 cycles at 10Mhz and 5 cycles at 21Mhz. The GETB instructions test ROM reading so we know what the results should be.

From this test you can see that in the Load/Store Word to/from RAM commands the second (MSB) access is shorter by 1 cycle. Perhaps when loading the cache (16 bytes sequential access) the access time is less than 5 cycles (some kind of burst mode).

Test version.

paulb-nl mentioned this issue Jun 17, 2022

Stunt Race FX (and Yoshi’s Island) run slower than real hardware #335

Closed

Max833 mentioned this issue Jun 17, 2022

[SNES] Super FX speed test ares-emulator/ares#621

Open

srg320 referenced this issue Aug 15, 2022

GSU: fix some instructions timing.

a6daf9b

srg320 referenced this issue Sep 1, 2022

GSU: reworking memory bus logic.

b299792

Test version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Super FX speed test #340

Super FX speed test #340

paulb-nl commented Jun 16, 2022

FitzRoyX commented Jun 18, 2022

paulb-nl commented Jun 25, 2022

paulb-nl commented Jun 25, 2022

sorgelig commented Jun 25, 2022

paulb-nl commented Jun 25, 2022

srg320 commented Aug 15, 2022

birdybro commented Aug 15, 2022

paulb-nl commented Aug 16, 2022

srg320 commented Aug 16, 2022

srg320 commented Aug 16, 2022

paulb-nl commented Aug 20, 2022

srg320 commented Aug 20, 2022

Super FX speed test #340

Super FX speed test #340

Comments

paulb-nl commented Jun 16, 2022

FitzRoyX commented Jun 18, 2022

paulb-nl commented Jun 25, 2022

paulb-nl commented Jun 25, 2022

sorgelig commented Jun 25, 2022

paulb-nl commented Jun 25, 2022

srg320 commented Aug 15, 2022

birdybro commented Aug 15, 2022

paulb-nl commented Aug 16, 2022

srg320 commented Aug 16, 2022

srg320 commented Aug 16, 2022

paulb-nl commented Aug 20, 2022

srg320 commented Aug 20, 2022