Reverse Engineering and Modding Mario Pinball Land (GBA)

Initial publication: March 19th, 2019

I've been meaning to improve my reverse engineering skills for a while now, and with the recent release of Ghidra, I've finally been motivated to attempt a game modding project.

I chose to look at Mario Pinball Land for GBA, as it's a game which doesn't have a substantial existing modding community (unusual for a Mario title), and has lots of quirks which I'd like to try to improve!

The most common complaints with this game are:

Limited controls:

Binary flipper controls for a pinball game are not ideal,
Lack of table nudge control,

Misc:

Some of the soundtrack is very generic,
If you power-off without pressing "SAVE & QUIT" your save will be erased,

Lack of gameplay variety:

Rooms are simple and feature no scrolling,
Bosses are reused,

Artificial gameplay difficulty:

If you open all doors in a stage they all shut,
Falling down some holes takes you back further than 1 room,
When accidentally leaving a room, your progress there is lost, as the room resets once you re-enter,

The last point in particular is very frequently cited as the most annoying problem, and is explained well in IGN's review of the game:

But what makes the game unbelievably irritating is its insistence of resetting the playfield if the ball (Mario) ever leaves the area. In other words, if the player was one target away from earning that screen's Star, and they accidentally shoot the ball through the gate into another board, the previous board will be completely reset as if it was the player's first appearance in the area. This happens a lot in Mario Pinball Land, and it's absolutely unfair to the player since there are so many instances where the ball will leave the playing field...either by the hand of the player or simply due to a random bounce on the board. It's this element that artificially inflates the difficulty and length of the game, and what's more, it really makes Mario Pinball Land an absolute chore to play. If it were a memory restriction we might be able to forgive it...but the game seems to have no problem tracking Yoshi balls when a player knocks it into another board in a given area.

- IGN

In this article I provide notes of my first impressions using Ghidra to reverse engineer the game, and use these notes to try to address some of these complaints by modifying several game mechanics. Perhaps in a future article I will actually add new content such as new stages, music, and boss fights.

Prior work

Before diving into reversing the game, let's outline all prior research on this game publicly available.

Cheats

There are some cheat codes which give us the addresses of a few useful variables:

Lives - 0x3000000
Score - 0x3000004
Yellow Coins - 0x3000008
Blue Coins - 0x300000C and 0x300001C
Stars - 0x3000014 and 0x300578C
Item - 0x300002C
Stage - 0x3000030

Documented regional differences

The Cutting Room Floor states that the flipper physics were altered between the original Japanese release, and the later versions.

Tools and setup

Aside from Ghidra, I'll be using NO$GBA as my main emulator, for its excellent debugging capabilities, and occasionally VisualBoyAdvance-M for its cheat finder functionality.

Importing the file into Ghidra will automatically detect it as a "Raw Binary", so just set the language for Little Endian ARMv4t and map the ROM section at 0x8000000 - 0xA000000.

The other sections (EWRAM, IWRAM, IO, VRAM, ...) should be mapped to the addresses as documented on gbatek). Additionally this cartridge has an EEPROM chip; since the ROM size is not greater than 16MB, EEPROM is mapped at 0xD000000 (see here on gbatek).

Since the game doesn't utilize compression for code, all code is now mapped out correctly for any static analysis.

Fixing game saving

If you turn off the power without having selected to "SAVE & QUIT" your save game will be invalidated, and on the next boot you will only be given the option to start a "NEW GAME".

The developers' reasoning for this quirk was probably to prevent people from being able to rollback their mistakes. Consider for example saving your progress once you make it to a boss, and then if you accidentally fall down just restarting the game to immediately retry at the boss, without having to make it back there again.

Personally, I think the player should be allowed to use this trick if they want to, and that blocking it just contributes to the common frustrations of the game's difficulty. However, the bigger problem from forcibly erasing progress this way is the massive inconvenience induced by triggering it on accident; this can easily happen if the player doesn't know about the quirk and just wants to put the game down after a quick session, or even deliberately doesn't attempt to save as a courtesy if they are borrowing the cartridge. Even if everyone who plays the game knows about the quirk, it still prevents you from playing the game once you reach low battery (a reasonable scenario for a handheld device) for fear of losing your entire progress if the system powers off!

Assuming that the EEPROM save data would be DMAed somewhere into RAM, I decided to search for references to the DMA registers. I quickly realized that Ghidra wasn't detecting all references as it hadn't detected all functions in the program, which was a little disappointing:

After manually creating the functions I had found with dynamic analysis, I quickly found all uses of DMA channel 3. One in particular is the function at 0x803645C, which is a wrapper for starting a data transfer using DMA channel 3 and waiting for it to complete. After naming the parameters and IO registers the decompilation output looks like this:

Although this is a relatively simple function, there are a lot of things we can notice from the decompiler output. The obvious things are:

Lack of spacing - it would look much nicer if there were spaces after commas. I found no way to configure the syntax, but since it's open source I expect this to be addressed at some point.
Redundant return statement at the end of the function.
Since the IO section is mapped as volatile, the decompiler wraps all loads and stores it detects to this region with read/write_volatile_x functions. I find this slightly less readable, but I suppose it's useful to have these operations marked more explicitly than non-volatile read/writes.

A slightly more concerning thing is the weird choice of control flow when handling loops; it has repeated the load of DMACNT3_H, even though in the assembly there is only a single load instruction. A much more natural choice would have been to display the loop condition once:

while (read_volatile_2(DMA3CNT._2_2_) & 0x8000);

Even if it is desired to explicitly assign to an intermediate variable, the pseudo-code could still more accurately describe the assembly by using a do statement:

do {
	dma_3_control = read_volatile_2(DMA3CNT._2_2_);
} while (dma_3_control & 0x8000);

Small quirks aside, this is a strong decompilation effort, and it enables us to quickly understand what this function does. By then looking for function call references to this function, and inspecting the source and destination arguments passed, we can quickly find the 2 functions which copy to/from EEPROM (0xD000000) and use this as a starting point for understanding how saving and loading works.

After some analysis we can discover the layout of the save:

Description	Offset	Size
Magic	0	8
High scores	8	140
High scores checksum	148	4
Game state	152	348
Game state checksum	500	4

The magic for Mario Pinball land saves is 05 f6 50 4d 6f 4d e6 5f.

Checksums are calculated by simply adding every byte for the section, and adding each 4 bytes of the magic (so + 0xAD374374 in total).

To identify how a save was being invalidated when loaded I copied the save after selecting to "SAVE & QUIT", and then copied the save immediately after loading it. Diffing the 2 files revealed 8 bytes from offset 496 were different (the last 4 bytes of game state, and game state checksum).

Breaking on the last 4 bytes of stage data revealed that it's the marker for whether the save is valid or not. When loading the save it is set to 0xC769D9A9, and when pressing "SAVE & QUIT" it is set to 0x38962656. I quickly located the code responsible for this, function restoreLoadedData at 0x8002CD0:

To fix the problem, we can change the constant it sets saveValid to from 0xC769D9A9 to 0x38962656; the location of this constant in the ROM is located at 0x8002D14.

Adding analog flippers

Our next modification will be to locate the code that moves the flippers up and down, and modifying it to operate on an analog value.

Locating flipper code

With VisualBoyAdvance-M cheat finder functionality, we can quickly locate probable flipper variables by searching for 32-bit values that are zero when the desired flipper is lowered, and non-zero when raised:

0x030014A8
0x03002B20
0x03002B30
0x03004D10

We can then locate the code that is responsible for modifying these values by setting memory write breakpoints on those addresses (in NO$GBA notation this is [030014A8]!), and then raising the flippers again.

As soon as I press L, the instruction at 0x080153D0 writes 0x55554 to 0x030014A8. We can lookup the enclosing function, starting at 0x080153A4, in Ghidra and try to understand what is going on. I came up with the following psuedo-code for this function (where flipperValue's address is 0x030014A8):

void updateFlipper(int flipper) {
	if(flipperPressed[flipper]) {
		int flipperDelta = upper32bitsof64bitmultiply(0x147AE, flipperMultiplier);
		
		if(flipperValue[flipper] != 0x2AAAAA) {
			flipperValue[flipper] += flipperDelta;
			if(flipperValue[flipper] > 0x2AAAAA) flipperValue[flipper] = 0x2AAAAA;
		}
	}
	else {
		int flipperDelta = upper32bitsof64bitmultiply(0xA3D7, flipperMultiplier);
		
		if(flipperValue[flipper] != 0) {
			flipperValue[flipper] -= flipperDelta;
			if(flipperValue[flipper] < 0) flipperValue[flipper] = 0;
		}
	}
}

flipperMultiplier is set to 0x42AAA during initialisation (at 0x8028C02), and then remains unchanged throughout the game.

So in this function, the flipper is either incremented by ((0x147ae * 0x42aaa) >> 32) = 0x55554 until 0x2AAAAA, or decremented by ((0xA3D7 * 0x42aaa) >> 32) = 0x2AAAA (half as quickly) until 0.

Modding this function

Now that we've located the relevant function and have a good idea of how it works, we can attempt to modify it.

Our aim is to just refine the controls without altering the gameplay. We want to preserve the original behaviour when going from a trigger value of 0 to 100% in one frame, where it is interpolated between the range and not just set to fully upright immediately, since we don't want to increase the maximum flipper acceleration possible. In other words, we just want to be able to either raise a flipper slower or be able to hold it in a position between the two states.

We have a slight problem in that when reading an analog value in a range like 0 - 256, it does not map perfectly to the game's 0 - 0x2AAAAA range; 256 * 10922 = 0x2AAA00, so we'll just add 0xAA after passing half way (this is less than 0.006% so shouldn't jerk too much).

We also need to keep the size of our new function under 0x7C if we want to overwrite it inline. I came up with the following:

#define analogReading ((unsigned int *)0x4000400)
#define flipperValue ((int *)0x30014A8)

void updateFlipper(int flipper) {
	unsigned int analog = analogReading[flipper];
	
	unsigned int v = analog * 10922;
	if(v >= 128 * 1092) v += 0xaa;

	int delta = v - flipperValue[flipper];

	if(delta > 0x55554) delta = 0x55554;
	if(delta < -0x2AAAA) delta = -0x2AAAA;

	flipperValue[flipper] += delta;
	if(flipperValue[flipper] < 0) flipperValue[flipper] = 0;
	if(flipperValue[flipper] > 0x2AAAAA) flipperValue[flipper] = 0x2AAAAA;
}

Compile as thumb code and insert into ROM file at offset 0x153A4:

arm-none-eabi-gcc updateFlipper.c -mthumb -Os -nostartfiles -nostdlib -o updateFlipper.elf -Ttext=0x080153A4
arm-none-eabi-objcopy -O binary updateFlipper.elf updateFlipper.bin

Note that as our code modifications get more complex and will require more than just a text section (data section, etc), we will need to ensure that the compiler and linker do not insert any padding between the sections, to keep our binaries as small as possible. Compiling with -Wl,-z,max-page-size=0x1 and removing all calls to ALIGN in the linker script, such as the below, seems to work (run arm-none-eabi-ld -verbose to get the default linker script):

  /* Adjust the address for the data segment.  We want to adjust up to
	 the same address within the page on the next page up.  */
  . = ALIGN(CONSTANT (MAXPAGESIZE)) + (. & (CONSTANT (MAXPAGESIZE) - 1));

Inserting analog support

For now, let's just modify VisualBoyAdvance-M, to expose the analog values we want at an unused IO address:

static inline uint32_t CPUReadMemory(uint32_t address)
{
	switch (address >> 24) {
	...
	case 4:
		if (address == 0x4000400) {
			value = 128;
		}
		else if (address == 0x4000404) {
			value = 64;
		}
		else if ((address < 0x4000400) && ioReadable[address & 0x3fc]) {
			...

When running this, the flippers are in the expected positions (left 1/2 up, right 1/4 up), and the ball seems to collide with the flippers at the correct location.

Modding the emulator further is out of scope of this article since I want to focus on the game itself, but it's just a matter of reading from a game controller and returning these trigger values instead of the constants shown above. If you want a solution which would work on a real GBA, you could read the flipper values over the GBA's link cable, and create a piece of external hardware to control the flippers (perhaps this will be a future project).

Locating flipper physics

The ball is still programmed to fly off the flippers once the trigger buttons are pressed.

From reversing the updateFlipper function, we know that there exist flipperPressed variables, which are likely to be controlling this.

The only function which reads this value is the one at 0x801531C, which seems to be updating a value likely used for the velocity calculation:

const int leftFlipper1_rightFlipperMinus1[2] = { 1, -1 };

void updateFlipperVelocity(int flipper) {
  int *vel = &flipperVelocity[flipper];
  *vel = 0;

  if(flipperPressed[flipper]) {
	if(flipperValue[flipper] == 0)
	  playSoundEffect(0x2F);

	if(flipperValue[flipper]) != 0x2AAAAA)
	  *vel = 0x147AE * leftFlipper1_rightFlipperMinus1[flipper];
  }
  else if(flipperValue[flipper]) {
	*vel = 0xA3D7 * leftFlipper1_rightFlipperMinus1[flipper];
  }
}

Those constants match what we saw in updateFlipper, which is a good sign we're looking at the right thing.

Let's try modifying this function to use the delta of the analog readings (again, the values don't divide perfectly, 0x55554 / 4 = 0x15555, which is not exactly 0x147AE):

#define analogReading ((unsigned int *)0x4000400)
#define flipperValue ((int *)0x30014A8)
#define flipperVelocity ((int *)0x30014B8)
#define leftFlipper1_rightFlipperMinus1 ((int *)0x81DEAAC)

typedef void playSoundEffect_t(int);

void updateFlipperVelocity(int flipper) {
	int *vel = &flipperVelocity[flipper];
	*vel = 0;

	unsigned int analog = analogReading[flipper];
	
	unsigned int v = analog * 10922;
	if(v >= 128 * 1092) v += 0xaa;

	int delta = v - flipperValue[flipper];
	int d = delta / 4;

	if(v > 0) {
		if(flipperValue[flipper] == 0) ((playSoundEffect_t *)0x802BADD)(0x2F);

		if(flipperValue[flipper] != 0x2AAAAA) {
			if(d > 0x147AE) d = 0x147AE;

			*vel = d * leftFlipper1_rightFlipperMinus1[flipper];
		}
	}
	else if(flipperValue[flipper]) {
		if (d < -0xA3D7) d = -0xA3D7;

		*vel = d * leftFlipper1_rightFlipperMinus1[flipper];
	}
}

Our new function is too big to be inserted over the original function, so let's place a jump there, and insert the function at some free space at the end of the ROM (0x87FD3D4).

The game now has analog flipper control support!

Allowing all doors to open

Once again, the game doesn't allow you to leave all doors in a stage open. Once you open the last door, all doors (including the one you just opened) close! Our next modification will be removing the code responsible for this.

The door data is located at 0x300140C; it's an array of 20 byte structures. The first field is the door number (4 bytes), which I'm not sure why is required considering you can just access doors by index without requiring a loop, and the second field is the door's state (4 bytes). The door state can be one of the following:

0 - closed and unopenable
1 - closed but openable
2 - opening
3 - open
4 - closing

Setting a write breakpoint on one of the door's states and opening all of the doors quickly revealed the function responsible for this.

checkAllDoorsOpen at 0x8014280 has a loop which checks if all doors (numberOfDoors at 0x3001448) are 3 (open), and if so, sets them all to 4 (closing).

We can just replace the first instruction of this function with bx lr to immediately return, which disables this functionality.

Door star requirements

To reach the final boss in Mario Pinball Land you only need to collect 15 stars (out of 35 total), but it may be desirable to lower some of the door star requirements even further to make the game easier.

Doors are created in each stage's setup function (we will go over stage control flow in more detail later) by calling createDoor. For example, stage 2 creates 2 doors, with identifiers 0 and 1, resources 0xBF and 0xCA, and the same palette at address 0x87FB15C:

createDoor(0, 0xBF, (struct palette *)0x87FB15C);
createDoor(1, 0xCA, (struct palette *)0x87FB15C);

The star limits are implemented in each stage's loop function. For example, stage 2 loop function has the following code to control door access:

if(getStars() > 0 || inTimeAttackMode())
	allowOpeningDoor(0);
if(getStars() > 1)
	allowOpeningDoor(1);

Swapping the resource number passed to createDoor for a different door sprite, and modifying the check in the stages's loop function is sufficient to lower the star requirement of a door.

Space limitations

Before moving onto the more involved modifications, let's first assess how much free RAM we have to work with.

The game doesn't use any dynamic memory allocation for IWRAM (0x3000000 - 0x3007FFF), everything has been statically pre-assigned an address, which makes it relatively easy to see that the developers have used practically everything!

EWRAM (0x2000000 - 0x203FFFF) on the other hand is a bit more interesting as the entirety of it is just used as a massive buffer to decompress data to. There is a pointer, nextSpaceForDecompressedData (0x300531C), which is used to track where to decompress data to. This pointer is set to the start of EWRAM (0x2000000) when transitioning to a new stage, and is then advanced each time the stage decompresses a resource by calling decompressResource (0x802E6A0).

Different stages require a different amount of data, for example stage 1 requires 0x18020 bytes and stage 3 requires 0x3e824 bytes. We could try to identify the area of the game which uses the most space for decompressed data, and then use whatever is left over as free space (maybe something like 5000 bytes).

However, I had an idea for a more practical solution. If the ROM was bigger we could just store all resources uncompressed, removing the need for decompressing the resources into EWRAM on the fly. This would give us the whole of EWRAM (256KB) for modifications to use!

Changing the ROM size from 8MB to 16MB is as simple as increasing the file size and won't affect any other mappings. The size could be increased even further to 32MB if necessary, but this would affect how the game accesses EEPROM. For any sizes larger than 32MB we could use a mapper for bank switching, giving practically unlimited storage (think about the GBA video carts storing feature length films, or flashcards reading from SD cards). However for the purposes of this article, 16MB will be more than sufficient.

Extracting resources

The resource table starts at 0x87FCCD8, it's an array of pointers to resources structured the following way:

1 byte - type
3 bytes - decompressedSize
[data]

The type is either 0 if the resource is not compressed, or 0x10 if it is compressed with LZ77 (to be decompressed via GBA BIOS function 0x11); many open source decompressors for this exist, such as CUE's DS/GBA compressors.

The following resources use their entry in the resource table for uncompressed metadata, and their real compressed data is stored elsewhere (to be fetched by calling getCompressedData (0x802D7A4)):

0x64 - 0x8382C48
0xa7 - 0x852E36C
0xf9 - 0x85C5BB4
0x91 - 0x849C568
0x92 - 0x84EF718
0x81 - 0x83EFDD8
0x83 - 0x8428594
0x3a - 0x8364CAC
0x51 - 0x837D0A8
0xb - 0x8269430
0xc - 0x82C948C
0x8 - 0x81FBBA0

I wrote a small hacky program which can decompress all of these resources and then update the resource table to point to the decompressed data. Note that some of the resources overlap, so I decided to do this in a couple of stages (extract all resources without writing anything, then copy the decompressed resources and write to the resource table).

unsigned int resourceAddressAddress(int resource, int resolve) {
	if(resolve) {
		switch(resource) {
			case 0x64: return 0x802D834;
			case 0xa7: return 0x802D83C;
			case 0xf9: return 0x802D82C;
			case 0x91: return 0x802D824;
			case 0x92: return 0x802D844;
			case 0x81: return 0x802D804;
			case 0x83: return 0x802D81C;
			case 0x3a: return 0x802D7F4;
			case 0x51: return 0x802D84C;
			case 0xb: return 0x802D814;
			case 0xc: return 0x802D80C;
			case 0x8: return 0x802D7FC;
		}
	}

	return 0x87FCCD8 + resource * 4;
}

...

	// First loop just to fetch data
	int i;
	for(i = 0; i < 255; i++) {
		unsigned int aa = resourceAddressAddress(i, 1);
		unsigned int aau = resourceAddressAddress(i, 0);

		unsigned int a = *(unsigned int *)(rom + aa - 0x8000000);
		
		printf("  [+] Decompressing resource %d from 0x%08x\n", i, a);

...

	printf("  [+] Writing decompressed resources\n");

	// Decompress to end of ROM
	unsigned int offset = 8 * 1024 * 1024;

	// Second loop to write decompressed data
	for(i = 0; i < 255; i++) {
		*(unsigned int *)(rom + offset) = resources[i].decompressedSize << 8;
		memcpy(rom + offset + 4, resources[i].decompressedData, resources[i].decompressedSize);

		// Update the pointer
		*(unsigned int *)(rom + resources[i].addressOffset - 0x8000000) = offset + 0x8000000;

		offset += 4 + resources[i].decompressedSize;

There's just one final modification we need to make. Since the game expects certain resources to have been decompressed when requested, it tries to locate the data by calling getDecompressedData (0x802FB04); this function will return NULL if the resource has not been decompressed. To fix this, we'll update the function to return the resource's uncompressed address either from the resource table or from getCompressedData:

void *(*getCompressedData)(int resource) = (void *)(0x802D7A4 + 1);

#define compressedDataAddresses ((void **)0x87FCCD8)

void *_start(int resource) {
	void *r = getCompressedData(resource);
	
	if(r == (void *)0x600B800) {
		r = compressedDataAddresses[resource];
	}

	return r + 4;
}

Note that the resources mentioned above are not the only compressed data in the game, but are the only resources which are decompressed into EWRAM (other data is decompressed into VRAM for example).

Understanding stage control flow

We already know from the cheat codes that the current stage number is located at 0x3000030, so we can search for references to it in our disassembler or break on writing it to identify the pertinent functions.

It turns out that there are 9 main functions involved in stage control flow:

0x08000A8C - switchStage
0x0801EC64 - enterGameRoom
0x0801E6C8 - setupStage
(stage specific setup function)
0x0801E9F0 - mainGameFunction
(stage specific loop function)
0x8034150 - isReadyForSetup
0x801E8DC - clearUpStage
(stage specific clear up function)

enterGameRoom calls setupStage once, and then mainGameFunction in a loop until the status is changed, and finally clearUpStage:

int enterGameRoom(...) {
	int result;

	if (stageIsShop(currentStage)) {
		...
	}
	else {
		setupStage(stage, ...);
		updateTimer(...);
		do
			mainGameFunction(stage);
		while (getMarioStatus() == 13);
		clearUpStage(stage);
		result = getMarioStatus();
	}

	return result;
}

setupStage, mainGameFunction and clearUpStage have switch statements to execute a different function according to the current stage, such as setupStage1 (0x801F290), stage1 (0x801F52C), and clearUpStage1 (0x801F5DC):

int setupStage(...) {
	...
	switch (stage) {
		case 0:
			setupStage1();
			break;
		case 1:
			setupStage12();
			break;
			...
	}
}

Many of the stage specific loop functions call isReadyForSetup, which returns 1 once the ball's Y passes above a certain position, and then 0 for all future calls. The idea is that the stage specific setup functions execute whilst the screen is still faded out, but some routines such as creating enemies need to be done later so that they can be seen appearing:

int stage12() {
	...

	if (isReadyForSetup()) {
		createGhost(0xFF6A0000, 0x3520000);
		createGhost(0x960000, 0x3520000);
		createGhost(0xFF060000, 0x2260000);
		createGhost(0xFA0000, 0x2260000);
	}

To clarify this, here's an overview of the key points when transitioning from stage 1 to stage 1-2:

Stage 1 loop function detects you went through the ghost door, which sets ball[0].state from 13 to 0,
This causes enterGameRoom to break out of the loop that was executing mainGameFunction,
enterGameRoom calls clearUpStage,
clearUpStage switches on the stage number (0), and calls clearUpStage1,
The clear up functions return and enterGameRoom returns to switchRoom,
switchRoom changes stage to 1 (stage 1-2),
switchRoom continues its loop which calls enterGameRoom,
enterGameRoom calls setupStage,
setupStage switches on stage number (1), and calls setupStage12,
setupStage12 calls some functions to decompress some data, reset the setup variable, and set enemy type to ghost (0xd),
The setup functions return to enterGameRoom,
enterGameRoom enters its loop which calls mainGameFunction,
mainGameFunction switches on the stage number (1), and calls stage12,
stage12 checks the setup variable, and if it's ready will create the ghosts,

Stage links

The only other thing to understand is how the game desides the target of the next stage in switchStage. This will be useful for us, as we want to change one stage transition to take us to a different stage.

There is a hole in stage 3-3 (0x12) which will send you to the underwater stage, 3-4 (0x13), however if you fall from stage 3-4 you will be sent all the way back to stage 3 (0x0f), not stage 3-3!

To modify the game to return from stage 3-4 (0x13) to stage 3-3 (0x12), instead of stage 3 (0x0f), we need to understand the data structure which stores the relationships between stages:

struct stageLink {
	int stage;
	int forwardsTransition;
	int linkedStage;
	int backwardsTransition;
};

There is an array of these structures, stageLinks, starting at 0x87F5618, which is used by switchStage to determine the next stage like so:

if stageLinks[i].stage == currentStage && transitionMechanic == stageLinks[i].forwardsTransition {
	nextStage = stageLinks[i].linkedStage;
}
else if stageLinks[i].linkedStage == currentStage && transitionMechanic == stageLinks[i].backwardsTransition {
	nextStage = stageLinks[i].stage;
}

Where falling passed the flippers is mechanic 0xa. This is enough information for us to search through the table for stage numbers 0x12, 0x13, and 0x0f, yielding the entries at indexes 46 and 48:

87f58f8: stageLink { 0x12, 1, 0x13, 4 }
...
87f5918: stageLink { 0x13, 0xA, 0xF, 2 }

Entry 46 states that you can transition from stage 0x12 to 0x13 by going through transition mechanic 1 (the ice whole), but cannot return from stage 0x13 to stage 0x12 as there is nothing in stage 0x13 that sets transition mechanic 4.

Entry 48 states that you can transition from stage 0x13 to 0xf by falling (transition mechanic 0xa), but cannot return from stage 0xf to 0x13 as there is nothing in stage 0xf that sets transition mechanic 2.

To fix the problem, we can either modify entry 46's backwards transition to 0xa, or entry 48's linkedStage to 0x12.

Preserving stage data

Now we come to the biggest modification, removing the stage reset mechanic by preserving more stage data across stage transitions. This will be the new default behaviour, but we will allow the player to opt out of restoring stage data if they hold L+R while transitioning to a new stage.

The only stage data that is currently preserved across stage transitions is the stageFlags array, located directly after the stage variable, at 0x3000034. This is a 32-bit integer per stage, which holds information about the actions you have completed in the stage, such as whether you have collected the star, opened the question boxes, raised the pyramid, etc.

Our aim is to preserve state about which doors are currently open, the enemies, and any coins and stars (unlocking a star and then accidentally leaving the stage before collecting it is a huge problem). We will preserve this data in EWRAM, which has now been completely freed up by eliminating the decompression code.

Ignoring the shops, bonus stages, and boss stages (we wouldn't want to make the game too easy ;)), there are 28 stages:

0x00 - Stage 1
0x01 - Stage 1-2
0x07 - Stage 2
0x08 - Stage 2-2
0x0A - Stage 2-3
0x0B - Stage 2-4
0x0C - Stage 2-5
0x0D - Stage 2-6
0x0E - Stage 2-7
0x0F - Stage 3
0x10 - Stage 3-2
0x12 - Stage 3-3
0x13 - Stage 3-4
0x14 - Stage 3-5
0x15 - Stage 3-6
0x16 - Stage 4
0x17 - Stage 4-2
0x18 - Stage 4-3
0x19 - Stage 4-4
0x1A - Stage 4-5
0x1B - Stage 4-6
0x1C - Stage 4-7
0x1E - Stage 4-8
0x1F - Stage 5
0x21 - Stage 5-2
0x22 - Stage 5-3
0x23 - Stage 5-4
0x27 - Stage 5-8

It's kind of annoying that they are not perfectly in order, but we can just use an array to map from stage number to preservation index:

const unsigned char stageToStageDataIndex[] = {
	/*0x00:*/ 0,
	/*0x01:*/ 1,
	/*0x02:*/ 0xff,
	/*0x03:*/ 0xff,
	/*0x04:*/ 0xff,
	/*0x05:*/ 0xff,
	/*0x06:*/ 0xff,
	/*0x07:*/ 2,
	/*0x08:*/ 3,
	/*0x09:*/ 0xff,
	/*0x0A:*/ 4,
	/*0x0B:*/ 5,
	/*0x0C:*/ 6,
	/*0x0D:*/ 7,
	/*0x0E:*/ 8,
	/*0x0F:*/ 9,
	/*0x10:*/ 10,
	/*0x11:*/ 0xff,
	/*0x12:*/ 11,
	/*0x13:*/ 12,
	/*0x14:*/ 13,
	/*0x15:*/ 14,
	/*0x16:*/ 15,
	/*0x17:*/ 16,
	/*0x18:*/ 17,
	/*0x19:*/ 18,
	/*0x1A:*/ 19,
	/*0x1B:*/ 20,
	/*0x1C:*/ 21,
	/*0x1D:*/ 0xff,
	/*0x1E:*/ 22,
	/*0x1F:*/ 23,
	/*0x20:*/ 0xff,
	/*0x21:*/ 24,
	/*0x22:*/ 25,
	/*0x23:*/ 26,
	/*0x24:*/ 0xff,
	/*0x25:*/ 0xff,
	/*0x26:*/ 0xff,
	/*0x27:*/ 27,
	/*0x28:*/ 0xff,
	/*0x2E:*/ 0xff,
	/*0x2F:*/ 0xff,
	/*0x30:*/ 0xff,
	/*0x31:*/ 0xff,
	/*0x32:*/ 0xff,
	/*0x33:*/ 0xff,
	/*0x34:*/ 0xff
};

If we were to use the whole of EWRAM, we would have more than 9000 bytes per stage (256 * 1024 / 28), so storage space shouldn't be a problem.

Hijacking control flow (optimizing compiled ARM code for space)

Now that we understand the control flow when transitioning between stages, we need to find a way to execute our own code at key points; preserving data before clearUpStage, and restoring data after setupStage.

In armv4t there's no blx memory, or even blx register instruction, so you have to set the target, set the return address, and then execute the branch in 3 different instructions. It turns out that we need at least 10 bytes for an arbitrary call, the same amount of space required for a 64bit call on x86_64!

ldr r3, [pc, #x]
mov lr, pc
bx r3

...

@ 4 byte target

This is assuming that you are using exclusively Thumb code; if you wanted to support interopability with ARM functions, you would need an additional instruction to add 1 to that lr value so that the target function would know to return to Thumb code and not ARM (there is no add lr, pc, #1 instruction to do this in a single instruction).

There isn't enough space in these functions to insert a new function call, so usually we would have to insert this call by overwriting some of the function's original instructions, and then reimplement them in the new function before executing any custom functionality.

However, it turns out that the game is using a very old compiler which doesn't optimize space efficiency well for ARM target. I took it as a challenge to try to eliminate enough instructions to create space for an arbitrary call, whilst keeping the game functionally equivilant.

`setupStage`

This function has a switch statement, which completely throws off Ghidra and causes it to try to disassemble the code as ARM instead of Thumb, even though the whole function is Thumb:

The instruction it cites as confusing it is the last instruction of the final switch case, which is a 4-byte bl Thumb instruction. Perhaps the root cause is that it somewhere mistakenly checks for ARM/Thumb mode based on instruction size, assuming all Thumb instructions are 2-bytes?

Ignoring the disappointment of Ghidra's ARM disassembler, the setupStage function has the following prologue and epilogue (from a different disassembler):

0801E6C8 setupStage:
0801E6C8                 sub             sp, sp, #8
0801E6CA                 push            {r4-r6, lr}
...
0801E6D0                 str             r2, [sp, #0x18+var_8]
0801E6D2                 str             r3, [sp, #0x18+var_4]

...

0801E8D2                 pop             {r4-r6}
0801E8D4                 pop             {r3}
0801E8D6                 add             sp, sp, #8
0801E8D8                 bx              r3

Which means it has the following stack layout:

R4
R5
R6
LR
8 bytes local variables (R2 and R3)

Where the local stack variables are just used for storing arguments 2 and 3, to be used later for a single function call:

0801E6D0                 str             r2, [sp, #0x18+var_8]
0801E6D2                 str             r3, [sp, #0x18+var_4]

...

0801E738                 ldr             r0, [sp, #0x18+var_8]
0801E73A                 ldr             r1, [sp, #0x18+var_4]
0801E73C                 ldr             r2, [sp, #0x18+arg_0]
0801E73E                 bl              setupBall

Our strategy to optimize the epilogue for space will be:

Eliminate add by popping into trashed registers,
Eliminate bx instruction by popping into pc,
Combine 2 pops into 1 by reordering data,

It turns out that the 8 instructions originally used for the prologue and epilogue can be reduced to just 2 instructions!

prologue:
	push {r2-r6, lr}

epilogue:
	pop {r2-r6, pc}

This reorders the stack layout slightly:

Which just means we have to edit the 2 instructions which reference the local stack variables holding R2 and R3:

Before:

ldr r0, [sp, #16]
ldr r1, [sp, #20]

After:

ldr r0, [sp, #0]
ldr r1, [sp, #4]

Since there are also 2 bytes of padding between the end of setupStage and the next function (clearUpStage), we now have 8 bytes of space for code before setupStage epilogue, just 2 bytes short of being able to insert a call.

Note that even though our call will be the last instruction in the setupStage function, we can't perform a tail call optimization (setupStage call -> loadStageData return -> setupStage return to just setupStage branch -> loadStageData return) to reduce the number of instructions since there is no way of popping directly into lr.

Instead, our strategy at this point is to use the free space we created in the prologue to move the target into the stack, and then we can take advantage of being able to pop into pc to combine the load of the target, and the branch, into a single instruction:

prologue:
	...
	
	ldr r5, [pc, #0x20C]
	push {r2, r5}
	
	...

	@ bl loadCurrentStage
	mov lr, pc
	pop {r2, pc}

epilogue:
	pop {r2-r6, pc}

loadCurrentStageTarget:
	.word loadCurrentStage

Note that I also pushed r2 in order to keep sp 8-byte aligned.

Once again, this changes the stack layout, so we'll have to adjust all instructions accessing stack memory (to use offset +8), but this is still less effort than relocating the whole function body up a few instructions to move the free space in the prologue to the epilogue.

`clearUpStage`

For clearUpStage our aim is to create space to insert a function call at the start of the function, instead of at the end (but we can store the branch target at the end of the function so we don't have to jump over it).

0801E8DC clearUpStage:
0801E8DC                 push            {r4, lr}
0801E8DE                 movs            r4, r0

...

0801E9E8                 pop             {r4}
0801E9EA                 pop             {r0}
0801E9EC                 bx              r0

This one is pretty simple to optimize the epilogue for:

pop {r4, pc}

Once again, there are 2 bytes of padding at the end of this function, which means that with our epilogue optimization, we have 6 bytes of space after the function, which is more than enough to store an indirect branch target.

Now we just need to optimize the start of the function to create enough space for a new call:

0801E8DC clearUpStage:
0801E8DC                 push            {r4, lr}
0801E8DE                 movs            r4, r0
0801E8E0                 bl              sub_8033F34
0801E8E4                 bl              sub_801B4F8
0801E8E8                 bl              sub_800E020
0801E8EC                 cmp             r4, #0x36
0801E8EE                 bhi             def_801E8F8
0801E8F0                 lsls            r0, r4, #2
0801E8F2                 ldr             r1, =jpt_801E8F8
0801E8F4                 adds            r0, r0, r1
0801E8F6                 ldr             r0, [r0]
0801E8F8                 mov             pc, r0

As you can see, it's saving R0 in R4 before executing the 3 calls, but it's then immediately shifting it. Why don't we eliminate the redundant save and just shift it immediately (lsls r4, r0, #2)? This saves us 2 bytes.

The switch code is also sub-optimal, the ADDS and LDR for the switch table can be combined into a single operation (ldr r0, [r1, r4]), which saves another 2 bytes.

The comparison would have been generated to handle the default case where the stage number is none of the ones specified, but since we already know the that the stage number is always within the range we can omit the check here, which will save another 4 bytes.

Overall, this function can be shrunk to give us an additional 8 bytes at the start:

clearUpStage:
	push {r4, lr}
	lsls r4, r0, #2
	
	@ 8 new bytes of data

	bl sub_8033F34
	bl sub_801B4F8
	bl sub_800E020

	ldr r1, =jpt_801E8F8
	ldr r0, [r1, r4]
	mov pc, r0

We now have enough space to perform a call at the start of clearUpStage, with the target stored after the epilogue using the free space we optimized there.

Preserving door state

Now that we have enough free RAM to store stage data, and are able to cleanly hijack control flow at the desired places during stage transitions, we can begin implementing the preserve and restore functionality! Let's start by just preserving doors, as we've already talked about how they work.

Because of the way that numberOfDoors gets set (initialized to 0 on stage transition, and then incremented each time createDoor is called), it is very hard to statically know the number of doors for each stage.

Instead, I decided to use a data structure which stores the number of doors dynamically:

struct stageDataEntry {
	unsigned char dataType, count;
	unsigned char data[0];
};

So a stage with three doors would be stored as:

struct stageDataEntry {
	DATA_TYPE_DOOR,
	3,
	struct door[3] = {
		20 bytes door data,
		20 bytes door data,
		20 bytes door data,
	}
}

I came up with the following save and load functions:

void *(*copy)(void *destination, void *source, unsigned int size) = (void *)(0x8036798 + 1);

#define numberOfDoors (*((int *)0x3001448))
#define doorData ((struct door *)0x300140C)

struct preserve {
	void *data;
	unsigned int *count;
	unsigned char size;
} preservitives[] = {
	{ doorData, &numberOfDoors, sizeof(struct door) },
};

struct stageDataEntry *saveStageData(struct stageDataEntry *d) {
	int p;
	
	for(p = 0; p < sizeof(preservitives) / sizeof(preservitives[0]); p++) {
		struct preserve *preserve = &preservitives[p];

		unsigned int count = *preserve->count;

		if(count > 0) {
			unsigned int size = preserve->size * count;

			d->dataType = p;
			d->count = count;
			copy(d->data, preserve->data, size);
			
			d = (void *)d + offsetof(struct stageDataEntry, data) + size;
		}
	}

	d->dataType = 0xdb;

	return d;
}

void loadStageData(struct stageDataEntry *d) {
	while(d->dataType != 0xdb) {
		struct preserve *preserve = &preservitives[d->dataType];

		unsigned int count = d->count;
		unsigned int size = preserve->size * count;

		copy(preserve->data, d->data, size);
		*preserve->count = count;

		d = (void *)d + offsetof(struct stageDataEntry, data) + size;
	}
}

Link the target of the function calls we added to setupStage and cleanUpStage to the following two wrappers and door state will be preserved across the desired stages!

void saveCurrentStageData(void) {
	int i = stageToStageDataIndex[stage];
	
	if(i != 0xff) {
		saveStageData(stagePreserveData[i].entries);
	}
}

void loadCurrentStageData(void) {
	if(!(REG_KEYINPUT & (KEY_L | KEY_R))) {
		return;
	}
	
	int i = stageToStageDataIndex[stage];
	
	if(i != 0xff) {
		loadStageData(stagePreserveData[i].entries);
	}
}

Correcting other door behaviour

After testing the new door preserving code, a couple of issues become apparent.

If we open a door and then die by falling through the last hole, when we respawn the door state will be open, but the closed door frame will still appear.
If we open a door, go through, and then fall back, the door will be open for Mario to fall through, but will then close itself.

For the first issue, we can call setAnimationFrame (0x802F0C4) on all doors after loading the new door states.

For the second issue, we can find the code responsible for closing the door after Mario falls through, by setting a memory write breakpoint on the door state, it's in handleDoors (0x8014300), and looks something like this:

if(numberOfDoors > 0) {
	for(doorIndex = 0; doorIndex < numberOfDoors; doorIndex++) {
		if(doorData[doorIndex]->number == fellThroughDoor) {
			int *doorState = &doorData[doorIndex].state;
			
			if(*doorState == DOOR_OPEN) {
				playSoundEffect(46);
				*doorState = DOOR_CLOSING;
			}

			break;
		}
	}
}

We could nop out this snippet to disable this functionality, but we want to keep the functionality present in the game for when the player wants to reset the next room by holding L+R. Instead, we can set the fellThroughDoor variable to a non-existant door when loading stage data to prevent the above snippet from having any effect.

With these two fixes, loadCurrentStageData now looks like:

void loadCurrentStageData(void) {
	if(!(REG_KEYINPUT & (KEY_L | KEY_R))) {
		return;
	}
	
	int i = stageToStageDataIndex[stage];
	
	if(i != 0xff) {
		loadStageData(stagePreserveData[i].entries);

		// Prevent the door we fell through from closing
		fellThroughDoor = 0xc2;

		// Update door graphics
		int i;
		for(i = 0; i < numberOfDoors; i++) {
			setAnimationFrame(doorData[i].byte_8, doorData[i].uint_16, 0, doorData[i].frame);
		}
	}
}

Enemy data

With the stage data preservation code working as expected for doors, we can extend it to also preserve enemy state.

I reversed some of the enemies, and came up with the following memory addresses used. There's so many that it would take too long to fully reverse exactly how all of them work, but this should be enough to get started with demonstrating the concept of preserving them cross-stage:

Number	Name	Address	Size * Count
0x0	goomba	0x300166C	72 * 5
0x1	bee	0x3001AB0	20 * 3?
0x2	koopa	0x3002784	40 * 6
0x3	shy guy	0x300308C	52 * 4
0x4	snowman	0x30031E8	44 * 4?
0x5	penguin	0x3002DE8	40 * 4
0x6	fish	0x3000760	52 * 6
0x7	pokey	0x3002EF8	44 * 4
0x8	spiny	0x3003368	40 * 4
0x9	klepto	0x300593C	40 * 4
0xa	snake	0x30009C0	36 * 6
0xb	?	0x30019FC	44 * 4
0xc	mole	0x3002D50	48 * 1?
0xd	ghosts	0x30004D4	48 * 4
0xe	robot	0x3002884	36 * 3
0xf	?	0x3000428	40 * 4
0x10	?	0x3000668	36 * 4
0x11	?	0x3000924	36 * 4
0x12	flying shy guy	0x30014CC	44 * 6
?	lakitu	0x3002904	40

Summary

I decided to stop the article here as it's quite long, and it isn't my intention to create a full mod for this game, but just to describe my experience of reverse engineering it. I was successfully able to reverse engineer enough of the game to modify and rewrite core functionality including input handling, resource decompression, stage transitions, etc to make the game better.

Regarding Ghidra, I was slightly disappointed with how it handled ARM code, especially considering I tested it with probably the easiest target possible, some C code from 2004 that used a compiler that couldn't optimise well for the architecture, as I've shown.

Ghidra frequently messed up disassembly by not recognising all functions, and getting confused between ARM and Thumb mode, even though the whole program is Thumb. The UI also leaves a lot to be desired; scrolling is completely broken, selecting variables/registers doesn't highlight other uses of it, assembly output is cluttered as it displays multiple lines per instruction whenever there are any xrefs, etc. This is largely irrelevant though as no one is excited about Ghidra just to use its disassembler; all of the hype is around its decompiler, which is very promising aside from a few quirks I encountered.

I'm assuming that the primary use case for Ghidra is decompiling x86 (/x64), but even for ARM it is easily the best free decompiler available, and the complaints with the UI are relatively trivial, so I am optimistic for the future.