I've been meaning to improve my reverse engineering skills for a while now, and with the recent release of Ghidra, I've finally been motivated to attempt a game modding project.
I chose to look at Mario Pinball Land for GBA, as it's a game which doesn't have a substantial existing modding community (unusual for a Mario title), and has lots of quirks which I'd like to try to improve!
The most common complaints with this game are:
Limited controls:
Misc:
Lack of gameplay variety:
Artificial gameplay difficulty:
The last point in particular is very frequently cited as the most annoying problem, and is explained well in IGN's review of the game:
In this article I provide notes of my first impressions using Ghidra to reverse engineer the game, and use these notes to try to address some of these complaints by modifying several game mechanics. Perhaps in a future article I will actually add new content such as new stages, music, and boss fights.
Before diving into reversing the game, let's outline all prior research on this game publicly available.
There are some cheat codes which give us the addresses of a few useful variables:
0x3000000
0x3000004
0x3000008
0x300000C
and 0x300001C
0x3000014
and 0x300578C
0x300002C
0x3000030
The Cutting Room Floor states that the flipper physics were altered between the original Japanese release, and the later versions.
Aside from Ghidra, I'll be using NO$GBA as my main emulator, for its excellent debugging capabilities, and occasionally VisualBoyAdvance-M for its cheat finder functionality.
Importing the file into Ghidra will automatically detect it as a "Raw Binary", so just set the language for Little Endian ARMv4t and map the ROM section at 0x8000000 - 0xA000000
.
The other sections (EWRAM, IWRAM, IO, VRAM, ...) should be mapped to the addresses as documented on gbatek). Additionally this cartridge has an EEPROM chip; since the ROM size is not greater than 16MB, EEPROM is mapped at 0xD000000
(see here on gbatek).
Since the game doesn't utilize compression for code, all code is now mapped out correctly for any static analysis.
If you turn off the power without having selected to "SAVE & QUIT" your save game will be invalidated, and on the next boot you will only be given the option to start a "NEW GAME".
The developers' reasoning for this quirk was probably to prevent people from being able to rollback their mistakes. Consider for example saving your progress once you make it to a boss, and then if you accidentally fall down just restarting the game to immediately retry at the boss, without having to make it back there again.
Personally, I think the player should be allowed to use this trick if they want to, and that blocking it just contributes to the common frustrations of the game's difficulty. However, the bigger problem from forcibly erasing progress this way is the massive inconvenience induced by triggering it on accident; this can easily happen if the player doesn't know about the quirk and just wants to put the game down after a quick session, or even deliberately doesn't attempt to save as a courtesy if they are borrowing the cartridge. Even if everyone who plays the game knows about the quirk, it still prevents you from playing the game once you reach low battery (a reasonable scenario for a handheld device) for fear of losing your entire progress if the system powers off!
Assuming that the EEPROM save data would be DMAed somewhere into RAM, I decided to search for references to the DMA registers. I quickly realized that Ghidra wasn't detecting all references as it hadn't detected all functions in the program, which was a little disappointing:
After manually creating the functions I had found with dynamic analysis, I quickly found all uses of DMA channel 3. One in particular is the function at 0x803645C
, which is a wrapper for starting a data transfer using DMA channel 3 and waiting for it to complete. After naming the parameters and IO registers the decompilation output looks like this:
Although this is a relatively simple function, there are a lot of things we can notice from the decompiler output. The obvious things are:
return
statement at the end of the function.volatile
, the decompiler wraps all loads and stores it detects to this region with read/write_volatile_x
functions. I find this slightly less readable, but I suppose it's useful to have these operations marked more explicitly than non-volatile read/writes.
A slightly more concerning thing is the weird choice of control flow when handling loops; it has repeated the load of DMACNT3_H
, even though in the assembly there is only a single load instruction. A much more natural choice would have been to display the loop condition once:
while (read_volatile_2(DMA3CNT._2_2_) & 0x8000);
Even if it is desired to explicitly assign to an intermediate variable, the pseudo-code could still more accurately describe the assembly by using a do
statement:
do {
dma_3_control = read_volatile_2(DMA3CNT._2_2_);
} while (dma_3_control & 0x8000);
Small quirks aside, this is a strong decompilation effort, and it enables us to quickly understand what this function does. By then looking for function call references to this function, and inspecting the source
and destination
arguments passed, we can quickly find the 2 functions which copy to/from EEPROM (0xD000000
) and use this as a starting point for understanding how saving and loading works.
After some analysis we can discover the layout of the save:
Description | Offset | Size |
---|---|---|
Magic | 0 | 8 |
High scores | 8 | 140 |
High scores checksum | 148 | 4 |
Game state | 152 | 348 |
Game state checksum | 500 | 4 |
The magic for Mario Pinball land saves is 05 f6 50 4d 6f 4d e6 5f
.
Checksums are calculated by simply adding every byte for the section, and adding each 4 bytes of the magic (so + 0xAD374374
in total).
To identify how a save was being invalidated when loaded I copied the save after selecting to "SAVE & QUIT", and then copied the save immediately after loading it. Diffing the 2 files revealed 8 bytes from offset 496
were different (the last 4 bytes of game state, and game state checksum).
Breaking on the last 4 bytes of stage data revealed that it's the marker for whether the save is valid or not. When loading the save it is set to 0xC769D9A9
, and when pressing "SAVE & QUIT" it is set to 0x38962656
. I quickly located the code responsible for this, function restoreLoadedData
at 0x8002CD0
:
To fix the problem, we can change the constant it sets saveValid
to from 0xC769D9A9
to 0x38962656
; the location of this constant in the ROM is located at 0x8002D14
.
Our next modification will be to locate the code that moves the flippers up and down, and modifying it to operate on an analog value.
With VisualBoyAdvance-M cheat finder functionality, we can quickly locate probable flipper variables by searching for 32-bit values that are zero when the desired flipper is lowered, and non-zero when raised:
0x030014A8
0x03002B20
0x03002B30
0x03004D10
We can then locate the code that is responsible for modifying these values by setting memory write breakpoints on those addresses (in NO$GBA notation this is [030014A8]!
), and then raising the flippers again.
As soon as I press L, the instruction at 0x080153D0
writes 0x55554
to 0x030014A8
. We can lookup the enclosing function, starting at 0x080153A4
, in Ghidra and try to understand what is going on. I came up with the following psuedo-code for this function (where flipperValue
's address is 0x030014A8
):
void updateFlipper(int flipper) {
if(flipperPressed[flipper]) {
int flipperDelta = upper32bitsof64bitmultiply(0x147AE, flipperMultiplier);
if(flipperValue[flipper] != 0x2AAAAA) {
flipperValue[flipper] += flipperDelta;
if(flipperValue[flipper] > 0x2AAAAA) flipperValue[flipper] = 0x2AAAAA;
}
}
else {
int flipperDelta = upper32bitsof64bitmultiply(0xA3D7, flipperMultiplier);
if(flipperValue[flipper] != 0) {
flipperValue[flipper] -= flipperDelta;
if(flipperValue[flipper] < 0) flipperValue[flipper] = 0;
}
}
}
flipperMultiplier
is set to 0x42AAA
during initialisation (at 0x8028C02
), and then remains unchanged throughout the game.
So in this function, the flipper is either incremented by ((0x147ae * 0x42aaa) >> 32) = 0x55554
until 0x2AAAAA
, or decremented by ((0xA3D7 * 0x42aaa) >> 32) = 0x2AAAA
(half as quickly) until 0
.
Now that we've located the relevant function and have a good idea of how it works, we can attempt to modify it.
Our aim is to just refine the controls without altering the gameplay. We want to preserve the original behaviour when going from a trigger value of 0 to 100% in one frame, where it is interpolated between the range and not just set to fully upright immediately, since we don't want to increase the maximum flipper acceleration possible. In other words, we just want to be able to either raise a flipper slower or be able to hold it in a position between the two states.
We have a slight problem in that when reading an analog value in a range like 0 - 256
, it does not map perfectly to the game's 0 - 0x2AAAAA
range; 256 * 10922 = 0x2AAA00
, so we'll just add 0xAA
after passing half way (this is less than 0.006% so shouldn't jerk too much).
We also need to keep the size of our new function under 0x7C
if we want to overwrite it inline. I came up with the following:
#define analogReading ((unsigned int *)0x4000400)
#define flipperValue ((int *)0x30014A8)
void updateFlipper(int flipper) {
unsigned int analog = analogReading[flipper];
unsigned int v = analog * 10922;
if(v >= 128 * 1092) v += 0xaa;
int delta = v - flipperValue[flipper];
if(delta > 0x55554) delta = 0x55554;
if(delta < -0x2AAAA) delta = -0x2AAAA;
flipperValue[flipper] += delta;
if(flipperValue[flipper] < 0) flipperValue[flipper] = 0;
if(flipperValue[flipper] > 0x2AAAAA) flipperValue[flipper] = 0x2AAAAA;
}
Compile as thumb code and insert into ROM file at offset 0x153A4
:
arm-none-eabi-gcc updateFlipper.c -mthumb -Os -nostartfiles -nostdlib -o updateFlipper.elf -Ttext=0x080153A4
arm-none-eabi-objcopy -O binary updateFlipper.elf updateFlipper.bin
Note that as our code modifications get more complex and will require more than just a text section (data section, etc), we will need to ensure that the compiler and linker do not insert any padding between the sections, to keep our binaries as small as possible. Compiling with -Wl,-z,max-page-size=0x1
and removing all calls to ALIGN
in the linker script, such as the below, seems to work (run arm-none-eabi-ld -verbose
to get the default linker script):
/* Adjust the address for the data segment. We want to adjust up to
the same address within the page on the next page up. */
. = ALIGN(CONSTANT (MAXPAGESIZE)) + (. & (CONSTANT (MAXPAGESIZE) - 1));
For now, let's just modify VisualBoyAdvance-M, to expose the analog values we want at an unused IO address:
static inline uint32_t CPUReadMemory(uint32_t address)
{
switch (address >> 24) {
...
case 4:
if (address == 0x4000400) {
value = 128;
}
else if (address == 0x4000404) {
value = 64;
}
else if ((address < 0x4000400) && ioReadable[address & 0x3fc]) {
...
When running this, the flippers are in the expected positions (left 1/2 up, right 1/4 up), and the ball seems to collide with the flippers at the correct location.
Modding the emulator further is out of scope of this article since I want to focus on the game itself, but it's just a matter of reading from a game controller and returning these trigger values instead of the constants shown above. If you want a solution which would work on a real GBA, you could read the flipper values over the GBA's link cable, and create a piece of external hardware to control the flippers (perhaps this will be a future project).
The ball is still programmed to fly off the flippers once the trigger buttons are pressed.
From reversing the updateFlipper
function, we know that there exist flipperPressed
variables, which are likely to be controlling this.
The only function which reads this value is the one at 0x801531C
, which seems to be updating a value likely used for the velocity calculation:
const int leftFlipper1_rightFlipperMinus1[2] = { 1, -1 };
void updateFlipperVelocity(int flipper) {
int *vel = &flipperVelocity[flipper];
*vel = 0;
if(flipperPressed[flipper]) {
if(flipperValue[flipper] == 0)
playSoundEffect(0x2F);
if(flipperValue[flipper]) != 0x2AAAAA)
*vel = 0x147AE * leftFlipper1_rightFlipperMinus1[flipper];
}
else if(flipperValue[flipper]) {
*vel = 0xA3D7 * leftFlipper1_rightFlipperMinus1[flipper];
}
}
Those constants match what we saw in updateFlipper
, which is a good sign we're looking at the right thing.
Let's try modifying this function to use the delta of the analog readings (again, the values don't divide perfectly, 0x55554 / 4 = 0x15555
, which is not exactly 0x147AE
):
#define analogReading ((unsigned int *)0x4000400)
#define flipperValue ((int *)0x30014A8)
#define flipperVelocity ((int *)0x30014B8)
#define leftFlipper1_rightFlipperMinus1 ((int *)0x81DEAAC)
typedef void playSoundEffect_t(int);
void updateFlipperVelocity(int flipper) {
int *vel = &flipperVelocity[flipper];
*vel = 0;
unsigned int analog = analogReading[flipper];
unsigned int v = analog * 10922;
if(v >= 128 * 1092) v += 0xaa;
int delta = v - flipperValue[flipper];
int d = delta / 4;
if(v > 0) {
if(flipperValue[flipper] == 0) ((playSoundEffect_t *)0x802BADD)(0x2F);
if(flipperValue[flipper] != 0x2AAAAA) {
if(d > 0x147AE) d = 0x147AE;
*vel = d * leftFlipper1_rightFlipperMinus1[flipper];
}
}
else if(flipperValue[flipper]) {
if (d < -0xA3D7) d = -0xA3D7;
*vel = d * leftFlipper1_rightFlipperMinus1[flipper];
}
}
Our new function is too big to be inserted over the original function, so let's place a jump there, and insert the function at some free space at the end of the ROM (0x87FD3D4
).
The game now has analog flipper control support!
Once again, the game doesn't allow you to leave all doors in a stage open. Once you open the last door, all doors (including the one you just opened) close! Our next modification will be removing the code responsible for this.
The door data is located at 0x300140C
; it's an array of 20
byte structures. The first field is the door number (4 bytes
), which I'm not sure why is required considering you can just access doors by index without requiring a loop, and the second field is the door's state (4 bytes
). The door state can be one of the following:
Setting a write breakpoint on one of the door's states and opening all of the doors quickly revealed the function responsible for this.
checkAllDoorsOpen
at 0x8014280
has a loop which checks if all doors (numberOfDoors
at 0x3001448
) are 3
(open), and if so, sets them all to 4
(closing).
We can just replace the first instruction of this function with bx lr
to immediately return, which disables this functionality.
To reach the final boss in Mario Pinball Land you only need to collect 15 stars (out of 35 total), but it may be desirable to lower some of the door star requirements even further to make the game easier.
Doors are created in each stage's setup function (we will go over stage control flow in more detail later) by calling createDoor
. For example, stage 2 creates 2 doors, with identifiers 0
and 1
, resources 0xBF
and 0xCA
, and the same palette at address 0x87FB15C
:
createDoor(0, 0xBF, (struct palette *)0x87FB15C);
createDoor(1, 0xCA, (struct palette *)0x87FB15C);
The star limits are implemented in each stage's loop function. For example, stage 2 loop function has the following code to control door access:
if(getStars() > 0 || inTimeAttackMode())
allowOpeningDoor(0);
if(getStars() > 1)
allowOpeningDoor(1);
Swapping the resource number passed to createDoor
for a different door sprite, and modifying the check in the stages's loop function is sufficient to lower the star requirement of a door.
Before moving onto the more involved modifications, let's first assess how much free RAM we have to work with.
The game doesn't use any dynamic memory allocation for IWRAM (0x3000000 - 0x3007FFF
), everything has been statically pre-assigned an address, which makes it relatively easy to see that the developers have used practically everything!
EWRAM (0x2000000 - 0x203FFFF
) on the other hand is a bit more interesting as the entirety of it is just used as a massive buffer to decompress data to. There is a pointer, nextSpaceForDecompressedData
(0x300531C
), which is used to track where to decompress data to. This pointer is set to the start of EWRAM (0x2000000
) when transitioning to a new stage, and is then advanced each time the stage decompresses a resource by calling decompressResource
(0x802E6A0
).
Different stages require a different amount of data, for example stage 1 requires 0x18020
bytes and stage 3 requires 0x3e824
bytes. We could try to identify the area of the game which uses the most space for decompressed data, and then use whatever is left over as free space (maybe something like 5000 bytes).
However, I had an idea for a more practical solution. If the ROM was bigger we could just store all resources uncompressed, removing the need for decompressing the resources into EWRAM on the fly. This would give us the whole of EWRAM (256KB) for modifications to use!
Changing the ROM size from 8MB to 16MB is as simple as increasing the file size and won't affect any other mappings. The size could be increased even further to 32MB if necessary, but this would affect how the game accesses EEPROM. For any sizes larger than 32MB we could use a mapper for bank switching, giving practically unlimited storage (think about the GBA video carts storing feature length films, or flashcards reading from SD cards). However for the purposes of this article, 16MB will be more than sufficient.
The resource table starts at 0x87FCCD8
, it's an array of pointers to resources structured the following way:
1 byte - type
3 bytes - decompressedSize
[data]
The type
is either 0
if the resource is not compressed, or 0x10
if it is compressed with LZ77 (to be decompressed via GBA BIOS function 0x11
); many open source decompressors for this exist, such as CUE's DS/GBA compressors.
The following resources use their entry in the resource table for uncompressed metadata, and their real compressed data is stored elsewhere (to be fetched by calling getCompressedData
(0x802D7A4
)):
0x64 - 0x8382C48
0xa7 - 0x852E36C
0xf9 - 0x85C5BB4
0x91 - 0x849C568
0x92 - 0x84EF718
0x81 - 0x83EFDD8
0x83 - 0x8428594
0x3a - 0x8364CAC
0x51 - 0x837D0A8
0xb - 0x8269430
0xc - 0x82C948C
0x8 - 0x81FBBA0
I wrote a small hacky program which can decompress all of these resources and then update the resource table to point to the decompressed data. Note that some of the resources overlap, so I decided to do this in a couple of stages (extract all resources without writing anything, then copy the decompressed resources and write to the resource table).
unsigned int resourceAddressAddress(int resource, int resolve) {
if(resolve) {
switch(resource) {
case 0x64: return 0x802D834;
case 0xa7: return 0x802D83C;
case 0xf9: return 0x802D82C;
case 0x91: return 0x802D824;
case 0x92: return 0x802D844;
case 0x81: return 0x802D804;
case 0x83: return 0x802D81C;
case 0x3a: return 0x802D7F4;
case 0x51: return 0x802D84C;
case 0xb: return 0x802D814;
case 0xc: return 0x802D80C;
case 0x8: return 0x802D7FC;
}
}
return 0x87FCCD8 + resource * 4;
}
...
// First loop just to fetch data
int i;
for(i = 0; i < 255; i++) {
unsigned int aa = resourceAddressAddress(i, 1);
unsigned int aau = resourceAddressAddress(i, 0);
unsigned int a = *(unsigned int *)(rom + aa - 0x8000000);
printf(" [+] Decompressing resource %d from 0x%08x\n", i, a);
...
printf(" [+] Writing decompressed resources\n");
// Decompress to end of ROM
unsigned int offset = 8 * 1024 * 1024;
// Second loop to write decompressed data
for(i = 0; i < 255; i++) {
*(unsigned int *)(rom + offset) = resources[i].decompressedSize << 8;
memcpy(rom + offset + 4, resources[i].decompressedData, resources[i].decompressedSize);
// Update the pointer
*(unsigned int *)(rom + resources[i].addressOffset - 0x8000000) = offset + 0x8000000;
offset += 4 + resources[i].decompressedSize;
There's just one final modification we need to make. Since the game expects certain resources to have been decompressed when requested, it tries to locate the data by calling getDecompressedData
(0x802FB04
); this function will return NULL
if the resource has not been decompressed. To fix this, we'll update the function to return the resource's uncompressed address either from the resource table or from getCompressedData
:
void *(*getCompressedData)(int resource) = (void *)(0x802D7A4 + 1);
#define compressedDataAddresses ((void **)0x87FCCD8)
void *_start(int resource) {
void *r = getCompressedData(resource);
if(r == (void *)0x600B800) {
r = compressedDataAddresses[resource];
}
return r + 4;
}
Note that the resources mentioned above are not the only compressed data in the game, but are the only resources which are decompressed into EWRAM (other data is decompressed into VRAM for example).
We already know from the cheat codes that the current stage number is located at 0x3000030
, so we can search for references to it in our disassembler or break on writing it to identify the pertinent functions.
It turns out that there are 9 main functions involved in stage control flow:
0x08000A8C
- switchStage
0x0801EC64
- enterGameRoom
0x0801E6C8
- setupStage
0x0801E9F0
- mainGameFunction
0x8034150
- isReadyForSetup
0x801E8DC
- clearUpStage
enterGameRoom
calls setupStage
once, and then mainGameFunction
in a loop until the status is changed, and finally clearUpStage
:
int enterGameRoom(...) {
int result;
if (stageIsShop(currentStage)) {
...
}
else {
setupStage(stage, ...);
updateTimer(...);
do
mainGameFunction(stage);
while (getMarioStatus() == 13);
clearUpStage(stage);
result = getMarioStatus();
}
return result;
}
setupStage
, mainGameFunction
and clearUpStage
have switch statements to execute a different function according to the current stage, such as setupStage1
(0x801F290
), stage1
(0x801F52C
), and clearUpStage1
(0x801F5DC
):
int setupStage(...) {
...
switch (stage) {
case 0:
setupStage1();
break;
case 1:
setupStage12();
break;
...
}
}
Many of the stage specific loop functions call isReadyForSetup
, which returns 1
once the ball's Y passes above a certain position, and then 0
for all future calls. The idea is that the stage specific setup functions execute whilst the screen is still faded out, but some routines such as creating enemies need to be done later so that they can be seen appearing:
int stage12() {
...
if (isReadyForSetup()) {
createGhost(0xFF6A0000, 0x3520000);
createGhost(0x960000, 0x3520000);
createGhost(0xFF060000, 0x2260000);
createGhost(0xFA0000, 0x2260000);
}
To clarify this, here's an overview of the key points when transitioning from stage 1 to stage 1-2:
ball[0].state
from 13
to 0
,enterGameRoom
to break out of the loop that was executing mainGameFunction
,enterGameRoom
calls clearUpStage
,clearUpStage
switches on the stage number (0
), and calls clearUpStage1
,enterGameRoom
returns to switchRoom
,switchRoom
changes stage
to 1
(stage 1-2),switchRoom
continues its loop which calls enterGameRoom
,enterGameRoom
calls setupStage
,setupStage
switches on stage number (1
), and calls setupStage12
,setupStage12
calls some functions to decompress some data, reset the setup variable, and set enemy type to ghost (0xd
),enterGameRoom
,enterGameRoom
enters its loop which calls mainGameFunction
,mainGameFunction
switches on the stage number (1
), and calls stage12
,stage12
checks the setup variable, and if it's ready will create the ghosts,
The only other thing to understand is how the game desides the target of the next stage in switchStage
. This will be useful for us, as we want to change one stage transition to take us to a different stage.
There is a hole in stage 3-3 (0x12
) which will send you to the underwater stage, 3-4 (0x13
), however if you fall from stage 3-4 you will be sent all the way back to stage 3 (0x0f
), not stage 3-3!
To modify the game to return from stage 3-4 (0x13
) to stage 3-3 (0x12
), instead of stage 3 (0x0f
), we need to understand the data structure which stores the relationships between stages:
struct stageLink {
int stage;
int forwardsTransition;
int linkedStage;
int backwardsTransition;
};
There is an array of these structures, stageLinks
, starting at 0x87F5618
, which is used by switchStage
to determine the next stage like so:
if stageLinks[i].stage == currentStage && transitionMechanic == stageLinks[i].forwardsTransition {
nextStage = stageLinks[i].linkedStage;
}
else if stageLinks[i].linkedStage == currentStage && transitionMechanic == stageLinks[i].backwardsTransition {
nextStage = stageLinks[i].stage;
}
Where falling passed the flippers is mechanic 0xa
. This is enough information for us to search through the table for stage numbers 0x12
, 0x13
, and 0x0f
, yielding the entries at indexes 46
and 48
:
87f58f8: stageLink { 0x12, 1, 0x13, 4 }
...
87f5918: stageLink { 0x13, 0xA, 0xF, 2 }
Entry 46
states that you can transition from stage 0x12
to 0x13
by going through transition mechanic 1 (the ice whole), but cannot return from stage 0x13
to stage 0x12
as there is nothing in stage 0x13
that sets transition mechanic 4.
Entry 48
states that you can transition from stage 0x13
to 0xf
by falling (transition mechanic 0xa
), but cannot return from stage 0xf
to 0x13
as there is nothing in stage 0xf
that sets transition mechanic 2
.
To fix the problem, we can either modify entry 46
's backwards transition to 0xa
, or entry 48
's linkedStage
to 0x12
.
Now we come to the biggest modification, removing the stage reset mechanic by preserving more stage data across stage transitions. This will be the new default behaviour, but we will allow the player to opt out of restoring stage data if they hold L+R while transitioning to a new stage.
The only stage data that is currently preserved across stage transitions is the stageFlags
array, located directly after the stage
variable, at 0x3000034
. This is a 32-bit integer per stage, which holds information about the actions you have completed in the stage, such as whether you have collected the star, opened the question boxes, raised the pyramid, etc.
Our aim is to preserve state about which doors are currently open, the enemies, and any coins and stars (unlocking a star and then accidentally leaving the stage before collecting it is a huge problem). We will preserve this data in EWRAM, which has now been completely freed up by eliminating the decompression code.
Ignoring the shops, bonus stages, and boss stages (we wouldn't want to make the game too easy ;)), there are 28 stages:
It's kind of annoying that they are not perfectly in order, but we can just use an array to map from stage number to preservation index:
const unsigned char stageToStageDataIndex[] = {
/*0x00:*/ 0,
/*0x01:*/ 1,
/*0x02:*/ 0xff,
/*0x03:*/ 0xff,
/*0x04:*/ 0xff,
/*0x05:*/ 0xff,
/*0x06:*/ 0xff,
/*0x07:*/ 2,
/*0x08:*/ 3,
/*0x09:*/ 0xff,
/*0x0A:*/ 4,
/*0x0B:*/ 5,
/*0x0C:*/ 6,
/*0x0D:*/ 7,
/*0x0E:*/ 8,
/*0x0F:*/ 9,
/*0x10:*/ 10,
/*0x11:*/ 0xff,
/*0x12:*/ 11,
/*0x13:*/ 12,
/*0x14:*/ 13,
/*0x15:*/ 14,
/*0x16:*/ 15,
/*0x17:*/ 16,
/*0x18:*/ 17,
/*0x19:*/ 18,
/*0x1A:*/ 19,
/*0x1B:*/ 20,
/*0x1C:*/ 21,
/*0x1D:*/ 0xff,
/*0x1E:*/ 22,
/*0x1F:*/ 23,
/*0x20:*/ 0xff,
/*0x21:*/ 24,
/*0x22:*/ 25,
/*0x23:*/ 26,
/*0x24:*/ 0xff,
/*0x25:*/ 0xff,
/*0x26:*/ 0xff,
/*0x27:*/ 27,
/*0x28:*/ 0xff,
/*0x2E:*/ 0xff,
/*0x2F:*/ 0xff,
/*0x30:*/ 0xff,
/*0x31:*/ 0xff,
/*0x32:*/ 0xff,
/*0x33:*/ 0xff,
/*0x34:*/ 0xff
};
If we were to use the whole of EWRAM, we would have more than 9000 bytes per stage (256 * 1024 / 28
), so storage space shouldn't be a problem.
Now that we understand the control flow when transitioning between stages, we need to find a way to execute our own code at key points; preserving data before clearUpStage
, and restoring data after setupStage
.
In armv4t there's no blx memory
, or even blx register
instruction, so you have to set the target, set the return address, and then execute the branch in 3 different instructions. It turns out that we need at least 10 bytes for an arbitrary call, the same amount of space required for a 64bit call on x86_64!
ldr r3, [pc, #x]
mov lr, pc
bx r3
...
@ 4 byte target
This is assuming that you are using exclusively Thumb code; if you wanted to support interopability with ARM functions, you would need an additional instruction to add 1 to that lr
value so that the target function would know to return to Thumb code and not ARM (there is no add lr, pc, #1
instruction to do this in a single instruction).
There isn't enough space in these functions to insert a new function call, so usually we would have to insert this call by overwriting some of the function's original instructions, and then reimplement them in the new function before executing any custom functionality.
However, it turns out that the game is using a very old compiler which doesn't optimize space efficiency well for ARM target. I took it as a challenge to try to eliminate enough instructions to create space for an arbitrary call, whilst keeping the game functionally equivilant.
setupStage
This function has a switch
statement, which completely throws off Ghidra and causes it to try to disassemble the code as ARM instead of Thumb, even though the whole function is Thumb:
The instruction it cites as confusing it is the last instruction of the final switch
case
, which is a 4-byte bl
Thumb instruction. Perhaps the root cause is that it somewhere mistakenly checks for ARM/Thumb mode based on instruction size, assuming all Thumb instructions are 2-bytes?
Ignoring the disappointment of Ghidra's ARM disassembler, the setupStage
function has the following prologue and epilogue (from a different disassembler):
0801E6C8 setupStage:
0801E6C8 sub sp, sp, #8
0801E6CA push {r4-r6, lr}
...
0801E6D0 str r2, [sp, #0x18+var_8]
0801E6D2 str r3, [sp, #0x18+var_4]
...
0801E8D2 pop {r4-r6}
0801E8D4 pop {r3}
0801E8D6 add sp, sp, #8
0801E8D8 bx r3
Which means it has the following stack layout:
Where the local stack variables are just used for storing arguments 2 and 3, to be used later for a single function call:
0801E6D0 str r2, [sp, #0x18+var_8]
0801E6D2 str r3, [sp, #0x18+var_4]
...
0801E738 ldr r0, [sp, #0x18+var_8]
0801E73A ldr r1, [sp, #0x18+var_4]
0801E73C ldr r2, [sp, #0x18+arg_0]
0801E73E bl setupBall
Our strategy to optimize the epilogue for space will be:
add
by pop
ping into trashed registers,bx
instruction by pop
ping into pc
,pop
s into 1 by reordering data,It turns out that the 8 instructions originally used for the prologue and epilogue can be reduced to just 2 instructions!
prologue:
push {r2-r6, lr}
epilogue:
pop {r2-r6, pc}
This reorders the stack layout slightly:
Which just means we have to edit the 2 instructions which reference the local stack variables holding R2
and R3
:
Before:
ldr r0, [sp, #16]
ldr r1, [sp, #20]
After:
ldr r0, [sp, #0]
ldr r1, [sp, #4]
Since there are also 2 bytes of padding between the end of setupStage
and the next function (clearUpStage
), we now have 8 bytes of space for code before setupStage
epilogue, just 2 bytes short of being able to insert a call.
Note that even though our call will be the last instruction in the setupStage
function, we can't perform a tail call optimization (setupStage call -> loadStageData return -> setupStage return
to just setupStage branch -> loadStageData return
) to reduce the number of instructions since there is no way of pop
ping directly into lr
.
Instead, our strategy at this point is to use the free space we created in the prologue to move the target into the stack, and then we can take advantage of being able to pop
into pc
to combine the load of the target, and the branch, into a single instruction:
prologue:
...
ldr r5, [pc, #0x20C]
push {r2, r5}
...
@ bl loadCurrentStage
mov lr, pc
pop {r2, pc}
epilogue:
pop {r2-r6, pc}
loadCurrentStageTarget:
.word loadCurrentStage
Note that I also pushed r2
in order to keep sp
8-byte aligned.
Once again, this changes the stack layout, so we'll have to adjust all instructions accessing stack memory (to use offset +8
), but this is still less effort than relocating the whole function body up a few instructions to move the free space in the prologue to the epilogue.
clearUpStage
For clearUpStage
our aim is to create space to insert a function call at the start of the function, instead of at the end (but we can store the branch target at the end of the function so we don't have to jump over it).
0801E8DC clearUpStage:
0801E8DC push {r4, lr}
0801E8DE movs r4, r0
...
0801E9E8 pop {r4}
0801E9EA pop {r0}
0801E9EC bx r0
This one is pretty simple to optimize the epilogue for:
pop {r4, pc}
Once again, there are 2 bytes of padding at the end of this function, which means that with our epilogue optimization, we have 6 bytes of space after the function, which is more than enough to store an indirect branch target.
Now we just need to optimize the start of the function to create enough space for a new call:
0801E8DC clearUpStage:
0801E8DC push {r4, lr}
0801E8DE movs r4, r0
0801E8E0 bl sub_8033F34
0801E8E4 bl sub_801B4F8
0801E8E8 bl sub_800E020
0801E8EC cmp r4, #0x36
0801E8EE bhi def_801E8F8
0801E8F0 lsls r0, r4, #2
0801E8F2 ldr r1, =jpt_801E8F8
0801E8F4 adds r0, r0, r1
0801E8F6 ldr r0, [r0]
0801E8F8 mov pc, r0
As you can see, it's saving R0
in R4
before executing the 3 calls, but it's then immediately shifting it. Why don't we eliminate the redundant save and just shift it immediately (lsls r4, r0, #2
)? This saves us 2 bytes.
The switch code is also sub-optimal, the ADDS
and LDR
for the switch table can be combined into a single operation (ldr r0, [r1, r4]
), which saves another 2 bytes.
The comparison would have been generated to handle the default
case where the stage number is none of the ones specified, but since we already know the that the stage number is always within the range we can omit the check here, which will save another 4 bytes.
Overall, this function can be shrunk to give us an additional 8 bytes at the start:
clearUpStage:
push {r4, lr}
lsls r4, r0, #2
@ 8 new bytes of data
bl sub_8033F34
bl sub_801B4F8
bl sub_800E020
ldr r1, =jpt_801E8F8
ldr r0, [r1, r4]
mov pc, r0
We now have enough space to perform a call at the start of clearUpStage
, with the target stored after the epilogue using the free space we optimized there.
Now that we have enough free RAM to store stage data, and are able to cleanly hijack control flow at the desired places during stage transitions, we can begin implementing the preserve and restore functionality! Let's start by just preserving doors, as we've already talked about how they work.
Because of the way that numberOfDoors
gets set (initialized to 0
on stage transition, and then incremented each time createDoor
is called), it is very hard to statically know the number of doors for each stage.
Instead, I decided to use a data structure which stores the number of doors dynamically:
struct stageDataEntry {
unsigned char dataType, count;
unsigned char data[0];
};
So a stage with three doors would be stored as:
struct stageDataEntry {
DATA_TYPE_DOOR,
3,
struct door[3] = {
20 bytes door data,
20 bytes door data,
20 bytes door data,
}
}
I came up with the following save and load functions:
void *(*copy)(void *destination, void *source, unsigned int size) = (void *)(0x8036798 + 1);
#define numberOfDoors (*((int *)0x3001448))
#define doorData ((struct door *)0x300140C)
struct preserve {
void *data;
unsigned int *count;
unsigned char size;
} preservitives[] = {
{ doorData, &numberOfDoors, sizeof(struct door) },
};
struct stageDataEntry *saveStageData(struct stageDataEntry *d) {
int p;
for(p = 0; p < sizeof(preservitives) / sizeof(preservitives[0]); p++) {
struct preserve *preserve = &preservitives[p];
unsigned int count = *preserve->count;
if(count > 0) {
unsigned int size = preserve->size * count;
d->dataType = p;
d->count = count;
copy(d->data, preserve->data, size);
d = (void *)d + offsetof(struct stageDataEntry, data) + size;
}
}
d->dataType = 0xdb;
return d;
}
void loadStageData(struct stageDataEntry *d) {
while(d->dataType != 0xdb) {
struct preserve *preserve = &preservitives[d->dataType];
unsigned int count = d->count;
unsigned int size = preserve->size * count;
copy(preserve->data, d->data, size);
*preserve->count = count;
d = (void *)d + offsetof(struct stageDataEntry, data) + size;
}
}
Link the target of the function calls we added to setupStage
and cleanUpStage
to the following two wrappers and door state will be preserved across the desired stages!
void saveCurrentStageData(void) {
int i = stageToStageDataIndex[stage];
if(i != 0xff) {
saveStageData(stagePreserveData[i].entries);
}
}
void loadCurrentStageData(void) {
if(!(REG_KEYINPUT & (KEY_L | KEY_R))) {
return;
}
int i = stageToStageDataIndex[stage];
if(i != 0xff) {
loadStageData(stagePreserveData[i].entries);
}
}
After testing the new door preserving code, a couple of issues become apparent.
For the first issue, we can call setAnimationFrame
(0x802F0C4
) on all doors after loading the new door states.
For the second issue, we can find the code responsible for closing the door after Mario falls through, by setting a memory write breakpoint on the door state, it's in handleDoors
(0x8014300
), and looks something like this:
if(numberOfDoors > 0) {
for(doorIndex = 0; doorIndex < numberOfDoors; doorIndex++) {
if(doorData[doorIndex]->number == fellThroughDoor) {
int *doorState = &doorData[doorIndex].state;
if(*doorState == DOOR_OPEN) {
playSoundEffect(46);
*doorState = DOOR_CLOSING;
}
break;
}
}
}
We could nop
out this snippet to disable this functionality, but we want to keep the functionality present in the game for when the player wants to reset the next room by holding L+R. Instead, we can set the fellThroughDoor
variable to a non-existant door when loading stage data to prevent the above snippet from having any effect.
With these two fixes, loadCurrentStageData
now looks like:
void loadCurrentStageData(void) {
if(!(REG_KEYINPUT & (KEY_L | KEY_R))) {
return;
}
int i = stageToStageDataIndex[stage];
if(i != 0xff) {
loadStageData(stagePreserveData[i].entries);
// Prevent the door we fell through from closing
fellThroughDoor = 0xc2;
// Update door graphics
int i;
for(i = 0; i < numberOfDoors; i++) {
setAnimationFrame(doorData[i].byte_8, doorData[i].uint_16, 0, doorData[i].frame);
}
}
}
With the stage data preservation code working as expected for doors, we can extend it to also preserve enemy state.
I reversed some of the enemies, and came up with the following memory addresses used. There's so many that it would take too long to fully reverse exactly how all of them work, but this should be enough to get started with demonstrating the concept of preserving them cross-stage:
Number | Name | Address | Size * Count |
---|---|---|---|
0x0 | goomba | 0x300166C | 72 * 5 |
0x1 | bee | 0x3001AB0 | 20 * 3? |
0x2 | koopa | 0x3002784 | 40 * 6 |
0x3 | shy guy | 0x300308C | 52 * 4 |
0x4 | snowman | 0x30031E8 | 44 * 4? |
0x5 | penguin | 0x3002DE8 | 40 * 4 |
0x6 | fish | 0x3000760 | 52 * 6 |
0x7 | pokey | 0x3002EF8 | 44 * 4 |
0x8 | spiny | 0x3003368 | 40 * 4 |
0x9 | klepto | 0x300593C | 40 * 4 |
0xa | snake | 0x30009C0 | 36 * 6 |
0xb | ? | 0x30019FC | 44 * 4 |
0xc | mole | 0x3002D50 | 48 * 1? |
0xd | ghosts | 0x30004D4 | 48 * 4 |
0xe | robot | 0x3002884 | 36 * 3 |
0xf | ? | 0x3000428 | 40 * 4 |
0x10 | ? | 0x3000668 | 36 * 4 |
0x11 | ? | 0x3000924 | 36 * 4 |
0x12 | flying shy guy | 0x30014CC | 44 * 6 |
? | lakitu | 0x3002904 | 40 |
I decided to stop the article here as it's quite long, and it isn't my intention to create a full mod for this game, but just to describe my experience of reverse engineering it. I was successfully able to reverse engineer enough of the game to modify and rewrite core functionality including input handling, resource decompression, stage transitions, etc to make the game better.
Regarding Ghidra, I was slightly disappointed with how it handled ARM code, especially considering I tested it with probably the easiest target possible, some C code from 2004 that used a compiler that couldn't optimise well for the architecture, as I've shown.
Ghidra frequently messed up disassembly by not recognising all functions, and getting confused between ARM and Thumb mode, even though the whole program is Thumb. The UI also leaves a lot to be desired; scrolling is completely broken, selecting variables/registers doesn't highlight other uses of it, assembly output is cluttered as it displays multiple lines per instruction whenever there are any xrefs, etc. This is largely irrelevant though as no one is excited about Ghidra just to use its disassembler; all of the hype is around its decompiler, which is very promising aside from a few quirks I encountered.
I'm assuming that the primary use case for Ghidra is decompiling x86 (/x64), but even for ARM it is easily the best free decompiler available, and the complaints with the UI are relatively trivial, so I am optimistic for the future.