Real Time Status Port or Store and Dump?
So, we have two options both of which have their advantages and costs. The main concern we have is which will waste the fewest clock cycles. Every single clock cycle that is "wasted" by debugging is a chance that it'll alter the program state enough to hide the bug (we're back to Jacob Sorber's arguments against this methodology in the first place).
Option 1 - Store and Dump
Hopefully we can reduce the state we need to examine into one byte and store this in SRAM and, after the device has been sufficiently exercised, dump this out.
This dump can be done over a UART as we no longer care that much about speed now that all our precious data is stashed away safely in SRAM. But, whilst I don't have any idea what the problem is, The fact that a store operating takes 2 cycles rather than the 1 cycle of option 2 is "bothersome".
Option 2 - Status Port
An alternative would be to write the a byte or two to IO Ports and log it on an external device.
The problem here is that now there's yet another device that has to match the speed of the CPU. According to the datasheet, an IO operation only takes 1 clock cycle so that's good.
Logic Analyser
In both cases it'd really help to have the "log entry" be just a single byte. Writing two bytes to SRAM takes twice as long. Outputting two bytes need to write to two IO ports and needs a 16-bit logic analyser to read the ports.
What's all this about logic analysers?
I've been reading articles about using a Blue Pill, a Pi Pico, etc, etc as a DIY data logger. And there's an issue with all of these options: compared to oscilloscopes or dedicated logic analysers they're SO SLOW. Although the μCs on these boards "go like the clappers" compared to my humble AVRs, there's a considerable delay caused by their IO subsystems.
This is filling me with doubts about the certainty of "IO operations will take 1 clock cycle" that I mentioned above and I'm going to have to do even more experiments to alleviate them.
Anyway... anyway... anyway... let's worry about that later.
The answer to this problem has a nice simple "ready rolled" solution. The Cypress EZ-USB chips have a nice firmware bundle that can drag data off of an 8-bit or 16-bit parallel bus and cram them down a USB as fast as can be done.
Again, even here, we're trying to constrain the debug information "record" to one byte as the EZ-USB can do 8-bits much faster than it can do 16-bits.