The BBC Micro comes equipped with a very good 6502 assembler that makes the translation of 6502 mnemonics to 6502 machine code easy. The reverse process - converting 6502 machine code to 6502 mnemonics - may be something that seems pointless until you attempt to decipher how a machine code program works. Obviously if you have the assembly language source for the machine code then all you have to do is study it. If for any reason you don't haw access to the assembly language source then one way or another you have to attempt to reconstruct it from the machine code.
A program that changes machine code to assembly language is called a disassembler. The main theme of this chapter is the production of a 6502 disassembler but this takes us into some surprising areas concerning thevery fundamentals of microprocessors, look-up tables and program design. As well as providing plenty of scope for discussion, the final program is something that every BBC Micro owner needs now and again. The reason for this is that the BBC Micro contains 32K of very important machine code only software - BBC BASIC and the MOS. Some of the topics are admittedly rather advanced. In fact. one or two are the sort of thing that you would normally only encounter on a computer science degree so you can be pleased it they cause you no trouble! On the other hand. if you find any of the material tough going do not despair, you can still make use of the disassembler and understand many of the points of its implementation without understanding everything in this chapter.
The most obvious approach to constructing a disassembler, or an assembler for that matter, is to use a look-up table (see Chapter Five). The basic algorithm of a 'table driven' disassembler is very simple the contents of each memory location is used to index a pair of tables that give:
(1) the three letter mnemonic to which it corresponds and
(2) the addressing mode (e.g. absolute, immediate, etc.) that is in use.
If the contents of a memory location do not correspond to a legal 6502 instruction then a special 'invalid operation' mnemonic, "***" for example would be returned and the next memory location examined. Values within a machine code program that do not correspond to any 6502 operation are most likely to be generated by data within the ortginal assembly language source. However, the whole question of how to differentiate automatically between areas of code and data is one stage beyond the simple disassembler that is currently under discussion.
The only real difficulty with the look-up tables for the mnemonics and addressing modes are their size. To allow an entry for every possible single byte value needs two arrays each with 256 elements. That is,
DIM M$(255),AM(255)
where M$(1) holds the three letter mnemonic for the op code 1 and AM(1) holds an integer that indicates the addressing mode. For example, M$(&65) would hold "ADC" and AM(&65) would indicate zero page addressing. (&65 is the machine code for ADC using zero page addressing!) If you think about the pattern of entries in these tables you will quickly realise that there are many duplicate entries. For example, the mnemonic for ADC occurs eight times, once for each of its possible addressing modes. Entering the complete set of 6502 mnemonics to form a table is a tedious enough task without entering some of them as many as eight times!
If you examine a 6502 instruction table (there is one at the back of the BBC Micro's User Guide) you should be able to see a pattern in the values for the op codes corresponding to a single instruction in each of its different address modes. Any pattern in the entries in a look-up table is worth investigating because it is often possible to make use of it to reduce the number of entries in the table.
To a certain extent the 6502 chip within the BBC Micro has the problem of machine code disassembly every time it executes a program. After reading an op code from memory the 6502 has to decode it and determine what actions have to take place. For example, after reading the op code &65 the 6502 has to decode it to 'discover' that it means ADC using zero page addressing mode and then carry out the addition. The restrictions of building logic circuits on silicon make it necessary to find a way of making the decoding process as simple as possible. For this reason there is a very regular pattern to be found in the op code values used in any microprocessor and the 6502 is no exception.
By examining the table of instruction codes it is possible to deduce that the internal format of a 6502 op code is:
Only the top three bits (b7, b6 and b5) of each opcode are used to indicate the instruction that is to be carried out. The middle three bits (b4, b3 and b2) indicate which addressing mode is in use. However, using only three bits to determine the operation would limit the 6502 to a total of eight different instructions! To increase the total number of instructions the last two bits of the op code are used to provide four different groups of instructions. Thus the ADC zero page instruction can be analysed by writing down its op code in binary as a group one instruction:
The instruction code for ADC is 3 and the mode code for zero page addressing is 1. In the same way ADC immediate is:
which again shows that ADC is a group 1 instruction and its instruction code is 3. The only change to the op code in going from ADC zero page to ADC immediate is carried in the three mode bits, the addressing mode code for zero page addressing being 1 and for immediate addressing 2.
You should be able to appreciate that this division of the bits that make up an op code into an instruction code, an addressing mode code and a group code can greatly simplify the decoding that is involved in both a microprocessor and a disassembler. For example, if the first two bits of an op code indicate a group one instruction and the top three bits give an instruction code of 3 then the command is ADC. Subsequent examination of the mode bits to give the current addressing mode complete the decoding.
Before the internal structure of the op code can be used to produce a simplified disassembler, it is necessary to classify each 6502 instruction by group code and produce tables relating instruction codes to mnemonics and addressing mode codes to actual addressing modes. The way that this task was tackled originally was with the aid of the following program:
10 INPUT A$
20 A=FNdec(A$)
30 GR=A AND &03
40 MO=(A AND &1C)/4
50 OP=(A AND &E0)/32
60 PRINT "Code=";~A;" Group=";~GR;" OP=";OP;" Mode ";MO
70 GOTO 10
80 DEF FNdec(A$) EVAL("&"+A$)
This will analyse any 6502 op code and print out its group, instruction code and addressing mode code.
For reasons that will become apparent, it is better to start with an examination of the group 1 table:
Instruction | Mode | |||||||
Code | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
0 | ORA | ORA | ORA | ORA | ORA | ORA | ORA | ORA |
1 | AND | AND | AND | AND | AND | AND | AND | AND |
2 | EOR | EOR | EOR | EOR | EOR | EOR | EOR | EOR |
3 | ADC | ADC | ADC | ADC | ADC | ADC | ADC | ADC |
4 | STA | STA | STA | STA | STA | STA | STA | STA |
5 | LDA | LDA | LDA | LDA | LDA | LDA | LDA | LDA |
6 | CMP | CMP | CMP | CMP | CMP | CMP | CMP | CMP |
7 | SBC | SBC | SBC | SBC | SBC | SBC | SBC | SBC |
address mode |
(z,X) | zero page | imm | abs | (z,Y) | z,X, | c,Y | c,X |
(z = zero page address and c = two byte constant) |
From this table you can see that all eight group 1 instructions are operations m the A register. Another striking feature is that each instruction can be used in any of the eight possible addressing modes. (As will be demonstrated this is not the case with the other instruction groups.) Also notice that the use of just three bits to code the addressing mode limits any 6502 instruction to eight addressing modes at most despite the fact that the 6502 supports 13 different addressing modes. This, to a certain extent, accounts for the odd collection of rules about which addressing mode can be used with which instruction.
As far as the disassembler is concerned, group 1 instructions are very easy to decode. All that has to be done is to use the instruction code bits to index a table of the eight mnemonics and the address mode bits to index a table that gives the addressing mode in use. The mnemonic table is simply:
Instruction Code | Mnemonic |
0 | ORA |
1 | AND |
2 | EOR |
3 | ADC |
4 | STA |
5 | LDA |
6 | CMP |
7 | SBC |
If the 13 addressing modes of the 6502 are coded as:
mode | imm | abs | zero page | acc | imp | (z,X) | (z),Y | z,X | c,X | c,Y | rel | indir | z,Y |
code | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
then correspondence between the addressing mode code in group 1 instructions and actual addressing mode is:
addresing mode code (b4,b3,b2) | actual mode code |
0 | 6 |
1 | 3 |
2 | 1 |
3 | 2 |
4 | 7 |
5 | 8 |
6 | 10 |
7 | 9 |
This may seem a little complicated at first but the reward is reducing the 64 entries in the simple look-up tables to 8. For example, to decode the opcode &1D it is first split up to give:
Group=1, Instruction code=0, Mode=7
As this is a group 1 instruction the two tables given above yield ORA for the mnemonic and address mode 9 or c,X (absolute indexed addressmg using the X register) which are both correct.
The situation with group 0 instructions is not so simple. The table for group 0 is:
Instruction | Mode | |||||||
Code | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
0 | BRK | *** | PHP | *** | BPL | *** | CLC | *** |
1 | JSRabs | BIT | PLP | BIT | BMI | *** | SEC | *** |
2 | RTI | *** | PLA | JMP | BVC | *** | CLI | *** |
3 | RTS | *** | PLA | JMPind | BVS | *** | SEI | *** |
4 | *** | STY | DEY | STY | BCC | *** | TYA | *** |
5 | LDYimm | LDY | TAY | LDY | BCS | *** | CLV | *** |
6 | CPYimm | CPY | INY | CPY | BNE | *** | CLD | *** |
7 | CPXimm | CPZ | INX | CPX | BEQ | *** | SED | *** |
address mode | impl | zero page | impl | abs | rel | impl |
(where the addressing modes are as shown at the bottom of each column unless otherwise indicated within the table next to the instruction concerned)
This table reveals a much more complicated pattern. In particular, the instruction code part of the op code is not the sole determinant of the instruction type. Indeed, for group 0 instructions the addressing mode also determines the instruction type. For example, there are five different instructions corresponding to Instruction code=3 - that is, RTS, PLA, JMP, BVS and SEI. However this is not to say that the neat pattern of group 1 instructions is entirely lost. In the main, the instructions corresponding to a particular value of mode do use the same addressing mode. For example, all the instructions with mode=1 use zero page addressing, those with mode 2 use implied addressing, those with mode=4 use relative addressing and those with mode=6 use implied addressing. When mode=0 or mode=3 the situation is a little more difficult. The instructions in mode 0 are a mixture of implied addressing (BRK, RTI and RTS), absolute addressing (JSR) and immediate addressing (LDY, CPY and CPX) while all but one of those in mode 3 are absolute and JMP is indirect. Even so you should be able to see that there is a good overall regularity which simplifies the decoding process for the 6502's logic circuits. (For example, all mode 4 instructions are branches and all mode 6 instructions clear or set flags.) As far as the disassembler is concerned the simplest solution is to use the table as it stands i.e. as a two-dimensional look-up table indexed by the instruction and address mode part of the op code. The look-up table for the addressing modes is best teated as a one-dimensional table with corrections, as in the following:
addressing mode code | actual addressing mode |
0 | 5 |
1 | 3 |
2 | 5 |
3 | 2 |
4 | 11 |
5 | 0 |
6 | 5 |
7 | 0 |
In the above table actual addressing code 0 is used to indicate that there are no 6502 instructions using this mode in group 0. Of course this table is not always correct - a mode 0 instruction doesn't always use implied addressing and mode 3 instructions don't always use absolute addressing but the departures can be easily detected and corrected using IF statements.
Group 2 instructions are also messy but in a different way to group 0 instructions:
Instruction | Mode | |||||||
code | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
0 | *** | ASL | ASL | ASL | *** | ASL | *** | ASL |
1 | *** | ROL | ROL | ROL | *** | ROL | *** | ROL |
2 | *** | LSR | LSR | LSR | *** | LSR | *** | LSR |
3 | *** | ROR | ROR | ROR | *** | ROR | *** | ROR |
4 | *** | STX | TXAimp | STX | *** | STXz,Y | TXS | *** |
5 | LDX | LDX | TAXimp | LDX | *** | LDXz,Y | TSX | LDYc,Y |
6 | *** | DEC | DEXimp | DEC | *** | DEC | *** | DEC |
7 | *** | INC | *** | INC | *** | INC | *** | INC |
imm | aero | acc | abs | z,X | impl | c,X |
(where the addressing modes are as shown at the bottom of each column unless otherwise indicated within the table next to the instruction concerned)
The interesting thing about this table is that the top half (i.e. Instruction code = 0 to 3) is entirely regular but the bottom half is irregular. The look-up table for this group could also be created as a two-dimensional array as for the group 0 instructions. However, overall the regularities in the table outnumber the irregularities and it is easier to use a one-dimensional table based on the pattern of mnemonics in the mode=1 column and use IF statements to correct for the departures from regularity. In the same way the addressing mode look-up table can be one-dimensional:
Addressing mode code | Actual addressing mode used |
0 | 1 |
1 | 3 |
2 | 4 |
3 | 2 |
4 | 0 |
5 | 8 |
6 | 5 |
7 | 9 |
The departures from this pattern are taken care of by using IF statements.
After the complications introduced by the irregularities of group 0 and group 2 instructions it is a relief to discover that the 6502 doesn't have any group 3 instructions!
Apart from reducing the size of the look-up tables needed for a disassembler, this study of the structure of the 6502 instruction set explains many of the odd patterns of addressing modes that can be used with any particular instruction. One opinion is that an advanced microprocessor should allow every instruction to use any and all addressing modes that make sense. From this point of view the 6502 is a very poor micro indeed as its rules for which addressing modes can be used with which instructions contain many seemingly arbitrary restrictions. For example, why can't any addressing mode that uses the Y index register be used with the ROL or ROR instructions? A simple explanation would be to say that ROL and ROR are group 2 instructions but this misses the point that there are three unused addressing modes in the group 2 table and the whole of the group 3 table is empty! However, this argument can be countered by pointing out that by using only a single byte for the op code (other microprocessors can use two or even three byte op codes) the 6502 provides a good range of instructions, each of which can be used with an appropriate collection of addressing modes. From this point of view the 6502's design sacrifices simplicity of use for a compact and efficient instruction set.
After so much analysis the 6502 disassembler is fairly easy to write. The procedures used are shown in Table 11.1.
Table 11.1
Name | Line number | Function |
initialise | 1000 | Sets up the look-up tables for each instruction group. |
getparams | 2000 | Input start and end address for the disassembly. |
code(D) | 3000 | Returns the mnemonic (in M$) and addressing mode (in ATYPE) corresponding to the op code in D |
gzero(OP,MO) | 4000 | Performs the look-up for group 0 instructions. |
gone(OP,MO) | 4200 | Performs the look-up for group 1 instructions. |
gtwo(OP,MO) | 4400 | Performs the look-up for group 2 instructions. |
gthree(OP,MO) | 4600 | Returns M$=*** and ATYPE=0 for non-existent group three instructions. |
add(ATYPE) | 5000 | Uses ATYPE to return a string (A$) that contains the address field of the instruction. Also returns the contents of the memory locations that it uses to construct the address as hex numbers in B$. |
PROCinitialise sets up six different look-up tables, one pair for each instruction group. G1$ and A1% hold the mnemonics and addressing modes for group one instructions, GO$ and A0% hold the mnemonics and addressing modes for group 0 instructions and G2$ and A2% hold the mnemonics and addressing modes for group two instructions. If you look at PROCcode you will see that, apart from dividing up the three parts of the op code (lines 3020 to 3040) all it does is to call the correct PROC to deal with the particular group that the instruction belongs to (lines 3050 to 3080). PROCone is the simplest because the look-up tables can be indexed by OP and MO and the result used without correction. PROCzero uses the two- dimensional look-up table for the mnemonic code and a one-dimensional table for addressing mode. Unlike PROCone the results from the look-up table have to be corrected for irregularities in the addressing modes such as JSR (in the column corresponding to mode=0) using absolute addressing rather than implied. These addressing mode irregularities are corrected by the IF statements in lines 4030 to 4050. PROCgtwo uses a pair of one- dimensional look-up tables like PROCgone but in this case there are a large number of corrections to both addressing mode and mnemonic. These are dealt with by the IF statements lines 4430 to 4550. PROCgthree is almost a dummy routine that returns "***" and ATYPE=0. It is included more for completeness than anything else. PROCadd uses the information about the addressing mode stored in A TYPE to construct the address field that goes with the instruction. For example, if A TYPE is 1 then the addressing mode is immediate and the address field is constructed by 'peeking' the value in the next memory location, converting it to a string of hex digits and then adding a "#" in front. Each of the adddresing modes has a corresponding IF statement that constructs the instruction's address field in AD$. The only other complication is the need to return the contents of any memory locations that are used to construct the address field as hex digits in B$ so that they can be listed alongside the disassembled instructions.
Three functions are also used:
Table 11.2
Name | Line number | Action |
dec(A$) | 9000 | Converts hex string to decimal number. |
rel(A) | 9010 | Returns positive of negative offset used in relative addressing from the positive value returned by 'peeking' th address field. |
hex(A) | 9030 | Converts decimal number in A to a hex string. |
The complete program is:
10 PROCinitialise
20 PROCgetparams:
REM SADD,EADD returned
30 A=SADD
40 REPEAT
50 PRINT TAB(0);~A;" ";
60 PRINT TAB(8);FNhex(A);" ";
70 PROCcode(?A):REM M$,ATYPE returned
80 PROCadd(ATYPE):REM AD$ returned
90 PRINT B$,M$;" ";AD$
100 UNTIL A>EADD
110 END
120 DEF PROCinitialise
130 LOCAL I,J
140 DATA ORA,AND,EOR,ADC,STA,LDA,CMP,SBC
150 DATA 6,3,1,2,7,8,10,9
160 DATA BRK,JSR,BIT,RTS,***,LDY,CPY,CPX
170 DATA ***,BIT,***,***,STY,LDY,CPY,CPX
180 DATA PHP,PLP,PHA,PLA,DEY,TAY,INY,INX
190 DATA ***,BIT,JMP,JMP,STY,LDY,CPY,CPX
200 DATA BPL,BMI,BVC,BVS,BCC,BCS,BNE,BEQ
210 DATA CLC,SEC,CLI,SEI,TYA,CLV,CLD,SED
220 DATA 1,3,5,7,11,0,5,0
230 DATA ASL,ROR,LSR,ROR,STX,LDX,DEC,INC
240 DATA 1,3,4,2,0,8,5,9
250 DIM G1$(7)
260 FOR I=0 TO 7
270 READ G1$(I)
280 NEXT I
290 DIM A1%(7)
300 FOR I=0 TO 7
310 READ A1%(I)
320 NEXT I
330 DIM G0$(7,7)
340 FOR I=0 TO 7
350 FOR J=0 TO 7
360 IF I=5 OR I=7 THEN G0$(I,J)="***"
ELSE READ G0$(I,J)
370 NEXT J
380 NEXT I
390 DIM A0%(7)
400 FOR I=0 TO 7
410 READ A0%(I)
420 NEXT I
430 DIM G2$(I)
440 FOR I=0 TO 7
450 READ G2$(I)
460 NEXT I
470 DIM A2%(7)
480 FOR I=0 TO 7
490 READ A2%(I)
500 NEXT I
510 ENDPROC
520 DEF PROCgetparams
530 LOCAL A$
540 REPEAT
550 INPUT "START AT (HEX)",A$
560 SADD=FNdec(A$)
570 INPUT "END AT (HEX)",A$
580 EADD=FNdec(A$)
590 UNTIL SADD<EADD
600 ENDPROC
610 DEF PROCcode(D)
620 LOCAL MO,GROUP,OP
630 OP=(D AND &F0)/32
640 MO=(D AND &1C)/4
650 GROUP=D AND &03
660 IF GROUP=0 THEN PROCgzero(OP,MO)
670 IF GROUP=1 THEN PROCgone(OP,MO)
680 IF GROUP=2 THEN PROCgtwo(OP,MO)
690 IF GROUP=3 THEN PROCgthree(OP,MO)
700 ENDPROC
710 DEF PROCgzero(OP,MO)
720 M$=G0$(MO,OP)
730 ATYPE=A0%(MO)
740 IF MO=0 AND OP=1 THEN ATYPE=2
750 IF MO=0 AND (OP=0 OR OP=2 OR OP=3)
THEN ATYPE=5
760 IF MO=3 AND OP=3 THEN ATYPE=12
770 ENDPROC
780 DEF PROCgone(OP,MO)
790 M$=G1$(OP)
800 ATYPE=A1%(MO)
810 ENDPROC
820 DEF PROCgtwo(OP,MO)
830 M$=G2$(OP)
840 ATYPE=A2%(MO)
850 IF ATYPE=0 THEN M$="***"
860 IF MO=0 AND OP<>5 THEN M$="***"
870 IF MO=2 AND OP=4 THEN M$="TXA":
ATYPE=5
880 IF MO=2 AND OP=5 THEN M$="TAX":
ATYPE=5
890 IF MO=2 AND OP=6 THEN M$="DEX":
ATYPE=5
900 IF MO=4 THEN M$="***"
910 IF MO=5 AND (OP=4 OR OP=5) THEN
ATYPE=13
920 IF MO=6 THEN M$="***"
930 IF MO=6 AND OP=4 THEN M$="TXS":
ATYPE=5
940 IF MO=6 AND OP=5 THEN M$="TSX":
ATYPE=5
950 IF MO=7 AND OP=4 THEN M$="***"
960 IF MO=7 AND OP=5 THEN M$="LDX":
ATYPE=10
970 IF MO=7 AND OP=6 THEN M$="DEC"
980 ENDPROC
990 DEF PROCgthree(OP,MO)
1000 M$="***"
1010 ATYPE=0
1020 ENDPROC
1030 DEF PROCadd(ATYPE)
1040 IF ATYPE=1 THEN A=A+1:
AD$="#"+FNhex(?A)
1050 IF ATYPE=2 THEN A=A+1:
AD$=FNhex(?A+256*?(A+1)):A=A+1
1060 IF ATYPE=3 THEN A=A+1:
AD$=FNhex(?A)
1070 IF ATYPE=4 THEN AD$="A"
1080 IF ATYPE=5 THEN AD$=""
1090 IF ATYPE=6 THEN A=A+1:
AD$="("+FNhex(?A)+",X)"
1100 IF ATYPE=7 THEN A=A+1:
AD$="("+FNhex(?A)+"),Y"
1110 IF ATYPE=8 THEN A=A+1:
AD$=FNhex(?A)+",X"
1120 IF ATYPE=9 THEN A=A+1:
AD$=FNhex(?A+256*?(A+1))+",X":A=A+1
1130 IF ATYPE=10 THEN A=A+1:
AD$=FNhex(?A+256*?(A+1))+",Y":A=A+1
1140 IF ATYPE=11 THEN A=A+1:
AD$=FNhex(FNrel(?A)+A+1)
1150 IF ATYPE=12 THEN A=A+1:
AD$=FNhex(?A+256*?(A+1))+")":A=A+1
1160 IF ATYPE=13 THEN A=A+1:
AD$=FNhex(?A)+",Y"
1170 A=A+1
1180 IF M$="***" THEN AD$="":ATYPE=0
1190 B$=FNhex(?(A-1))
1200 IF ATYPE=2 OR ATYPE=9 OR ATYPE=10
OR ATYPE=12 THEN
B$=FNhex(?(A-2))+" "+FNhex(?(A-1))
1210 IF ATYPE=4 OR ATYPE=5 OR ATYPE=0
THEN B$=""
1220 ENDPROC
1230 DEF FNdec(A$)=EVAL("&"+A$)
1240 DEF FNrel(A)
1250 IF A<127 THEN =A ELSE =A-256
1260 DEF FNhex(A)
1270 LOCAL B,I,A$
1280 FOR I=1 TO 4
1290 B=A AND &F
1300 A=A DIV &10
1310 IF B<10 THEN A$=CHR$(B+48)+A$
ELSE A$=CHR$(B+55)+A$
1320 NEXT I
1330 IF MID$(A$,1,2)="00" THEN
A$=RIGHT$(A$,2)
1340 =A$
The main program is easy enough to understand by reference to the procedure table (Table 11.1). The only other information that might help is that A contains the address of the memory location that is currently being examined for a possible op code.
The disassembler listed above is a debugged first version. That is, although a few lines have been changed and even a few lines added during debugging the overall structure, the procedures and their internal operation are unchanged from the first attempt. After the extensive analysis of the form of the look-up tables given in the first part of this chapter it is not surprising that no major changes were necessary but this does not mean that the structure of the program is entirely up to standard. PROCinitialise could be made more compact by using a pair of two-dimension arrays for the mnemonic and address mode look-up table but this would make the rest of the program much more difficult to follow. The most unsatisfactory procedure in the whole program is PROCadd. An examination of the statements that follow the THENs shows immediately that there is much duplication of effort. It would be better to assemble the data items that most of the IF statements use before the IF statements are executed. For example, FNhex(?A) and FNhex(?A+256*?(A+1)) should all be worked out and stored in string variables at the start of the procedure. Also the variable that keeps track of the current location in memory, A, should not be incremented by each of the IF statements. This makes the task of printing the hex values stored in the memory locations alongside the disassembled instruction much more difficult than it need be (hence lines 1190 to 1210).
Even though there is plenty of scope for improvement definite proof that it is often better to rewrite first attempts at procedures the program works and is a useful tool.
After using the program a few times the only fatal error that occurred was BAD HEX due to mistyping of hex numbers in PROCgetparams. This is easily cured by adding:
1 ON ERROR RUN
This is a crude but effective error catch-all.
Using a disassembler is as easy as specifying the start and end address of the block of code that you want disassembled! Of course interpreting the output is much more difficult. Knowing where the machine code program starts is a great help but even then it is possible to run into an area of data and illegal op codes. Although illegal op codes are an almost sure sign that you have encountered a data area it is possible for a data area to contain legal op codes and so give the impression that you are still disassembling part of a program. In practice, data areas are identified both by containing the occasional illegal op code and by being referenced in address fields of other parts of the program.
The other main technique in understanding a disassembly is tracing the flow of control. Starting from the first instruction of the program it is possible to follow through each branch, JMP and JSR instruction and mark their destinations. In this way all of the possible paths through the program can be identified before the task of trying to understand what is happening in each part is begun. Going through and marking RTS instructions serves to identify candidates for the end points of subroutines. Similarly, the instructions that follow them are possible starting points of subroutines.
Even after using these hints you will still need a great deal of skill and ingenuity to decipher a disassembly. The one guiding principle is that you should always work out how you would write a program or subroutine to achieve the same result before you look at the output of a disassembler. If you have chosen the same method as the program being disassembled then you will quickly recognise the essential features of the method. If you haven't, then the disassembly will remain unfathomable and after a while you should stop studying it and try to think of another way of achieving the same ends. When you have identified the overall algorithm that is being used it is surprising how quickly the details fall into place.
The disassembler described in this chapter is far from the last word in sophistication on how to produce an easy-to-understand disassembly of a program. The following are some suggestions that you might like to add to the program, none of them very difficult:
(1) Allow the user to specify the location of known data areas. In these data areas the disassembler should output the contents of memory as hex and, where posslble, ASCII characters.
(2) Allow the user to specify a list at known labels and their corresponding addresses. Each address that the disassembler produces should be checked against the list to see if a label has been defined. If it has then the label should be used in preference to the numeric address. For example, instead of JSR &FFF4 the disassembler would print JSR OSBYTE (assuming that OSBYTE=&FFF4 had been included in the list of label definitions).
(3) Mark all branch, JSR and JMP instructions automatically so that they can be found quickly. Also mark their corresponding destination addresses. In other words, mark all the exit and entry points within the disassembly.
It is possible to produce disassemblers that automatically identify program and data areas by tracing all of the possible paths for the flow of control through the program but this is a much more difficult problem.