A beginners guide to Assembly language using emu8086
What is assembly language
Assembly language is a low-level programming language that is very fast, uses fewer resources compared to higher-level languages, and can be executed by translating directly to machine language via an assembler. According to Wikipedia:
In computer programming, assembly language is any low-level programming language with a very strong correspondence between the instructions in the language and the architecture's machine code instructions.
We know that a processor (also known as CPU - Central Processing Unit) executes all types of operations, effectively working as the brain of a computer. However, it only recognizes strings of 0's and 1's. As you can imagine, it's cumbersome to code in machine language. So, the low-level assembly language was designed for a specific family of processors that represents various instructions in symbolic code which is far easier to understand for a human being. But, as you can also guess, it's difficult and somewhat inconvenient to develop in assembly language.
So, why should we learn assembly language in today's world?
Well, you can think of the following points to decide whether to learn it or not.
Enhance your skill set.
Learn the fastest language aside from machine language.
Embed assembly language in a higher-level language to use features unsupported by the higher-level language or for performance reasons.
Fill in the knowledge gap for understanding how the higher-level languages came to be.
Assemblers and editors
Assemblers are programs that translate assembly language code to its equivalent machine language code. There are many assemblers targeting various microprocessors in the market today like MASM, TASM, NASM, etc. For a list of different assemblers, visit this Wikipedia page.
Code editors are software in which you can write the code, modify and save it to a file. Some editors that support assembly language are VS code, DOSBox, emu8086, and so on. Online assemblers are also available, like the popular online editor Ideone. We will use emu8086, which comes with the environment needed to start our journey in assembly language.
Code structure
We can simply write the assembly code and emulate it in emu8086, and it'll run. However, without calling the exit statements or halt
instruction, the program will continue executing the next instruction in memory until it is halted by OS or emu8086 itself. The assembly code is saved in a .asm
file type.
There are also some good practices like defining the model and stack memory size at the very beginning. For small
model, define data and code segment after the stack. The code segment contains the code to execute. In the example structure given here, I have created a main
procedure (also called function or methods in other programming languages), in which the code execution starts. At the end of it, I have called a specific predefined statement with interrupt to indicate the code has finished executing.
.model small
.stack 100H
; Data segment
.data ; if there is nothing in the data segment, you can omit this line.
; Code segment
.code
main PROC
; Write your code here
exit:
MOV AH, 4CH
INT 21H
main ENDP
END main
The first line, .model small
, defines the memory model to use. Some recognized memory models are tiny, small, medium, compact, large, and so on. The small
memory model supports one data segment and one code segment that are usually enough to write small programs. The following line .stack 100H
defines the stack size in hexadecimal numbers. The equivalent decimal number is 256
. The lines starting with, or part of the line after, ;
are comments that the assembler ignores.
Registers and flags
Registers are superfast memory directly connected to the CPU. The emu8086 can emulate all internal registers of the Intel 8086 microprocessor. All of these registers are 16-bit long and grouped into several categories as follows,
General purpose registers: There are four general purpose registers, each divided into two subgroups, low and high. For example, AX is divided into AL and AH, each 8-bit long.
Accumulator (AX)
Base (BX)
Counter (CX)
Data (DX)
Segment registers: There are also four segment registers.
Code Segment (CS)
Data Segment (DS)
Stack Segment (SS)
Extra Segment (ES)
Special purpose registers: There are two index registers and three pointer registers.
Source Index (SI)
Destination Index (DI)
Base Pointer (BP)
Stack Pointer (SP)
Instruction Pointer (IP)
Flag register: This is a 16-bit register of which 9 bits are used by 8086 to indicate current state of the processor. The nine flags are categorized into two groups.
Status flags: Six status flags indicate the status of currently executing instruction.
Carry flag (CF)
Parity flag (PF)
Auxiliary flag (AF)
Zero flag (ZF)
Sign flag (SF)
Overflow flag (OF)
Control flags: There are three control flags that controls certain operations of the processor.
Interrupt flag (IF)
Direction flag (DF)
Trap flag (TF)
To read more about these registers and what they are used for, visit this page.
Assembly language instructions
A total of 116 instructions are available for the Intel 8086 microprocessor. All these instructions with related examples are provided in this link.
In this article, I'll focus only on a few instructions necessary for understanding the later parts.
Copy data (MOV): This instruction copies a byte (8-bit) or a word (16-bit) from source to destination. Both operands should be of the same type (byte or word). The syntax of this instruction is:
MOV destination, source
The
destination
operand can be any register or a memory location, whereas thesource
operand can be a register, memory address, or a constant/immediate value.Addition (ADD) and Subtraction (SUB): ADD adds the data of the
destination
andsource
operand and stores the result indestination
. Both operands should be of the same type (words or bytes), otherwise, the assembler will generate an error. The subtraction instruction subtracts thesource
fromdestination
and stores the result indestination
.; Addition ADD destination, source ADD BL, 10 ; Subtraction SUB destination, source SUB BL, 10
Label: A label is a symbolic name for the address of the instruction that is given immediately after the label declaration. It can be placed at the beginning of a statement and serve as an instruction operand. The
exit:
used before is a label. Labels are of two types.Symbolic Labels: A symbolic label consists of an identifier or symbol followed by a colon (
:
). They must be defined only once as they have global scope and appear in the object file's symbol table.Numeric Labels: A numeric label consists of a single digit in the range zero (
0
) through nine (9
) followed by a colon (:
). They are used only for local reference and excluded in the object file's symbol table. Hence, they have a limited scope and can be re-defined repeatedly.
; Symbolic label
label:
MOV AX, 5
; Numeric label
1:
MOV AX, 5
Compare (CMP): This instruction takes two operands and subtracts one from the other, then sets OF, SF, ZF, AF, PF, and CF flags accordingly. The result is not stored anywhere.
CMP operand1, operand2
The
operand1
operand can be a register or memory address, andoperand2
can be a register, memory, or immediate value.Jump instructions: The jump instructions transfer the program control to a new set of instructions indicated by the label provided as an operand. There are two types of jump instructions.
Unconditional jump (JMP): It directly jumps to the provided label.
Conditional jump: These instructions are used to jump only if a condition is satisfied and called after
CMP
instruction. This instruction first evaluates if the condition is satisfied through flags, then jumps to the label given as operand. It is pretty similar toif
statements in other programming languages. There are 31 conditional jump instructions available in 8086 assembly language.
Working with variables
In an assembly program, all variables are declared in the data
segment. The emu8086 provides some define directives for declaring variables. Specifically, we'll use DB
(define byte) and DW
(define word) directives in this article which allocates 1 byte and 2 bytes respectively.
[variable-name] define-directive initial-value [,initial-value]...
Here, variable-name
is the identifier for each storage space. The assembler associates an offset value for each variable name defined in the data segment.
Following is an example of variable declaration, where we initialize num
and char
with a value that can be changed later. The output
is initialized with a string and has a dollar symbol ($
) at the end to indicate the end of string. The input_char
is declared without any initial value. We can use ?
to indicate that the value is currently unknown.
; Data segment
.data
num DB 31H
char DB 'A'
output DW "Hello, World!!$"
input_char DB ?
We cannot use the variables in the code segment just yet! For using these variables in the code segment, we have to first move the address of the data segment to the DS
(data segment) register. Use this line at the beginning of the code segment to import all variables.
; Storing all variables in data segment
MOV AX, @data
MOV DS, AX
Taking user input
The emu8086 assembler supports user input by setting a predefined value 01
or 01H
in the AH
register and then calling interrupt (INT
). It will take a single character from the user and save the ASCII value of that character in the AL
register. The emu8086 emulator displays all values in hexadecimal.
; input a character from user
MOV AH, 1
INT 21h ; the input will be stored in AL register
Displaying output
The emu8086 supports single character output. It also allows multi-character or string output. Similar to taking input, we have to provide a predefined value in the AH
register and call interrupt. The predefined value for single character output is 02
or 02H
and for string output 09
or 09H
. The output value must be stored in the general-purpose data register before calling interrupt.
; Output a character
MOV AH, 2
MOV DL, 35
INT 21H
; Output a string
MOV AH, 9
LEA DX, output
INT 21H
As shown in the code, for a single character output, we store the value in the DL
register because a character is one byte or 8 bits long. However, for string output it is a bit different. We must load the effective address (address with offset) of the string variable in the DX
register using LEA
instruction. The string variable must be defined in data segment.
The complete code containing variable declaration, input and output is provided in GitHub.
Branching or using conditions
We can simulate if-else conditions supported by higher-level programming languages using CMP
and jump instructions. Some frequently used conditional jump instructions are,
Instruction | Jump if | Similar to |
JE | equal | \== |
JL | less | < |
JLE | less than or equal | <= |
JG | greater | \> |
JGE | greater than or equal | \>= |
There is also JMP
instruction that works similar to else
statements found in higher-level languages. Following is an assembly code that compares AL
register value to 5
and sets an appropriate value in the BL
register.
; setting a test value
MOV AL, 5
; Compare
CMP AL, 5
JG greater ; if greater
JE equal ; else if equal
JMP less ; else
greater:
MOV BL, 'G'
JMP after
equal:
MOV BL, 'E'
JMP after
less:
MOV BL, 'L'
after:
; Other codes
; Note: BL will contain 'E' at this point
A complete code is available in this GitHub repository.
Using loops
We can also use loops in assembly language. However, unlike higher-level language, it does not provide different loop types. Though, the emu8086 emulator supports five types of loop syntax, LOOP
, LOOPE
, LOOPNE
, LOOPNZ
, LOOPZ
, they are not flexible enough for many situations. We can create our self-defined loops using condition and jump statements. Following are various types of loops implemented in assembly language, all of which are equivalent.
For loop
The for loop has an initialization section where loop variables are initialized, a loop condition section, and finally, an increment/decrement section to do some calculation or change loop variables before the next iteration. Following is an example for loop in C
language.
char bl = '0';
for (int cl = 0; cl < 5; cl++) {
// body
bl++;
}
The equivalent assembly code is as follows:
MOV BL, '0'
init_for:
; initialize loop variables
MOV CL, 0
for:
; condition
CMP CL, 5
JGE outside_for
; body
INC BL
; increment/decrement and next iteration
INC CL
JMP for
outside_for:
; other codes
While loop
Unlike for loop, while loop has no initialization section. It only has a loop condition section, which if satisfied, executes the body part. In the body part, we can do some calculations before the next iteration. Following is an example while loop in C
language.
char bl = '0';
int cl = 0;
while (cl < 5) {
// body
bl++;
cl++;
}
The identical assembly code is:
MOV CL, 0
MOV BL, '0'
while:
; condition
CMP CL, 5
JGE outside_while
; body
INC BL
INC CL
; next iteration
JMP while
outside_while:
; other codes
Do-while loop
Similar to the while loop, the do-while loop has a loop condition section and body. The only difference is that the code in the body executes at least once, even if the condition evaluates to false
. Following is an example do-while loop in C
language.
char bl = '0';
int cl = 0;
do {
// body
bl++;
cl++;
} while (cl < 5);
The matching assembly code is as follows,
MOV CL, 0
MOV BL, '0'
do_while:
; body
INC BL
INC CL
; condition
CMP CL, 5
JL do_while
; other codes
Using LOOP syntax
We can use predefined loop syntax using the CX
register as a counter. Following is an example of loop syntax, which does the same thing as previous loops.
MOV BL, '0'
; initialize counter
MOV CX, 5
loop1:
INC BL
LOOP loop1
A complete code containing various loops are available in GitHub.
Include directive
The Include directive is used to access and use procedures and macros defined in other files. The syntax is include
followed by a file name with an extension.
include file_name
The assembler automatically searches for the file in two locations and shows an error if it cannot find it. The locations are:
The folder where the source file is located
The
Inc
folder
In the Inc folder, there is a file emu8086.inc, which defines some useful procedures and macros that can make coding easier. We have to include the file at the beginning of our source code to use these functionalities.
include 'emu8086.inc'
Now, we can use these macros in the code segment. Some of these macros and procedures that I find most useful are:
PRINT macro to print a string. Example usage:
PRINT output
.PUTC macro to print an ASCII character. Example usage:
PUTC char
.GET_STRING procedure to get a null-terminated string from a user until the
Enter
key is pressed. DeclareDEFINE_GET_STRING
before theEND
directive to use this procedure.CLEAR_SCREEN procedure to clear the entire screen and set the cursor position to the beginning. Declare
DEFINE_CLEAR_SCREEN
before theEND
directive to use this procedure.
To learn more about the macros and procedures inside the emu8086.inc
file visit this page.
Extra: Reverse triangle problem
Let's solve a problem that uses all that we learned so far. The task is to input a number (1-9) from the user and print a reverse triangle shape using #
in the console. Also, appropriate error messages should be displayed, if the user inputs an invalid character. A demo output shown in the image.
Try it yourself first and if you cannot solve it, then read on.
To solve this problem, we have to do the following tasks:
Input a number from the user
Validate the input
Display user-friendly messages
Now comes the tricky part. We cannot use a single for loop to print a reverse triangle shape. For this, we have to use two loops one inside the other, also known as nested loops. In the outer loop, we can check how many lines are to be printed and also print the new line at the beginning or the end. The inner-loop can be used to print
#
.
Following is a demo code for the nested loop:
; Initialize outer loop counter
MOV BL, 0 ; counts line number starting from 0
outer_loop: ; using while loop format
CMP BL, x ; assuming x contains user input
JE outside_loop
; Print new-line
; Initialize inner loop counter
MOV CH, 0
MOV CL, x
SUB CL, BL ; subtract current line number from x
inner_loop:
; Print #
LOOP inner_loop
; Increment outer loop counter
INC BL
JMP outer_loop
outside_loop:
; other codes
The final output of my code is as follows:
The complete solution is available in my GitHub repository.
Summary
We covered so many contents in this article. First, we understood what assembly language is and some assemblers' names. Then, we understood a code structure and discovered all the registers and flags in the 8086 microprocessor. After comprehending some assembly instructions, we learned how to define a variable, how to take input from the user, and also how to output something on the screen. Then we learned about conditions and loops, and finally, to wrap up, we solved a problem using assembly language.
Subscribe to my newsletter
Read articles from Amrito Das Tipu directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by