Корректные ассемблерные инструкции синтаксис intel - Инструкции, руководства, мануалы

Процессор.

Что такое процессор? Процессор — это «мозг» компьютера. Процессором называется устройство, способное превращать входную информацию (входное слово) в выходную (выходное слово) согласно алгоритму, записанному в программном коде.

Конструктивно процессоры могут выполниться как в виде одной большой монокристальной интегральной микросхемы — чипа, так и в виде нескольких микросхем, блоков электронных плат и устройств.

Персональный компьютер содержит в своем составе довольно много различных процессоров.

Однако архитектуру и конструктивное исполнение персонального компьютера определяет процессор или процессоры, контролирующие и обслуживающие системную шину, и оперативную память, а также, что более важно, выполняющие объектный код программ. Такие процессоры принято называть центральными или главными процессорами (Central Point Unit — CPU). На основе архитектуры центральных процессоров строится архитектура материнских плат и проектируется архитектура и конструкция компьютера.

Язык процессора.

Родной язык процессора – машинные команды. Способ представления программы «понятный» процессору ЭВМ называется языком машинных команд. Машинная команда должна содержать в себе ответы на следующие вопросы:

•какую операцию выполнить?

•где находятся операнды?

•куда поместить результат операции?

Операндами называют данные, с которыми производится операция. В арифметических операциях это слагаемые, сомножители, уменьшаемое и вычитаемое, делимое и делитель.

Машинная команда – это описание элементарной операции, которую должен выполнить компьютер. Команды хранятся в ячейках памяти в двоичном коде.

Каждая модель процессора, в принципе, имеет свой набор команд и соответствующий ему язык (или диалект) ассемблера.

Assembler — язык программирования.

Assembler — язык программирования низкого уровня, представляющий собой формат записи машинных команд, удобный для восприятия человеком.

Команды языка ассемблера один в один соответствуют командам процессора и, фактически, представляют собой удобную символьную форму записи (мнемокод) команд и их аргументов.

Достоинства языка ассемблера:

Максимально оптимальное использование средств процессора, использование меньшего количества команд и обращений в память, и как следствие — большая скорость и меньший размер программы. Максимальная «подгонка» для нужной платформы.

Недостатки:

Большие объемы текста кода, большое число дополнительных мелких задач. Трудоёмкость чтения и поиска ошибок (хотя здесь многое зависит от комментариев и стиля программирования).

Синтаксис ассемблера intel.

Общепринятого стандарта для синтаксиса языков ассемблера не существует. Однако, существуют стандарты де-факто — традиционные подходы, которых придерживаются большинство разработчиков языков ассемблера. Основными такими стандартами являются Intel-синтаксис и AT&T-синтаксис. Общий формат записи инструкций (команд) одинаков для обоих стандартов.

Intel-синтаксис — один из форматов записи мнемоники инструкций процессора, который используется в документации Intel в ассемблерах для MS-DOS и Windows (MASM, TASM, встроенный ассемблер Visual Studio, и т.д.).

Особенности Intel-синтаксиса:

1. В команде приёмник находится слева от источника (например: mov eax,ebx, где mov-это команда, а eax и ebx операнды этой команды, один из них eax-приемник, ebx-источник).

2. Название регистров зарезервировано (нельзя использовать метки с именами eax, ebx и т. д.) Например, код mov eax, ebx пересылает в регистр eax значение, содержащееся в ebx, что функционально эквивалентно следующему коду:

push ebx ; положить в стек значение ebx

pop eax ; записать значение из стека в eax

Каждая инструкция записывается в отдельной строке. [метка:] мнемокод [операнды] [;комментарий]

Мнемокод (или опкод –операции код), непосредственно мнемоника инструкции процессора. К ней могут быть добавлены префиксы (повторения, изменения типа адресации и пр.). В качестве операндов могут выступать константы, названия регистров, адреса в оперативной памяти и пр..

Мнемоника — (совокупность специальных приёмов и способов, облегчающих запоминание нужной информации, в данном случае двоичный код инструкции заменяется символьным обозначением, например двоичный код инструкции сложения значений находящихся в регистрах eax и ebx — 0000000111011000 (01D8 в шестнадцатеричном виде), а мнемокод — add eax,ebx).

Полный формат каждой строки инструкций следующий: label: code ; comment

где label — название метки; code — собственно, инструкция языка ассемблера; comment — комментарий,

например: summa: add eax,ebx ; сумма значений регистров (summa –метка; add eax,ebx —

инструкция; сумма значений регистров — комментарий).

При этом один или два компонента строки могут отсутствовать, то есть строка может состоять, к примеру, только из комментария, или содержать только метку или инструкцию.

Объекты, над которыми производятся действия, это регистры процессора и участки оперативной памяти. Обозначения для них также являются частью синтаксиса.

Ассемблерная инструкция, состоит из мнемоники команды и списка аргументов через запятую (один, два или три в зависимости от инструкции).

Мнемоникой команды служит трёхили четырёхбуквенными сокращениями их аналогов, обычно на английском языке, например:

jmp — продолжать выполнение с нового адреса памяти (от англ. jump — прыжок) mov — переместить данные (от англ. move — передвинуть)

sub — получить разность двух значений (от англ. subtract — вычесть)

xchg — обменять значения в регистрах/ячейках памяти (от англ. exchange — обмен)

Текст программ может быть дополнен директивами ассемблера (параметры, влияющие на процесс

ассемблирования и свойства выходного файла).

Для упрощения и ускорения написания программ на языке ассемблера служат макросы.

Допустимыми символами при написании текста программ являются:

1.все латинские буквы: A—Z, a—z. При этом заглавные и строчные буквы считаются эквивалентными;

2.цифры от 0 до 9;

3.знаки ?, @, $, _, &;

4.разделители , . [ ] ( ) < > { } + / * % ! ‘ » ? \ = # ^.

Лексемы.

Предложения ассемблера формируются из лексем, представляющих собой синтаксически неразделимые последовательности допустимых символов языка, имеющие смысл для транслятора.

Лексемами являются:

—идентификаторы — последовательности допустимых символов, использующиеся для обозначения таких объектов программы, как коды операций, имена переменных и названия меток. Правило записи идентификаторов заключается в следующем: идентификатор может состоять из одного или нескольких символов. В качестве символов можно использовать буквы латинского алфавита, цифры и некоторые специальные знаки — _, ?, $, @. Идентификатор не может начинаться символом цифры. Длина идентификатора может быть до 255 символов, хотя транслятор воспринимает лишь первые 32, а остальные игнорирует.

—цепочки символов — последовательности символов, заключенные в одинарные или двойные кавычки;

—целые числа в одной из следующих систем счисления: двоичной, десятичной, шестнадцатеричной.

Отождествление чисел при записи их в программах на ассемблере производится по

определенным правилам:

Десятичные числа не требуют для своего отождествления указания каких-либо дополнительных символов, например 25 или 139.

Для отождествления в исходном тексте программы двоичных чисел необходимо после записи нулей и единиц, входящих в их состав, поставить латинское “b”, например 10010101b.

Шестнадцатеричные числа состоят из цифр 0…9, строчных и прописных букв латинского алфавита a, b, c, d, e, f или A, B, C, D, E, F, Если шестнадцатеричное число начинается с буквы, то перед ним записывается ведущий ноль, а в конце числа латинское ‘h’, например: 0ef15h.

Операнды.

Практически каждое предложение (т.е. инструкция представляющая строку кода программы) содержит описание объекта, над которым или при помощи которого выполняется некоторое действие. Эти объекты называются операндами. В качестве операндов могут выступать числа, регистры, ячейки памяти, символьные идентификаторы.

Операторы ассемблера и синтаксические правила формирования выражений ассемблера.

Возможные типы операторов ассемблера:

• Арифметические операторы o унарные “+” и “–”;

o бинарные “+” и “–”; o умножения “*”;

o целочисленного деления “/”;

oполучения остатка от деления “mod”.

•Операторы сдвига

•Операторы сравнения

•Логические операторы

•Индексный оператор

•Оператор переопределения типа

•Оператор именования типа структуры

•Оператор получения смещения выражения

Директивы.

Директивы в ассемблере позволяют правильно оформить последовательность команд, чтобы транслятор мог их обработать, а микропроцессор — выполнить.

Команды Ассемблера.

aaa , aad , aam, aas , adc, add, and, bound, bsf, bsr, bswap, bt, btc, btr, bts, call, cbw, cwde, clc, cld, cli, cmc, cmp, cmps/cmpsb/cmpsw/cmpsd, cmpxchg, cwd, cdq, daa, das, dec, div, enter, hlt, idiv, imul, in, inc, ins/insb/insw/insd, int, into, iret/iretd, jcc, jcxz, jecxz, jmp, lahf, lds, les, lfs, lgs, lss, lea, leave, lgdt, lidt, lods/lodsb/lodsw/lodsd, loop loope, loopz, loopne, loopnz, mov, movs/movsb/movsw/movsd, movsx, movzx, mul, neg, nop, not, or, out, outs, pop, popa, popad, popf, popfd, push, pusha, pushad, pushf, pushfd, rcl, rcr, rep/repe/repz/repne/repnz, ret/retf, rol, ror, sahf, sal, sar, sbb, scas/scasb/scasw/scasd, setcc, sgdt, sidt, shl, shld, shr, shrd, stc, std, sti, stos/stosb/stosw/stosd, sub, test, xadd, xchg, xlat/xlatb , xor…

Типы данных.

Байт, слово (2 байта), двойное слово (4 байта), учетверенное слово (8 байт).

Директивы резервирования и инициализации данных: db, dw, dd, dq, dt.

Подготовка программы на MASM32. Упражнение 1

1.Запустите программу MASM32.

2.Напечатайте в окне текст вашей программы

3.Сохраните текст программы с именем additive_reg.asm

4.Ассемблируйте программу через меню Projekt-Console link OBJ File нажмите Enter

5.Линкуйте программу через меню Projekt-Assemble & Link нажмите Enter

6.Запустите программу через меню Projekt-Run Program нажмите Enter

7.Если выводятся сообщения об ошибках исправляйте, сохраняйте исправленный файл

через меню File Save и повторяйте с шага 4

TITLE Шаблон программ на MASM32 ;заголовок — необязательная строка

;Программа выводит консольное окно — это комментарий к программе, вводится после знака ‘;’ comment * длинный ккомментарий, им можно закомментировать

всю или часть программы *

.386	; 32-битный режим, можно 486, 586, 686
;.model flat, stdcall	; модель памяти сплошная с вызовом функций API
option casemap :none	; не различение прописных и строчных символов

; подключение библиотек, необходимых для программы include с:\masm32\include\masm32rt.inc

; раздел, где объявляются все константы

.const

;раздел, где объявляются переменные, уже имеющие какое-то значение

.data

; раздел, где объявляются переменные, еще не имеющие значения

.data?

; раздел кода программы

.code

start:	; с этого слова (метки) начинается код программы
invoke ExitProcess,0
end start	; с этого слова заканчивается код программы

;Пример с использованием Debug Window MASM32

Упражнение 2. Арифметические команды

.386

;.model flat, stdcall option casemap :none

include <\masm32\include\masm32rt.inc> include <\masm32\include\debug.inc> includelib <\masm32\lib\debug.lib>

.data

.code

start:

;сообщение в консоли ассемблирования:

;текстовое сообщение и разделительная черта:

PrintText «Арифметические операци на языке ассемблер»

PrintLine

mov eax, 123 ;помещаем в регистр EAX десятичное чмсло 123 mov ebx, 321 ; в EBX 321

PrintDec eax,»Начальное значение» ;выводим значение регистра EAX add eax,ebx ;складываем EAX=EAX+EBX

PrintDec eax,»Сумма EAX=EAX+EBX» ;выводим значение регистра EAX PrintDec ebx, «Начальное значение» ;выводим значение регистра EBX

invoke crt_exit

end start

Упражнение 3.

1)Поместите в EAX двоичное число 101b, в EBX 111b. Посмотрите результат.

2)В EAX 123h, в EBX 987h. Посмотрите результат.

3)В EAX 123d, в EBX 321d. Посмотрите результат

4)В EAX 123o, в EBX 754o. Посмотрите результат

Упражнение 4.

Выполните операции sub eax,ebx; mul ebx; div ebx

Источник

x86 assembly language is the name for the family of assembly languages which provide some level of backward compatibility with CPUs back to the Intel 8008 microprocessor, which was launched in April 1972.^[1]^[2] It is used to produce object code for the x86 class of processors.

Regarded as a programming language, assembly is machine-specific and low-level. Like all assembly languages, x86 assembly uses mnemonics to represent fundamental CPU instructions, or machine code.^[3] Assembly languages are most often used for detailed and time-critical applications such as small real-time embedded systems, operating-system kernels, and device drivers, but can also be used for other applications. A compiler will sometimes produce assembly code as an intermediate step when translating a high-level program into machine code.

Keyword[edit]

Reserved keywords of x86 assembly language^[4]^[5]

lds
les
lfs
lgs
lss
pop
push
in
ins
out
outs
lahf
sahf
popf
pushf
cmc
clc
stc
cli
sti
cld
std
add
adc
sub
sbb
cmp
inc
dec
test
sal
shl
sar
shr
shld
shrd
not
neg
bound
and
or
xor
imul
mul
div
idiv
cbtw
cwtl
cwtd
cltd
daa
das
aaa
aas
aam
aad
wait
fwait
movs
cmps
stos
lods
scas
xlat
rep
repnz
repz
lcall
call
ret
lret
enter
leave
jcxz
loop
loopnz
loopz
jmp
ljmp
int
into
iret
sldt
str
lldt
ltr
verr
verw
sgdt
sidt
lgdt
lidt
smsw
lmsw
lar
lsl
clts
arpl
bsf
bsr
bt
btc
btr
bts
cmpxchg
fsin
fcos
fsincos
fld
fldcw
fldenv
fprem
fucom
fucomp
fucompp
lea
mov
movw
movsx
movzb
popa
pusha
rcl
rcr
rol
ror
setcc
bswap
xadd
xchg
wbinvd
invd
invlpg
lock
nop
hlt
fld
fst
fstp
fxch
fild
fist
fistp
fbld
fbstp
fadd
faddp
fiadd
fsub
fsubp
fsubr
fsubrp
fisubrp
fisubr
fmul
fmulp
fimul
fdiv
fdivp
fdivr
fdivrp
fidiv
fidivr
fsqrt
fscale
fprem
frndint
fxtract
fabs
fchs
fcom
fcomp
fcompp
ficom
ficomp
ftst
fxam
fptan
fpatan
f2xm1
fyl2x
fyl2xp1
fldl2e
fldl2t
fldlg2
fldln2
fldpi
fldz
finit
fnint
fnop
fsave
fnsave
fstew
fnstew
fstenv
fnstenv
fstsw
fnstsw
frstor
fclex
fnclex
fdecstp
ffree
fincstp

Mnemonics and opcodes[edit]

Each x86 assembly instruction is represented by a mnemonic which, often combined with one or more operands, translates to one or more bytes called an opcode; the NOP instruction translates to 0x90, for instance, and the HLT instruction translates to 0xF4.^[3] There are potential opcodes with no documented mnemonic which different processors may interpret differently, making a program using them behave inconsistently or even generate an exception on some processors. These opcodes often turn up in code writing competitions as a way to make the code smaller, faster, more elegant or just show off the author’s prowess.

Syntax[edit]

x86 assembly language has two main syntax branches: Intel syntax and AT&T syntax.^[6] Intel syntax is dominant in the DOS and Windows world, and AT&T syntax is dominant in the Unix world, since Unix was created at AT&T Bell Labs.^[7]
Here is a summary of the main differences between Intel syntax and AT&T syntax:

	AT&T	Intel
Parameter order	movl $5, %eax Source before the destination.	mov eax, 5 Destination before source.
Parameter size	addl $0x24, %esp movslq %ecx, %rax paddd %xmm1, %xmm2 Mnemonics are suffixed with a letter indicating the size of the operands: q for qword (64 bits), l for long (dword, 32 bits), w for word (16 bits), and b for byte (8 bits).^[6]	add esp, 24h movsxd rax, ecx paddd xmm2, xmm1 Derived from the name of the register that is used (e.g. rax, eax, ax, al imply q, l, w, b, respectively). Width-based names may still appear in instructions when they define a different operation. MOVSXD refers to sign extension with dword input, unlike MOVSX. SIMD registers have width-named instructions that determine how to split up the register. AT&T tends to keep the names unchanged, so PADDD is not renamed to «paddl».
Sigils	Immediate values prefixed with a «$», registers prefixed with a «%».^[6]	The assembler automatically detects the type of symbols; i.e., whether they are registers, constants or something else.
Effective addresses	movl offset(%ebx,%ecx,4), %eax General syntax of DISP(BASE,INDEX,SCALE).	mov eax, [ebx + ecx4 + offset] Arithmetic expressions in square brackets; additionally, size keywords like byte, word, or dword* have to be used if the size cannot be determined from the operands.^[6]

Many x86 assemblers use Intel syntax, including FASM, MASM, NASM, TASM, and YASM. GAS, which originally used AT&T syntax, has supported both syntaxes since version 2.10 via the .intel_syntax directive.^[6]^[8]^[9] A quirk in the AT&T syntax for x86 is that x87 operands are reversed, an inherited bug from the original AT&T assembler.^[10]

The AT&T syntax is nearly universal to all other architectures (retaining the same mov order); it was originally a syntax for PDP-11 assembly. The Intel syntax is specific to the x86 architecture, and is the one used in the x86 platform’s documentation. The Intel 8080, which predates the x86, also uses the «destination-first» order for mov.^[11]

Registers[edit]

x86 processors have a collection of registers available to be used as stores for binary data. Collectively the data and address registers are called the general registers. Each register has a special purpose in addition to what they can all do:^[3]

AX multiply/divide, string load & store
BX index register for MOVE
CX count for string operations & shifts
DX port address for IN and OUT
SP points to top of the stack
BP points to base of the stack frame
SI points to a source in stream operations
DI points to a destination in stream operations

Along with the general registers there are additionally the:

IP instruction pointer
FLAGS
segment registers (CS, DS, ES, FS, GS, SS) which determine where a 64k segment starts (no FS & GS in 80286 & earlier)
extra extension registers (MMX, 3DNow!, SSE, etc.) (Pentium & later only).

The IP register points to the memory offset of the next instruction in the code segment (it points to the first byte of the instruction). The IP register cannot be accessed by the programmer directly.

The x86 registers can be used by using the MOV instructions. For example, in Intel syntax:

mov ax, 1234h ; copies the value 1234hex (4660d) into register AX

mov bx, ax    ; copies the value of the AX register into the BX register

Segmented addressing[edit]

The x86 architecture in real and virtual 8086 mode uses a process known as segmentation to address memory, not the flat memory model used in many other environments. Segmentation involves composing a memory address from two parts, a segment and an offset; the segment points to the beginning of a 64 KiB (64×2¹⁰) group of addresses and the offset determines how far from this beginning address the desired address is. In segmented addressing, two registers are required for a complete memory address. One to hold the segment, the other to hold the offset. In order to translate back into a flat address, the segment value is shifted four bits left (equivalent to multiplication by 2⁴ or 16) then added to the offset to form the full address, which allows breaking the 64k barrier through clever choice of addresses, though it makes programming considerably more complex.

In real mode/protected only, for example, if DS contains the hexadecimal number 0xDEAD and DX contains the number 0xCAFE they would together point to the memory address 0xDEAD * 0x10 + 0xCAFE == 0xEB5CE. Therefore, the CPU can address up to 1,048,576 bytes (1 MB) in real mode. By combining segment and offset values we find a 20-bit address.

The original IBM PC restricted programs to 640 KB but an expanded memory specification was used to implement a bank switching scheme that fell out of use when later operating systems, such as Windows, used the larger address ranges of newer processors and implemented their own virtual memory schemes.

Protected mode, starting with the Intel 80286, was utilized by OS/2. Several shortcomings, such as the inability to access the BIOS and the inability to switch back to real mode without resetting the processor, prevented widespread usage.^[12] The 80286 was also still limited to addressing memory in 16-bit segments, meaning only 2¹⁶ bytes (64 kilobytes) could be accessed at a time.
To access the extended functionality of the 80286, the operating system would set the processor into protected mode, enabling 24-bit addressing and thus 2²⁴ bytes of memory (16 megabytes).

In protected mode, the segment selector can be broken down into three parts: a 13-bit index, a Table Indicator bit that determines whether the entry is in the GDT or LDT and a 2-bit Requested Privilege Level; see x86 memory segmentation.

When referring to an address with a segment and an offset the notation of segment:offset is used, so in the above example the flat address 0xEB5CE can be written as 0xDEAD:0xCAFE or as a segment and offset register pair; DS:DX.

There are some special combinations of segment registers and general registers that point to important addresses:

CS:IP (CS is Code Segment, IP is Instruction Pointer) points to the address where the processor will fetch the next byte of code.
SS:SP (SS is Stack Segment, SP is Stack Pointer) points to the address of the top of the stack, i.e. the most recently pushed byte.
SS:BP (SS is Stack Segment, BP is Stack Frame Pointer) points to the address of the top of the stack frame, i.e. the base of the data area in the call stack for the currently active subprogram.
DS:SI (DS is Data Segment, SI is Source Index) is often used to point to string data that is about to be copied to ES:DI.
ES:DI (ES is Extra Segment, DI is Destination Index) is typically used to point to the destination for a string copy, as mentioned above.

The Intel 80386 featured three operating modes: real mode, protected mode and virtual mode. The protected mode which debuted in the 80286 was extended to allow the 80386 to address up to 4 GB of memory, the all new virtual 8086 mode (VM86) made it possible to run one or more real mode programs in a protected environment which largely emulated real mode, though some programs were not compatible (typically as a result of memory addressing tricks or using unspecified op-codes).

The 32-bit flat memory model of the 80386’s extended protected mode may be the most important feature change for the x86 processor family until AMD released x86-64 in 2003, as it helped drive large scale adoption of Windows 3.1 (which relied on protected mode) since Windows could now run many applications at once, including DOS applications, by using virtual memory and simple multitasking.

Execution modes[edit]

The x86 processors support five modes of operation for x86 code, Real Mode, Protected Mode, Long Mode, Virtual 86 Mode, and System Management Mode, in which some instructions are available and others are not. A 16-bit subset of instructions is available on the 16-bit x86 processors, which are the 8086, 8088, 80186, 80188, and 80286. These instructions are available in real mode on all x86 processors, and in 16-bit protected mode (80286 onwards), additional instructions relating to protected mode are available. On the 80386 and later, 32-bit instructions (including later extensions) are also available in all modes, including real mode; on these CPUs, V86 mode and 32-bit protected mode are added, with additional instructions provided in these modes to manage their features. SMM, with some of its own special instructions, is available on some Intel i386SL, i486 and later CPUs. Finally, in long mode (AMD Opteron onwards), 64-bit instructions, and more registers, are also available. The instruction set is similar in each mode but memory addressing and word size vary, requiring different programming strategies.

The modes in which x86 code can be executed in are:

Real mode (16-bit)
- 20-bit segmented memory address space (meaning that only 1 MB of memory can be addressed— actually since 80286 a little more through HMA), direct software access to peripheral hardware, and no concept of memory protection or multitasking at the hardware level. Computers that use BIOS start up in this mode.
Protected mode (16-bit and 32-bit)
- Expands addressable physical memory to 16 MB and addressable virtual memory to 1 GB. Provides privilege levels and protected memory, which prevents programs from corrupting one another. 16-bit protected mode (used during the end of the DOS era) used a complex, multi-segmented memory model. 32-bit protected mode uses a simple, flat memory model.
Long mode (64-bit)
- Mostly an extension of the 32-bit (protected mode) instruction set, but unlike the 16–to–32-bit transition, many instructions were dropped in the 64-bit mode. Pioneered by AMD.
Virtual 8086 mode (16-bit)
- A special hybrid operating mode that allows real mode programs and operating systems to run while under the control of a protected mode supervisor operating system
System Management Mode (16-bit)
- Handles system-wide functions like power management, system hardware control, and proprietary OEM designed code. It is intended for use only by system firmware. All normal execution, including the operating system, is suspended. An alternate software system (which usually resides in the computer’s firmware, or a hardware-assisted debugger) is then executed with high privileges.

Switching modes[edit]

The processor runs in real mode immediately after power on, so an operating system kernel, or other program, must explicitly switch to another mode if it wishes to run in anything but real mode. Switching modes is accomplished by modifying certain bits of the processor’s control registers after some preparation, and some additional setup may be required after the switch.

Examples[edit]

With a computer running legacy BIOS, the BIOS and the boot loader run in Real mode. The 64-bit operating system kernel checks and switches the CPU into Long mode and then starts new kernel-mode threads running 64-bit code.

With a computer running UEFI, the UEFI firmware (except CSM and legacy Option ROM), the UEFI boot loader and the UEFI operating system kernel all run in Long mode.

Instruction types[edit]

In general, the features of the modern x86 instruction set are:

A compact encoding
- Variable length and alignment independent (encoded as little endian, as is all data in the x86 architecture)
- Mainly one-address and two-address instructions, that is to say, the first operand is also the destination.
- Memory operands as both source and destination are supported (frequently used to read/write stack elements addressed using small immediate offsets).
- Both general and implicit register usage; although all seven (counting ebp) general registers in 32-bit mode, and all fifteen (counting rbp) general registers in 64-bit mode, can be freely used as accumulators or for addressing, most of them are also implicitly used by certain (more or less) special instructions; affected registers must therefore be temporarily preserved (normally stacked), if active during such instruction sequences.
Produces conditional flags implicitly through most integer ALU instructions.
Supports various addressing modes including immediate, offset, and scaled index but not PC-relative, except jumps (introduced as an improvement in the x86-64 architecture).
Includes floating point to a stack of registers.
Contains special support for atomic read-modify-write instructions (xchg, cmpxchg/cmpxchg8b, xadd, and integer instructions which combine with the lock prefix)
SIMD instructions (instructions which perform parallel simultaneous single instructions on many operands encoded in adjacent cells of wider registers).

Stack instructions[edit]

The x86 architecture has hardware support for an execution stack mechanism. Instructions such as push, pop, call and ret are used with the properly set up stack to pass parameters, to allocate space for local data, and to save and restore call-return points. The ret size instruction is very useful for implementing space efficient (and fast) calling conventions where the callee is responsible for reclaiming stack space occupied by parameters.

When setting up a stack frame to hold local data of a recursive procedure there are several choices; the high level enter instruction (introduced with the 80186) takes a procedure-nesting-depth argument as well as a local size argument, and may be faster than more explicit manipulation of the registers (such as push bp ; mov bp, sp ; sub sp, size). Whether it is faster or slower depends on the particular x86-processor implementation as well as the calling convention used by the compiler, programmer or particular program code; most x86 code is intended to run on x86-processors from several manufacturers and on different technological generations of processors, which implies highly varying microarchitectures and microcode solutions as well as varying gate- and transistor-level design choices.

The full range of addressing modes (including immediate and base+offset) even for instructions such as push and pop, makes direct usage of the stack for integer, floating point and address data simple, as well as keeping the ABI specifications and mechanisms relatively simple compared to some RISC architectures (require more explicit call stack details).

Integer ALU instructions[edit]

x86 assembly has the standard mathematical operations, add, sub, neg, imul and idiv (for signed integers), with mul and div (for unsigned integers); the logical operators and, or, xor, not; bitshift arithmetic and logical, sal/sar (for signed integers), shl/shr (for unsigned integers); rotate with and without carry, rcl/rcr, rol/ror, a complement of BCD arithmetic instructions, aaa, aad, daa and others.

Floating-point instructions[edit]

x86 assembly language includes instructions for a stack-based floating-point unit (FPU). The FPU was an optional separate coprocessor for the 8086 through the 80386, it was an on-chip option for the 80486 series, and it is a standard feature in every Intel x86 CPU since the 80486, starting with the Pentium. The FPU instructions include addition, subtraction, negation, multiplication, division, remainder, square roots, integer truncation, fraction truncation, and scale by power of two. The operations also include conversion instructions, which can load or store a value from memory in any of the following formats: binary-coded decimal, 32-bit integer, 64-bit integer, 32-bit floating-point, 64-bit floating-point or 80-bit floating-point (upon loading, the value is converted to the currently used floating-point mode). x86 also includes a number of transcendental functions, including sine, cosine, tangent, arctangent, exponentiation with the base 2 and logarithms to bases 2, 10, or e.

The stack register to stack register format of the instructions is usually fop st, st(n) or fop st(n), st, where st is equivalent to st(0), and st(n) is one of the 8 stack registers (st(0), st(1), …, st(7)). Like the integers, the first operand is both the first source operand and the destination operand. fsubr and fdivr should be singled out as first swapping the source operands before performing the subtraction or division. The addition, subtraction, multiplication, division, store and comparison instructions include instruction modes that pop the top of the stack after their operation is complete. So, for example, faddp st(1), st performs the calculation st(1) = st(1) + st(0), then removes st(0) from the top of stack, thus making what was the result in st(1) the top of the stack in st(0).

SIMD instructions[edit]

Modern x86 CPUs contain SIMD instructions, which largely perform the same operation in parallel on many values encoded in a wide SIMD register. Various instruction technologies support different operations on different register sets, but taken as complete whole (from MMX to SSE4.2) they include general computations on integer or floating-point arithmetic (addition, subtraction, multiplication, shift, minimization, maximization, comparison, division or square root). So for example, paddw mm0, mm1 performs 4 parallel 16-bit (indicated by the w) integer adds (indicated by the padd) of mm0 values to mm1 and stores the result in mm0. Streaming SIMD Extensions or SSE also includes a floating-point mode in which only the very first value of the registers is actually modified (expanded in SSE2). Some other unusual instructions have been added including a sum of absolute differences (used for motion estimation in video compression, such as is done in MPEG) and a 16-bit multiply accumulation instruction (useful for software-based alpha-blending and digital filtering). SSE (since SSE3) and 3DNow! extensions include addition and subtraction instructions for treating paired floating-point values like complex numbers.

These instruction sets also include numerous fixed sub-word instructions for shuffling, inserting and extracting the values around within the registers. In addition there are instructions for moving data between the integer registers and XMM (used in SSE)/FPU (used in MMX) registers.

Memory instructions[edit]

The x86 processor also includes complex addressing modes for addressing memory with an immediate offset, a register, a register with an offset, a scaled register with or without an offset, and a register with an optional offset and another scaled register. So for example, one can encode mov eax, [Table + ebx + esi*4] as a single instruction which loads 32 bits of data from the address computed as (Table + ebx + esi * 4) offset from the ds selector, and stores it to the eax register. In general x86 processors can load and use memory matched to the size of any register it is operating on. (The SIMD instructions also include half-load instructions.)

Most 2-operand x86 instructions, including integer ALU instructions,
use a standard «addressing mode byte»^[13]
often called the MOD-REG-R/M byte.^[14]^[15]^[16]
Many 32-bit x86 instructions also have a SIB addressing mode byte that follows the MOD-REG-R/M byte.^[17]^[18]^[19]^[20]^[21]

In principle, because the instruction opcode is separate from the addressing mode byte, those instructions are orthogonal because any of those opcodes can be mixed-and-matched with any addressing mode.
However, the x86 instruction set is generally considered non-orthogonal because many other opcodes have some fixed addressing mode (they have no addressing mode byte), and every register is special.^[21]^[22]

The x86 instruction set includes string load, store, move, scan and compare instructions (lods, stos, movs, scas and cmps) which perform each operation to a specified size (b for 8-bit byte, w for 16-bit word, d for 32-bit double word) then increments/decrements (depending on DF, direction flag) the implicit address register (si for lods, di for stos and scas, and both for movs and cmps). For the load, store and scan operations, the implicit target/source/comparison register is in the al, ax or eax register (depending on size). The implicit segment registers used are ds for si and es for di. The cx or ecx register is used as a decrementing counter, and the operation stops when the counter reaches zero or (for scans and comparisons) when inequality is detected. Unfortunately, over the years the performance of some of these instructions became neglected and in certain cases it is now possible to get faster results by writing out the algorithms yourself. Intel and AMD have refreshed some of the instructions though, and a few now have very respectable performance, so it is recommended that the programmer should read recent respected benchmark articles before choosing to use a particular instruction from this group.

The stack is a region of memory and an associated ‘stack pointer’, which points to the bottom of the stack. The stack pointer is decremented when items are added (‘push’) and incremented after things are removed (‘pop’). In 16-bit mode, this implicit stack pointer is addressed as SS:[SP], in 32-bit mode it is SS:[ESP], and in 64-bit mode it is [RSP]. The stack pointer actually points to the last value that was stored, under the assumption that its size will match the operating mode of the processor (i.e., 16, 32, or 64 bits) to match the default width of the push/pop/call/ret instructions. Also included are the instructions enter and leave which reserve and remove data from the top of the stack while setting up a stack frame pointer in bp/ebp/rbp. However, direct setting, or addition and subtraction to the sp/esp/rsp register is also supported, so the enter/leave instructions are generally unnecessary.

This code is the beginning of a function typical for a high-level language when compiler optimisation is turned off for ease of debugging:

 push    rbp       ; Save the calling function’s stack frame pointer (rbp register)
 mov     rbp, rsp  ; Make a new stack frame below our caller’s stack
 sub     rsp, 32   ; Reserve 32 bytes of stack space for this function’s local variables.
                   ; Local variables will be below rbp and can be referenced relative to rbp,
                   ; again best for ease of debugging, but for best performance rbp will not
                   ; be used at all, and local variables would be referenced relative to rsp
                   ; because, apart from the code saving, rbp then is free for other uses.
  …       …        ; However, if rbp is altered here, its value should be preserved for the caller.
 mov [rbp-8], rdx  ; Example of accessing a local variable, from memory location into register rdx

…is functionally equivalent to just:

Other instructions for manipulating the stack include pushfd(32-bit) / pushfq(64-bit) and popfd/popfq for storing and retrieving the EFLAGS (32-bit) / RFLAGS (64-bit) register.

Values for a SIMD load or store are assumed to be packed in adjacent positions for the SIMD register and will align them in sequential little-endian order. Some SSE load and store instructions require 16-byte alignment to function properly. The SIMD instruction sets also include «prefetch» instructions which perform the load but do not target any register, used for cache loading. The SSE instruction sets also include non-temporal store instructions which will perform stores straight to memory without performing a cache allocate if the destination is not already cached (otherwise it will behave like a regular store.)

Most generic integer and floating-point (but no SIMD) instructions can use one parameter as a complex address as the second source parameter. Integer instructions can also accept one memory parameter as a destination operand.

Program flow[edit]

The x86 assembly has an unconditional jump operation, jmp, which can take an immediate address, a register or an indirect address as a parameter (note that most RISC processors only support a link register or short immediate displacement for jumping).

Also supported are several conditional jumps, including jz (jump on zero), jnz (jump on non-zero), jg (jump on greater than, signed), jl (jump on less than, signed), ja (jump on above/greater than, unsigned), jb (jump on below/less than, unsigned). These conditional operations are based on the state of specific bits in the (E)FLAGS register. Many arithmetic and logic operations set, clear or complement these flags depending on their result. The comparison cmp (compare) and test instructions set the flags as if they had performed a subtraction or a bitwise AND operation, respectively, without altering the values of the operands. There are also instructions such as clc (clear carry flag) and cmc (complement carry flag) which work on the flags directly. Floating point comparisons are performed via fcom or ficom instructions which eventually have to be converted to integer flags.

Each jump operation has three different forms, depending on the size of the operand. A short jump uses an 8-bit signed operand, which is a relative offset from the current instruction. A near jump is similar to a short jump but uses a 16-bit signed operand (in real or protected mode) or a 32-bit signed operand (in 32-bit protected mode only). A far jump is one that uses the full segment base:offset value as an absolute address. There are also indirect and indexed forms of each of these.

In addition to the simple jump operations, there are the call (call a subroutine) and ret (return from subroutine) instructions. Before transferring control to the subroutine, call pushes the segment offset address of the instruction following the call onto the stack; ret pops this value off the stack, and jumps to it, effectively returning the flow of control to that part of the program. In the case of a far call, the segment base is pushed following the offset; far ret pops the offset and then the segment base to return.

There are also two similar instructions, int (interrupt), which saves the current (E)FLAGS register value on the stack, then performs a far call, except that instead of an address, it uses an interrupt vector, an index into a table of interrupt handler addresses. Typically, the interrupt handler saves all other CPU registers it uses, unless they are used to return the result of an operation to the calling program (in software called interrupts). The matching return from interrupt instruction is iret, which restores the flags after returning. Soft Interrupts of the type described above are used by some operating systems for system calls, and can also be used in debugging hard interrupt handlers. Hard interrupts are triggered by external hardware events, and must preserve all register values as the state of the currently executing program is unknown. In Protected Mode, interrupts may be set up by the OS to trigger a task switch, which will automatically save all registers of the active task.

Examples[edit]

The following examples use the so-called Intel-syntax flavor as used by the assemblers Microsoft MASM, NASM and many others. (Note: There is also an alternative AT&T-syntax flavor where the order of source and destination operands are swapped, among many other differences.)^[23]

«Hello world!» program for MS-DOS in MASM-style assembly[edit]

Using the software interrupt 21h instruction to call the MS-DOS operating system for output to the display – other samples use libc’s C printf() routine to write to stdout. Note that the first example, is a 30-year-old example using 16-bit mode as on an Intel 8086. The second example is Intel 386 code in 32-bit mode. Modern code will be in 64-bit mode.^[24]

.model small
.stack 100h

.data
msg	db	'Hello world!$'

.code
start:
    mov ax, @DATA  ; Initializes Data segment
    mov ds, ax
	mov	ah, 09h    ; Sets 8-bit register ‘ah’, the high byte of register ax, to 9, to
                   ; select a sub-function number of an MS-DOS routine called below
                   ; via the software interrupt int 21h to display a message
	lea	dx, msg    ; Takes the address of msg, stores the address in 16-bit register dx
	int	21h        ; Various MS-DOS routines are callable by the software interrupt 21h
                   ; Our required sub-function was set in register ah above

	mov	ax, 4C00h  ; Sets register ax to the sub-function number for MS-DOS’s software
                   ; interrupt int 21h for the service ‘terminate program’.
	int	21h        ; Calling this MS-DOS service never returns, as it ends the program.

end start

«Hello world!» program for Windows in MASM style assembly[edit]

; requires /coff switch on 6.15 and earlier versions
.386
.model small,c
.stack 1000h

.data
msg     db "Hello world!",0

.code
includelib libcmt.lib
includelib libvcruntime.lib
includelib libucrt.lib
includelib legacy_stdio_definitions.lib

extrn printf:near
extrn exit:near

public main
main proc
        push    offset msg
        call    printf
        push    0
        call    exit
main endp

end

«Hello world!» program for Windows in NASM style assembly[edit]

; Image base = 0x00400000
%define RVA(x) (x-0x00400000)
section .text
push dword hello
call dword [printf]
push byte +0
call dword [exit]
ret

section .data
hello db "Hello world!"

section .idata
dd RVA(msvcrt_LookupTable)
dd -1
dd 0
dd RVA(msvcrt_string)
dd RVA(msvcrt_imports)
times 5 dd 0 ; ends the descriptor table

msvcrt_string dd "msvcrt.dll", 0
msvcrt_LookupTable:
dd RVA(msvcrt_printf)
dd RVA(msvcrt_exit)
dd 0

msvcrt_imports:
printf dd RVA(msvcrt_printf)
exit dd RVA(msvcrt_exit)
dd 0

msvcrt_printf:
dw 1
dw "printf", 0
msvcrt_exit:
dw 2
dw "exit", 0
dd 0

«Hello world!» program for Linux in its native AT&T style assembly[edit]

.data                         ; section for initialized data
str: .ascii "Hello, world!\n" ; define a string of text containing "Hello, world!" and then a new line.
str_len = . - str             ; get the length of str by subtracting its address

.text                         ; section for program functions
.globl _start                 ; export the _start function so it can be run
_start:                       ; begin the _start function
    movl $4, %eax             ; specify the instruction to 'sys_write'
    movl $1, %ebx             ; specify the output to the standard output, 'stdout'
    movl $str, %ecx           ; specify the outputted text to our defined string
    movl $str_len, %edx       ; specify the character amount to write as the length of our defined string.
    int $0x80                 ; call a system interrupt to initiate the syscall we have created.

    movl $1, %eax             ; specify the instruction to 'sys_exit'
    movl $0, %ebx             ; specify the exit code to 0, meaning success
    int $0x80                 ; call another system interrup to end the program

«Hello world!» program for Linux in NASM style assembly[edit]

;
; This program runs in 32-bit protected mode.
;  build: nasm -f elf -F stabs name.asm
;  link:  ld -o name name.o
;
; In 64-bit long mode you can use 64-bit registers (e.g. rax instead of eax, rbx instead of ebx, etc.)
; Also change "-f elf " for "-f elf64" in build command.
;
section .data                           ; section for initialized data
str:     db 'Hello world!', 0Ah         ; message string with new-line char at the end (10 decimal)
str_len: equ $ - str                    ; calcs length of string (bytes) by subtracting the str's start address
                                          ; from ‘here, this address’ (‘$’ symbol meaning ‘here’)

section .text                           ; this is the code section (program text) in memory 
global _start                           ; _start is the entry point and needs global scope to be 'seen' by the
                                        ; linker --equivalent to main() in C/C++
_start:                                 ; definition of _start procedure begins here
	mov	eax, 4                   ; specify the sys_write function code (from OS vector table)
	mov	ebx, 1                   ; specify file descriptor stdout --in gnu/linux, everything's treated as a file,
                                 ; even hardware devices
	mov	ecx, str                 ; move start _address_ of string message to ecx register
	mov	edx, str_len             ; move length of message (in bytes)
	int	80h                      ; interrupt kernel to perform the system call we just set up -
                                 ; in gnu/linux services are requested through the kernel
	mov	eax, 1                   ; specify sys_exit function code (from OS vector table)
	mov	ebx, 0                   ; specify return code for OS (zero tells OS everything went fine)
	int	80h                      ; interrupt kernel to perform system call (to exit)

For 64-bit long mode, «lea rcx, str» would be the address of the message, note 64-bit register rcx.

«Hello world!» program for Linux in NASM style assembly using the C standard library[edit]

;
;  This program runs in 32-bit protected mode.
;  gcc links the standard-C library by default

;  build: nasm -f elf -F stabs name.asm
;  link:  gcc -o name name.o
;
; In 64-bit long mode you can use 64-bit registers (e.g. rax instead of eax, rbx instead of ebx, etc..)
; Also change "-f elf " for "-f elf64" in build command.
;
        global  main                            ; ‘main’ must be defined, as it being compiled
                                                ; against the C Standard Library
        extern  printf                          ; declares the use of external symbol, as printf
                                                ; printf is declared in a different object-module.
                                                ; The linker resolves this symbol later.

segment .data                                   ; section for initialized data
	string db 'Hello world!', 0Ah, 0            ; message string ending with a newline char (10
                                                ; decimal) and the zero byte ‘NUL’ terminator
                                                ; ‘string’ now refers to the starting address
                                                ; at which 'Hello, World' is stored.

segment .text
main:
        push    string                          ; Push the address of ‘string’ onto the stack.
                                                ; This reduces esp by 4 bytes before storing
                                                ; the 4-byte address ‘string’ into memory at
                                                ; the new esp, the new bottom of the stack.

                                                ; This will be an argument to printf()
        call    printf                          ; calls the C printf() function.
        add     esp, 4                          ; Increases the stack-pointer by 4 to put it back
                                                ; to where it was before the ‘push’, which
                                                ; reduced it by 4 bytes.
        ret                                     ; Return to our caller.

«Hello world!» program for 64-bit mode Linux in NASM style assembly[edit]

This example is in modern 64-bit mode.

;  build: nasm -f elf64 -F dwarf hello.asm
;  link:  ld -o hello hello.o

DEFAULT REL			    ; use RIP-relative addressing modes by default, so [foo] = [rel foo]

SECTION .rodata			; read-only data should go in the .rodata section on GNU/Linux, like .rdata on Windows
Hello:		db "Hello world!", 10   ; Ending with a byte 10 = newline (ASCII LF)
len_Hello:	equ $-Hello             ; Get NASM to calculate the length as an assembly-time constant
                                    ; the ‘$’ symbol means ‘here’. write() takes a length so that
                                    ; a zero-terminated C-style string isn't needed.
                                    ; It would be for C puts()

SECTION .rodata			; read-only data can go in the .rodata section on GNU/Linux, like .rdata on Windows
Hello:		db "Hello world!",10        ; 10 = `\n`.
len_Hello:	equ $-Hello                 ; get NASM to calculate the length as an assemble-time constant
;;  write() takes a length so a 0-terminated C-style string isn't needed. It would be for puts

SECTION .text

global _start
_start:
	mov eax, 1				; __NR_write syscall number from Linux asm/unistd_64.h (x86_64)
	mov edi, 1				; int fd = STDOUT_FILENO
	lea rsi, [rel Hello]			; x86-64 uses RIP-relative LEA to put static addresses into regs
	mov rdx, len_Hello		; size_t count = len_Hello
	syscall					; write(1, Hello, len_Hello);  call into the kernel to actually do the system call
     ;; return value in RAX.  RCX and R11 are also overwritten by syscall

	mov eax, 60				; __NR_exit call number (x86_64) is stored in register eax.
	xor edi, edi		    ; This zeros edi and also rdi.
                            ; This xor-self trick is the preferred common idiom for zeroing
                            ; a register, and is always by far the fastest method.
                            ; When a 32-bit value is stored into eg edx, the high bits 63:32 are
                            ; automatically zeroed too in every case. This saves you having to set
                            ; the bits with an extra instruction, as this is a case very commonly
                            ; needed, for an entire 64-bit register to be filled with a 32-bit value.
                            ; This sets our routine’s exit status = 0 (exit normally)
	syscall					; _exit(0)

Running it under strace verifies that no extra system calls are made in the process. The printf version would make many more system calls to initialize libc and do dynamic linking. But this is a static executable because we linked using ld without -pie or any shared libraries; the only instructions that run in user-space are the ones you provide.

$ strace ./hello > /dev/null                    # without a redirect, your program's stdout is mixed with strace's logging on stderr.  Which is normally fine
execve("./hello", ["./hello"], 0x7ffc8b0b3570 /* 51 vars */) = 0
write(1, "Hello world!\n", 13)          = 13
exit(0)                                 = ?
+++ exited with 0 +++

Using the flags register[edit]

Flags are heavily used for comparisons in the x86 architecture. When a comparison is made between two data, the CPU sets the relevant flag or flags. Following this, conditional jump instructions can be used to check the flags and branch to code that should run, e.g.:

	cmp	eax, ebx
	jne	do_something
	; ...
do_something:
	; do something here

Aside, from compare instructions, there are a great many arithmetic and other instructions that set bits in the flags register. Other examples are the instructions sub, test and add and there are many more. Common combinations such as cmp + conditional jump are internally ‘fused’ (‘macro fusion’) into one single micro-instruction (μ-op) and are fast provided the processor can guess which way the conditional jump will go, jump vs continue.

The flags register are also used in the x86 architecture to turn on and off certain features or execution modes. For example, to disable all maskable interrupts, you can use the instruction:

The flags register can also be directly accessed. The low 8 bits of the flag register can be loaded into ah using the lahf instruction. The entire flags register can also be moved on and off the stack using the instructions pushfd/pushfq, popfd/popfq, int (including into) and iret.

The x87 floating point maths subsystem also has its own independent ‘flags’-type register the fp status word. In the 1990s it was an awkward and slow procedure to access the flag bits in this register, but on modern processors there are ‘compare two floating point values’ instructions that can be used with the normal conditional jump/branch instructions directly without any intervening steps.

Using the instruction pointer register[edit]

The instruction pointer is called ip in 16-bit mode, eip in 32-bit mode, and rip in 64-bit mode. The instruction pointer register points to the address of the next instruction that the processor will attempt to execute. It cannot be directly accessed in 16-bit or 32-bit mode, but a sequence like the following can be written to put the address of next_line into eax (32-bit code):

	call	next_line
next_line:
	pop	eax

Writing to the instruction pointer is simple — a jmp instruction stores the given target address into the instruction pointer to, so, for example, a sequence like the following will put the contents of rax into rip (64-bit code):

In 64-bit mode, instructions can reference data relative to the instruction pointer, so there is less need to copy the value of the instruction pointer to another register.

References[edit]

^ «Intel 8008 (i8008) microprocessor family». www.cpu-world.com. Retrieved 2021-03-25.
^ «Intel 8008». CPU MUSEUM — MUSEUM OF MICROPROCESSORS & DIE PHOTOGRAPHY. Retrieved 2021-03-25.
^ ^a ^b ^c «Intel 8008 OPCODES». www.pastraiser.com. Retrieved 2021-03-25.
^ «Assembler language reference». www.ibm.com. Retrieved 2022-11-28.
^ «x86 Assembly Language Reference Manual» (PDF).
^ ^a ^b ^c ^d ^e Narayam, Ram (2007-10-17). «Linux assemblers: A comparison of GAS and NASM». IBM. Archived from the original on October 3, 2013. Retrieved 2008-07-02.
^ «The Creation of Unix». Archived from the original on April 2, 2014.
^ Hyde, Randall. «Which Assembler is the Best?». Retrieved 2008-05-18.
^ «GNU Assembler News, v2.1 supports Intel syntax». 2008-04-04. Retrieved 2008-07-02.
^ «i386-Bugs (Using as)». Binutils documentation. Retrieved 15 January 2020.
^ «Intel 8080 Assembly Language Programming Manual» (PDF). Retrieved 12 May 2023.
^ Mueller, Scott (March 24, 2006). «P2 (286) Second-Generation Processors». Upgrading and Repairing PCs, 17th Edition (Book) (17 ed.). Que. ISBN 0-7897-3404-4. Retrieved 2017-12-06.
^
Curtis Meadow.
«Encoding of 8086 Instructions».
^
Igor Kholodov.
«6. Encoding x86 Instruction Operands, MOD-REG-R/M Byte».
^
«Encoding x86 Instructions».
^
Michael Abrash.
«Zen of Assembly Language: Volume I, Knowledge».
«Chapter 7: Memory Addressing».
Section «mod-reg-rm Addressing».
^
Intel 80386 Reference Programmer’s Manual.
«17.2.1 ModR/M and SIB Bytes»
^
«X86-64 Instruction Encoding: ModR/M and SIB bytes»
^
«Figure 2-1. Intel 64 and IA-32 Architectures Instruction Format».
^
«x86 Addressing Under the Hood».
^ ^a ^b
Stephen McCamant.
«Manual and Automated Binary Reverse Engineering».
^
«X86 Instruction Wishlist».
^ Peter Cordes (18 December 2011). «NASM (Intel) versus AT&T Syntax: what are the advantages?». Stack Overflow.
^ «I just started Assembly». daniweb.com. 2008.

Introduction

This is my full and final article about the Intel Assembly, it includes all the previous hardware articles (Internals, Virtualization, Multicore, DMMI) along with some new information (HIMEM.SYS, Flat mode, EMM386.EXE, Expanded Memory, DPMI information).

Reading this through will enable you to understand how the operating systems work, how the memory is allocated and addressed and, perhaps how to make your own OS-level drivers and applications.

To help you understand what’s happening, the github project includes many aspects of the article (and I ‘m still adding stuff). It’s a ready to be run tool which includes a Bochs binary, VMWare and VirtualBox configurations and a Visual Studio solution. The entire project is build in assembly using Flat Assembler.

Assemblers like TASM or MASM will not work, for they only support specific architectures.

Bochs is the best environment to experiment, because it includes a hardware GUI debugger (I’m proud of developing it myself) which can help you understand the internals. Debugging without Bochs is impossible, because the debuggers are either real mode only (like MSDOS Debug) and assume you will always have some sort of control (which is not the case in most debugging areas), or are able to run only in an existing environment (like Visual Studio).

If you have good C knowledge, then this will be a benefit in understanding the internals. Asesmbly knowledge is recommended, but you can follow the article even if you know nothing about assembly.

Generic Information

Architecture and CPU

Assembly is a language that everything must be done manually. A single printf() call will perhaps take thousands of assembly instructions to execute. While this article does not attempt to teach you assembly, it would be necessary to bear in mind that really lots of things are needed even to achieve the smallest result (that is actually why higher level languages were created). Assembly language is also specific to the architecture (Here, we discuss Intel x86 and x64), whereas a language like C is portable.

Assembly has a small (comparatively) set of commands:

Commands that move data between various places
Commands that execute mathematic algorithms (simple to complex)
Commands that check conditions (like if)
Other commands (to be later discussed)

The CPU is the unit that executes assembly instructions. The way they are executed depends on the running mode of the processor, and there are 4 modes:

Real mode
Protected mode (in two vresions, segmented and flat)
Long mode
Virtualization (not exactly a mode, but we will talk about it later)

The next paragraphs in this chapter discuss various elements of the assembly language in general.

Memory

Physically, the memory is one big array. If you have 4GB, you could describe it as unsigned char mem[4294967295]. However, the way it is used greatly differs depending on the processor mode and the configuration of the operating system. Therefore, you do not access it as a big array.

Stack and Functions

Stack is special memory that is setup for temporary storage. Parameters passed to a function are «pushed» to the stack, when the function ends they are «popped» so the stack clears and C functions’s local variables go there, that’s why they vanish when the function terminates. The stack memory is, technically, nothing but normal memory used for special purposes.

This is (oversimplified for now) what approximately happens in assembly with a function:

int x(int a,int b)
{
return a + b;
}

int c = x(5,10);

x:
mov ax,[first stack element]
mov bx,[second stack element]
add ax,bx
ret 4

main:
push 5
push 10 
call x

The variables «a» and «b» are «pushed» to temporary memory (which is now 4 bytes less if int = 16 bits). The function is called, and then it returns with the stack cleared and ax containing the return value. Note that the above is a big oversimplification of what the assembly code actually looks like, but let’s pass for now.

Registers

In addition to memory, each CPU has some auxilliary places to store data, called registers. What registers are available depends on the current running mode. Some registers have special meanings, some are for generic purposes.

Interrupts

An interrupt is code that interrupts other running code. For the moment, just assume it’s a function that can run while you are inside another function. There are interrupts that are automatically generated by the CPU (either hardware or when an exception occurs), and interrupts that are «called» by software. The way they work depends on the running mode, and there can be a maximum of 255 interrupts.

Exceptions

An exception is an interrupt triggered by either the CPU (for example, when a divide by zero occurs in your C++ code, int 00 functions are executed), or by using the API (via the throw keyword, for example), which generates a software interrupt. In the lower level we are discussing, there is no difference between exceptions and interrupts.

Now that we have an idea of the basics, let’s proceed to CPU modes.

Real Mode

Architecture

Real mode is the oldest mode. DOS runs in it. Windows 3.0 also runs in it when started with the /r switch. Everything is 16 bit. It is the weakest mode of operation, but not the simplest one. Memory is addressed by an 20 bit controller, making possible to access up to 1MB memory. Available memory over this limit is useless in real mode.

Segmentation

Memory is not accessed as an array, but in segments. Each pointer is described by a 16 bit segment, which is a memory address divided by 16, and an offset, which describes how far from the offset we will go. So we will see some simple (in hex) examples:

0000:0000 -> memory address 0
0000:0010 -> memory address 16 (hex 10)
0001:0002 -> memory address 18. Segment 1*16 + offset 2
0010:0034 -> 0x10*16 + 0x34
0011:0024 -> 0x11*16 + 0x24, same pointer as above
FFFF:0010 -> Maximum available address, specifying more than 0010h results in wrapping around zero.

We can see that segments can overlap. Specifying 0ffffh segment and an offset larger than 0010h results in wrapping. A segment maximum capacity is 64KB. Although we can go up to a FFFF segment, only the lower 640KB were available for DOS applications, because the upper segments (over 0xA000) were reserved for the BIOS.

All segments have read/write/execute access from anywhere (that is, any program can read/write or execute code within any segment). Any application can read from or write to any part of memory, including the part in which the OS resides. That is why a real mode OS is a single tasking OS and if one app crashes, you have to reboot.

Registers

Real mode registers are 16 bits, and they include:

Four generic purpose registers: AX, BX, CX, DX. The upper 8 bit part of them can be accessed as AH, BH, CH, DH and the lower part as AL, BL, CL, DL.
A register to hold the offset of the currently executing code: IP.
Four registers to be used as pointers: SI, DI, BP, SP. SP points to the end of the available stack memory (it cannot be used as an index like the rest). Each time we push something to the stack, SP decreases. On POP, SP increases. These registers have no 8 bit splits.
Four registers to contain segments: CS, holding always the segment of the currently executing code, DS,ES and SS. SS holds the segment of the stack memory, DS holds the segment of the data, and ES is an auxilliary register.

So the code is always executing at CS:IP, and stack is pointed by SS:SP.

The 386 CPU adds more registers, also accessible in real mode:

32 bit extensions to the non segment registers: EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP, EIP.
Two more auxilliary segment registers, GS and FS.
5 control registers, CR0, CR1, CR2, CR3, CR4.
6 debug registers, DR0, DR1, DR2, DR3, DR6, DR7, used for hardware breakpoints.

DS is the default data segment, unless else is specified or if SP or BP are used:

mov ax,[100] 
mov ax,[si] 
mov ax,[es:si]

ESI, EDI, EBP and ESP can be used as pointers. If their high bits are not zero, then an exception occurs (unless you are in Unreal mode, discussed below).

When REP operations are storing data (movsb, stosw etc), then when DI is used as an index, ES is the default segment.

COM and EXE files

A COM file is a memory map, fitting in one segment. The first 128 bytes contain the PSP, a data structure containing information, and the rest of the segment contains all code, data, and stack memory for the program. CS = DS = ES = SS. SP is set to 0xFFFE to point to the end of the segment. Execution starts from CS:IP = 0x100 (after the PSP).

An EXE file might have multiple segments, so an EXE can be more than 64KB. DS and ES initially point to the PSP. When an EXE is loaded, «relocations» are resolved. A relocation is a position within the executable that the assembler leaves as empty, to be filled with a segment value which would only be known at run time.

Interrupts

All the functions that DOS and BIOS provides are available through real mode software interrupts. In real mode, the first 1024 bytes of RAM (Starting at 0000:0000) contain a set of 256 segment:offset pointers to each interrupt. In 286+ this location can be changed by the LIDT command, which points to a 6 byte array:

Bytes 0-1 contain the full length of the IDT, maximum 1KB => 256 entries.
Bytes 2-5 contain the physical address of the first entry of the IDT, in memory.

Some interrupts are automatically issued by the processor when some event occurs. In real mode, the most significant are:

Interrupt 0, called on divide by zero.
Interrupt 1, called when using a debugger for single step.
Interrupt 3, called on breakpoints.
Interrupt 6, called on invalid opcode.
Interrupt 9, called on key press.

Software interrupts provide various services to real mode apps. The most important interrupts are:

0x10, BIOS display functions
0x13, BIOS disk functions
0x14, BIOS serial port functions
0x16, BIOS keyboard functions
0x17, BIOS parallel port functions
0x21, DOS functions (files, input, output, application, configuration etc)
0x2F, TSR functions
0x31, DPMI functions
0x33, Mouse functions

Using the excellent Ralf Brown Interrupt List you can learn about every interrupt in the world.

Models

Because of the segmented memory, different sets of programming models were created, which mostly resulted in incompatibilities between compilers and libraries. C pointers were described as near or far, depending on whether they included a segment or not:

The tiny model. Everything has to be included in a single segment (COM file). Pointers are near.
The small model. One segment for the code, one for the data. All pointers are near.
The medium model. One data segment, multiple code segments. Code pointers far, data pointers near.
The compact model. One code segment, multiple data segments. Code pointers near, data pointers far.
The large model. Multiple code and data segments, code and data pointers far. Single data structures still limited to 64KB.
The huge model. Multiple code and data segments, all pointers far.

Benefits

The only benefit in real mode is that you have DOS and BIOS functions available as software interrupts. Therefore, all techniques used by DOS extenders (which allowed applications to run in protected mode) involved temporarily switching to real mode to call DOS.

Here is a quick hello world in tiny model:

org 0x100 ; code starts at offset 100h
use16               ; use 16-bit code
mov ax,0900h
mov dx,Msg
int 21h
mov ax,4c00h
int 21h
Msg db "Hello World!$"

This very simple program calls two DOS functions. The first is function 9 (ah register) which accepts a pointer of the string to be written to the screen in DS:DX (DS already has the segment, it’s a com file). The second is function 4C, which terminates the program.

Here is the same application in EXE format:

FORMAT MZ               ; DOS 16-bit EXE format ENTRY CODE16:Main       ; Specify Entry point (i.e. the start address) STACK STACK16:stackdata ; Specify The Stack Segment and Size     
SEGMENT CODE16_2 USE16  ; Declare a 16-bit segment     
    ShowMsg:
        mov ax,DATA16
        mov ds,ax            
        mov ax,0900h    
        mov dx,Msg    
        int 21h
                            
    retf                    
                            
    
SEGMENT CODE16 USE16         ; Declare a 16-bit segment     ORG 0                    ; Says that the offset of the first opcode                               
    
    Main:
        mov ax,CODE16_2
        mov es,ax
        call far [es:ShowMsg] 
                              
        mov ax,4c00h          
        int 21h
    
SEGMENT DATA16 USE16     Msg db "Hello World!$"
        
SEGMENT STACK USE16     stackdata dw 0 dup(1024)  ; use 2048 bytes as stack. When program is initialized,

How does the assembler know the actual value of the data16, code16, code16_2, and stack16 segments? It doesn’t. What it does is to put null values, and then creates entries to the EXE file (known as «relocations») so the loader, once it copies the code to the memory, writes to the specified address, the true values of the segments. And because this relocation map has a header, COM files cannot have multiple segments even if they sum to less than 64KB in total.

This program calls a function ShowMsg in another segment via a far call, which uses a DOS function (09h, INT 21h) to display text.

Problems

If multiple applications are running, one application can overwrite any other without any notification.
Up to 1MB memory only, and the upper 384K were used by BIOS, so only 640K available.
Mixing far and near pointers between applications and libraries led to incompatibities and, usually, crashes.
If something wrong happens, the PC has to reboot.

Expanded Memory

To cope with the 640KB limitation, an additional compatible memory, called expanded memory or EMS memory was created. This was not a processor feature, but rather a set of hardware (ISA card) extensions which included a driver to perform bank switching, i.e. replace portions of memory installed with memory from that card. It offered up to 32MB more, but it was mapped to one of the high segments (A000, B000, C000, D000, E000 or F000), which means that this extra memory could not be available simultaneously. The expansion card came with a driver which had to be installed in config.sys and, using the LIM EMS protocol, offered the services via interrupt 67h.

Detecting EMS, by testing existence of a device called EMMXXXX0:

EMSName db 'EMMXXXX0',0
mov  dx,EMSName       
mov  ax,3D00h                
int  21h
jc   NotThere
mov  bx,ax                   
mov  ax,4407h                
int  21h
jc   NotThere
cmp  al,0FFh
jne  NotThere
mov  ah,3Eh                  
int  21h
jmp  ItIsThere

Allocating EMS

Interrupt 0x67, AH = 0x43, BX = # of pages (1 page = 16KB)

Detect segment to be used

Interrupt 0x67, AH = 0x41

Save previous EMS map

Interrupt 0x67, AH = 0x47

Save previous EMS map

Interrupt 0x67, AH = 0x47

Map our allocated memory

Interrupt 0x67, AH = 0x44

Restore previous EMS map

Interrupt 0x67, AH = 0x48

Release EMS

Interrupt 0x67, AH = 0x45

Various other functions are provided by int 0x67.

A20 line

We saw that the maximum address is FFFF:0010, because increasing the offset results in wrapping. That is true because the 8088 CPU has only 20 bits of addressing. However 286+ added the 21th line (known as A20 line) and, when it is enabled, FFFF:0010 to FFFF:FFFF can be used without wrapping (an almost 64KB more). This memory (known as High Memory Area, HMA) is now accessible from real mode and it can be used by HIMEM.SYS to load parts of DOS in it and therefore make more low memory available for applications.

Enabling or disabling A20 manually requires us to communicate with the keyboard controller:

WaitKBC:
   mov cx,0ffffh
   A20L:
   in al,64h
   test al,2
   loopnz A20L
ret

ChangeA20:
   call WaitKBC
   mov al,0d1h
   out 64h,al
   call WaitKBC
   mov al,0dfh 
   out 60h,al
ret

Segmented Protected Mode

Architecture

Protected mode solves the real mode problems. In particular:

Up to 16 MB (286) and up to 4GB (386+) are directly accessible.
Memory access is checked, protections and protection levels are available.
If something wrong happens, the problem can be isolated and the rest of the applications are not affected.
There is 16-bit protected mode (286+) or 32-bit protected mode(386+)

DOS never ran in protected mode. Windows 3.0 run in 16-bit segmented protected mode, when started with the /s switch. Windows 95+, Linux and the rest of 32-bit OSes run in flat protected mode, but before checking the flat mode we will immerse in the complex mechanisms that protected mode has. Flat mode greatly simplifies many complex things in normal segmented protected mode.

Protected mode introduces «rings», that is, levels of authorization. There are four rings (Ring 0, 1, 2 and 3), in which the Ring 0 is the most authorized, where the Ring 3 is the less authorized. Code running in a less privileged ring cannot access (without the OS supervision) code in a higher ring.

Memory

Each segment in memory is not anymore fixed, nor it has a fixed 64KB size. A protected mode segment can have any size, from 1 byte to 4GB. Each segment has its own limitations (read, write, execute access) and its own protection ring.

Registers

The same set of registers that exist in real mode are available. Also, every register can be used as an index, for example mov ax,[ebx] will work.

Global Descriptor Table

The Global Descriptor Table (GDT) is a set of entries that describes all segments for the CPU. Each entry is 8 bytes long and has the following format:

Bits	Meaning
0-15	Limit low 16 bits
16-31	Base low 16 bits
32-39	Base medium 8 bits
40	Ac
41	RW
42	DC
43	Ex
44	S
45-46	Priv
47	Pr
48-51	Limit upper 4 bits
52-53	Reserved (0)
54	Sz
55	Gr
56-63	Base upper 8 bits

The base is a 32-bit value that indicates the physical memory that this segment starts at.
The limit is an 20- bit value indicating the length of the segment, depending on the Gr bit. If the Gr bit is 1, then the actual limit is the limit value * 4096.
The Ex flag is 1, to indicate a code segment, or 0, to indicate a data segment.
The DC flag has different meaning, depending on the Ex flag:
- For code segment (Ex = 1), if DC is 0 then the segment is non conforming. A non conforming segment can only be called from a segment with the same privilege level. If RW is 1 then the segment is conforming and can be also called from segments with higher privilege. For example, a ring 3 conforming segment can be called from a ring 2 segment.
- For data segment (Ex = 0), if DC is 0 then the data segment expands up, else it expands down. For an expanding down segment, it starts from its limit and ends to its base, with the address going the reverse way. This flag was created so a stack segment could be easily expanded, but it is not used today.
The RW flag has different meaning, depending on the Ex flag:
- For code segment (Ex = 1), if 0, then the segment is not readable. If 1, then the code segment is readable.
- For data segment (Ex = 0), if 0, segment is read only, else read-write.
  Note that a code segment is not writable. However, because segment base addresses can overlap, you can create a writable data segment with the same base address and limit of a code segment.
The Pr indicates the current ring (00 to 11)
The Ac bit indicates access. The CPU sets this bit each time the segment is accessed, so the OS gets an idea how frequent is the access to the segment, so it knows if it can cache it to disk or not.
The S bit must be 1 for code and data segments, and 0 for system segments (see below).
The Pr bit can be set to 1 to indicate that the segment is present in memory. If the OS caches this segment to the disk, then it sets Pr to 0. Any attempt to access the removed segment causes an exception. The OS catches this exception, and reloads the segment to memory, setting Pr to 1 again.
The Sz bit can have two values:
- 0, in which case the default for opcodes is 16-bit. The segment can still execute 32-bit commands (386+) by putting the 0x66 or 0x67 prefix to them.
- 1 (386+), in which case the default for opcodes is 32-bit. The segment can still execute 16-bit commands by putting the 0x66 or 0x67 prefix to them.

In real mode, the segment registers (CS, DS, ES, SS, FS, GS) specify a real mode segment. And you can put anything to them, no matter where it points. And you can read and write and execute from that segment. In protected mode, these registers are loaded with selectors. The selectors are indices to the GDT and have the following format:

Bits	Meaning
0-2	RPL. Requested protection level, must be equal or lower to the segment PL.
2	0 to take the entry from GDT, 1 from the LDT (see below)
3-15	0-based index to the table.

In protected mode, you can’t just select random values to the segment registers like in real mode. You must put valid values or you will get an exception. The exception is the first entry in the GDT table, which is always set to 0. CPU does not read information from entry 0 and thus it is considered a «dummy» entry. This allows the programmer to put the 0 value to a segment register (DS, ES, FS, GS) without causing an exception.

The GDT is loaded to the CPU by executing the LDGT command, which points to a 6-byte array:

Bytes 0-1 contain the full length of the GDT, maximum 4KB => 4096 entries.
Bytes 2-5 contain the physical address of the first entry of the GDT, in memory.

Interrupts

The interrupt table is now 8 bytes long for each defined interrupt, having the following structure:

struc IDT_STR 
{
 .ofs0_15 dw ofs0_15
 .sel dw sel
 .zero db zero
 .flags db flags            
 .ofs16_31 dw ofs16_31
}

Each interrupt also has a protection level. The LIDT command has the same functionality as in real mode, pointing to an 6 byte array (containing the size and the physical location of the first entry).

After the LIDT command is executed, real mode interrupts no longer work, so a real mode debugger is useless.

Local Descriptor Table

Local Descriptor Table (LDT) is a method for each application, on multitasking scenarios, to have a private set of segments, loaded with the LLDT assembly instruction. The LDT bit in the selector specifies if the segment loaded is from the GDT or from the LDT.

System Segments in the GDT

When the S bit in the GDT is 0, this indicates a system-related segment. In this case, GDT entries describe three kinds of system segments:

Task Segments
Call Gates
Interrupt Gates
Trap Gates (same as interrupt gates, with the exception that when a trap occurs, interrupts are still enabled)

Bits 40-43 in a GDT entry have the following meaning:

0000 — Reserved
0001 — Available 16-bit TSS
0010 — Local Descriptor Table (LDT)
0011 — Busy 16-bit TSS
0100 — 16-bit Call Gate
0101 — Task Gate
0110 — 16-bit Interrupt Gate
0111 — 16-bit Trap Gate
1000 — Reserved
1001 — Available 32-bit TSS
1010 — Reserved
1011 — Busy 32-bit TSS
1100 — 32-bit Call Gate
1101 — Reserved
1110 — 32-bit Interrupt Gate
1111 — 32-bit Trap Gate

Call Gates

Call gates are a mechanism to switch from a low privilege code to a higher one, used for user-level code to call system-level code. You specify a 1100 type entry in the GDT with the following format:

Hide Copy Code

struct CALLGATE
{
    unsigned short offs0_15;
    unsigned short selector;
    unsinged short argnum:5;      unsigned char r:3;     unsigned char type:5;     unsigned char dpl:2;     unsigned char P:1;     unsigned short offs16_31;

};

Using CALL FAR with the selector of this callgate (the offset is ignored) will switch to the gate and execute the higher level privilege commands. If argnum specifies parameters to be copied, the system copies them to the new stack after pushing SS,ESP,CS,EIP. Using RETF will return from the gate call.

Call gates are slow mechanisms to transit between rings in the CPU.

TSS Descriptors, Task Gates and Hardware Multitasking

Having the ability to hold Task Segments in the GDT and Local Descriptor Tables, CPUs provide the ability for task switching. The Task State Segment is where the CPU saves information about a local task (the current registers). Executing a far JMP or a CALL (offsets are ignored like in call gates) with a selector pointing to a GDT TSS will «switch» to that task, restoring saved registers. The TSS descriptor is used to specify the base address and limit of the TSS to be used to load the new CPU state from. The CPU has a register named Task Register which tells which TSS will receive the old CPU state. When the TR register is loaded with an LTR instruction the CPU looks at the GDT entry (specified with LTR) and loads the visible part of TR with the GDT entry, and the hidden part with the base and limit of the GDT entry. When the CPU state is saved the hidden part of TR is used.

In addition to the far call and jmp, a context switch can be triggered by a using a Task Gate Descriptor. Unlike TSS Descriptors, task-gate descriptors can be in the GDT, LDT or IDT (so you can force a task switching when an interrupt occurs).

Entering protected mode

The steps to follow are:

Enable A20
Set the GDT
Set the IDT (if you need interrupts in protected mode)
Enter protected mode with the MSW or the CR0 register.

You use the MSW register (in 286), or, in 386+ CR0:

mov eax,cr0
or eax,1
mov cr0,eax


smsw ax
or al,1
lmsw ax

After that, you must execute a far jump to a protected mode code segment in order to clear possible invalid command cache. If this code segment is a 16-bit code segment, you must do:

db 0eah    
dw StartPM 
dw xx

If this code segment is a 32-bit code segment, you must do:

db 66h     
db 0eah    
dd StartPM 
dw xx

Also you must setup the stack and other registers:

mov ax, data_selector
mov ds,ax
mov ax, stack_selector
mov ss,ax
mov esp,1000h 
              
sti
...

Exiting protected mode

cli
mov eax,cr0
and eax,0ffffffeh
mov cr0,eax
mov ax,data16
mov ds,ax
mov ax,stack16
mov ss,ax
mov sp,1000h 
mov bx,RealMemoryInterruptTableSavedWithSidt
litd [bx]
sti

In 286, you cannot get back to real mode because a LMSW ax to remove the protected mode flag results in a processor reset, keeping the memory intact. 286 forces this reset and puts a routine to be executed after the reset with the following code:

MOV ax,40h 
MOV es,ax 
MOV di,67h 
MOV al,8fh 
OUT 70h,al 
MOV ax,ShutdownProc 
STOSW 
MOV ax,cs
STOSW 
MOV al,0ah 
OUT 71h,al 
MOV al,8dh 
OUT 70h,al

In 386+, normal exit back to the real mode can be done.

Problems

While you can access all the memory directly, there is still a lot of segmentation and slow task switching or slow movement between rings.

Flat Protected Mode

Paging

Paging is the method to redirect a memory address to another address. The requested address is called linear address and the target address is called physical address. When a linear address is the same as a physical address, we say that we are in a «see through» area.

To accomplish paging, two tables are used: the page directory and the page table.

The Page Directory is an array of 1024 32-bit entries with the following format:

P,R,U,W,D,A,N,S,G,AA,Addr

P — Page is present in memory. This flag allows the OS to cache the pages back to disk , clear P, and reload them when a page fault is generated when software attemps to access the page.
R — Page is Read Write if set, else Read only. This restriction applies only to ring 3 unless the WP bit in CR0 is set.
U — If unset, only ring 0 can access this page.
W — If set, write-through is enabled.
D — If set, the page will not be cached. The CPU caches the page tables in it’s Translation Lookaside Buffer (TLB).
A — Set when the page is accessed (not automatically, like the GDT bit).
N — Set to 0.
S — Set to 0. If Page Size Extensions (PSE) are enabled, S can be 1, in which case the page size is 4MB instead, and the pages must be 4MB aligned. This mode is introduced to avoid lots of small pages, at the expense of more memory wasted if the needed memory is somewhat larger than 4MB. Fortunately, modes can be mixed.
G — Set to 0.
Addr — The upper 20 bits (the lower 12 are ignored because it must be 4096- aligned) of the Page Table entry that this Page Directory entry points to.

The Page Table is an array of 1024 32-bit entries with a similar format:

P,R,U,W,C,A,D,N,G,AA,Addr

The C bit is the same as the previous D bit
The D bit is used to mark dirty pages (pages that have been written) by the OS.
The G flat, if set, prevents caching in the TLB.
The Addr is the 4096-aligned physical address that this entry points to. The virtual address is calculated from the offset in the page directory and the offset in the page table.

To enable paging:

Load CR3 with the address of the first entry in the Page Directory (must be 4096-aligned).
Set CR0 bit 31. This requires protected mode, with the exception of LOADALL (see below).

Once the tables are loaded, they are cached into TLB. Reloading the CR3 will reset the cache. 486+ also has an INVLPG instruction to reset only a particular page cache, not the entire TLB.

Architecture

The segmented protected mode is very complex. Using paging, protected mode can be «flat», enabling the following:

All processes get an 4GB virtual address space. Protection is done at the paging level. All segments are 4GB, all segment selectors always point to the same segment.
Programming is way simpler since only «near» pointers are needed.
The OS can map shared libraries (residing once in physical memory) to multiple virtual destinations per application.
The application only sees memory paged to its own virtual address space, so processes are protected by hardware.

In addition, all modern OSes now use only 2 of the 4 protection rings, ring 0 for their kernel and ring 3 for all the user applications. Call gates are no more used.

SYSENTER/SYSEXIT

To make transitions between user mode (ring 3) and kernel mode (ring 0) faster, a method other than call gates had to be implemented. SYSENTER/SYSEXIT instructions are the current way to switch from ring 3 to ring 0. You will use WRMSR to set the new values for CS (0x174) , ESP (0x175) and EIP (0x176). ECX must hold the ring 3 stack pointer for SYSEXIT and EDX contains the ring 3 EIP for SYSEXIT. The entry stored for CS must be the index to 4 selectors, the first is the ring 0 code, the second is the ring 0 data, the third is the ring 3 code and the fourth is the ring 4 data. These values are fixed, so in order to use SYSENTER your GDT table must contain these entries in this format.

These opcodes only support switching between ring 3 and ring 0, but they are much faster. They are used today instead of the way slower call gates.

Software multitasking

Task gates are no longer used by today’s operating systems. Instead, they apply software multitasking to switch between processes:

A «scheduler» (an interrupt timer) is run.
It switches stack and EIP based on thread and process priorities.

Because a software scheduler saves only what is necessary for task switching, it is faster than the segmented mode hardware switching.

Protected Mode Facts

Unreal mode

Because protected mode cannot call DOS or BIOS interrupts, it is generally not very useful to DOS applications. However, a ‘bug’ in the 386+ processor turned out to be a feature called unreal mode. The unreal mode is a method to access the entire 4GB of memory from real mode. This trick is undocumented, however a large number of applications are using it. The trick is based on the fact that a segment selector can originally point to a 4GB data segment (set in the GDT), and when it goes back to the real mode its «invisible part» remains intact and still having a 4GB limit.

To use unreal mode, you must:

Enable A20.
Enter protected mode.
Load a segment register (ES or FS or GS) with a 4GB data segment.
Return to real mode.

After returning from protected mode, you can easily do:

mov ax,0
mov fs,ax
mov edi,1048576 
mov byte [fs:edi],0

286 lacks this capability because to exit protected mode, the CPU has to be reset, so all registers are destroyed (but see LOADALL below).

Huge real mode

The above unreal mode theory can be applied to CS as well, making it possible to execute code at a position over 1MB when EIP > 0xFFFF. However when calling an interrupt, the upper 16 bits of EIP are not pushed to the stack, so on return you will not return where you were. Therefore, huge real mode was not very much used.

LOADALL

At that time, a now non-existent and mostly undocumented instruction existed, LOADALL (0xF 0x5 in 286, 0xF 0x7 in 386). LOADALL used, as the name implies, to load all the registers (including the GDTR and IDTR) from one table in memory. In 286 LOADALL (which was not accessible from 386), this table was fixed at memory address 0x800, whereas in 386 LOADALL it reads the buffer pointed to by real mode ES:EDI. Because the CPU does not check in any way if any of the values loaded by LOADALL are valid, LOADALL was used by many tools at the time, including HIMEM.SYS, for various infamous actions:

To access the entire memory from real mode without entering protected mode and unreal mode.
To run real code with paging.
To run 32bit code in real mode.
To run normal 16-bit code inside protected mode without VM86 (which was not there in 286). This was done by trapping each memory access (which would lead to GPF because all the segments were marked non-present) and emulating the desired result by using another LOADALL. Of course this was too slow, but it led to the creation of the VM86 mode in 386, where LOADALL eventually faded out.

LOADALL cannot switch the 286 back to real mode, but using LOADALL removes the need to enter protected mode altogether.

LOADALL 286 itself was mentioned in the manuals and was partially documented; by contrast, LOADALL 386 was heavily obscure, probably to induce the programmers to take advantage of the new VM86 mode.

HIMEM.SYS

Protected mode is complex and, without a debugger available, it is prone to lots of unsolvable crashes. To help the programmers, Microsoft created a driver that was able to manage protected mode from a normal 16-bit DOS application, allowing it to access high memory. that time, extended memory was mostly, if not totally, used to cache data from the disk, especially from big apps. HIMEM puts the CPU in unreal mode (or it uses LOADALL in 286) and provides a simple interface to the applications that want more memory without messing with the protected mode details. By enabling the A20 line, HIMEM allowed a portion of DOS COMMAND.COM to reside in the high memory area when config.sys had a DOS=HIGH directive.

Detect HIMEM.SYS

Interrupt 0x2F, AX = 0x4300

Return HIMEM.SYS function pointer

Interrupt 0x2F, AX = 0x4310

All the following functions are provided from the function at the returned ES:BX from the above interrupt.

Detect/Enable/Disable A20

AH = 0x7 (detect), 0x3 (enable), 0x4 (disable)

Allocate HMA

AH = 0x1

Free HMA

AH = 0x2

Allocate extended memory

AH = 0x9

Free extended memory

AH = 0xA

Copy real/protected memory from/to real/protected memory

AH = 0xB

Lock/Unlock protected mode memory

AH = 0xC (Lock), 0xD (Unlock)

HIMEM.SYS moves memory in order to defragment it. Locking memory is useful when you will access the memory directly, within protected mode. Actually, because HIMEM puts the CPU in unreal mode, you can use the very same returned pointers directly.

VM86 Mode

Many of the existing applications were real-mode at the time protected mode was introduced. Even today, many (mostly games) are played under Windows. To force these applications (which think they own the machine) to cooperate, a special mode should be created.

The VM86 mode is a special flag to the EFlags register, allowing a normal 16-bit DOS memory map of 640KB which is forwarded via paging to the actual memory — this makes it possible to run multiple DOS applications at the same time without risking any chance for one application to overwrite another. EMM386.EXE puts the processor to that state. The OS performs a step-by-step watching to the process, making sure that the process won’t execute something illegal. Normally also, you want to map all your other critical structures (GDT, IDT etc) above 1MB so they are not visible to any VM86 process.

To trigger VM86 mode, you can use PUSHFD and IRET:

mov ebp,esp
push dword  [ebp+4]
push dword  [ebp+8]
pushfd             
or dword [esp], (1 << 17)     ; set VM flags
push dword [ebp+12]        ; cs
push dword  [ebp+16]       ; eip
iret

Once the VM flag is set, you can load a normal «segment» to a segment register. Interrupt calls by DOS applications are caught by the OS and emulated through it — if possible. Also, some instructions are ignored, for example, if you do a CLI, the interrupts are not actually disabled. The OS sees that you prefer to not be interrupted and acts accordingly, but interrupts are still there.

All VM86 code executes in PL 3, the lowest privilege level. Ins/Outs to ports are also captured and emulated if possible. The interesting thing about VM86 is that there are two interrupt tables, one for the real and one for the protected mode. But only protected mode interrupts are executed.

VM86 was removed from 64-bit mode, so a 64-bit OS cannot execute 16-bit DOS code anymore. In order to execute such code, you need an emulator such as DosBox.

Many applications were also written to take advantage of the expanded memory, but the modern standard was the protected mode. EMM386 puts the CPU in VM86 mode and maps via paging memory over 1MB to real mode segments (over 0xA0000), so an application that would like to use expanded memory can use it via EMM386.EXE, which provides an LIM EMS int 0x67 interface. In addition, EMM386 allowed «devicehigh» and «loadhigh» commands in CONFIG.SYS, allowing applications to get loaded to these high segments if possible.

Physical Address Extensions (PAE)

PAE is the ability of x86 to use 36 address bits instead of 32. This increases the available memory from 4GB to 64GB. The 32-bit applications still see only a 4GB address space, but the OS can map (via paging) memory from the high area to the lower 4GB address space. This extension was added to x86 to cope with the (nowadays not enough) limit of 4GB, before 64-bit CPUs came to the foreground.

Enabling PAE (CR4 bit 5) means that now you have 3 paging levels: In addition to Page Directory and the Page Table , you have now the PDTD, Page Directory Pointer Table, which has four 64-bit entries. Each of the PDTD entries points to a Page Directory of 4KB (like in normal paging). Each entry in the new Page Directory is now 64 bit long (so there are 512 entries). Each entry in the new Page Directory points to a Page Table of 4KB (like in normal paging), and each entry in the new Page Table is now 64-bit long, so there are 512 entries. Because that would allow only a quarter of the original mapping, that’s why 4 directory/table entries are supported. The first entry maps the first 1GB, the 2nd the 2nd GB, the 3rd the 3rd GB and finally, the 4th entry maps the 4th GB.

But now the «S» bit in the PDT has a different meaning: If not set, it means that the page entry is 4KB but if set, it means that this entry does not point to a PT entry, but it describes itself a 2MB page. So you can have different levels of paging traversal depending on the S bit.

There is a new flag in the Page Directory entry as well, the NX bit (Bit 63) which, if set, prevents code execution in that page.

This system allows the OS to handle memory over 4GB, but since the address space is still 4GB, each process is still limited to 4GB. The memory can be up to 64GB but a process cannot see the entire memory.

Direct Memory Access drivers however have a problem, because they don’t use paged memory. If working in 32 bits, the driver has to manage the paging tables itself in order to be able to manipulate memory over 4GB and this cound mean incompatibilites with the operating system, unless a safe DMA API was exposed to the driver. For this reason, PAE quickly faded out in favor of 64-bit operating systems, in which it still remains a required paging level.

DPMI

For DOS applications, unreal mode was not enough, eventually a fully 32-bit capability application had to be created. DPMI (Dos Protected Mode Interface) was a driver that provided a (relative complex) interface to applications wishing to run in 32 bit protected mode. DOS extenders, based on DPMI, like DOS4GW and DOS32A were created to support applications (mostly games) that wanted to run in 32 bit while still having access to DOS interrupts. DPMI catches the interrupt call, switches to real mode, executes the interrupt and goes back to protected mode. DPMI even allows multitasking and multiple «virtual» 32 bit machines.

DOS extenders use a «Linear Executable» (LE or LX format) which contains native 32-bit code. DOS32A can load and run such an executable. Here is a FASM example of creating a LE executable with DPMI.

Detect DPMI using interrupt 2F:

Interrupt 0x2F, AX = 0x1687

Example from DJCPP:

modesw	dd	0			; far pointer to DPMI host's
					    ; mode switch entry point
	mov	ax,1687h		; get address of DPMI host's
	<a href="http://www.delorie.com/djgpp/doc/dpmi/api/2f1687.html">int	2fh</a>		      	; mode switch entry point
	or	ax,ax			; exit if no DPMI host
	jnz	error
	mov	word ptr modesw,di	; save far pointer to host's
	mov	word ptr modesw+2,es	; mode switch entry point
	or	si,si			; check private data area size
	jz	@@1		     	; jump if no private data area

	mov	bx,si			; allocate DPMI private area
	mov	ah,48h			; allocate memory
	int	21h			    ;  transfer to DOS
	jc	error			; jump, allocation failed
	mov	es,ax			; let ES=segment of data area

@@1:	mov	ax,0			; bit 0=0 indicates 16-bit app
	call	modesw			; switch to protected mode
	jc	error			; jump if mode switch failed
					; else we're in prot. mode now

App terminates via 0x4C int 0x21 (as in real mode). The rest of DPMI functions are provided through int 0x31 and include:

Real mode interrupt capturing (as function 0x25 int 0x21)
Real mode exception trapping
Call DOS interrupts either directly, or through int 0x31 function 3
Real mode callbacks
Sharing memory between DPMI clients
Paging
Setting hardware breakpoints
TSR capabilities

Many good games like The Dig were running under DPMI.

Long Mode

Architecture

Whatever methods created to overcome the 4GB limit of the x86, they would eventually lead to full 64-bit processors. Having discussed all the protected mode complexities, we are lucky to observe that the x64 CPU architecture is way simpler. The x64 CPU has 3 operation modes:

Real mode
Protected mode (called legacy mode)
Long mode, containing two submodes:
- Compatibility mode, 32 bit. This allows an 64-bit OS to run 32-bit applications natively.
- 64-bit mode

To work in Long mode, the programmer must take into consideration the facts below:

Unlike Protected mode, which can run with or without paging, long mode runs only with PAE and paging and in flat mode. All the segments are flat, from 0 to 0xFFFFFFFFFFFFFFFF and all memory addressing is linear. DS, ES, SS are ignored. The «flat» mode is the only valid mode in long mode. No segmentation.
You can get into long mode directly from real mode, by enabling protected mode and long mode within one instruction (this can work because Control Registers are accessible from real mode).
Although in theory any 64-bit value could be used as an address, in practise we don’t need yet 2^64 memory. Therefore, current implementations only implement 48-bit addressing, which enforces all pointers to have bits 47-63 either all 0 or all 1. This means that you have 2 ranges of valid «canonical» addresses, one from 0 to 0x00007FFF’FFFFFFFF and one from 0xFFFF8000’00000000 through 0xFFFFFFFF’FFFFFFFF, for a 256TB of total space. Most OSes reserve the upper area for the kernel, and the lower area for the user space.

To verify that long mode is supported, we must check extended CPUID features:

mov eax, 0x80000000 
cpuid
cmp eax, 0x80000001
jb .NoLongMode

Registers

When running in 64-bit mode, the following 64-bit extensions are available:

RAX, RBX, RCX, RDX, RSI, RDI, RSP, RBP, RIP
8 new 64-bit registers added: R8 to R15. Lower 32 bits in R8D — R15D format, Upper 8 bits in R8W — R14W format and lower 8 bits in R8B — R14B format.

These registers are only available in 64-bit mode. In all other modes, including compatibility mode, they are not available.

GDT/IDT

Bit 53 of the GDT, previously reserved, is now the «L», bit. When 1, the Sz bit must also be 0, and this indicates an 64-bit code (the combination L = 1 and Sz = 1 is reserved and will throw an exception if used). The limits are always 0 to 0xFFFFFFFFFFFFFFFF and the base is always 0.

If your GDT resides in the lower 4GB of memory, you need not change it after entering long mode. However, if you plan to call SGDT or LGDT while in long mode, you must now deal with the 10-byte GDTR, which holds two bytes for the length of the GDT and 8 bytes for the physical address of it.

Any selector you might load to access a 64-bit segment is ignored, and DS, ES, SS are not used at all. All the segments are flat, and everything is done via paging. However GS and FS can still be used as auxilliary registers and their values are still subject to verification from the GDT. In Windows, FS points to the Thread Information Block.

IDT is similar to the protected mode’s, the difference being the fact that each entry is expanded to contain an 64-bit physical address to the interrupt:

struc IDT_STR 
{
 .ofs0_15 dw ofs0_15
 .sel dw sel
 .zero db zero
 .flags db flags            ; 0 P,1-2 DPL, 3-7 index to the GDT
 .ofs16_31 dw ofs16_31
 .ofs32_63 dd ofs32_63
 .zero dd 0
}

There is no LDT, VM86, DPMI, unreal mode or call gates in long mode. Missing VM86 is the reason that 64-bit OSes cannot run 16 bit software without an emulator.

Long Mode Paging

In long mode the paging system adds a new top level structure, the PML4T which has 512 64-bit long entries which point to one PDPT and now the PDPT has 512 entries as well (instead of 4 in the x86 mode). So now you can have 512 PDPTs which means that one PT entry manages 4KB, one PDT entry manages 2MB (4KB * 512 PT entries), one PDPT entry manages 1GB (2MB*512 PDT entries), and one PML4T entry manages 512 GB (1GB * 512 PDPT entries). Since there are 512 PML4T entries, a total of 256TB (512GB * 512 PML4T entries) can be addressed.

This is another reason not to use the entire 64-bit for addressing. Using the entire thing would force us to have 6 levels of paging, where now four are needed.

Each of the «S» bits in the PDPT/PDT can be 0 to indicate that there is a lower level structure below, or 1 to indicate that the traversal ends here. If the PDPT S flag is 1 and the CPU supports it, then the page size is 1GB.

There is an Intel draft about PML5, a new top level structure which would allow 5 levels of paging, when the CPUS will support 56 bits of addressing.

To verify that 1GB pages are supported, we try EDX bit 26:

mov eax,80000001h
cpuid
bt edx,26
jnc .no1gbpg

Entering Long Mode

mov eax, cr0 
and eax,7FFFFFFFh
mov cr0, eax 
mov eax, cr4
bts eax, 5
mov cr4, eax 
mov ecx, 0c0000080h 
rdmsr 
bts eax, 8 
wrmsr 

' is loaded with the physical address of the page table.
mov eax, cr0 
or eax,80000000h 
mov cr0, eax

Turn off paging, if enabled. To do that, you must ensure that you are running in a «see through» area.
Set PAE, by setting CR4’s fifth bit.
Create the new page tables and load CR3 with them. Because CR3 is still 32-bits before entering Long mode, the page table must reside in the lower 4GB.
Enable Long mode (note, this does not enter Long mode, it just enables it).
Enable paging. Enabling paging activates and enters Long mode.

Because the rdmsr/wrmsr opcodes are also available in Real mode, you can activate Long mode from Real mode directly by setting both PE and PM bits of CR0 simultaneously.

Entering 64-bit

Now you are in compatibility mode. Enter 64-bit mode by jumping to a 64-bit code segment:

db 0eah
dd LinearAddressOfStart64

The initial 64-bit segment must reside in the lower 4GB because compatibility mode does not see 64-bit addresses. Note that you must use the linear address, because 64-bit segments always start from 0. Note also that if the current compatibility segment is 16-bit default, you have to use the 066h prefix.

The only thing you have to do in 64-bit mode is to reset the RSP:

mov rsp,STACK64
shl rsp,4
add rsp,stack64_end

SS, DS, ES, are not used in 64-bit mode. That is, if you want to access data in another segment, you cannot load DS with that segment’s selector and access the data. You must specify the linear address of the data. Data and stack are always accessed with linear addresses. «Flat» mode is not only the default, it is the only one for 64-bit.

Once you are in 64-bit mode, the defaults for the opcodes (except from jmp/call) are still 32-bit. So a REX prefix is required (0x40 to 0x4F) to mark a 64-bit opcode. Your assembler handles that automatically if it supports a «code64» segment.

In addition, a 64-bit interrupt table must now be set with a new LIDT instruction, this time taking a 10-byte operator (2 for the length and 8 for the location).

Returning to Compatibility Mode

To exit 64-bit mode, it is first necessary to return to compatibility mode. Because 0eah is not a valid jump when in 64-bit mode, you have to use a RETF trick to get back to a compatibility mode segment.

push code32_idx    
xor rcx,rcx    

mov ecx,Back32    
                  
push rcx
retf

This gets you back to compatibility mode. 64-bit OSs keep jumping from 64-bit to compatibility mode in order to be able to run both 64-bit and 32-bit applications.

Exiting from Long Mode

You have to setup all the registers again with 32-bit selectors — back to segmentation. Also you must be in a see-through area because to exit long mode you must deactivate paging. Of course, you can switch immediately to real mode by resetting the PM bit as well.

mov ax,stack32_idx 
mov ss,ax 
mov esp,stack32_end 
mov ax,data32_idx 
mov ds,ax
mov es,ax
mov ax,data16_idx
mov gs,ax
mov fs,ax


mov eax, cr0 
and eax,7fffffffh 
mov cr0, eax 


mov ecx, 0c0000080h 
rdmsr 
btc eax, 8 
wrmsr

Interrupt priorities

Driver developers in Windows will know the meaning of IRQL. An IRQL is a CPU feature to prioritize interrupts. x86 and x64 has the CLI instruction all right to disable interrupts entirely, but in a modern multithreading system something that can prioritize interrupts should exist. Windows driver functions KeRaiseIrlq and KeLowerIrlq modify the CR8 register, settting the CPU interrupt priority (0 — 15, where 0 is PASSIVE_LEVEL and 2 is DISPATCH_LEVEL). When an interrupt is pending, its priority is compared to CR8. If the vector is greater, it is serviced, otherwise it is held pending until CR8 is set to a lower value. CR8 starts with 0 on CPU reset.

As of Intel’s Vol 3A, section 10.8.3, the interrupt priority is the higher 4:7 bits of the interrupt vector.

Multiple Cores

A single CPU can execute one instruction at a time. Multitasking in single processors is generally the fast switching (at the software level) between different registers/paging for each process running, and this is so fast that it appears that processes run simultaneously.

A multiple core CPU is similar to having many single CPUs that share the same memory. Everything else (Registers, modes, etc) are specific to each CPU. That means that if we have an 8 core processor, we have to execute the same procedure 8 times to put it e.g. in long mode. We can have one processor to real mode and another processor in protected mode, another processor in long mode etc.

In multiple core configurations we are concerned with three things:

How to discover multiple processors and their properties
How to communicate from one CPU to another
How to synchronize access to sensitive data

Discovery

The Advanced Programmable Interrupt Controller (APIC) is a set of tables, found in memory, that will provide us the information we need. First we discover the presence of APIC:

mov eax,1
cpuid
bt edx,9
jc ApicFound

Second, we search for the Advanced Configuration and Power Interface (ACPI) in memory. The ACPI is the first of the APIC tables, it resides somewhere in BIOS memory, between physical addresses 0xE0000 and 0xFFFFF and it has the following header:

struct RSDPDescriptor 
{
 char Signature[8];
 uint8_t Checksum;
 char OEMID[6];
 uint8_t Revision;
 uint32_t RsdtAddress;

; The following is present if ACPI 2.0
 uint32_t Length;
 uint64_t XsdtAddress;
 uint8_t ExtendedChecksum;
 uint8_t reserved[3];
}

The above RSDP Descriptor contains the signature value which, for the first ACPI table, is 0x2052545020445352. If this signature is not found in the memory, then we don’t have ACPI and therefore, there are no multiple CPU cores.

Each descriptor also has a checksum, which is verified with the following algorithm:

IsChecksumValid:
    PUSH ECX
    PUSH EDI
    XOR EAX,EAX
    .St:
    ADD EAX,[FS:EDI]
    INC EDI
    DEC ECX
    JECXZ .End
    JMP .St
    .End:
    TEST EAX,0xFF
    JNZ .F
    MOV EAX,1
    .F:
    POP EDI
    POP ECX
    RETF

In case we succeed in finding an ACPI 2.0 table and its ExtendedChecksum is verified, then we must use the XsdtAddress (which always points to lower 4GB) to find the other tables. If it is an ACPI 1.0 then we use the RsdtAddress.

Having found the address, we use it to locate the first APIC table. The starting table contains pointers to all the other tables (32 or 64 bit if APIC 2.x+) after the header. This physical address is over the 1MB and hence it is only accessible from protected (or unreal) mode. There are many ACPI tables but we are only interested in a few of them.

All of them have the following header:

struct ACPISDTHeader 
  {
  char Signature[4];
  unsigned long Length;
  unsigned char Revision;
  unsigned char Checksum;
  char OEMID[6];
  char OEMTableID[8];
  unsigned long OEMRevision;
  unsigned long CreatorID;
  unsigned long CreatorRevision;
  };

The first table that we will find contains the pointers to all other APIC tables after this header. The Length member contains the length of the entire table, including the header.

To find how many processors we have, we find the «MADT» table, a table which has the signature «APIC» in its header. After the standard header, we have:

At offset 0x24, the Local APIC Address, which we will need later.
At offset 0x2C, the rest of the MADT table contains a sequence of variable length records which enumerate the interrupt devices. Each record begins with the 2 header bytes, 1 for the type and one for the length. If the type bype is 0, then the bytes following the length byte contain 6 bytes, describing a physical CPU. The first byte is the ACPI Processor ID and the second byte is the APIC ID of this processor.

Looping the above table will reveal us all the installed processors along with their ACPI and APIC IDs.

Initial Startup

A CPU can communicate with another CPU by issuing an «Interprocessor Interrupt» (IPI). To prepare the APIC to manage interrupts, we have to enable the «Spurious Interrupt Vector Register», indexed at 0xF0:

MOV EDI,[LocalApic]
ADD EDI,0x0F0
MOV EDX,[FS:EDI]
OR EDX,0x1FF
MOV [FS:EDI],EDX

After that, we are ready to send IPIs. An IPI (Interprocessor Interrupt) is sent by using the Interrupt Command Register of the Local APIC. This consists of two 32-bit registers, one at offset 0x300 and one at offset 0x310 (All Local APIC registers are aligned to 16 bytes):

The register at 0x310 is what we write it first, and it contains the Local APIC of the processor we want to send the interrupt at the bits 24 — 27.
The register at 0x300 has the following structure:

struct R300
    {
    unsigned char VectorNumber;     unsigned char DestinationMode:3;     unsigned char DestinationModeType:1;     unsigned char DeliveryStatus:1;     unsigned char R1:1;
    unsigned char InitDeAssertClear:1; 
    unsigned char InitDeAssertSet:1;
    unsigned char R2:2;
    unsigned char DestinationType:2;     unsigned char R3:12;
    };

Writing to register 0x300 will actually send the IPI (that is why you must write to 0x310 first). Note that if DestinationType is not 0, the Destination target in the register 0x310 is ignored. Under Windows, IPIs are sent with an IRQL level 29 (x86) or 14 (x64).

As we know, the CPU starts in real mode from 0xFFFF:0xFFF0 position, but this is only true for the first cpu. All other CPUs stay «asleep» until woken up, in a special state called Wait-for-SIPI. The main CPU awakes other CPUs by sending a SIPI (Startup Inter-Processor Interrupt) which contains the startup address for that CPU. Later on, there are other Inter-processor Interrupts to communicate between the CPUs.

To awake the processor, we send two special IPIs. The first is the «Init» IPI, DestinationMode 5, which stores the starting address for the CPU. Remember that the CPU starts in real mode. Because the processor starts in real mode, we have to give it a real memory address, stored in VectorNumber. The second IPI is the SIPI, DestinationMode 6, which starts the CPU. The starting address must be 4096 aligned.

Later Communication

Apart from INIT and SIPI, which we saw above, the local APIC can be used to send a normal interrupt, i.e., merely executing INT XX in the context of the target CPU. We have to take into consideration the following:

If the CPU is in HLT state, the interrupt awakes it, and when the interrupt returns the CPU resumes with the instruction after the HLT opcode. If there is also a CLI, then we must send a NMI interrupt (A flag in the APIC Interrupt Register) to wake the CPU.
If the CPU is in HLT state and we send again an INIT and a SIPI, the CPU starts all over again from real mode.
The interrupt must exist in the target processor. For example, in protected mode, the interrupt must have been defined in IDT.
The Local APIC is common to all CPUS (memorywise), therefore, we must lock for write access (mutex) before we can issue the interrupt.
Because the registers cannot be passed from CPU to CPU, we have to write all the registers (that will be used for the interrupt, if any) in a separated memory area.
The interrupt might fail, so, you have to rely on some inter-cpu communication (via shared memory and mutexes) to verify the delivery.
Finally, the handler of the interrupt must tell its own Local APIC that there is an «End of Interrupt». It was similar to int 0x21’s out 020h,al in the past. Now we write to the EOI register (LocalApic + 0xB0) the value 0 (End Of Interrupt).

Synchronization

Since the CPUS share the same memory, it is crucial to synchronize write and read accesses to critical parts of it. In Windows of course we have mutexes ready to be used, but here some extra work has to be done. We can create our own mutex variable as follows:

Initialization, put a byte to value 0xFF
Lock mutex, decrease its value
Unlock mutex, increase its value unless already 0xFF
Wait for a mutex, but not lock it: A simple loop.

; assuming edi has the address
.Loop1:        
CMP byte [edi],0xff
JZ .OutLoop1
pause 
JMP .Loop1
.OutLoop1:

Note the pause opcode (equal to rep nop). This is a hint to the cpu that we are inside a spin loop, which greatly enhances performance because code prefetching is avoided.

Our problem is to wait for a mutex, then grab it when it is free (similar to WaitForSingleObject()). This code is not going to work:

.Loop1: 
CMP byte [edi],0xff 
JZ .OutLoop1 
pause 
JMP .Loop1 
.OutLoop1:
.MutexIsFree:
DEC [edi]

The reason is that, between the JZ command (which has verified that the mutex is free) and before the DEC [edi] is executed, another CPU might grab the mutex (race condition).

Fortunately for us, the CPU provides a LOCK CMPXCHG opcode which atomically grabs the lock for us:

.Loop1:        
CMP byte [edi],0xff
JZ .OutLoop1
pause 
JMP .Loop1
.OutLoop1:

mov bl,0xfe
MOV AL,0xFF
LOCK CMPXCHG [EDI],bl
JNZ .Loop1 
.OutLoop2:

We use the CMPXCHG instruction which, along with the LOCK prefix, atomically tests [edi] if it is still 0xFF (the value in AL), and if yes, then it writes BL to it and sets the ZF. If another CPU has done the same meanwhile, the ZF is cleared and BL is not moved to the [edi].

Virtualization

Virtualization, techically, is a «system» inside the system. Its a clone of the processor running inside the same processor. It is not very much complex to setup and it greatly enhances computing since you are able to run another OS inside an existing OS.

Each CPU (called Host) can run one Virtual Machine (called guest) at a time. You can configure multiple guests per CPU and pause/resume each guest, much like multitasking. If you have 8 CPU cores of course, you can have 8 guests running simultaneously.

The lifecycle of VM operations is as follows:

Test if the CPU supports virtualization:

mov eax,1
cpuid
bt ecx,5
jc VMX_Supported
jmp VMX_NotSupported

Check CPU-specific revision from the IA32_VMX_BASIC register:
```
mov ecx, 0480h
rdmsr
```
This 64-bit register contains important information for our project:
- Bits 0 — 31: 32-bit VMX Revision Number
- Bits 32 — 44: Number of bytes (up to 4096) which we will need to allocate later.
Enable VMX operations
```
mov rax,cr4
bts rax,13
mov cr4,rax
```
Configure a VMXON structure. This is a 4096-aligned CPU-specific array and its size must be the number we got from the IA32_VMX_BASIC register. A VMXON structure contains:
- 4 bytes which hold the revision number
- 4 bytes that are used for VMX Abort data (we will check that later),
Execute the VMXON command
For each guest, configure a VMCS. A VMCS is a 4096-aligned CPU-specific array which we need to allocate for each guest, and its size must be the number we got from the IA32_VMX_BASIC register. To load a VMCS for configuration we use the VMPTRLD opcode. To read or write into the VMCS we use the VMREAD, VMWRITE and VMCLEAR. A VMCS contains:
- 4 bytes that are used for VMX Abort data (we will check that later),
- The rest of the bytes are used by VMCS groups (we will check that later).
Configure the memory available to the guests.
Launch a guest with VMLAUNCH.
Guest returns (exits) to the host on specific conditions.
Host uses VMPAUSE, VMRESUME to pause or resume its guests.
When the guest terminates, host uses VMXOFF to turn off VMX operations.

VMCS Groups

The rest of the VMCS (that is, after the first 8 bytes (revision + VMX Abort) is divided into 6 subgroups:

Guest State
Host State
Non root controls
VMExit controls
VMEntry controls
VMExit information

Each of the above fields contains important information. We will look at them one by one. To mark a VMCS for further reading/writing with VMREAD or VMWRITE, you would first initialize its first 4 bytes to the revision (as with the VMXON structure above), and then execute a VMPTRLD with its address.

Appendix H of the 3B Intel Manual has a list of all indices. For example, the index of the RIP of the guest is 0x681e. To write the value 0 to that field, we would use:

mov rax,0681eh
mov rbx,0
vmwrite rax,rbx

Not all features are always present in all processors. We must check the VMX MSRs for available features before testing them. Intel’s 3B Appendix G contains all these MSRs. To load a MSR, you put its number to RCX and execute the rdmsr opcode. The result is in RAX.

IA32_VMX_BASIC (0x480): Basic VMX information including revision, VMCS size, memory types and others.
IA32_VMX_PINBASED_CTLS (0x481): Allowed settings for pin-based VM execution controls.
IA32_VMX_PROCBASED_CTLS (0x482): Allowed settings for processor based VM execution controls.
IA32_VMX_PROCBASED_CTLS2 (0x48B): Allowed settings for secondary processor based VM execution controls.
IA32_VMX_EXIT_CTLS (0x483): Allowed settings for VM Exit controls.
IA32_VMX_ENTRY_CTLS (0x484): Allowed settings for VM Entry controls.
IA32_VMX_MISC MSR (0x485): Allowed settings for miscellaneous data, such as RDTSC options, unrestricted guest availability, activity state and others.
IA32_VMX_CR0_FIXED0 (0x486) and IA32_VMX_CR0_FIXED1 (0x487): Indicate the bits that are allowed to be 0 or to 1 in CR0 in the VMX operation.
IA32_VMX_CR4_FIXED0 (0x488) and IA32_VMX_CR4_FIXED1 (0x489): Same for CR4.
IA32_VMX_VMCS_ENUM (0x48A): enumerator helper for VMCS.
IA32_VMX_EPT_VPID_CAP (0x48C): provides information for capabilities regarding VPIDs and EPT.

The Host State

This contains the following information (In parentheses, the bit number):

CR0,CR3,CR4,RSP,RIP (64 each)
CS,SS,DS,ES,FS,GS,TR selectors (16 each)
FS,GS,TR,GDTR,IDTR base addresses (64 each)
IA32_SYSENTER_CS (32)
IA32_SYSENTER_ESP (64)
IA32_SYSENTER_EIP (64)
*IA32_PERF_GLOBAL_CTRL (64)
*IA32_PAT (64)
*IA32_EFER (64)

The host state tells the CPU how to return to the host after the guest exits. After executing a successfull VMLAUNCH or VMRESUME command (if this command fails, execution resumes after it), then the host is paused until the guest exits. When the guest exits, the host is reloaded with values from this VMCS group.

The Guest State

This contains the following information (In parentheses, the bit number):

CR0,CR3,CR4,DR7,RSP,RIP,RFLAGS, (64 each)
For each of CS,SS,DS,ES,FS,GS,LDTR,TR:
- Selector (16)
- Base address (64)
- Segment limits (32)
- Access rights (32)
For GDTR and IDTR:
- Base address (64)
- Limit (32)
IA32_DEBUGCRTL (64)
IA32_SYSENTER_CS (32)
IA32_SYSENTER_ESP (64)
IA32_SYSENTER_EIP (64)
IA_PERF_GLOBAL_CTRL (64)
IA32_PAT (64)
IA32_EFER (64)
SMBASE (32)
Activity State (32) — 0 Active , 1 Inactive (HLT executed) , 2 Triple fault occured , 3 waiting for startup IPI (SIPI).
Interruptibility state (32) — a state that defines some features that should be blocked in the VM.
Pending debug exceptions (64) — to facilitate hardware breakpoings with DR7.
VMCS Link pointer (64) — reserved, set to 0xFFFFFFFFFFFFFFFF.
VMX Preemption timer value (32)
Page Directory pointer table entries (4×64), pointers to pages.

This group defines how the guest will start. The guest can be started in two modes:

Paged 32 bit protected mode.
Real mode (unrestricted guest), if the CPU supports it.

Starting a guest in paged protected mode does not allow later the guest to turn into long mode and does not allow modifications of GDT. If a guest expects a real mode start but unrestricted guest is not available, then you can start in VM86 mode.

In unrestricted guest, the guest starts in real mode and can modify any register allowed by the VMCS control fields. Note that you still load protected mode style segments for CS and the real mode starts with a protected mode selector, but you can immediately load a new real mode segment with a JMP.

The Execution Control Fields

These fields configure what is allowed to be executed in the guest and what is not. Everything not allowed causes a VMEXIT. The sections are:

Pin-Based (32b) : Interrupts
Processor-Based (2x32b)
- Primary: Single Step, TSC HLT INVLPG MWAIT CR3 CR8 DR0 I/O Bitmaps
- Secondary: EPT, Descriptor Table Change, Unrestricted Guest and others
Exception bitmap (32b): One bit for each exception. If bit is 1, the exception causes a VMExit.
I/O bitmap addresses (2x64b): Controls when IN/OUT cause VMExit.
Time Stamp Counter offset
CR0/CR4 guest/host masks
CR3 Targets
APIC Access
MSR Bitmaps

For example, you can configure it so an exception would make it to the host, instead of being caught in the guest. Similarily you might not allow GDT changes, Control Register changes etc.

Exit Control Fields

These fields tell the CPU what to load and what to discard in case of a VMExit:

VMExit Controls (32b)
VMExit Controls for MSRs

Exit Control Fields

These fields tell the CPU what to inject to the guest in case of an exit:

VMEntry Controls (32b)
VMEntry Controls for MSRs
VMEntry Controls for event injection

Exit Information Field (Read only)

Basic information
- Exit Reason (32)
- Exit Qualification (64)
- Guest Linear Address (64)
- Guest Physical Address (64)
Vectored exit information
Event delivery exits
Intstruction execution exits
Error field

EPT

An EPT is a mechanism that translates host physical address to guest physical addresses. It is exactly the same as the long mode paging mechanism.

If you start the guest in Paged Protected Mode, then EPT is not required. Using Unrestricted Guest requires us to use EPT. You can check the 0x48B (IA32_VMX_PROCBASED_CTLS2) MSR bit 7 to see if Unrestricted Guest is supported.

Manual Exits

A guest that knows that is a guest might want to deliberately exchange information with its host. For this reason, the instruction VMCALL is provided to manually trigger an exit.

DMMI

DPMI works, but a long mode driver is also needed. Therefore I have decided to create a TSR service, included in the github project. I’ve called it DOS Multicore Mode Interface. It is a driver which helps you develop 32 and 64 bit applications for DOS, using int 0xF0. This interrupt is accessible from both real, protected and long mode. Put the function number to AH.

To check for existence, check the vector for INT 0xF0. It should not be pointing to 0 or to an IRET, ES:BX+2 should point to a dword ‘dmmi’.

Int 0xF0 provides the following functions to all modes (real, protected, long)

AH = 0, verify existence. Return values, AX = 0xFACE if the driver exists, DL = total CPUs, DH = virtualization support (0 none, 1 PM only, 2 Unrestricted guest). This function is accessible from real, protected and long mode.
AH = 1, begin thread. BL is the CPU index (1 to max-1). The function creates a thread, depending on AL:
- 0, begin (un)real mode thread. ES:DX = new thread seg:ofs. The thread is run with FS capable of unreal mode addressing, must use RETF to return.
- 1, begin 32 bit protected mode thread. EDX is the linear address of the thread. The thread must return with RETF.
- 2, begin 64 bit long mode thread. EDX holds the linear address of the code to start in 64-bit long mode. The thread must terminate with RET.
- 3, begin virtualized thread. BH contains the virtualization mode (1 for unrestricted guest real mode thread, and 2 for protected mode), and EDX the virtualized linear stack (or in seg:ofs format if unrestricted guest). The thread must return with RETF or VMCALL.
AH = 5, mutex functions. This function is accessible from all modes.
- AL = 0 => initialize mutex to ES:DI (real) , EDI linear (protected), RDI linear (long).
- AL = 1 => Lock mutex
- AL = 2 => Unlock mutex
- AL = 3 => Wait for mutex

AH = 4, execute real mode interrupt. This function is accessible from all modes. AL is the interrupt number, BP holds the AX value and BX,CX,DX,SI,DI are passed to the interrupt. DS and ES are loaded from the high 16 bits of ESI and EDI.
AH = 9, switch to mode. AL = 0 -> Unreal mode, returns immediately (also available from protected and long mode int 0xF0). AL = 1 -> Protected mode, ECX = linear address to start. AL = 2 -> Long Mode, ECX = linear address to start.

Now, if you have more than one CPU, your DOS applications/games can now directly access all 2^64 of memory and all your CPUs, while still being able to call DOS directly. In order to avoid calling int 0xF0 directly from assembly and to make the driver compatible with higher level languages, an INT 0x21 redirection handler is installed. If you call INT 0x21 from the main thread, INT 0x21 is executed directly. If you call INT 0x21 from protected or long mode thread, then INT 0xF0 function AX = 0x0421 is executed automatically.

Virtualization Debugger

Debugging protected or long mode under DOS is next to impossible. I am now trying to create a simple DEBUG enhancement, called VDEBUG, which should be able to debug any DOS app in virtualization.

This app should perform the following:

Load the debugee (int 0x21, function 0x4B01)
Enter long mode (int 0xf0, function 0x0902)
Prepare virtualization structures (int 0xf0, function 0x0801)
Launch an unrestricted guest VM
In the VM, set the trap flag so each opcode causes a VMEXIT.
Jump to the entry point of the debugee
When target process calls int 0x21 function 0x4C to terminate, control returns to the command next to the int 0x21 function 0x4B01 call. Check there if under virtual machine. If so, do VMCALL to exit.
Go back to real mode and exit.

At the moment, the implemented functions are:

r — (registers) — shows Control, General, Segment regs, Dissassembly and bytes using UDIS86
g — (go) — runs program
t — (trace) — traces commands
h — (help) — shows help
q — (quit) — quits

Compile with VDEBUG=1 in config.asm to enable VDebug.

Multicore Debugger

Debugging protected or long mode under DOS is next to impossible (again). I am now trying to create a simple DEBUG enhancement, called MDEBUG, which should be able to debug any DOS app from another CPU core.

This app should perform the following:

Jump to another core
Load the debugee (int 0x21, function 0x4B01)
Set the trap flag
On exception, HLT the first processor then go to the MDEBUG processor
On resume, send resume IPI to the first processor

This project is not yet created, but I hope that it will be here soon!

Switcher

True DOS multitasking with this DMMI client. This app should perform the following:

Prompt for core, executable and parameters.
Run the executable in virtualization mode within the specific processor.
On some key combination (for example Ctrl+Alt+Ins), VMCALL and pause the VM
Switch between applications on demand.

Soon to be created!

The project

The full github project includes many functions discussed in this article. It’s arranged with 4 filters: 16 bit code, 32 bit code, data, DMMI client and configuration files.

The fact that you made it to this end means that you are truly decisive. Have fun and good luck!

References

http://www.fysnet.net/emsinfo.htm, EMS info
http://www.ctyme.com/rbrown.htm, Ralf Brown Interrupt List
http://bochs.sourceforge.net, Bochs
https://github.com/Himmele/My-Blog-Repository/blob/master/Operating%20Systems/Build%20Your%20Own%20OS/Protected%20Mode%20Tutorial.txt, Till Gerken PM Tutorial
https://wiki.osdev.org/Context_Switching, Task Switching
http://www.sudleyplace.com/dpmione/dpmispec1.0.pdf, DPMI specification
http://www.delorie.com/djgpp/doc/dpmi/, DJCPP DPMI examples
http://www.sudleyplace.com/swat/, 386SWAP protected mode debugger
http://dos32a.narechk.net/index_en.html, DOS32A DPMI extender
http://www.dumais.io/index.php?article=ac3267239dd3e34c061de6413203fb98, VMX Examples and Diagrams

Источник

Keyword[edit]

Reserved keywords of x86 assembly language^[4]^[5]

lds
les
lfs
lgs
lss
pop
push
in
ins
out
outs
lahf
sahf
popf
pushf
cmc
clc
stc
cli
sti
cld
std
add
adc
sub
sbb
cmp
inc
dec
test
sal
shl
sar
shr
shld
shrd
not
neg
bound
and
or
xor
imul
mul
div
idiv
cbtw
cwtl
cwtd
cltd
daa
das
aaa
aas
aam
aad
wait
fwait
movs
cmps
stos
lods
scas
xlat
rep
repnz
repz
lcall
call
ret
lret
enter
leave
jcxz
loop
loopnz
loopz
jmp
ljmp
int
into
iret
sldt
str
lldt
ltr
verr
verw
sgdt
sidt
lgdt
lidt
smsw
lmsw
lar
lsl
clts
arpl
bsf
bsr
bt
btc
btr
bts
cmpxchg
fsin
fcos
fsincos
fld
fldcw
fldenv
fprem
fucom
fucomp
fucompp
lea
mov
movw
movsx
movzb
popa
pusha
rcl
rcr
rol
ror
setcc
bswap
xadd
xchg
wbinvd
invd
invlpg
lock
nop
hlt
fld
fst
fstp
fxch
fild
fist
fistp
fbld
fbstp
fadd
faddp
fiadd
fsub
fsubp
fsubr
fsubrp
fisubrp
fisubr
fmul
fmulp
fimul
fdiv
fdivp
fdivr
fdivrp
fidiv
fidivr
fsqrt
fscale
fprem
frndint
fxtract
fabs
fchs
fcom
fcomp
fcompp
ficom
ficomp
ftst
fxam
fptan
fpatan
f2xm1
fyl2x
fyl2xp1
fldl2e
fldl2t
fldlg2
fldln2
fldpi
fldz
finit
fnint
fnop
fsave
fnsave
fstew
fnstew
fstenv
fnstenv
fstsw
fnstsw
frstor
fclex
fnclex
fdecstp
ffree
fincstp

Mnemonics and opcodes[edit]

Syntax[edit]

	AT&T	Intel
Parameter order	movl $5, %eax Source before the destination.	mov eax, 5 Destination before source.
Parameter size	addl $0x24, %esp movslq %ecx, %rax paddd %xmm1, %xmm2 Mnemonics are suffixed with a letter indicating the size of the operands: q for qword (64 bits), l for long (dword, 32 bits), w for word (16 bits), and b for byte (8 bits).^[6]	add esp, 24h movsxd rax, ecx paddd xmm2, xmm1 Derived from the name of the register that is used (e.g. rax, eax, ax, al imply q, l, w, b, respectively). Width-based names may still appear in instructions when they define a different operation. MOVSXD refers to sign extension with dword input, unlike MOVSX. SIMD registers have width-named instructions that determine how to split up the register. AT&T tends to keep the names unchanged, so PADDD is not renamed to «paddl».
Sigils	Immediate values prefixed with a «$», registers prefixed with a «%».^[6]	The assembler automatically detects the type of symbols; i.e., whether they are registers, constants or something else.
Effective addresses	movl offset(%ebx,%ecx,4), %eax General syntax of DISP(BASE,INDEX,SCALE).	mov eax, [ebx + ecx4 + offset] Arithmetic expressions in square brackets; additionally, size keywords like byte, word, or dword* have to be used if the size cannot be determined from the operands.^[6]

Registers[edit]

AX multiply/divide, string load & store
BX index register for MOVE
CX count for string operations & shifts
DX port address for IN and OUT
SP points to top of the stack
BP points to base of the stack frame
SI points to a source in stream operations
DI points to a destination in stream operations

Along with the general registers there are additionally the:

IP instruction pointer
FLAGS
segment registers (CS, DS, ES, FS, GS, SS) which determine where a 64k segment starts (no FS & GS in 80286 & earlier)
extra extension registers (MMX, 3DNow!, SSE, etc.) (Pentium & later only).

The x86 registers can be used by using the MOV instructions. For example, in Intel syntax:

mov ax, 1234h ; copies the value 1234hex (4660d) into register AX

mov bx, ax    ; copies the value of the AX register into the BX register

Segmented addressing[edit]

There are some special combinations of segment registers and general registers that point to important addresses:

CS:IP (CS is Code Segment, IP is Instruction Pointer) points to the address where the processor will fetch the next byte of code.
SS:SP (SS is Stack Segment, SP is Stack Pointer) points to the address of the top of the stack, i.e. the most recently pushed byte.
SS:BP (SS is Stack Segment, BP is Stack Frame Pointer) points to the address of the top of the stack frame, i.e. the base of the data area in the call stack for the currently active subprogram.
DS:SI (DS is Data Segment, SI is Source Index) is often used to point to string data that is about to be copied to ES:DI.
ES:DI (ES is Extra Segment, DI is Destination Index) is typically used to point to the destination for a string copy, as mentioned above.

Execution modes[edit]

The modes in which x86 code can be executed in are:

Real mode (16-bit)
- 20-bit segmented memory address space (meaning that only 1 MB of memory can be addressed— actually since 80286 a little more through HMA), direct software access to peripheral hardware, and no concept of memory protection or multitasking at the hardware level. Computers that use BIOS start up in this mode.
Protected mode (16-bit and 32-bit)
- Expands addressable physical memory to 16 MB and addressable virtual memory to 1 GB. Provides privilege levels and protected memory, which prevents programs from corrupting one another. 16-bit protected mode (used during the end of the DOS era) used a complex, multi-segmented memory model. 32-bit protected mode uses a simple, flat memory model.
Long mode (64-bit)
- Mostly an extension of the 32-bit (protected mode) instruction set, but unlike the 16–to–32-bit transition, many instructions were dropped in the 64-bit mode. Pioneered by AMD.
Virtual 8086 mode (16-bit)
- A special hybrid operating mode that allows real mode programs and operating systems to run while under the control of a protected mode supervisor operating system
System Management Mode (16-bit)
- Handles system-wide functions like power management, system hardware control, and proprietary OEM designed code. It is intended for use only by system firmware. All normal execution, including the operating system, is suspended. An alternate software system (which usually resides in the computer’s firmware, or a hardware-assisted debugger) is then executed with high privileges.

Switching modes[edit]

Examples[edit]

With a computer running UEFI, the UEFI firmware (except CSM and legacy Option ROM), the UEFI boot loader and the UEFI operating system kernel all run in Long mode.

Instruction types[edit]

In general, the features of the modern x86 instruction set are:

A compact encoding
- Variable length and alignment independent (encoded as little endian, as is all data in the x86 architecture)
- Mainly one-address and two-address instructions, that is to say, the first operand is also the destination.
- Memory operands as both source and destination are supported (frequently used to read/write stack elements addressed using small immediate offsets).
- Both general and implicit register usage; although all seven (counting ebp) general registers in 32-bit mode, and all fifteen (counting rbp) general registers in 64-bit mode, can be freely used as accumulators or for addressing, most of them are also implicitly used by certain (more or less) special instructions; affected registers must therefore be temporarily preserved (normally stacked), if active during such instruction sequences.
Produces conditional flags implicitly through most integer ALU instructions.
Supports various addressing modes including immediate, offset, and scaled index but not PC-relative, except jumps (introduced as an improvement in the x86-64 architecture).
Includes floating point to a stack of registers.
Contains special support for atomic read-modify-write instructions (xchg, cmpxchg/cmpxchg8b, xadd, and integer instructions which combine with the lock prefix)
SIMD instructions (instructions which perform parallel simultaneous single instructions on many operands encoded in adjacent cells of wider registers).

Stack instructions[edit]

Integer ALU instructions[edit]

Floating-point instructions[edit]

SIMD instructions[edit]

Memory instructions[edit]

This code is the beginning of a function typical for a high-level language when compiler optimisation is turned off for ease of debugging:

 push    rbp       ; Save the calling function’s stack frame pointer (rbp register)
 mov     rbp, rsp  ; Make a new stack frame below our caller’s stack
 sub     rsp, 32   ; Reserve 32 bytes of stack space for this function’s local variables.
                   ; Local variables will be below rbp and can be referenced relative to rbp,
                   ; again best for ease of debugging, but for best performance rbp will not
                   ; be used at all, and local variables would be referenced relative to rsp
                   ; because, apart from the code saving, rbp then is free for other uses.
  …       …        ; However, if rbp is altered here, its value should be preserved for the caller.
 mov [rbp-8], rdx  ; Example of accessing a local variable, from memory location into register rdx

…is functionally equivalent to just:

Other instructions for manipulating the stack include pushfd(32-bit) / pushfq(64-bit) and popfd/popfq for storing and retrieving the EFLAGS (32-bit) / RFLAGS (64-bit) register.

Program flow[edit]

Examples[edit]

«Hello world!» program for MS-DOS in MASM-style assembly[edit]

.model small
.stack 100h

.data
msg	db	'Hello world!$'

.code
start:
    mov ax, @DATA  ; Initializes Data segment
    mov ds, ax
	mov	ah, 09h    ; Sets 8-bit register ‘ah’, the high byte of register ax, to 9, to
                   ; select a sub-function number of an MS-DOS routine called below
                   ; via the software interrupt int 21h to display a message
	lea	dx, msg    ; Takes the address of msg, stores the address in 16-bit register dx
	int	21h        ; Various MS-DOS routines are callable by the software interrupt 21h
                   ; Our required sub-function was set in register ah above

	mov	ax, 4C00h  ; Sets register ax to the sub-function number for MS-DOS’s software
                   ; interrupt int 21h for the service ‘terminate program’.
	int	21h        ; Calling this MS-DOS service never returns, as it ends the program.

end start

«Hello world!» program for Windows in MASM style assembly[edit]

; requires /coff switch on 6.15 and earlier versions
.386
.model small,c
.stack 1000h

.data
msg     db "Hello world!",0

.code
includelib libcmt.lib
includelib libvcruntime.lib
includelib libucrt.lib
includelib legacy_stdio_definitions.lib

extrn printf:near
extrn exit:near

public main
main proc
        push    offset msg
        call    printf
        push    0
        call    exit
main endp

end

«Hello world!» program for Windows in NASM style assembly[edit]

; Image base = 0x00400000
%define RVA(x) (x-0x00400000)
section .text
push dword hello
call dword [printf]
push byte +0
call dword [exit]
ret

section .data
hello db "Hello world!"

section .idata
dd RVA(msvcrt_LookupTable)
dd -1
dd 0
dd RVA(msvcrt_string)
dd RVA(msvcrt_imports)
times 5 dd 0 ; ends the descriptor table

msvcrt_string dd "msvcrt.dll", 0
msvcrt_LookupTable:
dd RVA(msvcrt_printf)
dd RVA(msvcrt_exit)
dd 0

msvcrt_imports:
printf dd RVA(msvcrt_printf)
exit dd RVA(msvcrt_exit)
dd 0

msvcrt_printf:
dw 1
dw "printf", 0
msvcrt_exit:
dw 2
dw "exit", 0
dd 0

«Hello world!» program for Linux in its native AT&T style assembly[edit]

.data                         ; section for initialized data
str: .ascii "Hello, world!\n" ; define a string of text containing "Hello, world!" and then a new line.
str_len = . - str             ; get the length of str by subtracting its address

.text                         ; section for program functions
.globl _start                 ; export the _start function so it can be run
_start:                       ; begin the _start function
    movl $4, %eax             ; specify the instruction to 'sys_write'
    movl $1, %ebx             ; specify the output to the standard output, 'stdout'
    movl $str, %ecx           ; specify the outputted text to our defined string
    movl $str_len, %edx       ; specify the character amount to write as the length of our defined string.
    int $0x80                 ; call a system interrupt to initiate the syscall we have created.

    movl $1, %eax             ; specify the instruction to 'sys_exit'
    movl $0, %ebx             ; specify the exit code to 0, meaning success
    int $0x80                 ; call another system interrup to end the program

«Hello world!» program for Linux in NASM style assembly[edit]

;
; This program runs in 32-bit protected mode.
;  build: nasm -f elf -F stabs name.asm
;  link:  ld -o name name.o
;
; In 64-bit long mode you can use 64-bit registers (e.g. rax instead of eax, rbx instead of ebx, etc.)
; Also change "-f elf " for "-f elf64" in build command.
;
section .data                           ; section for initialized data
str:     db 'Hello world!', 0Ah         ; message string with new-line char at the end (10 decimal)
str_len: equ $ - str                    ; calcs length of string (bytes) by subtracting the str's start address
                                          ; from ‘here, this address’ (‘$’ symbol meaning ‘here’)

section .text                           ; this is the code section (program text) in memory 
global _start                           ; _start is the entry point and needs global scope to be 'seen' by the
                                        ; linker --equivalent to main() in C/C++
_start:                                 ; definition of _start procedure begins here
	mov	eax, 4                   ; specify the sys_write function code (from OS vector table)
	mov	ebx, 1                   ; specify file descriptor stdout --in gnu/linux, everything's treated as a file,
                                 ; even hardware devices
	mov	ecx, str                 ; move start _address_ of string message to ecx register
	mov	edx, str_len             ; move length of message (in bytes)
	int	80h                      ; interrupt kernel to perform the system call we just set up -
                                 ; in gnu/linux services are requested through the kernel
	mov	eax, 1                   ; specify sys_exit function code (from OS vector table)
	mov	ebx, 0                   ; specify return code for OS (zero tells OS everything went fine)
	int	80h                      ; interrupt kernel to perform system call (to exit)

For 64-bit long mode, «lea rcx, str» would be the address of the message, note 64-bit register rcx.

«Hello world!» program for Linux in NASM style assembly using the C standard library[edit]

;
;  This program runs in 32-bit protected mode.
;  gcc links the standard-C library by default

;  build: nasm -f elf -F stabs name.asm
;  link:  gcc -o name name.o
;
; In 64-bit long mode you can use 64-bit registers (e.g. rax instead of eax, rbx instead of ebx, etc..)
; Also change "-f elf " for "-f elf64" in build command.
;
        global  main                            ; ‘main’ must be defined, as it being compiled
                                                ; against the C Standard Library
        extern  printf                          ; declares the use of external symbol, as printf
                                                ; printf is declared in a different object-module.
                                                ; The linker resolves this symbol later.

segment .data                                   ; section for initialized data
	string db 'Hello world!', 0Ah, 0            ; message string ending with a newline char (10
                                                ; decimal) and the zero byte ‘NUL’ terminator
                                                ; ‘string’ now refers to the starting address
                                                ; at which 'Hello, World' is stored.

segment .text
main:
        push    string                          ; Push the address of ‘string’ onto the stack.
                                                ; This reduces esp by 4 bytes before storing
                                                ; the 4-byte address ‘string’ into memory at
                                                ; the new esp, the new bottom of the stack.

                                                ; This will be an argument to printf()
        call    printf                          ; calls the C printf() function.
        add     esp, 4                          ; Increases the stack-pointer by 4 to put it back
                                                ; to where it was before the ‘push’, which
                                                ; reduced it by 4 bytes.
        ret                                     ; Return to our caller.

«Hello world!» program for 64-bit mode Linux in NASM style assembly[edit]

This example is in modern 64-bit mode.

;  build: nasm -f elf64 -F dwarf hello.asm
;  link:  ld -o hello hello.o

DEFAULT REL			    ; use RIP-relative addressing modes by default, so [foo] = [rel foo]

SECTION .rodata			; read-only data should go in the .rodata section on GNU/Linux, like .rdata on Windows
Hello:		db "Hello world!", 10   ; Ending with a byte 10 = newline (ASCII LF)
len_Hello:	equ $-Hello             ; Get NASM to calculate the length as an assembly-time constant
                                    ; the ‘$’ symbol means ‘here’. write() takes a length so that
                                    ; a zero-terminated C-style string isn't needed.
                                    ; It would be for C puts()

SECTION .rodata			; read-only data can go in the .rodata section on GNU/Linux, like .rdata on Windows
Hello:		db "Hello world!",10        ; 10 = `\n`.
len_Hello:	equ $-Hello                 ; get NASM to calculate the length as an assemble-time constant
;;  write() takes a length so a 0-terminated C-style string isn't needed. It would be for puts

SECTION .text

global _start
_start:
	mov eax, 1				; __NR_write syscall number from Linux asm/unistd_64.h (x86_64)
	mov edi, 1				; int fd = STDOUT_FILENO
	lea rsi, [rel Hello]			; x86-64 uses RIP-relative LEA to put static addresses into regs
	mov rdx, len_Hello		; size_t count = len_Hello
	syscall					; write(1, Hello, len_Hello);  call into the kernel to actually do the system call
     ;; return value in RAX.  RCX and R11 are also overwritten by syscall

	mov eax, 60				; __NR_exit call number (x86_64) is stored in register eax.
	xor edi, edi		    ; This zeros edi and also rdi.
                            ; This xor-self trick is the preferred common idiom for zeroing
                            ; a register, and is always by far the fastest method.
                            ; When a 32-bit value is stored into eg edx, the high bits 63:32 are
                            ; automatically zeroed too in every case. This saves you having to set
                            ; the bits with an extra instruction, as this is a case very commonly
                            ; needed, for an entire 64-bit register to be filled with a 32-bit value.
                            ; This sets our routine’s exit status = 0 (exit normally)
	syscall					; _exit(0)

$ strace ./hello > /dev/null                    # without a redirect, your program's stdout is mixed with strace's logging on stderr.  Which is normally fine
execve("./hello", ["./hello"], 0x7ffc8b0b3570 /* 51 vars */) = 0
write(1, "Hello world!\n", 13)          = 13
exit(0)                                 = ?
+++ exited with 0 +++

Using the flags register[edit]

	cmp	eax, ebx
	jne	do_something
	; ...
do_something:
	; do something here

The flags register are also used in the x86 architecture to turn on and off certain features or execution modes. For example, to disable all maskable interrupts, you can use the instruction:

Using the instruction pointer register[edit]

	call	next_line
next_line:
	pop	eax

In 64-bit mode, instructions can reference data relative to the instruction pointer, so there is less need to copy the value of the instruction pointer to another register.

References[edit]

^ «Intel 8008 (i8008) microprocessor family». www.cpu-world.com. Retrieved 2021-03-25.
^ «Intel 8008». CPU MUSEUM — MUSEUM OF MICROPROCESSORS & DIE PHOTOGRAPHY. Retrieved 2021-03-25.
^ ^a ^b ^c «Intel 8008 OPCODES». www.pastraiser.com. Retrieved 2021-03-25.
^ «Assembler language reference». www.ibm.com. Retrieved 2022-11-28.
^ «x86 Assembly Language Reference Manual» (PDF).
^ ^a ^b ^c ^d ^e Narayam, Ram (2007-10-17). «Linux assemblers: A comparison of GAS and NASM». IBM. Archived from the original on October 3, 2013. Retrieved 2008-07-02.
^ «The Creation of Unix». Archived from the original on April 2, 2014.
^ Hyde, Randall. «Which Assembler is the Best?». Retrieved 2008-05-18.
^ «GNU Assembler News, v2.1 supports Intel syntax». 2008-04-04. Retrieved 2008-07-02.
^ «i386-Bugs (Using as)». Binutils documentation. Retrieved 15 January 2020.
^ «Intel 8080 Assembly Language Programming Manual» (PDF). Retrieved 12 May 2023.
^ Mueller, Scott (March 24, 2006). «P2 (286) Second-Generation Processors». Upgrading and Repairing PCs, 17th Edition (Book) (17 ed.). Que. ISBN 0-7897-3404-4. Retrieved 2017-12-06.
^
Curtis Meadow.
«Encoding of 8086 Instructions».
^
Igor Kholodov.
«6. Encoding x86 Instruction Operands, MOD-REG-R/M Byte».
^
«Encoding x86 Instructions».
^
Michael Abrash.
«Zen of Assembly Language: Volume I, Knowledge».
«Chapter 7: Memory Addressing».
Section «mod-reg-rm Addressing».
^
Intel 80386 Reference Programmer’s Manual.
«17.2.1 ModR/M and SIB Bytes»
^
«X86-64 Instruction Encoding: ModR/M and SIB bytes»
^
«Figure 2-1. Intel 64 and IA-32 Architectures Instruction Format».
^
«x86 Addressing Under the Hood».
^ ^a ^b
Stephen McCamant.
«Manual and Automated Binary Reverse Engineering».
^
«X86 Instruction Wishlist».
^ Peter Cordes (18 December 2011). «NASM (Intel) versus AT&T Syntax: what are the advantages?». Stack Overflow.
^ «I just started Assembly». daniweb.com. 2008.

Содержание

Инструменты и тестирование
Базовая среда выполнения
Базовые арифметические инструкции
Регистр флагов и операции сравнения
Работа с памятью
Переходы, метки и машинный код
Стек
Соглашение о вызовах
Повторяемые строковые инструкции
Плавающая точка и SIMD
Виртуальная память
64-битный режим
Сравнение с другими архитектурами
Обобщение
Дополнительные материалы

1. Инструменты и тестирование

Параллельно с чтением будет полезно также писать и тестировать ваши собственные программы ассемблера. Проще всего это делать под Linux (менее удобно под Windows). Вот образец функции на ассемблере:

.globl myfunc
myfunc:
    retl

Сохраните ее в файле my-asm.s и скомпилируйте командой gcc -m32 -c -o my-asm.o my-asm.s. Пока что выполнить этот код у нас возможности нет, потому что для этого потребуется либо взаимодействие с программой Си, либо написание шаблонного кода для взаимодействия с ОС для обработки начала/вывода/остановки/и т.д. По меньшей мере, возможность скомпилировать код дает нам способ убедиться в синтаксической верности наших программ ассемблера.

Имейте ввиду, что в моем руководстве используется синтаксис AT&T, а не Intel. Отличаются они только нотацией, внутренние же принципы работы остаются одинаковыми. При этом всегда можно механически перевести программу из одного синтаксиса в другой, так что беспокоиться особо не о чем.

2. Базовая среда выполнения

В ЦПУ х86 есть восемь 32-битных универсальных регистров. По историческим причинам они имеют следующие названия: {eax, ecx, edx, ebx, esp, ebp, esi, edi}. (В других архитектурах ЦПУ они называются просто r0, r1, …, r7). Каждый из них может содержать любое 32-битное целочисленное значение. Вообще, в архитектуре x86 есть более сотни регистров, но мы разберем только необходимые нам.

Если говорить в общих чертах, то ЦПУ последовательно выполняет набор инструкций в порядке, указанном в исходном коде. Чуть позже мы увидим, как код может сойти с линейного маршрута, когда будем разбирать такие принципы, как if-then, циклы и вызовы функций.

По факту мы имеем восемь 16-битных и восемь 8-битных регистров, являющихся частью восьми 32-битных универсальных регистров. Эти элементы происходят из 16-битной эпохи процессоров x86, но все еще иногда применяются в 32-битном режиме. 16-битные регистры называются {ax, cx, dx, bx, sp, bp, si, di} и представляют младшие 16 бит соответствующих 32-битных регистров {eax, ecx, ..., edi} (префикс e означает «расширенный»). 8-битные регистры именуются {al, cl, dl, bl, ah, ch, dh, bh} и представляют младшие и старшие 8 бит регистров {ax, cx, dx, bx}. Когда значение 16-битного или 8-битного регистра изменяется, старшие биты, принадлежащие полному 32-битному регистру остаются неизменными.

3. Базовые арифметические инструкции

Основные арифметические инструкции x86 оперируют с 32-битными регистрами. Первый операнд выступает в качестве источника, а второй в качестве источника и точки назначения. Например: addl %ecx, %eax – в нотации Си означает: eax = eax + ecx;, где eax и ecx имеют тип uint_32.

Этой важной схеме следуют многие инструкции, например:

xorl %esi, %ebp означает ebp = ebp ^ esi;.
subl %edx, %ebx означает ebx = ebx - edx;.
andl %esp, %eax означает eax = eax & esp;.

Некоторые инструкции получают в качестве аргумента только один регистр, например:

notl %eax означает eax = ~eax;.
incl %ecx означает ecx = ecx + 1;.

Инструкции сдвига и вращения получают 32-битный регистр со сдвигаемым значением и фиксированный 8-битный регистр сl, указывающий количество сдвигов.

Например: shll %cl, %ebx означает ebx = ebx << cl;.

Многие арифметические инструкции могут получать в качестве первого операнда непосредственное значение. Это значение является фиксированным (не переменным) и кодируется в саму инструкцию.

Непосредственные значения сопровождаются приставкой $. Вот примеры:

movl $0xFF, %esi означает esi = 0xFF;.
addl $-2, %edi означает edi = edi + (-2);.
shrl $3, %edx означает edx = edx >> 3;.

Обратите внимание, что инструкция movl копирует значение из первого аргумента во второй (она не производит конкретно «перемещение», но называется именно так). В случае регистров, например movl %eax, %ebx, это означает копирование значения регистра eax в ebx (что приводит к перезаписи имеющегося значения ebx).

Сейчас будет кстати разобрать один из принципов программирования на ассемблере: «Не каждая желаемая операция может быть непосредственно выражена в одной инструкции. В типичных языках программирования многие конструкции являются компонуемыми и подстраиваются под разные ситуации, а арифметика может быть вложенной. Тем не менее в ассемблере можно прописать только то, что позволяет набор инструкций. Покажу на примерах:

Нельзя складывать две непосредственные константы, хотя в Си это допускается. В ассемблере мы либо вычисляем значение во время компиляции, либо выражаем его как последовательность инструкций.

В одной инструкции можно сложить содержимое двух 32-битных регистров, но нельзя сложить значения трех – потребуется разбить такую инструкцию на две.

Нельзя прибавлять содержимое 16-битного регистра к содержимому 32-битного. Нужно будет написать одну инструкцию для выполнения преобразования из 16 в 32 бита, и еще одну для выполнения сложения.

При выполнении битового сдвига количество сдвигов должно быть либо жестко прописанным непосредственным значением, либо задаваться регистром cl. Если количество сдвигов находилось в другом регистре, тогда это значение нужно сначала скопировать в cl.

Из всего этого следует, что вам не нужно стараться угадывать или изобретать несуществующие синтаксические конструкции (такие как addl %eax, %ebx, %ecx). Также, если вам не удается найти необходимую инструкцию в огромном списке поддерживаемых, тогда нужно реализовать ее вручную как последовательность имеющихся инструкций (и, возможно, выделить регистры для хранения промежуточных значений).

4. Регистр флагов и операции сравнения

Среди прочих, нам доступен 32-битный регистр eflags, который неявно считывается или записывается во многих инструкциях. Другими словами, его значение играет роль в выполнении инструкции, но сам этот регистр в коде ассемблера не упоминается.

Арифметические инструкции вроде addl обычно обновляют eflags на основе результата вычислений. Инструкция устанавливает или снимает флаги вроде переноса (CF), переполнения (OF), знаковый (SF), четности (PF), нуля (ZF) и т.д. Некоторые инструкции считывают эти флаги – например, adcl складывает два числа и использует флаг переноса в качестве третьего операнда: adcl %ebx, %eax означает eax = eax + ebx + cf;. Некоторые инструкции устанавливают на основе флага регистр – например, setz %al устанавливает 8-битный регистр al на 0, если ZF неактивен, или на 1, если ZF установлен. Некоторые инструкции напрямую влияют на один бит флага, например cld, очищающая флаг направления (DF).

Инструкции сравнения влияют на eflags, не меняя никаких универсальных регистров. Например, cmpl %eax, %ebx выполнит сравнение значений двух регистров путем их вычитания в неименованной временной области и установит флаги согласно результату, что позволит вам в беззнаковом или знаковом режиме понять, является ли eax < ebx либо eax == ebx, либо eax > ebx. Аналогичным образом, testl %eax, %ebx вычисляет eax & ebx во временной области и устанавливает соответствующие флаги. В большинстве случаев инструкция, следующая за сравнением, является условным переходом (рассмотрим позже).

Пока что нам известно, что некоторые биты флагов относятся к арифметическим операциям. Другие биты флагов связаны с поведением процессора – принятием аппаратных прерываний, виртуальным режимом 8086 и прочими элементами управления системой, которые уже касаются разработчиков ОС, а не создателей приложений. По большей части, регистр eflags можно игнорировать. Системные флаги вполне допустимо опускать, как и арифметические флаги, за исключением сравнений и арифметических операций bigint.

5. Работа с памятью

Одного только процессора для эффективной работы компьютера будет недостаточно. Наличие всего 8 регистров данных сильно ограничивает объем вычислений ввиду невозможности хранения большого количества информации. Для увеличения вычислительного потенциала процессора у нас есть ОЗУ, представляющее обширную системную память. По сути, ОЗУ представляет собой огромный массив байт – например, 128МиБ ОЗУ – это 134,217,728 байт, в которых можно хранить значения.

При сохранении значения размером больше байта оно кодируется в прямом порядке байтов. Например, если 32-битный регистр содержит значение 0xDEADBEEF, и этот регистр нужно сохранить в памяти по адресу 10, тогда значение байта 0xEF отправляется в адрес ОЗУ 10, 0xBE в адрес 11, 0xAD в адрес 12, а 0xDE в адрес 13. То же правило работает и при считывании значений из памяти – байты из нижних адресов памяти загружаются в нижние части регистра.

Очевидно, что у процессора есть инструкции для считывания и записи значений в память. В частности, можно загружать или сохранять один или более байтов в любой желаемый адрес памяти. Самым простым действием в этом случае будет считывание или запись одного байта:

movb (%ecx), %al означает al = *ecx;. (считывает байт по адресу памяти ecx в 8-битный регистр al).
movb %bl, (%edx) означает *edx = bl;. (записывает байт из bl в байт по адресу памяти edx).
(в типичном коде Си регистры al и bl имеют тип uint8_t, а ecx и edx приводятся из uint32_t в uint8_t*.)

Далее, многие арифметические инструкции могут получать один операнд памяти (никогда два). Например:

addl (%ecx), %eax означает eax = eax + (*ecx);. (считывает из памяти 32 бита).
addl %ebx, (%edx) означает *edx = (*edx) + ebx;. (считывает и записывает 32 в память).

Режимы адресации

Когда мы пишем код с циклами, то нередко один регистр содержит базовый адрес массива, а другой текущий обрабатываемый индекс. Несмотря на то, что адрес обрабатываемого элемента можно вычислить вручную, x86 ISA предоставляет более элегантное решение – у нас есть режимы адресации памяти, которые позволяют складывать и перемножать содержимое определенных регистров.

Это будет проще показать, чем объяснять:

movb (%eax,%ecx), %bh означает bh = *(eax + ecx);.
movb -10(%eax,%ecx,4), %bh означает bh = *(eax + (ecx * 4) - 10);.

Здесь формат адреса смещение (основа, индекс, масштаб), где смещение – это целочисленная константа (может быть положительной, отрицательной или нулевой), основа и индекс – это 32-битные регистры (но некоторые комбинации запрещены), а масштаб – это одно из значений {1,2,4,8}. К примеру, если массив содержит набор 64-битный целых, то мы будем использовать масштаб равный 8, поскольку каждый элемент имеет длину 8 байт.

Режимы адресации памяти допустимы везде, где разрешен операнд памяти. Таким образом, если вы можете написать sbbl %eax, (%eax), и вам нужна возможность индексации, то вы определенно можете написать sbbl %eax, (%eax,%ecx,2). Также имейте ввиду, что вычисляемый адрес является временным значением, которое не сохраняется ни в каком регистре. Это хорошо, потому что если вы захотите вычислить этот адрес явно, то нужно будет выделить под него регистр, а наличие всего 8 универсальных регистров не позволяет особо разгуляться, когда вам нужно сохранять и другие переменные.

Есть одна специальная инструкция, которая использует адресацию памяти, но по факту к ней не обращается. Инструкция leal (загрузка действительного адреса) вычисляет заключительный адрес памяти согласно режиму адресации и сохраняет результат в регистре. К примеру, leal 5(%eax,%ebx,8), %ecx означает ecx = eax + ebx*8 + 5;. Заметьте, что это полностью арифметическая операция, которая не включает разыменовывание адреса памяти.

6. Переходы, метки и машинный код

Каждую инструкцию в ассемблере можно предварить нужным числом меток. Эти метки пригодятся, когда потребуется перейти к определенной инструкции. Вот несколько примеров:

foo:  /* Метка */
negl %eax  /* Одна метка */

addl %eax, %eax  /* Нет меток */

bar: qux: sbbl %eax, %eax  /* Две метки */

Инструкция jmp говорит процессору перейти к выполнению размеченной инструкции, а не следующей ниже по порядку, как это происходит по умолчанию. Вот простой бесконечный цикл:

top: incl %ecx
jmp top

Несмотря на то, что jmp условия не имеет, у нее есть родственные инструкции, которые смотрят на состояние eflags и переходят либо к метке (при выполнении условия), либо к очередной инструкции ниже. К инструкциям с условным переходом относятся: ja (перейти, если больше), jle (перейти, если меньше либо равно), jo (перейти, если переполнение), jnz (перейти, если не нуль) и так далее.

Всего таких инструкций 16, и у некоторых есть синонимы – например jz (перейти, если нуль) равнозначна je (перейти, если равно), ja (перейти, если больше) равнозначна jnbe (перейти, если не меньше или равно).

Вот пример использования условного перехода:

jc skip  /* Если флаг переноса активен, перейти */
/* В противном случае выполнить это */
notl %eax
/* Неявно попадает в следующую инструкцию */
skip:
adcl %eax, %eax

Адреса меток фиксируются в коде при его компиляции, но также можно переходить к произвольному адресу памяти, вычисляемому в среде выполнения. В частности, можно перейти к значению регистра: jmp *%ecx, по сути, означает «скопировать значение ecx в eip».

Теперь самое время обсудить принцип, касающийся инструкций и выполнения, о котором я заикнулся еще в первом разделе. В ассемблере каждая инструкция в конечном итоге преобразуется в 1-15 байт машинного кода, после чего эти машинные инструкции компонуются вместе, образуя исполняемый файл. У процессора есть 32-битный регистр eip (расширенный указатель инструкции), который во время выполнения программы хранит адрес памяти текущей обрабатываемой инструкции. Имейте ввиду, что есть очень мало способов для считывания/записи регистра eip, в связи с чем он работает не так, как 8 основных универсальных регистров. При каждом выполнении инструкции процессор знает ее длину в байтах и продвигает eip на это число, чтобы он указывал на следующую инструкцию.

Пока мы говорим о машинном коде, стоит добавить, что ассемблер на деле не является самым нижним уровнем, до которого может добраться программист. Самым фундаментом выступает сырой двоичный машинный код. (Инсайдеры Intel имеют доступ к еще более низким уровням, таким как отладка пайплайна и микрокод – но обычным программистам туда не попасть). Писать машинный код вручную – задача не из легких (да и вообще, писать на ассемблере уже непросто), но это дает пару выгодных возможностей. При написании машинного кода можно кодировать некоторые инструкции альтернативными способами (например, использовать удлиненную последовательность байт, которая будет иметь тот же эффект при выполнении), а также намеренно генерировать недействительные инструкции для проверки поведения ЦПУ (не все ЦПУ обрабатывают ошибки одинаково).

7. Стек

Стек – это область памяти, адресуемая регистром esp. В x86 ISA есть ряд инструкций для управления стеком. Несмотря на то, что всю эту функциональность можно реализовать посредством movl, addl, … и т.д. с помощью других регистров, подход с использованием инструкций стека оказывается более идиоматичным и кратким.

В архитектуре x86 стек растет вниз, от больших адресов памяти в направлении меньших. К примеру, добавление 32-битного значения в стек подразумевает уменьшение esp на 4 с последующим помещением этого 4-байтового значения в область памяти, начиная с адреса esp. Извлечение значения подразумевает обратные операции – загрузку 4 байтов, начинающихся с адреса esp (либо в заданный регистр, либо отбрасывание), и увеличение esp на 4.

Стек очень важен для вызовов функций. Инструкция call подобна jmp, за исключением того, что перед переходом она сначала помещает в стек адрес следующей инструкции. Таким образом, можно вернуться к выполнению инструкции retl, которая извлекает адреса в eip. Кроме того, стандартное соглашение о вызовах в Си помещает некоторые или все аргументы функций в стек.

Имейте ввиду, что память стека можно использовать для чтения/записи регистра eflags и считывания регистра eip. Обращаться к этим двум регистрам неудобно, поскольку они не могут быть использованы в типичной movl или в арифметических инструкциях.

8. Соглашение о вызовах

Когда мы компилируем код Си, он переводится в код ассемблера и в последствии в машинный код. Соглашение о вызовах определяет то, как функции Си получают аргументы и возвращают значения, помещая значения в стек и/или в регистры. Это соглашение применяется к функции Си, вызывающей другую функцию Си, фрагменту кода ассемблера, вызывающему функцию Си, либо функции Си, вызывающей функцию ассемблера. (Оно не применяется к фрагменту кода ассемблера, вызывающему произвольный фрагмент кода ассемблера; в этом случае ограничения отсутствуют).

В 32-битной системе x86 под Linux соглашение о вызовах называется cdecl. Вызывающий функцию компонент справа налево помещает аргументы в стек, вызывает целевую функцию, получает возвращаемое значение в eax и извлекает аргументы из стека.

Например:

int main(int argc, char **argv) {
  print("Hello", argc);
  /*
  Вышеприведенный вызов print() преобразуется в код ассемблера:
  
  pushl %registerContainingArgc
  pushl $ADDRESS_OF_HELLO_STRING_CONSTANT
  call print
  // Получение результата в %eax
  popl %ecx  // Извлечение аргумента str
  popl %ecx  // Извлечение аргумента foo
  */
}

int print(const char *str, int foo) {
  ....
  /*
  В ассемблере эти 32-битные значения существуют в стеке:
    0(%esp) содержит адрес очередной инструкции вызывающего.
    4(%esp) содержит значение аргумента str (указатель char).
    8(%esp) содержит значение аргумента foo (знаковое целое).
  Прежде, чем функция выполнит retl, ей нужно поместить какое-нибудь число в %eax в качестве возвращаемого.
  */
}

9. Повторяемые строковые инструкции

Есть ряд инструкций, которые упрощают обработку длинных последовательностей байтов/слов и неформально относятся к разряду «строковых» инструкций. Каждая такая инструкция использует в качестве адресов памяти регистры esi и edi и автоматически инкрементирует/декрементирует их после выполнения инструкции. Например, movsb %esi, %edi означает *edi = *esi; esi++; edi++; (копирует один байт). (По факту esi и edi инкрементируются, если DF равен 0; если же он равен 1, то они декрементируются). К примерам других строковых инструкций относятся cmpsb, scasb, stosb.

Строковую инструкцию можно изменить с помощью приставки rep (сюда же относятся repe и repne), чтобы она выполнялась ecx раз (при автоматическом уменьшении ecx). К примеру, rep movsb %esi, %edi означает:

while (ecx > 0) {
  *edi = *esi;
  esi++;
  edi++;
  ecx--;
}

Эти строковые инструкции и приставки rep привносят в ассемблер некоторые итерируемые составные операции. Они отражают часть парадигмы дизайна CISC, где для программистов считается нормальным писать код прямо на ассемблере, и предоставляют более высокоуровневые возможности для упрощения работы. (Однако современным решением считается писать код на Си или даже более высокоуровневом языке, а генерацию муторного кода ассемблера поручать компилятору).

10. Плавающая точка и SIMD

Математический сопроцессор x87 имеет восемь 80-битных регистров с плавающей точкой (но вся функциональность x87 сейчас уже встроена в основной ЦПУ x86), и у процессора x86 также есть восемь 128-битных регистров xmm для инструкций SSE. У меня мало опыта работы с FP/x87, так что по этой теме вам стоит обратиться к другим руководствам. Стек в x87 FP работает несколько странным образом, и сегодня удобнее выполнять арифметику с плавающей точкой, используя вместо этого регистры xmm и инструкции SSE/SSE2.

Регистр xmm можно интерпретировать по-разному, в зависимости от выполняемой инструкции: как 16-байтовые значения, как 16-битные слова, как четыре 32-битных двойных слова или числа одинарной точности с плавающей точкой. Например, одна инструкция SSE копирует 16 байтов (128 бит) из памяти в регистр xmm, а другая инструкция SSE складывает содержимое двух регистров xmm, рассматривая каждый как восемь параллельных 16-битных слов. Идея SIMD состоит в выполнении одной инструкции для одновременной обработки большого количества данных, что оказывается быстрее, чем обработка каждого значения по-отдельности, поскольку запрос и выполнение каждой инструкции вносит определенную нагрузку.

Отмечу, что все операции SSE/SIMD можно эмулировать с меньшей скоростью, используя базовые скалярные операции (например, 32-битную арифметику, рассмотренную в разделе 3). Осторожный программист может создать прототип программы с использованием скалярных операций, оценить ее корректность и постепенно преобразовать под использование более скоростных инструкций SSE, обеспечив получение тех же результатов.

11. Виртуальная память

До этого момента мы предполагали, что когда инструкция запрашивает считывание из/запись в адрес памяти, то это будет адрес, обрабатываемый ОЗУ. Но, если мы добавим в промежутке переводящий слой, то сможем выполнять интересные действия. Этот принцип известен как виртуальная память, пейджинг и под другими именами.

Основная идея в том, что у нас есть таблица страниц, которая описывает, с чем сопоставлена каждая страница (блок) из 4096 байтов 32-битного виртуального адресного пространства. Например, если страница ни с чем не сопоставлена, то попытка считать/записать адрес памяти на эту страницу вызовет прерывание/исключение/ловушку. Либо, к примеру, тот же виртуальный адрес 0x08000000 можно сопоставить с другой страницей физической ОЗУ в каждом запущенном процессе приложения. Кроме того, каждый процесс может иметь собственный набор страниц и никогда не видеть содержимое других процессов или ядра операционной системы. Принцип пейджинга, по большому счету, относится к сфере написания ОС, но его поведение иногда затрагивает и разработчиков приложений, поэтому им стоит о нем знать.

Имейте ввиду, что отображение адресов не обязательно должно происходить по схеме 32 бита в 32 бита. Например, 32 бита виртуального адресного пространства можно сопоставить с 36 битами области физической памяти (PAE). Либо 64-битное виртуальное адресное пространство можно сопоставить с 32 битами области физической памяти на компьютере, имеющем всего 1ГиБ ОЗУ.

12. 64-битный режим

Здесь я только немного расскажу о режиме x86-64 и примерно обрисую, какие изменения он собой привнес. При желании в сети можно найти множество статей и справочных материалов, которые поясняют все отличия детально.

Из наиболее очевидного — 8 универсальных регистров были расширены до 64 бит. Новые регистры получили имена {rax, rcx, rdx, rbx, rsp, rbp, rsi, rdi}, а старые 32-битные {eax, ..., edi} теперь занимают младшие 32 бита вышеупомянутых 64-битных регистров.

Также появилось восемь новых 64-битных регистров {r8, r9, r10, r11, r12, r13, r14, r15}, и общее число универсальных регистров дошло до 16. Это существенно снижает нагрузку при работе с большим числом переменных. У новых регистров также есть подрегистры – например, 64-битный r9 содержит 32-битный r9d, 16-битный r9w и 8-битный r9l. Кроме того, нижний байт {rsp, rbp, rsi, rdi} теперь адресуется как {spl, bpl, sil, dil}.

Арифметические инструкции могут оперировать с 8-, 16-, 32- или 64-битными регистрами. При работе с 32-битными верхние 32 бита очищаются на нуль, но при меньшей ширине операнда все старшие биты остаются неизменными. Многие нишевые инструкции из 64-битного набора были удалены – например, связанные с BCD, большинство инструкций, задействующих 16-битные сегментные регистры, а также добавляющие/извлекающие 32-битные значения из стека.

Не так уж много отличий x86-64 от старой x86-32 касаются конкретно разработчиков приложений. Если говорить в общем, то работать стало легче ввиду доступности большего числа регистров и удаления ненужного функционала. Все указатели памяти должны быть 64-битными (к этому нужно привыкать) в то время, как значения данных могут быть 32-, 64-, 8-битными и так далее, в зависимости от ситуации (не обязательно использовать для данных именно 64 бита).

Рассмотренное соглашение о вызовах существенно упрощает извлечение аргументов функций в коде ассемблера, потому что первые ~6 аргументов помещаются не в стек, а в регистры. В остальном принцип работы остался прежним. (Хотя для программистов систем архитектура x86-64 представляет новые режимы, возможности, новые проблемы и новые кейсы для обработки).

13. Сравнение с другими архитектурами

Принцип работы архитектур ЦПУ RISC в некоторых аспектах отличен от x86. Память затрагивают только явные инструкции загрузки/сохранения, обычные арифметические этого не делают. Инструкции имеют фиксированную длину, а именно 2 или 4 байта каждая. Операции с памятью обычно нужно объединять, например загрузка 4-байтового слова должна содержать адрес памяти, кратный 4.

Для сравнения, в x86 ISA операции с памятью встраиваются в арифметические инструкции, инструкции кодируются как последовательности байтов переменной длины, и почти всегда допускается невыравненное обращение к памяти. Кроме того, если в x86 есть полный набор 8-, 16- и 32-битных арифметических операций ввиду обратной совместимости, то архитектуры RISC обычно являются просто 32-битными. Для работы с более короткими значениями они загружают байт или слово из памяти, расширяют его значение на полный 32-битный регистр, выполняют арифметические операции в 32 битах и в завершении сохраняют нижние 8 или 16 бит в памяти. К популярным RISC ISA относятся ARM, MIPS и RISC-V.

Архитектуры VLIW позволяют явно выполнять несколько параллельных подинструкций. К примеру, можно написать add a, b; sub c, d на одной строке, потому что у процессора есть два независимых арифметических блока, работающих одновременно. Процессоры x86 тоже могут выполнять несколько инструкций параллельно (суперскалярная обработка), но инструкции в этом случае не прописываются явно – ЦПУ внутренне анализирует параллелизм в потоке инструкций и распределяет допустимые инструкции по нескольким блокам выполнения.

14. Обобщение

Разбор архитектуры процессоров x86 мы начали с их рассмотрения как простой машины, которая содержит пару регистров и последовательно следует списку инструкций. Мы познакомились с базовыми арифметическими операциями, которые можно выполнять на этих регистрах. Далее мы узнали о переходе к различным участкам кода, о сравнении и условных переходах. После мы разобрали принцип работы ОЗУ как огромного адресуемого хранилища данных, а также поняли, как можно использовать режимы адресации x86 для лаконичного вычисления адресов. В завершении мы кратко рассмотрели принцип работы стека, соглашение о вызовах, продвинутые инструкции, перевод адресов виртуальной памяти и отличия режима x86-64.

Надеюсь, этого руководства было достаточно, чтобы вы сориентировались в общем принципе устройства архитектуры x86. В эту ознакомительную статью мне не удалось вместить очень много деталей – полноценное написание простой функции, отладку распространенных ошибок, эффективное использование SSE/AVX, работу с сегментацией, знакомство с системными структурами данных вроде таблиц страниц и дескрипторов прерываний, да и многое другое. Тем не менее теперь у вас есть устойчивое представление о работе процессора x86, и вы можете приступить к изучению более продвинутых уроков, попробовать написать код с пониманием происходящего внутри и даже решиться полистать чрезвычайно подробные руководства Intel по ЦПУ.

15. Дополнительные материалы

University of Virginia CS216: x86 Assembly Guide
Wikipedia: x86 instruction listings
Intel® 64 and IA-32 Architectures Software Developer Manuals
Carnegie Mellon University: Introduction to Computer Systems: Machine-Level Programming I: Basics
Carnegie Mellon University: Introduction to Computer Systems: Machine-Level Programming II: Control

Источник

Keyword[edit]

Mnemonics and opcodes[edit]

Syntax[edit]

Registers[edit]

Segmented addressing[edit]

Execution modes[edit]

Switching modes[edit]

Examples[edit]

Instruction types[edit]

Stack instructions[edit]

Integer ALU instructions[edit]

Floating-point instructions[edit]

SIMD instructions[edit]

Memory instructions[edit]

Program flow[edit]

Examples[edit]

«Hello world!» program for MS-DOS in MASM-style assembly[edit]

«Hello world!» program for Windows in MASM style assembly[edit]

«Hello world!» program for Windows in NASM style assembly[edit]

«Hello world!» program for Linux in its native AT&T style assembly[edit]

«Hello world!» program for Linux in NASM style assembly[edit]

«Hello world!» program for Linux in NASM style assembly using the C standard library[edit]

«Hello world!» program for 64-bit mode Linux in NASM style assembly[edit]

Using the flags register[edit]

Using the instruction pointer register[edit]

See also[edit]

References[edit]

Further reading[edit]

Manuals[edit]

Books[edit]

Introduction

Generic Information

Architecture and CPU

Memory

Stack and Functions

Registers

Interrupts

Exceptions

Real Mode

Architecture

Segmentation

Registers

COM and EXE files

Interrupts

Models

Benefits

Problems

Expanded Memory

A20 line

Segmented Protected Mode

Architecture

Memory

Registers

Global Descriptor Table

Interrupts

Local Descriptor Table

System Segments in the GDT

Call Gates

TSS Descriptors, Task Gates and Hardware Multitasking

Entering protected mode

Exiting protected mode

Problems

Flat Protected Mode

Paging

Architecture

SYSENTER/SYSEXIT

Software multitasking

Protected Mode Facts

Unreal mode

Huge real mode

LOADALL

HIMEM.SYS

VM86 Mode

Physical Address Extensions (PAE)

DPMI

Long Mode

Architecture

Registers

GDT/IDT

Long Mode Paging