A Two-Stage x86 Bootloader That Draws a Penguin
I wanted to understand what happens in the first seconds after a regular x86 PC powers on, before any operating system or driver loads. To have a grasp of how do bootloaders work. I remembered that 30 years ago those "Energy Star" cool logos and OEM splash screens appeared at boot time, being drawn directly to the screen by firmware. That implied a framebuffer was accessible with no OS in between.
With some LLM assistance and a lot of trial and error, I built a two-stage bootloader in x86 real mode assembly that loads a 256-color bitmap from disk and fades it in and out, running on actual hardware, without operating system or drivers.
The First 512 Bytes
When you turn on an x86 PC, the CPU doesn't jump straight into an operating system. It starts executing code stored in firmware/ROM (the BIOS), which runs POST (Power-On Self Test), initializes hardware, and then looks for something bootable. It checks the configured boot devices in a user-defined order. When it finds one, it reads exactly 512 bytes from the very first sector and loads them into memory at a fixed address: 0x7C00. If those 512 bytes end with the magic bytes 0x55 0xAA, the BIOS considers the device bootable (0x55 0xAA being a signature for a bootable device) and hands control over by jumping to 0x7C00.
The entire budget for stage 1 are those 512 bytes of code, loaded at a known address, with the CPU in real mode.
About real mode: real mode is the operating mode the x86 CPU starts in for historical reasons stretching back to the 8086 in 1978. It uses 16-bit registers and a segmented memory model where addresses are formed by combining a segment register and an offset: physical address = segment × 16 + offset. The practical ceiling this creates is about 1MB of addressable memory — the famous 640K barrier that defined a generation of software constraints. There is no memory protection, no virtual memory, no privilege separation. Any code can read or write anything. It's a different world from the protected mode environment that modern operating systems run in.
With only 512 bytes available, stage 1 has to be minimal. Ours does exactly four things: normalizes the segment registers to a known state, saves the boot drive number that the BIOS passed in the DL register, loads stage 2 from disk into memory at 0x8000, and jumps to it.
The disk read is where the first real complexity appears. The traditional method is CHS (Cylinder, Head, Sector) a coordinate system inherited from the physical geometry of spinning hard disks. You tell the BIOS which cylinder, which read head, and which sector on that track to read. This worked fine for decades, but it encodes assumptions about disk geometry that vary between device types. When we first tried to boot from a USB drive on the Athlon 64, stage 1 loaded correctly but stage 2 immediately threw a disk error, the BIOS was presenting the USB drive with hard disk geometry (63 sectors per track) while our code assumed floppy geometry (18 sectors per track). The BIOS error code 0x01(invalid parameter — was the clue.
The fix was switching to LBA, Logical Block Addressing, which treats the disk as a flat array of numbered sectors with no geometry assumptions at all. Sector 0 is the MBR, sector 1 is the start of stage 2, sector 6 onwards is the bitmap. You ask for sector N and the BIOS figures out the physical location. Everything after that worked cleanly on both QEMU and real hardware.
; stage1.asm - LBA version
BITS 16
ORG 0x7C00
start:
xor ax, ax
mov ds, ax
mov es, ax
mov ss, ax
mov sp, 0x7C00
mov [boot_drive], dl
; Load stage2 via LBA (sectors 2-5 → 0x8000)
mov si, dap
mov ah, 0x42
mov dl, [boot_drive]
int 0x13
jc disk_error
jmp 0x0000:0x8000
disk_error:
mov si, err_msg
.p: lodsb
or al, al
jz .h
mov ah, 0x0E
xor bh, bh
int 0x10
jmp .p
.h: cli
hlt
boot_drive db 0
align 4
dap:
db 0x10 ; packet size
db 0x00 ; reserved
dw 4 ; read 4 sectors (stage2 = 2048 bytes)
dw 0x8000 ; destination offset
dw 0x0000 ; destination segment
dd 1 ; LBA = 1 (second sector, 0-based)
dd 0
err_msg db "Disk error!", 0
times 510-($-$$) db 0
dw 0xAA55
Drawing Without an Operating System
Once stage 2 has control, the first order of business is getting access to graphics. On a modern system this would involve a kernel driver negotiating with the GPU over a complex hardware interface. In real mode, it's a single BIOS interrupt call:
mov ax, 0x0013
int 0x10
This switches the display into VGA mode 13h: 320×200 pixels, 256 colors, with pixels laid out linearly in memory starting at 0xA0000. One byte per pixel, row by row, left to right, top to bottom. To draw a pixel at coordinates (x, y) you write one byte to address 0xA0000 + y × 320 + x. The byte value is not a color directly, it's an index into a palette of 256 entries, where each entry defines an RGB color.
This is why there's no driver needed: the VGA standard guarantees this memory layout and this BIOS interface. It has been consistent since IBM defined it in 1987. The hardware is just there, mapped into the address space, ready to be written to.
Loading the bitmap follows the same pattern. An 8-bit BMP file is mostly straightforward: a fixed header, a 256-entry color palette (1024 bytes, four bytes per color in BGR order), and then raw pixel data, one byte per pixel. Two quirks worth knowing: BMP stores rows bottom-up, so the first row of pixel data in the file is the bottom row of the image. And the pixel data doesn't necessarily start immediately after the header, the offset is stored at byte 10 of the file and needs to be read explicitly.
We load the entire BMP into memory at segment 0x9000 using the same LBA disk read mechanism as stage 1, then blit it to the framebuffer row by row, reversing the bottom-up order as we go.
The palette is where the color mapping happens. The VGA DAC (Digital-to-Analog Converter) has 256 registers, each holding an RGB triplet in 6-bit format (0–63 per channel). To program it, you write the starting index to port 0x3C8, then stream R, G, B values to port 0x3C9, one channel at a time, advancing automatically with each write. We read the palette directly from the BMP's own palette data, shift each 8-bit value right by 2 to convert to 6-bit, and write it straight to the DAC. After that, the palette indices in the pixel data map correctly to the colors the image was designed with.
; stage2.asm
BITS 16
ORG 0x8000
BMP_LOAD_SEG equ 0x0900
BMP_SECTOR equ 6
BMP_SECTOR_COUNT equ 35
BMP_PIXEL_OFFSET equ 1078
MAX_PER_READ equ 18
SCREEN_W equ 320
SCREEN_H equ 200
IMG_W equ 128
IMG_H equ 128
IMG_ROW_BYTES equ 128
ORIGIN_X equ (SCREEN_W - IMG_W) / 2
ORIGIN_Y equ (SCREEN_H - IMG_H) / 2
FADE_STEPS equ 64 ; brightness levels 0..63
VBLANKS_PER_STEP equ 3 ; ~3s total fade (64 * 3 / 70Hz ≈ 2.7s)
HOLD_VBLANKS equ 210 ; ~3s hold (210 / 70Hz = 3s)
start:
xor ax, ax
mov ds, ax
mov es, ax
mov ss, ax
mov sp, 0x7C00
mov [boot_drive], dl
mov ax, 0x0013
int 0x10
call load_bmp
call clear_screen
call blit_tux ; draw once, never touch framebuffer again
fade_loop:
; Fade in: brightness 0 → 63
mov byte [brightness], 0
.fi:
mov cl, VBLANKS_PER_STEP
.fi_vb: call wait_vblank
dec cl
jnz .fi_vb
movzx bx, byte [brightness]
call apply_palette
inc byte [brightness]
cmp byte [brightness], FADE_STEPS
jl .fi
; Hold at full brightness
mov cx, HOLD_VBLANKS
.hold:
call wait_vblank
loop .hold
; Fade out: brightness 63 → 0
mov byte [brightness], FADE_STEPS - 1
.fo:
mov cl, VBLANKS_PER_STEP
.fo_vb: call wait_vblank
dec cl
jnz .fo_vb
movzx bx, byte [brightness]
call apply_palette
cmp byte [brightness], 0
je .fo_done
dec byte [brightness]
jmp .fo
.fo_done:
; Brief black pause (~0.5s) before looping
mov cx, 35
.pause: call wait_vblank
loop .pause
jmp fade_loop
; ─────────────────────────────────────────────
; apply_palette: BX = brightness (0..63)
; Reads original palette from BMP, scales each component, writes to DAC
apply_palette:
push ds
push bx ; save brightness
mov ax, BMP_LOAD_SEG
mov ds, ax
mov si, 54 ; BMP palette offset
mov dx, 0x3C8
xor al, al
out dx, al ; DAC write index = 0
mov dx, 0x3C9
mov cx, 256
.pal_loop:
; R
mov al, [si+2]
shr al, 2 ; 8-bit → 6-bit (0..63)
call scale_component
out dx, al
; G
mov al, [si+1]
shr al, 2
call scale_component
out dx, al
; B
mov al, [si]
shr al, 2
call scale_component
out dx, al
add si, 4
loop .pal_loop
pop bx
pop ds
ret
; al = component (0..63), bx = brightness (0..63) → al = al*bx/63
; uses AX for mul so save/restore carefully
scale_component:
push cx
push dx
movzx ax, al
mul bx ; DX:AX = component * brightness (max 63*63=3969, fits in AX)
mov cx, 63
div cx ; AX = result / 63, DX = remainder
pop dx
pop cx
ret ; AL = scaled component
; ─────────────────────────────────────────────
; Wait for vertical blank rising edge (port 0x3DA bit 3)
wait_vblank:
mov dx, 0x3DA
.not_blank:
in al, dx
test al, 0x08
jnz .not_blank ; spin until we're OUT of vblank
.blank:
in al, dx
test al, 0x08
jz .blank ; spin until vblank STARTS
ret
; ─────────────────────────────────────────────
clear_screen:
mov ax, 0xA000
mov es, ax
xor di, di
mov cx, SCREEN_W * SCREEN_H
xor al, al
rep stosb
xor ax, ax
mov es, ax
ret
; ─────────────────────────────────────────────
blit_tux:
xor bx, bx
.row_loop:
cmp bx, IMG_H
jge .done
movzx eax, bx
mov ecx, IMG_ROW_BYTES
mul ecx
add eax, BMP_PIXEL_OFFSET
mov si, ax
mov ax, IMG_H - 1
sub ax, bx
add ax, ORIGIN_Y
mov cx, SCREEN_W
mul cx
add ax, ORIGIN_X
mov di, ax
push ds
mov ax, BMP_LOAD_SEG
mov ds, ax
mov ax, 0xA000
mov es, ax
mov cx, IMG_ROW_BYTES
rep movsb
pop ds
inc bx
jmp .row_loop
.done:
ret
; ─────────────────────────────────────────────
; Disk Address Packet for int 13h extended read
align 4
dap:
db 0x10 ; packet size
db 0x00 ; reserved
dw 0 ; sectors to read (filled at runtime)
dw 0 ; buffer offset (filled at runtime)
dw BMP_LOAD_SEG ; buffer segment
dd 0 ; LBA low dword (filled at runtime)
dd 0 ; LBA high dword (always 0 for us)
load_bmp:
mov ax, BMP_LOAD_SEG
mov es, ax
xor bx, bx
mov byte [sectors_remaining], BMP_SECTOR_COUNT
mov dword [dap+8], BMP_SECTOR ; starting LBA
.chunk:
; Clamp to 18 sectors per read (safe for all implementations)
mov al, [sectors_remaining]
cmp al, 18
jle .set_count
mov al, 18
.set_count:
movzx ax, al
mov [dap+2], ax ; sector count into DAP
mov [sectors_this_read], al
; Buffer offset advances as we load
mov [dap+4], bx
mov ah, 0x42 ; extended read
mov dl, [boot_drive]
mov si, dap ; DS:SI → DAP
int 0x13
jc disk_error
; Advance buffer pointer
movzx ax, byte [sectors_this_read]
mov cx, 512
mul cx
add bx, ax
; Advance LBA
movzx eax, byte [sectors_this_read]
add [dap+8], eax
mov al, [sectors_this_read]
sub [sectors_remaining], al
jnz .chunk
xor ax, ax
mov es, ax
ret
; ─────────────────────────────────────────────
disk_error:
mov ax, 0x0003
int 0x10
mov si, err_msg
.p: lodsb
or al, al
jz .h
mov ah, 0x0E
xor bh, bh
int 0x10
jmp .p
.h: cli
hlt
; ─────────────────────────────────────────────
boot_drive db 0
brightness db 0
sectors_remaining db 0
current_sector db 0
sectors_this_read db 0
cyl db 0
head db 0
sect db 0
err_msg db "Disk read error!", 0
times 2048-($-$$) db 0
The Penguin Appears
With stage 1 and stage 2 working together, the execution path is clean: BIOS loads stage 1, stage 1 loads stage 2, stage 2 loads the bitmap, programs the palette, and blits pixels to the framebuffer. Tux appears on screen.
The fade effect works by manipulating the DAC palette rather than touching the framebuffer. At full brightness the palette entries reflect the original BMP colors. To fade out, we scale every RGB component toward zero over 64 steps, multiplying each value by a brightness factor that decrements from 63 to 0. Fade in is the reverse. The timing is synchronized to the VGA vertical blank signal (!), readable from port 0x3DA: we wait for the retrace pulse, update the palette, and wait for the next one. On real VGA hardware this runs at roughly 70Hz, giving a smooth roughly three-second fade in each direction with no tearing.
nasm -f bin stage1.asm -o stage1.bin
nasm -f bin stage2.asm -o stage2.bin
cat stage1.bin stage2.bin tux.bmp > disk.img
qemu-system-i386 -drive format=raw,file=disk.img,if=floppy
The experiment answered the original question. Those boot-time logos from the 90s were small but nice pieces of code doing exactly this: asking the BIOS to switch video modes, loading pixel data from storage, writing bytes to a known memory address. Real mode is constrained and archaic but it's also remarkably direct. There's almost nothing between your code and the hardware. Writing one byte to 0xA0000 + y × 320 + x puts a colored dot on the screen. That's it.
Glitches, Pitfalls, and the Cursed Half-Penguin
Getting to a working result involved a reliable series of failures, each one clarifying something the documentation hadn't made obvious.
The cursed half-penguin. The first time we got pixels on screen they were recognizably Tux, but only the bottom third of him, floating in the lower portion of the display. This happened twice during development, for two different reasons. The first time, stage 2 was not padded to an exact sector boundary, so the bitmap data started mid-sector at an offset the loader wasn't expecting, and we were reading pixel data from somewhere in the middle of the file. The second time, BMP_SECTOR_COUNT was set to 34 instead of 35, one sector short, which silently dropped the last 512 bytes of pixel data. Since BMP stores rows bottom-up, the missing data was the top of the image: Tux's head and body. His feet survived. Debugging bitmaps by counting sectors is a particular kind of humbling.
CHS addressing and the geometry mismatch. The traditional BIOS disk read interface — int 13h with Cylinder, Head, Sector coordinates — encodes assumptions about physical disk geometry that vary between device types. Our code assumed floppy geometry (18 sectors per track, 2 heads) because we were testing with -if=floppy in QEMU. On the Athlon 64, the BIOS presented the USB drive with hard disk geometry, and the very first disk read in stage 2 returned error code 0x01: invalid parameter. The fix was switching to LBA (Logical Block Addressing), which treats the disk as a flat numbered sequence of sectors and lets the BIOS handle the geometry translation. LBA has been standard since the mid-90s and works identically on floppies, USB drives, and hard disks — it should have been the first choice.
QEMU floppy versus real hardware. QEMU's floppy emulation does not support int 13h extended reads (the LBA interface). So when we switched to LBA in stage 1, QEMU immediately started throwing disk errors that real hardware didn't. The fix was changing the QEMU invocation from -if=floppy to -if=ide, which emulates a hard disk and supports the full extended BIOS interface. This made QEMU and real hardware behave consistently, which is what you actually want from an emulator.
The palette mystery and the yellow penguin. Mode 13h is an indexed color mode: each pixel byte is not a color but an index into a 256-entry palette programmed into the VGA DAC. The BMP format stores its own 256-entry palette in the file header, and you have to load that palette into the DAC explicitly or the colors will be wrong. When we first got the colored version working in QEMU the background was yellow instead of near-black, despite the BMP palette showing the background color as R=10 G=9 B=10 — essentially black. The cause is a palette loading bug where the wrong memory segment was active during the DAC write, so the hardware is receiving garbage RGB values for certain entries. On real hardware the same yellow image rendered with glitchy colored borders around Tux's edges, caused by antialiased pixels in the source image using palette indices that mapped to incorrect DAC values. The underlying lesson: in indexed color modes, the image and the hardware palette must be kept in sync precisely. Any mismatch between what the BMP expects and what the DAC contains shows up immediately and visibly as wrong colors.
Reading pixel data offset from the wrong segment. At one point stage 2 tried to read the pixel data offset from the BMP header dynamically — a four-byte value at offset 10 in the file. The code loaded it with DS=0x0000 instead of DS=0x9000 where the BMP actually lived, so it read from low memory and got garbage, blitting from a completely wrong position in the file. The fix was hardcoding the known offset (1078 for this specific BMP) rather than reading it at runtime. Not elegant, but correct — and in 512-byte-budget real mode assembly, correct beats elegant.