A Two-Stage x86 Bootloader That Draws a Penguin

I wanted to understand what happens in the first seconds after a regular x86 PC powers on, before any operating system or driver loads. To have a grasp of how do bootloaders work. I remembered that 30 years ago those "Energy Star" cool logos and OEM splash screens appeared at boot time, being drawn directly to the screen by firmware. That implied a framebuffer was accessible with no OS in between.

With some LLM assistance and a lot of trial and error, I built a two-stage bootloader in x86 real mode assembly that loads a 256-color bitmap from disk and fades it in and out, running on actual hardware, without operating system or drivers.

The First 512 Bytes

When you turn on an x86 PC, the CPU doesn't jump straight into an operating system. It starts executing code stored in firmware/ROM (the BIOS), which runs POST (Power-On Self Test), initializes hardware, and then looks for something bootable. It checks the configured boot devices in a user-defined order. When it finds one, it reads exactly 512 bytes from the very first sector and loads them into memory at a fixed address: 0x7C00. If those 512 bytes end with the magic bytes 0x55 0xAA, the BIOS considers the device bootable (0x55 0xAA being a signature for a bootable device) and hands control over by jumping to 0x7C00.

The entire budget for stage 1 are those 512 bytes of code, loaded at a known address, with the CPU in real mode.

About real mode: real mode is the operating mode the x86 CPU starts in for historical reasons stretching back to the 8086 in 1978. It uses 16-bit registers and a segmented memory model where addresses are formed by combining a segment register and an offset: physical address = segment × 16 + offset. The practical ceiling this creates is about 1MB of addressable memory — the famous 640K barrier that defined a generation of software constraints. There is no memory protection, no virtual memory, no privilege separation. Any code can read or write anything. It's a different world from the protected mode environment that modern operating systems run in.

With only 512 bytes available, stage 1 has to be minimal. Ours does exactly four things: normalizes the segment registers to a known state, saves the boot drive number that the BIOS passed in the DL register, loads stage 2 from disk into memory at 0x8000, and jumps to it.

The disk read is where the first real complexity appears. The traditional method is CHS (Cylinder, Head, Sector) a coordinate system inherited from the physical geometry of spinning hard disks. You tell the BIOS which cylinder, which read head, and which sector on that track to read. This worked fine for decades, but it encodes assumptions about disk geometry that vary between device types. When we first tried to boot from a USB drive on the Athlon 64, stage 1 loaded correctly but stage 2 immediately threw a disk error, the BIOS was presenting the USB drive with hard disk geometry (63 sectors per track) while our code assumed floppy geometry (18 sectors per track). The BIOS error code 0x01(invalid parameter — was the clue.

The fix was switching to LBA, Logical Block Addressing, which treats the disk as a flat array of numbered sectors with no geometry assumptions at all. Sector 0 is the MBR, sector 1 is the start of stage 2, sector 6 onwards is the bitmap. You ask for sector N and the BIOS figures out the physical location. Everything after that worked cleanly on both QEMU and real hardware.

; stage1.asm - LBA version
BITS 16
ORG 0x7C00

start:
    xor ax, ax
    mov ds, ax
    mov es, ax
    mov ss, ax
    mov sp, 0x7C00
    mov [boot_drive], dl

    ; Load stage2 via LBA (sectors 2-5 → 0x8000)
    mov si, dap
    mov ah, 0x42
    mov dl, [boot_drive]
    int 0x13
    jc disk_error

    jmp 0x0000:0x8000

disk_error:
    mov si, err_msg
.p: lodsb
    or al, al
    jz .h
    mov ah, 0x0E
    xor bh, bh
    int 0x10
    jmp .p
.h: cli
    hlt

boot_drive  db 0

align 4
dap:
    db 0x10         ; packet size
    db 0x00         ; reserved
    dw 4            ; read 4 sectors (stage2 = 2048 bytes)
    dw 0x8000       ; destination offset
    dw 0x0000       ; destination segment
    dd 1            ; LBA = 1 (second sector, 0-based)
    dd 0

err_msg db "Disk error!", 0

times 510-($-$$) db 0
dw 0xAA55

Drawing Without an Operating System

Once stage 2 has control, the first order of business is getting access to graphics. On a modern system this would involve a kernel driver negotiating with the GPU over a complex hardware interface. In real mode, it's a single BIOS interrupt call:

mov ax, 0x0013
int 0x10

This switches the display into VGA mode 13h: 320×200 pixels, 256 colors, with pixels laid out linearly in memory starting at 0xA0000. One byte per pixel, row by row, left to right, top to bottom. To draw a pixel at coordinates (x, y) you write one byte to address 0xA0000 + y × 320 + x. The byte value is not a color directly, it's an index into a palette of 256 entries, where each entry defines an RGB color.

This is why there's no driver needed: the VGA standard guarantees this memory layout and this BIOS interface. It has been consistent since IBM defined it in 1987. The hardware is just there, mapped into the address space, ready to be written to.

Loading the bitmap follows the same pattern. An 8-bit BMP file is mostly straightforward: a fixed header, a 256-entry color palette (1024 bytes, four bytes per color in BGR order), and then raw pixel data, one byte per pixel. Two quirks worth knowing: BMP stores rows bottom-up, so the first row of pixel data in the file is the bottom row of the image. And the pixel data doesn't necessarily start immediately after the header, the offset is stored at byte 10 of the file and needs to be read explicitly.

We load the entire BMP into memory at segment 0x9000 using the same LBA disk read mechanism as stage 1, then blit it to the framebuffer row by row, reversing the bottom-up order as we go.

The palette is where the color mapping happens. The VGA DAC (Digital-to-Analog Converter) has 256 registers, each holding an RGB triplet in 6-bit format (0–63 per channel). To program it, you write the starting index to port 0x3C8, then stream R, G, B values to port 0x3C9, one channel at a time, advancing automatically with each write. We read the palette directly from the BMP's own palette data, shift each 8-bit value right by 2 to convert to 6-bit, and write it straight to the DAC. After that, the palette indices in the pixel data map correctly to the colors the image was designed with.

the source 8 bit bmp
; stage2.asm
BITS 16
ORG 0x8000

BMP_LOAD_SEG        equ 0x0900
BMP_SECTOR          equ 6
BMP_SECTOR_COUNT    equ 35
BMP_PIXEL_OFFSET    equ 1078
MAX_PER_READ        equ 18

SCREEN_W            equ 320
SCREEN_H            equ 200
IMG_W               equ 128
IMG_H               equ 128
IMG_ROW_BYTES       equ 128

ORIGIN_X            equ (SCREEN_W - IMG_W) / 2
ORIGIN_Y            equ (SCREEN_H - IMG_H) / 2

FADE_STEPS          equ 64      ; brightness levels 0..63
VBLANKS_PER_STEP    equ 3       ; ~3s total fade (64 * 3 / 70Hz ≈ 2.7s)
HOLD_VBLANKS        equ 210     ; ~3s hold (210 / 70Hz = 3s)

start:
    xor ax, ax
    mov ds, ax
    mov es, ax
    mov ss, ax
    mov sp, 0x7C00
    mov [boot_drive], dl

    mov ax, 0x0013
    int 0x10

    call load_bmp
    call clear_screen
    call blit_tux           ; draw once, never touch framebuffer again

fade_loop:
    ; Fade in: brightness 0 → 63
    mov byte [brightness], 0
.fi:
    mov cl, VBLANKS_PER_STEP
.fi_vb: call wait_vblank
    dec cl
    jnz .fi_vb
    movzx bx, byte [brightness]
    call apply_palette
    inc byte [brightness]
    cmp byte [brightness], FADE_STEPS
    jl .fi

    ; Hold at full brightness
    mov cx, HOLD_VBLANKS
.hold:
    call wait_vblank
    loop .hold

    ; Fade out: brightness 63 → 0
    mov byte [brightness], FADE_STEPS - 1
.fo:
    mov cl, VBLANKS_PER_STEP
.fo_vb: call wait_vblank
    dec cl
    jnz .fo_vb
    movzx bx, byte [brightness]
    call apply_palette
    cmp byte [brightness], 0
    je .fo_done
    dec byte [brightness]
    jmp .fo
.fo_done:

    ; Brief black pause (~0.5s) before looping
    mov cx, 35
.pause: call wait_vblank
    loop .pause

    jmp fade_loop

; ─────────────────────────────────────────────
; apply_palette: BX = brightness (0..63)
; Reads original palette from BMP, scales each component, writes to DAC
apply_palette:
    push ds
    push bx                 ; save brightness

    mov ax, BMP_LOAD_SEG
    mov ds, ax
    mov si, 54              ; BMP palette offset

    mov dx, 0x3C8
    xor al, al
    out dx, al              ; DAC write index = 0

    mov dx, 0x3C9
    mov cx, 256

.pal_loop:
    ; R
    mov al, [si+2]
    shr al, 2               ; 8-bit → 6-bit (0..63)
    call scale_component
    out dx, al
    ; G
    mov al, [si+1]
    shr al, 2
    call scale_component
    out dx, al
    ; B
    mov al, [si]
    shr al, 2
    call scale_component
    out dx, al
    add si, 4
    loop .pal_loop

    pop bx
    pop ds
    ret

; al = component (0..63), bx = brightness (0..63) → al = al*bx/63
; uses AX for mul so save/restore carefully
scale_component:
    push cx
    push dx
    movzx ax, al
    mul bx                  ; DX:AX = component * brightness (max 63*63=3969, fits in AX)
    mov cx, 63
    div cx                  ; AX = result / 63, DX = remainder
    pop dx
    pop cx
    ret                     ; AL = scaled component

; ─────────────────────────────────────────────
; Wait for vertical blank rising edge (port 0x3DA bit 3)
wait_vblank:
    mov dx, 0x3DA
.not_blank:
    in al, dx
    test al, 0x08
    jnz .not_blank          ; spin until we're OUT of vblank
.blank:
    in al, dx
    test al, 0x08
    jz .blank               ; spin until vblank STARTS
    ret

; ─────────────────────────────────────────────
clear_screen:
    mov ax, 0xA000
    mov es, ax
    xor di, di
    mov cx, SCREEN_W * SCREEN_H
    xor al, al
    rep stosb
    xor ax, ax
    mov es, ax
    ret

; ─────────────────────────────────────────────
blit_tux:
    xor bx, bx

.row_loop:
    cmp bx, IMG_H
    jge .done

    movzx eax, bx
    mov ecx, IMG_ROW_BYTES
    mul ecx
    add eax, BMP_PIXEL_OFFSET
    mov si, ax

    mov ax, IMG_H - 1
    sub ax, bx
    add ax, ORIGIN_Y
    mov cx, SCREEN_W
    mul cx
    add ax, ORIGIN_X
    mov di, ax

    push ds
    mov ax, BMP_LOAD_SEG
    mov ds, ax
    mov ax, 0xA000
    mov es, ax
    mov cx, IMG_ROW_BYTES
    rep movsb
    pop ds

    inc bx
    jmp .row_loop
.done:
    ret

; ─────────────────────────────────────────────
; Disk Address Packet for int 13h extended read
align 4
dap:
    db 0x10         ; packet size
    db 0x00         ; reserved
    dw 0            ; sectors to read (filled at runtime)
    dw 0            ; buffer offset (filled at runtime)
    dw BMP_LOAD_SEG ; buffer segment
    dd 0            ; LBA low dword (filled at runtime)
    dd 0            ; LBA high dword (always 0 for us)

load_bmp:
    mov ax, BMP_LOAD_SEG
    mov es, ax
    xor bx, bx

    mov byte [sectors_remaining], BMP_SECTOR_COUNT
    mov dword [dap+8], BMP_SECTOR   ; starting LBA

.chunk:
    ; Clamp to 18 sectors per read (safe for all implementations)
    mov al, [sectors_remaining]
    cmp al, 18
    jle .set_count
    mov al, 18
.set_count:
    movzx ax, al
    mov [dap+2], ax             ; sector count into DAP
    mov [sectors_this_read], al

    ; Buffer offset advances as we load
    mov [dap+4], bx

    mov ah, 0x42                ; extended read
    mov dl, [boot_drive]
    mov si, dap                 ; DS:SI → DAP
    int 0x13
    jc disk_error

    ; Advance buffer pointer
    movzx ax, byte [sectors_this_read]
    mov cx, 512
    mul cx
    add bx, ax

    ; Advance LBA
    movzx eax, byte [sectors_this_read]
    add [dap+8], eax

    mov al, [sectors_this_read]
    sub [sectors_remaining], al
    jnz .chunk

    xor ax, ax
    mov es, ax
    ret

; ─────────────────────────────────────────────
disk_error:
    mov ax, 0x0003
    int 0x10
    mov si, err_msg
.p: lodsb
    or al, al
    jz .h
    mov ah, 0x0E
    xor bh, bh
    int 0x10
    jmp .p
.h: cli
    hlt

; ─────────────────────────────────────────────
boot_drive          db 0
brightness          db 0
sectors_remaining   db 0
current_sector      db 0
sectors_this_read   db 0
cyl                 db 0
head                db 0
sect                db 0
err_msg             db "Disk read error!", 0

times 2048-($-$$) db 0

The Penguin Appears

With stage 1 and stage 2 working together, the execution path is clean: BIOS loads stage 1, stage 1 loads stage 2, stage 2 loads the bitmap, programs the palette, and blits pixels to the framebuffer. Tux appears on screen.

getting the palette right is not that easy...

The fade effect works by manipulating the DAC palette rather than touching the framebuffer. At full brightness the palette entries reflect the original BMP colors. To fade out, we scale every RGB component toward zero over 64 steps, multiplying each value by a brightness factor that decrements from 63 to 0. Fade in is the reverse. The timing is synchronized to the VGA vertical blank signal (!), readable from port 0x3DA: we wait for the retrace pulse, update the palette, and wait for the next one. On real VGA hardware this runs at roughly 70Hz, giving a smooth roughly three-second fade in each direction with no tearing.

nasm -f bin stage1.asm -o stage1.bin
nasm -f bin stage2.asm -o stage2.bin
cat stage1.bin stage2.bin tux.bmp > disk.img
qemu-system-i386 -drive format=raw,file=disk.img,if=floppy

The experiment answered the original question. Those boot-time logos from the 90s were small but nice pieces of code doing exactly this: asking the BIOS to switch video modes, loading pixel data from storage, writing bytes to a known memory address. Real mode is constrained and archaic but it's also remarkably direct. There's almost nothing between your code and the hardware. Writing one byte to 0xA0000 + y × 320 + x puts a colored dot on the screen. That's it.

Glitches, Pitfalls, and the Cursed Half-Penguin

Getting to a working result involved a reliable series of failures, each one clarifying something the documentation hadn't made obvious.

The cursed half-penguin. The first time we got pixels on screen they were recognizably Tux, but only the bottom third of him, floating in the lower portion of the display. This happened twice during development, for two different reasons. The first time, stage 2 was not padded to an exact sector boundary, so the bitmap data started mid-sector at an offset the loader wasn't expecting, and we were reading pixel data from somewhere in the middle of the file. The second time, BMP_SECTOR_COUNT was set to 34 instead of 35, one sector short, which silently dropped the last 512 bytes of pixel data. Since BMP stores rows bottom-up, the missing data was the top of the image: Tux's head and body. His feet survived. Debugging bitmaps by counting sectors is a particular kind of humbling.

CHS addressing and the geometry mismatch. The traditional BIOS disk read interface — int 13h with Cylinder, Head, Sector coordinates — encodes assumptions about physical disk geometry that vary between device types. Our code assumed floppy geometry (18 sectors per track, 2 heads) because we were testing with -if=floppy in QEMU. On the Athlon 64, the BIOS presented the USB drive with hard disk geometry, and the very first disk read in stage 2 returned error code 0x01: invalid parameter. The fix was switching to LBA (Logical Block Addressing), which treats the disk as a flat numbered sequence of sectors and lets the BIOS handle the geometry translation. LBA has been standard since the mid-90s and works identically on floppies, USB drives, and hard disks — it should have been the first choice.

QEMU floppy versus real hardware. QEMU's floppy emulation does not support int 13h extended reads (the LBA interface). So when we switched to LBA in stage 1, QEMU immediately started throwing disk errors that real hardware didn't. The fix was changing the QEMU invocation from -if=floppy to -if=ide, which emulates a hard disk and supports the full extended BIOS interface. This made QEMU and real hardware behave consistently, which is what you actually want from an emulator.

The palette mystery and the yellow penguin. Mode 13h is an indexed color mode: each pixel byte is not a color but an index into a 256-entry palette programmed into the VGA DAC. The BMP format stores its own 256-entry palette in the file header, and you have to load that palette into the DAC explicitly or the colors will be wrong. When we first got the colored version working in QEMU the background was yellow instead of near-black, despite the BMP palette showing the background color as R=10 G=9 B=10 — essentially black. The cause is a palette loading bug where the wrong memory segment was active during the DAC write, so the hardware is receiving garbage RGB values for certain entries. On real hardware the same yellow image rendered with glitchy colored borders around Tux's edges, caused by antialiased pixels in the source image using palette indices that mapped to incorrect DAC values. The underlying lesson: in indexed color modes, the image and the hardware palette must be kept in sync precisely. Any mismatch between what the BMP expects and what the DAC contains shows up immediately and visibly as wrong colors.

Reading pixel data offset from the wrong segment. At one point stage 2 tried to read the pixel data offset from the BMP header dynamically — a four-byte value at offset 10 in the file. The code loaded it with DS=0x0000 instead of DS=0x9000 where the BMP actually lived, so it read from low memory and got garbage, blitting from a completely wrong position in the file. The fix was hardcoding the known offset (1078 for this specific BMP) rather than reading it at runtime. Not elegant, but correct — and in 512-byte-budget real mode assembly, correct beats elegant.

No Pages Found