User mode

Privilege separation

Operating system’s task:

Privilege separation on x86:

4 rings: 0 (kernel), 1, 2, 3 (user processes)
segmentation:
- memory divided into several segments
- each pointer deref has segment associated with it:
  - coe pointer: CS (code segment)
  - heap pointer: data segment (DS)
  - local variable pointer: stack segment (SS)
- x86_64 pretends not to have segments – all segments point to same memory, more or less
  - but, low-order 2 bits of CS register determine current ring
switching x86 rings
- no instruction to set CS irectly
- traditional way is using interrupts:
  - user → kernel
    - interrupt CPU (int instruction)
    - CPU switches to interrupt handler in kernel
  - kernel → user
    - kernel interrupt handler returns (iret instruction)
    - CPU restores user mode state, including CS
Global Descriptor Table
- defines segments, among other things
- code segment is an offset into GDT, so is data segment
- GDTR register points to GDT
- 64-bit GDT entries (crossed out parts are legacy):
  - DPL: descriptor privilege level. Which rings can access the segment.
Address spaces and user processes:
- different page tables create different address spaces
- address spaces isolate processes from each other
- most modern OSes provide a one-to-one mapping between address spaces and user processes

Meltdown (rogue data cache load):

bypasses supervisor bit
when CPU tries to read data that it can’t read, it will fault
modern CPUs speculate on what might happen after an operation
CPU speculates on the read, generates fault only later, so the data will be in the cache
you end up with arbitrary kernel reads from user mode
Kernel page table isolation:
- kernel just isn’t mapped into app address space, except for a small area for interrupt handlers
- kernel has its own address space
- but it impacts performance: switch page tables, flush TLB, etc.

Foreshadow (L1TF)

MDS (RIDL):

Defending kernel attacks:

SMEP: supervisor mode execution protection
SMAP: supervisor mode access protection
ASLR: address space layout randomization
- map user mode code and data in random locations in virtual address space

events “interrupting” execution flow
kernel handles interrupt before program execution continues
external: key presses, network packets
- device signals CPU by setting a pin using an electrical signal
- most hardware interrupts can be masked (disabled), using IF in EFLAGS register
internal: divide by zero, page fault, system call

Most software interrupts are synchronous: directly before/after instruction

Most hardware interrupts are asynchronous: can come at any time, proper masking is important

During an interrupt:

Dealing with livelocks (when the CPU is doing work, but not useful work):

do as little as you can in interrupt handler, schedule work for later
reduce number of interrupts: use hardware demuxing, poll instead of large number of interrupts

Kernel support for servicing user apps.

Originally only issues using int instruction. Kernel-user communication dictated by calling convention.

Each OS specifies its own calling convention.

X86 Linux:
- int 0x80 to issue system call
- %rax has syscall number
- arguments specified in %rdi, %rsi, %rdx, %r10, %r8, and %r9
- kernel places return value in %rax

syscall

caching IDT entry for system call in special CPU register
syscall/sysret and dedicated registers
requires setup through MSR (model-specific registers) via rdmsr/wrmsr
on syscall:
- saves RFLAGS in %r11, masks RFLAGS
- saves %rip in %rcx (and switches %rip to handler)
- switches CS, SS, and privilege level to ring 0
on sysret:
- restores RFLAGS from %r11
- restores %rip from %rcx
- switches CS, SS, privilege level to ring 3

VDSO:

in 32-bit x86, fast syscall instruction (sysenter) was optional in CPU
need to retain legacy int support
so kernel figures out what CPU supports using cpuid instruciton
VDSO: virtual syscall page with optimal system call instruction
- replaces int 0x80 with a call <addr>
- allows for fixed return point
- looks like a dynamic shared object