Introduction

Why binary analysis?

Static analysis: staring at the bytes and trying to see what they mean

Dynamic analysis:

Disassembler:

interpret binary files and decode their instructions
- assembly instructions map to sequence of bytes
- but opposite way is not easy to do
practical limitations
- overlapping instructions
  - on e.g. x86, instructions have variable length
  - start address of instructions not know in advance
  - depending from which byte you disassemble, you might get different instructions
- desynchronisation: how do you distinguish data from code?
practical approaches
- linear sweep (objdump, gdb, windbg):
  1. start at .text section
  2. disassemble one instruction after the other
  3. assume that well-behaving compiler tightly packs instructions
- recursive traversal (IDA, OllyDbg)
  1. start at program entry point
  2. disassemble one instruction after the other until a control flow instruction
  3. recursively follow the instructions targets (e.g. addresses of jmp)
    - pros: better at interleaving data and codee
    - cons: coverage, what to do with indirect jumps?

Decompilation:

issues:
- structure lost, data types lost, no semantic information
- no one-to-one-mapping between code and assembler blocks
types of analysis:
- static analysis: examine without running, could in principle tell us everything the program could do
  - levels of analysis
    - program level: tools like strings (strings used in program), readelf (examine structure of binary), ldd (shared libraries used), nm (symbols in a program), file, cat /proc/<pid>/maps (show memory mappings)
    - instruction level: disassemblers like IDA Pro
  - limitations: in principle undecidable, may be obfuscated/encrypted, doesn’t scale to real world programs because of cost for huge programs, needs to model library/system calls and environment, hard to deal with indirect addressing and compiler optimizations
- dynamic analysis: run and observe, tells us what the program does in a given environment with a particular input
  - containment is important (but maybe that changes its behaviour)

Analyzing a binary:

	Application level	Instruction level
Static analysis	Identify file type: `file foo` Extract strings: `strings -a -t d foo` Identify libraries and imported symbols `ldd` - list shared libraries `nm` - list symbols, unless stripped	Tracking control flow Path slices Data flow graphs Value set analysis Symbolic execution
Dynamic analysis	General info about the process: `/proc//maps` Library/system call trace `strace` - reveal system calls `ltrace` - strace but for dynamically linked libraries Network sniffer like `netstat` or `tcpdump`	Improve accuracy of static analyses Dynamic information flow tracking, e.g. input and variable types Function call monitoring Combination of symbolic and dynamic execution

Common file formats:

Defines things like what the file looks like on disk, what it should look like in memory

Contains info about machine to run it on, executable or library, entry point, sections, what should be writable and what should be executable

Binary format

Executable that

hates debuggers and VMs
hates being analyzed
does bad things
frequently controlled in a centralised or peer-to-peer fashion (botnet)
often packed:
- compressed to reduce size on disk
- may have anti-debugging techniques
- can’t say that packed binaries are malware because normal software can also be packed