SubX: A minimalist assembly language for a subset of the x86 ISA
SubX: A minimalist assembly language for a subset of the x86 ISA
SubX is a simple, minimalist stack for programming your computer.
$ git clone https://github.com/akkartik/mu $ cd mu/subx $ ./subx # print out a help message
You can generate native ELF binaries with it that run on a bare Linux kernel. No other dependencies needed.
$ ./subx translate examples/ex1.subx -o examples/ex1 $ ./examples/ex1 # only on Linux $ echo $? 42
You can emulate programs on an interpreter/VM for better error messages.
$ ./subx run examples/ex1 # on Linux or BSD or OS X $ echo $? 42
Emulated runs generate a trace that permits time-travel debugging .
$ ./subx --map translate examples/factorial.subx -o examples/factorial $ ./subx --map --trace run examples/factorial saving trace to 'last_run' $ ../browse_trace/browse_trace last_run # text-mode debugger UI
You can write tests for your assembly programs. The entire stack is thoroughly covered by automated tests. SubX's tagline: tests before syntax.
$ ./subx test $ ./subx run apps/factorial test
You can use it to learn about the x86 processor that (almost certainly) runs your computer. (See below.)
You can read its tiny zero-dependency internals and understand how they work. You can hack on it, and its thorough tests will raise the alarm when you break something.
Eventually you will be able to program in higher-level notations. But you'll always have tests as guardrails and traces for inspecting runs. The entire stack will always be designed for others to comprehend. You'll always be empowered to understand how things work, and change what doesn't work for you. You'll always be expected to make small changes during upgrades.
What it looks like
Here is the first example we ran above, a program that just returns 42:
bb/copy-to-EBX 0x2a/imm32 # 42 in hex b8/copy-to-EAX 1/imm32/exit cd/syscall 0x80/imm8
Every line contains at most one instruction. Instructions consist of words separated by whitespace. Words may be opcodes (defining the operation being performed) or arguments (specifying the data the operation acts on). Any word can have extra metadata attached to it after
/ . Some metadata is required (like the
/imm8 above), but unrecognized metadata is silently skipped so you can attach comments to words (like the instruction name
/copy-to-EAX above, or the
SubX doesn't provide much syntax (there aren't even the usual mnemonics for opcodes), but it does provide error-checking. If you miss an operand or accidentally add an extra operand you'll get a nice error. SubX won't arbitrarily interpret bytes of data as instructions or vice versa.
So much for syntax. What do all these numbers actually mean ? SubX supports a small subset of the 32-bit x86 instruction set that likely runs on your computer. (Think of the name as short for "sub-x86".) Instructions operate on a few registers:
- 6 general-purpose 32-bit registers: EAX, EBX, ECX, EDX, ESI and EDI
- 2 additional 32-bit registers: ESP and EBP (I suggest you only use these to manage the call stack.)
- 3 bit-size flag registers for conditional branching:
- zero/equal flag ZF
- sign flag SF
- overflow flag OF
SubX programs consist of instructions like
52/push-ECX which modify these registers as well as a byte-addressable memory. For a complete list of supported instructions, run
subx help opcodes .
(SubX doesn't support floating-point registers yet. Intel processors support an 8-bit mode, 16-bit mode and 64-bit mode. SubX will never support them. There are other flags. SubX will never support them. There are also many more instructions that SubX will never support.)
It's worth distinguishing between an instruction's operands and its arguments . Arguments are provided directly in instructions. Operands are pieces of data in register or memory that are operated on by instructions. Intel processors determine operands from arguments in fairly complex ways.
Lengthy interlude: How x86 instructions compute operands
The Intel processor manual is the final source of truth on the x86 instruction set, but it can be forbidding to make sense of, so here's a quick orientation. You will need familiarity with binary numbers, and maybe a few other things. Email me any time if something isn't clear. I love explaining this stuff for as long as it takes. The bad news is that it takes some getting used to. The good news is that internalizing the next 500 words will give you a significantly deeper understanding of your computer.
Most instructions operate on an operand in register or memory ('reg/mem'), and a second operand in a register. The register operand is specified fairly directly using the 3-bit
EAX ECX EDX EBX ESP EBP ESI EDI
The reg/mem operand, however, gets complex. It can be specified by 1-7 arguments, each ranging in size from 2 bits to 4 bytes.
The key argument that's always present for reg/mem operands is
/mod , the addressing mode . This is a 2-bit argument that can take 4 possible values, and it determines what other arguments are required, and how to interpret them.
3: the operand is the register described by the 3-bit
/rm32argument similarly to
0: the operand is the address provided in the register described by
*rm32in C syntax.
1: the operand is the address provided by adding the register in
/rm32with the (1-byte) displacement. That's
*(rm32 + disp8)in C syntax.
2: the operand is the address provided by adding the register in
/rm32with the (4-byte) displacement. That's
*(/rm32 + disp32)in C syntax.
In the last three cases, one exception occurs when the
/rm32 argument contains
4 . Rather than encoding register
ESP , it means the address is provided by three whole new arguments (
/scale ) in a totally different way:
reg/mem = *(base + index * 2^scale)
Phew, that was a lot to take in. Some examples to work through as you reread and digest it:
To read directly from the EAX register,
3(direct mode), and
0. There must be no
To read from
*EAX(in C syntax),
0(indirect mode), and the
/rm32argument must be
0. There must be no
To read from
1(indirect + disp8 mode),
0, there must be no SIB byte, and there must be a single displacement byte containing
To read from
*(EAX+ECX+4), one approach would be to set
4(SIB byte next),
1(ECX) and a single displacement byte to
4. (What should the
scalebits be? Can you think of another approach?)
To read from
*(EAX+ECX+1000), one approach would be:
2(indirect + disp32)
/disp32: 4 bytes containing
Putting it all together
Here's a more meaty example:
This program sums the first 10 natural numbers. By convention I use horizontal tabstops to help read instructions, dots to help follow the long lines, comments before groups of instructions to describe their high-level purpose, and comments at the end of complex instructions to state the low-level operation they perform. Numbers are always in hexadecimal (base 16); the '0x' prefix is optional, and I tend to include it as a reminder when numbers look like decimal numbers or words.
Try running this example now:
$ ./subx translate examples/ex3.subx -o examples/ex3 $ ./subx run examples/ex3 $ echo $? 55
If you're on Linux you can also run it natively:
$ ./examples/ex3 $ echo $? 55
Use it now to follow along for a more complete tour of SubX syntax.
The syntax of SubX programs
SubX programs map to the same ELF binaries that a conventional Linux system uses. Linux ELF binaries consist of a series of segments . In particular, they distinguish between code and data. Correspondingly, SubX programs consist of a series of segments, each starting with a header line:
== followed by a name. The first segment must be named
code ; the second must be named
Execution begins at the start of the
code segment by default.
You can reuse segment names:
== code ...A... == data ...B... == code ...C...
code segment now contains the instructions of
A as well as
code segment, each line contains a comment, label or instruction. Comments start with a
# and are ignored. Labels should always be the first word on a line, and they end with a
Instruction arguments must specify their type, from:
/r32bits in an instruction are used as an extra opcode)
Different instructions (opcodes) require different arguments. SubX will validate each instruction in your programs, and raise an error anytime you miss or spuriously add an argument.
I recommend you order arguments consistently in your programs. SubX allows arguments in any order, but only because that's simplest to explain/implement. Switching order from instruction to instruction is likely to add to the reader's burden. Here's the order I've been using after opcodes:
|<--------- reg/mem --------->| |<- reg/mem? ->| /subop /mod /rm32 /base /index /scale /r32 /displacement /immediate
Instructions can refer to labels in displacement or immediate arguments, and they'll obtain a value based on the address of the label: immediate arguments will contain the address directly, while displacement arguments will contain the difference between the address and the address of the current instruction. The latter is mostly useful for
Functions are defined using labels. By convention, labels internal to functions (that must only be jumped to) start with a
$ . Any other labels must only be called, never jumped to. All labels must be unique.
A special label is
Entry , which can be used to specify/override the entry point of the program. It doesn't have to be unique, and the latest definition will override earlier ones.
Entry label, along with duplicate segment headers, allows programs to be built up incrementally out of multiple layers .)
The data segment consists of labels as before and byte values. Referring to data labels in either
code segment instructions or
data segment values (using the
imm32 metadata either way) yields their address.
Automatic tests are an important part of SubX, and there's a simple mechanism to provide a test harness: all functions that start with
test- are called in turn by a special, auto-generated function called
run-tests . How you choose to call it is up to you.
I try to keep things simple so that there's less work to do when I eventually implement SubX in SubX. But there is one convenience: instructions can provide a string literal surrounded by quotes (
" ) in an
imm32 argument. SubX will transparently copy it to the
data segment and replace it with its address. Strings are the only place where a SubX word is allowed to contain spaces.
That should be enough information for writing SubX programs. The
examples/ directory provides some fodder for practice, giving a more gradual introduction to SubX features. This repo includes the binary for all examples. At any commit, an example's binary should be identical bit for bit with the result of translating the corresponding
.subx file. The binary should also be natively runnable on a Linux system running on Intel x86 processors, either 32- or 64-bit. If either of these invariants is broken it's a bug on my part.
Roadmap and status
Bootstrapping a SubX->ELF translator in SubX
- Converting ascii hex bytes to binary. (✓)
- Packing bitfields for x86 instructions into bytes. (✓)
- Combining segments with the same name. (30% complete)
- Replacing addresses with labels.
- Support for string literals.
Testable, dependency-injected vocabulary of primitives
- Concurrency, and a framework for testing blocking code
Using the trace in white-box tests for performance, fault tolerance, etc.
Higher-level notations. Like programming languages, but with thinner implementations that you can -- and are expected to! -- modify.
subx will transparently compile it as necessary.
subx currently has the following sub-commands:
subx help: some helpful documentation to have at your fingertips.
subx test: runs all automated tests.
subx translate <input files> -o <output ELF binary>: translates
.subxfiles into an executable ELF binary.
subx run <ELF binary>: simulates running the ELF binaries emitted by
subx translate. Useful for debugging, and also enables more thorough testing of
Remember, not all 32-bit Linux binaries are guaranteed to run. I'm not building general infrastructure here for all of the x86 instruction set. SubX is about programming with a small, regular subset of 32-bit x86.
A few hints for debugging
Writing programs in SubX is surprisingly pleasant and addictive. Reading programs is a work in progress, and hopefully the extensive unit tests help. However, debugging programs is where one really faces up to the low-level nature of SubX. Even the smallest modifications need testing to make sure they work. In my experience, there is no modification so small that I get it working on the first attempt. And when it doesn't work, there are no clear error messages. Machine code is too simple-minded for that. You can't use a debugger, since SubX's simplistic ELF binaries contain no debugging information. So debugging requires returning to basics and practicing with a new, more rudimentary but hopefully still workable toolkit:
Start by nailing down a concrete set of steps for reproducibly obtaining the error or erroneous behavior.
If possible, turn the steps into a failing test. It's not always possible, but SubX's primary goal is to keep improving the variety of tests one can write.
Start running the single failing test alone. This involves modifying the top of the program (or the final
.subxfile passed in to
subx translate) by replacing the call to
run-testswith a call to the appropriate
Generate a trace for the failing test while running your program in emulated mode (
$ ./subx translate input.subx -o binary $ ./subx --trace run binary arg1 arg2 2>trace
The ability to generate a trace is the essential reason for the existence of
subx runmode. It gives far better visibility into program internals than running natively.
As a further refinement, it is possible to render label names in the trace by adding a second flag to both the
$ ./subx --map translate input.subx -o binary $ ./subx --map --trace run binary arg1 arg2 2>trace
subx --map translateemits a mapping from label to address in a file called
subx --map --trace runreads in the
mapfile at the start and prints out any matching label name as it traces each instruction executed.
Here's a sample of what a trace looks like, with a few boxes highlighted:
Each of the green boxes shows the trace emitted for a single instruction. It starts with a line of the form
run: inst: ___followed by the opcode for the instruction, the state of registers before the instruction executes, and various other facts deduced during execution. Some instructions first print a matching label. In the above screenshot, the red boxes show that address
0x0900005emaps to label
$loopand presumably marks the start of some loop. Function names get similar
run: == labellines.
One trick when emitting traces with labels:
$ grep label trace
This is useful for quickly showing you the control flow for the run, and the function executing when the error occurred. I find it useful to start with this information, only looking at the complete trace after I've gotten oriented on the control flow. Did it get to the loop I just modified? How many times did it go through the loop?
Once you have SubX displaying labels in traces, it's a short step to modify the program to insert more labels just to gain more insight. For example, consider the following function:
This function contains a series of jump instructions. If a trace shows
is-hex-lowercase-byte?being encountered, and then
$is-hex-lowercase-byte?:endbeing encountered, it's still ambiguous what happened. Did we hit an early exit, or did we execute all the way through? To clarify this, add temporary labels after each jump:
Now the trace should have a lot more detail on which of these labels was reached, and precisely when the exit was taken.
If you find yourself wondering, "when did the contents of this memory address change?",
subx runhas some rudimentary support for watch points . Just insert a label starting with
$watch-before an instruction that writes to the address, and its value will start getting dumped to the trace after every instruction thereafter.
Once we have a sense for precisely which instructions we want to look at, it's time to look at the trace as a whole. Key is the state of registers before each instruction. If a function is receiving bad arguments it becomes natural to inspect what values were pushed on the stack before calling it, tracing back further from there, and so on.
I occasionally want to see the precise state of the stack segment, in which case I uncomment a commented-out call to
vm.cclayer. It makes the trace a lot more verbose and a lot less dense, necessitating a lot more scrolling around, so I keep it turned off most of the time.
If the trace seems overwhelming, try browsing it in the 'time-travel debugger'.
Hopefully these hints are enough to get you started. The main thing to remember is to not be afraid of modifying the sources. A good debugging session gets into a nice rhythm of generating a trace, staring at it for a while, modifying the sources, regenerating the trace, and so on. Email me if you'd like another pair of eyes to stare at a trace, or if you have questions or complaints.
Reference documentation on available primitives
Kernel strings: null-terminated arrays of bytes. Unsafe and to be avoided, but needed for interacting with the kernel.
Strings: length-prefixed arrays of bytes. String contents are preceded by 4 bytes (32 bytes) containing the
lengthof the array.
Slices: a pair of 32-bit addresses denoting a half-open [
end) interval to live memory with a consistent lifetime.
Streams: strings prefixed by 32-bit
readindexes that the next write or read goes to, respectively.
- offset 0: write index
- offset 4: read index
- offset 8: length of array (in bytes)
- offset 12: start of array data
Invariant: 0 <=
File descriptors (fd): Low-level 32-bit integers that the kernel uses to track files opened by the program.
File: 32-bit value containing either a fd or an address to a stream (fake file).
Buffered files (buffered-file): Contain a file descriptor and a stream for buffering reads/writes. Each
buffered-filemust exclusively perform either reads or writes.
A major goal of SubX is testable wrappers for operating system syscalls. Here's what I've built so far:
write: takes two arguments, a file
fand an address to array
Comparing this interface with the Unix
write()syscall shows two benefits:
SubX can handle 'fake' file descriptors in tests.
write()accepts buffer and its length in separate arguments, which requires callers to manage the two separately and so can be error-prone. SubX's wrapper keeps the two together to increase the chances that we never accidentally go out of array bounds.
read: takes two arguments, a file
fand an address to stream
s. Reads as much data from
fas can fit in (the free space of)
write(), this wrapper around the Unix
read()syscall adds the ability to handle 'fake' file descriptors in tests, and reduces the chances of clobbering outside array bounds.
One bit of weirdness here: in tests we do a redundant copy from one stream to another. See the comments before the implementation for a discussion of alternative interfaces.
stop: takes two arguments:
edis an address to an exit descriptor . Exit descriptors allow us to
exit()the program in production, but return to the test harness within tests. That allows tests to make assertions about when
valueis the status code to
For more details on exit descriptors and how to create one, see the comments before the implementation .
Allocates a whole new segment of memory for the program, discontiguous with both existing code and data (heap) segments. Just a more opinionated form of
allocate: takes two arguments, an address to allocation-descriptor
adand an integer
Allocates a contiguous range of memory that is guaranteed to be exclusively available to the caller. Returns the starting address to the range in
An allocation descriptor tracks allocated vs available addresses in some contiguous range of memory. The int specifies the number of bytes to allocate.
Explicitly passing in an allocation descriptor allows for nested memory management, where a sub-system gets a chunk of memory and further parcels it out to individual allocations. Particularly helpful for (surprise) tests.
... (to be continued)
primitives built atop system calls
(Compound arguments are usually passed in by reference. Where the results are compound objects that don't fit in a register, the caller usually passes in allocated memory for it.)
assertions for tests
check-ints-equal check-stream-equal check-next-stream-line-equal
error: takes three arguments, an exit-descriptor, a file and a string (message)
Prints out the message to the file and then exits using the provided exit-descriptor.
errorbut takes an extra byte value that it prints out at the end of the message.
kernel-string-equal?: compares a kernel string with a string
string-equal?: compares two strings
stream-data-equal?: compares a stream with a string
next-stream-line-equal?: compares with string the next line in a stream, from
readindex to newline
slice-empty?: checks if the
endof a slice are equal
slice-equal?: compares a slice with a string
slice-starts-with?: compares the start of a slice with a string
slice-ends-with?: compares the end of a slice with a string
writing to disk
write-stream write-buffered write-slice flush print-byte
reading from disk
non-IO operations on streams
new-stream: allocates space for a stream of size
clear-stream: resets everything in the stream to
rewind-stream: resets the read index of the stream to
0without modifying its contents.
reading/writing hex representations of integers
is-hex-int?: takes a slice argument, returns boolean result in
parse-hex-int: takes a slice argument, returns int result in
is-hex-digit?: takes a 32-bit word containing a single byte, returns boolean result in
from-hex-char: takes a hexadecimal digit character in EAX, returns its numeric value in
to-hex-char: takes a single-digit numeric value in EAX, returns its corresponding hexadecimal character in
from a stream:
next-token skip-chars-matching skip-chars-not-matching
from a slice:
next-token-from-slice: start, end, delimiter byte -> slice Given a slice and a delimiter byte, returns a new slice inside the input that ends at the delimiter byte.
skip-chars-matching-in-slice: curr, end, delimiter byte -> new-curr (in
skip-chars-not-matching-in-slice: curr, end, delimiter byte -> new-curr (in
- String literals support no escape sequences. In particular, no way to represent newlines.
- Single-page cheatsheet for the x86 ISA (pdf; cached local copy )
- Concise reference for the x86 ISA
- Intel processor manual (pdf)
- Some details on the unconventional organization of this project.
- 在一张 24 GB 的消费级显卡上用 RLHF 微调 20B LLMs
- 开源月刊《HelloGitHub》第 84 期
- 穷人版生产力工具，好用得飞起 「GitHub 热点速览」
- GitHub Copilot X来了，我让 GPT-4 帮我写了一篇分析文章｜太强了
- 大数据计算引擎 EasyMR：拥抱开源，引领技术创新
- 10 款开源的在线游戏，点开就能玩的那种
- 下一代代码助手 GitHub Copilot X 发布
- 千亿参数开源大模型 BLOOM 背后的技术
- 程序员“摸鱼”神器，GitHub Copilot“凭本事”完全免费！！
- 快看！这只猫两次登上 Github Trending !!!
- 穷人版生产力工具，好用得飞起 「GitHub 热点速览」
- Kakao Brain 的开源 ViT、ALIGN 和 COYO 文字-图片数据集
- 官宣：OpenDAL 成功进入 Apache 孵化器
- GPT-4 来了！这些开源的 GPT 应用又要变强了
- WebGPU 尚未发布，Orillusion 提前公测，我们先和创始人聊了聊