Skip to content
/ grad Public

A custom programming language grad and its compiler, written in Rust

Notifications You must be signed in to change notification settings

Shubhamai/grad

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

grad

This project implements a custom programming language grad and its compiler, written in Rust. The compiler follows a multi-stage process to transform source code into executable bytecode, which is then interpreted by a custom Stack Based Virtual Machine (VM).

Getting Started

Try the language in the playground.

Example

cargo install grad
echo "let a = 10; print(a);" > example.grad
grad run example.grad

Table of Contents

  1. Compiler Overview
  2. Lexical Analysis
  3. Parsing
  4. Abstract Syntax Tree (AST)
  5. Code Generation
  6. Virtual Machine
  7. String Interning
  8. Example: Program Compilation and Execution
  9. Future Improvements

Compiler Overview

The compiler follows these main stages:

  1. Lexical Analysis - Tokenizes the input source code.
  2. Parsing - Builds an Abstract Syntax Tree (AST).
  3. Code Generation - Transforms the AST into bytecode.
  4. Virtual Machine - Executes the generated bytecode.

Lexical Analysis

The lexical analysis is performed by the Lexer struct. It tokenizes/splits the input source code into a series of Tokens.

pub struct Lexer {
    pub tokens: Vec<Token>,
}

#[derive(Debug, PartialEq, Clone)]
pub struct Token {
    pub token_type: TokenType,
    pub lexeme: String,
    pub span: std::ops::Range<usize>,
}

The Lexer uses logos to identify different token types such as keywords, identifiers, literals, and operators. It also return the span (start and end positions) of each token in the source code.

Parsing

The parsing stage is implemented using a recursive descent parser with Pratt parsing for expressions.

pub struct Parser<'a> {
    lexer: &'a mut Lexer,
}

The parser uses methods like parse_statement(), parse_expression(), and various other parsing functions to build the Abstract Syntax Tree (AST).

The expression parsing uses the Pratt parsing technique for handling operator precedence:

fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> ParseResult<ASTNode> {
    // ... (Pratt parsing implementation by matklad)
}

This allows for efficient and correct parsing of complex expressions with different operator precedences.

The AST is represented using the ASTNode enum:

pub enum ASTNode {
    IntNumber(i64),
    FloatNumber(f64),
    Identifier(String),
    Boolean(bool),
    String(String),
    Op(Ops, Vec<ASTNode>),
    Callee(String, Vec<ASTNode>),
    Let(String, Vec<ASTNode>),
    Assign(String, Vec<ASTNode>),
    If(Vec<ASTNode>, Vec<ASTNode>, Option<Vec<ASTNode>>),
    While(Vec<ASTNode>, Vec<ASTNode>),
    Print(Vec<ASTNode>),
    Function(String, Vec<String>, Vec<ASTNode>),
    Block(Vec<ASTNode>),
}

This structure allows for representing various language constructs, including literals, variables, function calls, control flow statements, and more.

Code Generation

The code generation phase transforms the AST into bytecode that can be executed by the Virtual Machine. This process is handled by the Compiler struct:

pub struct Compiler {
    chunk: Chunk,
    interner: Interner,
    locals: Vec<Local>,
    local_count: usize,
    scope_depth: u8,
    functions: Vec<Function>,
    function_count: usize,
}

The compiler emits bytecode instructions represented by the OpCode enum:

#[derive(Debug, Clone, Copy, Serialize, Deserialize)]
#[repr(u8)]
pub enum OpCode {
    OpConstant,
    OpNil,
    OpTrue,
    OpFalse,
    // ... (other opcodes)
}

These instructions, along with their operands, are stored in a Chunk:

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Chunk {
    pub code: Vec<VectorType>,
    pub constants: Vec<ValueType>,
}

The VectorType enum allows for storage of both opcodes and constant indices in the same vector.

Virtual Machine

The Virtual Machine (VM) is executes the generated bytecode. It's implemented in the VM struct:

pub struct VM {
    pub chunk: Chunk,
    ip: usize,
    stack: [ValueType; STACK_MAX],
    stack_top: usize,
    pub interner: Interner,
    globals: HashMap<StringObjIdx, ValueType>,
    call_frames: Vec<CallFrame>,
    frame_index: usize,
}

The VM uses a stack-based architecture for executing instructions. It maintains a stack for operands and local variables, a global variable table, and call frames for function calls.

The main execution loop of the VM interprets each opcode and performs the corresponding operation:

pub fn run(&mut self) -> Result {
    // ... (main execution loop)
}

String Interning

To optimize string handling, the compiler uses string interning via the Interner struct:

pub struct Interner {
    pub map: HashMap<String, usize>,
    vec: Vec<String>,
}

It allows for efficient storage and comparison of strings by assigning unique indices to each unique string.

Example: Program Compilation and Execution

Here's a simple program to demonstrate how it progresses through each stage of the compiler.

Sample Program

print("Hello, world!");

let a = 4.0;
let b = 2**2;

print(a + b);

Stage 1: Lexical Analysis

The lexer breaks down the program into tokens:

[
    Token { token_type: PRINT, lexeme: "print", span: 0..5 }
    Token { token_type: LeftParen, lexeme: "(", span: 5..6 }
    Token { token_type: String, lexeme: "\"Hello, world!\"", span: 6..21 }
    Token { token_type: RightParen, lexeme: ")", span: 21..22 }
    Token { token_type: SEMICOLON, lexeme: ";", span: 22..23 }
    Token { token_type: LET, lexeme: "let", span: 25..28 }
    Token { token_type: Identifier, lexeme: "a", span: 29..30 }
    ...
]

Stage 2: Parsing and AST Generation

The parser creates an Abstract Syntax Tree:

Print
  String(""Hello, world!"")
Let(a)
  FloatNumber(4)
Let(b)
  Op(PostfixOp(StarStar))
    IntNumber(2)
    IntNumber(2)
Print
  Op(BinaryOp(Add))
    Identifier(a)
    Identifier(b)

Stage 3: Code Generation

The compiler generates bytecode:

0000 OP_CONSTANT                    0 | intr->"Hello, world!"
0002 OP_PRINT
0003 OP_CONSTANT                    2 | 4
0005 OP_DEFINE_GLOBAL               1 | intr->a
0007 OP_CONSTANT                    4 | 2
0009 OP_CONSTANT                    5 | 2
0011 OP_POWER
0012 OP_DEFINE_GLOBAL               3 | intr->b
0014 OP_GET_GLOBAL                  6 | intr->a
0016 OP_GET_GLOBAL                  7 | intr->b
0018 OP_ADD
0019 OP_PRINT
0020 OP_RETURN

Stage 4: Execution

The VM executes the bytecode, resulting in the output:

Hello, world!
8

Future Improvements

While this compiler implements core functionality, there are several areas for potential improvements:

  1. Type System: Implement a static type system with type inference for improved safety and performance.
  2. Optimization: Add optimization passes to improve the generated bytecode's efficiency.
  3. Error Handling: Enhance error reporting with more detailed messages and source code locations.
  4. Garbage Collection: Implement a garbage collector for automatic memory management.
  5. REPL: Implement a Read-Eval-Print Loop for interactive programming.

About

A custom programming language grad and its compiler, written in Rust

Topics

Resources

Stars

Watchers

Forks

Languages