1. Background

SQLite is a software library that translates high-level disk I/O requests generated by an application into low-level I/O operations that can be carried out by the operating system. The application constructs high-level I/O requests using the SQL language. SQLite translates each high-level SQL statement into a sequence of many low-level I/O requests (open a file, read a few bytes from a file, write a few bytes into a file, etc.) that do the work requested by the SQL.

An application program could do all its disk I/O by direct calls to operating system I/O routines or by using a key/value storage engine like Berkeley DB or RocksDB (to name but two). But there are advantages to using a higher-level interface based on the SQL language.

SQL is a very high-level language. A few lines of SQL can replace hundreds or thousands of lines of procedural code. SQL thus reduces the amount of work needed to develop and maintain the application, and thereby helps to reduce the number of bugs in the application.
SQL and SQLite are transactional. The use of a transactional storage system makes it much easier to reason about the behavior of the application, and to write applications that are robust, even in the face of software bugs, hardware faults, or power losses.
SQLite is often faster than direct low-level I/O. This is counterintuitive. One would expect that a high-level interface such as SQLite would impose a run-time penalty. And, theoretically, that is correct. But in practice, SQL-based systems such as SQLite do so many behind-the-scenes optimizations that an application developer would never have time to create and maintain, that the SQL-based systems end up providing a net performance gain.

1.1. SQLite Is Different From Most Other SQL Databases

There are many SQL-based database management systems available, besides SQLite. Common options include MySQL, PostgreSQL, and SQL-Server. All these systems use the SQL langauge to communicate with the application, just like SQLite. But these other systems different from SQLite in important respects.

SQLite is a serverless software library, whereas the other systems are client-server based. With MySQL, PostgreSQL, SQL-Server, and others, the application sends a message containing some SQL over to a separate server thread or process. That separate thread or process performs the requested I/O, then send the results back to the application. But there is no separate thread or process with SQLite. SQLite runs in the same address space as the application, using the same program counter and heap storage. SQLite does no interprocess communication (IPC). When an application sends an SQL statement into SQLite (by invoking a the appropriate SQLite library subroutine), SQLite interprets the SQL in the same thread as the caller. When an SQLite API routine returns, it does not leave behind any background tasks that run separately from the application.
An SQLite database is a single ordinary file on disk (with a well-defined file format). With other systems, a "database" is usually a large number of separate files hidden away in obscure directories of the filesystem, or even spread across multiple machines. But with SQLite, a complete database is just an ordinary disk file.

2. SQL Is A Programming Language

The best way to understand how SQL database engines work is to think of SQL as a programming language, not as a "query language". Each SQL statement is a separate program. Applications construct SQL program source files and send them to the database engine. The database engine compiles the SQL source code into executable form, runs that executable, then sends the result back to the application.

While SQL is a programming language, it is different from other programming languages like C, Javascript, Python, or Go in that SQL is a declarative language where the others are imperative languages. This is an important difference that has implications for the design of the compiler used to translate program source text into an executable format. However, those details should not detract from the fact that SQL is really just another programming language.

2.1. Programming Language Processing Steps

All programming languages are processed in two steps:

Translate the program source text into an executable format.
Run the executable generated in the previous step in order to carry out the desired action.

All programming languages uses those two basic steps. The main difference is in the executable format.

"Compiled" languages like C++ and Rust translate the source text into machine code that can be directly executed by the underlying hardware. There exist SQL database systems that do the same with SQL - they translate each SQL statement directly into machine code. But that approach is uncommon and is not the approach taken by SQLite.

Other languages like Java, Perl, Python, and TCL typically translate the program source text into bytecode. This bytecode is then run through an interpreter that reads the bytecode and carries out the desired operations. SQLite uses this bytecode approach. If you preceed any SQL statement with the "EXPLAIN" keyword in SQLite, it will show you the bytecode that is generated rather than run the bytecode.

Another approach is to translate the program source text into a tree of objects in memory. This tree is the "executable". An interpret runs the executable by walking the tree. This is the technique used by MySQL, PostgreSQL, and SQL-Server.

Of course, not every language fits neatly into one of the above categories. This applies to both SQL database engines and more familiar imperative programming languages. Javascript is famous for using a hybrid execution model, where the code is initially compiled into a tree of objects, but might be further translating (using just-in-time compilation) down into more efficient bytecode or machine code, as a means of boosting performance.

The executable format really ends up being just an implementation detail. The key point is that all languages have a compiler step which translates programs into an executable format and a run step that executes the compiled program.

2.2. Compiling SQLite Programs

When an SQL program is submitted to SQLite, the first step is to split the source text into "tokens". A token might be:

A language keyword like "SELECT" or "UPDATE".
An identifier for a table or column or variable.
Punctuation characters like "," or "==" or ";".
Literal values: numeric or string constants.
Whitespace or comments.

Whitespace and comment tokens are discarded. All other tokens are fed into an LALR(1) Parser that analyzes the structure of the input program and generates an Abstract Syntax Tree (AST) for the input program.

The parser forwards the AST on to the code generator. The code generator is the heart of SQLite, and is where most of the magic happens. The code generator resolves symbolic names in the AST - matching the names of columns and tables in the input SQL into actual columns and tables of the database. The code generator also does various transformations on the AST to "optimize" it. Finally the code generator chooses appropriate algorithms to implement the operations requested by the AST and constructs bytecode to carry out those operations.

The bytecode generated by the code generator is called a "prepared statement". Translating SQL source text into a prepared statement is analogous to converting a C++ program into machine code by invoking gcc or clang. Human-readable source text (SQL or C++) goes in, and a machine readable executable (bytecode or machine code) comes out.

3. Further Reading

The Atomic Commit document describes how SQLite implements transactions.
The bytecode engine document has more information on the bytecode format used by SQLite, and how to view and interpret an SQLite prepared statement.
The SQLite query planner and Next Generation Query Planner documents have further details on the algorithms SQLite uses to implement SQL statements and how it goes above choosing an appropriate algorithm for each individual SQL statement.

This page last modified on 2022-06-16 15:42:19 UTC

*** DRAFT ***