SQLite traps and pitfalls

SQLite is a popular, relational database that you embed into your application. However, there are many traps and pitfalls you should avoid. This article discusses several pitfalls (and how to avoid them), such as the use of ORMs, how to reclaim disk space, minding the maximum number of query variables, column data types and how to handle large integers.

Introduction

SQLite is a popular relational database (DB) system. It has a very similar feature set as its bigger brothers, such as MySQL, which are client/server-based systems. However, SQLite is an embedded database. It can be included in your program as a static (or dynamic) library. This simplifies the deployment, because no separate server process is necessary. Bindings and wrapper libraries let you access SQLite in most programming languages.

I’ve been working with SQLite extensively while developing BSync as part of my PhD dissertation. This article is a (random) list of traps and pitfalls I stumbled open during development. I hope that you will find them useful and avoid making the same mistakes I once did.

Traps and pitfalls

Use ORM libraries with caution

Object-Relational Mapping (ORM) libraries abstract the details from concrete database engines and their syntax (such as specific SQL statements) to a high-level, object-oriented API. There are many third party libraries out there (see Wikipedia). ORM libraries have a few advantages:

  • They save time during development, because they quickly map your code/classes to DB structures,
  • They are often cross-platform, i.e., allow substitution of the concrete DB technology (e.g. SQLite with MySQL),
  • They offer helper code for schema migration.

However, they also have several severe disadvantages you should be aware of:

  • They make working with databases appear easy. However, in reality, DB engines have intricate details you simply have to know. Once something goes wrong, e.g. when the ORM library throws exceptions you don’t understand, or when run-time performance degrades, the development time you saved by using ORM will be quickly eaten up by the efforts required to debug the problem. For instance, if you do not known what indices are, you’d have a hard time troubleshooting performance bottlenecks caused by the ORM, when it did not automatically create all the required indices. In essence: there is no free lunch.
  • Due to the abstraction of the concrete DB vendor, vendor-specific functionality is either hard to access, not at accessible at all.
  • There is some computational overhead compared to writing and executing SQL queries directly. However, I’d say this point is moot in practice, as it is common that you lose performance once you switch to a higher level of abstraction.

In the end, using an ORM library is a matter of personal preference. If you do, just be prepared that you’ll have to learn about the quirks of relational databases (and vendor-specific caveats), once unexpected behavior or performance bottlenecks occur.

Include a migrations table from the start

If you are not using a ORM library, you will have to take care of the DB’s schema migration. This involves writing migration code that alters your table schemas and transforms the stored data in some way. I recommend you create a table called “migrations” or “version”, with a single row and column, that simply stores the schema version, e.g. using a monotonically increasing integer. This lets your migration function detect which migrations still need to be applied. Whenever a migration step was completed successfully, your migration tooling code increments this counter via an UPDATE SQL statement.

Auto-created rowid column

Whenever you create a table, SQLite will automatically create an INTEGER column named rowid for you – unless you provided the WITHOUT ROWID clause (but chances are you didn’t know about this clause). The rowid row is a primary key column. If you also specify such a primary key column yourself (e.g. using the syntax some_column INTEGER PRIMARY KEY) this column will simply be an alias for rowid. See here for further information, which describes the same thing in rather cryptic words. Note that a SELECT * FROM table statement will not include rowid by default – you need to ask for the rowid column explicitly.

Verify that PRAGMAs really work

Among other things, PRAGMA statements are used to configure database settings, or to invoke various functionality (official docs). However, there are undocumented side effects where sometimes setting a variable actually has no effect. In other words, it does not work and fails silently.

For instance, if you issue the following statements in the given order, the last statement will not have any effect. Variable auto_vacuum still has value 0 (NONE), for no good reason.


PRAGMA journal_mode = WAL PRAGMA synchronous = NORMAL PRAGMA auto_vacuum = INCREMENTAL

You can read the value of a variable by executing PRAGMA variableName and omitting the equal sign and value.

To fix the above example, use a different order. Using the row ordering 3, 1, 2 will work as expected.

You may even want to include such checks into your production code, because these side effects may depend on the concrete SQLite version and how it was built. The library used in production may differ from the one you used during development.

Claiming disk space for large databases

By default, a SQLite database file’s size is monotonically growing. Deleting rows only marks specific pages as free, so that they can be used to INSERT data in the future. To actually reclaim disk space, and to speed up performance, there are two options:

  1. Execute the VACUUM statement. However, this has several side effects:
    • It locks the entire DB. No concurrent operations can take place during the VACUUM operation.
    • It takes a long time (for larger databases), because it internally recreates the DB in a separate, temporary file, and finally deletes the original database, replacing it with that temporary file.
    • The temporary file consumes additional disk space while the operation is running. Thus, it is not a good idea to run VACUUM in case you are low on disk space. You could still do it, but would have to regularly check that (freeDiskSpace - currentDbFileSize) > 0.
  2. Use PRAGMA auto_vacuum = INCREMENTAL when creating the DB. Make this PRAGMA the first statement after creating the file! This enables some internal house-keeping, helping the database to reclaim space whenever you call PRAGMA incremental_vacuum(N). This call reclaims up to N pages. The official docs provide further details, and also other possible values for auto_vacuum.
    • Note: you can determine how much free disk space (in bytes) would be gained when calling PRAGMA incremental_vacuum(N): multiply the value returned by PRAGMA freelist_count with PRAGMA page_size.

The better option depends on your context. For very large database files I recommend option 2, because option 1 would annoy your users with minutes or hours of waiting for the database to clean up. Option 1 is suitable for smaller databases. Its additional advantage is that the performance of the DB will improve (which is not the case for option 2), because the recreation eliminates side-effects of data fragmentation.

Mind the maximum number of variables in queries

By default, the maximum number of variables (“host parameters”) you can use in a query is hard-coded to 999 (see here, section Maximum Number Of Host Parameters In A Single SQL Statement). This limit may vary, because it is a compile-time parameter, whose default value you (or whoever else compiled SQLite) may have altered.

This is problematic in practice, because it is not uncommon that your application provides an (arbitrarily large) list to the DB engine. For instance if you want to mass-DELETE (or SELECT) rows based on, say, a list of IDs. A statement such as

DELETE FROM some_table WHERE rowid IN (?, ?, ?, ?, <999 times "?, ">, ?)

will throw an error and won’t complete.

To fix this, consider the following steps:

  • Analyze your lists and split them into smaller lists,
  • If a split was necessary, make sure to use BEGIN TRANSACTION and COMMIT to emulate the atomicity a single statement would have had.
  • Make sure to also consider other ? variables you might use in your query that are not related to the incoming list (e.g. ? variables used in a ORDER BY condition), so that the total number of variables does not exceed the limit.

An alternative solution is the use of temporary tables. The idea is to create a temporary table, insert the query variables as rows, and then use that temporary table in a subquery, e.g.

DROP TABLE IF EXISTS temp.input_data CREATE TABLE temp.input_data (some_column TEXT UNIQUE) # Insert input data, running the next query multiple times INSERT INTO temp.input_data (some_column) VALUES (...) # The above DELETE statement now changes to this one: DELETE FROM some_table WHERE rowid IN (SELECT some_column from temp.input_data)

Beware of SQLite’s type affinity

SQLite columns are not strictly typed, and conversions don’t necessarily happen as you might expect. The types you provide are just hints. SQLite will often store data of any type in its original type, and only convert data to the type of the column in case the conversion is lossless. For instance, you can simply insert a "hello" string into an INTEGER column. SQLite won’t complain, or warn you about type mismatches. Conversely, you may not expect that data returned by a SELECT statement of an INTEGER column is always an INTEGER. These type hints are referred to as “type affinity” in SQLite-speak, see here. Make sure to study this part of the SQLite manual closely, to better understand the meaning of the column types you specify when creating new tables.

Beware of large integers

SQLite supports signed 64-bit integers, which it can store, or do computations with. In other words, only numbers from -2^63 to (2^63) - 1 are supported, because one bit is needed to represent the sign!

That means that if you expect to work with larger numbers, e.g. 128-bit (signed) integers or unsigned 64-bit integers, you must convert the data to text before inserting it.

The horror starts when you ignore this and simply insert larger numbers (as integers). SQLite won’t complain and store a rounded number instead! For instance, if you insert 2^63 (which is already outside the supported range), the SELECTed value will be 9223372036854776000, and not 2^63=9223372036854775808. Depending on the programming language and binding library you use, the behavior may differ, though! For instance, Python’s sqlite3 binding does check for such integer overflows!

Do not use REPLACE() for file paths

Imagine you store relative or absolute file paths in a TEXT column in SQLite, e.g. to keep track of files on the actual file system. Here is an example of three rows:

foo/test.txt foo/bar/ foo/bar/x.y

Suppose you want to rename directory “foo” to “xyz”. What SQL command would you use? This one?

REPLACE(path_column, old_path, new_path)

This is what I did, until weird things started to happen. The problem with REPLACE() is that it will replace all occurrences. If there was a row with path “foo/bar/foo/”, then REPLACE(column_name, 'foo/', 'xyz/') will wreak havoc, as the result won’t be “xyz/bar/foo/”, but “xyz/bar/xyz/”.

A better solution is something like

UPDATE mytable SET path_column = 'xyz/' || substr(path_column, 4) WHERE path_column GLOB 'foo/*'"

The 4 reflects the length of the old path (‘foo/’ in this case). Note that I used GLOB instead of LIKE to update only those rows that start with ‘foo/’.

Conclusion

SQLite is a fantastic database engine, where most commands work as expected. However, specific intricacies, as those I just presented, still require a developer’s attention. In addition to this article, make sure you also read the official SQLite caveats documentation.

Have you encountered other caveats in the past? If so, let me know in the comments.

Leave a Comment