100000 TPS over a billion rows: the unreasonable effectiveness of SQLite
SQLite doesn't have MVCC! It only has a single writer! SQLite is for phones and mobile apps (and the occasional airliner)! For web servers use a proper database like Postgres! In this article I'll go over why being embedded and a single writer are not deficiencies but actually allow SQLite to scale so unreasonably well.
Prelude
For the code examples I will be using Clojure. But, what they cover should be applicable to most programming language.
The machine these benchmarks run on has the following specs:
- MacBook Pro (2021)
- Chip: Apple M1 Pro
- Memory: 16 GB
These benchmarks are not meant to be perfect or even optimal. They are merely to illustrate that it's relatively easy to achieve decent write throughput with SQLite. Usual benchmark disclaimers apply.
Defining TPS
When I say TPS I don't mean writes/updates per second. I'm talking about transactions per second, specifically interactive transactions that are common when building web applications. By interactive transactions I mean transactions where you execute some queries, run some application code and then execute more queries. For example:
BEGIN;
UPDATE accounts SET balance = balance - 100.00
WHERE name = 'Alice';
-- some application code runs
UPDATE accounts SET balance = balance + 100.00
WHERE name = 'Bob';
COMMIT;
Transactions are useful because they let you rollback the state of your changes if your application encounters a problem.
The benchmark harness
To simulate requests we spin up n virtual threads (green threads) that each execute a function f this is analogous to handlers on a web server and will give us similar contention. Worth noting that this is high burst. I.e we will reach n level concurrent requests as fast as the system can spin up the virtual threads.
(defmacro tx-per-second [n & body]
`(let [ids# (range 0 ~n)
start# (. System (nanoTime))]
(->> ids#
;; Futures are using virtual threads so blocking is not slow
(mapv (fn [_#] (future ~@body)))
(run! deref))
(int (/ ~n (/ (double (- (. System (nanoTime)) start#)) 1000000000.0)))))
For the Clojure programmers among you future has been altered to use virtual threads. So, we can spin up millions if we need to.
;; Make futures use virtual threads
(set-agent-send-executor!
(Executors/newVirtualThreadPerTaskExecutor))
(set-agent-send-off-executor!
(Executors/newVirtualThreadPerTaskExecutor))
We'll be using Postgres as our network database (I'm using Postgres, but the same applies to MySQL etc) with a high performance connection pool optimised for our number of cores.
(defonce pg-db
(jdbc/with-options
(connection/->pool
HikariDataSource
{:dbtype "postgres"
:dbname "thedb"
:username (System/getProperty "user.name")
:password ""
:minimumIdle 8
:maximumPoolSize 8})
{}))
We'll be using SQLite with a single writer connection and a number of reader connections equal to our number of cores.
(defonce lite-db
(d/init-db! "database.db"
{:pool-size 8
:pragma {:cache_size 15625
:page_size 4096
:journal_mode "WAL"
:synchronous "NORMAL"
:temp_store "MEMORY"
:busy_timeout 5000}}))
Our databases will have a simple schema:
(jdbc/execute! pg-db
["CREATE TABLE IF NOT EXISTS account(id INT PRIMARY KEY, balance INT)"])
(d/q (lite-db :writer)
["CREATE TABLE IF NOT EXISTS account(id PRIMARY KEY, balance INT)"])
And each contain a billion rows:
(->> (range 0 (* 1000 1000 1000))
(partition-all 32000)
(run!
(fn [batch]
(jdbc-sql/insert-multi! pg-db :account
(mapv (fn [id] {:id id :balance 1000000000}) batch)))))
(->> (range 0 (* 1000 1000 1000))
(partition-all 100000)
(run!
(fn [batch]
(d/with-write-tx [tx (lite-db :writer)]
(run!
(fn [id]
(d/q tx
["INSERT INTO account(id, balance) VALUES (?,?)" id 1000000000]))
batch)))))
Our user distribution will follow a power law. I.e the top X percent will be involved in most of the transactions. We have a billion users, so in practice most of those won't be active, or be active rarely. 0.9995 means 99.95% of transactions will be done by 0.05% of users. This still means around 100000 unique active users at any given time.
The reason we are using a power law, is that's a very common distribution for a lot of real products. If you think about a credit card payment system, in the context of retail, the largest number of transactions are most likely with a few large retailers (Amazon, Walmart etc).
(defn pareto-user []
(rand-pareto (* 1000 1000 1000) 0.9995))
rand-pareto turns a random distribution into a power law distribution.
(defn rand-pareto [r p]
(let [a (/ (Math/log (- 1.0 p)) (Math/log p))
x (rand)
y (/ (- (+ (Math/pow x a) 1.0)
(Math/pow (- 1.0 x) (/ 1.0 a)))
2.0)]
(long (* r y))))
Network database
Let's start with a network database.
(tx-per-second 100000
(jdbc/with-transaction [tx pg-db]
(jdbc/execute! tx (credit-random-account))
(jdbc/execute! tx (debit-random-account))))
;; => 13756 TPS
A respectable 13756 TPS.
However, normally a network database will not be on the same server as our application. So let's simulate some network latency. Let's say you have 5ms latency between your app server and your database.
(tx-per-second 10000
(jdbc/with-transaction [tx pg-db]
(jdbc/execute! tx (credit-random-account))
(Thread/sleep 5)
(jdbc/execute! tx (debit-random-account))))
;; => 1214 TPS
Note: virtual threads do not sleep a real thread. They instead park allowing the underlying carrier thread to resume another virtual thread.
What if we increase that latency to 10ms?
(tx-per-second 10000
(jdbc/with-transaction [tx pg-db]
(jdbc/execute! tx (credit-random-account))
(Thread/sleep 10)
(jdbc/execute! tx (debit-random-account))))
;; => 702 TPS
But, wait our transactions are not serialisable, which they need to be if we want consistent transaction processing (SQLite is isolation serialisable by design). We better fix that and handle retries.
(tx-per-second 10000
(loop []
(let [result
(try
(jdbc/with-transaction [tx pg-db {:isolation :serializable}]
(jdbc/execute! tx (credit-random-account))
(Thread/sleep 10)
(jdbc/execute! tx (debit-random-account)))
(catch Exception _ nil))]
(when-not result (recur)))))
;; => 660 TPS
What if the interactive transaction has an extra query (an extra network hop)?
(tx-per-second 10000
(loop []
(let [result
(try
(jdbc/with-transaction [tx pg-db {:isolation :serializable}]
(jdbc/execute! tx (credit-random-account))
(Thread/sleep 10)
(jdbc/execute! tx (debit-random-account))
(Thread/sleep 10)
(jdbc/execute! tx (debit-random-account)))
(catch Exception _ nil))]
(when-not result (recur)))))
;; => 348 TPS
348 TPS! What's going on here? Amdahl's Law strikes!
the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used.
We're holding transactions with row locks across a network with high contention because of the power law. What's terrifying about this is no amount of additional (cpu/servers/memory) is going to save us. This is a hard limit caused by the network. What's worse, in any unexpected increase in latency will exacerbate the problem. Which also means you can't have application servers in different data centres than your database (because of the increased latency).
I learnt this the hard way building an emoji based tipping bot for discord. At the time I didn't understand why we were hitting this hard limit in TPS. We ended up sacrificing the convenience of interactive transactions and moving everything into stored procedures (meaning no locks across the network). However, in a lot of domains this isn't possible.
Embedded means no network
Let's see how SQLite fares.
(tx-per-second 1000000
(d/with-write-tx [tx (lite-db :writer)]
(d/q tx (credit-random-account))
(d/q tx (debit-random-account))))
;; => 44096 TPS
44096 TPS! By eliminating the network SQLite massively reduces the impact of Amdahl's law.
Single writer lets you batch (trivially)
We don't need to stop there though. Because, SQLite is a single writer we can batch. sqlite4clj provides a convenient dynamic batching function. Batch size grows dynamically with the workload and producers don't have to block when the consumer is busy. Effectively it self optimises for latency and throughput.
(defn batch-fn [db batch]
@(on-pool! lite-write-pool
(d/with-write-tx [tx db]
(run! (fn [thunk] (thunk tx)) batch))))
(defonce tx!
(b/async-batcher-init! lite-db
{:batch-fn #'batch-fn}))
Note: to Clojure/Java programmers we're using a thread pool as SQLite should be treated as CPU not IO, so we don't want it starving our virtual threads (io green threads).
(tx-per-second 1000000
@(tx!
(fn [tx]
(d/q tx (credit-random-account))
(d/q tx (debit-random-account)))))
;; => 186157 TPS
But, wait I hear you cry! That's cheating we now don't have isolated transaction failure. Batching is sacrificing fine grained transaction. You're right! Let's fix that.
(tx-per-second 1000000
@(tx!
(fn [tx]
(d/q tx ["SAVEPOINT inner_tx"])
(try
(d/q tx (credit-random-account))
(d/q tx (debit-random-account))
(catch Throwable _
(d/q tx ["ROLLBACK inner_tx"])))
(d/q tx ["RELEASE inner_tx"]))))
;; => 121922 TPS
SQLite supports nested transactions with SAVEPOINT this lets us have fine-grained transaction rollback whilst still batching our writes. If a transaction fails it won't cause the batch to fail. The only case where the whole batch will fail is in the case of power loss/or a hard crash.
What about concurrent reads?
Generally systems have a mix of reads and writes, somewhere in the region of 75% reads to 25% writes. So let's add some writes reads.
(tx-per-second 1000000
(on-pool! lite-read-pool
(d/q (lite-db :reader)
["select * from account where id = ? limit 1" (pareto-user)]))
(on-pool! lite-read-pool
(d/q (lite-db :reader)
["select * from account where id = ? limit 1" (pareto-user)]))
(on-pool! lite-read-pool
(d/q (lite-db :reader)
["select * from account where id = ? limit 1" (pareto-user)]))
@(tx!
(fn [tx]
(d/q tx ["SAVEPOINT inner_tx"])
(try
(d/q tx (credit-random-account))
(d/q tx (debit-random-account))
(catch Throwable _
(d/q tx ["ROLLBACK inner_tx"])))
(d/q tx ["RELEASE inner_tx"]))))
;; => 102545 TPS
102545 TPS!
Note: to Clojure/Java programmers we're using a separate read thread pool so that reads don't starve writes.
TPS Report
| Postgres | SQLite | |
|---|---|---|
| no network | 13756 | 44096 |
| 5ms | 1214 | n/a |
| 10ms | 702 | n/a |
| 10ms serializable | 660 | n/a |
| batch | n/a | 186157 |
| batch savepoint | n/a | 121922 |
| batch savepoint + reads | n/a | 102545 |
Conclusion
Hopefully, this post helps illustrate the unreasonable effectiveness of SQLite as well as the challenges you can run in with Amdahl's law and network databases like postgres.
The full benchmark code can be found here.
Epilogue
This article is not intended as a comparison between SQLite and Postgres
It is intended to highlight:
A. You can handle more TPS than most applications will need with SQLite.
B. Networks can cause brutal hard limits when you use interactive transactions and hit a power law. This includes SQLite if you're using a network drive (a lot of permanent storage in the cloud is networked). This hard limit can kill your product if you run into it, there's no way to scale out of it. You have to fundamentally change your design.
C. The concern often raised with Sqlite that it's a single writer is not a problem in practice.
However, I clearly failed to convey this.
Disclaimer: There's nothing wrong with Postgres
If you are not in the business of running on a single monolithic server. Or need a network database. Or not confident handling backups. Or do not fully understand the nuances and limitations of SQLite DO NOT USE IT. There's nothing wrong with a managed postgres service.
Pragma synchronous "FULL"
Some smart hackers on hackernnews pointed out that using synchronous normal was not a fair comparison as technically under some conditions is can sacrifice some durability.
So here are the updated numbers.
| Postgres | SQLite | |
|---|---|---|
| no network | 13756 | 20654 |
| 5ms | 1214 | n/a |
| 10ms | 702 | n/a |
| 10ms serializable | 660 | n/a |
| batch | n/a | 161868 |
| batch savepoint | n/a | 98163 |
| batch savepoint + reads | n/a | 100405 |
Interestingly, you can see the power of dynamic batching here. The addition of concurrent reads means the batches have time to get larger so we have less writes and are therefore less impacted by synchronous "FULL". This is why batch savepoint is slower than batch savepoint + reads. What the numbers don't show is there will be a fractional latency increase.
You don't need isolation serializable (you really do for a ledger)
Some comments on hackernews argued you don't need isolation serializable. In the context of this example you very much do (see Testing the āIā in ACID ā Martin Kleppmann). But let's covers all the numbers. I've also added results for 1ms latency for those who wanted to see that.
| Postgres | SQLite | |
|---|---|---|
| no net | 13756 | 20654 |
| 1ms | 5428 | n/a |
| 5ms | 1214 | n/a |
| 10ms | 702 | n/a |
| no net serializable | 10026 | 20654 |
| 1ms serializable | 4691 | n/a |
| 5ms serializable | 1182 | n/a |
| 10ms serializable | 660 | n/a |
| batch | n/a | 161868 |
| batch savepoint | n/a | 98163 |
| batch savepoint + reads | n/a | 100405 |
Give postgres a bigger connection pool!
Some comments on hackernews argued that increasing postgres' connection pool size to 64 would improve the numbers. It made some better and some worse (note there was a larger number of queries exceeding 30s and ability to handle burst was reduced). Serializable results became much worse.
| Postgres | SQLite | |
|---|---|---|
| no net | 17399 | 20654 |
| 1ms | 8430 | n/a |
| 5ms | 7563 | n/a |
| 10ms | 5574 | n/a |
| no net serializable | 10283 | 20654 |
| 1ms serializable | 83 | n/a |
| 5ms serializable | 75 | n/a |
| 10ms serializable | 66 | n/a |
| batch | n/a | 161868 |
| batch savepoint | n/a | 98163 |
| batch savepoint + reads | n/a | 100405 |
Also checkout this article on connection pool sizing.
Why didn't you batch with postgres?
There's no benefit to batching interactive transactions over the network as you still have to do as many network trips as there are statements because application code is being run between statements. This means you cannot amortise the network by batching.
Further Reading:
If you want to learn more about Amdahl's law, power laws and how they interact with network databases I highly recommend listening to this interview with Joran Greef and watching his talk 1000x: The Power of an Interface for Performance by Joran Dirk Greef.
If you want to read about how much further you can scale SQLite checkout Scaling SQLite to 4M QPS on a single server (EC2 vs Bare Metal).
If you're thinking of running SQLite in production and wondering how to create streaming replicas, backups and projections checkout litestream.
If you still don't think a single machine can handle your workload it's worth reading Scalability! But at what COST?.
Discussion
Thanks to Everyone on the Datastar discord who read drafts of this and gave me feedback.