Impact of the big UNIX mutex lock on performance and scalability

(1) By zhongren on 2022-10-23 02:51:41 [link] [source]

Hello, SQLite developers,

I work for a technology company. We are trying to use SQLite in a very large application. We encounter a serious performance and scalability issue with the global big Unix mutex lock.

Here is our use case. The application manages hundreds of millions of databases, which vary in size from a few MBs to a few GBs. Databases are independent, live over NFS, and are served from thousands of machines. Most operations are READ.

Since the application hosting SQLite needs to manage many databases (over 500K) on each machine, it has to open and close database connections frequently (opened database instance replacement in its connection cache). We run into a serious performance and scalability issue from the big UNIX mutex lock. The issue is not very obvious when databases are on the local disk, but it is when they are over NFS. And it is especially so when multiple threads try to access the same database. I know this issue is specific to our use case: our application needs to open and close a large number of database connections constantly.

I have written a test program to reproduce this performance issue, and I have included a sample code change to remove the big UNIX mutex lock. With its removal, query execution throughput jumps 5X, and latency reduces greatly.

We would like to use SQLite in our application and hope SQLite can address this scalability issue.

Thank you, Zhongren

PS. Not sure how to attach my test program and patch to remove the big UNIX mutex lock in SQLite.

(2) By zhongren on 2022-10-23 02:54:20 in reply to 1 [link] [source]

Please let me know how I can send my test program (for reproducing the performance issue) and patch to remove the big UNIX mutex lock in SQLite.

Thank you.

(3) By anonymous on 2022-10-23 09:08:12 in reply to 1 [link] [source]

Another way to approach this problem without breaking the SQLite semantics is to stop using NFS. With NFS, all your locking is happening across the network, hence the extra latency you are experiencing. A better option would be to host the data on a Ceph cluster that you could then mount locally. This way filesystem locks happen in the host matchine's memory and don't incur the cost of network round trips.

Or if you really don't like locking (and consistency) then you can open the database with the nolock parameter (forgot the exact name, you will need to look it up), which disables all sorts of locking, could be useful for a read only database.

(4) By Stephan Beal (stephan) on 2022-10-23 09:58:57 in reply to 3 [link] [source]

... then you can open the database with the nolock parameter (forgot the exact name, you will need to look it up), which disables all sorts of locking, could be useful for a read only database.

The VFS name is "unix-none"

@OP, as our anonymous user says: locking over NFS is terribly slow. In a recent post on the fossil SCM forum (sqlite's sister project), one user reports performance hits of 10-15x attributable entirely to NFS locking speed.

(6) By anonymous on 2022-10-23 11:48:09 in reply to 4 [link] [source]

Anon meant the query string param from https://www.sqlite.org/uri.html

(7) By anonymous on 2022-10-23 11:54:13 in reply to 4 [link] [source]

I think the OP is more referring to the POSIX issue Richard describes in https://sqlite.org/forum/forumpost/dbf245f2b7.

Thus I’m not even sure not locking avoids that mutex.

(8) By zhongren on 2022-10-23 17:01:12 in reply to 7 [link] [source]

To all the people who replied to my post, thank you for your replies and suggestions.

Our application has the requirements of data reliability and availability at all time, so having data on local disk is not an option. Also, because a database can be accessed from multiple hosts or multiple processes (or threads) on one host, no lock on a database is not an option.

We tested hosting SQLite data in NetAPP NFS. Yes, the performance is much slower than the case of having data on local disk, but the performance is still very good and meets our performance requirement.

The main issue is with the global UNIX mutex lock with a very expensive critical section (with a few system calls and file locking (over NFS) operations in it). All database open and close calls are serialized by the mutex. Our application needs to open and close a large number of databases constantly. When the number of open and close calls increases to a few hundreds, the latency of the open or close call increases to many milliseconds.

The big UNIX mutex protected a link list of file inodes. To solve the scalability issue, it can be changed to a hash table (lookup key is either (st_dev/st_ino) or database pathname), and use hash table per bucket lock. We have a patch and hope to pass it to SQLite developers.

Thanks, Zhongren

(9) By anonymous on 2022-10-23 17:22:30 in reply to 8 [link] [source]

If your app can guarantee than the same DB file cannot be opened multiple times in your app, then you could avoid that mutex and code completely in fact. Otherwise using a more scalable scheme than a linked list for iinodes is the only alternative indeed. You could start by posting a diff against the canonical sources in Fossil here (not the amalgamation) for your changes. The chances Richard accepts them as slim, especially as-is, but you never know.

(13.1) Originally by zhongren with edits by Stephan Beal (stephan) on 2022-10-25 08:35:39 from 13.0 in reply to 9 [link] [source]

"Anonymous", thank you.

My diff is against the amalgamation of version 3380300. Knowing that the chance of being accepted is 0, I post it below as it is, without addressing the coding style and other things. Just would like to let SQLite developers know what this fix looks like. I am sure SQLite developers are able to have a better fix. In our use case, it increases query execution throughput by 5X. It would be great if SQLite fixes the issue. Thanks!

35007,35008d35006
<   dev_t st_dev;
<   ino_t st_ino;
35767a35766
>   unixEnterMutex();
35783a35783
>   unixLeaveMutex();
35791a35792
>   unixEnterMutex();
35800a35802
>   unixLeaveMutex();
35952c35954,35955
<   unixInodeInfo *next;
---
>   unixInodeInfo *pNext;           /* List of all unixInodeInfo objects */
>   unixInodeInfo *pPrev;           /*    .... doubly linked */
35962,36047c35965,35970
< #include <stdbool.h>
< #include <inttypes.h>
<
< #define NBUCKET     13971
<
< typedef struct inode_bucket_s {
<     unixInodeInfo   *node;
<     pthread_mutex_t mutex;
< } inode_bucket_t;
<
< static inode_bucket_t   g_inode_bucket[NBUCKET];
<
< static void inode_init (void)
< {
<     int     i;
<
<     for (i = 0; i < NBUCKET; i++)
<         pthread_mutex_init(&g_inode_bucket[i].mutex, NULL);
< }
<
<
< static inline inode_bucket_t* get_inode_bucket_x (dev_t dev, ino_t ino)
< {
<     return (g_inode_bucket + (dev * ino) % NBUCKET);
< }
<
< static inline inode_bucket_t* get_inode_bucket (unixFile *pFile)
< {
<     return (get_inode_bucket_x(pFile->st_dev, pFile->st_ino));
< }
<
< /* data protection is done by caller with mutex in bucket */
< static void inode_bucket_remove (inode_bucket_t *bucket, unixInodeInfo *p)
< {
<     unixInodeInfo   *prev = NULL;
<     unixInodeInfo   *node = bucket->node;
<
<     while (node != NULL) {
<         if (node == p) {
<             if (prev == NULL) {
<                 bucket->node = node->next;
<             } else {
<                 prev->next = node->next;
<             }
<             return;
<         }
<         node = node->next;
<     }
<     printf("inode_bucket_remove failed\n");
<     abort();
< }
<
< /* data protection is done by caller with mutex in bucket */
< static unixInodeInfo* inode_bucket_get (inode_bucket_t *bucket, struct stat *st, bool create)
< {
<     unixInodeInfo   *node = bucket->node;
<
<     while (node != NULL) {
<         if (node->fileId.dev == st->st_dev && node->fileId.ino == st->st_ino) {
<             node->nRef++;
<             return node;
<         }
<         node = node->next;
<     }
<     if (create == false) {
<         return NULL;
<     }
<
<     node = sqlite3_malloc64( sizeof(*node) );
<     memset(node, 0, sizeof(*node));
<     if( sqlite3GlobalConfig.bCoreMutex ){
<       node->pLockMutex = sqlite3_mutex_alloc(SQLITE_MUTEX_FAST);
<       if( node->pLockMutex==0 ){
<         sqlite3_free(node);
<         abort();
<       }
<     }
<
<     node->fileId.dev = st->st_dev;
<     node->fileId.ino = st->st_ino;
<     node->nRef = 1;
<     assert( unixMutexHeld() );
<     node->next = bucket->node;
<     bucket->node = node;
<
<     return node;
< }
---
> /*
> ** A lists of all unixInodeInfo objects.
> **
> ** Must hold unixBigLock in order to read or write this variable.
> */
> static unixInodeInfo *inodeList = 0;  /* All unixInodeInfo objects */
36184c36107
< static void releaseInodeInfo(unixFile *pFile, inode_bucket_t *bucket){
---
> static void releaseInodeInfo(unixFile *pFile){
36195c36118,36128
<       inode_bucket_remove(bucket, pInode);
---
>       if( pInode->pPrev ){
>         assert( pInode->pPrev->pNext==pInode );
>         pInode->pPrev->pNext = pInode->pNext;
>       }else{
>         assert( inodeList==pInode );
>         inodeList = pInode->pNext;
>       }
>       if( pInode->pNext ){
>         assert( pInode->pNext->pPrev==pInode );
>         pInode->pNext->pPrev = pInode->pPrev;
>       }
36213,36215c36146
<   unixInodeInfo **ppInode,       /* Return the unixInodeInfo object here */
<   struct stat *statbuf,
<   inode_bucket_t *bucket
---
>   unixInodeInfo **ppInode        /* Return the unixInodeInfo object here */
36219a36151
>   struct stat statbuf;           /* Low-level file information */
36227a36160,36167
>   rc = osFstat(fd, &statbuf);
>   if( rc!=0 ){
>     storeLastErrno(pFile, errno);
> #if defined(EOVERFLOW) && defined(SQLITE_DISABLE_LFS)
>     if( pFile->lastErrno==EOVERFLOW ) return SQLITE_NOLFS;
> #endif
>     return SQLITE_IOERR;
>   }
36240c36180
<   if( statbuf->st_size==0 && (pFile->fsFlags & SQLITE_FSFLAGS_IS_MSDOS)!=0 ){
---
>   if( statbuf.st_size==0 && (pFile->fsFlags & SQLITE_FSFLAGS_IS_MSDOS)!=0 ){
36246c36186
<     rc = osFstat(fd, statbuf);
---
>     rc = osFstat(fd, &statbuf);
36253a36194,36200
>   memset(&fileId, 0, sizeof(fileId));
>   fileId.dev = statbuf.st_dev;
> #if OS_VXWORKS
>   fileId.pId = pFile->pId;
> #else
>   fileId.ino = (u64)statbuf.st_ino;
> #endif
36255c36202,36229
<   *ppInode = inode_bucket_get(bucket, statbuf, true);
---
>   pInode = inodeList;
>   while( pInode && memcmp(&fileId, &pInode->fileId, sizeof(fileId)) ){
>     pInode = pInode->pNext;
>   }
>   if( pInode==0 ){
>     pInode = sqlite3_malloc64( sizeof(*pInode) );
>     if( pInode==0 ){
>       return SQLITE_NOMEM_BKPT;
>     }
>     memset(pInode, 0, sizeof(*pInode));
>     memcpy(&pInode->fileId, &fileId, sizeof(fileId));
>     if( sqlite3GlobalConfig.bCoreMutex ){
>       pInode->pLockMutex = sqlite3_mutex_alloc(SQLITE_MUTEX_FAST);
>       if( pInode->pLockMutex==0 ){
>         sqlite3_free(pInode);
>         return SQLITE_NOMEM_BKPT;
>       }
>     }
>     pInode->nRef = 1;
>     assert( unixMutexHeld() );
>     pInode->pNext = inodeList;
>     pInode->pPrev = 0;
>     if( inodeList ) inodeList->pPrev = pInode;
>     inodeList = pInode;
>   }else{
>     pInode->nRef++;
>   }
>   *ppInode = pInode;
36938d36911
<   inode_bucket_t *node = get_inode_bucket(pFile);
36944c36917
<   pthread_mutex_lock(&node->mutex);
---
>   unixEnterMutex();
36960c36933
<   releaseInodeInfo(pFile, node);
---
>   releaseInodeInfo(pFile);
36963c36936
<   pthread_mutex_unlock(&node->mutex);
---
>   unixLeaveMutex();
37556d37528
<     inode_bucket_t *node = get_inode_bucket(pFile);
37560,37562c37532,37534
<     pthread_mutex_lock(mutex);
<     releaseInodeInfo(pFile, node);
<     pthread_mutex_unlock(&node->mutex);
---
>     unixEnterMutex();
>     releaseInodeInfo(pFile);
>     unixLeaveMutex();
38020,38021c37992
<   inode_bucket_t *node = get_inode_bucket(pFile);
<   pthread_mutex_lock(&node->mutex);
---
>   unixEnterMutex();
38035c38006
<   releaseInodeInfo(pFile, node);
---
>   releaseInodeInfo(pFile);
38038c38009
<   pthread_mutex_unlock(&node->mutex);
---
>   unixLeaveMutex();
39419,39420c39390
<   inode_bucket_t *node = get_inode_bucket(pDbFd);
<   pthread_mutex_lock(&node->mutex);
---
>   unixEnterMutex();
39502c39472
<   pthread_mutex_unlock(&node->mutex);
---
>   unixLeaveMutex();
39521c39491
<   pthread_mutex_unlock(&node->mutex);
---
>   unixLeaveMutex();
39907,39908c39877
<   inode_bucket_t *node = get_inode_bucket(pDbFd);
<   pthread_mutex_lock(&node->mutex);
---
>   unixEnterMutex();
39917c39886
<   pthread_mutex_unlock(&node->mutex);
---
>   unixLeaveMutex();
40458d40426
<   struct stat st;
40502,40512d40469
<   if (osFstat(h, &st) != 0) {
<     storeLastErrno(pNew, errno);
<     rc = SQLITE_IOERR;
<     goto out;
<   }
<
<   pNew->st_dev = st.st_dev;
<   pNew->st_ino = st.st_ino;
<
<   inode_bucket_t *node = get_inode_bucket(pNew);
<
40518,40519c40475,40476
<     pthread_mutex_lock(&node->mutex);
<     rc = findInodeInfo(pNew, &pNew->pInode, &st, node);
---
>     unixEnterMutex();
>     rc = findInodeInfo(pNew, &pNew->pInode);
40521d40477
< out:
40543c40499
<     pthread_mutex_unlock(&node->mutex);
---
>     unixLeaveMutex();
40562,40563c40518,40519
<       pthread_mutex_lock(&node->mutex);
<       rc = findInodeInfo(pNew, &pNew->pInode, node);
---
>       unixEnterMutex();
>       rc = findInodeInfo(pNew, &pNew->pInode);
40569c40525
<       pthread_mutex_unlock(&node->mutex);
---
>       unixLeaveMutex();
40596,40597c40552,40553
<     pthread_mutex_lock(&node->mutex);
<     rc = findInodeInfo(pNew, &pNew->pInode, node);
---
>     unixEnterMutex();
>     rc = findInodeInfo(pNew, &pNew->pInode);
40611c40567
<     pthread_mutex_unlock(&node->mutex);
---
>     unixLeaveMutex();
40742,40743d40697
<   unixInodeInfo *pInode;
<   inode_bucket_t *node;
40745,40750c40699
<   if (osStat(zPath, &sStat) != 0) {
<     return NULL;
<   }
<
<   node = get_inode_bucket_x(sStat.st_dev, sStat.st_ino);
<   pthread_mutex_lock(&node->mutex);
---
>   unixEnterMutex();
40760c40709,40717
<   if ((pInode = inode_bucket_get(node, &sStat, false)) != NULL) {
---
>   if( inodeList!=0 && 0==osStat(zPath, &sStat) ){
>     unixInodeInfo *pInode;
>
>     pInode = inodeList;
>     while( pInode && (pInode->fileId.dev!=sStat.st_dev
>                      || pInode->fileId.ino!=(u64)sStat.st_ino) ){
>        pInode = pInode->pNext;
>     }
>     if( pInode ){
40770a40728
>     }
40772c40730
<   pthread_mutex_unlock(&node->mutex);
---
>   unixLeaveMutex();
41445a41404
>   unixEnterMutex();
41449a41409
>   unixLeaveMutex();
42921d42880
<   inode_init();

((edited by admin to treat the pasted-in block as a code block))

(15) By ddevienne on 2022-10-25 08:13:44 in reply to 13.0 [link] [source]

"Anonymous", thank you.

You're welcome.

My diff is against the amalgamation of version 3380300

You're not helping your case by doing that, against my advice :)

Knowing that the chance of being accepted is 0

It's not quite that bad I think, although the chances are still slim.

Just would like to let SQLite developers know what this fix looks like.

Sure. I think you've achieve that part. FWIW, I've read the diff (w/o applying it),
and I see what you are doing. Looks reasonable to me too, again FWIW.

I am sure SQLite developers are able to have a better fix

I think the main issue with your fix is that it affects:

the code space / size, a tiny bit.
the runtime size, quite a bit (200-300K at least, don't know the sizeof of a mutex, for that large 13K element array).
the runtime perf (a.k.a. valgrind instruction count), a tiny bit

In your case, the patch makes complete sense, and it's demonstrated by the excellent 5x speedup.
But for the general population of SQLite uses and users, not so much, most likely.

You have several possible solutions going forward IMHO:

Work with Richard via his company, for paid support
Fork/patch SQLite in your own code (the license allows it), but then going forward, that's a PITA.
Avoid forking, and instead write your own Linux VFS, if as I suspect, the code you are changing is entirely VFS specific.

Not much else I can say, at this point. Good luck. And interesting use case, and nice work-around, BTW.

PS: Your last post formats badly, in the diff part. Please edit it, to use a triple-backquotes or perhaps a pre, so it formats OK. Use that Preview button.

(10) By zhongren on 2022-10-23 17:57:09 in reply to 8 [source]

Hope some SQLite developers are reading my posts.

In the code change I described above, the issue is not from the link list operations; it is from the expensive operations in the critical sections with the global UNIX lock locked. Only one thread is allowed at one time. And it can be very slow when databases are over NFS.

If the hash table per bucket lock is used, as long as two database files don't hash into the same bucket, there is no wait in their database open and close operations.

Zhongren

(11) By Simon Slavin (slavin) on 2022-10-23 19:12:59 in reply to 10 [link] [source]

One or more members of the development team reads this forum. You don't seem to have suggested any solution to the problem. You're just pointing out that NFS locks are slow, and we knew that already.

You can implement your own MUTEX lock if you want. If you think you can make a MUTEX which works faster than the standard NFS one, you're welcome to try it. It would involve replacing some or all of

https://www3.sqlite.org/c3ref/mutex_alloc.html

https://www.sqlite.org/c3ref/mutex_methods.html

However the standard suggestion is that connections which read and don't write ignore MUTEXes, or implement MUTEXes in a fast manner which doesn't really work. You need to think about the situation where a read-only connection tries to read a table or row while a read/write connection is changing it.

(12) By zhongren on 2022-10-24 01:30:32 in reply to 11 [link] [source]

Thank you for the reply.

From your reply, it seems we talked about two different things.

It is not about mutex or NFS file locking is fast or slow. The issue is that the global UNIX mutex protects a big chunk of code in the database open and close calls. And that big chunk of code involves expensive calls such as system call stat/fstat/malloc/free, nested mutex lock on inode, and file locking over NFS (if databases are over NFS).

Since it is a global mutex, one thread is allowed at a time. And since the critical section is big in latency, wait time can be long. If a program tries t open 10 different databases in 10 threads at the same time, there is no reason why they have to be affected by the global mutex, at least by the restriction of a single thread over a big chunk of code.

Yes, UNIX file inode lookup needs to be protected by mutex, but a global mutex is not needed. If one thread tries to open database file-a and another thread tries to open database file-b, they do inode lookup on two different database file names, they can execute the big chunk of code concurrently. It can be done with a hash table (of inodes) with per bucket lock.

I believe that if a SQLite developer who understands the mutex locking code in that area reads this post, he/she can have an easy fix.

Thanks, Zhongren

(14) By Paul Stuart (pstuart) on 2022-10-24 19:10:54 in reply to 12 [link] [source]

Maybe reconsider how you work with the databases rather than solve the problem of how you use them now?

Perhaps consider some form of file versioning or a "checkout" system?

A common developer trap is to solve your initial solution rather than the original problem at hand.

That might not be possible but may be worth revisiting if your current approach doesn't work.

(5) By Stephan Beal (stephan) on 2022-10-23 10:25:54 in reply to 1 [link] [source]

PS. Not sure how to attach my test program and patch to remove the big UNIX mutex lock in SQLite.

The forum does not support attachments. The preferred approach is that "small" data are pasted directly in to forum posts and larger data are posted on an external site (e.g. Google Drive or Dropbox) and linked here.

Regarding patches, please see:

https://www.sqlite.org/copyright.html

in particular the last section of the page.