Slow select operation after many times of insert different Table

(1) By anonymous on 2021-02-11 09:07:03 [link] [source]

Hi all,

I am using sqlite3 with python3 When I insert each row by row to different tables and then select all columns, It is really slow. If I insert all rows at once then everything goes fine. One thing strange is if I add rows to one table after finishing adding rows to the other table, everything goes fine. So, the slow down only occurs when I add rows to different tables within one transaction. One solution is executing 'vacuum' but it takes too much time for my original dataset which takes too much long time for vacuum operation Here are the sample codes below

import sqlite3
import os
import time

os.system('rm -f *.db')
conn1 = sqlite3.connect('db1.db', timeout=1000.0)
conn2 = sqlite3.connect('db2.db', timeout=1000.0)
conn3 = sqlite3.connect('db3.db', timeout=1000.0)
cursor1 = conn1.cursor()
cursor2 = conn2.cursor()
cursor3 = conn3.cursor()

v = ''.join(['a' for _ in range(10000)])
rows = [(str(i),v) for i in range(1000)]

cursor1.execute('CREATE TABLE if not exists Test1(ID text, value text)')
cursor1.execute('CREATE TABLE if not exists Test2(ID text, value text)')
cursor1.execute('CREATE TABLE if not exists Test3(ID text, value text)')
cursor2.execute('CREATE TABLE if not exists Test1(ID text, value text)')
cursor2.execute('CREATE TABLE if not exists Test2(ID text, value text)')
cursor2.execute('CREATE TABLE if not exists Test3(ID text, value text)')
cursor3.execute('CREATE TABLE if not exists Test1(ID text, value text)')
cursor3.execute('CREATE TABLE if not exists Test2(ID text, value text)')
cursor3.execute('CREATE TABLE if not exists Test3(ID text, value text)')

#################
n = 1
for i in range(len(rows)//n+1):
    cursor1.executemany('INSERT INTO Test1 VALUES(?,?);', rows[i*n:(i+1)*n])
    cursor1.executemany('INSERT INTO Test2 VALUES(?,?);', rows[i*n:(i+1)*n])
    cursor1.executemany('INSERT INTO Test3 VALUES(?,?);', rows[i*n:(i+1)*n])
conn1.commit()

#################
for i in range(len(rows)//n+1):
    cursor2.executemany('INSERT INTO Test1 VALUES(?,?);', rows[i*n:(i+1)*n])
for i in range(len(rows)//n+1):
    cursor2.executemany('INSERT INTO Test2 VALUES(?,?);', rows[i*n:(i+1)*n])
for i in range(len(rows)//n+1):
    cursor2.executemany('INSERT INTO Test3 VALUES(?,?);', rows[i*n:(i+1)*n])
conn2.commit()

#################
cursor3.executemany('INSERT INTO Test1 VALUES(?,?);', rows)
cursor3.executemany('INSERT INTO Test2 VALUES(?,?);', rows)
cursor3.executemany('INSERT INTO Test3 VALUES(?,?);', rows)
conn3.commit()

#cursor1.execute('VACUUM;')
#conn1.commit()
#cursor2.execute('VACUUM;')
#conn2.commit()
#cursor3.execute('VACUUM;')
#conn3.commit()

t1 = time.time()
cursor1.execute('SELECT * FROM Test1')
rows = cursor1.fetchall()
t2 = time.time()
print (t2-t1, len(rows))

t1 = time.time()
cursor2.execute('SELECT * FROM Test1')
rows = cursor2.fetchall()
t2 = time.time()
print (t2-t1, len(rows))

t1 = time.time()
cursor3.execute('SELECT * FROM Test1')
rows = cursor3.fetchall()
t2 = time.time()
print (t2-t1, len(rows))

output

5.287451267242432 1000
0.15462756156921387 1000
0.1925675868988037 1000

(2) By David Raymond (dvdraymond) on 2021-02-11 13:18:20 in reply to 1 [link] [source]

My guess as to what's going on:

For db1 you're shuffling all the pages of the three tables together as you build the file. As you evenly deal out the inserts between then, then as the db file grows it's basically going to be laid out as page for Test1, page for Test2, page for Test3, page for Test1, page for Test2, page for Test3, etc... So when you read everything from Test1, you need to read every third page over the entire length of the file.

When you do the inserts into db2 and db3, you're keeping each table together. So as the db file grows you have a big swath of pages all for Test1, then a big swath of pages all for Test2, then a big swath of pages all for Test3. So when you read all of Test1, you're basically reading all continuous pages in the first third of the file.

That's going to make a difference, especially if you have a spinning hard disk for the file. Whether that's the difference you're seeing I'll let the more knowledgeable folks reply.

(4) By anonymous on 2021-02-11 14:58:25 in reply to 2 [link] [source]

Thank you for the explanation. I agree with your theory. In my application, it is inevitable to insert rows to different tables and to shuffle the order of insertion. Is there any smart solution?

(8) By David Raymond (dvdraymond) on 2021-02-11 16:24:38 in reply to 4 [link] [source]

Not sure. I did a couple test runs on my own system and didn't get anywhere near as big of a difference. So I definitely might be wrong on what's causing it.

(11) By Keith Medcalf (kmedcalf) on 2021-02-11 21:29:08 in reply to 2 [link] [source]

I get the following output:

0.009074687957763672 1000
0.00870060920715332 1000
0.009019851684570312 1000

which are all less than half-a-tick, and all about the same as each other.

(12) By anonymous on 2021-02-12 08:10:37 in reply to 11 [link] [source]

Did you run the exactly same code I attached?

(13) By Keith Medcalf (kmedcalf) on 2021-02-12 08:30:14 in reply to 12 [source]

Other than changing your os.system('rm -f *.db') to

try:
    os.unlink('db1.db')
except:
    pass
try:
    os.unlink('db2.db')
except:
    pass
try:
    os.unlink('db3.db')
except:
    pass

(14) By anonymous on 2021-02-12 13:25:19 in reply to 13 [link] [source]

Do you have any guess why there is a huge gap between your and my results?

(3) By Larry Brasfield (larrybr) on 2021-02-11 14:45:30 in reply to 1 [link] [source]

The main reason your scenario is slow is that you are forcing disk I/O at every insert. If your were to wrap your inserts into a transaction, they would go much faster. If that's too many at once, group them into multiple transactions. But keep the group size much greater than 1.

(5) By anonymous on 2021-02-11 15:01:03 in reply to 3 [link] [source]

Thank you for the proposal. I am very unfamiliar with sqlite3. So it would be really beneficial if you can provide some snip of the code or relevant references. Thank you in advance

(6) By Larry Brasfield (larrybr) on 2021-02-11 15:40:30 in reply to 5 [link] [source]

See Transaction, and links therein. Example:BEGIN TRANSACTION INSERT INTO MyTable ... -- many more inserts COMMIT TRANSACTION

Try it -- you'll like it!

(7) By David Raymond (dvdraymond) on 2021-02-11 15:45:57 in reply to 3 [link] [source]

No, with the way he opened the connection it's doing implicit transactions. So once you start doing selects, inserts, etc you have an open transaction until you explicitly commit or rollback. So his inserts are fine. Also, the timing was only done on the select part of it, and the insert time wasn't included at all.

(9) By Larry Brasfield (larrybr) on 2021-02-11 16:35:21 in reply to 7 [link] [source]

I have not done the research into Python to gainsay your claim if that is something done by its .executemany() method. But if you refer to what SQLite does, you're mistaken.

It is true that every individual statement is wrapped into a transaction, absent an already-open transaction by the SQLite library. But the automatic transactions are over once the statement executes. Otherwise, COMMITs unpaired with BEGINs would have to appear in code. They do not, at least not without error.

(10) By David Raymond (dvdraymond) on 2021-02-11 16:42:31 in reply to 9 [link] [source]

Python's default sqlite3 module is the part that does the implicit transaction stuff I'm talking about, yes.

I'm not referring to the SQLite C library/API/etc itself.

(15.1) By Simon Slavin (slavin) on 2021-02-12 15:29:25 edited from 15.0 in reply to 1 [link] [source]

Various posters to this thread have excellently covered the amount of time taken by software. You asked about a difference between the amount of time your computer took, and the amount of time one of the responder's computers took.

SQLite is fast and efficient and most of the time it takes is taken in storage access – reading and writing files. If two computers execute SQLite APIs at different speeds the cause is usually hardware operating as designed. This is taken up by, among other things,

speed of storage (hard disk rotation speed, spinning rust or SSD)
number and sizes of caches (application cache, OS cache, storage subsystem cache)
whether other apps are accessing storage, thus busting the caches
how much CPU the database app is able to use (what other processes are doing)
how efficient the OS is at handing the database app's requirements

As an example, a recent bug in Lenovo's System Update application for Windows, which runs in the background, caused it to use 6% of available CPU in a typical windows installation all the time the computer is working. You wouldn't find this problem by profiling your database program.

So don't worry if someone else's computer is faster than yours. Unless your computer isn't working fast enough to make your program useful, in which case use a dedicated computer with a simple OS and storage setup.