SQLite3.exe fails to handle huge multi-line-comment

(1) By MBL (RoboManni) on 2021-09-08 09:05:02 [link] [source]

What happened

I generated a text file which starts with a huge multi-line comment of about 258000 lines and contains the following 46000 sql statements to execute.

Appearance

>sqlite3.exe version
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
>sqlite3.exe TagList-ua.sqb ".read sql-command.sql"

This command (Windows OS) did not finish within 15 minutes before I aborted it.

Solution

When I delete the huge comment block from the .sql file and reduce the number of comment block lines to e.g. 15, then the sql statements, which are framed with a transaction begin and end, completes within 1 second.

With 16000 lines of multi-line-comment the execution took still some few seconds; but at least did the intended job.

Complaint

Skipping multi-line-comments is not performant and can almost inhibit its use.

Question

Why does processing the /* multi-line-comment */ take soooo much time? Can this be improved with next SQLite3 version 3.37 ?

(2.1) By Larry Brasfield (larrybr) on 2021-09-08 13:15:17 edited from 2.0 in reply to 1 [link] [source]

Answers

Why does processing the /* multi-line-comment */ take soooo much time?

The most slowish (O n^2) thing happening with your extreme input is that, almost for each input line of your humongous comment, a realloc is being done for the portion of the SQL collected to that point.

Can this be improved with next SQLite3 version 3.37 ?

I doubt that the O n^2 behavior will change by then. However, if you were to find a line in shell.c reading nAlloc = nSql+nLine+100;, and change it to nAlloc = 3*(nSql+nLine)/2+100;, you should see a noticeable improvement. This gets to O n*log(n) behavior. Please report back if this solves your problem.

Edited to add: With above change, it's still O n^2. Dunno why, yet.

(3) By Richard Hipp (drh) on 2021-09-08 14:00:35 in reply to 1 [link] [source]

I think the following shell script demonstrates the problem:

cat >script.tcl <<\EOF
  puts {/*}
  for {set i 0} {$i<10000} {incr i} {puts "SELECT null;"}
  puts {*/}
  puts {SELECT 1, 2, 3;}
EOF
tclsh script.tcl | time sqlite3

The TCL script generates 10000 lines of comment prior to a single SELECT statement. Each comment line ends with ";".

What I think is going wrong is this: The sqlite3 shell reads its input line by line. As each line is input, it asks "do I have a complete SQL statement, or do I need to wait for more?" To implement this test, it first checks to see if the input ended with ";". If it did, then it also calls "sqlite3_complete()" to see whether the input really is complete, or if the final ";" was perhaps in the middle of a string literal or CREATE TRIGGER statement or within a comment. Only if sqlite3_complete() returns true does the sqlite3 shell try to process the input.

So what is happening here is that because each line of the input comment ends with ";", sqlite3_complete() gets called over and over again, each time on successively larger inputs.

(4) By MBL (RoboManni) on 2021-09-08 18:43:22 in reply to 3 [link] [source]

My comment block did NOT contain any ; semicolon (I searched for it) but only text, numbers, commas and colons and some few other special symbols, which are used in paths and cmd commands.

(5) By Larry Brasfield (larrybr) on 2021-09-08 19:28:47 in reply to 2.1 [link] [source]

With above change, it's still O n^2. Dunno why, yet.

My experiment was with non-semicolon-terminated comments. So, regarding Richard's post, (about calls to line_contains_semicolon() contributing to the O n^2 behavior), that clearly makes a difference, but less than one might imagine.

To convert the reallocation calls to O log(n), my favored computation is:nAlloc = nSql+(nSql>>1)+nLine+100;. With that in place, the execution time remains O n^2 whether the commented lines end with ';' or not. With my test^a, screen-scrape of which is> timer `big_comment 20000 "select null;" | sqlite3` ... Elapsed: 0:00:04.422 > timer `big_comment 40000 "select null;" | sqlite3` ... Elapsed: 0:00:17.143 > timer `big_comment 20000 "select null." | sqlite3` ... Elapsed: 0:00:02.619 > timer `big_comment 40000 "select null." | sqlite3` ... Elapsed: 0:00:09.429, I see that the O scale factor changes by about 2x with trailing ';', but the elapsed time clearly varies with n^2.

I intend to investigate why execution time remains O n^2 when I can see no code path explaining such growth. Curiously, the realloc() calls play a very small role. More later.

a. The big_comment script reads:#!/usr/bin/perl if (@ARGV < 2){ print stderr "Args are comment_line_count and comment_line_content.\n"; exit 1; } $nc = shift(@ARGV) + 0; $sql = shift(@ARGV); print <<'_'; .print Starting big SQL comment ... /* _ while ($nc-- > 0){ print $sql, "\n"; } print <<'_'; */; .print End big SQL comment and run. .q _

(6) By RandomCoder on 2021-09-08 20:02:40 in reply to 5 [link] [source]

Looks like it's this bit in shell.c:

        else if (nSql && _all_whitespace(zSql)) {
            if (ShellHasFlag(p, SHFLG_Echo)) printf("%s\n", zSql);
            nSql = 0;
        }

As zSql is built up line by line during the comment, _all_whitespace works on the string character by character. Pretty much the poster child for an On^2 loop.

Not sure the best way to solve it. Commenting out this bit prevents the slowdown, but also prevents the printf call, of course.

(7) By Larry Brasfield (larrybr) on 2021-09-08 20:58:49 in reply to 6 [link] [source]

Yes, that's O n^2 expensive when the accumulation is not all whitespace. And it is easily fixed, as a check-in will soon show. Thanks for spotting this.

As for the O n^2 response to block-commented, ';'-terminated lines, that's a harder problem which I do not see as solvable in the 3.37 release timeframe.

(8) By Keith Medcalf (kmedcalf) on 2021-09-08 21:53:03 in reply to 1 [link] [source]

The answer to the question "Doctor, Doctor, it hurts when I do this" is "Do not do that".

Have you considered that the SQL comment introducer is --?

Does the problem "go away" if you use proper SQL style comments (that is, do not do that anymore)?

Try using a good text editor to replicate the three characters -- in front of your commentary and see if that solves your problem?

(9) By Larry Brasfield (larrybr) on 2021-09-08 22:01:52 in reply to 1 [link] [source]

Can this be improved with next SQLite3 version 3.37 ?

For the case where few of the block-commented lines are semicolon-terminated, the CLI code now on trunk is quite fast. (about 1.6e6 40-char lines/second on my Gen 7 Core i7 machine)

Avoiding the O n^2 issue that Richard mentioned is a harder nut to crack, unlikely to be done soon.

(10) By Keith Medcalf (kmedcalf) on 2021-09-08 22:14:51 in reply to 9 [link] [source]

This works but it should be noted that it is still 33% slower (at best) then using the correct comment introducer. Using the "in-line comment" capabilities for whole-line or block comments is ill-concieved from the get-go.

(11) By Larry Brasfield (larrybr) on 2021-09-08 22:22:35 in reply to 10 [link] [source]

No argument with your timing claim.

I'm sure you have enjoyed the convenience of disabling code blocks by inserting block-comment delimiters at the boundaries. That is the reason this was worth a near-costless (in code space and execution time) remedy, and why I was somewhat sympathetic to the OP's plight.

(12) By Keith Medcalf (kmedcalf) on 2021-09-08 22:38:49 in reply to 11 [link] [source]

Well, actually, no, I have not enjoyed using the wrong comment indicators for any purpose because I do not do so.

The purpose of the /* this is an inline comment */ is to insert comments in-line.

The purpose of the -- the remainder of the line is useless to the computer and should be ignored.

So for example, if I want to comment AN ENTIRE LINE OR THE REMAINDER OF AN ENTIRE LINE, then I use the comment indicator designed for that use, namely "--". If I wish to comment out something in-line then I use the inline comment designators "\* comment *\".

The only placed that mixed use of comment indicators does not matter is when a pre-processor removes all the "crud and crappola" before passing the input to the processor (as in with a C compiler, for example). Otherwise, discarding of "crud and crappola" takes a long time. Choosing the correct comment introducer "this is a comment IN LINE with a valid command that should be processed" vs "the remainder of the line can be discarded because it is meaningless". Making the wrong choice should have the same deleterious effect on the chooser as any other ill-conceived choice made.

(13) By Scott Robison (casaderobison) on 2021-09-09 01:19:28 in reply to 10 [link] [source]

33% slower? What is the basis for that claim? I must be missing something beyond a pathological case.

(14) By MBL (RoboManni) on 2021-09-09 11:06:23 in reply to 8 [link] [source]

The solution, as mentioned above, is to write a preprocessor and pipe the output through it before it gets as input into the sqlite3.exe and/or to write a wrapper which scans the .sql file to remove all the multi-comment-lines. - As I wrote at the beginning, the sql file was generated. Where the generator is under my control I will try to work around at the source (as explained as my solution).

Regarding the explanation from Richard I was thinking that each line is read and checked before added to the realloc'd buffer. (That's what Command Line Interpreter means to me.) Once a multi line buffer has been started and if any next line does not contain the closing */ sequence then simple do NOT add that line to the realloc'd buffer. I understand now the issue and reason in the not required buffering, which is kept to let the SQLite3 core later on throw out the comment block. My suggestion would be to avoid adding not required lines to that buffer - then while the comment is to be continued with any next line the buffer would not grow.

There are some cases of combinations to consider:

normal line, not collecting the lines of a multi-line-comment : add the line to buffer
if a line starts with -- then such a line could be dropped from buffering
line with starting single line comment (any /* to be ignored behind -- ) : add this line to buffer if -- is not starting the line
line with starting multi-line comment ( /* detected ) : add this line to buffer, remember the state
while state is 'multi-line-comment started check next line if */ is contained : if yes then add line to buffer if not then drop the line from buffer
finally the buffer is handed into the sqlite3_prepare, sqlite3_exec, etc with not more than few lines of comment

I hope my thinking and explanations are understandable enough.

(15) By Richard Hipp (drh) on 2021-09-09 12:26:59 in reply to 14 [link] [source]

My suggestion would be to avoid adding not required lines to that buffer

That's a lot of complication and potential for bugs for an obscure performance-optimization case. In particular the function that determines whether or not a line is required seems likely to be subtle and error-prone.

(16) By MBL (RoboManni) on 2021-09-09 13:11:02 in reply to 15 [link] [source]

understandable and acceptable ... the sql statements itself also could contain strings which look like comment invocations .... quoted -- and /*

(17) By Scott Robison (casaderobison) on 2021-09-09 16:16:59 in reply to 14 [link] [source]

To add on to what DRH said, the point of the shell is not to be a tool in its own right (though it is a valuable tool in its own right), but as a way to feed input into the SQLite library for processing. As such, the ability to unconditionally preprocess commands before they get to SQLite would diminish the ability to feed arbitrary input to the shell to pass on to SQLite for testing and processing. Some of that is unavoidable in as much as the shell already processes certain commands itself (the dot commands). But the rest is intended for SQLite to process and report outcomes.

I agree it would be nice to have the functionality you suggest if it were not at odds with the purpose of the project. The fact that it is error prone and difficult to get just right is of course another very good reason to not re-create code to strip comments.

(18) By MBL (RoboManni) on 2021-09-09 17:18:18 in reply to 17 [link] [source]

Even if the SQLite3.exe remains as it is, it is now good to know where "performance killers" might come from ... did you ever thought about comments as being a reason?

In first point I also expected that something with my sql statements might have been bad when the whole did not finish within 15 minutes - but then I tried one framing transaction for all the individual inserts and it did not become significantly faster (this has usually helped me in the past) ... and then I tried to reduce the text to find the bottle neck. Finally I stayed with almost just the remaining comment block. I did not expect the text to ignore being the reason for performance drop.

Sorry for having bothered you with my exceptional finding; however, hopefully someone at some day will find it helpful being mentioned here.

(19.1) By Scott Robison (casaderobison) on 2021-09-09 19:43:46 edited from 19.0 in reply to 18 [source]

Absolutely. I wasn't being critical, just suggesting another reason why it would be a less than optimum idea to modify the shell to parse the text of SQL statements.

I understand it was a surprising situation, and you didn't "bother me". This is a forum where people share ideas and exchange information. If I were of a mind to, I could say "sorry for having bothered you by trying to share a rationale you might not have considered". But that would be a rude thing for me to do, so I won't.

It is sentences like your last that lead people to not want to be helpful, because it becomes simply easier to ignore than to engage. For the second time in recent history, I'm going to assume that was not your intent, but using words like "sorry to have bothered you" come off as sarcasm as though there is no "sorry" and "my exceptional finding" as though you deserve high praise for reporting a finding that came from feeding pathological text into a library.

It is good to know this so that people can be aware of the issue in the future (though the fact that it has never come up before this report suggests there aren't many people trying to use it this way; nevertheless, it is suboptimal that it behaves this way with unanticipated text). It is a reasonable idea that in almost every other possible circumstance it would be preferable to improve the performance of the shell program when such behavior is found.

I thought another perspective might have been helpful. There is no more or less to my comment.

(20) By Larry Brasfield (larrybr) on 2021-09-09 20:56:14 in reply to 17 [link] [source]

As I contemplated what "efficient"^a pre-scanning would look like, I realized it would need to handle input such as this:
create table " Travails of a Verbose Identifier Afficionado (Or, ""How I came to avoid obtuse names."")" ( " When I learned to program, identifiers could contain up to 6 dark characters, with any amount of embedded whitespace.", " For larger programs, I sometimes kept a symbol table; There detailing what names variables would have if not for such limitations. /* Crucial, but too much work! */", " Now, in SQL(ite), identifiers may have virtually any length. (Hooray!)", " My main problem today is that complex queries no longer fit on a screen and so they are harder to read in their totality.", " -- (Funny looks from my coworkers are tomorrow's problem.)" );.
This dampens my enthusiasm somewhat.

a. I put "efficient" in jest quotes because it is unclear what the measure of success should be.

(21) By Larry Brasfield (larrybr) on 2021-09-10 02:12:23 in reply to 16 [link] [source]

The new branch, speedy-cli, deals with the problems and challenges mentioned in this thread. It runs faster than the current shell for all cases readily at hand, including the gulping of millions of semicolon-terminated SQL lines within a block-comment.^a As far as I can see, the O n^2 runtime behavior is gone.

Whether this branch will be merged to trunk remains to be seen. It's a lot of new and critical code in a tool used extensively for internal testing and widely used externally. So the perceived risk/benefit ratio may doom it.

Any problem reports against the branch will be very welcome.

a. For such input with 50 character statements, about 2e6 lines/second on my non-sloucher, Core i7