union vs left join for select
which query should i be using?
SELECT Name FROM Names WHERE NameID IN (SELECT DISTINCT NameID FROM AllNouns) UNION SELECT Name FROM Names WHERE NameID IN (SELECT DISTINCT NameID FROM AllVerbs)
SELECT Name FROM Names WHERE NameID IN (SELECT DISTINCT AllNouns.NameID FROM AllNouns LEFT JOIN AllVerbs ON AllNouns.FirstID = AllVerbs.FirstID AND AllNouns.NameID = AllVerbs.NameID AND AllNouns.SecondID = AllVerbs.SecondID)
These queries are different, so you should use the one that returns the results you want.
What is the actual problem you're trying to solve?
i need different name ids from both tables.
the problem is that when i union there might still be duplicates
UNION actually removes duplicates.
Can you show with some example data what you (not) want?
Run the queries with .eqp on and see which generates the most efficient plan.
Since you are using the DISTINCT qualifier I would suspect that you would have better results by having correct schema declarations (indexing and collation).
Once you have the correct schema then I would suspect that something like the following would be most efficient:
select name from ( select nameid from allnouns union select nameid from allverbs ) join names on name == nameid ;
which also happens to be the one that most closely expresses your unspoken problem definition (ie, it is what your other statements appear to be trying to accomplish) and gives the optimizer the widest latitude to generate an optimum plan.
In any case, running the queries against the data and/or reading the generated plan (via .eqp on or explain query plan) would be highly enlightening.
Well, the ON condition is typo'd. Sorry about that ...
select name from ( select nameid from allnouns union select nameid from allverbs ) as T join names on names.nameid == T.nameid ;
This is just the same as your first query but does it differently.
okay, this took 24s vs my 1st query 23s
Trillions of records or missing indexes? I would presume the latter.
around 150000 records. indexes do exist.
There is not much difference between the two queries when I run them with such a small amount of data, though the merge then lookup is somewhat faster than the lookup then merge. This is probably entirely due to the fact that doing a "union" (B-Tree sort/merge) is faster when working with simple integers than it is with longer text strings.
Interestingly, even if the output is sorted by name, the first query is faster than the second when passing the results back to an application than the second.
(Are you including the time taken to produce the output on the screen rather than the time to run the query? It takes about 19 seconds or so more to "output the data to the screen" on my computer, or an additional .4 seconds or so to pass the records back to an application (python) when not output to the screen.)
sqlite> .timer on sqlite> .eqp on sqlite> select count(name) ...> from ( ...> select nameid ...> from allnouns ...> union ...> select nameid ...> from allverbs ...> ) as T ...> join names ...> on names.nameid = T.nameid ...> ; QUERY PLAN |--MATERIALIZE 2 | `--COMPOUND QUERY | |--LEFT-MOST SUBQUERY | | `--SCAN TABLE allnouns (~1048576 rows) | `--UNION USING TEMP B-TREE | `--SCAN TABLE allverbs (~1048576 rows) |--SCAN SUBQUERY 2 AS T (~2097152 rows) `--SEARCH TABLE names USING INTEGER PRIMARY KEY (rowid=?) (~1 row) 94813 Run Time: real 0.165 user 0.156250 sys 0.000000 sqlite> sqlite> select count(name) ...> from ( ...> select name ...> from names ...> where nameid in ( ...> select distinct nameid ...> from allnouns ...> ) ...> union ...> select name ...> from names ...> where nameid in ( ...> select distinct nameid ...> from allverbs ...> ) ...> ) ...> ; QUERY PLAN |--CO-ROUTINE 4 | `--COMPOUND QUERY | |--LEFT-MOST SUBQUERY | | |--SEARCH TABLE names USING INTEGER PRIMARY KEY (rowid=?) (~24 rows) | | `--LIST SUBQUERY 1 | | `--SCAN TABLE allnouns USING COVERING INDEX an_nid (~1048576 rows) | `--UNION USING TEMP B-TREE | |--SEARCH TABLE names USING INTEGER PRIMARY KEY (rowid=?) (~24 rows) | `--LIST SUBQUERY 3 | `--SCAN TABLE allverbs USING COVERING INDEX av_nid (~1048576 rows) `--SCAN SUBQUERY 4 (~48 rows) 94813 Run Time: real 0.175 user 0.171875 sys 0.000000