union vs left join for select

(1) By bb8 (thebb8) on 2020-06-09 10:43:24 [link]

which query should i be using?

SELECT Name FROM Names WHERE NameID IN (SELECT DISTINCT NameID FROM AllNouns)
UNION
SELECT Name FROM Names WHERE NameID IN (SELECT DISTINCT NameID FROM AllVerbs)

versus

SELECT Name FROM Names WHERE NameID IN (SELECT DISTINCT AllNouns.NameID FROM AllNouns
LEFT JOIN AllVerbs ON
AllNouns.FirstID = AllVerbs.FirstID AND
AllNouns.NameID = AllVerbs.NameID AND
AllNouns.SecondID = AllVerbs.SecondID)

(2) By Clemens Ladisch (cladisch) on 2020-06-09 11:32:57 in reply to 1

These queries are different, so you should use the one that returns the results you want.

What is the actual problem you're trying to solve?

(3) By Keith Medcalf (kmedcalf) on 2020-06-09 11:45:23 in reply to 1 [link]

Run the queries with .eqp on and see which generates the most efficient plan.

Since you are using the DISTINCT qualifier I would suspect that you would have better results by having correct schema declarations (indexing and collation).

Once you have the correct schema then I would suspect that something like the following would be most efficient:

```
select name
  from (
        select nameid
          from allnouns
        union
        select nameid
          from allverbs
       )
  join names
    on name == nameid
;
```

which also happens to be the one that most closely expresses your unspoken problem definition (ie, it is what your other statements appear to be trying to accomplish) and gives the optimizer the widest latitude to generate an optimum plan.

In any case, running the queries against the data and/or reading the generated plan (via .eqp on or explain query plan) would be highly enlightening.

(4) By bb8 (thebb8) on 2020-06-09 12:14:16 in reply to 3 [link]

is the join correct?

(5) By bb8 (thebb8) on 2020-06-09 12:15:25 in reply to 2 [link]

i need different name ids from both tables.

the problem is that when i union there might still be duplicates

(6) By Clemens Ladisch (cladisch) on 2020-06-09 12:39:54 in reply to 5 [link]

UNION actually removes duplicates.

Can you show with some example data what you (not) want?

(7) By Keith Medcalf (kmedcalf) on 2020-06-09 12:53:08 in reply to 4 [link]

Well, the ON condition is typo'd.  Sorry about that ...

```
select name
  from (
         select nameid
           from allnouns
        union
         select nameid
           from allverbs
       ) as T
  join names
    on names.nameid == T.nameid
;
```

This is just the same as your first query but does it differently.

(8) By bb8 (thebb8) on 2020-06-09 13:07:34 in reply to 7 [link]

okay, this took 24s vs my 1st query 23s

(9) By Keith Medcalf (kmedcalf) on 2020-06-09 13:18:39 in reply to 8 [link]

Trillions of records or missing indexes?  I would presume the latter.

(10) By bb8 (thebb8) on 2020-06-09 15:30:40 in reply to 9 [link]

around 150000 records. indexes do exist.

(11) By Keith Medcalf (kmedcalf) on 2020-06-09 18:29:52 in reply to 10 [link]

There is not much difference between the two queries when I run them with such a small amount of data, though the merge then lookup is somewhat faster than the lookup then merge.  This is probably entirely due to the fact that doing a "union" (B-Tree sort/merge) is faster when working with simple integers than it is with longer text strings.  

Interestingly, even if the output is sorted by name, the first query is faster than the second when passing the results back to an application than the second.

(Are you including the time taken to produce the output on the screen rather than the time to run the query?  It takes about 19 seconds or so more to "output the data to the screen" on my computer, or an additional .4 seconds or so to pass the records back to an application (python) when not output to the screen.)

```
sqlite> .timer on
sqlite> .eqp on
sqlite> select count(name)
   ...>   from (
   ...>         select nameid
   ...>           from allnouns
   ...>         union
   ...>         select nameid
   ...>          from allverbs
   ...>        ) as T
   ...>   join names
   ...>     on names.nameid = T.nameid
   ...> ;
QUERY PLAN
|--MATERIALIZE 2
|  `--COMPOUND QUERY
|     |--LEFT-MOST SUBQUERY
|     |  `--SCAN TABLE allnouns (~1048576 rows)
|     `--UNION USING TEMP B-TREE
|        `--SCAN TABLE allverbs (~1048576 rows)
|--SCAN SUBQUERY 2 AS T (~2097152 rows)
`--SEARCH TABLE names USING INTEGER PRIMARY KEY (rowid=?) (~1 row)
94813
Run Time: real 0.165 user 0.156250 sys 0.000000
sqlite>
sqlite> select count(name)
   ...>   from (
   ...>         select name
   ...>           from names
   ...>          where nameid in (
   ...>                           select distinct nameid
   ...>                             from allnouns
   ...>                          )
   ...>        union
   ...>         select name
   ...>           from names
   ...>          where nameid in (
   ...>                           select distinct nameid
   ...>                             from allverbs
   ...>                          )
   ...>        )
   ...> ;
QUERY PLAN
|--CO-ROUTINE 4
|  `--COMPOUND QUERY
|     |--LEFT-MOST SUBQUERY
|     |  |--SEARCH TABLE names USING INTEGER PRIMARY KEY (rowid=?) (~24 rows)
|     |  `--LIST SUBQUERY 1
|     |     `--SCAN TABLE allnouns USING COVERING INDEX an_nid (~1048576 rows)
|     `--UNION USING TEMP B-TREE
|        |--SEARCH TABLE names USING INTEGER PRIMARY KEY (rowid=?) (~24 rows)
|        `--LIST SUBQUERY 3
|           `--SCAN TABLE allverbs USING COVERING INDEX av_nid (~1048576 rows)
`--SCAN SUBQUERY 4 (~48 rows)
94813
Run Time: real 0.175 user 0.171875 sys 0.000000
```