Best way to retrieve enumerated "vectors"

(1) By nugend on 2022-05-09 18:55:16 [link] [source]

Hello,

I have a bit of an efficiency question and was wondering if anyone had any suggestions on which would be better for large numbers of values.

I have a table schema as follows:

CREATE TABLE enums(
  name TEXT, 
  symbol TEXT, 
  idx INTEGER, 
  created_at DATETIME DEFAULT (unixepoch(CURRENT_TIMESTAMP)),
  PRIMARY KEY (name, symbol)
);

I want to use the database in order to track the enumerated values for each symbol. The idx field is managed by an AFTER trigger and I don't believe its value is relevant to the question. The name field will be low cardinality and the symbol field will be very high cardinality (hundreds of millions in the larger cases).

Basically, I want to retrieve the list of symbols for each name in the insertion order. The result will be iterated upon in the client code in order to intern the enumeration values in-memory.

Obviously, the straightforward way to do this is as follows:

select name, symbol from enums ORDER BY ROWID;

And simply iterate through the results with the API.

But I was wondering if it is necessarily the most efficient way to operate since it would involve a large number of SQLite API calls in the cases when an enum has many millions of rows and splitting on a specific character is quite fast (presuming, for example, unit separator is not used in the symbols, which should be the case):

select name, group_concat(symbol,char(31)) from enums GROUP BY name ORDER BY ROWID;

I was also supposing that the following might be a good split between up-front allocation and iteration as well:

select name, floor(idx/100000) as batch, group_concat(symbol,char(31)) as syms FROM enums GROUP BY name, batch ORDER by ROWID;

The important semantic operational aspect of the result is that the symbols are interned in idx order per enumeration name (which should be the same as ROWID since this is an append-only table and I believe it would be better to use ROWID for retrieval ordering as the ROWID is going to be accessed regardless? I realize this is a separate matter and maybe I should just be ordering by idx).

Or would there be an alternate approach which would be obviously better? I'm not married to any idea, I was just looking for suggestions since the scale of the table can be fairly significant.

Additionally, perhaps the scale that I am suggesting indicates I shouldn't be attempting this with SQLite at all? The approach we have been using is a very simple packed file, but that approach has a number of issues we are hoping to address with some of the features SQLite makes available. I don't believe insertions should be an issue since we would be careful to do very large batches insertions and this is much less frequent than reading.

Thank you, and I appreciate all criticism.

(2.1) By Keith Medcalf (kmedcalf) on 2022-05-09 19:33:09 edited from 2.0 in reply to 1 [link] [source]

Although unclear, one would presume that you want your concat "in insertion order of name" and the concat string to "be in insertion order". This may be achieved as follows:

with enumname(name) as
( -- get all the distinct name in insert order
    select distinct name
      from enum
  order by rowid
)
select name,
       (
        select group_concat(symbol,char(31))
          from (
                  select symbol
                    from enum
                   where name == o.name
                order by rowid
               )
       )
  from enumname as o
;

or, without the CTE expression

select name,
       (
        select group_concat(symbol,char(31))
          from (
                  select symbol
                    from enum
                   where name == o.name
                order by rowid
               )
       )
  from (
          select distinct name
            from enum
        order by rowid
       ) as o
;

Personally, I would just select name, symbol from enum order by rowid and process the returned information at the application level. This is especially true considering the distance from the application to the data is measured in nanoseconds as compared to the distance between the application and some other database server (which could be measured in seconds) -- that is, the turnaround time for each request is "immediately" and not "gotta wait for the server on the other side of the planet to get around to sending me back the rows, one by each, so since each turnaround takes so long, I need to diddle the data so there are not so many turnarounds".

I think you are trying to optimize away a problem that does not exist.

(3) By Keith Medcalf (kmedcalf) on 2022-05-09 19:58:54 in reply to 2.1 [link] [source]

You also need to consider the "massive" resources required to "diddle" the data inside the SQLite3 library and your application to "undiddle" the returned data.

This overhead will likely far exceed a straight-forward solution.

(4) By nugend on 2022-05-09 20:00:43 in reply to 2.0 [link] [source]

Thanks! I somehow missed that the group_concat ordering was arbitrary even with an ordering clause. I suppose the above also isn't guaranteed should the implementation change to some extent?

I will probably do as you suggest. I thought that given the scale of the potential table sizes that maybe a little evil premature optimization would be worth asking about.

(5) By Keith Medcalf (kmedcalf) on 2022-05-09 21:52:54 in reply to 4 [link] [source]

Which performs better depends on your host language and what you are doing. Here is a simple comparison written in python that build a dictionary by name containing a tuple of symbols, all in insertion order.

import collections
import sqlite3
import time
import random
import os

from collections import defaultdict
from random import randint

if os.path.isfile('test.db'):
    os.unlink('test.db')

db = sqlite3.connect('test.db', isolation_level=None)

# Build our database
# nNames is the size of the  "names"  space
# nSymbol is the size of the "symbol" space
# nRows is the number of rows to create

nNames  =     100
nSymbol =  100000
nRows   = 5000000

rs = chr(30)
fs = chr(31)

print(f"nNames = {nNames}; nSymbol = {nSymbol}; nRows = {nRows}")

ascii = list(c for c in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')

def randomlist(size):
    from random import shuffle
    a = []
    for i in range(size):
        shuffle(ascii)
        a.append(''.join(ascii[:8]))
    return a

st = time.time()
nameList = randomlist(nNames)
symbolList = randomlist(nSymbol)

print(f"Random lists created in {round(time.time()-st,3)} seconds")

db.executescript("create table enum(name text not null, symbol text not null, primary key(name, symbol))")

st = time.time()
lNames = nNames - 1
lSymbols = nSymbol - 1
db.executescript('begin')
for i in range(nRows):
    db.execute('insert or ignore into enum values (?,?)', (nameList[randint(0, lNames)], symbolList[randint(0, lSymbols)]))
db.execute('commit')
print(f"Database Build Complete; {nRows} in {round(time.time()-st,3)} seconds")
print(f"Table enum contains {db.execute('select count() from enum').fetchone()[0]} rows")
print()

# retrieve the data one by each an build an in-memory structure

print("Building internal data structure method simple")

sql = """
  select name,
         symbol
    from enum
order by rowid
"""
data = None
st = time.time()
data = defaultdict(list)
for row in db.execute(sql):
    data[row[0]].append(row[1])
for key in data.keys():
    data[key] = tuple(data[key])
print(f"Internal Structure built in {round(time.time()-st, 3)} seconds")
print()
data = None

# retrieve the data in "diddled" form

print("Building internal data structure diddling the data so there is one row per name")

sql = """
select name,
       (
        select group_concat(symbol,char(31))
          from (
                  select symbol
                    from enum
                   where name == o.name
                order by rowid
               )
       ) as symbols
  from (
          select distinct name
            from enum
        order by rowid
       ) as o
"""
data = {}
st = time.time()
for row in db.execute(sql):
    data[row[0]] = tuple(row[1].split(fs))
print(f"Internal Structure built in {round(time.time()-st, 3)} seconds")
print()

from which the following result was obtained:

>fetchtest
nNames = 100; nSymbol = 100000; nRows = 5000000
Random lists created in 1.289 seconds
Database Build Complete; 5000000 in 31.081 seconds
Table enum contains 3933777 rows

Building internal data structure method simple
Internal Structure built in 4.025 seconds

Building internal data structure diddling the data so there is one row per name
Internal Structure built in 2.856 seconds

As can be seen, in this particular case the "retrieve diddled data" can be performed faster in an SQL Query than it can by the application "retrieving then diddling". This will be dependant on your application data structure and language.

(6) By Keith Medcalf (kmedcalf) on 2022-05-09 23:34:06 in reply to 4 [source]

I somehow missed that the group_concat ordering was arbitrary even with an ordering clause. I suppose the above also isn't guaranteed should the implementation change to some extent?

group_concat ordering is not arbitrary. Values are concatenated in the order of presentation. This means that if you do not know the order of presentation that the ordering appears (to the lay observer) to be arbitrary, even though it is not.

Similarly, the select distinct ... returns the distinct rows as they are encountered in order of presentation (rows which have already been seen are skipped). That is DISTINCT operates as a filter of the presented candidates removing duplicates as they are encountered (this depends on the Query Planner which will product different output depending on what you ask for.

select distinct name from enum; will use a b-tree to produce the projection since you have not specified an ordering (as if the statement were select name from enum group by name or select distinct name from enum order by name). Since you did not specify an ordering, the ordering resulting from the lowest-cost execution plan will used. However, if you specify an ordering for the presentation of output to the projection, it changes the traversal order thus the order of appearance of the results.

(7) By nugend on 2022-05-10 01:34:18 in reply to 6 [link] [source]

I think I understand now. Because I didn't directly impose a presentation order on the symbols, SQLite chose to traverse the symbols in whatever order was lowest-cost. By imposing the presentation order in the sub-expression and correlating the rows retrieved to the group, we get a fully ordered result.

So, if I don't care about the order that the names are in, I can do the following:

select o.name, 
       (
        select group_concat(symbol,char(31))
          from ( 
                  select symbol 
                    from enum 
                   where name == o.name 
               order by rowid
               )
       ) 
 from enum as o 
 group by o.name

And while the symbols will be in rowid order within each group, SQLite will return the enumerations in whatever order it chooses.

(8) By Ryan Smith (cuz) on 2022-05-10 08:39:00 in reply to 7 [link] [source]

And while the symbols will be in rowid order within each group, SQLite will return the enumerations in whatever order it chooses.

Yes, which should still be in row-id order as long as the SQLite Query planner sticks with the decision to visit the rows in row-id order. Of course some future Query planner, or perhaps adding another index can change that decision.

A note on your original question: To reshuffle the data, reshape it or in some way handle it, requires work. Whether that work happens inside your code or inside SQLite.

In general, it is faster for this additional work to happen in your own code (unless it is very primitive in nature), simply because SQLite's code has to cater for the general case, while your code can easily be optimized to your data specifically.

That said, some platforms are just slow behemoths when measured against C, such as Python, Javascript, etc., and in those cases the well-optimized C-code inside SQLite would probably be faster, even if it technically does more work (as Keith proved).

Takeaway: It's all a bit relative. Test it and see which is faster for you.