sqlite db design question

(1) By Eric (herdingcat) on 2022-04-26 04:17:31 [source]

Hi,

I have around 160K items need to be inserted into a sqlite db. And for each item there are 4 different item-related attributes. So my initial plan is to create 4 columns to store those values. However, I think I can also create 4 rows for one item and only use 1 column to store the attribute values.

So which approach is more reasonable? If I do heavy query to search the attributes, which one is more efficient?

thanks.

(2) By anonymous on 2022-04-26 11:17:44 in reply to 1 [link] [source]

Some queries will be more efficient one way. Some will be more efficient the other. So start with the queries you expect to make. Your most-frequent queries. Try writing them out, twice, once for each table layout. This should give you an idea of what is likely to work best.

(3) By Bill Wade (billwade) on 2022-04-26 11:20:12 in reply to 1 [link] [source]

Which is more efficient? To find out, test it with your "heavy query".

Some queries are just easier to write and understand if one item = one row.

CREATE TABLE items(itemid PRIMARY KEY, attributeA, attributeB, attributeC, attributeD);

SELECT max(attributeB) FROM items WHERE attributeC=17;

Compare that to the equivalent query (left as an exercise for the reader) for

CREATE TABLE attributes(itemid, attributeid, attributevalue, primary key (itemid, attributeid));

The first version will probably be smaller (N rows with five columns each, compared to 4N rows with three columns each), and it is often the case that smaller means faster.

(4) By JayKreibich (jkreibich) on 2022-04-26 14:37:15 in reply to 1 [link] [source]

Object-Attribute data is a tricky thing. The answer is heavily dependent on the nature of the attributes and how you plan on doing queries.

To back up a bit to square one: If you have a set of items that have generic attributes, such as arbitrary keywords, it seems normal use a one-to-many/item-to-attribute set of tables, (or better yet, a many-to-many/item-to-keyword design that allows for keywords to be reused). This is basically the second design you're talking about. The issue with this design is that it often used in searches of a "give me all the objects with (at least) these keywords" type search. This type of search, at its core, requires a "relational divide." As the name implies, it is basically an SQL JOIN in reverse: you start with a product of objects and attributes (one pair per row), and "divide by" the searched keywords, such that you're left with a set of objects. Unfortunately, relational divide is the one core Relational Theory operation that is not directly supported by SQL. There are work-arounds, but they're ugly and often slow. I would strongly suggest looking it up in a book, as it is way to complex to explain here. I would suggest either "SQL For Smarties" by Joe Celko or "Introduction to SQL" by Rick F van den Lans. IIRC, the first book has a whole chapter on exactly this problem.

That's the super generic version of the problem. You can muck with the data representation to try to simplify the query. A common approach is putting all the keywords into a single column using JSON or comma separated strings, and then do wildcard searches. That has different problems. To keep query efficiency, you need to do stuff like sort the attributes. Most queries will still require a full table scan. It gets messy in other ways.

That's the problem when dealing with generic attributes. The good news is that if you don't have generic attributes, the situation gets much simpler.

You said you have four attributes. If those attributes are fundamental to the data items, in that the attribute's nature and value domain is extremely unlikely to change, along with the number of possible attributes, I would go with a single table, with each attribute having a column. Keywords for an image are neither fundamental nor fixed in number, but something like weight, size, and color of an object might be. Regardless, fixed, ordered data is much easier to deal with, and once the data is in columns it is super easy and straight forward to query. The danger is if you have five attributes next month, and sixteen attributes in a year.

The specific advise I can offer in this case is to look at the fundamental nature of these attributes. Are they generic "attributes" or are they fixed "auxiliary" or "detail" data? That should guide your data representation and storage format. Make sure you understand the fundamental nature of the data you're trying to store. This is where most designers screw up. I can go on for a long time here on Relational Theory, which offers some really interesting insights into information theory, but there are literally whole classes taught on that kind of thing.

The more general advice I can offer for database design questions is to look at the data flows. How is the data coming in? How is it going out? What queries do you need to write for this data? Very often, the data design is influenced heavily by how the data will be inserted, updated, and queried. Think about use-cases for the data in your application and environment, and make sure you can write those queries. Going through that process will often shed light on what design choices will work best with your data, in your application, with your use cases.

-j

(5) By Eric (herdingcat) on 2022-04-26 18:48:52 in reply to 4 [link] [source]

Thanks Bill and Jay. And yes, those attributes are fundamental to the data items, I decided to use some integer to mark the value like 0,1,2,3, etc..

My query is more like, for attrA and attrC or whatever combination, what the items are if attrX == 3 and / or attrY == 0. So if in this query pattern, will one single table w/ 4 attributes columns would be better? And yes, it's possible I will add more attributes later, but my use case is kinda special, those data will be repopulated every day which means a new table will be created every time, so old data really does not matter. So I assume changing the database schema won't be a heavy task(?). Once is the data is committed, no change on the data until a new batch comes in which will rotate old db and create a new one.

BTW, will I get benefit if I am using FTS for the query search pattern?

Thanks.

(6.1) By Keith Medcalf (kmedcalf) on 2022-04-26 19:39:14 edited from 6.0 in reply to 5 [link] [source]

Once upon a time I did a combination of both in order to create a rapidly querieable store of tag dumps from a Honeywell control system. Basically, the "backup" procedure dumped the configuration of each tag as a text file of applicable attributes (ie, attribute: value).

The database was built to contain each 'tag' and its associated 'text dump', together with a table of attribute references (ie, tag / attribute / value).

You could then do things like query for the list of tags (and configuration) for all tags that had specific attribute values thusly:

select * from mastertags where tagid in (
 select tagid from attrs where name='blah' and value='a'
 intersect
 select tagid from attrs where name='blink' and value='b'
 intersect
 select tagid from attrs where name='broke' and value!='c'
);

The attrs table was in fact a view of the linkage table because the 'attribute names' and 'attribute values' were higly duplicated. ie,

create table mastertags 
(
 tagid integer primary key, ...
);
create table masterattrs 
(
 attrid integer primary key, 
 name text collate nocase unique
);
create table mastervalue 
(
 valueid integer primary key,  
 value text collate nocase unique
);
create table linkage
(
 tagid integer not null references mastertags, 
 attrid integer not null references masterattrs, 
 valueid integer not null references mastervalues, 
 primary key(tagid, attrid, valueid), 
 unique (attrid, valueid, tagid), 
 unique(valueid, attrid, tagid)
) without rowid;
create view attrs 
 as select tagid, name, value 
      from linkage, masteratts, mastervalues 
     where linkage.attrid == masterattrs.attrid 
       and linkage.valueid == mastervalues.valueid
;

The time spent to build the tables was consumed at load time. Queries could be very complicated and took only microseconds to execute because the query planner (once you have done an analyze) can generate a very efficient retrieval plan.

Edited to correct linkage definition and formatting

(7) By David Jones (vman59) on 2022-04-26 22:20:32 in reply to 4 [link] [source]

My county's board of elections makes available a file that combines voter information and voting history. It's a tab-delimited text file where each record represents a unique voter: 32 fields describing the voter (name, address, precinct, etc); and ~160 fields to hold the voting history (2 or 3 fields for each primary, special, or general election for the last 20 years). As a single table, it's pretty sparse since people move around and few vote in primaries and special elections.

I ended up changing my import program to populate 2 tables:

voter table where each row holds the voter profile. One of the columns holds a unique persistent ID number assigned by the county.
history table where each row is the consolidated activity data for a single voter (specified by ID number) and single election (as long as it is not null).

On average, each voter record has about 8 associated history records. Using 2 tables also means I can update it from later voter files without having to add columns to a table.

(8) By JayKreibich (jkreibich) on 2022-04-27 02:44:19 in reply to 5 [link] [source]

So if in this query pattern, will one single table w/ 4 attributes columns would be better?

I would approach it that way, yes. If you only have four attributes and they're unlikely to expand very much, it is pretty easy to just write a query that directly captures the filters and limits you want to put on the data.

And yes, it's possible I will add more attributes later,

The issue with that is table size and query complexity. It isn't a big deal to add a unique attribute or two, but if you have 50 attributes, it might be time for a different approach. Hence, the difference between structured data, such as a measurement, and unstructured data, such as a keyword association where they keyword pool can grow basically without bound.

those data will be repopulated every day which means a new table will be created every time, so old data really does not matter. So I assume changing the database schema won't be a heavy task(?).

That's a really excellent observation. If you don't have to worry too much about preserving the data long-term, it definitely makes it easier to modify and refactor the database schema. If you need to make really significant changes, you can likely run the new and old database systems side by side for a few days, to verify the new system gives the same answers as the old one. That type of thing is much harder to do if you have "living" data that must be preserve state continuously.

Once is the data is committed, no change on the data until a new batch comes in which will rotate old db and create a new one.

Keith eluded to this, but if you have a "load once, query many" situation, especially with no updates, that can make it worth looking data models or designs that have heavy load times/cost, but very fast query structures. For example, building a few indexes to optimize specific query structures (don't go too crazy with indexes, however... make sure you understand how they work and how they can or cannot be used... they're not the magic bullet most people think they are; remember, a given query can generally only use one index per table, so putting an index on every column does not do what you think it does).

That said, if most of your data sets are under 200K items, it isn't, to be frank, really enough to be interesting. Unless you are trying to shave every last millisecond off query times, 200K just isn't that many. Bump up the page cache enough to hold the whole database in memory and you'll likely not need to worry about structure too much. It would be a different story with a hundred-million items.

BTW, will I get benefit if I am using FTS for the query search pattern?

For the stuff we've been talking about, no, not really. FTS is designed for document searches, like searching hundreds or thousands of web pages or similar large data sets where you're looking for a handful of words in data blocks that contain hundreds of words. I don't see how it would relate to anything you've talked about here. FTS is a somewhat heavy-weight tool designed for searching large blocks of data.

-j

(9) By Eric (herdingcat) on 2022-04-29 07:17:25 in reply to 8 [link] [source]

Thanks. What if the data stored is a json payload, will FTS improve performance on searching keyword?