store dimensions that never change in normalized form or columns?

(1) By biodrool on 2022-10-18 16:51:23 [source]

Hi,

For some time I've debated myself on this design choice: If I'm creating code that talks to many read-only SQLite databases, whereby each db has it's own set of unique dimensions, is it "better" if this dimensional data be stored in normalized row form, so that schemas are consistent?:

person table

name   salary   dim
ed     5        1 
ed     5        2 
susie  9        3 
susie  9        4

dim table

id   name    value
1    career  jouster
2    hobby   running
3    career  journalist
4    hobby   poetry

or as separate columns (currently my preferred approach):

-- in "database 1"                                 -- db-specific dims
create table person (name text, salary integer,    hobby text, career text)

-- in "database 2"                                 -- db-specific dims
create table person (name text, salary integer,    language text, mood text)

The trouble with my preferred approach is my code must dynamically build queries from relatively less safe and inefficient string formatting since whether person contains hobby or language information isn't known until we've established which database we're connecting to at runtime.

However, these databases are strictly read only, so the chance of changes getting made is 0%, hence the normalized form provides no opportunity for ease of adding more dimensions. Also, I still can't make as much use of cached statements since I'd be querying dimensions with a variable number of predicates when I go to filter by any number of dimensions, e.g. (dim.name = ? and dim.value = ?) or (dim.name = ? and dim.value = ?) ...

The only reason I'm frustrated by the column approach is because I prefer to use prepared statements as much as possible, so when I hit an awkward use case where I'm dynamically building queries from strings, I appreciate knowing from the experts best practices for such cases.

(2) By Simon Slavin (slavin) on 2022-10-18 19:11:45 in reply to 1 [link] [source]

My initial preference is for 'dim table'. The things you're specifying (hobby, mood) should not be stored in the schema, they are data and belong /in/ the tables. This design also allows you to merge all the tables together easily: just add a new column which is currently the name of the database. You may find this useful sometime in the future. However, some things about how the data is used may provide your answer.

What data do you need to read together ?

If you read something about one person, do you usually need to know everything about them ? If I need to read someone's hobby, do I probably also need to know their career ? Or is it more likely that if I read one person's hobby I need to know the hobbies of everyone else in that table ?

What's expensive in that system ?

Processing ? It runs on a portable device, or an embedded processor ? Speed is unusually important ?

Storage ? The read-only databases are accessed across a network ? The databases are stored on a tiny storage medium ?

Documentation or ease of maintenance ? Many people will read/approve/maintain your code, and must understand what it does very quickly ?

Code size ? Your code must work on many platforms, or is likely to be translated to another language soon ?

You may look at these and say "none of the above", or come up with something I didn't list. But the answers may answer your question.

Another option

Just to annoy you, I'll give you another option. Had you considered storing the attributes in JSON format, all in one column ? This would be flexible and fast in looking up everything about one person, but make it harder to find everyone who has been a jouster.

    create table person (name text, salary integer, attributes TEXT)

'ed',5,'{"career":"jouster","hobby":"running"}'
'susie',9,'{"career":"journalist","hobby":"poetry"}'

You can't actually use the above for input, because SQLite stores JSON as text, so you need to change the quotes. But it might make it easier for you to think about the option.

(3) By anonymous on 2022-10-18 20:11:04 in reply to 1 [link] [source]

Regarding dynamic query building, maybe you shouldn't do that, I am surprised how many developers tend to do this when in reality, the number of queries they run against the database is quite limited.

Store your queries and reuse them, preferably in prepared form, this way you can even test them thoroughly before mixing them with your application code

(4) By biodrool on 2022-10-25 16:54:47 in reply to 3 [link] [source]

I would have to calculate all combinations of when predicates, since for in expressions there can be a variable number of ? variables.

So that is not a reliable option unfortunately and hence why dynamic queries are needed. Otherwise I agree with you.

(5) By biodrool on 2022-10-25 16:58:38 in reply to 2 [link] [source]

Unfortunately even with my proposed dim table, I would still not be able to prepare statements since I would use a dynamic list of variables in predicates containing in expressions.

So, that is why I didn't see any benefit to it. Yes, I read all of the dims all of the time, hence why I opted to denormalize them rather than having to continually join on dim to get the results.

Thanks for the suggestions.

(6) By biodrool on 2022-10-25 22:35:19 in reply to 2 [link] [source]

Actually, I just realized one slick way of preparing statements with a variable IN clause is to insert the values into an in-memory temp table that you then query like where something in (select values from temp.values)

That way I don't have my issue of variable number of ? arguments to the prepared statement.