Feature Request - ORDER BY random(seed)

(1) By Denis TRUFFAUT (DenisTRUFFAUT) on 2023-01-11 01:51:55 [source]

Why a seed ?

Passing a seed to random() allows developers to sort by random(), while keeping a stable pagination, ensuring no result is shown twice.

Indeed, while the seed remains the same, the randomly generated order also remains the same.

Seeded random is useful to deliver consistent random results to scrolling customers (Infinite scroll).

Think about an image gallery.

Seeded random is fairly common among engines:

1. MySQL allows to seed rand():

SELECT * FROM your_table ORDER BY RAND(351);

https://dev.mysql.com/doc/refman/8.0/en/mathematical-functions.html#function_rand

https://stackoverflow.com/a/23706906/7776828

2. ElasticSearch allows to seed random():

curl -XGET 'localhost:9200/_search' -d '{
  "query": {
    "function_score" : {
      "query" : { "match_all": {} },
      "random_score" : { "seed" : 1376773391128418000 }
    }
  }
}';

https://github.com/elastic/elasticsearch/issues/1170#issuecomment-22819670

3. But SQLite does not :(

That said, SQLite implements a non-exposed seedable PRNG function:

void sqlite3_randomness(int N, void *P);

https://www.sqlite.org/c3ref/randomness.html

The exposed random() function just needs to accept a seed.

This feature does not seem too hard to implement, and would bring great value !

Thanks,

Denis TRUFFAUT

https://twitter.com/DenisTRUFFAUT

https://www.linkedin.com/in/denistruffaut/

(2.1) By Aask (AAsk1902) on 2023-01-11 07:22:24 edited from 2.0 in reply to 1 [link] [source]

Deleted

(3) By Keith Medcalf (kmedcalf) on 2023-01-11 12:56:11 in reply to 1 [link] [source]

The exposed random() function just needs to accept a seed.

It needs to do more than that. If random() returned the next random number generated from the PRNG, and random(seed) applied the seed value to the PRNG, then ORDER BY random(seed) would be moot (that is, it always returns the same value, so no ordering is done).

random(seed) would need to keep track of its context and only re-seed the PRNG once per context. How to handle multiple uses of random in one statement would be problematic -- do you have two PRNG's, each seeded separately? How do you control the seed (ie, if you do select random() from wholenumber where value between 1 and 1000 order by random(345) how do you apply the seed when the first random() call is not seeded).

The whole thing sounds somewhat ill-conceived from the get-go.

(5) By Denis TRUFFAUT (DenisTRUFFAUT) on 2023-01-11 17:53:38 in reply to 3 [link] [source]

Yes.

If the random() function is now able to accept a seed, it supposes every random() call in the query defines its own PRNG context.

Several calls cannot reference the same PRNG context.

So if you have an edge case query like:

SELECT RANDOM() ORDER BY RANDOM(seed)

Then:

The first random() references to PRNG context #1 (Auto Seed)
The second random(seed) references to PRNG context #2 (Custom Seed)

In practice, you face N + 1 functions, N being the number the number of custom seeds, and +1 being the auto seed (which is re-used on multiple random() calls without seed).

Clearly an edge case.

(4) By Richard Damon (RichardDamon) on 2023-01-11 15:57:56 in reply to 1 [link] [source]

One big issue is that SQL doesn't define the order the rows will be touched by the select clause, so even if every run generated the same sequence of numbers, they might still be assigned to the rows in a different order.

Yes, if you absolutely don't change the database, AT ALL, you should get the same order, but that is a fairly unusual case.

(6) By Denis TRUFFAUT (DenisTRUFFAUT) on 2023-01-11 18:08:08 in reply to 4 [link] [source]

Is it better if score is present in the SELECT clause ?

SELECT id, RAND(seed) as score FROM your_table ORDER BY score;

Or maybe a hash function is needed ?

SELECT id, SHA512(id + seed) as score FROM your_table ORDER BY score;

This way the score is a nicely distributed hash (gaussian, normal distribution), and the source of the hash depends on id and seed that are both stable and ensure idempotency over repeated calls.

I did not see any hash function in sqlite :/

I plan to run this kind of queries on Cloudflare D1, which uses SQLite under the hood.

(7) By Keith Medcalf (kmedcalf) on 2023-01-11 19:15:58 in reply to 6 [link] [source]

I did not see any hash function in sqlite :/

Yes, there are loadable extensions (which may be compiled in) that implement User Defined Functions that perform hashing functions.

SELECT id, SHA512(id + seed) as score FROM your_table ORDER BY score;

This is pretty useless. Assuming that id is the integer primary key and seed is some constant number, then you may as well use sin(id + seed), or just id+seed.

This whole concept is for what I cannot fathom. I suppose it implements the "endless scrolling" used on social drivel sites which once experienced, leads to the drivel site never being used again because it is totally useless for any purpose.

(8.1) By Denis TRUFFAUT (DenisTRUFFAUT) on 2023-01-11 21:17:01 edited from 8.0 in reply to 7 [link] [source]

UX Vision

Randomness ensures normal distribution (central limit theorem) which brings same chance for all record to appear.

Infinite scroll avoids pagination clicks (which require to target a link with a finger on mobile, which is more brain expensive rather than to scroll wherever on the screen).

It doesn't mean the user is completely inactive, as user can filter the results (WHERE).

Why SQL ?

The reason I need to use SQL rather than Search engines is:

The price : Search engines are extremely expensive. ElasticSearch starts at $90 per month which is huge when the products exposed to users cost less than $0.50. Low cost is required to target low income countries. Google Custom Search API is even more, $5 per 1000 queries. Meilisearch is off the charts with $1200 per month for 10M searches. There is currently no real affordable MVP scalable solution to start a business, excepted a plain old SQL engine.
Exclusion : I want to filter products already purchased from the results. ElasticSearch has a limit of 1024 items in arrays. All SQL engines have limits for enumerations queries such :

WHERE id IN (a,b,c,...etc.)

The only way I see to bypass such limits is to reference an external table, using a subquery :

WHERE product_id NOT IN (SELECT purchased_id from purchase WHERE purchaser_id = ${userId})

Search engines don't support subqueries.

Of course by using SQL I have to forgot the natural language search features (i18n), which is sad, but with the help of some static filters, the infinite scroll can be customized so it seems acceptable for a MVP.

What better/different approach would you take ?

SIN() is great !

id+seed

Sadly, simple operators like + - * / don't work, as the order of id is kept : the order is not random :(

sin(id + seed)

I tested sin(id + seed) with a small JS simulation of 1000 records, and it seems to perform the job as expected :)

https://stackoverflow.com/a/75089040/7776828

Thanks !