How to rerank one row in a big set of rows?

(1) By Haujet Zhao (HaujetZhao) on 2021-11-25 06:05:22 [link] [source]

The situation

Suppose there is a big dataset (say 200,000 items).

Each item requires a specified unique rank (from 1 to 200,000)

Need to update some item's rank frequently.

Temporary solution and problem

Store all items in a table:

CREATE TABLE dataset (
    id     INTEGER PRIMARY KEY,
    item   TEXT
);

Each row stores one item, and make the id as it's rank.

But when updating one item's rank, say from m to n, I need to alter all the ids between 'm' and n. The time complexity is O(m-n).

Say the m is 100000 and the n is 100, then updating one item's rand requires updating 99900 ids, making that kind of rank update 1000 times requires updating 99900000 ids. That is insane.

Is there a more efficient way to organize this kind of dataset?

Each item requires a specified unique and continuous rank, which can be used for ordering.
The item's rank often need to update.

(2) By ddevienne on 2021-11-25 10:09:00 in reply to 1 [link] [source]

I don't have a solution to what you describe, but here are some thoughts.

If you make your rank the id PRIMARY KEY column, as an alias to the ROWID,
then updating the ranks means physically moving the rows around, rebalancing the BTREE,
which means a lot of IO. Very expensive.

While if the rank is a separate and explicit column of its own,
then updating it means updating the pages with actually modified rows only.

If you make queries by that rank column, then you need an index on it though.
So the index will need updating too, but then you don't need to move the item columns,
as when id and rank are one-and-the-same and the PK.

If you ranks change often, that's a problem indeed. You could switch to a floating point rank,
to insert in-the-middle, w/o disturbing surrounding rows, but that only works so many times,
you eventually run out of precision. But if you can detect those conflicts when they arise,
you've at least reduced the frequency of re-ranking the tail/rest of a rank-update.

A similar strategy is to split your rank in two, the primary and secondary.
The latter starts out at zero, and when re-ranking, you just update that one instead of the primary,
again limiting the number of rows that need updating on re-ranks. Now some rows share a primary rank
(after 1 or more reranking), but not a secondary one, and you need to index/order-by both ranks of course,
but again, it's a strategy to limit to number of rows affected by a rerank.

In both cases (REAL ranks or SPLIT ranks),
once in a while you probably want to pay for a full rerank to go back to INT-reals or ALL-0-sub-ranks.

I'm just thinking aloud here, I haven't tried any of the above. --DD

(3) By Haujet Zhao (HaujetZhao) on 2021-11-25 11:33:13 in reply to 2 [link] [source]

Float number rank is a kind of tricky approach, sorting may be slower but acceptable.

I also wonder if there is a bidirectional pointer approach of ranking items can be used in SQLite, so that each re-rank only changes a few pointers at the cost of some searching speed (basically it should have same sorting speed with float number rank method).

(4) By Adrian Ho (lexfiend) on 2021-11-25 12:31:55 in reply to 3 [link] [source]

I also wonder if there is a bidirectional pointer approach of ranking items can be used in SQLite

You mean like a doubly-linked list? That's orthogonal to maintaining a rank number, and almost certainly a 200,000-deep recursive CTE. If you're even considering this approach, it begs the question: Why did you start with "maintaining a strictly contiguous integer rank column between 1 and 200,000" as the appropriate solution?

If (as in many cases I've seen) you just need the top or bottom n rows in ranked order, then Dominique's floating-point rank suggestion would work just fine in the general case, because you'd be doing an ORDER BY rank LIMIT n regardless of the actual type of rank.

In any event, more information about your actual use case would help generate more appropriate solutions.

(9) By Haujet Zhao (HaujetZhao) on 2021-11-26 01:56:42 in reply to 4 [link] [source]

You are right, doubly-linked list. Without index, it does need 200,000-deep recursive following.

But with a index, let's say generate 1 index for each 1000 items. The 200,000 items then have 200 indexes. When locating item 5500, the program just need to find the index of 5000, getting the rowid, then go through a 500-deep path.

When altering one item's rank, only a few indexes need to be updated.

But indeed, the float number solution is the most implementable method so far.

(11) By anonymous on 2021-12-05 12:01:11 in reply to 9 [link] [source]

Joe Nelson (author of PostgREST) showed a few ways to do this in Postgres in a post titled "User-defined Order in SQL": https://begriffs.com/posts/2018-03-20-user-defined-order.html

I discovered this after I wrote a small library to use lexicographically-subdivided strings to handle this problem, but in JavaScript (readily portable) and intended to be run from the application layer: https://github.com/fasiha/mudderjs#readme

(5) By Gunter Hick (gunter_hick) on 2021-11-25 12:48:11 in reply to 1 [link] [source]

In many use cases, the rank is implied by comparing some kind of score (e.g. elo points in chess) and updating a records' score only needs to rebalance the index on that score to faciliate returning records ordered by the score (plus any criteria to tie-break equal scores).

What makes it necessary to quickly extract the 4711th record, ond only that one?

(6) By Simon Slavin (slavin) on 2021-11-25 16:56:44 in reply to 1 [link] [source]

Usually you don't store ranks. Calculate a rank by iterating through an index of scores until you find the row you want.

If you have a situation where it is useful to store ranks, and ranks don't change often, don't store ranks in the same table where your scores are stored. Make a special table for just id, score and rank, and have a procedure which deletes all the rows in the table and makes new ones. Call this procedure when a new set of scores are in.

(7) By Keith Medcalf (kmedcalf) on 2021-11-25 17:48:36 in reply to 1 [link] [source]

Python code to maintain an Ordered Table:


def OpenRowid(db, table, rowid, count=1):
    if count < 1:
        raise ValueError('Count must be >= 1')
    if db.cursor().execute(OpenRowid.test.format(table=table), (rowid, count)).fetchone()[0] != 0:
        db.cursor().execute(OpenRowid.sql.format(table=table), (rowid, count))

def RenumberTable(db, table):
    db.cursor().execute(RenumberTable.sql.format(table=table))


OpenRowid.test = '''
select count(*)
  from {table}
 where _rowid_ >= ?1
   and _rowid_ < ?1 + ?2
;
'''

OpenRowid.sql = '''
insert into {table}
     select *
       from {table}
      where _rowid_ >= ?
   order by _rowid_ desc
on conflict (_rowid_) do update
        set _rowid_ = excluded._rowid_ + ?;
'''

RenumberTable.sql = '''
begin immediate;

-- make sure we have the correct temporary table r

drop table if exists temp.r
;
create temporary table if not exists r
(
   old primary key,
   new
)
without rowid
;

-- for all _rowid_ > 0 compute new _rowid_ to close gaps

insert into temp.r
     select _rowid_ as old,
            row_number() over (order by _rowid_) as new
       from {table}
      where _rowid_ > 0
   order by _rowid_
;

-- apply _rowid_ updates in ascending order of _rowid_ (move _rowid_ down)

insert into {table}
     select *
       from {table}
      where _rowid_ in (
                        select old
                          from temp.r
                         where old != new
                       )
   order by _rowid_
on conflict (_rowid_) do update
        set _rowid_ = (
                       select new
                         from temp.r
                        where old == excluded._rowid_
                      )
;

-- clear temporary table

delete from temp.r;

-- for _rowid_ > 0 compute offset _rowid_ so rows with _rowid_ < 1 can be moved into place

insert into r
     select _rowid_ as old,
            row_number() over (order by _rowid_) + (select count(*) from {table} where _rowid_ < 1) as new
       from {table}
      where _rowid_ > 0
   order by _rowid_
;

-- apply _rowid_ updates in descending order of _rowid_ (move _rowid_ up)

insert into {table}
     select *
      from {table}
     where _rowid_ in (
                       select old
                         from temp.r
                        where old != new
                      )
   order by _rowid_ desc
on conflict (_rowid_) do update
        set _rowid_ = (
                       select new
                         from temp.r
                        where old == excluded._rowid_
                      )
;

-- clear temporary table

delete from temp.r;

-- compute final _rowid_ for rows with _rowid_ < 1

insert into temp.r
     select _rowid_ as old,
            row_number() over (order by _rowid_) as new
       from {table}
      where _rowid_ < 1
   order by _rowid_
;

-- apply _rowid_ updates in descending order of _rowid_ (move _rowid_ up)

insert into {table}
     select *
       from {table}
      where _rowid_ in (
                        select old
                          from temp.r
                         where old != new
                       )
   order by _rowid_ desc
on conflict (_rowid_) do update
        set _rowid_ = (
                       select new
                         from temp.r
                        where old == excluded._rowid_
                      )
;

-- clean up our temporary table

drop table if exists temp.r;

-- commit our transaction

commit;
'''

if __name__ == '__main__':
    import apsw
    db = apsw.Connection()
    db.cursor().execute('create table x(id integer primary key, v integer);')
    db.cursor().execute('insert into x select value*10, value from generate_series where start=1 and stop=10;')
    db.cursor().execute('insert into x select -(value-1)*10, -value from generate_series where start=1 and stop=10;')
    db.cursor().execute('delete from x where id=50;')
    db.cursor().execute('delete from x where id=-30;')
    for row in db.cursor().execute('select * from x;'):
        print(row)
    print()
    RenumberTable(db, 'x')
    for row in db.cursor().execute('select * from x;'):
        print(row)
    print()
    OpenRowid(db, 'x', 3)
    OpenRowid(db, 'x', -4)
    for row in db.cursor().execute('select * from x;'):
        print(row)
    print()
    db.cursor().execute('insert into x values (3, 103);')
    db.cursor().execute('insert into x values (-4, -104);')
    for row in db.cursor().execute('select * from x;'):
        print(row)
    print()
    RenumberTable(db, 'x')
    for row in db.cursor().execute('select * from x;'):
        print(row)
    print()

(8) By Keith Medcalf (kmedcalf) on 2021-11-25 17:50:53 in reply to 7 [source]

Opening a rowid and renumbering the table is performed by changing the traversal to minimize row movement.

(10) By Haujet Zhao (HaujetZhao) on 2021-11-26 02:02:35 in reply to 8 [link] [source]

It helps me a lot, much thanks!!!