SQLite Forum

Window function is not functioning as expected
Login

Window function is not functioning as expected

(1) By Balaji Ramanathan (balaji) on 2021-03-27 01:29:24 [link] [source]

Hi,

    Please consider the following set of simple queries:

create table Test (x integer);

insert into Test values (1), (3), (2), (5), (4), (6), (7), (9), (7), (4), (3), (1);

select x, row_number() over Win1 as Row, count() over Win1 as TotalRows
from Test
window Win1 as (order by x);

    12 values have been inserted into table Test.  I expect the select query to have 12 under TotalRows in all rows of the output.  Instead, count() seems to be partitioning the window by the value of x.  This is the output I get:

x	Row	TotalRows
1	1	2
1	2	2
2	3	3
3	4	5
3	5	5
4	6	7
4	7	7
5	8	8
6	9	9
7	10	11
7	11	11
9	12	12

    This is even though my window definition does not include a Partition by clause, and the documentation for window functions says the following:

For the purpose of computing window functions, the result set of a query is divided into one or more "partitions". A partition consists of all rows that have the same value for all terms of the PARTITION BY clause in the window-defn. If there is no PARTITION BY clause, then the entire result set of the query is a single partition.

    Based on the last sentence, I expected the count() to operate over all the rows in the window since it should be considered a single partition.  But instead, there seems to be some automatic partitioning going on based on the value of x.

    Am I misunderstanding the documentation on Window functions?  Or am I using count() wrong?  Any help would be appreciated.  What I want is each row to include the row number as well as the total number of rows in the window.  Thank you.

(2) By Keith Medcalf (kmedcalf) on 2021-03-27 01:56:14 in reply to 1 [link] [source]

It would appear that for the purposes of count() as a window function it is treating rows with the same order by value as peers and providing the count of rows in the current and all prior peer groups.

Unless I missed it, this does not appear to be documented.

(3) By Balaji Ramanathan (balaji) on 2021-03-27 02:09:40 in reply to 2 [source]

You are right, Keith. It is not just partitioning by the value of x and counting within each partition, it is doing something cumulative also. Thank you for confirming that this is not expected behavior based on your reading of the documentation either.

(4) By Balaji Ramanathan (balaji) on 2021-03-27 02:19:51 in reply to 2 [link] [source]

To provide more context, I am trying to calculate the median of x using the code below:

with RowNumbers as ( select x, row_number() over (order by x) as Row, count(x) over (order by x) as TotalRows from Test)

select avg(x) from RowNumbers where Row between TotalRows/2.0 and TotalRows/2.0 + 1;

Unfortunately, because of the way count() is working, this is giving me nonsense results.

If there is a simpler way to calculate the median, that would solve my immediate problem (though I think it would be a good idea to fix either count() or the windows function documentation so that the result is as expected and documented).

If SQLite could provide an aggregate function for median the way some other databases do (AWS Redshift being one I am familiar with that has a built-in median() aggregate function), that would be really great. Same thing for mode() also. Thanks again.

(7) By anonymous on 2021-03-27 11:09:18 in reply to 4 [link] [source]

with RowNumbers as ( 
  select x, 
  row_number() over (order by x) as Row,
  count(x) over () as TotalRows  /*no order by here*/
from Test)

select avg(x)
from RowNumbers
where Row between TotalRows/2.0 and TotalRows/2.0 + 1;

--> 4.0

(5) By Keith Medcalf (kmedcalf) on 2021-03-27 02:20:16 in reply to 2 [link] [source]

That is, it operates like rank() except that where rank() returns the number of the first row of each peer group, count() returns the number of the last row of each peer group.

(6) By anonymous on 2021-03-27 08:47:16 in reply to 2 [link] [source]

I think it is documented: The default frame specification is

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW EXCLUDE NO OTHERS
(see https://sqlite.org/windowfunctions.html#frame_specifications).

What you probably were expecting for TotalRows would be

select x, row_number() over Win1 as Row, count() over Win1 as TotalRows
from Test
window Win1 as (order by x range between unbounded preceding and unbounded following);

which returns 12 as TotalRows for all rows. Or a Window which does not contain any ORDER BY clause like

...
count() over () as TotalRows
...
See also the "concept of peer groups" in https://sqlite.org/windowfunctions.html#built_in_window_functions.

I tripped over this in Oracle SQL as well - but once understood, makes perfectly sense as often one wants "cumulative" sums/counts as default ...

(8) By Balaji Ramanathan (balaji) on 2021-03-27 15:03:14 in reply to 6 [link] [source]

You are correct. I never noticed that there was a default frame specification. Given that frame specification, the output makes perfect sense now. Thank you very much. It looks like aggregate functions (like count() ) take this default frame specification into account while built-in functions ignore the frame specification, so this default frame_spec has never been visible to me while using only the built-in window functions. Now my query functions correctly and gives me the correct median. Thanks again.

(10) By anonymous on 2021-03-27 22:45:30 in reply to 8 [link] [source]

You are welcome.

This "ignoring" is mentioned in https://sqlite.org/windowfunctions.html#built_in_window_functions : ... Most built-in window functions ignore the frame-spec, the exceptions being first_value(), last_value() and nth_value() ...

Especially for 'aggregate' window functions, beeing able to specify frame specification makes sense.

(9) By Keith Medcalf (kmedcalf) on 2021-03-27 20:34:39 in reply to 6 [link] [source]

Actually, what appears to happen (by observation and experimentation) is that count() has behaviour that is dependent on the window specification.

PARTITION BY and ORDER BY work as expected.

If a frame specification is given, then that frame specification in used. The default frame specification is groups from unbounded preceding to current row or range from unbounded preceding to current row (both of which apply count to the number of rows in the peer groups included, including all the rows in the peer group to which the current row belongs), thus count() works like rank() except that it returns the number of the last row for each peer group.

count() over (order by x rows between unbounded preceding and current row) returns the same result as row_number() over (order by x).

count() over () (an empty window specification) returns the total number of rows, that is the max(row_number()), for the entire result set.

count() over (partition by x) returns the total count of rows in each partition.

And so on and so forth.

(11) By anonymous on 2021-03-27 22:57:55 in reply to 9 [link] [source]

Which all makes sense for me. My personal rule of thumb (mainly "derived" from experience with Oracle)

  • aggregate window function (plus first_value, last_value)
    • without ORDER BY - results in aggregation over whole partition
    • with ORDER BY - results in cumulative aggregation "along" the ORDER BY ( unless another frame specification is given)

seems to be valid for sqlite as well, count() being an aggregate function.