Numerical stability of AVG and SUM

(1) By Justin Meiners (justinmeiners) on 2022-06-24 23:12:41 [link] [source]

The implementation for AVG and SUM is straightforward but is usually considered numerically unstable. This is of concern because the typical use case is aggregating a large sequence of values, the exact time when numerical accuracy becomes a problem. It's particularly a problem for AVG, which first adds up a large sum, and then divides to get a final count. This tends to be inaccurate because the sum is MUCH larger than the resulting average.

Postgres and Boost accumulators both address these issues. Here are some straightforward options to consider with relatively minor performance tradeoffs:

SUM: Kahan summation. If fast-math or other exotic float options are enabled this might a problem.
SUM: Balanced reduction. The order of summation can be re-associated to become a form of divide and conquer. I have described this in another context here

AVG: Here is a description. The code looks something like:

double mean = 0.0;
double n = 0;

for x in ...:
    ++n;
    mean += (x - mean) / n;

I looked in the old mailing list briefly and did not see this brought up, but maybe it has. If we think this is a good idea, I would be happy to write a patch.

Thanks,

Justin

(2.2) By Keith Medcalf (kmedcalf) on 2022-06-25 00:44:26 edited from 2.1 in reply to 1 [link] [source]

Successive Approximation can be used to compute a lot of statistical functions.

http://www.dessus.com/files/sqlfunc.c

This calculates:
avg
aavg (aavg(x) <--> avg(abs(x))
gavg (gavg(x) <--> exp(avg(log(x)))
rms
var
varP
stdev
stdevP
sem
ci
skew
skewP
kurt
kurtP

avg stdev stdevP var varP sem ci also allow "weights" as a second parameter to the aggregate/window function.

For the SUM window function, I just changed the builtin to maintain the floating accumulator as LONGDOUBLE_TYPE rather than just a double. In the odd case where this is not sufficient precision you can simply recompile using LONGDOUBLE_TYPE=__float128 and/or presort the input.

(3) By Justin Meiners (justinmeiners) on 2022-06-25 18:10:53 in reply to 2.2 [link] [source]

That does look like some great work. Have you discussed improving those in the main repository before? Do you have any license or distribution information about this code?

(4) By Donal Fellows (dkfellows) on 2022-06-27 12:18:09 in reply to 1 [link] [source]

It's particularly a problem for AVG, which first adds up a large sum, and then divides to get a final count.

It's not more of a problem for AVG than for SUM because the divide reduces the scale of the error too, and SQLite uses floating point instead of fixed point and a lot of the work in a floating point divide is shouldered by the exponent (the info loss in the mantissa is probably unavoidable with any representation using a fixed number of bits, and doesn't usually matter). The problem SUM comes with cumulative loss of significance when you have a set of values to add up that have significantly different magnitudes.

The real problem comes when you take the difference between two large similar numbers (where at least one has some sort of cumulative error issue) and then use the result to take some sort of binary decision. An example of this issue is what you get when you ask "is this sum or average producing a value greater than this threshold?" When the values being compared are close, things get decidedly tricky. (In at least one application I know, the answer has been to use stochastic decision making, which maintains substantially better overall numeric accuracy despite being really counterintuitive at first glance.)

(5) By Keith Medcalf (kmedcalf) on 2022-06-27 18:28:18 in reply to 3 [source]

I wrote the code, and it is released to the public domain. Break it and you own all the pieces. The "running calculations" are all well described numerical methods and should be stable.

(6.1) By Justin Meiners (justinmeiners) on 2022-06-28 04:51:49 edited from 6.0 in reply to 4 [link] [source]

It's not more of a problem for AVG than for SUM because the divide reduces the scale of the error too

That's true. I guess part of my concern is that implementing AVG with summing as an intermediate step may produce a far larger value than expected. This intermediate accumulation may enter a range of values that cannot be represented accurately. The user has in mind the average result, not how big the sum gets if you add it all up.

It sounds like you have some good knowledge on this subject and probably more than me. Are you also concerned about the existing implementation? What do you think should be done?

(7) By Justin Meiners (justinmeiners) on 2022-06-28 00:08:05 in reply to 4 [link] [source]

The problem SUM comes with cumulative loss of significance when you have a set of values to add up that have significantly different magnitudes.

The balanced application of addition that I proposed is based off of that idea. Instead of adding each element successively, resulting in a large sum, with small incremental additions, it adds equal numbers of elements using associativity.

However, I don't know how it compares to compensated sum methods like Kahan.

(8) By Justin Meiners (justinmeiners) on 2022-06-28 00:09:37 in reply to 5 [link] [source]

Thanks. This is also is also an improvement on the stdev and other statistical methods implemented in the extensions listing.

(9) By Donal Fellows (dkfellows) on 2022-06-28 09:08:27 in reply to 7 [link] [source]

I'd want to compare with another (expensive!) technique: sorting the values by their absolute magnitudes and adding them from small magnitudes to large. That should minimise the information loss, since it gives the many a mickle a chance to make a muckle, i.e., the least significant bits have a maximum chance to carry over into higher bits that will end up retained. I'm not convinced that balanced sums will work with similar effectiveness unless they also have the sorting. Compensated summations I think effectively just add a bunch more bits (like having an extended mantissa) which might be workable in the huge majority of cases.

(10) By Keith Medcalf (kmedcalf) on 2022-06-28 16:46:24 in reply to 9 [link] [source]

Whether you pre-sort into ascending order depends on the relative size of the sum or avg in comparison to the input values.

sqlite> with s(x) as (values (1), (1e100), (1), (-1e100)) select sum(x) from s;
┌────────┐
│ sum(x) │
├────────┤
│ 0.0    │
└────────┘
VM-steps: 45
Run Time: real 0.008 user 0.000000 sys 0.000000
sqlite> with s(x) as (values (1), (1e100), (1), (-1e100)) select sum(x) from (select x from s order by abs(x) desc);
┌────────┐
│ sum(x) │
├────────┤
│ 2.0    │
└────────┘

(11) By Justin Meiners (justinmeiners) on 2022-06-29 00:52:43 in reply to 9 [link] [source]

Compensated summations I think effectively just add a bunch more bits (like having an extended mantissa) which might be workable in the huge majority of cases.

I think compensated sums are a much more common solution, and I suspect the right one. I mention the balanced sum, in case there was some reason the authors previously did not want to do this (eg. fast-math compatibility).

In either, case I think you're right that some comparisons or an example might be a good idea.

(12) By Donal Fellows (dkfellows) on 2022-06-30 12:21:01 in reply to 10 [link] [source]

The best technique is probably to group by the magnitudes.

I suspect it's more of a problem for working with matrix algebra than databases, where the values commonly summed will usually be of similar magnitudes (and no special action required). By contrast, some matrix ops are really really tricky. I can't think of any real world dataset where the magnitudes of the values vary enough to be tricky with double precision floats. (I run into this stuff more with fixed point and hybrid stiff ODE systems.)