Delete every record except the first of the month

(1) By Cecil (CecilWesterhof) on 2022-08-17 18:49:27 [link] [source]

I have a table where there should be a record for every day in the format '2022-08-17'. For the previous years I want to delete all the records except the first record of the month. At the moment I use:

DELETE
FROM   table
WHERE  date < '2022-'
   AND date NOT LIKE '%-01'

This works, but would go wrong at the moment there will not be a record for the first day of the month. If there is not a record for the first day, but there is a record for the second, the record for the second day should not be removed. What would be a good way to rewrite this query?

(2) By anonymous on 2022-08-17 19:35:10 in reply to 1 [link] [source]

table is a bad example table name, so I'll use daily instead...

delete from daily
    where date<'2022' and exists
        (select * from daily earlier
            where earlier.date>substr(daily.date,1,9)
                and earlier.date<daily.date);

(3.1) By jose isaias cabrera (jicman) on 2022-08-17 19:49:31 edited from 3.0 in reply to 1 [link] [source]

Deleted

(4) By jose isaias cabrera (jicman) on 2022-08-17 19:50:00 in reply to 3.1 [link] [source]

Mis-read the request. :-)

(5) By Cecil (CecilWesterhof) on 2022-08-17 20:33:55 in reply to 2 [link] [source]

table is a bad example table name, so I'll use daily instead...

Yeah, that was not smart using the variable name instead of a real table name. :'-(

delete from daily
    where date<'2022' and exists
        (select * from daily earlier
            where earlier.date>substr(daily.date,1,9)
                and earlier.date<daily.date);

That looks likes it is going to work.

Thanks.

(6.1) By Keith Medcalf (kmedcalf) on 2022-08-17 21:00:37 edited from 6.0 in reply to 1 [link] [source]

You could use something like this:

with rows(rowid, rank) as
   (
     select rowid,
            rank() over (partition by substring(date,1,7) order by date) as rank
       from data
      where date < strftime('%Y')
   )
delete from data where data.rowid in (select rowid from rows where rank <> 1)
;

Changed hardcoded '2022' to strftime('%Y') so that the "current year" is always retrieved. Note that this will be the UTC year. If you want localtime-years then you need to use strftime('%Y','now','localtime').

(7.1) By Keith Medcalf (kmedcalf) on 2022-08-17 21:16:36 edited from 7.0 in reply to 5 [link] [source]

The 9 should be a 7. And it works well and quickly provided the date field is indexed.

(8) By Cecil (CecilWesterhof) on 2022-08-17 21:56:42 in reply to 7.1 [link] [source]

The 9 should be a 7. And it works well and quickly provided the date field is indexed.

I would say 8. But for sure: 7 is better as 9.

I did not see that even. :'-(

(9) By Keith Medcalf (kmedcalf) on 2022-08-17 23:13:06 in reply to 6.1 [link] [source]

Note that this runs in about the same time whether the date is indexed or not. The other one (that does an exists test to see if the current candidate should be deleted) degenerates very rapidly if date is not indexed.

If date is indexed then the delete ... exists is more performant in all cases.

NOTE indexed means, in this case, the first column of an index.

(10) By Cecil (CecilWesterhof) on 2022-08-18 08:34:42 in reply to 6.1 [link] [source]

Very interesting. Just as query I just used:

SELECT rowid
,      date
,      rank() OVER  (
           PARTITION BY substring(date, 1, 7) 
           ORDER     BY date
       ) AS   rank
FROM   partitionUsage

With this I see that in the old thinned out data there was no problem. (It was already thinned out.) But there is no data for 2022-01-01. So for next year it is good to have a better query. :-D

2022-01-01 is the only date that does not have data. So the simple query would not be a real problem, but it never hurts to do things correctly.

For 2022 it starts with (I monitor two partitions):

4209	2022-01-02	1
4210	2022-01-02	1
4211	2022-01-03	3
4212	2022-01-03	3

(11) By Cecil (CecilWesterhof) on 2022-08-18 08:42:19 in reply to 9 [link] [source]

If date is indexed then the delete ... exists is more performant in all cases.

I have:

    PRIMARY KEY     (date, partition)

So it is better to use the delete … exists variant.

But it never hurts to have a little bit extra knowledge.

Both could go wrong when there is data for one partition, but not the other one.

But the change that this will happen is almost non existent. But maybe add some extra complexity. Better safe as sorry.

(12) By Cecil (CecilWesterhof) on 2022-08-18 10:35:33 in reply to 2 [source]

To try it out I used:

SELECT partitionUsage.*
FROM   partitionUsage
,      partitionUsage AS earlier
WHERE  earlier.partition = partitionUsage.partition
   AND earlier.date > substr(partitionUsage.date, 1, 8)
   AND earlier.date < partitionUsage.date

This gives 6462 records.

When I use:

SELECT DISTINCT partitionUsage.*
FROM   partitionUsage
,      partitionUsage AS earlier
WHERE  earlier.partition = partitionUsage.partition
   AND earlier.date > substr(partitionUsage.date, 1, 8)
   AND earlier.date < partitionUsage.date

I get 442 records.

Would it be better to use DISTINCT?

Maybe I go to far for a query that will only run once a year. :-D