SQLite Forum

Avoiding adding duplicate entries
Login

Avoiding adding duplicate entries

(1) By anonymous on 2020-08-29 20:23:10 [link] [source]

I'm adding once a week simple timeseries data like this: (each week will be new date so record will be different)

timestamp,number1,number2,number3
2020-08-05,111,222,333
2020-08-12,444,555,666

I'm trying to avoid duplicated entries in case, if I would run code twice or three times within one week for example. I tried commands "INSERT OR REPLACE INTO" or "INSERT OR IGNORE INTO" and both commands add duplicates at every run, unless I have set table1, key: timestamp as UNIQUE. Is this how it should work?

Shouldn't "INSERT OR REPLACE" delete the record then insert it again, that way avoiding duplicate entry?

Shouldn't "INSERT OR IGNORE" ignore writing duplicate record? (not arguing, just asking, I'm beginner :) )

My problem is, when I created database + table, I didn't know I need to set UNIQUE to one of the keys. Can I edit my database/table now, afterward, and set UNIQUE key, after table is already populated with data? If not, what are my options? Bellow is simplified code, I'm using:

add_data = [timestamp, number1, number2, number3]

conn = sqlite3.connect('./data.db')
c = conn.cursor()
c.execute("INSERT OR REPLACE INTO table1 VALUES (?,?,?)", add_data)
# c.execute("INSERT OR IGNORE INTO table1 VALUES (?,?,?)", add_data)
conn.commit()
conn.close()

(2) By Keith Medcalf (kmedcalf) on 2020-08-29 20:51:56 in reply to 1 [link] [source]

"OR REPLACE" and "OR IGNORE" are what are known as "conflict resolution method". In order for them to have any effect whatsoever, you must firstly have a conflict.

However, you have not specified "a conflict", so without a "conflict" to resolve, the "conflict resolution method" is never invoked.

So if you want the column table1.timestamp to be unique you can either declare that column to be unique in the table declaration OR create a unique index on that column. The former is merely syntactic sugar for the latter. (Syntactic sugar means that it does the same thing but is a different (sweeter, probably as in simpler) way of expressing the same result. ((Note that this may not be the case in all RDBMS implementations, however, it is the case for SQLite3))

So, if you create a unique index:

create unique index table1timestamp on table1(timestamp);

then a conflict will arise if you attempt to INSERT a record with a duplicate timestamp into the table1 table. This will require the use of the "conflict resolution method" to handle if you do not want to "see" an error being "bubbled up" to the application.

The "REPLACE" conflict resolution method means that conflicting records (in this case those with the same timestamp value) will be deleted and then the new record inserted.

The "IGNORE" conflict resolution method means that the fact of the conflict is ignored and the insert is ignored.

So the long and short of it is that before a "conflict resolution method" will be invoked, there must be a conflict which the specified "resolution method" will be capable of resolutioning.

(3) By Keith Medcalf (kmedcalf) on 2020-08-29 22:02:42 in reply to 1 [link] [source]

It should be noted however that a unique index on timestamp will not actually enforce this rule. It will prevent duplicate values of timestamp but will not require "once a week".

In order to enforce "once a week" you would need to create the index thusly:

create unique index table1weekly on table1(strftime('%Y-%W', timestamp);

Note however that this will not work correctly for the "divided week" between years (for example, where December 31 and January 1 occur in the same week).

Firstly, you must be storing your dates in UT1 (not in local/wall clock time) since wall clock/local time/date is dependant on the whim of politicians and is non-deterministric.

So lets pick 0000-01-02 as "week 0" in our arbitrary time system. The date does not matter provided that it is (a) consistent and (b) is the first day of a "week". Week numbers are only valid after this date.

So the formula that you want to use is:

cast ((julianday(timestamp)-julianday(epochday))/7 as integer)

which will give you the number of the week since the epoch week started, For example:

create unique index table1week on (cast ((julianday(timestamp)-1721060.5)/7 as integer));

This will raise a "conflict" if you try to add two data points in the same week (assuming that the "first day of a week is Sunday"). You can adjust the epoch juliandate if you wish to use some other definition of a "week".

Proof of concept:

with dates(date, week)
  as (
      select '2019-01-01', null
     UNION ALL
      select date(date, '+1 day'),
             cast ( (julianday(date, '+1 day') - 1721060.5) / 7 as integer)
        from dates
       where date < '2021-12-31'
      )
  select min(date), week
    from dates
group by week
order by week;

which will show you the starting date of each week.

(4) By anonymous on 2020-08-29 23:46:50 in reply to 2 [link] [source]

Keith Medcalf, thank you very much for detailed response. So, my python script is working as intended. This also explain, why my testing database works when I set UNIQUE to one key. All makes sense now. From now on, whenever I will create new database, I will make sure I will have UNIQUE set if I will want to prevent duplicate entries.

I have question about creating unique index. Can I create it on already existing batabase/table?

I tried your command several times, through Python and directly inside SQLIte environment (never before I have ran SQLite command on it's own, so I probably made a mistake. I will show you what I did:

I python I ran these commands, I tried several different variations what you suggested:

import sqlite3 conn = sqlite3.connect('./test3.db') c = conn.cursor() c.execute(create unique index table1.timestamp on table1(timestamp))

c.execute(create unique index table1timestamp on table1(timestamp))

c.execute(create unique index table1(timestamp) on table1(timestamp))

c.execute(CREATE UNIQUE index table1 timestamp on table1(timestamp))

I always get syntax error. Then I tried directly inside SQLite3 from the terminal. Commands were:

sqlite3 sqlite> .open test3.db sqlite> create unique index table1.timestamp on table1(timestamp) ...> .close test3.db

test3.db is mock database with same schema I mentioned in first post with one record already inserted, to test if I can create unique index). I didn't get any error running upper SQLite commands, but table didn't change, there is no UNIQUE property for timestamp and there is no collusion, same data is being added, which tells me, database and table are the same as before i ran upper commands.

I'm completely noob, beginner in Python, first time dealing with database...

Is there a simple way, preferably in Python command, that I can convert existing database/table into UNIQUE? I would like to use existing database file if possible. By creating new one, I would lose all data already inserted.

(5) By Keith Medcalf (kmedcalf) on 2020-08-30 02:30:07 in reply to 4 [link] [source]

The argument to the execute method of a cursor is the SQL Statement String:

c.execute('create unique index table1timestamp on table1(timestamp)')

Note that this will create the index if it does not exist or throw an exception if an error occurs (such as the index already exists).

(6) By anonymous on 2020-08-30 14:14:40 in reply to 5 [source]

This worked, I don't know what I was doing before.

Thank you very much for all the help and patience with me!

(7) By anonymous on 2020-08-30 14:22:49 in reply to 3 [link] [source]

It should be noted however that a unique index on timestamp will not actually enforce this rule. It will prevent duplicate values of timestamp but will not require "once a week".

I would like to thank you for this detailed response. While code is above my head right now, I understand the point. I'm saving this for further use. In my case, just UNIQUE flag on timestamp will work as once a week filter, since new weekly data is generated once a week, so timestamp will always be different. If timestamp is the same, then I'm still having the same weekly data, hence needs to be ignored.

Your code will come handy one day, when I will need to make sure, it runs weekly, so I'm putting it down on my 'Study ToDo' list, thanks again for all your help!