Import TXT file into SQLite - End of Line character
(1) By TimKendal on 2023-04-16 10:01:51 [link] [source]
I want to import a TXT file into a new SQLite database. There are no columns in this data - just a single field for each record, comprising simple text. This is a novel by Charles Dickens, so it is quite long - about 40,000 lines.
Each record is to contain the text in a single line of the TXT file. Each line is a maximum of 90 characters long (depends on word wrapping).
I've examined my text file with a Hex editor (HexEdit, online only) and the end of each line of text in the file is Hex 0A.
This is LF or Line Feed in the ASCII table I have, and importing the file into e.g. TextEdit or Sublime Text produces the correct results, so I think the file is ok
Because I want to be able to search the file and e.g. display all occurrences of a name, I've tried to load the file into a new database using DB Browser for SQLite, and there are errors, in that the EOL code 0A is sometimes ignored, making for 2 or more lines in a DB record. I've checked the Hex codes as above for where the error happens and the line ends have the 0A code just like all the lines that have imported correctly.
I've struggled with the SQLite documentation to see how to do the import using SQLite in Terminal on MacOS Catalina - can someone please point me to the required commands, and say if there is an easier way!
Many thanks
(2) By Simon Slavin (slavin) on 2023-04-16 13:14:16 in reply to 1 [link] [source]
Your problem is with DB Browser, which is a program which uses the SQLite library. Your problem is not with the SQLite library itself. If you want DB Browser to do what you want, you'll have to contact the developers of that program.
There are two other solutions I can suggest:
- Write your own program to read the text file and make rows in your database. Always the best solution.
- Use the SQLite CLI program, which is written and supported by the same team which supports the SQLite API, to import the file.
If you use solution (2) then I think the best solution is to use .ascii
mode as described here:
https://sqlite.org/cli.html#importing_files_as_csv_or_other_formats
and set field and separators to characters which never appear in your source file. That may work. If not, please post again with details of what happens and perhaps someone can help.
(3) By Larry Brasfield (larrybr) on 2023-04-16 13:52:59 in reply to 1 [link] [source]
I've struggled with the SQLite documentation to see how to do the import using SQLite in Terminal on MacOS Catalina - can someone please point me to the required commands, and say if there is an easier way!
I do not gainsay Simon's advice here, although I question whether writing a new program is the best way. This might be considered a program understood by the SQLite CLI:
CREATE TABLE EmacsInit(line text);
.mode ascii
.separator "\a" "\n"
.import .emacs EmacsInit
Now, as thanks for that little program, (which you can adapt with minor edits), would you please answer:
Did you find the SQLite CLI doc?
If "No", under what terms would you have expected to find it in the documentation start page or the index available there?
If "Yes", did you notice the section titled "Importing ..."?
If you did notice, what was unclear about it?
If you did not notice, under what term(s) would you have expected to see an explanation of importing data?
We strive to make the documentation clear and useful. But we sometimes fail to understand how its discovery features fail to lead to appropriate content. Your help with that understanding would be appreciated.
(4) By punkish on 2023-04-16 14:10:38 in reply to 3 [link] [source]
.separator "\a" "\n"
what is the delimiter \a
? Seems like that is the bell, but are you using that here simply to add a noop for the col delimiter? (I couldn't figure out how to set a row delimiter without setting a col delimiter. It was not clear from .help .separator
or .help .mode
)
(6) By Larry Brasfield (larrybr) on 2023-04-16 14:38:54 in reply to 4 [link] [source]
what is the delimiter a ? Seems like that is the bell, but are you using that here simply to add a noop for the col delimiter?
Unfortunately, there is no NOOP for the column delimiter in the CLI. So I used what used to be known as ASCII "BEL" because that is embedded in text only by those who are or wish to become insane.
I couldn't figure out how to set a row delimiter without setting a col delimiter. It was not clear from .help .separator or .help .mode
The blatting from ".help separator" is pretty clear on this provided one knows its lingo. It says
.separator COL ?ROW? Change the column and row separators
, which means the COL argument is required and the ROW argument is optional. IOW, there is no way to set just the row separator because there are no other commands other than .mode that mess with separators.
I doubt there is much use for just setting the row delimiter. With .import, one must always be concerned with the column delimiter. The row delimiter is often left alone because it rarely has to be different from newline.
(8) By punkish on 2023-04-16 16:13:58 in reply to 6 [link] [source]
what is the delimiter a ? Seems like that is the bell, but are you using that here simply to add a noop for the col delimiter?
Unfortunately, there is no NOOP for the column delimiter in the CLI. So I used what used to be known as ASCII "BEL" because that is embedded in text only by those who are or wish to become insane.
This is both a very clever trick and almost non-discoverable by a normal, non-savvy user (a set of users that includes me). It would be very helpful to add this as a trick to the CLI documentation in section 8.5
(10) By anonymous on 2023-04-17 10:04:10 in reply to 6 [link] [source]
Excellent! Any ideas on what would be the best way to add the actual source line number as column? Using CREATE TABLE ImpTxt(Line text, LineNo INTEGER PRIMARY KEY AUTOINCREMENT); .mode ascii .separator "\a" "\n" .import c:\a.txt ImpTxt results in expected 2 columns but found 1 - filling the rest with NULL and using the SQLite RowId cannot be guaranteed as being the source line number. Philip
(11) By Larry Brasfield (larrybr) on 2023-04-17 13:00:21 in reply to 10 [link] [source]
This would do:
CREATE TEMP TABLE PreImpTxt(line TEXT);
.mode ascii
.separator "\a" "\n"
.import c:\a.txt --schema temp PreImpTxt
CREATE TABLE ImpTxt(Line text, LineNo INTEGER PRIMARY KEY);
INSERT INTO ImpTxt(Line, LineNo) SELECT line, rowid FROM temp.PreImpTxt;
DROP TABLE temp.PreImpTxt;
(13) By TimKendal on 2023-04-18 12:53:30 in reply to 3 [link] [source]
Thanks Larry. As you'll see from my post just submitted, my problem is solved. Great.
In reply to your questions:
Yes
Yes
Nothing that I can point too specifically - but I'm a beginner and learning Python, Sqlite and several other things as I go along. So I do not expect to understand everything without further research
The documentation is clear, provided one has a certain amount of background knowledge - if I just ask what to me is a simple question, the documentation does not necessarily give a simple answer straight away. But the documentation needs to provide all the detail an expert needs, and that's where the problem I have stems from. No criticism at all in this. These things are complex and can't be learnt and understood in a moment and I don't expect that. When working on this and similar projects I nearly always find that the answer I find to one question leads to one or more further questions - so it can be a long time before I et to the answer I need. My problem, and one I will continue to live with.
It also doesn't always help when on some forums (not this one) you get conflicting answers not all of which are understandable. They make me wonder which one to believe. But perhaps I have not understood that answer well enough, as well as not understanding whatever it was that made me ask the question in the first place!
(5) By punkish on 2023-04-16 14:20:04 in reply to 1 [link] [source]
first, use the SQLite command line program to import the data (see Larry's post above for how to import a text file. Fwiw, I downloaded the text version of Hard Times by Dickens from Project Gutenberg and imported it flawlessly into SQLite, one table with a single column, one line per record.
Once you have the table, you can create a full text index using FTS5 (see the SQLite website for instructions) and then you can search for arbitrary text and get your results very fast. Once you have the data in your table, then you can use DB Browser to browse the data as you wish.
Yes, some of the built-in help for sqlite3
is a bit mysterious, but once you get it, it is super fast.
(7) By Chris Locke (chrisjlocke1) on 2023-04-16 15:17:33 in reply to 1 [link] [source]
The EOL code is ignored
What were your import settings? If the field separator was a comma, then the field would have been quoted. Was the quote character correct?
You can submit an issue on the DB Browser for SQLite github page (click 'bug report' in the 'Help' menu). It's unlikely to be an issue in DB4S itself, but the import settings and/or the file.
(9.1) By punkish on 2023-04-16 16:21:56 edited from 9.0 in reply to 7 [link] [source]
I can confirm that the import works very well with DB Browser if the field separator is changed to "Other (code)" and the value left at 0. Each line imports into its own row in a table.
(12) By TimKendal on 2023-04-18 12:39:23 in reply to 9.1 [link] [source]
Many thanks for all the replies to this. Most instructive!
I now have a working database, so that's great.
I used DB Browser as recommended by punkish - the key is to Import / Table from CSV then choose Textfile and then choose the TX file to be imported. Then in the options screen that follows set both the Field Separator and Quote Character to Other (Code) and a value of 0.
I also tried using SQLite in Terminal, as recommended and this also worked
I don't really understand what the Quote Character field is for, but it needs to be 0.
Nice to find another Dickens fan!
(14.1) By Chris Locke (chrisjlocke1) on 2023-04-18 13:09:45 edited from 14.0 in reply to 12 [source]
I don't really understand what the Quote Character field is for
While a csv is 'standard' due to its name (comma separated values), the quote character is needed if the field contains commas.
However, you may have pipes as a separator. Obviously this isn't a csv, but a psv.. but the point is, you may not want quotes as a quote character. It might be stars.
*this is a field*|*field2*
Changing these options allows the above text line to be read.
(15) By punkish on 2023-04-18 13:30:59 in reply to 14.1 [link] [source]
However, you may have pipes as a separator. Obviously this isn't a csv, but a psv.. but the point is, you may not want quotes as a quote character. It might be stars.
sure, but why "0"? Or is that 0 meant to imply a non-existent quote character?
(16) By Chris Locke (chrisjlocke1) on 2023-04-18 16:21:07 in reply to 15 [link] [source]
If you're wanting everything on a single row, you don't want any character, so using chr(0) is a character thats not going exist. So yes, its to imply a non-existent character.
(17) By Larry Brasfield (larrybr) on 2023-04-18 16:33:52 in reply to 16 [link] [source]
I say the following knowing almost nothing about the application discussed.
I would think that it is additionally needed, to solve the OP's stated problem, to avoid CSV dequoting logic. (That is why ".mode ascii" is used in the SQLite CLI script.) Does that happen somehow with the solution presented in post #7?