select * from fcc where unique_system_identifier='4350131'; returns 1000+ records
(1) By Kyle (kylekrieg) on 2021-05-05 20:30:58 [link] [source]
Hello all, I'm new to sqlite and databases. I have a Node Red flow which looks up an users FCC amateur radio call sign in a database I imported from the FCC. The schema is as follows; sqlite> .schema CREATE TABLE fcc( record_type char(2) not null, unique_system_identifier numeric(9) not null, uls_file_number char(14) null, ebf_number varchar(30) null, call_sign char(10) null, entity_type char(2) null, licensee_id char(9) null, entity_name varchar(200) null, first_name varchar(20) null, mi char(1) null, last_name varchar(20) null, suffix char(3) null, phone char(10) null, fax char(10) null, email varchar(50) null, street_address varchar(60) null, city varchar(20) null, state char(2) null, zip_code char(9) null, po_box varchar(20) null, attention_line varchar(35) null, sgin char(3) null, frn char(10) null, applicant_type_code char(1) null, applicant_type_other char(40) null, status_code char(1) null, status_date datetime null, lic_category_code char(1) null, linked_license_id numeric(9) null, linked_callsign char(10) null ); CREATE INDEX call_sign_idx on fcc(call_sign); CREATE INDEX unique_system_identifier_idx on fcc(unique_system_identifier); The import was created via a dat file with the command .import EN.dat fcc. Importing had a few errors, but I'm not sure how to decpher the errors. sqlite> .import EN.dat fcc EN.dat:874847: unescaped " character EN.dat:876288: unescaped " character EN.dat:876288: unescaped " character EN.dat:876288: unescaped " character EN.dat:874847: expected 30 columns but found 29 - filling the rest with NULL EN.dat:893284: unescaped " character EN.dat:900994: unescaped " character EN.dat:900994: unescaped " character EN.dat:910606: unescaped " character EN.dat:910606: unescaped " character EN.dat:910606: unescaped " character EN.dat:893284: expected 30 columns but found 29 - filling the rest with NULL EN.dat:1479385: unescaped " character EN.dat:1479385: unescaped " character EN.dat:1479385: unterminated "-quoted field EN.dat:1479385: expected 30 columns but found 10 - filling the rest with NULL sqlite> Here is some sample data from the EN.dat file (all public info if you were wondering).... EN|4442768|||KD9SPT|L|L02444991|Lynn, Brian S|Brian|S|Lynn|||||312 Benedictine Ct|O Fallon|IL|62269|||000|0030508675|I|||||| EN|4442769|||KD9SPU|L|L02444992|Lambert, Dawn R|Dawn|R|Lambert|||||3924 Old Hwy 51|Makanda|IL|62958|||000|0030818413|I|||||| EN|4442770|||KO4PYH|L|L02437414|Parisi, Tyler J|Tyler|J|Parisi|||||812 PREAKNESS DR|ALPHARETTA|GA|300222706|||000|0030782544|I|||||| EN|4442771|||KI5PRJ|L|L02445028|Berardi, Vincent C|Vincent|C|Berardi|||||10245 E RANCHO DIEGO LN|CROWLEY|TX|760369435|||000|0030869424|I|||||| EN|4442772|||KJ7YJS|L|L02445029|Black, Robert|Robert||Black|||||395 E MEADOW LARK LN|ELK RIDGE|UT|846518000|||000|0030864474|I|||||| When looking up a friend's new call sign in the database via call_sign, I was seeing the database wasn't returning the correct record, so I got to digging. When I issued a select * from fcc where unique_system_identifier='4442936'; I get zero records. That unique system identifier is my friends in the FCC database. I knew he was one of the last people issued a call sign from this EN.dat file, so I entered in the following sql statement select * from fcc order by unique_system_identifier desc limit 10; There must of been 2000+ records come back. I scrolled to the top of the list to find the last unique_system_identifier as 4350131. If I search for 4350130 it returns the correct record. Search for 4350131 and it returns 2000+ records. Search for anything past 4350132 and beyond returns no records, no matter what field your searching for. I have no idea why this would occur? Sorry for the long email, just trying to give someone all the info I have. It's like the indexing is messed up but I dropped and created a new index file. Thanks Kyle
(2) By Warren Young (wyoung) on 2021-05-05 22:57:16 in reply to 1 [link] [source]
EN.dat file you're asking us to speculate on a few points.
However, what immediately catches my eye is that first field,
record_type. That's a good sign of a single flat table that's trying to do the job of multiple tables interlinked by some relation, probably this
What then follows from that is this guess: your
SELECT statement should include a
WHERE record_type='EN' or similar, to filter out all of the other record types.
Once you sort these initial confusions out, if you're going to use this data long-term, I'd do some serious refactoring on the data to get it into normal form.
(4) By Kyle (kylekrieg) on 2021-05-06 02:13:00 in reply to 2 [link] [source]
What would help you make more of a better guess of what's going on?
The data can be downloaded here : ftp://wirelessftp.fcc.gov/pub/uls/complete/l_amat.zip. It's about 250MB in size.
All of the records that I imported have EN as the record_type.
(6.2) By Warren Young (wyoung) on 2021-05-06 05:48:38 edited from 6.1 in reply to 4 [source]
The data can be downloaded here
Okay, that puts a different spin on things. It's simply not well-formed for direct SQLite input.
I was able to get a conversion that allowed your queries to work with:
$ pip3 install --user csvs-to-sqlite $ dos2unix EN.dat $ echo 'record_type unique_system_identifier uls_file_number ebf_number call_sign entity_type licensee_id entity_name first_name mi last_name suffix phone fax email street_address city state zip_code po_box attention_line sgin frn applicant_type_code applicant_type_other status_code status_date lic_category_code linked_license_id linked_callsign' > fcc.tsv $ tr '|' '\t' < EN.dat >> fcc.tsv $ csvs-to-sqlite fcc.tsv fcc.db -s $'\t'
The third command is necessary to give the TSV file (
fcc.tsv) the necessary headers. The hard tabs in the command made it through the forum posting's formatting path, so you should be able to copy the command directly.
This path still causes a few complaints, and it results in a somewhat different schema than the one you posted, but the result seems cleaner.
All of the records that I imported have EN as the record_type.
Sorry, the wrapping in your message made that unclear.
(11) By Kyle (kylekrieg) on 2021-05-06 14:27:57 in reply to 6.2 [link] [source]
Hey Warren, thanks for the info and the reply. I'm a bit over my head here but thanks for hanging with me.
Here is what I've gotten so far. I had to install some packages as I didn't have any python libraries installed. I also had to install some other dev libraries to get the csvs-to-sqlite to work as it was throwing an error. Had to issue this command sudo apt-get install libatlas-base-dev to get csvs-to-sqlite to work after installing.
sudo apt-get install python3-pip (which installed a ton of dependencies).
sudo pip3 install --user csvs-to-sqlite which installed in /.local/bin
sudo apt-get install dos2unix (installed fine)
echo to > fcc.tsv completed correctly
|tr '||' 't' < EN.dat >> fcc.tsv completed correctly|
csvs-to-sqlite fcc.tsv fcc.db -s $'t' has some errors as below;
pi@raspberrypi:~ $./.local/bin/csvs-to-sqlite fcc.tsv fcc.db -s $'t' /root/.local/lib/python3.7/site-packages/click/core.py:1066: DtypeWarning: Columns (12,18,21) have mixed types.Specify dtype option on import or set low_memory=False. return ctx.invoke(self.callback, **ctx.params) Loaded 1 dataframes Created fcc.db from 1 CSV file
Now the schema for the table looks a bit funky? Is this correct?
sqlite> .schema CREATE TABLE IF NOT EXISTS "fcc" ( "record_typeunique_system_identifieruls_file_numberebf_numbercall_signentity_typelicensee_identity_namefirst_namemilast_namesuffixphonefaxemailstreet_addresscitystatezip_codepo_boxattention_linesginfrnapplicant_type_codeapplicant_type_otherstatus_codestatus_datelic_category_codelinked_license_idlinked_callsign" INTEGER ); sqlite>
When I import the EN.dat (.import EN.dat fcc) not in CSV mode, I get the following error while it increments up. I stopped the import, this was the last line.
EN.dat:578088: expected 1 columns but found 30 - extras ignored
(12) By Warren Young (wyoung) on 2021-05-06 14:36:40 in reply to 11 [link] [source]
You aren’t copying or pasting the hard tabs in the third command. It’s a single 30-element tab-separated line, repeating your field list from your initial post.
The single quotes should preserve the tabs, but perhaps they aren’t being pasted into your terminal properly.
Worst case, you can construct the line manually.
(13) By Kyle (kylekrieg) on 2021-05-06 15:42:23 in reply to 12 [link] [source]
Ok, here is where I'm at after creating the fcc.tsv file manually. pi@raspberrypi:~ $ ./.local/bin/csvs-to-sqlite fcc.tsv fcctest -s $'\t' /home/pi/.local/lib/python3.7/site-packages/click/core.py:1066: DtypeWarning: Columns (12,18,21) have mixed types.Specify dtype option on import or set low_memory=False. return ctx.invoke(self.callback, **ctx.params) Loaded 1 dataframes /home/pi/.local/lib/python3.7/site-packages/pandas/core/generic.py:2789: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores. method=method, Created fcctest.db from 1 CSV file Here is the .schema for fcctest.db. When I complete a search, I get 0 records. pi@raspberrypi:~ $ sqlite3 fcctest.db SQLite version 3.27.2 2019-02-25 16:06:06 Enter ".help" for usage hints. sqlite> .schema CREATE TABLE IF NOT EXISTS "fcc" ( "record_type" INTEGER, "unique_system_identifier" INTEGER, "uls_file_number" TEXT, "ebf_number" TEXT, "call_sign" TEXT, "entity_type" TEXT, "licensee_identity_name" TEXT, "first_name" TEXT, "mi" TEXT, "last_name" TEXT, "suffix" TEXT, "phone" INTEGER, "fax" INTEGER, "email" TEXT, "street_address" TEXT, "city" TEXT, "state" INTEGER, "zip_code" TEXT, "po_box" TEXT, "attention_line" TEXT, "sgin" INTEGER, "frn" TEXT, "applicant_type_code" INTEGER, "applicant_type_other" INTEGER, "status_code status_date" INTEGER, "lic_category_code" INTEGER, "linked_license_id" INTEGER, "linked_callsign" INTEGER ); sqlite> select * from fcc where call_sign='AA0Z'; sqlite> select count(*) from fcc; 1505803 sqlite> select * from fcc where call_sign="AA0Z"; sqlite> select * from fcc where call_sign='AA0Z'; sqlite> select * from fcc where unique_system_identifier='4442936';
(3.2) By Simon Slavin (slavin) on 2021-05-06 01:51:20 edited from 3.1 in reply to 1 [link] [source]
The CSV file you're downloading has a format error which is preventing SQLite from being able to import it all.
Here's an example line from the file:
EN|261932|||K7SAS|L|L00171357|SCHURMAN SR, ROBERT W|ROBERT|W|SCHURMAN|SR||||1175 1/2 "D" ST|BAKER CITY|OR|97814| ||000|0002101574|I||||||
You can't do this in a file in CSV format. Double quotes are used to delimit text strings. You can't use them inside strings unless you double them. So that line should really look like this:
EN|261932|||K7SAS|L|L00171357|SCHURMAN SR, ROBERT W|ROBERT|W|SCHURMAN|SR||||1175 1/2 ""D"" ST|BAKER CITY|OR|97814| ||000|0002101574|I||||||
But CSV files use commas, not pipes, so it should really look more like
EN,261932,,,K7SAS,L,L00171357,SCHURMAN SR, ROBERT W,ROBERT,W,SCHURMAN,SR,,,,1175 1/2 ""D"" ST,BAKER CITY,OR,97814, ,,000,0002101574,I,,,,,,
But some of those commas are inside the fields, not delimiters, so this particular line should look more like
EN,261932,,,K7SAS,L,L00171357,"SCHURMAN SR, ROBERT W",ROBERT,W,SCHURMAN,SR,,,,1175 1/2 ""D"" ST,BAKER CITY,OR,97814, ,,000,0002101574,I,,,,,,
I think you would need to find some way to make these global changes to the file before you proceed. It's possible that there's some combination of settings for the SQLite tool which allows importing without it, but I don't know one.
I suspect you would find it simplest to import your data from this file into a SQLite database by writing your own program, than to use the command-line tool.
(5) By Kyle (kylekrieg) on 2021-05-06 02:16:07 in reply to 3.2 [link] [source]
I don't think I changed my mode to CSV when importing the EN.dat file. I've imported CSV files in the past and I've had to change the mode to CSV before importing.
The problem I have is this database is 1.4 million lines long. Any program (that I know of) just doesn't cut it with importing something like that and converting the pipes to commas.
(8.2) By Stephan Beal (stephan) on 2021-05-06 06:29:57 edited from 8.1 in reply to 5 [link] [source]
Any program (that I know of) just doesn't cut it with importing something like that
Such files are trivial to slice up from the CLI. e.g. to split them into 100k-line chunks using the bash shell and sed:
$ x=1; y=1; chunk=100000; while sed -n $x,$((x + $chunk))p EN.dat > EN_$y.dat; do test -s EN_$y.dat || break; x=$((x + $chunk + 1)); y=$((y + 1)); done $ wc -l EN.dat 1505803 EN.dat $ wc -l EN_*.dat 100001 EN_10.dat 100001 EN_11.dat 100001 EN_12.dat 100001 EN_13.dat 100001 EN_14.dat 100001 EN_15.dat 5788 EN_16.dat 0 EN_17.dat 100001 EN_1.dat 100001 EN_2.dat 100001 EN_3.dat 100001 EN_4.dat 100001 EN_5.dat 100001 EN_6.dat 100001 EN_7.dat 100001 EN_8.dat 100001 EN_9.dat 1505803 total
That 0-sized file at the end will always exist using this simple automation - simply remove it before importing the chunks.
EDIT: fixed off-by-one error which caused each successive group to start with the last line of the previous group. Now the line counts match up.
(10) By Warren Young (wyoung) on 2021-05-06 08:16:50 in reply to 8.2 [link] [source]
(9.1) By Keith Medcalf (kmedcalf) on 2021-05-06 07:19:06 edited from 9.0 in reply to 5 [link] [source]
The data file is fine, it is just in a very weird format.
| as the field separator but does not do quoting or escaping according to any commonly used method of quoting or escaping.
For example, the record:
EN|4350131|||KM0BUL|L|L02375253|Bulla 495, Kerry "|Kerry|"|Bulla|495||||5835 Best Rd|Larkspur|CO|80118|80118|Kerry "Mark" Bulla|000|0029933348|I||||||
It requires a custom parser written according to whatever rules apply for the data file format since it is non-standard.
If you read the file, split records on
\n and fields on
| then rewrite as a properly formatted csv it works just fine. (Save that it has fubar termination, but that is ok).
Python 3.9.4 (tags/v3.9.4:1f2e308, Apr 6 2021, 13:40:21) [MSC v.1928 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import csv >>> f = open('en.dat', 'r') >>> data = f.read().split('\n') >>> f.close() >>> data = [row.split('|') for row in data] >>> f = open('en.csv', 'w') >>> c = csv.writer(f) >>> c.writerows(data) >>> f.close() >>> quit()
If you want to do it row-by-each rather than processing the whole file in RAM at once the following should do that:
>>> import csv >>> fi = open('en.dat', 'r') >>> fo = open('en.csv', 'w') >>> c = csv.writer(fo) >>> c.writerows(row.strip('\n').split('|') for row in fi) >>> fi.close() >>> fo.close()
(7) By Stephan Beal (stephan) on 2021-05-06 06:13:17 in reply to 3.2 [link] [source]
I think you would need to find some way to make these global changes to the file before you proceed.
Any decent stream editor (sed or perl) can do that. Here's a line of sed which fully quotes all of the records, double-quoting any existing quote characters:
$ sed -e 's/\r//g' -e 's/"/""/g' -e 's/|/","/g' -e 's/^/"/g' -e 's/$/"/' < EN.dat > _en.dat $ grep 'SCHURMAN SR' _en.dat "EN","261932","","","K7SAS","L","L00171357","SCHURMAN SR, ROBERT W","ROBERT","W","SCHURMAN","SR","","","","1175 1/2 ""D"" ST","BAKER CITY","OR","97814"," ","","000","0002101574","I","","","","","","" ...
\r removal is needed with GNU sed, else it apparently doesn't recognize
\r\n as the end-of-line for purposes of the
(Ach - it looks like Warren has responded while i was writing this, but i'll go ahead and post it for completeness's sake.)
(14) By Kyle (kylekrieg) on 2021-05-06 16:00:16 in reply to 7 [link] [source]
I think this is the ticket! This worked!