select * from fcc where unique_system_identifier='4350131'; returns 1000+ records

(1.1) Originally by Kyle (kylekrieg) with edits by Richard Hipp (drh) on 2023-04-19 00:51:42 from 1.0 [link] [source]

Hello all,

I'm new to sqlite and databases.  I have a Node Red flow which looks up an users FCC amateur radio call sign in a database I imported from the FCC.  The schema is as follows;

sqlite> .schema
CREATE TABLE fcc(
record_type               char(2)              not null,
unique_system_identifier  numeric(9)         not null,
uls_file_number           char(14)             null,
ebf_number                varchar(30)          null,
call_sign                 char(10)             null,
entity_type               char(2)              null,
licensee_id               char(9)              null,
entity_name               varchar(200)         null,
first_name                varchar(20)          null,
mi                        char(1)              null,
last_name                 varchar(20)          null,
suffix                    char(3)              null,
phone                     char(10)             null,
fax                       char(10)             null,
email                     varchar(50)          null,
street_address            varchar(60)          null,
city                      varchar(20)          null,
state                     char(2)              null,
zip_code                  char(9)              null,
po_box                    varchar(20)          null,
attention_line            varchar(35)          null,
sgin                      char(3)              null,
frn                       char(10)             null,
applicant_type_code       char(1)              null,
applicant_type_other      char(40)             null,
status_code               char(1)       null,
status_date               datetime             null,
lic_category_code         char(1)              null,
linked_license_id         numeric(9)           null,
linked_callsign           char(10)             null
);
CREATE INDEX call_sign_idx on fcc(call_sign);
CREATE INDEX unique_system_identifier_idx on fcc(unique_system_identifier);


The import was created via a dat file with the command .import EN.dat fcc.  Importing had a few errors, but I'm not sure how to decpher the errors.

sqlite> .import EN.dat fcc
EN.dat:874847: unescaped " character
EN.dat:876288: unescaped " character
EN.dat:876288: unescaped " character
EN.dat:876288: unescaped " character
EN.dat:874847: expected 30 columns but found 29 - filling the rest with NULL
EN.dat:893284: unescaped " character
EN.dat:900994: unescaped " character
EN.dat:900994: unescaped " character
EN.dat:910606: unescaped " character
EN.dat:910606: unescaped " character
EN.dat:910606: unescaped " character
EN.dat:893284: expected 30 columns but found 29 - filling the rest with NULL
EN.dat:1479385: unescaped " character
EN.dat:1479385: unescaped " character
EN.dat:1479385: unterminated "-quoted field
EN.dat:1479385: expected 30 columns but found 10 - filling the rest with NULL
sqlite>


Here is some sample data from the EN.dat file (all public info if you were wondering)....

EN|4442768|||KD9SPT|L|L02444991|Lynn, Brian S|Brian|S|Lynn|||||312 Benedictine Ct|O Fallon|IL|62269|||000|0030508675|I||||||
EN|4442769|||KD9SPU|L|L02444992|Lambert, Dawn R|Dawn|R|Lambert|||||3924 Old Hwy 51|Makanda|IL|62958|||000|0030818413|I||||||
EN|4442770|||KO4PYH|L|L02437414|Parisi, Tyler J|Tyler|J|Parisi|||||812 PREAKNESS DR|ALPHARETTA|GA|300222706|||000|0030782544|I||||||
EN|4442771|||KI5PRJ|L|L02445028|Redardi, Cinvent C|Cinvent|C|Rebardi|||||86641 E CHORAN AVE LN|CROWLEY|TX|760369435|||000|0030869424|I||||||
EN|4442772|||KJ7YJS|L|L02445029|Black, Robert|Robert||Black|||||395 E MEADOW LARK LN|ELK RIDGE|UT|846518000|||000|0030864474|I||||||

When looking up a friend's new call sign in the database via call_sign, I was seeing the database wasn't returning the correct record, so I got to digging.

When I issued a select * from fcc where unique_system_identifier='4442936'; I get zero records.  That unique system identifier is my friends in the FCC database.  

I knew he was one of the last people issued a call sign from this EN.dat file, so I entered in the following sql statement select * from fcc order by unique_system_identifier desc limit 10;  There must of been 2000+ records come back.  I scrolled to the top of the list to find the last unique_system_identifier as 4350131.  

If I search for 4350130 it returns the correct record.
Search for 4350131 and it returns 2000+ records.
Search for anything past 4350132 and beyond returns no records, no matter what field your searching for.

I have no idea why this would occur?  Sorry for the long email, just trying to give someone all the info I have.  It's like the indexing is messed up but I dropped and created a new index file.

Thanks

Kyle

(2) By Warren Young (wyoung) on 2021-05-05 22:57:16 in reply to 1.0 [source]

Without this EN.dat file you're asking us to speculate on a few points.

However, what immediately catches my eye is that first field, record_type. That's a good sign of a single flat table that's trying to do the job of multiple tables interlinked by some relation, probably this unique_system_identifier value.

What then follows from that is this guess: your SELECT statement should include a WHERE record_type='EN' or similar, to filter out all of the other record types.

Once you sort these initial confusions out, if you're going to use this data long-term, I'd do some serious refactoring on the data to get it into normal form.

(4) By Kyle (kylekrieg) on 2021-05-06 02:13:00 in reply to 2 [link] [source]

What would help you make more of a better guess of what's going on?

The data can be downloaded here : ftp://wirelessftp.fcc.gov/pub/uls/complete/l_amat.zip. It's about 250MB in size.

All of the records that I imported have EN as the record_type.

Kyle

(6.2) By Warren Young (wyoung) on 2021-05-06 05:48:38 edited from 6.1 in reply to 4 [link] [source]

The data can be downloaded here

Okay, that puts a different spin on things. It's simply not well-formed for direct SQLite input.

I was able to get a conversion that allowed your queries to work with:

$ pip3 install --user csvs-to-sqlite

$ dos2unix EN.dat

$ echo 'record_type	unique_system_identifier	uls_file_number	ebf_number	call_sign	entity_type	licensee_id	entity_name	first_name	mi	last_name	suffix	phone	fax	email	street_address	city	state	zip_code	po_box	attention_line	sgin	frn	applicant_type_code	applicant_type_other	status_code	status_date	lic_category_code	linked_license_id	linked_callsign' > fcc.tsv

$ tr '|' '\t' < EN.dat >> fcc.tsv 

$ csvs-to-sqlite fcc.tsv fcc.db -s $'\t'

The third command is necessary to give the TSV file (fcc.tsv) the necessary headers. The hard tabs in the command made it through the forum posting's formatting path, so you should be able to copy the command directly.

This path still causes a few complaints, and it results in a somewhat different schema than the one you posted, but the result seems cleaner.

All of the records that I imported have EN as the record_type.

Sorry, the wrapping in your message made that unclear.

(11) By Kyle (kylekrieg) on 2021-05-06 14:27:57 in reply to 6.2 [link] [source]

Hey Warren, thanks for the info and the reply. I'm a bit over my head here but thanks for hanging with me.

Here is what I've gotten so far. I had to install some packages as I didn't have any python libraries installed. I also had to install some other dev libraries to get the csvs-to-sqlite to work as it was throwing an error. Had to issue this command sudo apt-get install libatlas-base-dev to get csvs-to-sqlite to work after installing.

sudo apt-get install python3-pip (which installed a ton of dependencies).

sudo pip3 install --user csvs-to-sqlite which installed in /.local/bin

sudo apt-get install dos2unix (installed fine)

echo to > fcc.tsv completed correctly

tr '	' 't' < EN.dat >> fcc.tsv completed correctly

csvs-to-sqlite fcc.tsv fcc.db -s $'t' has some errors as below;

pi@raspberrypi:~ $./.local/bin/csvs-to-sqlite fcc.tsv fcc.db -s $'t' /root/.local/lib/python3.7/site-packages/click/core.py:1066: DtypeWarning: Columns (12,18,21) have mixed types.Specify dtype option on import or set low_memory=False. return ctx.invoke(self.callback, **ctx.params) Loaded 1 dataframes Created fcc.db from 1 CSV file

Now the schema for the table looks a bit funky? Is this correct?

sqlite> .schema CREATE TABLE IF NOT EXISTS "fcc" ( "record_typeunique_system_identifieruls_file_numberebf_numbercall_signentity_typelicensee_identity_namefirst_namemilast_namesuffixphonefaxemailstreet_addresscitystatezip_codepo_boxattention_linesginfrnapplicant_type_codeapplicant_type_otherstatus_codestatus_datelic_category_codelinked_license_idlinked_callsign" INTEGER ); sqlite>

When I import the EN.dat (.import EN.dat fcc) not in CSV mode, I get the following error while it increments up. I stopped the import, this was the last line.

EN.dat:578088: expected 1 columns but found 30 - extras ignored

Thanks Kyle

(12) By Warren Young (wyoung) on 2021-05-06 14:36:40 in reply to 11 [link] [source]

You aren’t copying or pasting the hard tabs in the third command. It’s a single 30-element tab-separated line, repeating your field list from your initial post.

The single quotes should preserve the tabs, but perhaps they aren’t being pasted into your terminal properly.

Worst case, you can construct the line manually.

(13) By Kyle (kylekrieg) on 2021-05-06 15:42:23 in reply to 12 [link] [source]

Ok, here is where I'm at after creating the fcc.tsv file manually.

pi@raspberrypi:~ $ ./.local/bin/csvs-to-sqlite fcc.tsv fcctest -s $'\t'
/home/pi/.local/lib/python3.7/site-packages/click/core.py:1066: DtypeWarning: Columns (12,18,21) have mixed types.Specify dtype option on import or set low_memory=False.
  return ctx.invoke(self.callback, **ctx.params)
Loaded 1 dataframes
/home/pi/.local/lib/python3.7/site-packages/pandas/core/generic.py:2789: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
  method=method,
Created fcctest.db from 1 CSV file

Here is the .schema for fcctest.db.  When I complete a search, I get 0 records.

pi@raspberrypi:~ $ sqlite3 fcctest.db
SQLite version 3.27.2 2019-02-25 16:06:06
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE IF NOT EXISTS "fcc" (
"record_type" INTEGER,
  "unique_system_identifier" INTEGER,
  "uls_file_number" TEXT,
  "ebf_number" TEXT,
  "call_sign" TEXT,
  "entity_type" TEXT,
  "licensee_identity_name" TEXT,
  "first_name" TEXT,
  "mi" TEXT,
  "last_name" TEXT,
  "suffix" TEXT,
  "phone" INTEGER,
  "fax" INTEGER,
  "email" TEXT,
  "street_address" TEXT,
  "city" TEXT,
  "state" INTEGER,
  "zip_code" TEXT,
  "po_box" TEXT,
  "attention_line" TEXT,
  "sgin" INTEGER,
  "frn" TEXT,
  "applicant_type_code" INTEGER,
  "applicant_type_other" INTEGER,
  "status_code status_date" INTEGER,
  "lic_category_code" INTEGER,
  "linked_license_id" INTEGER,
  "linked_callsign" INTEGER
);
sqlite> select * from fcc where call_sign='AA0Z';
sqlite> select count(*) from fcc;
1505803
sqlite> select * from fcc where call_sign="AA0Z";
sqlite> select * from fcc where call_sign='AA0Z';
sqlite> select * from fcc where unique_system_identifier='4442936';

(3.2) By Simon Slavin (slavin) on 2021-05-06 01:51:20 edited from 3.1 in reply to 1.0 [link] [source]

The CSV file you're downloading has a format error which is preventing SQLite from being able to import it all.

Here's an example line from the file:

EN|261932|||K7SAS|L|L00171357|SCHURMAN SR, ROBERT W|ROBERT|W|SCHURMAN|SR||||1175 1/2 "D" ST|BAKER CITY|OR|97814| ||000|0002101574|I||||||

You can't do this in a file in CSV format. Double quotes are used to delimit text strings. You can't use them inside strings unless you double them. So that line should really look like this:

EN|261932|||K7SAS|L|L00171357|SCHURMAN SR, ROBERT W|ROBERT|W|SCHURMAN|SR||||1175 1/2 ""D"" ST|BAKER CITY|OR|97814| ||000|0002101574|I||||||

But CSV files use commas, not pipes, so it should really look more like

EN,261932,,,K7SAS,L,L00171357,SCHURMAN SR, ROBERT W,ROBERT,W,SCHURMAN,SR,,,,1175 1/2 ""D"" ST,BAKER CITY,OR,97814, ,,000,0002101574,I,,,,,,

But some of those commas are inside the fields, not delimiters, so this particular line should look more like

EN,261932,,,K7SAS,L,L00171357,"SCHURMAN SR, ROBERT W",ROBERT,W,SCHURMAN,SR,,,,1175 1/2 ""D"" ST,BAKER CITY,OR,97814, ,,000,0002101574,I,,,,,,

I think you would need to find some way to make these global changes to the file before you proceed. It's possible that there's some combination of settings for the SQLite tool which allows importing without it, but I don't know one.

I suspect you would find it simplest to import your data from this file into a SQLite database by writing your own program, than to use the command-line tool.

(5) By Kyle (kylekrieg) on 2021-05-06 02:16:07 in reply to 3.2 [link] [source]

I don't think I changed my mode to CSV when importing the EN.dat file. I've imported CSV files in the past and I've had to change the mode to CSV before importing.

The problem I have is this database is 1.4 million lines long. Any program (that I know of) just doesn't cut it with importing something like that and converting the pipes to commas.

Kyle

(8.2) By Stephan Beal (stephan) on 2021-05-06 06:29:57 edited from 8.1 in reply to 5 [link] [source]

Any program (that I know of) just doesn't cut it with importing something like that

Such files are trivial to slice up from the CLI. e.g. to split them into 100k-line chunks using the bash shell and sed:

$ x=1; y=1; chunk=100000; while sed -n $x,$((x + $chunk))p EN.dat > EN_$y.dat; do test -s EN_$y.dat || break; x=$((x + $chunk + 1)); y=$((y + 1)); done

$ wc -l EN.dat 
1505803 EN.dat
$ wc -l EN_*.dat 
   100001 EN_10.dat
   100001 EN_11.dat
   100001 EN_12.dat
   100001 EN_13.dat
   100001 EN_14.dat
   100001 EN_15.dat
     5788 EN_16.dat
        0 EN_17.dat
   100001 EN_1.dat
   100001 EN_2.dat
   100001 EN_3.dat
   100001 EN_4.dat
   100001 EN_5.dat
   100001 EN_6.dat
   100001 EN_7.dat
   100001 EN_8.dat
   100001 EN_9.dat
  1505803 total

That 0-sized file at the end will always exist using this simple automation - simply remove it before importing the chunks.

EDIT: fixed off-by-one error which caused each successive group to start with the last line of the previous group. Now the line counts match up.

(10) By Warren Young (wyoung) on 2021-05-06 08:16:50 in reply to 8.2 [link] [source]

$ man 1 split

😁

(9.1) By Keith Medcalf (kmedcalf) on 2021-05-06 07:19:06 edited from 9.0 in reply to 5 [link] [source]

The data file is fine, it is just in a very weird format.

It uses | as the field separator but does not do quoting or escaping according to any commonly used method of quoting or escaping.

For example, the record:

EN|4350131|||KM0BUL|L|L02375253|Bulla 495, Kerry "|Kerry|"|Bulla|495||||5835 Best Rd|Larkspur|CO|80118|80118|Kerry "Mark" Bulla|000|0029933348|I||||||

It requires a custom parser written according to whatever rules apply for the data file format since it is non-standard.

If you read the file, split records on \n and fields on | then rewrite as a properly formatted csv it works just fine. (Save that it has fubar termination, but that is ok).

Python 3.9.4 (tags/v3.9.4:1f2e308, Apr  6 2021, 13:40:21) [MSC v.1928 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import csv
>>> f = open('en.dat', 'r')
>>> data = f.read().split('\n')
>>> f.close()
>>> data = [row.split('|') for row in data]
>>> f = open('en.csv', 'w')
>>> c = csv.writer(f)
>>> c.writerows(data)
>>> f.close()
>>> quit()

If you want to do it row-by-each rather than processing the whole file in RAM at once the following should do that:

>>> import csv
>>> fi = open('en.dat', 'r')
>>> fo = open('en.csv', 'w')
>>> c = csv.writer(fo)
>>> c.writerows(row.strip('\n').split('|') for row in fi)
>>> fi.close()
>>> fo.close()

(7) By Stephan Beal (stephan) on 2021-05-06 06:13:17 in reply to 3.2 [link] [source]

I think you would need to find some way to make these global changes to the file before you proceed.

Any decent stream editor (sed or perl) can do that. Here's a line of sed which fully quotes all of the records, double-quoting any existing quote characters:

$ sed -e 's/\r//g' -e 's/"/""/g' -e 's/|/","/g' -e 's/^/"/g' -e 's/$/"/'  < EN.dat > _en.dat
$ grep 'SCHURMAN SR' _en.dat 
"EN","261932","","","K7SAS","L","L00171357","SCHURMAN SR, ROBERT W","ROBERT","W","SCHURMAN","SR","","","","1175 1/2 ""D"" ST","BAKER CITY","OR","97814"," ","","000","0002101574","I","","","","","",""
...

The initial \r removal is needed with GNU sed, else it apparently doesn't recognize \r\n as the end-of-line for purposes of the $ character.

(Ach - it looks like Warren has responded while i was writing this, but i'll go ahead and post it for completeness's sake.)

(14) By Kyle (kylekrieg) on 2021-05-06 16:00:16 in reply to 7 [link] [source]

I think this is the ticket! This worked!