Selecting certain sub string

(1) By Ertan Küçükoğlu (ertank) on 2021-12-23 21:19:31 [link] [source]

Hello,

I am using 3.27.2 version of SQLite. I can switch to the latest if needed.

There is a table with column information in each row like following:
D:\DOSYALAR\2021-07\0012090829\0012090829-202107-K-000000.zip
D:\DOSYALAR\2021-07\0012090829\GIB-0012090829-202107-KB-000000.zip
D:\DOSYALAR\2021-08\16270558554\GIB-16270558554-202108-YB-000000.zip
D:\DOSYALAR\2021-08\35345023780\35345023780-202108-K-000000.zip
D:\DOSYALAR\2021-08\9920362895\GIB-9920362895-202108-YB-000000.zip
D:\DOSYALAR\2021-08\3310378950\3310378950-202108-K-000001.zip

These are recorded values by an application that I have no control of. I just need to parse that data for my own use.

What I would like is to select following information in separate columns:
1- Filename only
2- 202107, 202108 (These are the sections before K, KB, Y, YB letters in filename. They mean year and month)
3- 000000, 000001 (Last part of the filename in numbers. They are index numbers for a multi-part file.)

I tried to find something like lastindexof() or similar function but all I could find is instr() function that returns first occurrence of a character.

There is a solution here: https://stackoverflow.com/questions/21388820/how-to-get-the-last-index-of-a-substring-in-sqlite

I tried to use it. But, I all messed it up with number 2 and 3 above. Functions in other functions something I cannot easily handle.

Since above answer is years back. I wonder if there is any new function/enhanced function to get last index of a character in string or another easier solution.

Thanks & Regards,
Ertan

(2) By Simon Slavin (slavin) on 2021-12-24 03:49:18 in reply to 1 [link] [source]

There is no simple function, or simple combination of functions, in SQLite which can do what you want.

You could write such a function in your favourite language, and get SQLite to load it:

https://sqlite.org/c3ref/create_function.html

Note that your favourite language library or shim make have a version of sqlite3_create_function() that looks more like that language and would be easier for you to use.

But frankly I'd just have SQLite return that entire pathfilename string and process it in my software..

(4) By Ertan Küçükoğlu (ertank) on 2021-12-24 10:09:17 in reply to 2 [link] [source]

Thank you.

I need to do some filters based on the extracted values and that would be easier using SQLite to have them filtered.

I really need to check creating a function.

(3) By Larry Brasfield (larrybr) on 2021-12-24 08:42:22 in reply to 1 [link] [source]

Something like the below query with capitalized keywords would do. (I include a few shell inputs to get things setup.)

create table Gloms( fpath text );
insert into Gloms values
    ('D:\DOSYALAR\2021-07\0012090829\0012090829-202107-K-000000.zip'),
    ('D:\DOSYALAR\2021-07\0012090829\GIB-0012090829-202107-KB-000000.zip'),
    ('D:\DOSYALAR\2021-08\16270558554\GIB-16270558554-202108-YB-000000.zip'),
    ('D:\DOSYALAR\2021-08\35345023780\35345023780-202108-K-000000.zip'),
    ('D:\DOSYALAR\2021-08\9920362895\GIB-9920362895-202108-YB-000000.zip'),
    ('D:\DOSYALAR\2021-08\3310378950\3310378950-202108-K-000001.zip')
;

SELECT (SELECT chopee FROM (
  WITH RECURSIVE PluckFest(chopee, pluck, niter) AS (
    SELECT
    '' AS chopee,
    fpath AS pluck,
    0 AS niter
    -- At each iteration:
    --   chopee is what is being removed, sans any separator
    --   pluck is what remains to have a leading delimited clump removed
    --   niter is how many removals were done upto present iteration
    UNION
    SELECT
      iif( instr(pluck, '\'), substr(pluck, 1, instr(pluck, '\')-1), pluck ),
      iif( instr(pluck, '\'), substr(pluck, instr(pluck, '\')+1), ''),
      niter+1
    FROM PluckFest
    WHERE pluck <> ''
  ) SELECT chopee, pluck, niter FROM PluckFest
) WHERE niter=5) AS favoredPathPart
FROM Gloms;

(5) By Ertan Küçükoğlu (ertank) on 2021-12-24 10:11:41 in reply to 3 [link] [source]

Thanks for the help.

If it was only filename part that can be extracted with below (from URL in my initial post)

select replace(fpath, rtrim(fpath, replace(fpath, '', '')), '') from Gloms;

However, extracting more from this becomes a pain since I have read only access to that source database.

(6) By anonymous on 2021-12-24 10:13:26 in reply to 1 [link] [source]

Is the filename always '.zip' (or at least guaranteed to have a 3-character extension)? Is the section with K, KB, Y, YB always guaranteed to be either one or two characters?

If yes to both of those questions, then you can do it simply by counting characters, something like...

WITH filenames(f) AS (
  VALUES ('D:\DOSYALAR\2021-07\0012090829\0012090829-202107-K-000000.zip'),
         ('D:\DOSYALAR\2021-07\0012090829\GIB-0012090829-202107-KB-000000.zip'),
         ('D:\DOSYALAR\2021-08\16270558554\GIB-16270558554-202108-YB-000000.zip'),
         ('D:\DOSYALAR\2021-08\35345023780\35345023780-202108-K-000000.zip'),
         ('D:\DOSYALAR\2021-08\9920362895\GIB-9920362895-202108-YB-000000.zip'),
         ('D:\DOSYALAR\2021-08\3310378950\3310378950-202108-K-000001.zip'))
SELECT f,
  SUBSTR(f, -10, 6),
  SUBSTR(f, IIF(SUBSTR(f,-13,1)='-', -19, -20), 6)
FROM filenames;

Use a negative index for SUBSTR() to count from the right. Note that you'll need Sqlite 3.32.0 or later for the IIF() function (used to check the length of the letter code in the middle).

(9) By Ertan Küçükoğlu (ertank) on 2021-12-25 17:42:35 in reply to 6 [link] [source]

Thank you.

Among other suggestions, this is the most simple solution which I can understand. Never thought that substr() would be working from right side.

That will force me to use latest version, but that can be managed.

(7) By Ryan Smith (cuz) on 2021-12-24 12:04:52 in reply to 1 [source]

Something fun to do - thanks.

Here is one solution (using many of the tricks already expressed by others) and some Common Table expressions to weed out the rest.

Assumptions:

There is always a full path given
The filename always contain both the year-month and the file number
The year-month is always 6 digits long
The file-number is always 6 digits long

Everything else should be dynamically handled.

  -- SQLite version 3.35.4  [ Release: 2021-04-02 ]  on  SQLitespeed version 2.1.3.11.
  -- ================================================================================================

  -- Create an example table containing example paths...

CREATE TABLE paths(pathname TEXT);

INSERT INTO paths VALUES
 ('D:\DOSYALAR\2021-07\0012090829\0012090829-202107-K-000000.zip')
,('D:\DOSYALAR\2021-07\0012090829\GIB-0012090829-202107-KB-000000.zip')
,('D:\DOSYALAR\2021-08\16270558554\GIB-16270558554-202108-YB-000000.zip')
,('D:\DOSYALAR\2021-08\35345023780\35345023780-202108-K-000000.zip')
,('D:\DOSYALAR\2021-08\9920362895\GIB-9920362895-202108-YB-000000.zip')
,('D:\DOSYALAR\2021-08\3310378950\3310378950-202108-K-000001.zip')
;


-- Show how the path-length identifier works. It keeps looking for the
-- next slash in the name, skips to it and cuts off the handled part,
-- using the row-id to identify the rows. (It could use the primary key
-- or unique index too, but my example did not have any):

WITH pathpos(path_id, slash_idx, remaining_path) AS (
    SELECT rowid, 0, pathname FROM paths
    UNION ALL
    SELECT path_id, slash_idx + instr(remaining_path,'\'), substr(remaining_path, instr(remaining_path, '\') + 1)
      FROM pathpos
     WHERE instr(remaining_path, '\') > 0  LIMIT 1000
), pathlen(path_id, last_slash_idx, remaining_path) AS (
    SELECT path_id, MAX(slash_idx), remaining_path
      FROM pathpos
     GROUP BY path_id
)
SELECT path_id, last_slash_idx AS directory_length, remaining_path AS filename
  FROM pathlen
;

  --             |director-|                                      
  --    path_id  | y_length|filename                              
  -- ------------|---------|--------------------------------------
  --       1     |    31   |0012090829-202107-K-000000.zip        
  --       2     |    31   |GIB-0012090829-202107-KB-000000.zip   
  --       3     |    32   |GIB-16270558554-202108-YB-000000.zip  
  --       4     |    32   |35345023780-202108-K-000000.zip       
  --       5     |    31   |GIB-9920362895-202108-YB-000000.zip   
  --       6     |    31   |3310378950-202108-K-000001.zip        




-- The next query shows using the above and more string searches and
-- sub-string handlers to get the rest of the information.

WITH pathpos(path_id, slash_idx, remaining_path) AS (
    SELECT rowid, 0, pathname FROM paths
    UNION ALL
    SELECT path_id, slash_idx + instr(remaining_path,'\'), substr(remaining_path, instr(remaining_path, '\') + 1)
      FROM pathpos
     WHERE instr(remaining_path, '\') > 0  LIMIT 1000
), pathlen(path_id, last_slash_idx, remaining_path) AS (
    SELECT path_id, MAX(slash_idx), remaining_path
      FROM pathpos
     GROUP BY path_id
), ym_start(path_id, yearmonth_idx, yearmonth_part) AS (
    SELECT path_id,
           instr(substr(remaining_path,5), '-') + 1,
	   substr(remaining_path, 5)
      FROM pathlen
), ym_end(path_id, yearmonth, fileno_idx, yearmonth_part) AS (
    SELECT path_id,
           substr(yearmonth_part, yearmonth_idx, 6),
           instr(substr(yearmonth_part, yearmonth_idx + 7), '-') + 1,
           substr(yearmonth_part, yearmonth_idx + 7)
      FROM ym_start
), all_info(path_id, directory, filename, type, yearmonth, fileno) AS (
    SELECT pathlen.path_id,
           substr(paths.pathname, 1, pathlen.last_slash_idx),
	   pathlen.remaining_path,
	   substr(ym_end.yearmonth_part, 1, ym_end.fileno_idx - 2),
	   ym_end.yearmonth,
	   substr(ym_end.yearmonth_part, ym_end.fileno_idx, 6)
      FROM pathlen
      JOIN ym_end ON ym_end.path_id = pathlen.path_id
      JOIN paths  ON paths.rowid    = pathlen.path_id
)
SELECT *
  FROM all_info
;


  --    path_id  |directory                         |filename                              | type|yearmonth| fileno 
  -- ------------|----------------------------------|--------------------------------------|-----|---------|--------
  --       1     |D:\DOSYALAR\2021-07\0012090829\   |0012090829-202107-K-000000.zip        |  K  |  202107 | 000000 
  --       2     |D:\DOSYALAR\2021-07\0012090829\   |GIB-0012090829-202107-KB-000000.zip   |  KB |  202107 | 000000 
  --       3     |D:\DOSYALAR\2021-08\16270558554\  |GIB-16270558554-202108-YB-000000.zip  |  YB |  202108 | 000000 
  --       4     |D:\DOSYALAR\2021-08\35345023780\  |35345023780-202108-K-000000.zip       |  K  |  202108 | 000000 
  --       5     |D:\DOSYALAR\2021-08\9920362895\   |GIB-9920362895-202108-YB-000000.zip   |  YB |  202108 | 000000 
  --       6     |D:\DOSYALAR\2021-08\3310378950\   |3310378950-202108-K-000001.zip        |  K  |  202108 | 000001 

  -- ------------------------------------------------------------------------------------------------

Please note: On a rather large dataset, this query will get the job done, but it will not be fast. For speed, nothing will beat the UDF suggestion made in another post.

Good luck!

(8) By Ertan Küçükoğlu (ertank) on 2021-12-25 17:38:16 in reply to 7 [link] [source]

Thank you for your time.

This whole thing looks intimidating to me. I was actually after something simpler. It does work.