howto combine rows with recursive query

(1) By Maik (MaikScholz) on 2022-04-04 19:27:05 [link] [source]

Hi,
i have a table with complete messages or with message segments.

gidx, just counting each row in t_segments
segm, a complete message or just a segment

optional for segments:
tsize, total size of a reassembled message
segmidx, segment index (index of segment in a message)

All full messages or segments are stored row by row with a incrementing gidx/rowid.

Full/complete messages are just stored in segm. tsize and segmidx are void.
Message Segments are stored in segm, tsize is the size of the complete reassembled message. segmidx counts the segment in the complete message (0,1,2,...).


Example dataset:
CREATE TABLE t_segments (gidx INTEGER PRIMARY KEY,segm BLOB,tsize INTEGER,segmidx INTEGER);
INSERT INTO t_segments (gidx,segm) VALUES(1,"This is a complete message 1"); 
INSERT INTO t_segments (gidx,segm) VALUES(2,"This is a complete message 2"); 
INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(3,"This is a ",28,0); 
INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(4,"complete m",28,1); 
INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(5,"essage 3",28,2); 
INSERT INTO t_segments (gidx,segm) VALUES(6,"This is a complete message 4"); 

The expected output is:
1 | This is a complete message 1
2 | This is a complete message 2
3 | This is a complete message 3
6 | This is a complete message 3

I made some experiments, with a recursive query.
Unfortunately, the output is not as expected.

WITH RECURSIVE
  v_message(gidx,segm,tsize,segmidx,isfirst) AS (
     SELECT S.gidx,S.segm,S.tsize,S.segmidx,(S.segmidx IS NOT NULL AND S.segmidx=0) FROM t_segments AS S 
     UNION ALL
     SELECT S.gidx,printf("%s%s",M.segm,S.segm),S.tsize,S.segmidx,M.isfirst FROM t_segments AS S,v_message AS M
     WHERE ((S.gidx>M.gidx AND S.segmidx IS NOT NULL AND M.isfirst=1 AND S.segmidx=M.segmidx+1))
)
SELECT * FROM v_message;


1|This is a complete message 1|||0
2|This is a complete message 2|||0
3|This is a |28|0|1
4|complete m|28|1|0
5|essage 3|28|2|0
6|This is a complete message 4|||0
4|This is a complete m|10|1|1
5|This is a complete message 3|28|2|1

Any hint, howto solve this?

(2) By Michael A. Cleverly (cleverly) on 2022-04-04 22:46:04 in reply to 1 [link] [source]

This works for your sample data:

WITH recombined(gidx, segm) AS (
    SELECT min(gidx) AS gidx, group_concat(segm, '') AS segm 
      FROM t_segments 
     WHERE tsize IS NOT NULL 
     ORDER BY segmidx
)
SELECT gidx, segm
  FROM recombined
 UNION ALL
SELECT gidx, segm
  FROM t_segments
 WHERE tsize IS NULL
 ORDER BY gidx;

I believe the documentation for the group_concat() aggregate function says that the order of concatenation is not guaranteed, but I tested all six permutations of insert order for the three INSERT statements in question (WHERE gidx=3) with SQLite versions 3.32.3, 3.37.2, and 3.38.2 (the ones I have installed on my Mac) and the reconstructed message was "This is a complete message 3" in all cases.

(3) By Michael A. Cleverly (cleverly) on 2022-04-04 22:50:07 in reply to 1 [link] [source]

I just realized (doh!) that if your sample data had had a 2nd segmented message in it that the ORDER BY in the CTE would need to be

ORDER BY gidx, segmidx

Rather than just

ORDER BY segmidx

I can't edit my original reply since it is still pending moderation.

(4) By Michael A. Cleverly (cleverly) on 2022-04-04 23:07:30 in reply to 3 [link] [source]

Actually, no, that won't work either, if there is more than one segment, because gidx is different for each row.

From your data you appear to know that the full length of the message is 28 characters before you ever do your first insert. (And, if you do, why not just insert it as a single message to begin with?)

Can you guarantee that a single message will be split up and inserted with increasing gidx values without an intervening message being split up too?

In other words, would the following NEVER happen?

INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(3,"This is a ",28,0); -- your gidx=3
INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(4,'ANOTHER MESS',28,0);
INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(5,"complete m",28,1); -- your gidx=4
INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(6,'AGE BUT ',28,2);
INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(7,"essage 3",28,2);   -- your gidx=5
INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(8,'NUMBER 4',28,2);

If you had an contofgidx column, and then segmidx (you could define them to be UNIQUE(contofgidx, segmidx) then you could group on contofgidx,segmidx. Rows that are complete would be (NULL,NULL) but every NULL is unique from every other NULL, so there wouldn't be a constraint violation. You'd be able to get rid of tsize then too.

(5.1) By Keith Medcalf (kmedcalf) on 2022-04-04 23:51:55 edited from 5.0 in reply to 1 [link] [source]

You mean like this? Note that there is absolutely no need for recursion. You simply have to firstly reconstruct the missing data that someone failed to store (MessageNo).

  select MessageNo,
         group_concat(segm,'')
    from (
            select gidx - coalesce(segmidx,0) as MessageNo,
                   coalesce(segmidx, 0) as LineNo,
                   segm
              from t_segments
          order by gidx, coalesce(segmidx,0)
         )
group by MessageNo
;

and the projection result is:

┌───────────┬────────────────────────────────┐
│ MessageNo │     group_concat(segm,'')      │
├───────────┼────────────────────────────────┤
│ 1         │ 'This is a complete message 1' │
│ 2         │ 'This is a complete message 2' │
│ 3         │ 'This is a complete message 3' │
│ 6         │ 'This is a complete message 4' │
└───────────┴────────────────────────────────┘

(6) By Maik (MaikScholz) on 2022-04-05 06:03:41 in reply to 5.1 [link] [source]

Hi,
thank you very much for the hint. So far it is working.

CREATE TABLE t_segments (gidx INTEGER PRIMARY KEY,segm BLOB,tsize INTEGER,segmidx INTEGER);
INSERT INTO t_segments (gidx,segm) VALUES(1,"This is a complete message 1"); 
INSERT INTO t_segments (gidx,segm) VALUES(2,"This is a complete message 2"); 
INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(3,"This is a ",28,0); 
INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(4,"complete m",28,1); 
INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(5,"essage 3",28,2); 
INSERT INTO t_segments (gidx,segm) VALUES(6,"This is a complete message 4"); 
SELECT MessageNo,
         group_concat(segm,'')
    FROM (
            SELECT gidx - coalesce(segmidx,0) as MessageNo,
                   coalesce(segmidx, 0) as LineNo,
                   segm
              FROM t_segments
          ORDER BY gidx, coalesce(segmidx,0)
         )
GROUP BY MessageNo;

=>

1|This is a complete message 1
2|This is a complete message 2
3|This is a complete message 3
6|This is a complete message 4

=============================

Unfortunately, the real application is a step more complicate.
There are different message classes in the same table.

CREATE TABLE t_segments (gidx INTEGER PRIMARY KEY,msgclass,segm BLOB,tsize INTEGER,segmidx INTEGER);

INSERT INTO t_segments (gidx,segm) VALUES(1,"A","This is a complete message 1"); 
INSERT INTO t_segments (gidx,segm) VALUES(2,"B","This is for class B"); 
INSERT INTO t_segments (gidx,segm) VALUES(3,"A","This is a complete message 2"); 
INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(4,"A","This is a ",28,0);8
INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(5,"A","complete m",28,1); 

INSERT INTO t_segments (gidx,segm) VALUES(6,"B","This is for class B"); 

INSERT INTO t_segments (gidx,segm,tsize,segmidx) VALUES(7,"A","essage 3",28,2); 
INSERT INTO t_segments (gidx,segm) VALUES(7,"A","This is a complete message 4"); 

Expected output:
1|This is a complete message 1
2|This is a for class B
3|This is a complete message 2
4or7|This is a complete message 3
6|This is a for class B
7|This is a complete message 4

Alternatively, i could store the rowid of the first splitted message in a new column of each subsequent message. This would be easy in the C code layer.

Thank you very much for the help.

(7.1) By midijohnny on 2022-04-05 12:41:18 edited from 7.0 in reply to 6 [link] [source]

Please note - you need to fix-up your example - just to make sure everybody understands the modified use-case.

There is an error (too few columns for values) in the latest version:

Result: 3 values for 2 columns
At line 2:
INSERT INTO t_segments (gidx,segm) VALUES(1,"A","This is a complete message 1");

(8) By Maik (MaikScholz) on 2022-04-05 12:24:30 in reply to 7.0 [source]

Hi,
sorry. This is the related SQL example:

CREATE TABLE t_segments (gidx INTEGER PRIMARY KEY,msgclass,segm BLOB,tsize INTEGER,segmidx INTEGER);

INSERT INTO t_segments (gidx,msgclass,segm) VALUES(1,"A","This is a complete message 1"); 

INSERT INTO t_segments (gidx,msgclass,segm) VALUES(2,"B","This is for class B"); 

INSERT INTO t_segments (gidx,msgclass,segm) VALUES(3,"A","This is a complete message 2"); 

INSERT INTO t_segments (gidx,msgclass,segm,tsize,segmidx) VALUES(4,"A","This is a ",28,0);
INSERT INTO t_segments (gidx,msgclass,segm,tsize,segmidx) VALUES(5,"A","complete m",28,1); 

INSERT INTO t_segments (gidx,msgclass,segm) VALUES(6,"B","This is for class B"); 

INSERT INTO t_segments (gidx,msgclass,segm,tsize,segmidx) VALUES(7,"A","essage 3",28,2); 

INSERT INTO t_segments (gidx,msgclass,segm) VALUES(8,"A","This is a complete message 4"); 
SELECT MessageNo,
         group_concat(segm,'')
    FROM (
            SELECT gidx - coalesce(segmidx,0) as MessageNo,
                   coalesce(segmidx, 0) as LineNo,
                   segm
              FROM t_segments
          ORDER BY gidx, coalesce(segmidx,0)
         )
GROUP BY MessageNo;

Output:
1|This is a complete message 1
2|This is for class B
3|This is a complete message 2
4|This is a complete m
5|essage 3
6|This is for class B
8|This is a complete message 4

Expected Output:
1|This is a complete message 1
2|This is for class B
3|This is a complete message 2
4|This is a complete message 3
6|This is for class B
8|This is a complete message 4

=====

From the C code, i could add a additional column holding the rowid reference to the first row in all child rows. Does this help?

# modified table with additional reference to parent/fist row for, splitted messages.

CREATE TABLE t_test2 (gidx INTEGER PRIMARY KEY,p,msgclass,segm BLOB,tsize INTEGER,segmidx INTEGER);

INSERT INTO t_test2 (gidx,p,msgclass,segm) VALUES(1,NULL,"A","This is a complete message 1"); 

INSERT INTO t_test2 (gidx,p,msgclass,segm) VALUES(2,NULL,"B","This is for class B"); 

INSERT INTO t_test2 (gidx,p,msgclass,segm) VALUES(3,NULL,"A","This is a complete message 2"); 

INSERT INTO t_test2 (gidx,p,msgclass,segm,tsize,segmidx) VALUES(4,NULL,"A","This is a ",28,0);
INSERT INTO t_test2 (gidx,p,msgclass,segm,tsize,segmidx) VALUES(5,4,"A","complete m",28,1); 

INSERT INTO t_test2 (gidx,p,msgclass,segm) VALUES(6,NULL,"B","This is for class B"); 

INSERT INTO t_test2 (gidx,p,msgclass,segm,tsize,segmidx) VALUES(7,4,"A","essage 3",28,2); 

INSERT INTO t_test2 (gidx,p,msgclass,segm) VALUES(8,NULL,"A","This is a complete message 4"); 
SELECT MessageNo,
         group_concat(segm,'')
    FROM (
            SELECT gidx - coalesce(segmidx,0) as MessageNo,
                   coalesce(segmidx, 0) as LineNo,
                   segm
              FROM t_test2
          ORDER BY gidx, coalesce(segmidx,0)
         )
GROUP BY MessageNo;

(9.1) By Ryan Smith (cuz) on 2022-04-06 10:02:13 edited from 9.0 in reply to 8 [link] [source]

I feel like there are too many unanswered questions here to really provide a good solution, and that once a solution is given for the current supplied data, a simple new added bit will come to light regarding the "real data" to further shift the goal-posts.

Either way, this query handles the current given data correctly. Some things to note:

I have used CTE's but only for clarity, they are not special nor recursive and can be refactored into sub-queries if needed.
The first entries build the table and adds the data, then the working query is shown, and then a few more queries that simply breaks down the working query into its segments so that you can see what is going on in each step to better understand how it works, in case you want to alter it for your use case.
I've added some field to show the message-parts count, it's informative only and can be removed safely.

Lastly - If this does not work on "real data", kindly give actual real data next time.

  -- SQLite version 3.37.2  [ Release: 2022-01-06 ]  on  SQLitespeed version 2.1.3.11.
  -- ================================================================================================

CREATE TABLE t_segments (
    gidx INTEGER PRIMARY KEY,
    msgclass,
    segm BLOB,
    tsize INTEGER,
    segmidx INTEGER
);


INSERT INTO t_segments (gidx, msgclass, segm, tsize, segmidx) VALUES
 (1,'A','This is a complete message 1', NULL, NULL)
,(2,'B','This is for class B',          NULL, NULL)
,(3,'A','This is a complete message 2', NULL, NULL)
,(4,'A','This is a ',                   28,   0   )
,(5,'A','complete m',                   28,   1   )
,(6,'B','This is for class B',          NULL, NULL)
,(7,'A','essage 3',                     28,   2   )
,(8,'A','This is a complete message 4', NULL, NULL)
;



WITH TSEGM(GIdx, MsgClass, MsgSegment, FullSize, PartIdx) AS (
    SELECT gidx, msgclass, segm, tsize, IFNULL(segmidx,0)
      FROM t_segments
), GRANGE(GIdx, NextSiblingGIdx) AS (
    SELECT A.GIdx, IFNULL((SELECT MIN(GIdx) FROM TSEGM AS B WHERE B.GIdx > A.GIdx AND B.MsgClass=A.MsgClass AND B.PartIdx = 0), A.GIdx+1)
      FROM TSEGM AS A
     WHERE A.PartIdx = 0
), AMALG(MsgIdx, SubIdx, PartIdx, MsgSegment) AS (
    SELECT P.GIdx, C.GIdx, C.PartIdx, C.MsgSegment
      FROM TSEGM AS P
      JOIN GRANGE AS R ON R.GIdx = P.GIdx
      JOIN TSEGM AS C ON C.GIdx >= R.GIdx AND C.GIdx < R.NextSiblingGIdx AND C.MsgClass = P.MsgClass
     WHERE P.PartIdx = 0
     ORDER BY P.GIdx, C.GIdx
), MSGS(MsgIdx, Message, PartCount) AS (
    SELECT MsgIdx, GROUP_CONCAT(MsgSegment,''), COUNT(*)
      FROM AMALG
     GROUP BY MsgIdx
)
SELECT * FROM MSGS;

  --    MsgIdx   |Message                       |PartCount
  -- ------------|------------------------------|---------
  --       1     |This is a complete message 1  |    1    
  --       2     |This is for class B           |    1    
  --       3     |This is a complete message 2  |    1    
  --       4     |This is a complete message 3  |    3    
  --       6     |This is for class B           |    1    
  --       8     |This is a complete message 4  |    1    




WITH TSEGM(GIdx, MsgClass, MsgSegment, FullSize, PartIdx) AS (
    SELECT gidx, msgclass, segm, tsize, IFNULL(segmidx,0)
      FROM t_segments
)
SELECT * FROM TSEGM;

  --     GIdx    |MsgClass|MsgSegment                    |FullSize|PartIdx
  -- ------------|--------|------------------------------|--------|-------
  --       1     |    A   |This is a complete message 1  |NULL    |   0   
  --       2     |    B   |This is for class B           |NULL    |   0   
  --       3     |    A   |This is a complete message 2  |NULL    |   0   
  --       4     |    A   |This is a                     |28      |   0   
  --       5     |    A   |complete m                    |28      |   1   
  --       6     |    B   |This is for class B           |NULL    |   0   
  --       7     |    A   |essage 3                      |28      |   2   
  --       8     |    A   |This is a complete message 4  |NULL    |   0   




WITH TSEGM(GIdx, MsgClass, MsgSegment, FullSize, PartIdx) AS (
    SELECT gidx, msgclass, segm, tsize, IFNULL(segmidx,0)
      FROM t_segments
), GRANGE(GIdx, NextSiblingGIdx) AS (
    SELECT A.GIdx, IFNULL((SELECT MIN(GIdx) FROM TSEGM AS B WHERE B.GIdx > A.GIdx AND B.MsgClass=A.MsgClass AND B.PartIdx = 0), A.GIdx+1)
      FROM TSEGM AS A
     WHERE A.PartIdx = 0
)
SELECT * FROM GRANGE;

  --             |NextSibl-
  --     GIdx    | ingGIdx 
  -- ------------|---------
  --       1     |    3    
  --       2     |    6    
  --       3     |    4    
  --       4     |    8    
  --       6     |    7    
  --       8     |    9    



WITH TSEGM(GIdx, MsgClass, MsgSegment, FullSize, PartIdx) AS (
    SELECT gidx, msgclass, segm, tsize, IFNULL(segmidx,0)
      FROM t_segments
), GRANGE(GIdx, NextSiblingGIdx) AS (
    SELECT A.GIdx, IFNULL((SELECT MIN(GIdx) FROM TSEGM AS B WHERE B.GIdx > A.GIdx AND B.MsgClass=A.MsgClass AND B.PartIdx = 0), A.GIdx+1)
      FROM TSEGM AS A
     WHERE A.PartIdx = 0
), AMALG(MsgIdx, SubIdx, PartIdx, MsgSegment) AS (
    SELECT P.GIdx, C.GIdx, C.PartIdx, C.MsgSegment
      FROM TSEGM AS P
      JOIN GRANGE AS R ON R.GIdx = P.GIdx
      JOIN TSEGM AS C ON C.GIdx >= R.GIdx AND C.GIdx < R.NextSiblingGIdx AND C.MsgClass = P.MsgClass
     WHERE P.PartIdx = 0
     ORDER BY P.GIdx, C.GIdx
)
SELECT * FROM AMALG;

  -- MsgIdx|SubIdx|PartIdx|MsgSegment                    
  -- ------|------|-------|------------------------------
  --    1  |   1  |   0   |This is a complete message 1  
  --    2  |   2  |   0   |This is for class B           
  --    3  |   3  |   0   |This is a complete message 2  
  --    4  |   4  |   0   |This is a                     
  --    4  |   5  |   1   |complete m                    
  --    4  |   7  |   2   |essage 3                      
  --    6  |   6  |   0   |This is for class B           
  --    8  |   8  |   0   |This is a complete message 4

(10) By Ryan Smith (cuz) on 2022-04-06 10:18:13 in reply to 9.1 [link] [source]

Someone pointed out that the output may need to include the message class - a rather trivial addition, but shown here for completeness:

WITH TSEGM(GIdx, MsgClass, MsgSegment, FullSize, PartIdx) AS (
    SELECT gidx, msgclass, segm, tsize, IFNULL(segmidx,0)
      FROM t_segments
), GRANGE(GIdx, NextSiblingGIdx) AS (
    SELECT A.GIdx, IFNULL((SELECT MIN(GIdx) FROM TSEGM AS B WHERE B.GIdx > A.GIdx AND B.MsgClass=A.MsgClass AND B.PartIdx = 0), A.GIdx+1)
      FROM TSEGM AS A
     WHERE A.PartIdx = 0
), AMALG(MsgIdx, MsgClass, SubIdx, PartIdx, MsgSegment) AS (
    SELECT P.GIdx, P.MsgClass, C.GIdx, C.PartIdx, C.MsgSegment
      FROM TSEGM AS P
      JOIN GRANGE AS R ON R.GIdx = P.GIdx
      JOIN TSEGM AS C ON C.GIdx >= R.GIdx AND C.GIdx < R.NextSiblingGIdx AND C.MsgClass = P.MsgClass
     WHERE P.PartIdx = 0
     ORDER BY P.GIdx, C.GIdx
), MSGS(MsgIdx, MsgClass, Message, PartCount) AS (
    SELECT MsgIdx, MsgClass, GROUP_CONCAT(MsgSegment,''), COUNT(*)
      FROM AMALG
     GROUP BY MsgIdx
)
SELECT *
  FROM MSGS
;


  --    MsgIdx   |MsgClass|Message                       |PartCount
  -- ------------|--------|------------------------------|---------
  --       1     |    A   |This is a complete message 1  |    1    
  --       2     |    B   |This is for class B           |    1    
  --       3     |    A   |This is a complete message 2  |    1    
  --       4     |    A   |This is a complete message 3  |    3    
  --       6     |    B   |This is for class B           |    1    
  --       8     |    A   |This is a complete message 4  |    1

(11.3) By midijohnny on 2022-04-06 11:21:24 edited from 11.2 in reply to 9.1 [link] [source]

My attempt - I added in 'message 5' for testing. It uses a window function to enumerate all the zero 'segmidx' rows (the 'head' of each list), then groups the non-zeros (the 'rest' of each list) - assuming that gidx never interleaves in between 'segmidx', then group_concat to get the combined message as before; then combines the already-complete messages in an ordered output. Note: see 'edits' below - this doesn't work properly (doesn't detect upper-bound of 'gidx/sedmidx' combination).

.version
.mode column

DROP TABLE IF EXISTS t_segments;
CREATE TABLE t_segments (gidx INTEGER PRIMARY KEY,msgclass,segm BLOB,tsize INTEGER,segmidx INTEGER);
INSERT INTO t_segments (gidx,msgclass,segm) VALUES
    (1,'A','This is a complete message 1'),
    (2,'B','This is for class B'),
    (3,'A','This is a complete message 2'),
    (6,'B','This is for class B'),
    (8,'A','This is a complete message 4');
INSERT INTO t_segments (gidx,msgclass,segm,tsize,segmidx) VALUES
    (4,'A','This is a ',28,0),
    (5,'A','complete m',28,1),
    (7,'A','essage 3',28,2),
    (9,'B','mess',9,0),
    (10,'B','age_5',9,1);

WITH
complete(gidx,segm) AS (
    SELECT gidx,segm FROM t_segments
        WHERE segmidx IS NULL), -- implies message is already complete
head(gidx,msgclass,segm,tsize,segmidx,i) AS (
        SELECT *, row_number() OVER (ORDER BY gidx) FROM t_segments
        WHERE segmidx==0 -- implies message is multipart, enumerate each of them (row_number()/OVER)
),
head_and_rest(gidx,msgclass,segm,tsize,segmidx,i) AS ( -- put the head and rest of all mulitparts together, with the enumeration to group them.
        SELECT t.*,h.i FROM head h
        INNER JOIN t_segments t USING(msgclass)
        WHERE t.gidx>=h.gidx AND t.segmidx>=h.segmidx AND t.tsize==h.tsize -- (last clause is kinda of a safety-check, but could be confusing if data is wrong).
),
combined(gidx, segm) AS ( -- join the message text
    SELECT gidx, GROUP_CONCAT(segm,'') AS segm
    FROM head_and_rest GROUP BY i
)
SELECT * FROM complete -- combine the already-complete and the joined-up multi-parts in single ordered output.
  UNION ALL
SELECT * FROM combined
ORDER BY gidx;

Output:

SQLite 3.31.1 2020-01-27 19:55:54 3bfa9cc97da10598521b342961df8f5f68c7388fa117345eeb516eaa837balt1
zlib version 1.2.11
gcc-9.3.0
gidx        segm
----------  ----------------------------
1           This is a complete message 1
2           This is for class B
3           This is a complete message 2
4           This is a complete message 3
6           This is for class B
8           This is a complete message 4
9           message_5
~

EDIT: I'm a little unclear about the significance of the 'msgClass' here - if it is just another field that doesn't affect 'gidx'/'segmidx' counts - the logic could be changed to remove the mention of it altogether.

EDIT2: Pretty sure my logic is wrong actually - since I'm not properly differentiating between items which belong to a particular group. (That is: the data I generated (class='B') is only co-incidentally making this work I think).

EDIT3: yeah, broken.

The following data shows that:

INSERT INTO t_segments (gidx,msgclass,segm,tsize,segmidx) VALUES
    (4,'A','This is a ',28,0),
    (5,'A','complete m',28,1),
	(7,'A','essage 3',28,2),
	(9,'B','mess',9,0),
	(10,'B','age_5',9,1),
	(11,'A','mess',9,0),
	(12,'A','age_6',9,1),
	(13,'A','mess',9,0),
	(14,'A','age_7',9,1);

gidx	segm
1	This is a complete message 1
2	This is for class B
3	This is a complete message 2
4	This is a complete message 3
6	This is for class B
8	This is a complete message 4
9	message_5
11	message_6message_7 --- <----- nope!
13	message_7

(12) By midijohnny on 2022-04-06 11:29:26 in reply to 11.3 [link] [source]

Tweaked the window function. (Now the naming of the CTEs are a bit odd) This might be right now - not sure - but added in some additional data to try and test more.

DROP TABLE IF EXISTS t_segments;
CREATE TABLE t_segments (gidx INTEGER PRIMARY KEY,msgclass,segm BLOB,tsize INTEGER,segmidx INTEGER);
INSERT INTO t_segments (gidx,msgclass,segm) VALUES
    (1,'A','This is a complete message 1'),
    (2,'B','This is for class B'),
    (3,'A','This is a complete message 2'),
	(6,'B','This is for class B'),
	(8,'A','This is a complete message 4');
INSERT INTO t_segments (gidx,msgclass,segm,tsize,segmidx) VALUES
    (4,'A','This is a ',28,0),
    (5,'A','complete m',28,1),
	(7,'A','essage 3',28,2),
	(9,'B','mess',9,0),
	(10,'B','age_5',9,1),
	(11,'A','mess',9,0),
	(12,'A','age_6',9,1),
	(13,'A','mess',9,0),
	(14,'A','age_7',9,1);

	
WITH
complete(gidx,segm) AS ( 
    SELECT gidx,segm FROM t_segments 
	WHERE segmidx IS NULL), -- implies message is already complete
head_and_rest AS (
SELECT *, row_number() OVER (PARTITION BY segmidx ORDER BY gidx) AS i FROM t_segments
WHERE segmidx IS NOT NULL -- implies message is multipart
),
combined(gidx, segm) AS ( -- join the message text
    SELECT gidx, GROUP_CONCAT(segm,'') AS segm 
    FROM head_and_rest GROUP BY i
)
SELECT * FROM complete -- combine the already-complete and the joined-up partials in single output.
  UNION ALL
SELECT * FROM combined
ORDER BY gidx;

Output:

SQLite 3.31.1 2020-01-27 19:55:54 3bfa9cc97da10598521b342961df8f5f68c7388fa117345eeb516eaa837balt1
zlib version 1.2.11
gcc-9.3.0
gidx        segm
----------  ----------------------------
1           This is a complete message 1
2           This is for class B
3           This is a complete message 2
4           This is a complete message 3
6           This is for class B
8           This is a complete message 4
9           message_5
11          message_6
13          message_7

(14) By midijohnny on 2022-04-06 11:55:40 in reply to 12 [link] [source]

Tidied up version:

WITH 
  single AS (SELECT * FROM t_segments WHERE segmidx IS NULL),
  multi  AS (SELECT * FROM t_segments EXCEPT SELECT * FROM single),
  enum   AS (SELECT row_number() OVER w1 AS i,* FROM multi
             WINDOW w1 AS (PARTITION BY segmidx ORDER BY gidx)),
  concat AS (SELECT gidx,GROUP_CONCAT(segm,'') AS segm FROM enum
             GROUP BY i)
SELECT gidx, segm FROM concat
  UNION ALL
SELECT gidx, segm FROM single
ORDER BY gidx;

Output:

gidx	segm
1	This is a complete message 1
2	This is for class B
3	This is a complete message 2
4	This is a complete message 3
6	This is for class B
8	This is a complete message 4
9	message_5
11	message_6
13	message_7

(16) By midijohnny on 2022-04-06 16:18:14 in reply to 14 [link] [source]

Variation: includes a check to make sure the total length of the re-generated message is the same as the 'tsize' specified in the first part of the message.

WITH 
  single AS (SELECT * FROM t_segments WHERE segmidx IS NULL),
  multi  AS (SELECT * FROM t_segments EXCEPT SELECT * FROM single),
  enum   AS (SELECT rank() OVER w1 AS i,* FROM multi
             WINDOW w1 AS (PARTITION BY segmidx ORDER BY gidx)),
  concat AS (SELECT
              gidx,
              GROUP_CONCAT(segm,'') AS segm,
              SUM(LENGTH(segm))     AS csize
            FROM enum
            GROUP BY i),
  chk    AS (SELECT
               gidx,
               segmidx, 
              (tsize==csize) AS "valid_size?"
             FROM concat
             JOIN multi USING(gidx))
SELECT gidx, segmidx, segm, "valid_size?" FROM concat
INNER JOIN chk USING (gidx)
  UNION ALL
SELECT gidx, segmidx, segm,1 FROM single
ORDER BY gidx

Given the following data (below) - it does detect any discrepancy between the zero-indexed 't.size' and the resultant size. Feel free to hate that fact that I used "?" characters in my column aliasing ;-)

gidx        segmidx     segm                          valid_size?
----------  ----------  ----------------------------  -----------
1                       This is a complete message 1  1
2                       This is for class B           1
3                       This is a complete message 2  1
4           0           This is a complete message 3  1
6                       This is for class B           1
8                       This is a complete message 4  1
9           0           message_5                     1
11          0           message_6                     1
13          0           message_7                     1
15          0           badone                        0
17          0           This is one for class C       0

Random notes/thoughts:

The other thing that could be done, but I haven't: is to check all component 'tsize' items are the same as each other.

I was trying to get lead and lag to do this for me. But then I went a period of not understanding why the Window function worked at all; then re-working it out why it does.

I believe the reason the window function works: all the zeros are gathered up into a single partition, all the ones are gathered up into a second, twos etc. That is: you can a partition for 0,1,2.... then within those partitions the individual members are sorted by their 'gidx' and these relative rankings are the same (if the data is true to its promise at least).

But that does mean you cannot use 'lag' and 'lead' (since they work within partitions, rather than across them). Another window function across the results could probably be done to check this. That every row.tsize==lag(row.tsize)

Data used:

DROP TABLE IF EXISTS "t_segments";
CREATE TABLE IF NOT EXISTS "t_segments" (
	"gidx"	INTEGER,
	"msgclass"	TEXT,
	"segm"	BLOB,
	"tsize"	INTEGER,
	"segmidx"	INTEGER,
	PRIMARY KEY("gidx")
);
INSERT INTO "t_segments"("gidx","msgclass","segm","tsize","segmidx")
VALUES 
 (1,'A','This is a complete message 1',NULL,NULL),
 (2,'B','This is for class B',NULL,NULL),
 (3,'A','This is a complete message 2',NULL,NULL),
 (4,'A','This is a ',28,0),
 (5,'A','complete m',28,1),
 (6,'B','This is for class B',NULL,NULL),
 (7,'A','essage 3',28,2),
 (8,'A','This is a complete message 4',NULL,NULL),
 (9,'B','mess',9,0),
 (10,'B','age_5',9,1),
 (11,'A','mess',9,0),
 (12,'A','age_6',9,1),
 (13,'A','mess',9,0),
 (14,'A','age_7',9,1),
 (15,'A','bad',3,0),
 (16,'A','one',6,1),
 (17,'C','This is one f',25,0),
 (18,'C','or class C',25,1);

(13) By Ryan Smith (cuz) on 2022-04-06 11:53:28 in reply to 11.3 [link] [source]

Nice work Johnny.

FWIW - I've thought of a possible flaw in my logic and devised a more devious table to test it (adding a split-up message inside an already split message), which did work fine for me, though I did add another tweak, and while I was at it, saw your post and grabbed your query to test alongside, and happy to confirm it also works perfectly on the devious data.

@Maik: See here revised table and the two queries, mine followed by midijohnny's, both of which work perfectly.

Only testing them on real data will reveal which is fastest and should be used (since the made-up data sample is too small to show any efficiency differences in the query plan).

Good luck!

CREATE TABLE t_segments (
    gidx INTEGER PRIMARY KEY,
    msgclass,
    segm BLOB,
    tsize INTEGER,
    segmidx INTEGER
);

INSERT INTO t_segments (gidx, msgclass, segm, tsize, segmidx) VALUES
 ( 1,'A','This is a complete message 1', NULL, NULL)
,( 2,'B','This is for class B',          NULL, NULL)
,( 3,'A','This is a complete message 2', NULL, NULL)
,( 4,'A','This is a ',                   28,   0   )
,( 5,'A','complete m',                   28,   1   )
,( 6,'B','This is for class B',          NULL, NULL)
,( 7,'C','This is one f',                25,   0   )
,( 8,'C','or class C',                   25,   1   )
,( 9,'A','essage 3',                     28,   2   )
,(10,'A','This is a complete message 4', NULL, NULL)
,(11,'B','mess',                          9,   0   )
,(12,'B','age_5',                         9,   1   )
;


WITH TSEGM(GIdx, MsgClass, MsgSegment, FullSize, PartIdx) AS (
    SELECT gidx, msgclass, segm, tsize, IFNULL(segmidx,0)
      FROM t_segments
), GRANGE(GIdx, NextSiblingGIdx) AS (
    SELECT A.GIdx, IFNULL((SELECT MIN(GIdx) FROM TSEGM AS B WHERE B.GIdx > A.GIdx AND B.MsgClass=A.MsgClass AND B.PartIdx = 0), A.GIdx+16)
      FROM TSEGM AS A
     WHERE A.PartIdx = 0
), AMALG(MsgIdx, MsgClass, SubIdx, PartIdx, MsgSegment) AS (
    SELECT P.GIdx, P.MsgClass, C.GIdx, C.PartIdx, C.MsgSegment
      FROM TSEGM AS P
      JOIN GRANGE AS R ON R.GIdx = P.GIdx
      JOIN TSEGM AS C ON C.GIdx >= R.GIdx AND C.GIdx < R.NextSiblingGIdx AND C.MsgClass = P.MsgClass
     WHERE P.PartIdx = 0
     ORDER BY P.GIdx, C.GIdx
), MSGS(MsgIdx, MsgClass, Message, PartCount) AS (
    SELECT MsgIdx, MsgClass, GROUP_CONCAT(MsgSegment,''), COUNT(*)
      FROM AMALG
     GROUP BY MsgIdx
)
SELECT * FROM MSGS;

  --    MsgIdx   |MsgClass|Message                       |PartCount
  -- ------------|--------|------------------------------|---------
  --       1     |    A   |This is a complete message 1  |    1    
  --       2     |    B   |This is for class B           |    1    
  --       3     |    A   |This is a complete message 2  |    1    
  --       4     |    A   |This is a complete message 3  |    3    
  --       6     |    B   |This is for class B           |    1    
  --       7     |    C   |This is one for class C       |    2    
  --      10     |    A   |This is a complete message 4  |    1    
  --      11     |    B   |message_5                     |    2    



WITH
complete(gidx,segm) AS (
    SELECT gidx,segm FROM t_segments
	WHERE segmidx IS NULL),
head_and_rest AS (
SELECT *, row_number() OVER (PARTITION BY segmidx ORDER BY gidx) AS i FROM t_segments
WHERE segmidx IS NOT NULL
),
combined(gidx, segm) AS (
    SELECT gidx, GROUP_CONCAT(segm,'') AS segm
    FROM head_and_rest GROUP BY i
)
SELECT * FROM complete
  UNION ALL
SELECT * FROM combined
ORDER BY gidx;


  --     gidx    |segm                          
  -- ------------|------------------------------
  --       1     |This is a complete message 1  
  --       2     |This is for class B           
  --       3     |This is a complete message 2  
  --       4     |This is a complete message 3  
  --       6     |This is for class B           
  --       7     |This is one for class C       
  --      10     |This is a complete message 4  
  --      11     |message_5

(15) By midijohnny on 2022-04-06 11:59:31 in reply to 13 [link] [source]

Nice one! Posts crossed! I have tidied-up my CTE, hopefully a little more compact and readable than my first effort. Interesting puzzle @Maik - cheers!