Why is utf8 IN char* but utf8 OUT is unsigned char*

(1) By curmudgeon on 2020-05-15 08:21:20 [link] [source]

I noticed in the sqlite3.c code that utf8 IN is accepted as char*
e.g. int sqlite3_bind_text(sqlite3_stmt*,int,const char*,int,void(*)(void*));
whereas utf8 OUT is returned as
e.g. const unsigned char *sqlite3_column_text(sqlite3_stmt*, int iCol);

Within the code the char* IN is converted to unsigned char* and then manipulated. Can anyone explain to me why? I think I know how utf8 works but I'm unsure as to why the conversion is neccessary.

e.g. this define plays a big part in the sqlite3.c utf8 code (uz is the unsigned char* conversion of the recveived char* z)

#define SQLITE_SKIP_UTF8(uz) { 		    	\
  if( (*(uz++))>=0xc0 ){             	        \
	while( (*uz & 0xc0)==0x80 ){ uz++; }    \
  }                                             \
}

Couldn't the conversion to unsigned be avoided and the SKIP macro replaced with

do z++; while (*z<-64)

I know the latter doesn't check that the initial byte indicates a UTF8 char follows but does it matter? I mean if there's embedded chars < -64 without the leading utf8 byte is it not a malformed utf8 string in any case? If it did matter you could always change it to

if ((*++z & 192) == 192) while (*++z<-64);

(In the above I'm assuming -i & i always returns i for all compilers). 

I'm just wondering what I'm missing about utf8.

(2) By Stephan Beal (stephan) on 2020-05-15 08:32:59 in reply to 1 [link] [source]

i don't know if this changes your analysis, but don't forget that whether or not char is signed or unsigned by default is platform-dependent. e.g. on ARM platforms (the ones i've developed on, anyway), char is unsigned by default, whereas it's signed on most platforms. Also don't forget that numeric overflow/underflow for signed types technically has undefined behavior (though it's likely to work identically/predictably on all recent platforms, it's not guaranteed to).

(3) By curmudgeon on 2020-05-15 09:02:31 in reply to 2 [link] [source]

It certainly further confuses the issue for me Stephen :-(

(4) By Keith Medcalf (kmedcalf) on 2020-05-15 18:04:22 in reply to 3 [link] [source]

In the bytes that are stored there is no difference between "char" and "unsigned char". There is no "conversion" that takes place when you tell the compiler that something is "signed char" vs "unsigned char". All you are doing is telling the compiler to emit different instructions.

If a is a "signed char" then the statement a > 0xc0 is always false unless one assumes that 0xc0 is also a signed char, in which case the meaning is entirely opposite from what one expects.

So telling the compiler that a "char" is signed or unsigned is merely a convention to that you can (a) write code that does what it appears to be saying and (b) have the compiler emit the correct instruction to "do what you mean". The signed/unsigned has absolutely zero "conversion" effect of the data, merely how the compiler implements your manipulations.

That is, it merely distinguishes whether the same data is bytes from 0 to 255, or bytes from -128 to 127 in two-s complement. The value (bit pattern) of the byte does not change.

(5) By curmudgeon on 2020-05-16 07:56:15 in reply to 4 [source]

Thanks Keith. I was on board with what you say when I first posted. I was just wondering why convert back and forth if it's not needed. Not that I have a problem with that, I was just worried I was missing something.