Support of unicode operators like ≠?

(1) By MBS on 2020-12-03 08:42:00 [link]

Hello,

how about this little feature request to expand the parser to also understand the unicode characters below?

≠, not equal to, U+2260 = 8800 or composed with 61 + 824
≤, less-than or equal to, U+2264 = 8804
≥, greater-than or equal to, U+2265 = 8805

While you may not use those characters in source code directly, you could just check for the matching UTF-8 sequences in your parser.

Thanks.

(2) By Ryan Smith (cuz) on 2020-12-03 12:15:13 in reply to 1 [link]

I don't program in unicode and cannot fathom why anyone would want to do it in SQL (other languages sure, especially if you are Chinese or Russian or such), but that is besides the point, I don't mind SQLite being able to interpret certain unicode characters to fulfill it's query planning.  
Should be fairly easy to do too, SQLite already only talks UTF8.

I DO however mind if it spends even a few CPU cycles on having to do that, or even check for it, because I care about speed and SQL isn't like normal programming where the compiler takes a couple milliseconds more to do it and you only compile once in a while.  
SQLite (and SQL engines in general) have to parse and interpret each of the thousands of queries we send them per second, and paying extra cycles for that really matters.

I am quite curious though:

- Is this possible in any other SQL engines?
- Does the SQL standard have any guidelines in this regard?

(3) By Richard Hipp (drh) on 2020-12-03 12:20:40 in reply to 1 [link]

Do any of PostgreSQL, MySQL, SQL Server, or Oracle support this?

(4) By anonymous on 2020-12-03 13:26:29 in reply to 3 [link]

SQL Server sees <> OR != as 'not equal to'

The <i>only</i> programming language I know that will ≠ as 'not equal to' is APL (A Programming Language)

(5) By anonymous on 2020-12-03 16:42:45 in reply to 4 [link]

Another such language is [Raku](https://docs.raku.org/language/operators) (formerly known as Perl6), including such operators as `≠` `≤` `≥` `≅` `∈` `∉` `∋` `∌` `⊂` `⊄` `⊃` `⊅` `⊆` `⊈` `⊇` `⊉` `≼` `≽` and other syntax elements like `｢｣` `«»`, but while it presents some compelling and interesting ideas, it's removed too far from SQL to be considered an argument in favour of adding such operators to SQLite.

(6) By Keith Medcalf (kmedcalf) on 2020-12-03 17:05:42 in reply to 5 [link]

I don't know about you, but my keyboard does not have any of those funny characters.

(7) By MBS on 2020-12-03 17:07:49 in reply to 6 [link]

I can type ≠ with option-= and ≤ with option-< here (German layout).

Since those unicode characters are newer than the original SQL standard, I just wondered whether those could supported.

(8) By David Raymond (dvdraymond) on 2020-12-03 17:37:06 in reply to 3 [link]

> Do any of PostgreSQL, MySQL, SQL Server, or Oracle support this?

For Postgres:
[4.1.3. Operators](https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-OPERATORS)

```
4.1.3. Operators

An operator name is a sequence of up to NAMEDATALEN-1 (63 by default) characters from the following list:


+ - * / < > = ~ ! @ # % ^ & | ` ?

There are a few restrictions on operator names, however:

    -- and /* cannot appear anywhere in an operator name, since they will be taken as the start of a comment.

    A multiple-character operator name cannot end in + or -, unless the name also contains at least one of these characters:


    ~ ! @ # % ^ & | ` ?

    For example, @- is an allowed operator name, but *- is not. This restriction allows PostgreSQL to parse SQL-compliant queries without requiring spaces between tokens.

When working with non-SQL-standard operator names, you will usually need to separate adjacent operators with spaces to avoid ambiguity. For example, if you have defined a left unary operator named @, you cannot write X*@Y; you must write X* @Y to ensure that PostgreSQL reads it as two operator names not one.
```

(9.1) By Keith Medcalf (kmedcalf) on 2020-12-04 02:00:39 edited from 9.0 in reply to 7 [link]

I don't have an "option" key, whatever that is.

(10) By Larry Brasfield (LarryBrasfield) on 2020-12-04 02:05:37 in reply to 9.0 [link]

That is Apple's name for the Alt key. Innovation!

(11) By Trudge on 2020-12-04 16:30:40 in reply to 2 [link]

One of the reasons I have for trying to understand Unicode is my collection of music and books. Many artists and authors have non-ASCII characters in their names or titles of their work. The reason Unicode exists is because we no longer live in a mono-language / mono-character set world. As a Perl programmer, I have torn more hair out than I've lost naturally. There are several layers at work in a digital environment and they all have to talk to each other correctly if Unicode is going to work. 

<ol>
<li>the OS</li>
<li>the programming language</li>
<li>the editor / authoring tool used to write the code</li>
<li>the database engine</li>
<li>the browser or agent used to view a page</li>
</ol>

There may be more but it's almost like quantum physics - if you think you understand it you're wrong.

In my work I've given up trying to get all the parts working together and gone straight back to ASCII. A search form on my sites contains a drop-down list of authors, artists and titles because a user-supplied search for any non-ASCII characters would not work. 

But at some point I will figure it all out. When that happens I'll tackle quantum physics.

(12) By Tim Streater (Clothears) on 2020-12-04 17:41:40 in reply to 11 [link]

Just make sure they are UTF-8 compliant, which most stuff is these days. That does not, however, mean that such as SQLite, in terms of its language syntax, has to support large numbers of UTF-8 characters. That would risk expanding the library size rather a lot.

I find this site useful:

https://www.utf8-chartable.de/unicode-utf8-table.pl?number=1024

I prefer UTF-8 to other encodings as its bottom page is the same as ASCII, and being a variable length encoding, there are no endianness issues.

(13) By Ryan Smith (cuz) on 2020-12-05 10:46:05 in reply to 11 [link]

I think you mistake my statements for being "against Unicode", I adore Unicode, I can't author a letter in my native tongue if it wasn't for Unicode, and all my created software can handle Unicode perfectly well. Yet, the code that I wrote to make that software is technical and contains only ASCII characters.

What I am contesting, is writing that software in Unicode.
The tools used to build engines do not in itself need to behave the same way as the built engines.

Most modern code editors can all speak Unicode perfectly well, but I still prefer the larger-or-equal sign to be the two characters ">=" rather than a single special Unicode character. It's not that it is more "correct" in any sense, it's just faster that way and limits your required keys to produce the needed characters on ANY keyboard to a set of about 100 or so keys to get the job done.

It's like how some people view luxury to mean being special in every way, better than normal, but even if you buy a right fancy luxurious car, you'll find the tire-iron to be of a very standard format. I pose that it is extreme luxury to be able to walk into any little roadside shop out in desert-country after suffering a flat tire and purchase a tire iron that will work on your otherwise wildly innovative expensive out-of-the-norm vehicle.

Your computer understanding Unicode and being able to write your correct umlauted name on an email or document is a MUST. Needing that for basic programming input commands however, is the opposite of luxury.

Lastly, Unicode is nothing like quantum physics. It's fully understood, used almost universally everywhere (where general language input/output is needed) and is perfectly sensible.

If you are having trouble with it, perhaps you are trying to do things in Unicode which isn't needed (like coding) or perhaps some more reading may enlighten. It's hard to guess at your problems, but if you can pose a specific difficulty, we might be able to suggest a remedy.

(14) By Ryan Smith (cuz) on 2020-12-05 10:55:40 in reply to 12 [link]

> I prefer UTF-8 to other encodings as its bottom page is the same as ASCII, and being a variable length encoding, there are no endianness issues.

And let's not forget, it's helluva compact, winning hands-down over any other encoding for sheer size.

(15) By anonymous on 2020-12-05 13:59:54 in reply to 2 [link]

> I don't program in unicode and cannot fathom why anyone would want to do it in SQL (other languages sure, especially if you are Chinese or Russian or such)

In middle school we had an experimental course in programming. The environment had been optimised for children-friendliness, which meant LogoWriter translated into Russian. For some reason, using roots from the same language I think in to instruct a computer felt far weirder than other, more mainstream programming languages with English-based keywords felt later. This was in DOS, so the encoding was probably CP866.

The actual reason for the weirdness may have been Logo's LISP ancestry mixed with an attempt to read like natural language, though.

(Sorry for an off-topic response.)

(16) By Trudge on 2020-12-05 16:50:29 in reply to 12 [link]

Thank you for your insight. I use that site as well but carefully - I still need to get a real understanding of the whole UTF-8 / Unicode situation as used in my environment.

(17) By Trudge on 2020-12-05 16:56:26 in reply to 13 [link]

"If you are having trouble with it, perhaps you are trying to do things in Unicode which isn't needed (like coding) or perhaps some more reading may enlighten. It's hard to guess at your problems, but if you can pose a specific difficulty, we might be able to suggest a remedy."

I believe you are correct. I am still struggling to fully understand the whole UTF-8 / Unicode situation, thus my reference to quantum physics. Yes, I know it is understood (by some people) but certainly not me. Sort of like Unicode. Maybe I'm making it more complicated than it is.

Thanks for you thoughts and insight.

(18) By Tim Streater (Clothears) on 2020-12-05 17:09:41 in reply to 17

You may also like to look here:

https://en.wikipedia.org/wiki/UTF-8

UTF-8 is just one way of encoding Unicode code-points for storage. There are others, but its main advantages are:

1) Variable length units from one to four bytes. One byte units map exactly to ASCII.
2) As its byte oriented it has no endianness issue.
3) It's the most-used encoding.