/ Check-in [7b692782]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add a README.txt file for the ICU extension. (CVS 4055)
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 7b6927829f18d39052e67eebca4275e7aa496035
User & Date: danielk1977 2007-06-11 08:00:00
Context
2007-06-11
12:56
Fix a bug in the LIKE optimization that was causing the pattern string to be dequoted twice. Ticket #2407. (CVS 4056) check-in: abf64d1d user: drh tags: trunk
08:00
Add a README.txt file for the ICU extension. (CVS 4055) check-in: 7b692782 user: danielk1977 tags: trunk
2007-06-10
22:57
Define isnan() on windows. Ticket #2399. (CVS 4054) check-in: fed9373e user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Added ext/icu/README.txt.









































































































































































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164

This directory contains source code for the SQLite "ICU" extension, an
integration of the "International Components for Unicode" library with
SQLite. Documentation follows.

    1. Features
    
        1.1  SQL Scalars upper() and lower()
        1.2  Unicode Aware LIKE Operator
        1.3  ICU Collation Sequences
        1.4  SQL REGEXP Operator
    
    2. Compilation and Usage
    
    3. Bugs, Problems and Security Issues
    
        3.1  The "case_sensitive_like" Pragma
        3.2  The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro
        3.3  Collation Sequence Security Issue


1. FEATURES

  1.1  SQL Scalars upper() and lower()

    SQLite's built-in implementations of these two functions only 
    provide case mapping for the 26 letters used in the English
    language. The ICU based functions provided by this extension
    provide case mapping, where defined, for the full range of 
    unicode characters.

    ICU provides two types of case mapping, "general" case mapping and
    "language specific". Refer to ICU documentation for the differences
    between the two. Specifically:

       http://www.icu-project.org/userguide/caseMappings.html
       http://www.icu-project.org/userguide/posix.html#case_mappings

    To utilise "general" case mapping, the upper() or lower() scalar 
    functions are invoked with one argument:

        upper('ABC') -> 'abc'
        lower('abc') -> 'ABC'

    To access ICU "language specific" case mapping, upper() or lower()
    should be invoked with two arguments. The second argument is the name
    of the locale to use. Passing an empty string ("") or SQL NULL value
    as the second argument is the same as invoking the 1 argument version
    of upper() or lower():

        lower('I', 'en_us') -> 'i'
        lower('I', 'tr_tr') -> 'ı' (small dotless i)

  1.2  Unicode Aware LIKE Operator

    Similarly to the upper() and lower() functions, the built-in SQLite LIKE
    operator understands case equivalence for the 26 letters of the English
    language alphabet. The implementation of LIKE included in this
    extension uses the ICU function u_foldCase() to provide case
    independent comparisons for the full range of unicode characters.  

    The U_FOLD_CASE_DEFAULT flag is passed to u_foldCase(), meaning the
    dotless 'I' character used in the Turkish language is considered
    to be in the same equivalence class as the dotted 'I' character
    used by many languages (including English).

  1.3  ICU Collation Sequences

    A special SQL scalar function, icu_load_collation() is provided that 
    may be used to register ICU collation sequences with SQLite. It
    is always called with exactly two arguments, the ICU locale 
    identifying the collation sequence to ICU, and the name of the
    SQLite collation sequence to create. For example, to create an
    SQLite collation sequence named "turkish" using Turkish language
    sorting rules, the SQL statement:

        SELECT icu_load_collation('tr_TR', 'turkish');

    Or, for Australian English:

        SELECT icu_load_collation('en_AU', 'australian');

    The identifiers "turkish" and "australian" may then be used
    as collation sequence identifiers in SQL statements:

        CREATE TABLE aust_turkish_penpals(
          australian_penpal_name TEXT COLLATE australian,
          turkish_penpal_name    TEXT COLLATE turkish
        );
  
  1.4 SQL REGEXP Operator

    This extension provides an implementation of the SQL binary
    comparision operator "REGEXP", based on the regular expression functions
    provided by the ICU library. The syntax of the operator is as described
    in SQLite documentation:

        <string> REGEXP <re-pattern>

    This extension uses the ICU defaults for regular expression matching
    behaviour. Specifically, this means that:

        * Matching is case-sensitive,
        * Regular expression comments are not allowed within patterns, and
        * The '^' and '$' characters match the beginning and end of the
          <string> argument, not the beginning and end of lines within
          the <string> argument.

    Even more specifically, the value passed to the "flags" parameter
    of ICU C function uregex_open() is 0.


2  COMPILATION AND USAGE

  The easiest way to compile and use the ICU extension is to build
  and use it as a dynamically loadable SQLite extension.




3 BUGS, PROBLEMS AND SECURITY ISSUES

  3.1 The "case_sensitive_like" Pragma

    This extension does not work well with the "case_sensitive_like"
    pragma. If this pragma is used before the ICU extension is loaded,
    then the pragma has no effect. If the pragma is used after the ICU
    extension is loaded, then SQLite ignores the ICU implementation and
    always uses the built-in LIKE operator.

    The ICU extension LIKE operator is always case insensitive.

  3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro

    Passing very long patterns to the built-in SQLite LIKE operator can
    cause a stack overflow. To curb this problem, SQLite defines the
    SQLITE_MAX_LIKE_PATTERN_LENGTH macro as the maximum length of a
    pattern in bytes (irrespective of encoding). The default value is
    defined in internal header file "limits.h".
    
    The ICU extension LIKE implementation suffers from the same 
    problem and uses the same solution. However, since the ICU extension
    code does not include the SQLite file "limits.h", modifying
    the default value therein does not affect the ICU extension.
    The default value of SQLITE_MAX_LIKE_PATTERN_LENGTH used by
    the ICU extension LIKE operator is 50000, defined in source 
    file "icu.c".

  3.3 Collation Sequence Security Issue

    Internally, SQLite assumes that indices stored in database files
    are sorted according to the collation sequence indicated by the
    SQL schema. Changing the definition of a collation sequence after
    an index has been built is therefore equivalent to database
    corruption. The SQLite library is not very well tested under
    these conditions, and may contain potential buffer overruns
    or other programming errors that could be exploited by a malicious
    programmer.

    If the ICU extension is used in an environment where potentially
    malicious users may execute arbitrary SQL (i.e. gears), they
    should be prevented from invoking the icu_load_collation() function,
    possibly using the authorisation callback.

Changes to ext/icu/icu.c.

1
2












3
4
5
6
7
8
9
10
11
12
13
14
15
16

17
18
19
20

21
22
23
24
25
26
27
..
28
29
30
31
32
33
34
35
36
37
38
39
40



41
42
43
44
45
46
47
..
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
...
146
147
148
149
150
151
152









153
154
155
156
157
158
159
...
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301

/*












** This file implements an integration between the ICU library 
** ("International Components for Unicode", an open-source library 
** for handling unicode data) and SQLite. The integration uses 
** ICU to provide the following to SQLite:
**
**   * An implementation of the SQL regexp() function (and hence REGEXP
**     operator) using the ICU uregex_XX() APIs.
**
**   * Implementations of the SQL scalar upper() and lower() 
**     functions for case mapping.
**
**   * Collation sequences
**
**   * LIKE

*/

#if !defined(SQLITE_CORE) || defined(SQLITE_ENABLE_ICU)


#include <unicode/utypes.h>
#include <unicode/uregex.h>
#include <unicode/ustring.h>
#include <unicode/ucol.h>

#include <assert.h>
#include "sqlite3.h"
................................................................................

#ifndef SQLITE_CORE
  #include "sqlite3ext.h"
  SQLITE_EXTENSION_INIT1
#endif

/*
** Collation sequences:
**
**   ucol_open()
**   ucol_strcoll()
**   ucol_close()
*/




/*
** Version of sqlite3_free() that is always a function, never a macro.
*/
static void xFree(void *p){
  sqlite3_free(p);
}
................................................................................

/*
** Compare two UTF-8 strings for equality where the first string is
** a "LIKE" expression. Return true (1) if they are the same and 
** false (0) if they are different.
*/
static int icuLikeCompare(
  const uint8_t *zPattern,   /* The UTF-8 LIKE pattern */
  const uint8_t *zString,    /* The UTF-8 string to compare against */
  const UChar32 uEsc         /* The escape character */
){
  static const int MATCH_ONE = (UChar32)'_';
  static const int MATCH_ALL = (UChar32)'%';

  int iPattern = 0;       /* Current byte index in zPattern */
................................................................................
  sqlite3_context *context, 
  int argc, 
  sqlite3_value **argv
){
  const unsigned char *zA = sqlite3_value_text(argv[0]);
  const unsigned char *zB = sqlite3_value_text(argv[1]);
  UChar32 uEsc = 0;










  if( argc==3 ){
    /* The escape character string must consist of a single UTF-8 character.
    ** Otherwise, return an error.
    */
    int nE= sqlite3_value_bytes(argv[2]);
    const unsigned char *zE = sqlite3_value_text(argv[2]);
................................................................................
**
**     upper('ABC') -> 'abc'
**     lower('abc') -> 'ABC'
**
** To access ICU "language specific" case mapping, upper() or lower()
** should be invoked with two arguments. The second argument is the name
** of the locale to use. Passing an empty string ("") or SQL NULL value
** as the second argument is the smae as invoking the 1 argument version
** of upper() or lower().
**
**     lower('I', 'en_us') -> 'i'
**     lower('I', 'tr_tr') -> 'ı' (small dotless i)
**
** http://www.icu-project.org/userguide/posix.html#case_mappings
*/
<

>
>
>
>
>
>
>
>
>
>
>
>








|
|

|

|
>




>







 







|
|
<
<
<

>
>
>







 







|







 







>
>
>
>
>
>
>
>
>







 







|








1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
..
41
42
43
44
45
46
47
48
49



50
51
52
53
54
55
56
57
58
59
60
..
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
...
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
...
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323

/*
** 2007 May 6
**
** The author disclaims copyright to this source code.  In place of
** a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
*************************************************************************
** $Id: icu.c,v 1.5 2007/06/11 08:00:00 danielk1977 Exp $
**
** This file implements an integration between the ICU library 
** ("International Components for Unicode", an open-source library 
** for handling unicode data) and SQLite. The integration uses 
** ICU to provide the following to SQLite:
**
**   * An implementation of the SQL regexp() function (and hence REGEXP
**     operator) using the ICU uregex_XX() APIs.
**
**   * Implementations of the SQL scalar upper() and lower() functions
**     for case mapping.
**
**   * Integration of ICU and SQLite collation seqences.
**
**   * An implementation of the LIKE operator that uses ICU to 
**     provide case-independent matching.
*/

#if !defined(SQLITE_CORE) || defined(SQLITE_ENABLE_ICU)

/* Include ICU headers */
#include <unicode/utypes.h>
#include <unicode/uregex.h>
#include <unicode/ustring.h>
#include <unicode/ucol.h>

#include <assert.h>
#include "sqlite3.h"
................................................................................

#ifndef SQLITE_CORE
  #include "sqlite3ext.h"
  SQLITE_EXTENSION_INIT1
#endif

/*
** Maximum length (in bytes) of the pattern in a LIKE or GLOB
** operator.



*/
#ifndef SQLITE_MAX_LIKE_PATTERN_LENGTH
# define SQLITE_MAX_LIKE_PATTERN_LENGTH 50000
#endif

/*
** Version of sqlite3_free() that is always a function, never a macro.
*/
static void xFree(void *p){
  sqlite3_free(p);
}
................................................................................

/*
** Compare two UTF-8 strings for equality where the first string is
** a "LIKE" expression. Return true (1) if they are the same and 
** false (0) if they are different.
*/
static int icuLikeCompare(
  const uint8_t *zPattern,   /* LIKE pattern */
  const uint8_t *zString,    /* The UTF-8 string to compare against */
  const UChar32 uEsc         /* The escape character */
){
  static const int MATCH_ONE = (UChar32)'_';
  static const int MATCH_ALL = (UChar32)'%';

  int iPattern = 0;       /* Current byte index in zPattern */
................................................................................
  sqlite3_context *context, 
  int argc, 
  sqlite3_value **argv
){
  const unsigned char *zA = sqlite3_value_text(argv[0]);
  const unsigned char *zB = sqlite3_value_text(argv[1]);
  UChar32 uEsc = 0;

  /* Limit the length of the LIKE or GLOB pattern to avoid problems
  ** of deep recursion and N*N behavior in patternCompare().
  */
  if( sqlite3_value_bytes(argv[0])>SQLITE_MAX_LIKE_PATTERN_LENGTH ){
    sqlite3_result_error(context, "LIKE or GLOB pattern too complex", -1);
    return;
  }


  if( argc==3 ){
    /* The escape character string must consist of a single UTF-8 character.
    ** Otherwise, return an error.
    */
    int nE= sqlite3_value_bytes(argv[2]);
    const unsigned char *zE = sqlite3_value_text(argv[2]);
................................................................................
**
**     upper('ABC') -> 'abc'
**     lower('abc') -> 'ABC'
**
** To access ICU "language specific" case mapping, upper() or lower()
** should be invoked with two arguments. The second argument is the name
** of the locale to use. Passing an empty string ("") or SQL NULL value
** as the second argument is the same as invoking the 1 argument version
** of upper() or lower().
**
**     lower('I', 'en_us') -> 'i'
**     lower('I', 'tr_tr') -> 'ı' (small dotless i)
**
** http://www.icu-project.org/userguide/posix.html#case_mappings
*/