module Ustring:Unicode string library - Ustring. Version 0.01 This module implements Unicode variants of functions using strings in the OCaml standard library, e.g., modulessig
..end
String
and Pervasives
.
There are also a number of additional functions available. Several
basic operators (e.g., equality and string concatenation) are defined in
sub module Ustring.Op
. It is recommended to open up this sub module
but not Ustring
directly.
This module currently supports ASCII, Latin-1, and UTF-8 encoding. If
another encoding is used, an exception will be raised. In all functions
below, a ustring s
is indexed from 0
to (Ustring.length s)-1
.
Copyright (C) 2010 David Broman. All rights reserved. This file
is distributed under the "New BSD License".
typeuchar =
int
module Op:sig
..end
Ustring.Op
is containing Unicode versions of several functions
available in the Pervasives
module, as well as operators and functions
for simple string creation and manipulation (for example string concatenation
operator ^.
, ustring equality operator =.
and string creation us"string"
).
typet =
Op.ustring
Op.ustring
typeustring =
Op.ustring
Op.ustring
type
encoding =
| |
Ascii |
(* | Standard ASCII (values 0-127) | *) |
| |
Latin1 |
(* | Latin-1 encoding. Supports most European languages. | *) |
| |
Utf8 |
(* | UTF-8 encoding. Full Unicode encoding, where each character is represented by 1-4 bytes. | *) |
| |
Utf16le |
(* | Not yet supported | *) |
| |
Utf16be |
(* | Not yet supported | *) |
| |
Utf32le |
(* | Not yet supported | *) |
| |
Utf32be |
(* | Not yet supported | *) |
| |
Auto |
(* | Not a specific encoding, but a method for how encoded data is interpreted. If the input data have a byte order mark (BOM) in the beginning of the sequence, the encoding type stated in the BOM will be used. If no BOM is available, ASCII will initially be assumed to be the encoding type. Then, when a byte sequence appear that is not ASCII (value 0-127), a choice is made for the remaining encoding type: If the sequence represents a legal UTF-8 encoded character, the rest of the input will be treated as UTF-8 encoded. If it is not an UTF encoded character, Latin-1 will be assumed for the rest of the data sequence. Any later illegal decoding will then be reported as an error. | *) |
String
.val length : ustring -> int
Ustring.length s
returns the number of characters of s
.val get : ustring -> int -> uchar
Ustring.get s n
returns character number n
in ustring s
.
Raises Invalid_argument "Ustring.get"
if out of range.val set : ustring -> int -> uchar -> unit
Ustring.set s n c
modifies ustring s
in place by replacing uchar at
index n
by uchar c
. Raises Invalid_argument "Ustring.get"
if out of
range.val create : int -> ustring
Ustring.create n
returns a new fresh ustring with length n
characters. Raises Invalid_argument "Ustring.create"
if n < 0
or
n > Sys.max_array_length
.val make : int -> uchar -> ustring
Ustring.make n c
returns a new fresh ustring of length n
, filled
with character c
. Raises Invalid_argument "Ustring.make"
if n < 0
or
n > Sys.max_array_length
.val copy : ustring -> ustring
val sub : ustring -> int -> int -> ustring
Ustring.sub s start len
returns a new ustring with length
len
, consisting of a sub-string of s
, which starts at index
position start
and has length len
. Raises exception
Invalid_argument "Ustring.sub"
if start
and len
does not
give a valid sub-string.val concat : ustring -> ustring list -> ustring
Ustring.concat sep sl
returns a ustring where the list of ustrings
sl
are concatenated, were separator string sep
is inserted between
each list element. The function always return a new fresh string.
If performance is critical, use function fast_concatval rindex : ustring -> uchar -> int
Ustring.rindex s c
returns the index of the last occurrence of
character c
in ustring s
. Raises Not_found
if c
does not
occur in s
.val rindex_from : ustring -> int -> uchar -> int
Ustring.rindex_from s i c
returns the index of the last occurrence of
character c
in ustring s
before index position i+1
. Note that
function calls Ustring.rindex s
and
Ustring.rindex_from s (Ustring.length s - 1) c
are equivalent.
Raises Not_found
if c
does not occur in s
. Raises
Invalid_argument "Ustring.rindex_from"
if i+1
is an illegal
index in ustring s
.val append : ustring -> ustring -> ustring
^..
. Please see module Ustring.Op
for more details about the
operator. Note that this function is always creating a new fresh string
after append. For performance demanding application, function
fast_append
or operator ^.
are recommended instead.val fast_append : ustring -> ustring -> ustring
append
and
standard string concatenation operator ^
, function fast_append
do not
allocate new memory or create a new fresh string. Instead,
the concatenation is internally stored as a tree, making
the operation constant time. When the string is later used by some function
in module Ustring
, the internal tree representation will be automatically
collapsed to a plain string. Note that in contrary to append
, this
function do not create a fresh new string directly. Hence, if the sub strings
are shared and modified in place, this string will also be updated. To
make sure that the result string is unique, call function copy
.
This function is equivalent to infix operator ^.
Please see module
Ustring.Op
for more details about the operator.val fast_concat : ustring -> ustring list -> ustring
concat
with the difference that it is not returning
a fresh string. Function fast_concat
is instead using fast_append
for string concatenation.val count : ustring -> uchar -> int
Ustring.count s c
returns the number of occurrences of character c
in ustring s
.val trim_left : ustring -> ustring
val trim_right : ustring -> ustring
val trim : ustring -> ustring
val empty : unit -> ustring
val unix2dos : string -> string
Ustring.unix2dos s
returns a string where newline characters in
string s
are converted to the DOS and Windows standard. The ustring
module handles all strings internally using line feed (LF) code 0x0A, which
is standard in Unix-like systems (e.g., GNU/Linux, Mac OS X, and FreeBSD).
All input functions (e.g., from_utf8()
or from_latin1()
) automatically
converts to this format. Hence, when ustrings are encoded using e.g.
LATIN-1 or UTF-8, they will only contain the LF charcter for new line.
However, for example Windows, DOS, OS/2 and Symbian OS are using
the sequence of Carriage return (CR) 0x0D and LF 0x0A. This function
converts from unix-style to this format.val string2hex : string -> ustring
Ustring.string2hex s
returns a comma separated list of hex
values for the bytes in string s
. For example, from input string "an_"
a ustring "61,6e,5f"
is returned.val convert_escaped_chars : ustring -> ustring
Invalid_argument "convert_escaped_chars"
if the string contains
illegal escape sequences.val read_file : ?encode_type:encoding -> string -> ustring
Ustring.read_file fn
returns a ustring of the whole contents
of a file with file name fn
. By default, the encoding type is Auto
(see type encoding
for details). The input can also be forced to be
assumed to be another encoding type. For example, expression
Ustring.read_file ~encode_type:Ustring.Utf8 fn
creates
an ustring that will assume that the input file is encoded using UTF-8.
If there is a decoding error exception Decode_error enc pos
is raised.
Argument enc
is the encoding type
and pos
is number of bytes read from the file when the decoding error
occurred. Raises Sys_error
if there where problems opening or reading
from the file.val read_from_channel : ?encode_type:encoding ->
Pervasives.in_channel -> int -> ustring
Ustring.read_from_channel ic
returns a function which
can read from the in_channel. The returned function has one argument stating
the number of Unicode characters that should be read from the stream.
It returns an ustring with the read characters. The length of the
returned ustring is approximately the requested number of characters,
i.e., the function can do a partial read. If the returned ustring has
length zero, the end of the character stream has been reached. If there is
a decoding error exception Decode_error enc pos
is raised. Argument
enc
is the encoding type and pos
is number of bytes read from the
file when the decoding error occurred. Note that this function should
only be called once to get the read function int -> ustring
and no
other function is allowed to read from this channel at the same time.exception Decode_error of (encoding * int)
val from_latin1 : string -> ustring
val from_latin1_char : char -> ustring
val from_utf8 : string -> ustring
Invalid_argument "Ustring.from_utf8"
if the input string has not a valid UTF-8 encoding. The input string
must not have a byte order mark (BOM).val from_uchars : uchar array -> ustring
Invalid_argument "Ustring.from_uchars"
if uchar values are illegal
(must be in range 0x0-0x1FFFFF).val latin1_to_uchar : char -> uchar
val to_latin1 : ustring -> string
Invalid_argument "Ustring.to_latin1"
if the characters are not within
the ASCII and Latin-1 character set (values 0-255).val to_utf8 : ustring -> string
val to_uchars : ustring -> uchar array
val validate_utf8_string : string -> int -> int
Ustring.validate_utf8_string s n
checks if the first n
characters
of string s
have valid UTF-8 encoding. If the input is valid, but not all data
is available (e.g. at the end, only 2 bytes are available for a character
that needs 3 bytes), the number of characters that represent
whole characters are return. Raises exception
Decode_error enc pos
if the string has not a valid
UTF-8 encoding. Argument enc
is the encoding type and pos
is the position
in string s
of the decode error.Ustring
module is especially designed for simple support of
Unicode lexing and parsing. The below functions are defined for this
purpose.val lexing_from_channel : ?encode_type:encoding -> Pervasives.in_channel -> Lexing.lexbuf
Lexing.lexbuf
on a given input channel. Expression
Ustring.lexing_from_channel inchan
returns a lexer buffer that reads
from input channel inchan
. By default, the encoding type is Auto
(see type encoding
for details). The input can also be forced to be
assumed to be another encoding type. For example, expression
Ustring.lexing_from_channel ~encode_type:Ustring.Utf8 inchan
creates
a lexbuf that will assume that the input data is encoded using UTF-8.
The stream of characters that are returned to the lexical analyzer is
always UTF-8, regardless of the input encoding. Hence, this function
is a simple and safe way to do lexical analysis of arbitrary encoded
text data. If there is an encoding error of the data read from
inchan
, Raises exception Decode_error enc pos
if there is a decoding
error of the data read from inchan
. Argument enc
is the encoding type
and pos
is number of bytes read from inchan
when the decoding error
occurred.val lexing_from_ustring : ustring -> Lexing.lexbuf
Lexing.lexbuf
that reads from a ustring. The stream of
characters that are returned to the lexical analyzer is
always UTF-8.val equal : t -> t -> bool
=.
which is defined in
module Ustring.Op
.val not_equal : t -> t -> bool
<>.
which is defined in
module Ustring.Op
.val compare : t -> t -> int
Pervasives.compare
. Since both type t
and function compare
is
implemented, module UString
can be passed directly to functors such
as Set.Make
and Map.Make
. For example, to create a map the uses
ustrings as the key, use the following source code line:
module USMap = Map.Make (Ustring)
val hash : t -> int
t
and
functions hash
and equal
are implemented, module UString
can
be passed directly to functor Hashtbl.Make, making it simple to
use ustrings as keys in a hash table. For example, to create a
hash table that uses ustrings as keys, use the following source
code line:
module USHash = Hashtbl.Make(Ustring)