Module Ustring

module Ustring: sig .. end

Unicode string library - Ustring. Version 0.01 This module implements Unicode variants of functions using strings in the OCaml standard library, e.g., modules String and Pervasives. There are also a number of additional functions available. Several basic operators (e.g., equality and string concatenation) are defined in sub module Ustring.Op. It is recommended to open up this sub module but not Ustring directly.

This module currently supports ASCII, Latin-1, and UTF-8 encoding. If another encoding is used, an exception will be raised. In all functions below, a ustring s is indexed from 0 to (Ustring.length s)-1.

type uchar = int

Unicode char type. It is a basic integer.

module Op: sig .. end

Sub module Ustring.Op is containing Unicode versions of several functions available in the Pervasives module, as well as operators and functions for simple string creation and manipulation (for example string concatenation operator ^., ustring equality operator =. and string creation us"string").

type t = Op.ustring

Alias for the type Op.ustring

type ustring = Op.ustring

Alias for the type Op.ustring

type encoding =

`\|`	`Ascii`	`(*`	Standard ASCII (values 0-127)	`*)`
`\|`	`Latin1`	`(*`	Latin-1 encoding. Supports most European languages.	`*)`
`\|`	`Utf8`	`(*`	UTF-8 encoding. Full Unicode encoding, where each character is represented by 1-4 bytes.	`*)`
`\|`	`Utf16le`	`(*`	Not yet supported	`*)`
`\|`	`Utf16be`	`(*`	Not yet supported	`*)`
`\|`	`Utf32le`	`(*`	Not yet supported	`*)`
`\|`	`Utf32be`	`(*`	Not yet supported	`*)`
`\|`	`Auto`	`(*`	Not a specific encoding, but a method for how encoded data is interpreted. If the input data have a byte order mark (BOM) in the beginning of the sequence, the encoding type stated in the BOM will be used. If no BOM is available, ASCII will initially be assumed to be the encoding type. Then, when a byte sequence appear that is not ASCII (value 0-127), a choice is made for the remaining encoding type: If the sequence represents a legal UTF-8 encoded character, the rest of the input will be treated as UTF-8 encoded. If it is not an UTF encoded character, Latin-1 will be assumed for the rest of the data sequence. Any later illegal decoding will then be reported as an error.	`*)`

Module String's functions

The following section implements the Unicode version of the functions available in standard library module String.

val length : ustring -> int

Ustring.length s returns the number of characters of s.

val get : ustring -> int -> uchar

Ustring.get s n returns character number n in ustring s. Raises Invalid_argument "Ustring.get" if out of range.

val set : ustring -> int -> uchar -> unit

Ustring.set s n c modifies ustring s in place by replacing uchar at index n by uchar c. Raises Invalid_argument "Ustring.get" if out of range.

val create : int -> ustring

Function Ustring.create n returns a new fresh ustring with length n characters. Raises Invalid_argument "Ustring.create" if n < 0 or n > Sys.max_array_length.

val make : int -> uchar -> ustring

Function Ustring.make n c returns a new fresh ustring of length n, filled with character c. Raises Invalid_argument "Ustring.make" if n < 0 or n > Sys.max_array_length.

val copy : ustring -> ustring

Returns a fresh copy of the string, i.e., there will be no more sharing

val sub : ustring -> int -> int -> ustring

Function Ustring.sub s start len returns a new ustring with length len, consisting of a sub-string of s, which starts at index position start and has length len. Raises exception Invalid_argument "Ustring.sub" if start and len does not give a valid sub-string.

val concat : ustring -> ustring list -> ustring

Ustring.concat sep sl returns a ustring where the list of ustrings sl are concatenated, were separator string sep is inserted between each list element. The function always return a new fresh string. If performance is critical, use function fast_concat

val rindex : ustring -> uchar -> int

Ustring.rindex s c returns the index of the last occurrence of character c in ustring s. Raises Not_found if c does not occur in s.

val rindex_from : ustring -> int -> uchar -> int

Ustring.rindex_from s i c returns the index of the last occurrence of character c in ustring s before index position i+1. Note that function calls Ustring.rindex s and Ustring.rindex_from s (Ustring.length s - 1) c are equivalent. Raises Not_found if c does not occur in s. Raises Invalid_argument "Ustring.rindex_from" if i+1 is an illegal index in ustring s.

Additional string functions

val append : ustring -> ustring -> ustring

Append/concatenation of two strings. This function is equivalent to infix operator ^... Please see module Ustring.Op for more details about the operator. Note that this function is always creating a new fresh string after append. For performance demanding application, function fast_append or operator ^. are recommended instead.

val fast_append : ustring -> ustring -> ustring

Fast append/concatenation of two strings. Compared to function append and standard string concatenation operator ^, function fast_append do not allocate new memory or create a new fresh string. Instead, the concatenation is internally stored as a tree, making the operation constant time. When the string is later used by some function in module Ustring, the internal tree representation will be automatically collapsed to a plain string. Note that in contrary to append, this function do not create a fresh new string directly. Hence, if the sub strings are shared and modified in place, this string will also be updated. To make sure that the result string is unique, call function copy. This function is equivalent to infix operator ^. Please see module Ustring.Op for more details about the operator.

val fast_concat : ustring -> ustring list -> ustring

Same as function concat with the difference that it is not returning a fresh string. Function fast_concat is instead using fast_append for string concatenation.

val count : ustring -> uchar -> int

Ustring.count s c returns the number of occurrences of character c in ustring s.

val trim_left : ustring -> ustring

Returns a new ustring where white space (e.g. space, newline and tab) is removed from the beginning of the input ustring.

val trim_right : ustring -> ustring

Returns a new ustring where white space (e.g. space, newline and tab) is removed from the end of the input ustring.

val trim : ustring -> ustring

Returns a new ustring where white space (e.g. space, newline and tab) is removed from the beginning and end of the input ustring.

val empty : unit -> ustring

Returns an empty ustring

val unix2dos : string -> string

Function Ustring.unix2dos s returns a string where newline characters in string s are converted to the DOS and Windows standard. The ustring module handles all strings internally using line feed (LF) code 0x0A, which is standard in Unix-like systems (e.g., GNU/Linux, Mac OS X, and FreeBSD). All input functions (e.g., from_utf8() or from_latin1()) automatically converts to this format. Hence, when ustrings are encoded using e.g. LATIN-1 or UTF-8, they will only contain the LF charcter for new line. However, for example Windows, DOS, OS/2 and Symbian OS are using the sequence of Carriage return (CR) 0x0D and LF 0x0A. This function converts from unix-style to this format.

val string2hex : string -> ustring

Function Ustring.string2hex s returns a comma separated list of hex values for the bytes in string s. For example, from input string "an_" a ustring "61,6e,5f" is returned.

val convert_escaped_chars : ustring -> ustring

Converts escaped characters. Raises Invalid_argument "convert_escaped_chars" if the string contains illegal escape sequences.

val read_file : ?encode_type:encoding -> string -> ustring

Function Ustring.read_file fn returns a ustring of the whole contents of a file with file name fn. By default, the encoding type is Auto (see type encoding for details). The input can also be forced to be assumed to be another encoding type. For example, expression Ustring.read_file ~encode_type:Ustring.Utf8 fn creates an ustring that will assume that the input file is encoded using UTF-8. If there is a decoding error exception Decode_error enc pos is raised. Argument enc is the encoding type and pos is number of bytes read from the file when the decoding error occurred. Raises Sys_error if there where problems opening or reading from the file.

val read_from_channel : ?encode_type:encoding ->
       Pervasives.in_channel -> int -> ustring

Function Ustring.read_from_channel ic returns a function which can read from the in_channel. The returned function has one argument stating the number of Unicode characters that should be read from the stream. It returns an ustring with the read characters. The length of the returned ustring is approximately the requested number of characters, i.e., the function can do a partial read. If the returned ustring has length zero, the end of the character stream has been reached. If there is a decoding error exception Decode_error enc pos is raised. Argument enc is the encoding type and pos is number of bytes read from the file when the decoding error occurred. Note that this function should only be called once to get the read function int -> ustring and no other function is allowed to read from this channel at the same time.

Encoding

exception Decode_error of (encoding * int)

Exception raised when a decode error occurrs. First parameter represents the encoding method used when the error occurred and the second parameter is the position in the stream/string/channel.

val from_latin1 : string -> ustring

Creates an ustring from a string that is assumed to be encoded with Latin-1.

val from_latin1_char : char -> ustring

Creates an ustring from a Latin-1 encoded char

val from_utf8 : string -> ustring

Creates an ustring from a string that is assumed to be encoded using UTF-8. Raises exception Invalid_argument "Ustring.from_utf8" if the input string has not a valid UTF-8 encoding. The input string must not have a byte order mark (BOM).

val from_uchars : uchar array -> ustring

Converts an array of uchars to an ustring. Raises exception Invalid_argument "Ustring.from_uchars" if uchar values are illegal (must be in range 0x0-0x1FFFFF).

val latin1_to_uchar : char -> uchar

Converts a Latin-1 encoded char to an uchar

val to_latin1 : ustring -> string

Creates a new string encoded using Latin-1. Raises Invalid_argument "Ustring.to_latin1" if the characters are not within the ASCII and Latin-1 character set (values 0-255).

val to_utf8 : ustring -> string

Returns an UTF-8 encoded string. An UTF-8 string consist of a sequence of bytes where each Unicode character is encoded into 1 to 4 bytes. ASCII characters are always encoded into 1 byte.

val to_uchars : ustring -> uchar array

Returns an array of uchars.

val validate_utf8_string : string -> int -> int

Expression Ustring.validate_utf8_string s n checks if the first n characters of string s have valid UTF-8 encoding. If the input is valid, but not all data is available (e.g. at the end, only 2 bytes are available for a character that needs 3 bytes), the number of characters that represent whole characters are return. Raises exception Decode_error enc pos if the string has not a valid UTF-8 encoding. Argument enc is the encoding type and pos is the position in string s of the decode error.

Lexing

The Ustring module is especially designed for simple support of Unicode lexing and parsing. The below functions are defined for this purpose.

val lexing_from_channel : ?encode_type:encoding -> Pervasives.in_channel -> Lexing.lexbuf

Creates a new Lexing.lexbuf on a given input channel. Expression Ustring.lexing_from_channel inchan returns a lexer buffer that reads from input channel inchan. By default, the encoding type is Auto (see type encoding for details). The input can also be forced to be assumed to be another encoding type. For example, expression Ustring.lexing_from_channel ~encode_type:Ustring.Utf8 inchan creates a lexbuf that will assume that the input data is encoded using UTF-8. The stream of characters that are returned to the lexical analyzer is always UTF-8, regardless of the input encoding. Hence, this function is a simple and safe way to do lexical analysis of arbitrary encoded text data. If there is an encoding error of the data read from inchan, Raises exception Decode_error enc pos if there is a decoding error of the data read from inchan. Argument enc is the encoding type and pos is number of bytes read from inchan when the decoding error occurred.

val lexing_from_ustring : ustring -> Lexing.lexbuf

Creates a new Lexing.lexbuf that reads from a ustring. The stream of characters that are returned to the lexical analyzer is always UTF-8.

Comparison and Standard Collections

val equal : t -> t -> bool

Safe structural equality comparison function for ustrings. For easy usage, use the equivalent operator =. which is defined in module Ustring.Op.

val not_equal : t -> t -> bool

Safe structural inequality comparison function for ustrings. For easy usage, use the equivalent operator <>. which is defined in module Ustring.Op.

val compare : t -> t -> int

Comparison function for ustrings. Uses the same specification as Pervasives.compare. Since both type t and function compare is implemented, module UString can be passed directly to functors such as Set.Make and Map.Make. For example, to create a map the uses ustrings as the key, use the following source code line:

module USMap = Map.Make (Ustring)

val hash : t -> int

Implements a safe hash function for ustrings. Since type t and functions hash and equal are implemented, module UString can be passed directly to functor Hashtbl.Make, making it simple to use ustrings as keys in a hash table. For example, to create a hash table that uses ustrings as keys, use the following source code line:

module USHash = Hashtbl.Make(Ustring)