UTF-8
Faster UTF-8 encoding and decoding.
To use the bindings from this module:
(import :std/text/utf8)
string->utf8
(string->utf8 str [start = 0] [end = (string-length str)]) -> u8vector | error
str := string of UTF-8 data
start := exact integer for starting index
end := exact integer for end index
Returns a newly allocated u8vector with UTF-8 data from str converted to bytes. Optional start and end limit the operation to substring of str.
utf8->string
(utf8->string u8v [start = 0] [end = (u8vector-length str)]) -> string | error
u8v := u8vector of data to convert
start := exact integer for starting index
end := exact integer for ending index
Returns a newly allocated string with UTF-8 contents from u8v. Optional start and end parameters limit the operation to the sub-vector of given indexes. The replacement character U+FFFD is used to replace an unknown, unrecognized or unrepresentable character. An error is raised upon reading an incomplete character.
utf8-encode
(utf8-encode str start end) -> u8vector
str := UTF-8 encoded string
start := exact integer for starting index
end := exact integer for ending index
Returns a newly allocated u8vector with byte data of UTF-8 string str.
utf8-decode
(utf8-decode u8v start end) -> string | error
u8v := u8vector of input data
start := exact integer for starting index
end := exact integer for ending index
Decodes the bytes in byte vector u8v from start index until end index and returns the results as a string. Will signal an error if fails to parse UTF-8 bytes.
string-utf8-length
(string-utf8-length str [start = 0] [end = (string-length str)]) -> integer | error
Returns the byte length of given UTF-8 string str. Optional start and end indexes limit the operation on substring. Signals an error if str isn't string.
Examples
> (import :std/format)
> (import :std/text/utf8)
> (let ((s "uber")
(us "über"))
(printf "s length: ~a\n" (string-length s))
(printf "u length: ~a\n" (string-length us))
(newline)
(printf "s utf8-length: ~a\n" (string-utf8-length s))
(printf "u utf8-length: ~a\n" (string-utf8-length us)))
s length: 4
u length: 5
s utf8-length: 4
u utf8-length: 7